LinuxLists.cc - Re: 2.6.22 -mm merge plans

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

Hi Andi,

* Andi Kleen ([email protected]) wrote:
> Andrew Morton <[email protected]> writes:
>
>
> > Static markers. Will merge.
> There don't seem to be any users of this. How do you know it hasn't
> already bitrotted?
>

See the detailed explanation at :
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc7/2.6.21-rc7-mm2/broken-out/linux-kernel-markers-kconfig-menus.patch

Major points :

It is currently used as an instrumentation infrastructure for the LTTng
tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
WindRiver products. The SystemTAP project also plan to use this type of
infrastructure to trace sites hard to instrument. The Linux Kernel
Markers has the support of Frank C. Eigler, author of their current
marker alternative (which he wishes to drop in order to adopt the
markers infrastructure as soon as it hits mainline).

Quoting Jim Keniston <[email protected]> :

"kprobes remains a vital foundation for SystemTap. But markers are
attactive as an alternate source of trace/debug info. Here's why:
[...]"

> It seems quite overcomplicated to me. Has the complexity been justified?
>

To summarize the document pointed at the URL above, where the full
the key goals of the markers, showing the rationale being the most
important design choices :
- Almost non perceivable impact on production machines when compiled in
but markers are "disabled".
- Use a separate section to keep the data to minimize d-cache
trashing.
- Put the code (stack setup and function call) in unlikely branches of the
if() condition to minimize i-cache impact.
- Since it is required to allow instrumentation of variables within
the body of a function, accept the impact on compiler's
optimizations and let it keep the variables "live" sometimes longer
than required. It is up to the person who puts the marker in the
code to choose the location that will have a small impact in this
aspect.
- Allow per-architecture optimized versions which removes the need for
a d-cache based branch (patch a "load immediate" instruction
instead). It minimized the d-cache impact of the disabled markers.
- Accept the cost of an unlikely branch at the marker site because the
gcc compiler does not give the ability to put "nops" instead of a
branch generated from C code. Keep this in mind for future
per-architecture optimizations.
- Instrumentation of challenging kernel sites
- Instrumentation such as the one provided in the already existing
Lock dependency checker (lockdep) and instrumentation of trap
handlers implies being reentrant for such context. Therefore, the
implementation must be lock-free and update the state in an atomic
fashion (rcu-style). It must also let the programmer who describes
a marker site the ability to specify what is forbidden in the probe
that will be connected to the marker : can it generate a trap ? Can
it call lockdep (irq disable, take any type of lock), can it call
printk ? This is why flags can be passed to the _MARK() marker,
while the MARK() marker has the default flags.

Please tell me if I forgot to explain the rationale behind some
implementation detail and I will be happy to explain in more depth.

Regards,

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-02 00:31:25

by Rusty Russell

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Tue, 2007-05-01 at 14:17 +0200, Andi Kleen wrote:
> Andrew Morton <[email protected]> writes:
> > Will merge the rustyvisor.
>
> IMHO the user code still doesn't belong into Documentation.
> Also it needs another review round I guess. And some beta testing by
> more people.

Like any piece of code more review and more testing would be great.
(Your earlier review was particularly useful!). But it's not clear that
waiting for longer will achieve either.

Look at kvm's experience for the reverse case: it went in, then got
rewritten.

As for the code in Documentation, my initial attempts tried to get
around the need for a userspace part by putting everything in the kernel
module. It meant you could launch a guest by writing a string
to /dev/lguest (no real ABI burden there), but it's a worse solution
than some user code in the kernel tree 8(

Cheers,
Rusty.

2007-05-02 10:31:01

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Wed, May 02, 2007 at 10:31:10AM +1000, Rusty Russell wrote:
> On Tue, 2007-05-01 at 14:17 +0200, Andi Kleen wrote:
> > Andrew Morton <[email protected]> writes:
> > > Will merge the rustyvisor.
> >
> > IMHO the user code still doesn't belong into Documentation.
> > Also it needs another review round I guess. And some beta testing by
> > more people.
>
> Like any piece of code more review and more testing would be great.
> (Your earlier review was particularly useful!). But it's not clear that
> waiting for longer will achieve either.

Not clear to me. Release a clear lguest patchkit with documentation
on l-k several times and you'll probably get both reviewers and testers.
Then confidence level will rise.

>
> Look at kvm's experience for the reverse case: it went in, then got
> rewritten.

They at least already had some user base at this point.

> As for the code in Documentation, my initial attempts tried to get
> around the need for a userspace part by putting everything in the kernel
> module. It meant you could launch a guest by writing a string
> to /dev/lguest (no real ABI burden there), but it's a worse solution
> than some user code in the kernel tree 8(

Just put it into a separate tarball.

-Andi

2007-05-02 10:44:16

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

> It is currently used as an instrumentation infrastructure for the LTTng
> tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> WindRiver products. The SystemTAP project also plan to use this type of
> infrastructure to trace sites hard to instrument. The Linux Kernel
> Markers has the support of Frank C. Eigler, author of their current
> marker alternative (which he wishes to drop in order to adopt the
> markers infrastructure as soon as it hits mainline).

All of the above don't use mainline kernels.
That doesn't constitute using it.

> Quoting Jim Keniston <[email protected]> :
>
> "kprobes remains a vital foundation for SystemTap. But markers are
> attactive as an alternate source of trace/debug info. Here's why:
> [...]"

Talk is cheap. Do they have working code to use it?

> - Allow per-architecture optimized versions which removes the need for
> a d-cache based branch (patch a "load immediate" instruction
> instead). It minimized the d-cache impact of the disabled markers.

That's a good idea in general, but should be generalized (available
independently), not hidden in your subsystem. I know a couple of places
who could use this successfully.

> - Accept the cost of an unlikely branch at the marker site because the
> gcc compiler does not give the ability to put "nops" instead of a
> branch generated from C code. Keep this in mind for future
> per-architecture optimizations.

See upcomming paravirt code for a way to do this.

> - Instrumentation of challenging kernel sites
> - Instrumentation such as the one provided in the already existing
> Lock dependency checker (lockdep) and instrumentation of trap
> handlers implies being reentrant for such context. Therefore, the
> implementation must be lock-free and update the state in an atomic
> fashion (rcu-style). It must also let the programmer who describes
> a marker site the ability to specify what is forbidden in the probe
> that will be connected to the marker : can it generate a trap ? Can
> it call lockdep (irq disable, take any type of lock), can it call
> printk ? This is why flags can be passed to the _MARK() marker,
> while the MARK() marker has the default flags.

Why can't you just generally forbid probes from doing all of this?
It would greatly simplify your code, wouldn't it?

Keep it simple please.

> Please tell me if I forgot to explain the rationale behind some
> implementation detail and I will be happy to explain in more depth.

Having lots of flags to do things differently optionally normally
starts up all warning lights of early over design. While Linux
has this sometimes it is generally only in mature old subsystems.
But when something is freshly merged it shouldn't be like this.
That is because code tends to grow more complicated over its livetime
and when it is already complicated at the beginning it will eventually
fall over (you can study current slab as a poster child of this)

-Andi

2007-05-02 16:38:00

by Frank Ch. Eigler

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

Andi Kleen <[email protected]> writes:

> [...] The SystemTAP project also plan to use this type of
> > infrastructure to trace sites hard to instrument. The Linux Kernel
> > Markers has the support of Frank C. Eigler, author of their current
> > marker alternative [...]
>
> All of the above don't use mainline kernels.
> That doesn't constitute using it.

Systemtap does run on mainline kernels.

> > "kprobes remains a vital foundation for SystemTap. But markers are
> > attactive as an alternate source of trace/debug info. Here's why:
> > [...]"
>
> Talk is cheap. Do they have working code to use it? [...]

We had been waiting on the chicken & egg semaphore. LTTNG has working
code yesterday (months ago); systemtap will have it "tomorrow" (a week
or few).

- FChE

2007-05-02 16:48:12

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Wed, 2 May 2007 12:44:13 +0200 Andi Kleen <[email protected]> wrote:

> > It is currently used as an instrumentation infrastructure for the LTTng
> > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> > WindRiver products. The SystemTAP project also plan to use this type of
> > infrastructure to trace sites hard to instrument. The Linux Kernel
> > Markers has the support of Frank C. Eigler, author of their current
> > marker alternative (which he wishes to drop in order to adopt the
> > markers infrastructure as soon as it hits mainline).
>
> All of the above don't use mainline kernels.

That's because they have to add a markers patch!

> That doesn't constitute using it.

Andi, there was a huge amount of discussion about all this in September last
year (subjects: *markers* and *LTTng*). The outcome of all that was, I
believe, that the kernel should have a static marker infrastructure.

2007-05-02 17:24:57

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

* Andi Kleen ([email protected]) wrote:
> > It is currently used as an instrumentation infrastructure for the LTTng
> > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> > WindRiver products. The SystemTAP project also plan to use this type of
> > infrastructure to trace sites hard to instrument. The Linux Kernel
> > Markers has the support of Frank C. Eigler, author of their current
> > marker alternative (which he wishes to drop in order to adopt the
> > markers infrastructure as soon as it hits mainline).
>
> All of the above don't use mainline kernels.
> That doesn't constitute using it.
>

I am afraid this argument does not hold :

- These companies are not shipping their products with mainline kernels
to make sure things have time to stabilize.
- They eventually get to the next version some time after it is not
"head" anymore. They still want to benefit from the features of the
newer versions.
- All these companies would be really happy to have a marker
infrastructure in mainline so they can stop applying a separate set of
patches to provide this functionality.
- Arguing the fact that "they apply their set of patches anyway" goes
against the advice I have received from Greg KH, which is can be
reworded as : please submit your patches to mainline instead of
keeping your separate set of patches. See his various presentations
about "mainlining" for more info about this.

Because of these 4 arguments, I think that these companies can be
considered as users and contributors of/to mainline kernels.

> > Quoting Jim Keniston <[email protected]> :
> >
> > "kprobes remains a vital foundation for SystemTap. But markers are
> > attactive as an alternate source of trace/debug info. Here's why:
> > [...]"
>
> Talk is cheap. Do they have working code to use it?
>

LTTng has been using the markers for about 6 months now. SystemTAP is
waiting on the "it hits mainline" signal before they switch from their
STP_MARK() markers to this infrastructure. Give them a few days and they
will proceed to the change.

> > - Allow per-architecture optimized versions which removes the need for
> > a d-cache based branch (patch a "load immediate" instruction
> > instead). It minimized the d-cache impact of the disabled markers.
>
> That's a good idea in general, but should be generalized (available
> independently), not hidden in your subsystem. I know a couple of places
> who could use this successfully.
>

I agree that an efficient hooking mechanism is useful to manyr; listing
at least security hooks and instrumentation for tracing. What other
usage scenario do you have in mind that could not fit in my marker
infrastructure ? I have tried to generalize this as much as possible,
but if you see, within this, a piece of infrastructure that could be
taken apart and used more widely, I will be happy to submit it
separately to increase its usefulness.

> > - Accept the cost of an unlikely branch at the marker site because the
> > gcc compiler does not give the ability to put "nops" instead of a
> > branch generated from C code. Keep this in mind for future
> > per-architecture optimizations.
>
> See upcomming paravirt code for a way to do this.
>

I have looked at the paravirt code in Andrew's 2.6.21-rc7-mm2. A few
reasons why I do not plan to use it :

1 - It requires specific arg setup for the calls to be crafted by hand,
in assembly, for each and every number of parameters and each types, for
each architecture. I use a variable argument list as a parameter to my
marker to make sure that a single macro can be used for markup in a
generic manner.

Quoting : http://lkml.org/lkml/2007/4/4/577
"+ * Unfortunately there's no way to get gcc to generate the args setup
+ * for the call, and then allow the call itself to be generated by an
+ * inline asm. Because of this, we must do the complete arg setup and
+ * return value handling from within these macros. This is fairly
+ * cumbersome."

2 - I also provide an architecture independent "generic" version which
does not depend on per-architecture assembly. From what I see, paravirt
is only offered for i386 and x86_64. Are there any plans to support the
other ~12 architectures ? Does it offer a architecture agnostic fallback
in the cases where it is not implemented for a given architecture ?

3 - It can only alter instructions "safely" in the UP case before the
other CPUs are turned on. See my arch/i386/marker.c code patcher for
XMC-safe instruction patching. Marker activation must be done at
runtime, when the system is fully operational.

Quoting 2.6.21 arch/i386/kernel/alternative.c
"/* Replace instructions with better alternatives for this CPU type.
This runs before SMP is initialized to avoid SMP problems with
self modifying code. This implies that assymetric systems where
APs have less capabilities than the boot processor are not handled.
Tough. Make sure you disable such features by hand. */

void apply_alternatives(struct alt_instr *start, struct alt_instr *end)"

4 - paravirt does not offer the ability to replace a branch instruction,
generated by gcc, through its mechanism. If I choose to use paravirt
mechanism, I must do the stack setup and function call by hand, which
has been argued in points (1) and (2). GCC must itself generate the
branch instruction to jump over the function call containing the
variable argument list.

> > - Instrumentation of challenging kernel sites
> > - Instrumentation such as the one provided in the already existing
> > Lock dependency checker (lockdep) and instrumentation of trap
> > handlers implies being reentrant for such context. Therefore, the
> > implementation must be lock-free and update the state in an atomic
> > fashion (rcu-style). It must also let the programmer who describes
> > a marker site the ability to specify what is forbidden in the probe
> > that will be connected to the marker : can it generate a trap ? Can
> > it call lockdep (irq disable, take any type of lock), can it call
> > printk ? This is why flags can be passed to the _MARK() marker,
> > while the MARK() marker has the default flags.
>
> Why can't you just generally forbid probes from doing all of this?
> It would greatly simplify your code, wouldn't it?
>
> Keep it simple please.
>

An example, taken from the marker mechanism itself (no probe involved)
shows how difficult it can be to "forbid all of this" :

The optimized version patches code while the system is live. This
implies cross modifying code in SMP environment. It can be done safely
on x86 and x86_64 by using a breakpoint during the code modification to
make sure the CPU issues a serializing instruction between the moment a
given CPU speculates the code execution and actually reaches it. It
implies going though a trap, which does funny things such as enabling
interrupts, which calls into lockdep. Therefore, adding a marker into
the lockdep code cannot be done with a breakpoint-based marker on these
architectures. We have to provide an alternative way to do this, less
intrusive, which is exactly what the "generic" markers provide. The same
applies to instrumentation of the breakpoint trap handler.

I strongly doubt that _every_ users of the markers would be comfortable
with the "write your code so it does not take any lock and does
everything atomically" constraint. I have done it in LTTng so I could
have a fully reentrant tracer, but even then you can be limited by the
nature of where you want to send the data. Richard Purdie implemented a
serial port based data relay as an alternative data relay mechanism
connected to LTTng; he needed a spinlock because of the semantic of his
port, so he has to accept the limitation regarding the sites that can
and cannot be probed. Providing an explicit declaration of site
limitations make sense in this regard.

On other architectures, it is the time source which requires a read
seqlock. It is not atomic in the sense that a reader can nest over a
writer (if coming from NMI context) and spin forever.

I can list a lot of situations where we cannot _require_ the probe to
run atomically in every aspect; so generally forbidding these actions
does not seem to be a viable solution. In fact, this would be the best
way to make sure the marker infrastructure is never used by early
adopters because of the complexity level of writing probes, due to these
"rules".

> > Please tell me if I forgot to explain the rationale behind some
> > implementation detail and I will be happy to explain in more depth.
>
> Having lots of flags to do things differently optionally normally
> starts up all warning lights of early over design. While Linux
> has this sometimes it is generally only in mature old subsystems.
> But when something is freshly merged it shouldn't be like this.
> That is because code tends to grow more complicated over its livetime
> and when it is already complicated at the beginning it will eventually
> fall over (you can study current slab as a poster child of this)
>
> -Andi
>

Explicitly identifying "hard to instrument" sites is nothing new. It has
been done in different manners in the past. Kprobes sprinkles
"__kprobes" declarations before function declarations all over the place
to specify which ones cannot be safely instrumented. It results in a
visually less appealing source code and it limits the sites that can be
probed.

The goal of the marker infrastructure is exactly to instrument those
sites. Therefore, the approach "we forbid instrumentation of sites hard
to instrument" misses the point of this infrastructure. We can leverage
the fact that the marker is put in a context known by the programmer; it
makes sense to give him the ability to specify what are the restrictions
on the probes connected to this marker with some level of granularity.

Regards,

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-02 17:29:52

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > That doesn't constitute using it.
>
> Andi, there was a huge amount of discussion about all this in September last
> year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> believe, that the kernel should have a static marker infrastructure.

Only when it's actually useable. A prerequisite for merging it is
having an actual trace transport infrastructure aswell as a few actually
useful tracing modules in the kernel tree.

Let this count as a vote to merge the markers once we have the infrastructure
above ready, it'll be very useful then.

2007-05-02 17:49:44

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> On Wed, 2 May 2007 12:44:13 +0200 Andi Kleen <[email protected]> wrote:
>
> > > It is currently used as an instrumentation infrastructure for the LTTng
> > > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> > > WindRiver products. The SystemTAP project also plan to use this type of
> > > infrastructure to trace sites hard to instrument. The Linux Kernel
> > > Markers has the support of Frank C. Eigler, author of their current
> > > marker alternative (which he wishes to drop in order to adopt the
> > > markers infrastructure as soon as it hits mainline).
> >
> > All of the above don't use mainline kernels.
>
> That's because they have to add a markers patch!

I meant they use very old kernels. Their experiences don't apply
to mainline bitrottyness.

> > That doesn't constitute using it.
>
> Andi, there was a huge amount of discussion about all this in September last
> year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> believe, that the kernel should have a static marker infrastructure.

I have no problem with that in principle; just some doubts about
the current proposed implementation: in particular its complexity.
And also I think when something is merged it should have some users in tree.

-Andi

2007-05-02 20:41:38

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

* Christoph Hellwig ([email protected]) wrote:
> On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > > That doesn't constitute using it.
> >
> > Andi, there was a huge amount of discussion about all this in September last
> > year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> > believe, that the kernel should have a static marker infrastructure.
>
> Only when it's actually useable. A prerequisite for merging it is
> having an actual trace transport infrastructure aswell as a few actually
> useful tracing modules in the kernel tree.
>
> Let this count as a vote to merge the markers once we have the infrastructure
> above ready, it'll be very useful then.

Hi Christoph,

The idea is the following : either we integrate the infrastructure for
instrumentation / data serialization / buffer management / extraction of
data to user space in multiple different steps, which makes code review
easier for you guys, or we bring the main pieces of the LTTng project
altogether with the Linux Kernel Markers, which would result in a bigger
change.

Based on the premise that discussing about logically distinct pieces of
infrastructure is easier and can be done more thoroughly when done
separately, we decided to submit the markers first, with the other
pieces planned in a near future.

I agree that it would be very useful to have the full tracing stack
available in the Linux kernel, but we inevitably face the argument :
"this change is too big" if we submit all LTTng modules at once or
the argument : "we want the whole tracing stack, not just part of it"
if we don't.

This is why we chose to push the tracing infrastructure chunk by chunk :
to make code review and criticism more efficient.

Regards,

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-02 20:54:10

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Wed, 2 May 2007 16:36:27 -0400
Mathieu Desnoyers <[email protected]> wrote:

> * Christoph Hellwig ([email protected]) wrote:
> > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > > > That doesn't constitute using it.
> > >
> > > Andi, there was a huge amount of discussion about all this in September last
> > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> > > believe, that the kernel should have a static marker infrastructure.
> >
> > Only when it's actually useable. A prerequisite for merging it is
> > having an actual trace transport infrastructure aswell as a few actually
> > useful tracing modules in the kernel tree.
> >
> > Let this count as a vote to merge the markers once we have the infrastructure
> > above ready, it'll be very useful then.
>
> Hi Christoph,
>
> The idea is the following : either we integrate the infrastructure for
> instrumentation / data serialization / buffer management / extraction of
> data to user space in multiple different steps, which makes code review
> easier for you guys, or we bring the main pieces of the LTTng project
> altogether with the Linux Kernel Markers, which would result in a bigger
> change.
>
> Based on the premise that discussing about logically distinct pieces of
> infrastructure is easier and can be done more thoroughly when done
> separately, we decided to submit the markers first, with the other
> pieces planned in a near future.
>
> I agree that it would be very useful to have the full tracing stack
> available in the Linux kernel, but we inevitably face the argument :
> "this change is too big" if we submit all LTTng modules at once or
> the argument : "we want the whole tracing stack, not just part of it"
> if we don't.
>
> This is why we chose to push the tracing infrastructure chunk by chunk :
> to make code review and criticism more efficient.
>

I didn't know that this was the plan.

The problem I have with this is that once we've merged one part, we're
committed to merging the other parts even though we haven't seen them yet.

What happens if there's a revolt over the next set of patches? Do we
remove the core markers patches again? We end up in a cant-go-forward,
cant-go-backward situation.

I thought the existing code was useful as-is for several projects, without
requiring additional patching to core kernel. If such additional patching
_is_ needed to make the markers code useful then I agree that we should
continue to buffer the markers code in -mm until the
use-markers-for-something patches have been eyeballed.

In which case we have:

atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
atomich-complete-atomic_long-operations-in-asm-generic.patch
atomich-i386-type-safety-fix.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
local_t-architecture-independant-extension.patch
local_t-alpha-extension.patch
local_t-i386-extension.patch
local_t-ia64-extension.patch
local_t-mips-extension.patch
local_t-parisc-cleanup.patch
local_t-powerpc-extension.patch
local_t-sparc64-cleanup.patch
local_t-x86_64-extension.patch

For 2.6.22

linux-kernel-markers-kconfig-menus.patch
linux-kernel-markers-architecture-independant-code.patch
linux-kernel-markers-powerpc-optimization.patch
linux-kernel-markers-i386-optimization.patch
markers-add-instrumentation-markers-menus-to-avr32.patch
linux-kernel-markers-non-optimized-architectures.patch
markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
linux-kernel-markers-documentation.patch
#
markers-define-the-linker-macro-extra_rwdata.patch
markers-use-extra_rwdata-in-architectures.patch
#
some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
no-longer-include-asm-kdebugh.patch

Hold.

2007-05-02 21:43:50

by Tilman Schmidt

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

Am 02.05.2007 19:49 schrieb Andi Kleen:
> And also I think when something is merged it should have some users in tree.

Isn't that a circular dependency?

--
Tilman Schmidt E-Mail: [email protected]
Bonn, Germany
- Undetected errors are handled as if no error occurred. (IBM) -

Attachments:

signature.asc (253.00 B)
OpenPGP digital signature

2007-05-02 23:11:09

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

* Andrew Morton ([email protected]) wrote:
> On Wed, 2 May 2007 16:36:27 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>
> > * Christoph Hellwig ([email protected]) wrote:
> > > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > > > > That doesn't constitute using it.
> > > >
> > > > Andi, there was a huge amount of discussion about all this in September last
> > > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> > > > believe, that the kernel should have a static marker infrastructure.
> > >
> > > Only when it's actually useable. A prerequisite for merging it is
> > > having an actual trace transport infrastructure aswell as a few actually
> > > useful tracing modules in the kernel tree.
> > >
> > > Let this count as a vote to merge the markers once we have the infrastructure
> > > above ready, it'll be very useful then.
> >
> > Hi Christoph,
> >
> > The idea is the following : either we integrate the infrastructure for
> > instrumentation / data serialization / buffer management / extraction of
> > data to user space in multiple different steps, which makes code review
> > easier for you guys, or we bring the main pieces of the LTTng project
> > altogether with the Linux Kernel Markers, which would result in a bigger
> > change.
> >
> > Based on the premise that discussing about logically distinct pieces of
> > infrastructure is easier and can be done more thoroughly when done
> > separately, we decided to submit the markers first, with the other
> > pieces planned in a near future.
> >
> > I agree that it would be very useful to have the full tracing stack
> > available in the Linux kernel, but we inevitably face the argument :
> > "this change is too big" if we submit all LTTng modules at once or
> > the argument : "we want the whole tracing stack, not just part of it"
> > if we don't.
> >
> > This is why we chose to push the tracing infrastructure chunk by chunk :
> > to make code review and criticism more efficient.
> >
>
> I didn't know that this was the plan.
>
> The problem I have with this is that once we've merged one part, we're
> committed to merging the other parts even though we haven't seen them yet.
>
> What happens if there's a revolt over the next set of patches? Do we
> remove the core markers patches again? We end up in a cant-go-forward,
> cant-go-backward situation.
>
> I thought the existing code was useful as-is for several projects, without
> requiring additional patching to core kernel. If such additional patching
> _is_ needed to make the markers code useful then I agree that we should
> continue to buffer the markers code in -mm until the
> use-markers-for-something patches have been eyeballed.
>

My statement was probably not clear enough. The actual marker code is
useful as-is without any further kernel patching required : SystemTAP is
an example where they use external modules to load probes that can
connect either to markers or through kprobes. LTTng, in its current state,
has a mostly modular core that also uses the markers.

Although some, like Christoph and myself, think that it would benefit to
the kernel community to have a common infrastructure for more than just
markers (meaning common serialization and buffering mechanism), it does
not change the fact that the markers, being in mainline, are usable by
projects through additional kernel modules.

If we are looking at current "potential users" that are already in
mainline, we could change blktrace to make it use the markers.

Mathieu

> In which case we have:
>
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
> atomich-complete-atomic_long-operations-in-asm-generic.patch
> atomich-i386-type-safety-fix.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
> atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
> local_t-architecture-independant-extension.patch
> local_t-alpha-extension.patch
> local_t-i386-extension.patch
> local_t-ia64-extension.patch
> local_t-mips-extension.patch
> local_t-parisc-cleanup.patch
> local_t-powerpc-extension.patch
> local_t-sparc64-cleanup.patch
> local_t-x86_64-extension.patch
>
> For 2.6.22
>
> linux-kernel-markers-kconfig-menus.patch
> linux-kernel-markers-architecture-independant-code.patch
> linux-kernel-markers-powerpc-optimization.patch
> linux-kernel-markers-i386-optimization.patch
> markers-add-instrumentation-markers-menus-to-avr32.patch
> linux-kernel-markers-non-optimized-architectures.patch
> markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
> linux-kernel-markers-documentation.patch
> #
> markers-define-the-linker-macro-extra_rwdata.patch
> markers-use-extra_rwdata-in-architectures.patch
> #
> some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
> no-longer-include-asm-kdebugh.patch
>
> Hold.
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-02 23:22:01

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Wed, 2 May 2007 19:11:04 -0400
Mathieu Desnoyers <[email protected]> wrote:

> > I didn't know that this was the plan.
> >
> > The problem I have with this is that once we've merged one part, we're
> > committed to merging the other parts even though we haven't seen them yet.
> >
> > What happens if there's a revolt over the next set of patches? Do we
> > remove the core markers patches again? We end up in a cant-go-forward,
> > cant-go-backward situation.
> >
> > I thought the existing code was useful as-is for several projects, without
> > requiring additional patching to core kernel. If such additional patching
> > _is_ needed to make the markers code useful then I agree that we should
> > continue to buffer the markers code in -mm until the
> > use-markers-for-something patches have been eyeballed.
> >
>
> My statement was probably not clear enough. The actual marker code is
> useful as-is without any further kernel patching required : SystemTAP is
> an example where they use external modules to load probes that can
> connect either to markers or through kprobes. LTTng, in its current state,
> has a mostly modular core that also uses the markers.

OK, that's what I thought.

> Although some, like Christoph and myself, think that it would benefit to
> the kernel community to have a common infrastructure for more than just
> markers (meaning common serialization and buffering mechanism), it does
> not change the fact that the markers, being in mainline, are usable by
> projects through additional kernel modules.
>
> If we are looking at current "potential users" that are already in
> mainline, we could change blktrace to make it use the markers.

That'd be a useful demonstration.

2007-05-03 08:06:54

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote:
> My statement was probably not clear enough. The actual marker code is
> useful as-is without any further kernel patching required : SystemTAP is
> an example where they use external modules to load probes that can
> connect either to markers or through kprobes. LTTng, in its current state,
> has a mostly modular core that also uses the markers.

That just mean you have to load an enormous emount of exernal crap
that replaces the missing kernel functionality. It's exactly the
situation we want to avoid.

2007-05-03 08:08:22

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Wed, May 02, 2007 at 04:36:27PM -0400, Mathieu Desnoyers wrote:
> The idea is the following : either we integrate the infrastructure for
> instrumentation / data serialization / buffer management / extraction of
> data to user space in multiple different steps, which makes code review
> easier for you guys,

the staging area for that is -mm

2007-05-03 08:09:16

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Wed, May 02, 2007 at 01:53:36PM -0700, Andrew Morton wrote:
> In which case we have:
>
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
> atomich-complete-atomic_long-operations-in-asm-generic.patch
> atomich-i386-type-safety-fix.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
> atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
> local_t-architecture-independant-extension.patch
> local_t-alpha-extension.patch
> local_t-i386-extension.patch
> local_t-ia64-extension.patch
> local_t-mips-extension.patch
> local_t-parisc-cleanup.patch
> local_t-powerpc-extension.patch
> local_t-sparc64-cleanup.patch
> local_t-x86_64-extension.patch
>
> For 2.6.22
>
> linux-kernel-markers-kconfig-menus.patch
> linux-kernel-markers-architecture-independant-code.patch
> linux-kernel-markers-powerpc-optimization.patch
> linux-kernel-markers-i386-optimization.patch
> markers-add-instrumentation-markers-menus-to-avr32.patch
> linux-kernel-markers-non-optimized-architectures.patch
> markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
> linux-kernel-markers-documentation.patch
> #
> markers-define-the-linker-macro-extra_rwdata.patch
> markers-use-extra_rwdata-in-architectures.patch
> #
> some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
> no-longer-include-asm-kdebugh.patch

This it a plan I can fully agree with.

2007-05-03 10:12:34

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Wed, May 02, 2007 at 11:46:40PM +0200, Tilman Schmidt wrote:
> Am 02.05.2007 19:49 schrieb Andi Kleen:
> > And also I think when something is merged it should have some users in tree.
>
> Isn't that a circular dependency?

The normal mode of operation is to merge the initial users and the
subsystem at the same time.

-Andi

2007-05-03 10:31:53

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

> If we are looking at current "potential users" that are already in
> mainline, we could change blktrace to make it use the markers.

Ok, but do it step by step:
- split out useful pieces like the "patched in enable/disable flags"
and submit them separate with an example user or two
[I got a couple of candidates e.g. with some of the sysctls in VM or
networking]
- post and merge that.
- don't implement anything initially that is not needed by blktrace
- post a minimal marker patch together with the blktrace
conversion for review again on linux-kernel
- await review comments. This review would not cover the basic
need of markers, just the specific implementation.
- then potentially merge incorporate review comments
- then merge
- later add features with individual review/discussion as new users in the
kernel are added.

-Andi

2007-05-03 14:48:40

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

* Christoph Hellwig ([email protected]) wrote:
> On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote:
> > My statement was probably not clear enough. The actual marker code is
> > useful as-is without any further kernel patching required : SystemTAP is
> > an example where they use external modules to load probes that can
> > connect either to markers or through kprobes. LTTng, in its current state,
> > has a mostly modular core that also uses the markers.
>
> That just mean you have to load an enormous emount of exernal crap
> that replaces the missing kernel functionality. It's exactly the
> situation we want to avoid.
>

It makes sense to use -mm to hold the hole usable infrastructure before
submitting it to mainline. I will submit my core LTTng patches to Andrew
in the following weeks. There is no hurry, in the LTTng perspective, to
merge the markers sooner, although they could be useful to other
(external) projects meanwhile.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-03 14:49:26

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

Hi Andi,

This plan makes sense. I will split the "patched in enabled/disable
flags" part into a separate piece (good idea!) and then submit the LTTng
core to Andrew. Christoph's has a good point about wanting a usable
infrastructure to go ini. Regarding your plan, I must argue that
blktrace is not a general purpose tracing infrastructure, but one
dedicated to block io tracing. Therefore, it makes sense to bring in
the generic infrastructure first and then convert blktrace to it.

Mathieu

* Andi Kleen ([email protected]) wrote:
> > If we are looking at current "potential users" that are already in
> > mainline, we could change blktrace to make it use the markers.
>
> Ok, but do it step by step:
> - split out useful pieces like the "patched in enable/disable flags"
> and submit them separate with an example user or two
> [I got a couple of candidates e.g. with some of the sysctls in VM or
> networking]
> - post and merge that.
> - don't implement anything initially that is not needed by blktrace
> - post a minimal marker patch together with the blktrace
> conversion for review again on linux-kernel
> - await review comments. This review would not cover the basic
> need of markers, just the specific implementation.
> - then potentially merge incorporate review comments
> - then merge
> - later add features with individual review/discussion as new users in the
> kernel are added.
>
> -Andi

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-03 15:09:34

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

* Andrew Morton ([email protected]) wrote:
> > Although some, like Christoph and myself, think that it would benefit to
> > the kernel community to have a common infrastructure for more than just
> > markers (meaning common serialization and buffering mechanism), it does
> > not change the fact that the markers, being in mainline, are usable by
> > projects through additional kernel modules.
> >
> > If we are looking at current "potential users" that are already in
> > mainline, we could change blktrace to make it use the markers.
>
> That'd be a useful demonstration.

Here is a proof of concept patch, for demonstration purpose, of moving
blktrace to the markers.

A few remarks : this patch has the positive effect of removing some code
from the block io tracing hot paths, minimizing the i-cache impact in a
system where the io tracing is compiled in but inactive.

It also moves the blk tracing code from a header (and therefore from the
body of the instrumented functions) to a separate C file.

There, as soon as one device has to be traced, every devices have to
fall into the tracing function call. This is slower than the previous
inline function which tested the condition quickly. If it becomes a
show stopper, it could be fixed by having the possibility to test a
supplementary condition, dependant of the marker context, at the marker
site, just after the enable/disable test.

It does not make the code smaller, since I left all the specialized
tracing functions for requests, bio, generic, remap, which would go away
once a generic infrastructure is in place to serialize the information
passed to the marker. This is mostly why I consider it a proof a
concept.

Patch named "markers-port-blktrace-to-markers.patch", can be placed
after the marker patches in the 2.6.21-rc7-mm2 series.

Signed-off-by: Mathieu Desnoyers <[email protected]>

Index: linux-2.6-lttng/block/elevator.c
===================================================================
--- linux-2.6-lttng.orig/block/elevator.c 2007-05-02 20:33:22.000000000 -0400
+++ linux-2.6-lttng/block/elevator.c 2007-05-02 20:33:49.000000000 -0400
@@ -32,7 +32,7 @@
#include <linux/init.h>
#include <linux/compiler.h>
#include <linux/delay.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <linux/hash.h>

#include <asm/uaccess.h>
@@ -571,7 +571,7 @@
unsigned ordseq;
int unplug_it = 1;

- blk_add_trace_rq(q, rq, BLK_TA_INSERT);
+ MARK(blk_request_insert, "%p %p", q, rq);

rq->q = q;

@@ -757,7 +757,7 @@
* not be passed by new incoming requests
*/
rq->cmd_flags |= REQ_STARTED;
- blk_add_trace_rq(q, rq, BLK_TA_ISSUE);
+ MARK(blk_request_issue, "%p %p", q, rq);
}

if (!q->boundary_rq || q->boundary_rq == rq) {
Index: linux-2.6-lttng/block/ll_rw_blk.c
===================================================================
--- linux-2.6-lttng.orig/block/ll_rw_blk.c 2007-05-02 20:33:32.000000000 -0400
+++ linux-2.6-lttng/block/ll_rw_blk.c 2007-05-02 23:21:02.000000000 -0400
@@ -28,6 +28,7 @@
#include <linux/task_io_accounting_ops.h>
#include <linux/interrupt.h>
#include <linux/cpu.h>
+#include <linux/marker.h>
#include <linux/blktrace_api.h>
#include <linux/fault-inject.h>

@@ -1551,7 +1552,7 @@

if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
- blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
+ MARK(blk_plug_device, "%p %p %d", q, NULL, 0);
}
}

@@ -1617,7 +1618,7 @@
* devices don't necessarily have an ->unplug_fn defined
*/
if (q->unplug_fn) {
- blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+ MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);

q->unplug_fn(q);
@@ -1628,7 +1629,7 @@
{
request_queue_t *q = container_of(work, request_queue_t, unplug_work);

- blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+ MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);

q->unplug_fn(q);
@@ -1638,7 +1639,7 @@
{
request_queue_t *q = (request_queue_t *)data;

- blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
+ MARK(blk_pdu_unplug_timer, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);

kblockd_schedule_work(&q->unplug_work);
@@ -2148,7 +2149,7 @@

rq_init(q, rq);

- blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
+ MARK(blk_get_request, "%p %p %d", q, bio, rw);
out:
return rq;
}
@@ -2178,7 +2179,7 @@
if (!rq) {
struct io_context *ioc;

- blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
+ MARK(blk_sleep_request, "%p %p %d", q, bio, rw);

__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);
@@ -2252,7 +2253,7 @@
*/
void blk_requeue_request(request_queue_t *q, struct request *rq)
{
- blk_add_trace_rq(q, rq, BLK_TA_REQUEUE);
+ MARK(blk_requeue, "%p %p", q, rq);

if (blk_rq_tagged(rq))
blk_queue_end_tag(q, rq);
@@ -2937,7 +2938,7 @@
if (!ll_back_merge_fn(q, req, bio))
break;

- blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE);
+ MARK(blk_bio_backmerge, "%p %p", q, bio);

req->biotail->bi_next = bio;
req->biotail = bio;
@@ -2954,7 +2955,7 @@
if (!ll_front_merge_fn(q, req, bio))
break;

- blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE);
+ MARK(blk_bio_frontmerge, "%p %p", q, bio);

bio->bi_next = req->bio;
req->bio = bio;
@@ -3184,10 +3185,10 @@
blk_partition_remap(bio);

if (old_sector != -1)
- blk_add_trace_remap(q, bio, old_dev, bio->bi_sector,
- old_sector);
+ MARK(blk_remap, "%p %p %u %llu %llu", q, bio, old_dev,
+ (u64)bio->bi_sector, (u64)old_sector);

- blk_add_trace_bio(q, bio, BLK_TA_QUEUE);
+ MARK(blk_bio_queue, "%p %p", q, bio);

old_sector = bio->bi_sector;
old_dev = bio->bi_bdev->bd_dev;
@@ -3329,7 +3330,7 @@
int total_bytes, bio_nbytes, error, next_idx = 0;
struct bio *bio;

- blk_add_trace_rq(req->q, req, BLK_TA_COMPLETE);
+ MARK(blk_request_complete, "%p %p", req->q, req);

/*
* extend uptodate bool to allow < 0 value to be direct io error
Index: linux-2.6-lttng/block/Kconfig
===================================================================
--- linux-2.6-lttng.orig/block/Kconfig 2007-05-02 20:34:30.000000000 -0400
+++ linux-2.6-lttng/block/Kconfig 2007-05-02 20:34:53.000000000 -0400
@@ -32,6 +32,7 @@
depends on SYSFS
select RELAY
select DEBUG_FS
+ select MARKERS
help
Say Y here, if you want to be able to trace the block layer actions
on a given queue. Tracing allows you to see any traffic happening
Index: linux-2.6-lttng/block/Makefile
===================================================================
--- linux-2.6-lttng.orig/block/Makefile 2007-05-02 21:20:30.000000000 -0400
+++ linux-2.6-lttng/block/Makefile 2007-05-02 21:20:46.000000000 -0400
@@ -9,4 +9,4 @@
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

-obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
+obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o blk-probe.o
Index: linux-2.6-lttng/block/blk-probe.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/block/blk-probe.c 2007-05-02 23:43:44.000000000 -0400
@@ -0,0 +1,276 @@
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/crc32.h>
+#include <linux/marker.h>
+#include <linux/blktrace_api.h>
+
+
+/**
+ * blk_add_trace_rq - Add a trace for a request oriented action
+ * Expected variable arguments :
+ * @q: queue the io is for
+ * @rq: the source request
+ *
+ * Description:
+ * Records an action against a request. Will log the bio offset + size.
+ *
+ **/
+static void blk_add_trace_rq(const struct __mark_marker_data *mdata,
+ const char *fmt, ...)
+{
+ va_list args;
+ u32 what;
+ struct blk_trace *bt;
+ int rw;
+ struct blk_probe_data *pinfo = mdata->pdata;
+ struct request_queue *q;
+ struct request *rq;
+
+ va_start(args, fmt);
+ q = va_arg(args, struct request_queue *);
+ rq = va_arg(args, struct request *);
+ va_end(args);
+
+ what = pinfo->flags;
+ bt = q->blk_trace;
+ rw = rq->cmd_flags & 0x03;
+
+ if (likely(!bt))
+ return;
+
+ if (blk_pc_request(rq)) {
+ what |= BLK_TC_ACT(BLK_TC_PC);
+ __blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd);
+ } else {
+ what |= BLK_TC_ACT(BLK_TC_FS);
+ __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL);
+ }
+}
+
+/**
+ * blk_add_trace_bio - Add a trace for a bio oriented action
+ * Expected variable arguments :
+ * @q: queue the io is for
+ * @bio: the source bio
+ *
+ * Description:
+ * Records an action against a bio. Will log the bio offset + size.
+ *
+ **/
+static void blk_add_trace_bio(const struct __mark_marker_data *mdata,
+ const char *fmt, ...)
+{
+ va_list args;
+ u32 what;
+ struct blk_trace *bt;
+ struct blk_probe_data *pinfo = mdata->pdata;
+ struct request_queue *q;
+ struct bio *bio;
+
+ va_start(args, fmt);
+ q = va_arg(args, struct request_queue *);
+ bio = va_arg(args, struct bio *);
+ va_end(args);
+
+ what = pinfo->flags;
+ bt = q->blk_trace;
+
+ if (likely(!bt))
+ return;
+
+ __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL);
+}
+
+/**
+ * blk_add_trace_generic - Add a trace for a generic action
+ * Expected variable arguments :
+ * @q: queue the io is for
+ * @bio: the source bio
+ * @rw: the data direction
+ *
+ * Description:
+ * Records a simple trace
+ *
+ **/
+static void blk_add_trace_generic(const struct __mark_marker_data *mdata,
+ const char *fmt, ...)
+{
+ va_list args;
+ struct blk_trace *bt;
+ u32 what;
+ struct blk_probe_data *pinfo = mdata->pdata;
+ struct request_queue *q;
+ struct bio *bio;
+ int rw;
+
+ va_start(args, fmt);
+ q = va_arg(args, struct request_queue *);
+ bio = va_arg(args, struct bio *);
+ rw = va_arg(args, int);
+ va_end(args);
+
+ what = pinfo->flags;
+ bt = q->blk_trace;
+
+ if (likely(!bt))
+ return;
+
+ if (bio)
+ blk_add_trace_bio(mdata, "%p %p", q, bio);
+ else
+ __blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL);
+}
+
+/**
+ * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload
+ * Expected variable arguments :
+ * @q: queue the io is for
+ * @bio: the source bio
+ * @pdu: the integer payload
+ *
+ * Description:
+ * Adds a trace with some integer payload. This might be an unplug
+ * option given as the action, with the depth at unplug time given
+ * as the payload
+ *
+ **/
+static void blk_add_trace_pdu_int(const struct __mark_marker_data *mdata,
+ const char *fmt, ...)
+{
+ va_list args;
+ struct blk_trace *bt;
+ u32 what;
+ struct blk_probe_data *pinfo = mdata->pdata;
+ struct request_queue *q;
+ struct bio *bio;
+ unsigned int pdu;
+ __be64 rpdu;
+
+ va_start(args, fmt);
+ q = va_arg(args, struct request_queue *);
+ bio = va_arg(args, struct bio *);
+ pdu = va_arg(args, unsigned int);
+ va_end(args);
+
+ what = pinfo->flags;
+ bt = q->blk_trace;
+ rpdu = cpu_to_be64(pdu);
+
+ if (likely(!bt))
+ return;
+
+ if (bio)
+ __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu);
+ else
+ __blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
+}
+
+/**
+ * blk_add_trace_remap - Add a trace for a remap operation
+ * Expected variable arguments :
+ * @q: queue the io is for
+ * @bio: the source bio
+ * @dev: target device
+ * @from: source sector
+ * @to: target sector
+ *
+ * Description:
+ * Device mapper or raid target sometimes need to split a bio because
+ * it spans a stripe (or similar). Add a trace for that action.
+ *
+ **/
+static void blk_add_trace_remap(const struct __mark_marker_data *mdata,
+ const char *fmt, ...)
+{
+ va_list args;
+ struct blk_trace *bt;
+ struct blk_io_trace_remap r;
+ u32 what;
+ struct blk_probe_data *pinfo = mdata->pdata;
+ struct request_queue *q;
+ struct bio *bio;
+ u64 dev, from, to;
+
+ va_start(args, fmt);
+ q = va_arg(args, struct request_queue *);
+ bio = va_arg(args, struct bio *);
+ dev = va_arg(args, u64);
+ from = va_arg(args, u64);
+ to = va_arg(args, u64);
+ va_end(args);
+
+ what = pinfo->flags;
+ bt = q->blk_trace;
+
+ if (likely(!bt))
+ return;
+
+ r.device = cpu_to_be32(dev);
+ r.sector = cpu_to_be64(to);
+
+ __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r);
+}
+
+#define FACILITY_NAME "blk"
+
+static struct blk_probe_data probe_array[] =
+{
+ { "blk_bio_queue", "%p %p", BLK_TA_QUEUE, blk_add_trace_bio },
+ { "blk_bio_backmerge", "%p %p", BLK_TA_BACKMERGE, blk_add_trace_bio },
+ { "blk_bio_frontmerge", "%p %p", BLK_TA_FRONTMERGE, blk_add_trace_bio },
+ { "blk_get_request", "%p %p %d", BLK_TA_GETRQ, blk_add_trace_generic },
+ { "blk_sleep_request", "%p %p %d", BLK_TA_SLEEPRQ,
+ blk_add_trace_generic },
+ { "blk_requeue", "%p %p", BLK_TA_REQUEUE, blk_add_trace_rq },
+ { "blk_request_issue", "%p %p", BLK_TA_ISSUE, blk_add_trace_rq },
+ { "blk_request_complete", "%p %p", BLK_TA_COMPLETE, blk_add_trace_rq },
+ { "blk_plug_device", "%p %p %d", BLK_TA_PLUG, blk_add_trace_generic },
+ { "blk_pdu_unplug_io", "%p %p %d", BLK_TA_UNPLUG_IO,
+ blk_add_trace_pdu_int },
+ { "blk_pdu_unplug_timer", "%p %p %d", BLK_TA_UNPLUG_TIMER,
+ blk_add_trace_pdu_int },
+ { "blk_request_insert", "%p %p", BLK_TA_INSERT,
+ blk_add_trace_rq },
+ { "blk_pdu_split", "%p %p %d", BLK_TA_SPLIT,
+ blk_add_trace_pdu_int },
+ { "blk_bio_bounce", "%p %p", BLK_TA_BOUNCE, blk_add_trace_bio },
+ { "blk_remap", "%p %p %u %llu %llu", BLK_TA_REMAP,
+ blk_add_trace_remap },
+};
+
+
+#define NUM_PROBES ARRAY_SIZE(probe_array)
+
+int blk_probe_connect(void)
+{
+ int result;
+ uint8_t i;
+
+ for (i = 0; i < NUM_PROBES; i++) {
+ result = marker_set_probe(probe_array[i].name,
+ probe_array[i].format,
+ probe_array[i].callback, &probe_array[i]);
+ if (!result)
+ printk(KERN_INFO
+ "blktrace unable to register probe %s\n",
+ probe_array[i].name);
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(blk_probe_connect);
+
+void blk_probe_disconnect(void)
+{
+ uint8_t i;
+
+ for (i = 0; i < NUM_PROBES; i++) {
+ marker_remove_probe(probe_array[i].name);
+ }
+ synchronize_sched(); /* Wait for probes to finish */
+}
+EXPORT_SYMBOL_GPL(blk_probe_disconnect);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Mathieu Desnoyers");
+MODULE_DESCRIPTION(FACILITY_NAME " probe");
Index: linux-2.6-lttng/block/blktrace.c
===================================================================
--- linux-2.6-lttng.orig/block/blktrace.c 2007-05-02 20:33:15.000000000 -0400
+++ linux-2.6-lttng/block/blktrace.c 2007-05-02 23:48:32.000000000 -0400
@@ -28,6 +28,10 @@
static DEFINE_PER_CPU(unsigned long long, blk_trace_cpu_offset) = { 0, };
static unsigned int blktrace_seq __read_mostly = 1;

+/* Global reference count of probes */
+static struct mutex blk_probe_mutex;
+static int blk_probes_ref = 0;
+
/*
* Send out a notify message.
*/
@@ -229,6 +233,12 @@
blk_remove_tree(bt->dir);
free_percpu(bt->sequence);
kfree(bt);
+ mutex_lock(&blk_probe_mutex);
+ if (blk_probes_ref == 1) {
+ blk_probe_disconnect();
+ blk_probes_ref--;
+ }
+ mutex_unlock(&blk_probe_mutex);
}

static int blk_trace_remove(request_queue_t *q)
@@ -386,6 +396,14 @@
goto err;
}

+ /* Connect probes to markers */
+ mutex_lock(&blk_probe_mutex);
+ if (!blk_probes_ref) {
+ blk_probe_connect();
+ blk_probes_ref++;
+ }
+ mutex_unlock(&blk_probe_mutex);
+
return 0;
err:
if (dir)
@@ -552,6 +570,7 @@
static __init int blk_trace_init(void)
{
mutex_init(&blk_tree_mutex);
+ mutex_init(&blk_probe_mutex);
on_each_cpu(blk_trace_check_cpu_time, NULL, 1, 1);
blk_trace_set_ht_offsets();

Index: linux-2.6-lttng/include/linux/blktrace_api.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/blktrace_api.h 2007-05-02 20:45:58.000000000 -0400
+++ linux-2.6-lttng/include/linux/blktrace_api.h 2007-05-02 22:12:46.000000000 -0400
@@ -3,6 +3,7 @@

#include <linux/blkdev.h>
#include <linux/relay.h>
+#include <linux/marker.h>

/*
* Trace categories
@@ -142,149 +143,24 @@
u32 pid;
};

+/* Probe data used for probe-marker connection */
+struct blk_probe_data {
+ const char *name;
+ const char *format;
+ u32 flags;
+ marker_probe_func *callback;
+};
+
#if defined(CONFIG_BLK_DEV_IO_TRACE)
extern int blk_trace_ioctl(struct block_device *, unsigned, char __user *);
extern void blk_trace_shutdown(request_queue_t *);
extern void __blk_add_trace(struct blk_trace *, sector_t, int, int, u32, int, int, void *);
-
-/**
- * blk_add_trace_rq - Add a trace for a request oriented action
- * @q: queue the io is for
- * @rq: the source request
- * @what: the action
- *
- * Description:
- * Records an action against a request. Will log the bio offset + size.
- *
- **/
-static inline void blk_add_trace_rq(struct request_queue *q, struct request *rq,
- u32 what)
-{
- struct blk_trace *bt = q->blk_trace;
- int rw = rq->cmd_flags & 0x03;
-
- if (likely(!bt))
- return;
-
- if (blk_pc_request(rq)) {
- what |= BLK_TC_ACT(BLK_TC_PC);
- __blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd);
- } else {
- what |= BLK_TC_ACT(BLK_TC_FS);
- __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL);
- }
-}
-
-/**
- * blk_add_trace_bio - Add a trace for a bio oriented action
- * @q: queue the io is for
- * @bio: the source bio
- * @what: the action
- *
- * Description:
- * Records an action against a bio. Will log the bio offset + size.
- *
- **/
-static inline void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
- u32 what)
-{
- struct blk_trace *bt = q->blk_trace;
-
- if (likely(!bt))
- return;
-
- __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL);
-}
-
-/**
- * blk_add_trace_generic - Add a trace for a generic action
- * @q: queue the io is for
- * @bio: the source bio
- * @rw: the data direction
- * @what: the action
- *
- * Description:
- * Records a simple trace
- *
- **/
-static inline void blk_add_trace_generic(struct request_queue *q,
- struct bio *bio, int rw, u32 what)
-{
- struct blk_trace *bt = q->blk_trace;
-
- if (likely(!bt))
- return;
-
- if (bio)
- blk_add_trace_bio(q, bio, what);
- else
- __blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL);
-}
-
-/**
- * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload
- * @q: queue the io is for
- * @what: the action
- * @bio: the source bio
- * @pdu: the integer payload
- *
- * Description:
- * Adds a trace with some integer payload. This might be an unplug
- * option given as the action, with the depth at unplug time given
- * as the payload
- *
- **/
-static inline void blk_add_trace_pdu_int(struct request_queue *q, u32 what,
- struct bio *bio, unsigned int pdu)
-{
- struct blk_trace *bt = q->blk_trace;
- __be64 rpdu = cpu_to_be64(pdu);
-
- if (likely(!bt))
- return;
-
- if (bio)
- __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu);
- else
- __blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
-}
-
-/**
- * blk_add_trace_remap - Add a trace for a remap operation
- * @q: queue the io is for
- * @bio: the source bio
- * @dev: target device
- * @from: source sector
- * @to: target sector
- *
- * Description:
- * Device mapper or raid target sometimes need to split a bio because
- * it spans a stripe (or similar). Add a trace for that action.
- *
- **/
-static inline void blk_add_trace_remap(struct request_queue *q, struct bio *bio,
- dev_t dev, sector_t from, sector_t to)
-{
- struct blk_trace *bt = q->blk_trace;
- struct blk_io_trace_remap r;
-
- if (likely(!bt))
- return;
-
- r.device = cpu_to_be32(dev);
- r.sector = cpu_to_be64(to);
-
- __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r);
-}
+extern int blk_probe_connect(void);
+extern void blk_probe_disconnect(void);

#else /* !CONFIG_BLK_DEV_IO_TRACE */
#define blk_trace_ioctl(bdev, cmd, arg) (-ENOTTY)
#define blk_trace_shutdown(q) do { } while (0)
-#define blk_add_trace_rq(q, rq, what) do { } while (0)
-#define blk_add_trace_bio(q, rq, what) do { } while (0)
-#define blk_add_trace_generic(q, rq, rw, what) do { } while (0)
-#define blk_add_trace_pdu_int(q, what, bio, pdu) do { } while (0)
-#define blk_add_trace_remap(q, bio, dev, f, t) do {} while (0)
#endif /* CONFIG_BLK_DEV_IO_TRACE */

#endif
Index: linux-2.6-lttng/mm/bounce.c
===================================================================
--- linux-2.6-lttng.orig/mm/bounce.c 2007-05-02 21:34:39.000000000 -0400
+++ linux-2.6-lttng/mm/bounce.c 2007-05-02 21:36:17.000000000 -0400
@@ -13,7 +13,7 @@
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <asm/tlbflush.h>

#define POOL_SIZE 64
@@ -237,7 +237,7 @@
if (!bio)
return;

- blk_add_trace_bio(q, *bio_orig, BLK_TA_BOUNCE);
+ MARK(blk_bio_bounce, "%p %p", q, *bio_orig);

/*
* at least one page was bounced, fill in possible non-highmem
Index: linux-2.6-lttng/mm/highmem.c
===================================================================
--- linux-2.6-lttng.orig/mm/highmem.c 2007-05-02 21:36:27.000000000 -0400
+++ linux-2.6-lttng/mm/highmem.c 2007-05-02 21:36:39.000000000 -0400
@@ -26,7 +26,7 @@
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <asm/tlbflush.h>

/*
Index: linux-2.6-lttng/fs/bio.c
===================================================================
--- linux-2.6-lttng.orig/fs/bio.c 2007-05-02 21:37:52.000000000 -0400
+++ linux-2.6-lttng/fs/bio.c 2007-05-02 21:40:30.000000000 -0400
@@ -25,7 +25,7 @@
#include <linux/module.h>
#include <linux/mempool.h>
#include <linux/workqueue.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <scsi/sg.h> /* for struct sg_iovec */

#define BIO_POOL_SIZE 2
@@ -1081,7 +1081,7 @@
if (!bp)
return bp;

- blk_add_trace_pdu_int(bdev_get_queue(bi->bi_bdev), BLK_TA_SPLIT, bi,
+ MARK(blk_pdu_split, "%p %p %d", bdev_get_queue(bi->bi_bdev), bi,
bi->bi_sector + first_sectors);

BUG_ON(bi->bi_vcnt != 1);
Index: linux-2.6-lttng/drivers/block/cciss.c
===================================================================
--- linux-2.6-lttng.orig/drivers/block/cciss.c 2007-05-02 21:44:30.000000000 -0400
+++ linux-2.6-lttng/drivers/block/cciss.c 2007-05-02 21:45:08.000000000 -0400
@@ -37,7 +37,7 @@
#include <linux/hdreg.h>
#include <linux/spinlock.h>
#include <linux/compat.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <asm/uaccess.h>
#include <asm/io.h>

@@ -2502,7 +2502,7 @@
}
cmd->rq->data_len = 0;
cmd->rq->completion_data = cmd;
- blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
+ MARK(blk_request_complete, "%p %p", cmd->rq->q, cmd->rq);
blk_complete_request(cmd->rq);
}

Index: linux-2.6-lttng/drivers/md/dm.c
===================================================================
--- linux-2.6-lttng.orig/drivers/md/dm.c 2007-05-02 21:44:41.000000000 -0400
+++ linux-2.6-lttng/drivers/md/dm.c 2007-05-02 21:47:19.000000000 -0400
@@ -19,7 +19,7 @@
#include <linux/slab.h>
#include <linux/idr.h>
#include <linux/hdreg.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <linux/smp_lock.h>

#define DM_MSG_PREFIX "core"
@@ -485,8 +485,8 @@
wake_up(&io->md->wait);

if (io->error != DM_ENDIO_REQUEUE) {
- blk_add_trace_bio(io->md->queue, io->bio,
- BLK_TA_COMPLETE);
+ MARK(blk_request_complete, "%p %p",
+ io->md->queue, io->bio);

bio_endio(io->bio, io->bio->bi_size, io->error);
}
@@ -582,10 +582,10 @@
r = ti->type->map(ti, clone, &tio->info);
if (r == DM_MAPIO_REMAPPED) {
/* the bio has been remapped so dispatch it */
-
- blk_add_trace_remap(bdev_get_queue(clone->bi_bdev), clone,
- tio->io->bio->bi_bdev->bd_dev, sector,
- clone->bi_sector);
+ MARK(blk_remap, "%p %p %u %llu %llu",
+ bdev_get_queue(clone->bi_bdev), clone,
+ (u64)tio->io->bio->bi_bdev->bd_dev, (u64)sector,
+ (u64)clone->bi_sector);

generic_make_request(clone);
} else if (r < 0 || r == DM_MAPIO_REQUEUE) {
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-03 15:12:20

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Thu, May 03, 2007 at 11:04:15AM -0400, Mathieu Desnoyers wrote:
> - blk_add_trace_rq(q, rq, BLK_TA_INSERT);
> + MARK(blk_request_insert, "%p %p", q, rq);

I don't really like the shouting MARK name very much. Can we have
a less-generic, less shouting name, e.g. trace_marker? The aboe would then
be:

trace_mark(blk_request_insert, "%p %p", q, rq);

> +#define NUM_PROBES ARRAY_SIZE(probe_array)

just get rid of this and use ARRAY_SIZE diretly below.

> +int blk_probe_connect(void)
> +{
> + int result;
> + uint8_t i;

just use an int for for loops. it's easy to read and probably faster
on most systems (it the compiler isn't smart enough and promotes it
to int anyway during code generation)

> +void blk_probe_disconnect(void)
> +{
> + uint8_t i;
> +
> + for (i = 0; i < NUM_PROBES; i++) {
> + marker_remove_probe(probe_array[i].name);
> + }
> + synchronize_sched(); /* Wait for probes to finish */

kprobes does this kind of synchronization internally, so the marker
wrapper should probabl aswell.

> +static int blk_probes_ref = 0;

no need to initialize this.

> /*
> * Send out a notify message.
> */
> @@ -229,6 +233,12 @@
> blk_remove_tree(bt->dir);
> free_percpu(bt->sequence);
> kfree(bt);
> + mutex_lock(&blk_probe_mutex);
> + if (blk_probes_ref == 1) {
> + blk_probe_disconnect();
> + blk_probes_ref--;
> + }

if (--blk_probes_ref == 0)
blk_probe_disconnect();

would probably be a tad cleaner.

> + if (!blk_probes_ref) {
> + blk_probe_connect();
> + blk_probes_ref++;
> + }

Dito here with a:

if (!blk_probes_ref++)
blk_probe_connect();

also the connect in the name seems rather add, what about arm/disarm instead?

> static __init int blk_trace_init(void)
> {
> mutex_init(&blk_tree_mutex);
> + mutex_init(&blk_probe_mutex);

both should use DEFINE_MUTEX for compile-time initialization isntead.

Also it's probably better to put the trace points into blktrace.c,
that means all blktrace code can be static and self-contained. And we
can probably do some additional cleanups by simplifying things later on.

2007-05-03 17:22:03

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

Here is the reworked patch, except a comment :

* Christoph Hellwig ([email protected]) wrote:
> > +void blk_probe_disconnect(void)
> > +{
> > + uint8_t i;
> > +
> > + for (i = 0; i < NUM_PROBES; i++) {
> > + marker_remove_probe(probe_array[i].name);
> > + }
> > + synchronize_sched(); /* Wait for probes to finish */
>
> kprobes does this kind of synchronization internally, so the marker
> wrapper should probabl aswell.
>

The problem appears on heavily loaded systems. Doing 50
synchronize_sched() calls in a row can take up to a few seconds on a
4-way machine. This is why I prefer to do it in the module to which
the callbacks belong.

Here is the reviewed patch. It depends on a newer version of markers
I'll send to Andrew soon.

Signed-off-by: Mathieu Desnoyers <[email protected]>

Index: linux-2.6-lttng/block/elevator.c
===================================================================
--- linux-2.6-lttng.orig/block/elevator.c 2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/block/elevator.c 2007-05-03 12:54:58.000000000 -0400
@@ -32,7 +32,7 @@
#include <linux/init.h>
#include <linux/compiler.h>
#include <linux/delay.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <linux/hash.h>

#include <asm/uaccess.h>
@@ -571,7 +571,7 @@
unsigned ordseq;
int unplug_it = 1;

- blk_add_trace_rq(q, rq, BLK_TA_INSERT);
+ trace_mark(blk_request_insert, "%p %p", q, rq);

rq->q = q;

@@ -757,7 +757,7 @@
* not be passed by new incoming requests
*/
rq->cmd_flags |= REQ_STARTED;
- blk_add_trace_rq(q, rq, BLK_TA_ISSUE);
+ trace_mark(blk_request_issue, "%p %p", q, rq);
}

if (!q->boundary_rq || q->boundary_rq == rq) {
Index: linux-2.6-lttng/block/ll_rw_blk.c
===================================================================
--- linux-2.6-lttng.orig/block/ll_rw_blk.c 2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/block/ll_rw_blk.c 2007-05-03 12:54:58.000000000 -0400
@@ -28,6 +28,7 @@
#include <linux/task_io_accounting_ops.h>
#include <linux/interrupt.h>
#include <linux/cpu.h>
+#include <linux/marker.h>
#include <linux/blktrace_api.h>
#include <linux/fault-inject.h>

@@ -1551,7 +1552,7 @@

if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
- blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
+ trace_mark(blk_plug_device, "%p %p %d", q, NULL, 0);
}
}

@@ -1617,7 +1618,7 @@
* devices don't necessarily have an ->unplug_fn defined
*/
if (q->unplug_fn) {
- blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+ trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);

q->unplug_fn(q);
@@ -1628,7 +1629,7 @@
{
request_queue_t *q = container_of(work, request_queue_t, unplug_work);

- blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+ trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);

q->unplug_fn(q);
@@ -1638,7 +1639,7 @@
{
request_queue_t *q = (request_queue_t *)data;

- blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
+ trace_mark(blk_pdu_unplug_timer, "%p %p %d", q, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);

kblockd_schedule_work(&q->unplug_work);
@@ -2148,7 +2149,7 @@

rq_init(q, rq);

- blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
+ trace_mark(blk_get_request, "%p %p %d", q, bio, rw);
out:
return rq;
}
@@ -2178,7 +2179,7 @@
if (!rq) {
struct io_context *ioc;

- blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
+ trace_mark(blk_sleep_request, "%p %p %d", q, bio, rw);

__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);
@@ -2252,7 +2253,7 @@
*/
void blk_requeue_request(request_queue_t *q, struct request *rq)
{
- blk_add_trace_rq(q, rq, BLK_TA_REQUEUE);
+ trace_mark(blk_requeue, "%p %p", q, rq);

if (blk_rq_tagged(rq))
blk_queue_end_tag(q, rq);
@@ -2937,7 +2938,7 @@
if (!ll_back_merge_fn(q, req, bio))
break;

- blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE);
+ trace_mark(blk_bio_backmerge, "%p %p", q, bio);

req->biotail->bi_next = bio;
req->biotail = bio;
@@ -2954,7 +2955,7 @@
if (!ll_front_merge_fn(q, req, bio))
break;

- blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE);
+ trace_mark(blk_bio_frontmerge, "%p %p", q, bio);

bio->bi_next = req->bio;
req->bio = bio;
@@ -3184,10 +3185,10 @@
blk_partition_remap(bio);

if (old_sector != -1)
- blk_add_trace_remap(q, bio, old_dev, bio->bi_sector,
- old_sector);
+ trace_mark(blk_remap, "%p %p %u %llu %llu", q, bio, old_dev,
+ (u64)bio->bi_sector, (u64)old_sector);

- blk_add_trace_bio(q, bio, BLK_TA_QUEUE);
+ trace_mark(blk_bio_queue, "%p %p", q, bio);

old_sector = bio->bi_sector;
old_dev = bio->bi_bdev->bd_dev;
@@ -3329,7 +3330,7 @@
int total_bytes, bio_nbytes, error, next_idx = 0;
struct bio *bio;

- blk_add_trace_rq(req->q, req, BLK_TA_COMPLETE);
+ trace_mark(blk_request_complete, "%p %p", req->q, req);

/*
* extend uptodate bool to allow < 0 value to be direct io error
Index: linux-2.6-lttng/block/Kconfig
===================================================================
--- linux-2.6-lttng.orig/block/Kconfig 2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/block/Kconfig 2007-05-03 12:54:58.000000000 -0400
@@ -32,6 +32,7 @@
depends on SYSFS
select RELAY
select DEBUG_FS
+ select MARKERS
help
Say Y here, if you want to be able to trace the block layer actions
on a given queue. Tracing allows you to see any traffic happening
Index: linux-2.6-lttng/block/blktrace.c
===================================================================
--- linux-2.6-lttng.orig/block/blktrace.c 2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/block/blktrace.c 2007-05-03 13:05:30.000000000 -0400
@@ -23,11 +23,19 @@
#include <linux/mutex.h>
#include <linux/debugfs.h>
#include <linux/time.h>
+#include <linux/marker.h>
#include <asm/uaccess.h>

static DEFINE_PER_CPU(unsigned long long, blk_trace_cpu_offset) = { 0, };
static unsigned int blktrace_seq __read_mostly = 1;

+/* Global reference count of probes */
+static DEFINE_MUTEX(blk_probe_mutex);
+static int blk_probes_ref;
+
+int blk_probe_arm(void);
+void blk_probe_disarm(void);
+
/*
* Send out a notify message.
*/
@@ -179,7 +187,7 @@
EXPORT_SYMBOL_GPL(__blk_add_trace);

static struct dentry *blk_tree_root;
-static struct mutex blk_tree_mutex;
+static DEFINE_MUTEX(blk_tree_mutex);
static unsigned int root_users;

static inline void blk_remove_root(void)
@@ -229,6 +237,10 @@
blk_remove_tree(bt->dir);
free_percpu(bt->sequence);
kfree(bt);
+ mutex_lock(&blk_probe_mutex);
+ if (--blk_probes_ref == 0)
+ blk_probe_disarm();
+ mutex_unlock(&blk_probe_mutex);
}

static int blk_trace_remove(request_queue_t *q)
@@ -386,6 +398,11 @@
goto err;
}

+ mutex_lock(&blk_probe_mutex);
+ if (!blk_probes_ref++)
+ blk_probe_arm();
+ mutex_unlock(&blk_probe_mutex);
+
return 0;
err:
if (dir)
@@ -549,9 +566,270 @@
#endif
}

+/**
+ * blk_add_trace_rq - Add a trace for a request oriented action
+ * Expected variable arguments :
+ * @q: queue the io is for
+ * @rq: the source request
+ *
+ * Description:
+ * Records an action against a request. Will log the bio offset + size.
+ *
+ **/
+static void blk_add_trace_rq(const struct __mark_marker_data *mdata,
+ const char *fmt, ...)
+{
+ va_list args;
+ u32 what;
+ struct blk_trace *bt;
+ int rw;
+ struct blk_probe_data *pinfo = mdata->pdata;
+ struct request_queue *q;
+ struct request *rq;
+
+ va_start(args, fmt);
+ q = va_arg(args, struct request_queue *);
+ rq = va_arg(args, struct request *);
+ va_end(args);
+
+ what = pinfo->flags;
+ bt = q->blk_trace;
+ rw = rq->cmd_flags & 0x03;
+
+ if (likely(!bt))
+ return;
+
+ if (blk_pc_request(rq)) {
+ what |= BLK_TC_ACT(BLK_TC_PC);
+ __blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd);
+ } else {
+ what |= BLK_TC_ACT(BLK_TC_FS);
+ __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL);
+ }
+}
+
+/**
+ * blk_add_trace_bio - Add a trace for a bio oriented action
+ * Expected variable arguments :
+ * @q: queue the io is for
+ * @bio: the source bio
+ *
+ * Description:
+ * Records an action against a bio. Will log the bio offset + size.
+ *
+ **/
+static void blk_add_trace_bio(const struct __mark_marker_data *mdata,
+ const char *fmt, ...)
+{
+ va_list args;
+ u32 what;
+ struct blk_trace *bt;
+ struct blk_probe_data *pinfo = mdata->pdata;
+ struct request_queue *q;
+ struct bio *bio;
+
+ va_start(args, fmt);
+ q = va_arg(args, struct request_queue *);
+ bio = va_arg(args, struct bio *);
+ va_end(args);
+
+ what = pinfo->flags;
+ bt = q->blk_trace;
+
+ if (likely(!bt))
+ return;
+
+ __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL);
+}
+
+/**
+ * blk_add_trace_generic - Add a trace for a generic action
+ * Expected variable arguments :
+ * @q: queue the io is for
+ * @bio: the source bio
+ * @rw: the data direction
+ *
+ * Description:
+ * Records a simple trace
+ *
+ **/
+static void blk_add_trace_generic(const struct __mark_marker_data *mdata,
+ const char *fmt, ...)
+{
+ va_list args;
+ struct blk_trace *bt;
+ u32 what;
+ struct blk_probe_data *pinfo = mdata->pdata;
+ struct request_queue *q;
+ struct bio *bio;
+ int rw;
+
+ va_start(args, fmt);
+ q = va_arg(args, struct request_queue *);
+ bio = va_arg(args, struct bio *);
+ rw = va_arg(args, int);
+ va_end(args);
+
+ what = pinfo->flags;
+ bt = q->blk_trace;
+
+ if (likely(!bt))
+ return;
+
+ if (bio)
+ blk_add_trace_bio(mdata, "%p %p", q, bio);
+ else
+ __blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL);
+}
+
+/**
+ * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload
+ * Expected variable arguments :
+ * @q: queue the io is for
+ * @bio: the source bio
+ * @pdu: the integer payload
+ *
+ * Description:
+ * Adds a trace with some integer payload. This might be an unplug
+ * option given as the action, with the depth at unplug time given
+ * as the payload
+ *
+ **/
+static void blk_add_trace_pdu_int(const struct __mark_marker_data *mdata,
+ const char *fmt, ...)
+{
+ va_list args;
+ struct blk_trace *bt;
+ u32 what;
+ struct blk_probe_data *pinfo = mdata->pdata;
+ struct request_queue *q;
+ struct bio *bio;
+ unsigned int pdu;
+ __be64 rpdu;
+
+ va_start(args, fmt);
+ q = va_arg(args, struct request_queue *);
+ bio = va_arg(args, struct bio *);
+ pdu = va_arg(args, unsigned int);
+ va_end(args);
+
+ what = pinfo->flags;
+ bt = q->blk_trace;
+ rpdu = cpu_to_be64(pdu);
+
+ if (likely(!bt))
+ return;
+
+ if (bio)
+ __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu);
+ else
+ __blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
+}
+
+/**
+ * blk_add_trace_remap - Add a trace for a remap operation
+ * Expected variable arguments :
+ * @q: queue the io is for
+ * @bio: the source bio
+ * @dev: target device
+ * @from: source sector
+ * @to: target sector
+ *
+ * Description:
+ * Device mapper or raid target sometimes need to split a bio because
+ * it spans a stripe (or similar). Add a trace for that action.
+ *
+ **/
+static void blk_add_trace_remap(const struct __mark_marker_data *mdata,
+ const char *fmt, ...)
+{
+ va_list args;
+ struct blk_trace *bt;
+ struct blk_io_trace_remap r;
+ u32 what;
+ struct blk_probe_data *pinfo = mdata->pdata;
+ struct request_queue *q;
+ struct bio *bio;
+ u64 dev, from, to;
+
+ va_start(args, fmt);
+ q = va_arg(args, struct request_queue *);
+ bio = va_arg(args, struct bio *);
+ dev = va_arg(args, u64);
+ from = va_arg(args, u64);
+ to = va_arg(args, u64);
+ va_end(args);
+
+ what = pinfo->flags;
+ bt = q->blk_trace;
+
+ if (likely(!bt))
+ return;
+
+ r.device = cpu_to_be32(dev);
+ r.sector = cpu_to_be64(to);
+
+ __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r);
+}
+
+#define FACILITY_NAME "blk"
+
+static struct blk_probe_data probe_array[] =
+{
+ { "blk_bio_queue", "%p %p", BLK_TA_QUEUE, blk_add_trace_bio },
+ { "blk_bio_backmerge", "%p %p", BLK_TA_BACKMERGE, blk_add_trace_bio },
+ { "blk_bio_frontmerge", "%p %p", BLK_TA_FRONTMERGE, blk_add_trace_bio },
+ { "blk_get_request", "%p %p %d", BLK_TA_GETRQ, blk_add_trace_generic },
+ { "blk_sleep_request", "%p %p %d", BLK_TA_SLEEPRQ,
+ blk_add_trace_generic },
+ { "blk_requeue", "%p %p", BLK_TA_REQUEUE, blk_add_trace_rq },
+ { "blk_request_issue", "%p %p", BLK_TA_ISSUE, blk_add_trace_rq },
+ { "blk_request_complete", "%p %p", BLK_TA_COMPLETE, blk_add_trace_rq },
+ { "blk_plug_device", "%p %p %d", BLK_TA_PLUG, blk_add_trace_generic },
+ { "blk_pdu_unplug_io", "%p %p %d", BLK_TA_UNPLUG_IO,
+ blk_add_trace_pdu_int },
+ { "blk_pdu_unplug_timer", "%p %p %d", BLK_TA_UNPLUG_TIMER,
+ blk_add_trace_pdu_int },
+ { "blk_request_insert", "%p %p", BLK_TA_INSERT,
+ blk_add_trace_rq },
+ { "blk_pdu_split", "%p %p %d", BLK_TA_SPLIT,
+ blk_add_trace_pdu_int },
+ { "blk_bio_bounce", "%p %p", BLK_TA_BOUNCE, blk_add_trace_bio },
+ { "blk_remap", "%p %p %u %llu %llu", BLK_TA_REMAP,
+ blk_add_trace_remap },
+};
+
+
+int blk_probe_arm(void)
+{
+ int result;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(probe_array); i++) {
+ result = marker_set_probe(probe_array[i].name,
+ probe_array[i].format,
+ probe_array[i].callback, &probe_array[i]);
+ if (!result)
+ printk(KERN_INFO
+ "blktrace unable to register probe %s\n",
+ probe_array[i].name);
+ }
+ return 0;
+}
+
+void blk_probe_disarm(void)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(probe_array); i++) {
+ marker_remove_probe(probe_array[i].name);
+ }
+ synchronize_sched(); /* Wait for probes to finish */
+}
+
+
static __init int blk_trace_init(void)
{
- mutex_init(&blk_tree_mutex);
on_each_cpu(blk_trace_check_cpu_time, NULL, 1, 1);
blk_trace_set_ht_offsets();

Index: linux-2.6-lttng/include/linux/blktrace_api.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/blktrace_api.h 2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/include/linux/blktrace_api.h 2007-05-03 12:54:58.000000000 -0400
@@ -3,6 +3,7 @@

#include <linux/blkdev.h>
#include <linux/relay.h>
+#include <linux/marker.h>

/*
* Trace categories
@@ -142,149 +143,24 @@
u32 pid;
};

+/* Probe data used for probe-marker connection */
+struct blk_probe_data {
+ const char *name;
+ const char *format;
+ u32 flags;
+ marker_probe_func *callback;
+};
+
#if defined(CONFIG_BLK_DEV_IO_TRACE)
extern int blk_trace_ioctl(struct block_device *, unsigned, char __user *);
extern void blk_trace_shutdown(request_queue_t *);
extern void __blk_add_trace(struct blk_trace *, sector_t, int, int, u32, int, int, void *);
-
-/**
- * blk_add_trace_rq - Add a trace for a request oriented action
- * @q: queue the io is for
- * @rq: the source request
- * @what: the action
- *
- * Description:
- * Records an action against a request. Will log the bio offset + size.
- *
- **/
-static inline void blk_add_trace_rq(struct request_queue *q, struct request *rq,
- u32 what)
-{
- struct blk_trace *bt = q->blk_trace;
- int rw = rq->cmd_flags & 0x03;
-
- if (likely(!bt))
- return;
-
- if (blk_pc_request(rq)) {
- what |= BLK_TC_ACT(BLK_TC_PC);
- __blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd);
- } else {
- what |= BLK_TC_ACT(BLK_TC_FS);
- __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL);
- }
-}
-
-/**
- * blk_add_trace_bio - Add a trace for a bio oriented action
- * @q: queue the io is for
- * @bio: the source bio
- * @what: the action
- *
- * Description:
- * Records an action against a bio. Will log the bio offset + size.
- *
- **/
-static inline void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
- u32 what)
-{
- struct blk_trace *bt = q->blk_trace;
-
- if (likely(!bt))
- return;
-
- __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL);
-}
-
-/**
- * blk_add_trace_generic - Add a trace for a generic action
- * @q: queue the io is for
- * @bio: the source bio
- * @rw: the data direction
- * @what: the action
- *
- * Description:
- * Records a simple trace
- *
- **/
-static inline void blk_add_trace_generic(struct request_queue *q,
- struct bio *bio, int rw, u32 what)
-{
- struct blk_trace *bt = q->blk_trace;
-
- if (likely(!bt))
- return;
-
- if (bio)
- blk_add_trace_bio(q, bio, what);
- else
- __blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL);
-}
-
-/**
- * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload
- * @q: queue the io is for
- * @what: the action
- * @bio: the source bio
- * @pdu: the integer payload
- *
- * Description:
- * Adds a trace with some integer payload. This might be an unplug
- * option given as the action, with the depth at unplug time given
- * as the payload
- *
- **/
-static inline void blk_add_trace_pdu_int(struct request_queue *q, u32 what,
- struct bio *bio, unsigned int pdu)
-{
- struct blk_trace *bt = q->blk_trace;
- __be64 rpdu = cpu_to_be64(pdu);
-
- if (likely(!bt))
- return;
-
- if (bio)
- __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu);
- else
- __blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
-}
-
-/**
- * blk_add_trace_remap - Add a trace for a remap operation
- * @q: queue the io is for
- * @bio: the source bio
- * @dev: target device
- * @from: source sector
- * @to: target sector
- *
- * Description:
- * Device mapper or raid target sometimes need to split a bio because
- * it spans a stripe (or similar). Add a trace for that action.
- *
- **/
-static inline void blk_add_trace_remap(struct request_queue *q, struct bio *bio,
- dev_t dev, sector_t from, sector_t to)
-{
- struct blk_trace *bt = q->blk_trace;
- struct blk_io_trace_remap r;
-
- if (likely(!bt))
- return;
-
- r.device = cpu_to_be32(dev);
- r.sector = cpu_to_be64(to);
-
- __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r);
-}
+extern int blk_probe_connect(void);
+extern void blk_probe_disconnect(void);

#else /* !CONFIG_BLK_DEV_IO_TRACE */
#define blk_trace_ioctl(bdev, cmd, arg) (-ENOTTY)
#define blk_trace_shutdown(q) do { } while (0)
-#define blk_add_trace_rq(q, rq, what) do { } while (0)
-#define blk_add_trace_bio(q, rq, what) do { } while (0)
-#define blk_add_trace_generic(q, rq, rw, what) do { } while (0)
-#define blk_add_trace_pdu_int(q, what, bio, pdu) do { } while (0)
-#define blk_add_trace_remap(q, bio, dev, f, t) do {} while (0)
#endif /* CONFIG_BLK_DEV_IO_TRACE */

#endif
Index: linux-2.6-lttng/mm/bounce.c
===================================================================
--- linux-2.6-lttng.orig/mm/bounce.c 2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/mm/bounce.c 2007-05-03 12:54:58.000000000 -0400
@@ -13,7 +13,7 @@
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <asm/tlbflush.h>

#define POOL_SIZE 64
@@ -237,7 +237,7 @@
if (!bio)
return;

- blk_add_trace_bio(q, *bio_orig, BLK_TA_BOUNCE);
+ trace_mark(blk_bio_bounce, "%p %p", q, *bio_orig);

/*
* at least one page was bounced, fill in possible non-highmem
Index: linux-2.6-lttng/mm/highmem.c
===================================================================
--- linux-2.6-lttng.orig/mm/highmem.c 2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/mm/highmem.c 2007-05-03 12:54:58.000000000 -0400
@@ -26,7 +26,7 @@
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <asm/tlbflush.h>

/*
Index: linux-2.6-lttng/fs/bio.c
===================================================================
--- linux-2.6-lttng.orig/fs/bio.c 2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/fs/bio.c 2007-05-03 12:54:58.000000000 -0400
@@ -25,7 +25,7 @@
#include <linux/module.h>
#include <linux/mempool.h>
#include <linux/workqueue.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <scsi/sg.h> /* for struct sg_iovec */

#define BIO_POOL_SIZE 2
@@ -1081,7 +1081,7 @@
if (!bp)
return bp;

- blk_add_trace_pdu_int(bdev_get_queue(bi->bi_bdev), BLK_TA_SPLIT, bi,
+ trace_mark(blk_pdu_split, "%p %p %d", bdev_get_queue(bi->bi_bdev), bi,
bi->bi_sector + first_sectors);

BUG_ON(bi->bi_vcnt != 1);
Index: linux-2.6-lttng/drivers/block/cciss.c
===================================================================
--- linux-2.6-lttng.orig/drivers/block/cciss.c 2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/drivers/block/cciss.c 2007-05-03 12:54:58.000000000 -0400
@@ -37,7 +37,7 @@
#include <linux/hdreg.h>
#include <linux/spinlock.h>
#include <linux/compat.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <asm/uaccess.h>
#include <asm/io.h>

@@ -2502,7 +2502,7 @@
}
cmd->rq->data_len = 0;
cmd->rq->completion_data = cmd;
- blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
+ trace_mark(blk_request_complete, "%p %p", cmd->rq->q, cmd->rq);
blk_complete_request(cmd->rq);
}

Index: linux-2.6-lttng/drivers/md/dm.c
===================================================================
--- linux-2.6-lttng.orig/drivers/md/dm.c 2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/drivers/md/dm.c 2007-05-03 12:54:58.000000000 -0400
@@ -19,7 +19,7 @@
#include <linux/slab.h>
#include <linux/idr.h>
#include <linux/hdreg.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
#include <linux/smp_lock.h>

#define DM_MSG_PREFIX "core"
@@ -485,8 +485,8 @@
wake_up(&io->md->wait);

if (io->error != DM_ENDIO_REQUEUE) {
- blk_add_trace_bio(io->md->queue, io->bio,
- BLK_TA_COMPLETE);
+ trace_mark(blk_request_complete, "%p %p",
+ io->md->queue, io->bio);

bio_endio(io->bio, io->bio->bi_size, io->error);
}
@@ -582,10 +582,10 @@
r = ti->type->map(ti, clone, &tio->info);
if (r == DM_MAPIO_REMAPPED) {
/* the bio has been remapped so dispatch it */
-
- blk_add_trace_remap(bdev_get_queue(clone->bi_bdev), clone,
- tio->io->bio->bi_bdev->bd_dev, sector,
- clone->bi_sector);
+ trace_mark(blk_remap, "%p %p %u %llu %llu",
+ bdev_get_queue(clone->bi_bdev), clone,
+ (u64)tio->io->bio->bi_bdev->bd_dev, (u64)sector,
+ (u64)clone->bi_sector);

generic_make_request(clone);
} else if (r < 0 || r == DM_MAPIO_REQUEUE) {
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-03 17:25:56

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote:
> > kprobes does this kind of synchronization internally, so the marker
> > wrapper should probabl aswell.
> >
>
> The problem appears on heavily loaded systems. Doing 50
> synchronize_sched() calls in a row can take up to a few seconds on a
> 4-way machine. This is why I prefer to do it in the module to which
> the callbacks belong.

We recently had a discussion on batch unreistration interface for
kprobes. I'm not very happy with having so different interfaces for
different kind of probe registrations.

2007-05-10 19:39:49

[permalink] [raw]

Subject: Re: 2.6.22 -mm merge plans

* Christoph Hellwig ([email protected]) wrote:
> On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote:
> > > kprobes does this kind of synchronization internally, so the marker
> > > wrapper should probabl aswell.
> > >
> >
> > The problem appears on heavily loaded systems. Doing 50
> > synchronize_sched() calls in a row can take up to a few seconds on a
> > 4-way machine. This is why I prefer to do it in the module to which
> > the callbacks belong.
>
> We recently had a discussion on batch unreistration interface for
> kprobes. I'm not very happy with having so different interfaces for
> different kind of probe registrations.
>

Ok, I've had a look at the kprobes batch registration mechanisms and..
well, it does not look well suited for the markers. Adding
supplementary data structures such as linked lists of probes does not
look like a good match.

However, I agree with you that providing a similar API is good.

Therefore, here is my proposal :

The goal is to do the synchronize just after we unregister the last
probe handler provided by a given module. Since the unregistration
functions iterate on every marker present in the kernel, we can keep a
count of how many probes provided by the same module are still present.
If we see that we unregistered the last probe pointing to this module,
we issue a synchronize_sched().

It adds no data structure and keeps the same order of complexity as what
is already there, we only have to do 2 passes in the marker structures :
the first one finds the module associated with the callback and the
second disables the callbacks and keep a count of the number of
callbacks associated with the module.

Mathieu

P.S.: here is the code.

Linux Kernel Markers - Architecture Independant code Provide internal
synchronize_sched() in batch.

The goal is to do the synchronize just after we unregister the last
probe handler provided by a given module. Since the unregistration
functions iterate on every marker present in the kernel, we can keep a
count of how many probes provided by the same module are still present.
If we see that we unregistered the last probe pointing to this module,
we issue a synchronize_sched().

It adds no data structure and keeps the same order of complexity as what
is already there, we only have to do 2 passes in the marker structures :
the first one finds the module associated with the callback and the
second disables the callbacks and keep a count of the number of
callbacks associated with the module.

Signed-off-by: Mathieu Desnoyers <[email protected]>
---
kernel/module.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 52 insertions(+), 10 deletions(-)

Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c 2007-05-10 14:48:28.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c 2007-05-10 15:38:27.000000000 -0400
@@ -404,8 +404,12 @@
}

/* Sets a range of markers to a disabled state : unset the enable bit and
- * provide the empty callback. */
+ * provide the empty callback.
+ * Keep a count of other markers connected to the same module as the one
+ * provided as parameter. */
static int marker_remove_probe_range(const char *name,
+ struct module *probe_module,
+ int *ref_count,
const struct __mark_marker *begin,
const struct __mark_marker *end)
{
@@ -413,12 +417,17 @@
int found = 0;

for (iter = begin; iter < end; iter++) {
- if (strcmp(name, iter->mdata->name) == 0) {
- marker_set_enable(iter->enable, 0,
- iter->mdata->flags);
- iter->mdata->call = __mark_empty_function;
- found++;
+ if (strcmp(name, iter->mdata->name) != 0) {
+ if (probe_module)
+ if (__module_text_address(
+ (unsigned long)iter->mdata->call)
+ == probe_module)
+ (*ref_count)++;
+ continue;
}
+ marker_set_enable(iter->enable, 0, iter->mdata->flags);
+ iter->mdata->call = __mark_empty_function;
+ found++;
}
return found;
}
@@ -450,6 +459,29 @@
return found;
}

+/* Get the module to which the probe handler's text belongs.
+ * Called with module_mutex taken.
+ * Returns NULL if the probe handler is not in a module. */
+static struct module *__marker_get_probe_module(const char *name)
+{
+ struct module *mod;
+ const struct __mark_marker *iter;
+
+ list_for_each_entry(mod, &modules, list) {
+ if (mod->taints)
+ continue;
+ for (iter = mod->markers;
+ iter < mod->markers+mod->num_markers; iter++) {
+ if (strcmp(name, iter->mdata->name) != 0)
+ continue;
+ if (iter->mdata->call)
+ return __module_text_address(
+ (unsigned long)iter->mdata->call);
+ }
+ }
+ return NULL;
+}
+
/* Calls _marker_set_probe_range for the core markers and modules markers.
* Marker enabling/disabling use the modlist_lock to synchronise. */
int _marker_set_probe(int flags, const char *name, const char *format,
@@ -477,23 +509,33 @@
EXPORT_SYMBOL_GPL(_marker_set_probe);

/* Calls _marker_remove_probe_range for the core markers and modules markers.
- * Marker enabling/disabling use the modlist_lock to synchronise. */
+ * Marker enabling/disabling use the modlist_lock to synchronise.
+ * ref_count is the number of markers still connected to the same module
+ * as the one in which sits the probe handler currently removed, excluding the
+ * one currently removed. If the count is 0, we issue a synchronize_sched() to
+ * make sure the module can safely unload. */
int marker_remove_probe(const char *name)
{
- struct module *mod;
+ struct module *mod, *probe_module;
int found = 0;
+ int ref_count = 0;

mutex_lock(&module_mutex);
+ /* In what module is the probe handler ? */
+ probe_module = __marker_get_probe_module(name);
/* Core kernel markers */
- found += marker_remove_probe_range(name,
+ found += marker_remove_probe_range(name, probe_module, &ref_count,
__start___markers, __stop___markers);
/* Markers in modules. */
list_for_each_entry(mod, &modules, list) {
if (!mod->taints)
- found += marker_remove_probe_range(name,
+ found += marker_remove_probe_range(name, probe_module,
+ &ref_count,
mod->markers, mod->markers+mod->num_markers);
}
mutex_unlock(&module_mutex);
+ if (!ref_count)
+ synchronize_sched();
return found;
}
EXPORT_SYMBOL_GPL(marker_remove_probe);

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-13 21:04:57