LinuxLists.cc - blktrace daemon vs LTTng lttd

2006-02-22 03:05:49

Subject: blktrace daemon vs LTTng lttd

Hi M. Axboe,

Tom Zanussi just informed me of your work of blktrace. I think that some aspects
of our respective projects (mine being LTTng) shows resemblances.

LTTng is a system wide tracer that aims at tracing programs/librairies and the
kernel. There is a version currently available at http://ltt.polymtl.ca. Is has
a viewer counterpart (LTTV) that analyses and show graphically the traces takes
by LTTng.

I just looked at the blktrace code, and here are some parts we share :

- We both use RelayFS for data transfer to user space.
- We both need to get highly precise timestamps.

Where LTTng might have more constraints is on the performance and reentrancy
side : my tracer is NMI reentrant (using atomic operations) and must be able to
dump data at a rate high enough to sustain a loopback ping flood (with
interrupts logged). Time precision must permit to reorder events occuring very
closely on two different CPUs.

I already developed a multithreaded daemon (lttd) that generically reads the
RelayFS channels : it uses mmap() and 4 ioctl() to control the channel
buffers. I just discussed it with Tom Zanussi in the following thread :

http://www.listserv.shafik.org/pipermail/ltt-dev/2006-February/001245.html

Here is the argumentation I gave to justify the use of ioctl() for RelayFS
channels (in the same discussion) :

http://www.listserv.shafik.org/pipermail/ltt-dev/2006-February/001247.html

I suggest to integrate my ioctl addition to the RelayFS channels so they can be
used very efficiently with both mmap() and ioctl() to control the reader.

It would be trivial to use send() instead of write() to adapt lttd to the
networked case.

Using mmap() and write() instead or read() and write() would eliminate the
extra copy blktrace is doing when it writes to disk.

What do you think about it ?

On another point, I looked at your timekeeping in blktrace and I think you could
gain precision by using a monotonic clock instead of do_gettimeofday (which is
altered by NTP).

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2006-02-22 07:48:07

by Jens Axboe

[permalink] [raw]

Subject: Re: blktrace daemon vs LTTng lttd

On Tue, Feb 21 2006, Mathieu Desnoyers wrote:
> Hi M. Axboe,
>
> Tom Zanussi just informed me of your work of blktrace. I think that
> some aspects of our respective projects (mine being LTTng) shows
> resemblances.
>
> LTTng is a system wide tracer that aims at tracing programs/librairies
> and the kernel. There is a version currently available at
> http://ltt.polymtl.ca. Is has a viewer counterpart (LTTV) that
> analyses and show graphically the traces takes by LTTng.
>
> I just looked at the blktrace code, and here are some parts we share :
>
> - We both use RelayFS for data transfer to user space.
> - We both need to get highly precise timestamps.

That's true.

> Where LTTng might have more constraints is on the performance and
> reentrancy side : my tracer is NMI reentrant (using atomic operations)
> and must be able to dump data at a rate high enough to sustain a
> loopback ping flood (with interrupts logged). Time precision must
> permit to reorder events occuring very closely on two different CPUs.

I don't think we differ a lot there, actually. blktrace also needs to be
minimum impact - for the cases where we are not io bound (eg queueing
io), the logging rate can be quite high.

> I already developed a multithreaded daemon (lttd) that generically reads the
> RelayFS channels : it uses mmap() and 4 ioctl() to control the channel
> buffers. I just discussed it with Tom Zanussi in the following thread :
>
> http://www.listserv.shafik.org/pipermail/ltt-dev/2006-February/001245.html
>
> Here is the argumentation I gave to justify the use of ioctl() for RelayFS
> channels (in the same discussion) :
>
> http://www.listserv.shafik.org/pipermail/ltt-dev/2006-February/001247.html
>
> I suggest to integrate my ioctl addition to the RelayFS channels so
> they can be used very efficiently with both mmap() and ioctl() to
> control the reader.
>
> It would be trivial to use send() instead of write() to adapt lttd to
> the networked case.
>
> Using mmap() and write() instead or read() and write() would eliminate
> the extra copy blktrace is doing when it writes to disk.

blktrace currently uses read() to mmap'ed file buffers for local
storage, not read+write. We could mmap both ends of course and just copy
the data, I'm not sure it would buy me a lot though. For local storage,
blktraces biggest worry is peturbing the vm/io side of things so we skew
the results of what we are tracing. That is usually more important than
using that extra 0.1% of cpu cycles, as most io tests are not CPU bound.
The sendfile() support should work now, so the preferred approach now
becomes using blktrace in net client mode and sendfile() the data out
without it ever being copied either in-kernel or to-user.

That said, the "complexity" of controlling produced/consumed numbers is
what has kept me away from doing mmap() of the relayfs buffers for
local storage. With an easier control mechanism in place, I might be
convinced to switch blktrace as well.

> On another point, I looked at your timekeeping in blktrace and I think
> you could gain precision by using a monotonic clock instead of
> do_gettimeofday (which is altered by NTP).

I don't use gettimeofday() for time keeping, unless sched_clock() winds
up using that for some cases. Haven't looked much into that yet, but on
some systems the granularity of sched_clock() is jiffies which doesn't
work very well for us of course.

What does LTT use in the kernel?

--
Jens Axboe

2006-02-22 14:24:49

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: blktrace daemon vs LTTng lttd

* Jens Axboe ([email protected]) wrote:
> blktrace currently uses read() to mmap'ed file buffers for local
> storage, not read+write.

I see that blktrace uses fwrite() in write_data(). Isn't it a disk write
scheme where you read() from the RelayFS channel and (f)write() to a file ? Oh,
but the mmaped file is the output.. I see. However, you have to mmap/unmap the
output file between each subbuffer, which costs you time.

> We could mmap both ends of course and just copy
> the data, I'm not sure it would buy me a lot though. For local storage,
> blktraces biggest worry is peturbing the vm/io side of things so we skew
> the results of what we are tracing. That is usually more important than
> using that extra 0.1% of cpu cycles, as most io tests are not CPU bound.
> The sendfile() support should work now, so the preferred approach now
> becomes using blktrace in net client mode and sendfile() the data out
> without it ever being copied either in-kernel or to-user.
>

As I said earlier, using sendfile() or mmap+send() should lead to a similar
result.

> That said, the "complexity" of controlling produced/consumed numbers is
> what has kept me away from doing mmap() of the relayfs buffers for
> local storage.

Yes, it has been my reaction too.

> With an easier control mechanism in place, I might be
> convinced to switch blktrace as well.
>

Well, if you want to try the current lttd disk dumper, it's quite simple : you
fork from blktrace, exec lttd with 2-3 parameters and it will simply open
recursively a directory structure, create the exact same trace directory
structure as output, mmap each buffer and wait for data. It quits when the last
buffer has hung up.

Once adapted to a network send(), I don't see any limitation in it's genericity.

> > On another point, I looked at your timekeeping in blktrace and I think
> > you could gain precision by using a monotonic clock instead of
> > do_gettimeofday (which is altered by NTP).
>
> I don't use gettimeofday() for time keeping, unless sched_clock() winds
> up using that for some cases. Haven't looked much into that yet, but on
> some systems the granularity of sched_clock() is jiffies which doesn't
> work very well for us of course.
>
> What does LTT use in the kernel?
>

I looked closely at the time keeping in the Kernel, and found out that
cycles_2_ns was using a precomputed variable which was not precise enough for my
needs, as it is computed with integers.

So I has two mode, one with and one without TSC. The simplest one is the TSC
mode, where I get the TSC of the CPUs. I also log the cpu_khz variable at the
beginning of the trace, so I can calculate the time myself from the tsc, but I
do it later, in double precision with the analyser.

In non TSC case, I use the jiffies counter or'd with a logical clock.

Mathieu

> --
> Jens Axboe
>
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2006-02-22 14:51:26

by Jens Axboe

[permalink] [raw]

Subject: Re: blktrace daemon vs LTTng lttd

On Wed, Feb 22 2006, Mathieu Desnoyers wrote:
> * Jens Axboe ([email protected]) wrote:
> > blktrace currently uses read() to mmap'ed file buffers for local
> > storage, not read+write.
>
> I see that blktrace uses fwrite() in write_data(). Isn't it a disk
> write scheme where you read() from the RelayFS channel and (f)write()
> to a file ? Oh, but the mmaped file is the output.. I see. However,
> you have to mmap/unmap the output file between each subbuffer, which
> costs you time.

That's just a coding thing, if it really was an issue I could mremap
extend it instead.

> > We could mmap both ends of course and just copy
> > the data, I'm not sure it would buy me a lot though. For local storage,
> > blktraces biggest worry is peturbing the vm/io side of things so we skew
> > the results of what we are tracing. That is usually more important than
> > using that extra 0.1% of cpu cycles, as most io tests are not CPU bound.
> > The sendfile() support should work now, so the preferred approach now
> > becomes using blktrace in net client mode and sendfile() the data out
> > without it ever being copied either in-kernel or to-user.
> >
>
> As I said earlier, using sendfile() or mmap+send() should lead to a similar
> result.

Not following, similar what result?

> > That said, the "complexity" of controlling produced/consumed numbers is
> > what has kept me away from doing mmap() of the relayfs buffers for
> > local storage.
>
> Yes, it has been my reaction too.

Currently I just do sendfile() per subbuffer, if I just limit one
subbuffer in flight at the time, I can reliably use poll() to check
whether a new subbufer is available for transfer. I still do have to use
a relay control file to get the subbuffer padding at the end of the
subbuffer, if the trace info doesn't fully fill a subbuffer.

> > With an easier control mechanism in place, I might be
> > convinced to switch blktrace as well.
> >
>
> Well, if you want to try the current lttd disk dumper, it's quite
> simple : you fork from blktrace, exec lttd with 2-3 parameters and it
> will simply open recursively a directory structure, create the exact
> same trace directory structure as output, mmap each buffer and wait
> for data. It quits when the last buffer has hung up.
>
> Once adapted to a network send(), I don't see any limitation in it's
> genericity.

It might be a slight improvement in the local trace case, however as I
said it's not really an issue for me. Even for the local trace case, the
read-to-mmap isn't close to being the top bottleneck for traces. As it
stands right now, there's little incentive for me to do anything :)

As mentioned, blktrace prefers sendfile() for the network side which is
still a win over send() as you'd still have to copy the data over. A
quick there here shows 0.5-0.6% more idle time with the sendfile()
approach over send(), with ~0.3% of that being copy_user_generic()
overhead.

> > > On another point, I looked at your timekeeping in blktrace and I think
> > > you could gain precision by using a monotonic clock instead of
> > > do_gettimeofday (which is altered by NTP).
> >
> > I don't use gettimeofday() for time keeping, unless sched_clock() winds
> > up using that for some cases. Haven't looked much into that yet, but on
> > some systems the granularity of sched_clock() is jiffies which doesn't
> > work very well for us of course.
> >
> > What does LTT use in the kernel?
> >
>
> I looked closely at the time keeping in the Kernel, and found out that
> cycles_2_ns was using a precomputed variable which was not precise
> enough for my needs, as it is computed with integers.
>
> So I has two mode, one with and one without TSC. The simplest one is
> the TSC mode, where I get the TSC of the CPUs. I also log the cpu_khz
> variable at the beginning of the trace, so I can calculate the time
> myself from the tsc, but I do it later, in double precision with the
> analyser.
>
> In non TSC case, I use the jiffies counter or'd with a logical clock.

Sounds like we can share some code there, I basically just used
sched_clock() as it was available and had good granularity on the
systems I tested then. An internal get_good_clock() that does the right
thing would be appreciated :)

--
Jens Axboe

2006-02-22 15:19:36

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: blktrace daemon vs LTTng lttd

* Jens Axboe ([email protected]) wrote:
> On Wed, Feb 22 2006, Mathieu Desnoyers wrote:
> > * Jens Axboe ([email protected]) wrote:
> > I see that blktrace uses fwrite() in write_data(). Isn't it a disk
> > write scheme where you read() from the RelayFS channel and (f)write()
> > to a file ? Oh, but the mmaped file is the output.. I see. However,
> > you have to mmap/unmap the output file between each subbuffer, which
> > costs you time.
>
> That's just a coding thing, if it really was an issue I could mremap
> extend it instead.
>

It would limit you to the 3GB addressable memory space. You couldn't write files
bigger than that, so it's not just a coding thing :)

> > As I said earlier, using sendfile() or mmap+send() should lead to a similar
> > result.
>
> Not following, similar what result?
>

As of my understanding, mmap and send() would take approximately the same amount
of cpu time as sendfile() does.

> Currently I just do sendfile() per subbuffer, if I just limit one
> subbuffer in flight at the time, I can reliably use poll() to check
> whether a new subbufer is available for transfer. I still do have to use
> a relay control file to get the subbuffer padding at the end of the
> subbuffer, if the trace info doesn't fully fill a subbuffer.
>

Do you transfer the padding at the end of subbuffers or do you truncate it ?
Currently, my ioctl()+lttd does not tell the size of the padding, but it would
be very simple to add this output parameter to the reserve subbuf ioctl. It
would be particularly useful to reduce the network traffic on almost empty
subbuffers. As the majority of subbuffers are almost full on a heavily used
system, I doubt it would be useful at all to know the padding size in the
disk/network writer daemon.

> It might be a slight improvement in the local trace case, however as I
> said it's not really an issue for me. Even for the local trace case, the
> read-to-mmap isn't close to being the top bottleneck for traces. As it
> stands right now, there's little incentive for me to do anything :)
>

How would you like having both mmap+ioctl() and sendfile() added to standard
RelayFS file operations ? This would make both disk write and network send quite
simple and efficient.

However, I would tweak the sendfile() operation a little bit : I would make it
use the current consumer position and only send the number of subbuffers that
are ready for being read. It would then support more than one subbuffer per
channel.

> As mentioned, blktrace prefers sendfile() for the network side which is
> still a win over send() as you'd still have to copy the data over. A
> quick there here shows 0.5-0.6% more idle time with the sendfile()
> approach over send(), with ~0.3% of that being copy_user_generic()
> overhead.
>

You are right : the sendfile approach needs less checking of the copied memory
region, so there should be a gain.

> > So I have two modes, one with and one without TSC. The simplest one is
> > the TSC mode, where I get the TSC of the CPUs. I also log the cpu_khz
> > variable at the beginning of the trace, so I can calculate the time
> > myself from the tsc, but I do it later, in double precision with the
> > analyser.
> >
> > In non TSC case, I use the jiffies counter or'd with a logical clock.
>
> Sounds like we can share some code there, I basically just used
> sched_clock() as it was available and had good granularity on the
> systems I tested then. An internal get_good_clock() that does the right
> thing would be appreciated :)
>

The LTTng code is available at http://ltt.polymtl.ca. The major part the
timekeeping sits in the include/asm-*/ltt.h. Note that there is a small detail
about it : I need to reset the logical clock at each jiffy increment (for
architectures without TSC).

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2006-02-22 15:39:54

by Jens Axboe

[permalink] [raw]

Subject: Re: blktrace daemon vs LTTng lttd

On Wed, Feb 22 2006, Mathieu Desnoyers wrote:
> * Jens Axboe ([email protected]) wrote:
> > On Wed, Feb 22 2006, Mathieu Desnoyers wrote:
> > > * Jens Axboe ([email protected]) wrote:
> > > I see that blktrace uses fwrite() in write_data(). Isn't it a disk
> > > write scheme where you read() from the RelayFS channel and (f)write()
> > > to a file ? Oh, but the mmaped file is the output.. I see. However,
> > > you have to mmap/unmap the output file between each subbuffer, which
> > > costs you time.
> >
> > That's just a coding thing, if it really was an issue I could mremap
> > extend it instead.
> >
>
> It would limit you to the 3GB addressable memory space. You couldn't
> write files bigger than that, so it's not just a coding thing :)

I'm assuming you are talking about 32-bit machines here, but even there
I can mmap() larger than that. I'd never keep segments that large
anyways, were I to do anything like that I'd just munmap/mmap again like
I currently do. But again, it's not an issue for me, seems we are
arguing pretty irrelevant details right now.

> > Currently I just do sendfile() per subbuffer, if I just limit one
> > subbuffer in flight at the time, I can reliably use poll() to check
> > whether a new subbufer is available for transfer. I still do have to
> > use a relay control file to get the subbuffer padding at the end of
> > the subbuffer, if the trace info doesn't fully fill a subbuffer.
> >
>
> Do you transfer the padding at the end of subbuffers or do you
> truncate it ? Currently, my ioctl()+lttd does not tell the size of

I don't transfer it - this is what happens:

pad = get_subbuf_padding(tip, tip->ofile_offset);
if (pad == -1)
goto err;

ts->len = buf_size - pad;

if (net_send_header(tip, ts->len))
goto err;
if (net_sendfile(tip, ts))
goto err;

> the padding, but it would be very simple to add this output parameter
> to the reserve subbuf ioctl. It would be particularly useful to reduce
> the network traffic on almost empty subbuffers. As the majority of
> subbuffers are almost full on a heavily used system, I doubt it would
> be useful at all to know the padding size in the disk/network writer
> daemon.

Yeah, I don't check the padding to avoid sending that extra
sizeof(blk_io_trace) - foo, which is a net save of max 48 bytes or so.
It's done purely to send only sane data.

> > It might be a slight improvement in the local trace case, however as
> > I said it's not really an issue for me. Even for the local trace
> > case, the read-to-mmap isn't close to being the top bottleneck for
> > traces. As it stands right now, there's little incentive for me to
> > do anything :)
> >
>
> How would you like having both mmap+ioctl() and sendfile() added to
> standard RelayFS file operations ? This would make both disk write and
> network send quite simple and efficient.

I have added sendfile() support already, as that is what I'm using :-)
If you pull the current blktrace git repo, you'll see the blktrace
kernel patch contains a sendfile() hook for file_operations.

If there was a straight-forward to use mmap operation, I would surely
use it and scrap the read.

> However, I would tweak the sendfile() operation a little bit : I would
> make it use the current consumer position and only send the number of
> subbuffers that are ready for being read. It would then support more
> than one subbuffer per channel.

The sendfile() I did is comparable to the read(), it will serve as much
as you have available. blktrace does per-subbuffer sends though, as we
need to figure the padding. blktrace passes NULL as offset to use the
current position, and it does update the consumer position at the end.
So it works pretty much as you expect.

> > > So I have two modes, one with and one without TSC. The simplest
> > > one is the TSC mode, where I get the TSC of the CPUs. I also log
> > > the cpu_khz variable at the beginning of the trace, so I can
> > > calculate the time myself from the tsc, but I do it later, in
> > > double precision with the analyser.
> > >
> > > In non TSC case, I use the jiffies counter or'd with a logical
> > > clock.
> >
> > Sounds like we can share some code there, I basically just used
> > sched_clock() as it was available and had good granularity on the
> > systems I tested then. An internal get_good_clock() that does the
> > right thing would be appreciated :)
> >
>
> The LTTng code is available at http://ltt.polymtl.ca. The major part
> the timekeeping sits in the include/asm-*/ltt.h. Note that there is a
> small detail about it : I need to reset the logical clock at each
> jiffy increment (for architectures without TSC).

Ok, I'll have a look and see if I can extract something interesting for
my use (and generically).

--
Jens Axboe