This patch set makes the block layer maintain statistics for request
queues. Resulting data closely resembles the actual I/O traffic to a
device, as the block layer takes hints from block device drivers when a
request is being issued as well as when it is about to complete.
It is crucial (for us) to be able to look at such kernel level data in
case of customer situations. It allows us to determine what kind of
requests might be involved in performance situations. This information
helps to understand whether one faces a device issue or a Linux issue.
Not being able to tap into performance data is regarded as a big minus
by some enterprise customers, who are reluctant to use Linux SCSI
support or Linux.
Statistics data includes:
- request sizes (read + write),
- residual bytes of partially completed requests (read + write),
- request latencies (read + write),
- request retries (read + write),
- request concurrency,
For sample data please have a look at the SCSI stack patch or the DASD
driver patch respectively. This data is only gathered if statistics have
been enabled by users at run time (default is off).
The advantage of instrumenting request queues is that we can cover a
broad range of devices, including SCSI tape devices.
Having the block layer maintain such statistics on behalf of drivers
provides for comparability through a common set of statistics.
A previous approach was to put all the code into the SCSI stack:
http://marc.theaimsgroup.com/?l=linux-kernel&m=115928678420835&w=2
I gathered from feedback that moving that stuff to the block layer
might be preferable.
These patches use the statistics component described in
Documentation/statistics.txt.
Patches are against 2.6.19-rc2-mm2.
[Patch 1/5] I/O statistics through request queues: timeval_to_us()
[Patch 2/5] I/O statistics through request queues: queue instrumentation
[Patch 3/5] I/O statistics through request queues: small SCSI cleanup
[Patch 4/5] I/O statistics through request queues: SCSI
[Patch 5/5] I/O statistics through request queues: DASD
Martin
On Sat, Oct 21 2006, Martin Peschke wrote:
> This patch set makes the block layer maintain statistics for request
> queues. Resulting data closely resembles the actual I/O traffic to a
> device, as the block layer takes hints from block device drivers when a
> request is being issued as well as when it is about to complete.
>
> It is crucial (for us) to be able to look at such kernel level data in
> case of customer situations. It allows us to determine what kind of
> requests might be involved in performance situations. This information
> helps to understand whether one faces a device issue or a Linux issue.
> Not being able to tap into performance data is regarded as a big minus
> by some enterprise customers, who are reluctant to use Linux SCSI
> support or Linux.
>
> Statistics data includes:
> - request sizes (read + write),
> - residual bytes of partially completed requests (read + write),
> - request latencies (read + write),
> - request retries (read + write),
> - request concurrency,
Question - what part of this does blktrace currently not do? In case
it's missing something, why not add it there instead of putting new
trace code in?
--
Jens Axboe
Jens Axboe wrote:
> On Sat, Oct 21 2006, Martin Peschke wrote:
>> This patch set makes the block layer maintain statistics for request
>> queues. Resulting data closely resembles the actual I/O traffic to a
>> device, as the block layer takes hints from block device drivers when a
>> request is being issued as well as when it is about to complete.
>>
>> It is crucial (for us) to be able to look at such kernel level data in
>> case of customer situations. It allows us to determine what kind of
>> requests might be involved in performance situations. This information
>> helps to understand whether one faces a device issue or a Linux issue.
>> Not being able to tap into performance data is regarded as a big minus
>> by some enterprise customers, who are reluctant to use Linux SCSI
>> support or Linux.
>>
>> Statistics data includes:
>> - request sizes (read + write),
>> - residual bytes of partially completed requests (read + write),
>> - request latencies (read + write),
>> - request retries (read + write),
>> - request concurrency,
>
> Question - what part of this does blktrace currently not do?
The Dispatch / Complete events of blktrace aren't as accurate as
the additional "markers" introduced by my patch. A request might have
been dispatched (to the block device driver) from the block layer's
point of view, although this request still lingers in the low level
device driver.
For example, the s390 DASD driver accepts a small batch of requests
from the block layer and translates them into DASD requests. Such DASD
requests stay in an internal queue until an interrupt triggers the DASD
driver to issue the next ready-made DASD request without further delay.
Saving on latency.
For SCSI, the accuracy of the Dispatch / Complete events of blktrace
is not such an issue, since SCSI doesn't queue stuff on its own, but
reverts to queues implemented in SCSI devices. Anyway, command pre-
and postprocessing in the SCSI stack adds to the latency that can be
observed through the Dispatch / Complete events of blktrace.
Of course, the addition of two events to blktrace could fix that.
And it would be some effort to teach the blktrace tools family to
calculate the set of statistics I have proposed. But that's not a
reason to do things in the kernel...
> In case it's missing something, why not add it there instead of
> putting new trace code in?
The question is:
Is blktrace a performance tool? Or a development tool, or what?
Our tests indicate that the blktrace approach is fine for performance
analysis as long as the system to be analysed isn't too busy.
But once the system faces a consirable amount of non-sequential I/O
workload, the plenty of blktrace-generated data starts to get painful.
The majority of the scenarios that are likely to become subject of a
performance analysis due to some customer complaint fit into the
category of workloads that will be affected by the activation of
blktrace.
If the system runs I/O-bound, how to write out traces without
stealing bandwith and causing side effects?
And if CPU utilisation is high, how to make sure that blktrace
tools get the required share without seriously impacting
applications which are responsible for the workload to be analysed?
How much memory is reqired for per-cpu and per-device relay buffers
and for the processing done by blktrace tools at run time?
What if other subsystems get rigged with relay-based traces,
following the blktrace example? I think that's okay - much better
than cluttering the printk buffer with data that doesn't necessarily
require user attention. I am advocating the renovation of
arch/s390/kernel/debug.c - a tracing facility widely used throughout
the s390 code - so that it is switched over to blktrace-like techniques,
(and ideally shares code and is slimmed down).
With blktrace-like (utt-based?) tracing facilities the data
stream will swell. But if those were required to get an overview
about the performance of subsystems or drivers...
In my opinion, neither trace events relayed to user space nor
performance counters maintained in the kernel are the sole answer
to all information needs. The trick is to deliberate about 'when to
use which approach what for'. Performance counters should give
directions for further investigation. Traces are fine for debugging
a specific subsystem.
Would this be a good candidate to implement using kprobes? I was under
the impression that basing instrumentation on kprobes would be a good
thing since you can load the instrumentation code only when needed, then
unload it.
Martin Peschke wrote:
> This patch set makes the block layer maintain statistics for request
> queues. Resulting data closely resembles the actual I/O traffic to a
> device, as the block layer takes hints from block device drivers when a
> request is being issued as well as when it is about to complete.
>
> It is crucial (for us) to be able to look at such kernel level data in
> case of customer situations. It allows us to determine what kind of
> requests might be involved in performance situations. This information
> helps to understand whether one faces a device issue or a Linux issue.
> Not being able to tap into performance data is regarded as a big minus
> by some enterprise customers, who are reluctant to use Linux SCSI
> support or Linux.
>
> Statistics data includes:
> - request sizes (read + write),
> - residual bytes of partially completed requests (read + write),
> - request latencies (read + write),
> - request retries (read + write),
> - request concurrency,
>
> For sample data please have a look at the SCSI stack patch or the DASD
> driver patch respectively. This data is only gathered if statistics have
> been enabled by users at run time (default is off).
>
> The advantage of instrumenting request queues is that we can cover a
> broad range of devices, including SCSI tape devices.
> Having the block layer maintain such statistics on behalf of drivers
> provides for comparability through a common set of statistics.
>
> A previous approach was to put all the code into the SCSI stack:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=115928678420835&w=2
> I gathered from feedback that moving that stuff to the block layer
> might be preferable.
>
> These patches use the statistics component described in
> Documentation/statistics.txt.
>
> Patches are against 2.6.19-rc2-mm2.
>
> [Patch 1/5] I/O statistics through request queues: timeval_to_us()
> [Patch 2/5] I/O statistics through request queues: queue instrumentation
> [Patch 3/5] I/O statistics through request queues: small SCSI cleanup
> [Patch 4/5] I/O statistics through request queues: SCSI
> [Patch 5/5] I/O statistics through request queues: DASD
>
> Martin
>
On Mon, Oct 23 2006, Martin Peschke wrote:
> Jens Axboe wrote:
> >On Sat, Oct 21 2006, Martin Peschke wrote:
> >>This patch set makes the block layer maintain statistics for request
> >>queues. Resulting data closely resembles the actual I/O traffic to a
> >>device, as the block layer takes hints from block device drivers when a
> >>request is being issued as well as when it is about to complete.
> >>
> >>It is crucial (for us) to be able to look at such kernel level data in
> >>case of customer situations. It allows us to determine what kind of
> >>requests might be involved in performance situations. This information
> >>helps to understand whether one faces a device issue or a Linux issue.
> >>Not being able to tap into performance data is regarded as a big minus
> >>by some enterprise customers, who are reluctant to use Linux SCSI
> >>support or Linux.
> >>
> >>Statistics data includes:
> >>- request sizes (read + write),
> >>- residual bytes of partially completed requests (read + write),
> >>- request latencies (read + write),
> >>- request retries (read + write),
> >>- request concurrency,
> >
> >Question - what part of this does blktrace currently not do?
>
> The Dispatch / Complete events of blktrace aren't as accurate as
> the additional "markers" introduced by my patch. A request might have
> been dispatched (to the block device driver) from the block layer's
> point of view, although this request still lingers in the low level
> device driver.
>
> For example, the s390 DASD driver accepts a small batch of requests
> from the block layer and translates them into DASD requests. Such DASD
> requests stay in an internal queue until an interrupt triggers the DASD
> driver to issue the next ready-made DASD request without further delay.
> Saving on latency.
>
> For SCSI, the accuracy of the Dispatch / Complete events of blktrace
> is not such an issue, since SCSI doesn't queue stuff on its own, but
> reverts to queues implemented in SCSI devices. Anyway, command pre-
> and postprocessing in the SCSI stack adds to the latency that can be
> observed through the Dispatch / Complete events of blktrace.
>
> Of course, the addition of two events to blktrace could fix that.
Right, that's pretty close to what I am thinking :-)
> > In case it's missing something, why not add it there instead of
> > putting new trace code in?
>
> The question is:
> Is blktrace a performance tool? Or a development tool, or what?
I hope it's flexible enough to do both. I certainly have used it for
both. Others use it as a performance tool.
> Our tests indicate that the blktrace approach is fine for performance
> analysis as long as the system to be analysed isn't too busy.
> But once the system faces a consirable amount of non-sequential I/O
> workload, the plenty of blktrace-generated data starts to get painful.
Why haven't you done an analysis and posted it here? I surely cannot fix
what nobody tells me is broken or suboptimal. I have to say it's news to
me that it's performance intensive, tests I did with Alan Brunelle a
year or so ago showed it to be quite low impact. We don't even do any
sort of locking in the log path and everything is per-CPU.
> The majority of the scenarios that are likely to become subject of a
> performance analysis due to some customer complaint fit into the
> category of workloads that will be affected by the activation of
> blktrace.
Any sort of change will impact the running system, that's a given.
> If the system runs I/O-bound, how to write out traces without
> stealing bandwith and causing side effects?
You'd be silly to locally store traces, send them out over the network.
> And if CPU utilisation is high, how to make sure that blktrace
> tools get the required share without seriously impacting
> applications which are responsible for the workload to be analysed?
The only tool running locally is blktrace, and if you run in remote mode
it doesn't even touch the data. So far I see a lot of hand waving,
please show me some real evidence of problems you are seeing. First of
all, you probably need to investigate doing remote logging.
> How much memory is reqired for per-cpu and per-device relay buffers
> and for the processing done by blktrace tools at run time?
That's a tough problem, I agree. There's really no way to answer that
without doing tests. The defaults should be good enough for basically
anything you throw at it, if you lose events they are not.
> What if other subsystems get rigged with relay-based traces,
> following the blktrace example? I think that's okay - much better
> than cluttering the printk buffer with data that doesn't necessarily
> require user attention. I am advocating the renovation of
> arch/s390/kernel/debug.c - a tracing facility widely used throughout
> the s390 code - so that it is switched over to blktrace-like techniques,
> (and ideally shares code and is slimmed down).
>
> With blktrace-like (utt-based?) tracing facilities the data
> stream will swell. But if those were required to get an overview
> about the performance of subsystems or drivers...
>
> In my opinion, neither trace events relayed to user space nor
> performance counters maintained in the kernel are the sole answer
> to all information needs. The trick is to deliberate about 'when to
> use which approach what for'. Performance counters should give
> directions for further investigation. Traces are fine for debugging
> a specific subsystem.
Your problem seems dasd specific so far, so probably the best solution
(if you can't/won't use blktrace) is to keep it there.
--
Jens Axboe
Jens Axboe wrote:
>> Our tests indicate that the blktrace approach is fine for performance
>> analysis as long as the system to be analysed isn't too busy.
>> But once the system faces a consirable amount of non-sequential I/O
>> workload, the plenty of blktrace-generated data starts to get painful.
>
> Why haven't you done an analysis and posted it here? I surely cannot fix
> what nobody tells me is broken or suboptimal.
Fair enough. We have tried out the silly way of blktrace-ing, storing
data locally. So, it's probably not worthwhile discussing that.
> I have to say it's news to
> me that it's performance intensive, tests I did with Alan Brunelle a
> year or so ago showed it to be quite low impact.
I found some discussions on linux-btrace (Feburary 2006).
There is little information on how the alleged 2 percent impact has
been determined. Test cases seem to comprise formatting disks ...hmm.
>> If the system runs I/O-bound, how to write out traces without
>> stealing bandwith and causing side effects?
>
> You'd be silly to locally store traces, send them out over the network.
Will try this next and post complaints, if any, along with numbers.
However, a fast network connection plus a second system for blktrace
data processing are serious requirements. Think of servers secured
by firewalls. Reading some counters in debugfs, sysfs or whatever
might be more appropriate for some one who has noticed an unexpected
I/O slowdown and needs directions for further investigation.
Martin
Well, the instrumentation "on demand" aspect is half of the truth.
A probe inserted through kprobes impacts performance more than static
instrumentation.
Phillip Susi wrote:
> Would this be a good candidate to implement using kprobes? I was under
> the impression that basing instrumentation on kprobes would be a good
> thing since you can load the instrumentation code only when needed, then
> unload it.
>
> Martin Peschke wrote:
>> This patch set makes the block layer maintain statistics for request
>> queues. Resulting data closely resembles the actual I/O traffic to a
>> device, as the block layer takes hints from block device drivers when a
>> request is being issued as well as when it is about to complete.
>>
>> It is crucial (for us) to be able to look at such kernel level data in
>> case of customer situations. It allows us to determine what kind of
>> requests might be involved in performance situations. This information
>> helps to understand whether one faces a device issue or a Linux issue.
>> Not being able to tap into performance data is regarded as a big minus
>> by some enterprise customers, who are reluctant to use Linux SCSI
>> support or Linux.
>>
>> Statistics data includes:
>> - request sizes (read + write),
>> - residual bytes of partially completed requests (read + write),
>> - request latencies (read + write),
>> - request retries (read + write),
>> - request concurrency,
>>
>> For sample data please have a look at the SCSI stack patch or the DASD
>> driver patch respectively. This data is only gathered if statistics have
>> been enabled by users at run time (default is off).
>>
>> The advantage of instrumenting request queues is that we can cover a
>> broad range of devices, including SCSI tape devices.
>> Having the block layer maintain such statistics on behalf of drivers
>> provides for comparability through a common set of statistics.
>>
>> A previous approach was to put all the code into the SCSI stack:
>> http://marc.theaimsgroup.com/?l=linux-kernel&m=115928678420835&w=2
>> I gathered from feedback that moving that stuff to the block layer
>> might be preferable.
>>
>> These patches use the statistics component described in
>> Documentation/statistics.txt.
>>
>> Patches are against 2.6.19-rc2-mm2.
>>
>> [Patch 1/5] I/O statistics through request queues: timeval_to_us()
>> [Patch 2/5] I/O statistics through request queues: queue instrumentation
>> [Patch 3/5] I/O statistics through request queues: small SCSI cleanup
>> [Patch 4/5] I/O statistics through request queues: SCSI
>> [Patch 5/5] I/O statistics through request queues: DASD
>>
>> Martin
>>
>
On Tue, Oct 24 2006, Martin Peschke wrote:
> Jens Axboe wrote:
> >>Our tests indicate that the blktrace approach is fine for performance
> >>analysis as long as the system to be analysed isn't too busy.
> >>But once the system faces a consirable amount of non-sequential I/O
> >>workload, the plenty of blktrace-generated data starts to get painful.
> >
> >Why haven't you done an analysis and posted it here? I surely cannot fix
> >what nobody tells me is broken or suboptimal.
>
> Fair enough. We have tried out the silly way of blktrace-ing, storing
> data locally. So, it's probably not worthwhile discussing that.
You'd probably never want to do local traces for performance analysis.
It may be handy for other purposes, though.
> > I have to say it's news to
> >me that it's performance intensive, tests I did with Alan Brunelle a
> >year or so ago showed it to be quite low impact.
>
> I found some discussions on linux-btrace (Feburary 2006).
> There is little information on how the alleged 2 percent impact has
> been determined. Test cases seem to comprise formatting disks ...hmm.
It may sound strange, but formatting a large drive generates a huge
flood of block layer events from lots of io queued and merged. So it's
not a bad benchmark for this type of thing. And it's easy to test :-)
> >>If the system runs I/O-bound, how to write out traces without
> >>stealing bandwith and causing side effects?
> >
> >You'd be silly to locally store traces, send them out over the network.
>
> Will try this next and post complaints, if any, along with numbers.
Thanks! Also note that you do not need to log every event, just register
a mask of interesting ones to decrease the output logging rate. We could
so with some better setup for that though, but at least you should be
able to filter out some unwanted events.
> However, a fast network connection plus a second system for blktrace
> data processing are serious requirements. Think of servers secured
> by firewalls. Reading some counters in debugfs, sysfs or whatever
> might be more appropriate for some one who has noticed an unexpected
> I/O slowdown and needs directions for further investigation.
It's hard to make something that will suit everybody. Maintaining some
counters in sysfs is of course less expensive when your POV is cpu
cycles.
--
Jens Axboe
This discussion seems to involve two different solutions to two
different problems. If it is a simple counter you want to be able to
poll, then sysfs/debugfs is an appropriate place to make the count
available. If it is a detailed log of IO requests that you are after,
then blktrace is appropriate.
I did not read the patch to see, so I must ask: does it merely keep
statistics or does it log events? If it is just statistics you are
after, then clearly blktrace is not the appropriate tool to use.
Jens Axboe wrote:
> On Tue, Oct 24 2006, Martin Peschke wrote:
>> Jens Axboe wrote:
>>>> Our tests indicate that the blktrace approach is fine for performance
>>>> analysis as long as the system to be analysed isn't too busy.
>>>> But once the system faces a consirable amount of non-sequential I/O
>>>> workload, the plenty of blktrace-generated data starts to get painful.
>>> Why haven't you done an analysis and posted it here? I surely cannot fix
>>> what nobody tells me is broken or suboptimal.
>> Fair enough. We have tried out the silly way of blktrace-ing, storing
>> data locally. So, it's probably not worthwhile discussing that.
>
> You'd probably never want to do local traces for performance analysis.
> It may be handy for other purposes, though.
>
>>> I have to say it's news to
>>> me that it's performance intensive, tests I did with Alan Brunelle a
>>> year or so ago showed it to be quite low impact.
>> I found some discussions on linux-btrace (Feburary 2006).
>> There is little information on how the alleged 2 percent impact has
>> been determined. Test cases seem to comprise formatting disks ...hmm.
>
> It may sound strange, but formatting a large drive generates a huge
> flood of block layer events from lots of io queued and merged. So it's
> not a bad benchmark for this type of thing. And it's easy to test :-)
>
>>>> If the system runs I/O-bound, how to write out traces without
>>>> stealing bandwith and causing side effects?
>>> You'd be silly to locally store traces, send them out over the network.
>> Will try this next and post complaints, if any, along with numbers.
>
> Thanks! Also note that you do not need to log every event, just register
> a mask of interesting ones to decrease the output logging rate. We could
> so with some better setup for that though, but at least you should be
> able to filter out some unwanted events.
>
>> However, a fast network connection plus a second system for blktrace
>> data processing are serious requirements. Think of servers secured
>> by firewalls. Reading some counters in debugfs, sysfs or whatever
>> might be more appropriate for some one who has noticed an unexpected
>> I/O slowdown and needs directions for further investigation.
>
> It's hard to make something that will suit everybody. Maintaining some
> counters in sysfs is of course less expensive when your POV is cpu
> cycles.
>
Martin Peschke wrote:
> Well, the instrumentation "on demand" aspect is half of the truth.
> A probe inserted through kprobes impacts performance more than static
> instrumentation.
True, but given that there are going to be a number of things you might
want to instrument at some point, and that at any given time you might
only be interested in a few of those, it likely will be better overall
to spend some more time only on the few than less time on the many.
Phillip Susi wrote:
> This discussion seems to involve two different solutions to two
> different problems. If it is a simple counter you want to be able to
> poll, then sysfs/debugfs is an appropriate place to make the count
> available. If it is a detailed log of IO requests that you are after,
> then blktrace is appropriate.
It's about counters ... well, sometimes a buch of counters called
histogram.
> I did not read the patch to see, so I must ask: does it merely keep
> statistics or does it log events? If it is just statistics you are
> after, then clearly blktrace is not the appropriate tool to use.
If matters were as simple as that, sigh.
Statistics feed on data reported through events.
"Oh, this request has completed - time to update I/O counters."
The tricky question is: is event processing, that is, statistics data
aggregation, better done later (in user space), or immediately
(in the kernel). Both approaches exist: blktrace/btt vs.
gendisk statistics used by iostat, for example.
My feeling was that the in-kernel counters approach of my patch
was fine with regard to the purpose of these statistics. But blktrace
exists, undeniably, and deserves a closer look.
Martin
Phillip Susi wrote:
> Martin Peschke wrote:
>> Well, the instrumentation "on demand" aspect is half of the truth.
>> A probe inserted through kprobes impacts performance more than static
>> instrumentation.
>
> True, but given that there are going to be a number of things you might
> want to instrument at some point, and that at any given time you might
> only be interested in a few of those, it likely will be better overall
> to spend some more time only on the few than less time on the many.
Im sure there will be more discussions to sort out which data should be
retrieved through which kind of instrumentation, and to find the right mix.
But I won't dare speculating about the outcome. I am just tossing another
request for data into the discussion (important, in my eyes), along with a
method for retrieving such data (less important, in my eyes).
Jens Axboe wrote:
> On Tue, Oct 24 2006, Martin Peschke wrote:
>> Jens Axboe wrote:
>>>> Our tests indicate that the blktrace approach is fine for performance
>>>> analysis as long as the system to be analysed isn't too busy.
>>>> But once the system faces a consirable amount of non-sequential I/O
>>>> workload, the plenty of blktrace-generated data starts to get painful.
>>> Why haven't you done an analysis and posted it here? I surely cannot fix
>>> what nobody tells me is broken or suboptimal.
>> Fair enough. We have tried out the silly way of blktrace-ing, storing
>> data locally. So, it's probably not worthwhile discussing that.
>
> You'd probably never want to do local traces for performance analysis.
> It may be handy for other purposes, though.
"...probably not worthwhile discussing that" in the context of performance
analysis.
>>> I have to say it's news to
>>> me that it's performance intensive, tests I did with Alan Brunelle a
>>> year or so ago showed it to be quite low impact.
>> I found some discussions on linux-btrace (Feburary 2006).
>> There is little information on how the alleged 2 percent impact has
>> been determined. Test cases seem to comprise formatting disks ...hmm.
>
> It may sound strange, but formatting a large drive generates a huge
> flood of block layer events from lots of io queued and merged. So it's
> not a bad benchmark for this type of thing. And it's easy to test :-)
Just wondering to what degree this might resemble I/O workloads run
by customers in their data centers.
>>> You'd be silly to locally store traces, send them out over the network.
>> Will try this next and post complaints, if any, along with numbers.
>
> Thanks! Also note that you do not need to log every event, just register
> a mask of interesting ones to decrease the output logging rate. We could
> so with some better setup for that though, but at least you should be
> able to filter out some unwanted events.
...and consequently try to scale down relay buffers, reducing the risk of
memory constraints caused by blktrace activation.
>> However, a fast network connection plus a second system for blktrace
>> data processing are serious requirements. Think of servers secured
>> by firewalls. Reading some counters in debugfs, sysfs or whatever
>> might be more appropriate for some one who has noticed an unexpected
>> I/O slowdown and needs directions for further investigation.
>
> It's hard to make something that will suit everybody. Maintaining some
> counters in sysfs is of course less expensive when your POV is cpu
> cycles.
Counters are also cheaper with regard to memory consumption. Counters
are probably cause less side effects, but are less flexible than
full-blown traces.
On Wed, Oct 25 2006, Martin Peschke wrote:
> >>>I have to say it's news to
> >>>me that it's performance intensive, tests I did with Alan Brunelle a
> >>>year or so ago showed it to be quite low impact.
> >>I found some discussions on linux-btrace (Feburary 2006).
> >>There is little information on how the alleged 2 percent impact has
> >>been determined. Test cases seem to comprise formatting disks ...hmm.
> >
> >It may sound strange, but formatting a large drive generates a huge
> >flood of block layer events from lots of io queued and merged. So it's
> >not a bad benchmark for this type of thing. And it's easy to test :-)
>
> Just wondering to what degree this might resemble I/O workloads run
> by customers in their data centers.
It wont of course, the point is to generate a flood of events to put as
much pressure on blktrace logging as possible. Dirtying tons of data
does that.
> >>>You'd be silly to locally store traces, send them out over the network.
> >>Will try this next and post complaints, if any, along with numbers.
> >
> >Thanks! Also note that you do not need to log every event, just register
> >a mask of interesting ones to decrease the output logging rate. We could
> >so with some better setup for that though, but at least you should be
> >able to filter out some unwanted events.
>
> ...and consequently try to scale down relay buffers, reducing the risk of
> memory constraints caused by blktrace activation.
Pretty pointless, unless you are tracing lots of disks. 4x128kb gone
wont be a showstopper for anyone.
> >>However, a fast network connection plus a second system for blktrace
> >>data processing are serious requirements. Think of servers secured
> >>by firewalls. Reading some counters in debugfs, sysfs or whatever
> >>might be more appropriate for some one who has noticed an unexpected
> >>I/O slowdown and needs directions for further investigation.
> >
> >It's hard to make something that will suit everybody. Maintaining some
> >counters in sysfs is of course less expensive when your POV is cpu
> >cycles.
>
> Counters are also cheaper with regard to memory consumption. Counters
> are probably cause less side effects, but are less flexible than
> full-blown traces.
And the counters are special cases and extremely inflexible.
--
Jens Axboe
Jens Axboe wrote:
>>> Thanks! Also note that you do not need to log every event, just register
>>> a mask of interesting ones to decrease the output logging rate. We could
>>> so with some better setup for that though, but at least you should be
>>> able to filter out some unwanted events.
>> ...and consequently try to scale down relay buffers, reducing the risk of
>> memory constraints caused by blktrace activation.
>
> Pretty pointless, unless you are tracing lots of disks. 4x128kb gone
> wont be a showstopper for anyone.
per (online) CPU and device?
>>>> However, a fast network connection plus a second system for blktrace
>>>> data processing are serious requirements. Think of servers secured
>>>> by firewalls. Reading some counters in debugfs, sysfs or whatever
>>>> might be more appropriate for some one who has noticed an unexpected
>>>> I/O slowdown and needs directions for further investigation.
>>> It's hard to make something that will suit everybody. Maintaining some
>>> counters in sysfs is of course less expensive when your POV is cpu
>>> cycles.
>> Counters are also cheaper with regard to memory consumption. Counters
>> are probably cause less side effects, but are less flexible than
>> full-blown traces.
>
> And the counters are special cases and extremely inflexible.
Well, I disagree with "extremely".
These statistics have attributes that allow users to adjust data
aggregation, e.g. to retain more detail in a histogram by adding
buckets.
On Wed, Oct 25 2006, Martin Peschke wrote:
> Jens Axboe wrote:
> >>>Thanks! Also note that you do not need to log every event, just register
> >>>a mask of interesting ones to decrease the output logging rate. We could
> >>>so with some better setup for that though, but at least you should be
> >>>able to filter out some unwanted events.
> >>...and consequently try to scale down relay buffers, reducing the risk of
> >>memory constraints caused by blktrace activation.
> >
> >Pretty pointless, unless you are tracing lots of disks. 4x128kb gone
> >wont be a showstopper for anyone.
>
> per (online) CPU and device?
Yes, per device and per CPU. It does add up of course, but even some
megabytes of buffers should be ok for most uses. You can shrink them, if
you don't need that much. It is advised to have at least two sub-buffers
though, so shrinking the size is probably better.
> >>>>However, a fast network connection plus a second system for blktrace
> >>>>data processing are serious requirements. Think of servers secured
> >>>>by firewalls. Reading some counters in debugfs, sysfs or whatever
> >>>>might be more appropriate for some one who has noticed an unexpected
> >>>>I/O slowdown and needs directions for further investigation.
> >>>It's hard to make something that will suit everybody. Maintaining some
> >>>counters in sysfs is of course less expensive when your POV is cpu
> >>>cycles.
> >>Counters are also cheaper with regard to memory consumption. Counters
> >>are probably cause less side effects, but are less flexible than
> >>full-blown traces.
> >
> >And the counters are special cases and extremely inflexible.
>
> Well, I disagree with "extremely".
Fairly inflexible, then :-)
> These statistics have attributes that allow users to adjust data
> aggregation, e.g. to retain more detail in a histogram by adding
> buckets.
I'm not saying they aren't useful, just saying that their are mainly
useful for dasd apparently.
Lets let this go until we know more about how to proceed.
--
Jens Axboe
Martin Peschke <[email protected]> writes:
> [...] The tricky question is: is event processing, that is,
> statistics data aggregation, better done later (in user space), or
> immediately (in the kernel). Both approaches exist: blktrace/btt vs.
> gendisk statistics used by iostat, for example. [...]
I would put it one step farther: the tricky question is whether it's
worth separating marking the state change events ("request X
enqueued") from the action to be taken ("track statistics", "collect
trace records").
The reason I brought up the lttng/marker thread here was because that
suggests a way of addressing several of the problems at the same time.
This could work thusly: (This will sound familiar to OLS attendees.)
- The blktrace code would adopt a generic marker mechanism such as
that (still) evolving within the lttng/systemtap effort. These
markers would replace calls to inline functions such as
blk_add_trace_bio(q,bio,BLK_TA_QUEUE);
with something like
MARK(blk_bio_queue,q,bio);
- The blktrace code that formats and manages trace data would become a
consumer of the marker API. It would be hooked up at runtime to
these markers. When the events fire, the tracing backend receiving
the callbacks could do the same thing it does now. (With the
markers dormant, the cost should not be much higher than the current
(likely (!q->blk_trace)) conditional.)
- The mp3 statistics code would be an alternate backend to these same
markers. It could be activated or deactivated on the fly (to let
another subsystem use the markers). The code would maintain statistics
in its own memory and could present the data on /proc or whatnot, the
same way as today.
- Additional backends would be immediately possible: lttng style
tracing or even fully programmable systemtap probing / analysis
could all be dynamically activated without further kernel patches or
rebooting.
>From a user's point of view, it could be the best of all worlds: easy
to get a complete trace for detailed analysis, easy to retain plain
statistics for simple monitoring, easy to do something more elaborate
if necessary.
- FChE
Frank Ch. Eigler wrote:
> Martin Peschke <[email protected]> writes:
>
>> [...] The tricky question is: is event processing, that is,
>> statistics data aggregation, better done later (in user space), or
>> immediately (in the kernel). Both approaches exist: blktrace/btt vs.
>> gendisk statistics used by iostat, for example. [...]
>
> I would put it one step farther: the tricky question is whether it's
> worth separating marking the state change events ("request X
> enqueued") from the action to be taken ("track statistics", "collect
> trace records").
>
> The reason I brought up the lttng/marker thread here was because that
> suggests a way of addressing several of the problems at the same time.
> This could work thusly: (This will sound familiar to OLS attendees.)
>
> - The blktrace code would adopt a generic marker mechanism such as
> that (still) evolving within the lttng/systemtap effort. These
> markers would replace calls to inline functions such as
> blk_add_trace_bio(q,bio,BLK_TA_QUEUE);
> with something like
> MARK(blk_bio_queue,q,bio);
>
> - The blktrace code that formats and manages trace data would become a
> consumer of the marker API. It would be hooked up at runtime to
> these markers.
I suppose the marker approach will be adopted if jumping from a marker
to code hooked up there can be made fast and secure enough for
prominent architectures.
> When the events fire, the tracing backend receiving
> the callbacks could do the same thing it does now. (With the
> markers dormant, the cost should not be much higher than the current
> (likely (!q->blk_trace)) conditional.)
>
> - The mp3 statistics code would be an alternate backend to these same
> markers. It could be activated or deactivated on the fly (to let
> another subsystem use the markers). The code would maintain statistics
> in its own memory and could present the data on /proc or whatnot, the
> same way as today.
Basically, I agree. But, the devel is in the details.
Dynamic instrumentation based on markers allows to grow code,
but it doesn't allow to grow data structure, AFAICS.
Statistics might require temporary results to be stored per
entity.
For example, latencies require two timestamps. The older one needs to
be stored somewhere until the second timestamp can be determined and
a latency is calculated. I would add a field a field to struct request
for this purpose.
The workaround would be to pass any intermediate result in the form
of a trace event up to user space and try to sort it out later -
which takes us back to the blktrace approach.
Martin
Hi -
On Thu, Oct 26, 2006 at 01:07:53PM +0200, Martin Peschke wrote:
> [...]
> I suppose the marker approach will be adopted if jumping from a
> marker to code hooked up there can be made fast and secure enough
> for prominent architectures.
Agree, and I think we're not far. By "secure" you mean "robust"
right?
> [...]
> Dynamic instrumentation based on markers allows to grow code,
> but it doesn't allow to grow data structure, AFAICS.
>
> Statistics might require temporary results to be stored per
> entity.
The data can be kept in data structures private to the instrumentation
module. Instead of growing the base structure, you have a lookup
table indexed by a key of the base structure. In the lookup table,
you store whatever you would need: timestamps, whatnot.
> The workaround would be to pass any intermediate result in the form
> of a trace event up to user space and try to sort it out later -
> which takes us back to the blktrace approach.
In systemtap, it is routine to store such intermediate data in kernel
space, and process it into aggregate statistics on demand, still in
kernel space. User space need only see finished results. This part
is not complicated.
- FChE
Frank Ch. Eigler wrote:
> Hi -
>
> On Thu, Oct 26, 2006 at 01:07:53PM +0200, Martin Peschke wrote:
>> [...]
>> I suppose the marker approach will be adopted if jumping from a
>> marker to code hooked up there can be made fast and secure enough
>> for prominent architectures.
>
> Agree, and I think we're not far. By "secure" you mean "robust"
> right?
yes
>> [...]
>> Dynamic instrumentation based on markers allows to grow code,
>> but it doesn't allow to grow data structure, AFAICS.
>>
>> Statistics might require temporary results to be stored per
>> entity.
>
> The data can be kept in data structures private to the instrumentation
> module. Instead of growing the base structure, you have a lookup
> table indexed by a key of the base structure. In the lookup table,
> you store whatever you would need: timestamps, whatnot.
lookup_table[key] = value , or
lookup_table[key]++
How does this scale?
It must be someting else than an array, because key boundaries
aren't known when the lookup table is created, right?
And actual keys might be few and far between.
So you have got some sort of list or tree and do some searching,
don't you?
What if the heap of intermediate results grows into thousands or more?
>> The workaround would be to pass any intermediate result in the form
>> of a trace event up to user space and try to sort it out later -
>> which takes us back to the blktrace approach.
>
> In systemtap, it is routine to store such intermediate data in kernel
> space, and process it into aggregate statistics on demand, still in
> kernel space. User space need only see finished results. This part
> is not complicated.
Yes. I tried out earlier this year.
Hi -
On Thu, Oct 26, 2006 at 03:37:54PM +0200, Martin Peschke wrote:
> [...]
> lookup_table[key] = value , or
> lookup_table[key]++
>
> How does this scale?
It depends. If one is interested in only aggregates as an end result,
then intermediate totals can be tracked individiaully per-cpu with no
locking contention, so this scales well.
> It must be someting else than an array, because key boundaries
> aren't known when the lookup table is created, right?
> And actual keys might be few and far between.
In systemtap, we use a hash table.
> What if the heap of intermediate results grows into thousands or
> more? [...]
It depends whether you mean "rows" or "columns".
By "rows", if you need to track thousands of queues, you will need
memory to store some data for each of them. In systemtap's case, the
maximum number of elements in a hash table is configurable, and is all
allocated at startup time. (The default is a couple of thousand.)
This is of course still larger than enlarging the base structures the
way your code does. But it's only larger by a constant amount, and
makes it unnecessary to patch the code.
By "columns", if you need to track statistical aggregates of thousands
of data points for an individual queue, then one can use a handful of
fixed-size counters, as you already have for histograms.
Anyway, my point was not that you should use systemtap proper, or that
you need to use the same techniques for managing data on the side.
It's that by using instrumentation markers, more things are possible.
- FChE
Frank Ch. Eigler wrote:
> Hi -
>
> On Thu, Oct 26, 2006 at 03:37:54PM +0200, Martin Peschke wrote:
>> [...]
>> lookup_table[key] = value , or
>> lookup_table[key]++
>>
>> How does this scale?
>
> It depends. If one is interested in only aggregates as an end result,
> then intermediate totals can be tracked individiaully per-cpu with no
> locking contention, so this scales well.
Sorry for not being clear.
I meant scaling with regard to lots of different keys.
This is what you have described as "By 'rows'", isn't it?
For example, if I wanted to store a timestamp for each request
issued, and there were lots of devices and the I/O was driving
the system crazy - how would affect that lookup time?
I would use, say, the address of struct request as key. I would
store the start time of a request there. Once a requests completes
I would look up the start time and calculate a latency.
I would be done with that lookup table entry then.
But it won't go away, will it? Is this an issue?
>> It must be someting else than an array, because key boundaries
>> aren't known when the lookup table is created, right?
>> And actual keys might be few and far between.
>
> In systemtap, we use a hash table.
>
>> What if the heap of intermediate results grows into thousands or
>> more? [...]
>
> It depends whether you mean "rows" or "columns".
>
> By "rows", if you need to track thousands of queues, you will need
> memory to store some data for each of them. In systemtap's case, the
> maximum number of elements in a hash table is configurable, and is all
> allocated at startup time. (The default is a couple of thousand.)
> This is of course still larger than enlarging the base structures the
> way your code does. But it's only larger by a constant amount, and
> makes it unnecessary to patch the code.
>
> By "columns", if you need to track statistical aggregates of thousands
> of data points for an individual queue, then one can use a handful of
> fixed-size counters, as you already have for histograms.
>
>
> Anyway, my point was not that you should use systemtap proper, or that
> you need to use the same techniques for managing data on the side.
> It's that by using instrumentation markers, more things are possible.
Right, I have gone off on a tangent - systemtap, just one out of many
likely exploiters of markers.
Anyway, I think this discussion shows that any dynamically added client
of kernel markers which needs to hold extra data for entities like
requests might be difficult to be implemented efficiently (compared
to static instrumentation), because markers, by nature, only allow
for code additions, but not for additions to existing data structures.
HI -
On Thu, Oct 26, 2006 at 05:36:59PM +0200, Martin Peschke wrote:
> [...] I meant scaling with regard to lots of different keys. This
> is what you have described as "By 'rows'", isn't it?
Yes.
> For example, if I wanted to store a timestamp for each request
> issued, and there were lots of devices and the I/O was driving the
> system crazy - how would affect that lookup time?
If you have only hundreds or thousands of such requests on the go
at any given time, that's not a problem. Hash by pointer.
> [...] I would be done with that lookup table entry then. But it
> won't go away, will it? Is this an issue?
The entry can be instantly cleared for reuse by another future
key-value pair. Think of it like a mini slab cache.
> [...] Anyway, I think this discussion shows that any dynamically
> added client of kernel markers which needs to hold extra data for
> entities like requests might be difficult to be implemented
> efficiently (compared to static instrumentation), because markers,
> by nature, only allow for code additions, but not for additions to
> existing data structures.
It's a question that mixes quantitative and policy matters. It's
certainly *somewhat* slower to store data on the side, but whether in
the context of the event source that is okay or not Just Depends. On
the flip side, patching in hard-coded extra data storage for busy
structures also has a cost if the statistics gathering is not actually
requested by the end-user. (On the policy side, one must weigh to
what extent it's reasonable to pad more and more data structures, just
in case.)
- FChE
Jens, Andrew,
I am not pursueing this patch set. I have convinced myself that blktrace
is capable of being used as a an analysis tool for low level I/O
performance, if invoced with appropriate settings. In our tests, we
limited traces to Issue/Complete events, cut down on relay buffers, and
redirected traces to a network connection instead of filling locally
attached storage up. Cost is about 2 percent, as advertised.
Thanks,
Martin
On Thu, Nov 02 2006, martin wrote:
> Jens, Andrew,
> I am not pursueing this patch set. I have convinced myself that blktrace
> is capable of being used as a an analysis tool for low level I/O
> performance, if invoced with appropriate settings. In our tests, we
> limited traces to Issue/Complete events, cut down on relay buffers, and
> redirected traces to a network connection instead of filling locally
> attached storage up. Cost is about 2 percent, as advertised.
Martin, thanks for following up on this and verifying the cost numbers.
--
Jens Axboe