2006-05-19 16:12:18

by Martin Peschke

[permalink] [raw]
Subject: [Patch 4/6] statistics infrastructure - documentation

documentation for developers and users

Signed-off-by: Martin Peschke <[email protected]>
---

00-INDEX | 2
statistics.txt | 741 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 743 insertions(+)

diff -Nurp a/Documentation/00-INDEX b/Documentation/00-INDEX
--- a/Documentation/00-INDEX 2006-03-20 06:53:29.000000000 +0100
+++ b/Documentation/00-INDEX 2006-05-19 16:02:09.000000000 +0200
@@ -264,6 +264,8 @@ stable_kernel_rules.txt
- rules and procedures for the -stable kernel releases.
stallion.txt
- info on using the Stallion multiport serial driver.
+statistics.txt
+ - info on statistics infrastructure available for drivers and others
svga.txt
- short guide on selecting video modes at boot via VGA BIOS.
sx.txt
diff -Nurp a/Documentation/statistics.txt b/Documentation/statistics.txt
--- a/Documentation/statistics.txt 1970-01-01 01:00:00.000000000 +0100
+++ b/Documentation/statistics.txt 2006-05-19 16:23:32.000000000 +0200
@@ -0,0 +1,741 @@
+
+ Statistics infrastructure
+
+0. Which problems is it meant to solve?
+1. Concepts
+2. Performance
+3. Modes of data processing
+4. User interface
+5. Programming interface
+6. Possible future enhancements and known bugs
+7. Contact
+
+
+
+
+ 0. Which problems is it meant to solve?
+
+This common code layer implements statistics in a device driver independent
+and architecture independent way. It minimizes the effort by providing a simple
+programming interface. Added benefit is a generic user interface.
+
+
+
+
+ 1. Concepts
+
+ Overview
+
+The following figure depicts how the statistics infrastructure
+fits into the global picture, and how it interacts with both exploiting
+kernel code as well as users.
+
+ USER : KERNEL
+ :
+ user statistics programming
+ interface infrastructure interface exploiter
+ : +------------------+ : +-----------------+
+ : | process data and | : | collect and |
+ "data" : | provide output | (X, Y) | report data |
+ <====================| to user |<==============| as (X, Y) pairs |
+ file : | ^ | : | |
+ : | ^ | : | |
+ : | ^ | : | |
+ : | ^ | : | create/discard |
+ "definition" : | display settings | : | statistics, |
+ <===================>| and accept |<==============| provide default |
+ file : | changed settings | : | settings |
+ : +------------------+ : +-----------------+
+ : :
+
+Actual semantics of the data that feeds a statistic is unimportant when it
+comes to data processing. All that matters is how the user wants the data to
+be presented (counters, histograms, and so on). That's a job that can be
+be done by a generic layer without intervention by the device driver
+which is the actual source of statistics data.
+
+
+ The role of the statistics infrastructure
+
+It is the statistic infrastructure's task to accept or drop, accumulate,
+compute and store, as well as display statistics data according to the
+current settings.
+
+
+ The role of exploiters
+
+It is the exploiter's (e.g. device driver's) responsibility to feed the
+statistics infrastructure with sampled data for the statistics maintained by the
+statistics infrastructure on behalf of the exploiter.
+
+It would be nice of any exploiter to provide a default configuration for each
+statistic that most likely works best for general purpose use.
+
+
+ The role of users
+
+It is the user's freedom to configure how accumulation, computation and
+display of data are done, according to their needs.
+
+
+ The form of reported data
+
+Exploiters report data in the form of (X, Y) value pairs with X being
+a quantity for the main characteristic of the statistic, like a request size
+or request latency, and with Y being a qualifier for that characteristic,
+i.e. the occurrence of a particular X-value.
+
+Thus, the Y-part can be seen as an optimisation that allows exploiters
+to report a bunch of similar measurements in one call (see statistic_add()).
+For the programmer's convenience, Y can be omitted when it would be always 1
+(see statistic_inc()).
+
+
+ How data is reported
+
+There are two methods how such data can be provided to the statistics
+infrastructure, a push interface and a pull interface. Each statistic
+is either a pull-type or push-type statistic as determined by the exploiter.
+
+The push-interface is suitable for data feeds that report incremental updates
+to statistics, and where actual accumulation can be left to the statistics
+infrastructure. New measurements usually trigger pushing data.
+(see statistics_add() and statistic_inc())
+
+The pull-interface is suitable for data that already comes in an aggregated
+form, like hardware measurement data or counters already maintained and
+used by exploiters for other purposes. Reading statistics data from files
+triggers an optional callback of the exploiter, which can update pull-type
+statistics then (see statistic_set()).
+
+
+ How data is processed
+
+(X, Y) pairs can be processed in different ways by the statistics
+infrastructure, according to the current settings applicable to a
+particular statistic.
+
+For example, X might be used to distinguish certain buckets (see histogram).
+Minimum, average and maximum X values might be determined (see utilisation).
+Y might be summed up (see counter, for example).
+
+See below for a reference of processing options.
+
+Please note that the statistics infrastructure does not care about the
+actual semantics of (X, Y), and that it just adheres to abstract rules
+describing what to do with (X, Y) pairs for certain settings.
+It is up to the user to interpret processed data, to add semantics
+back to it, and to choose settings and, therewith, data processing modes
+according to their needs.
+
+
+ How statistics are organised
+
+Statistics are grouped within "interfaces" (debugfs entries) by exploiters,
+in order to reflect collections of related statistics of an entity,
+which is also quite efficient with regard to memory use.
+
+ statistics infrastructure
+ |
+ +----- statistic interface
+ | |
+ | +----- statistic
+ | | (comprising definition and data)
+ | |
+ | +----- statistic
+ | |
+ | +----- statistic
+ | |
+ | :
+ |
+ |
+ +----- statistic interface
+ | |
+ | +----- statistic
+ | |
+ : :
+
+The user interface, the programming interface and the internal data
+structures are organised like this.
+
+
+ Why debugfs for now
+
+While sysfs comes with a refined structure reflecting almost everything
+in the system, it is (by design) not good at representing large and variable
+amounts of data, that is, more than one value per file. As for statistics,
+we could make good use of the former, but not of the latter.
+
+For example, the same statistic might work as a single counter, or as a
+histogram comprising a variable (user-defined) number of buckets, or as an
+adaptable list of buckets for sparse concrete values, etc. Whatever the result
+looks like should be left to the individual modes of data processing.
+In order to reduce all kinds of data processing and their output to a common
+denominator, an output format along the following lines is suggested and
+has been implemented:
+
+ latency_write <=0 0 \
+ latency_write <=1 13 |
+ latency_write <=2 13 |
+ latency_write <=4 56 |
+ latency_write <=8 144 |
+ latency_write <=16 184 | a histogran with
+ latency_write <=32 181 > 13 buckets
+ latency_write <=64 74 |
+ latency_write <=128 271 |
+ latency_write <=256 0 |
+ latency_write <=512 33 |
+ latency_write <=1024 0 |
+ latency_write >1024 0 /
+ latency_read <=0 0 \
+ ... > another histogram
+ latency_read >1024 0 /
+ size_write missed 0x0 \
+ size_write 0x1000 143 |
+ size_write 0xc000 42 |
+ size_write 0x10000 14 | an adaptable list
+ size_write 0xf000 13 > with a growing number of buckets
+ size_write 0x1e000 12 | (up to a defined limit only)
+ size_write 0x14000 12 |
+ ... |
+ size_write 0x9000 1 /
+ queue_used_depth 970 1 18.122 32 > num min avg max for a queue
+
+Such output can grow as needed in debugfs files. It is human-readable and
+could be parsed and postprocessed by simple scripts that are aware of what the
+output of the various data processing modes looks like.
+
+
+ State machine
+
+Each statistic has a state that should be initialised by exploiters.
+Users probably want to adjust this state, e.g. enable
+data gathering. Defined states and transitions are:
+
+ state=unconfigured (mode of data processing has not yet been defined)
+ A
+ |
+ |
+ V
+ state=released (mode of data processing has been defined, but memory
+ A required for data gathering has not yet been allocated
+ | - would be a good default setup provided by exploiters)
+ |
+ V
+ state=off (all memory required for the defined mode of data
+ A processing has been allocated, but data gathering is
+ | currently disabled - data available to users, though)
+ |
+ V
+ state=on (data gathering is enabled and being done according to
+ the currently defined processing mode - data available
+ to users)
+
+How to alter states is explained in the user interface section of this document.
+
+
+
+
+ 2. Performance
+
+
+ Some preliminary numbers
+
+FIXME
+
+ Per-CPU data
+
+Measurements reported by exploiters are accumulated into per-CPU data areas
+in order to avoid the introduction of serialisation during the
+execution of statistic_add(). Locking of per-CPU data is done by disabling
+preemption and interrupts per CPU for the short time of a statistic update.
+
+Per-CPU data is not used if a statistic doesn't collect incremental data,
+i.e. if is only updated using statistic_set().
+
+
+ Path length of statistic_add() & friends
+
+Some flexibility and functionality are achieved on the expense of slightly
+increased path length.
+
+A function pointer is used to implement different data processing modes
+users can choose from.
+
+Individual data processing modes might come with their particular knobs,
+e.g. resolution or precision. That means that the statistics infrastructure
+has to retrieve some values required for calculation at run time.
+
+
+ Memory footprint
+
+Because the statistics code uses per-CPU data, it observes CPU hot-(un)plug
+events and allocates and releases per-CPU data as sparingly as possible.
+
+The differentiation of:
+
+- struct statistic (any data required for gathering data for a statistic),
+- struct statistic_info (description of a class of statistics),
+- struct statistic_discipline (description of a data processing mode), and
+- struct statistic_interface (user interface for a collection of statistics)
+
+means avoidance of storing redundant data per statistic. Struct statistic
+can be kept quite small.
+
+
+ Disabling statistics
+
+Data gathering can be turned off (by default or by users), which reduces
+statistic_add() to a check.
+
+
+ Kernel configuration option
+
+CONFIG_STATISTICS can be used to include or exclude statistics during the
+kernel build process.
+
+
+
+
+ 3. Modes of data processing
+
+So far, available are:
+
+
+ type=counter_inc
+
+A counter sums up all Y-values of (X, Y) data pairs reported, regardless of the
+X-part.
+
+For example, a (request size, occurrence)-statistic would yield the
+total of requests observed.
+
+
+ type=counter_prod
+
+A counter sums up all X*Y with X and Y belonging to the same (X, Y).
+
+For example, a (request size, occurrence)-statistic would yield the
+total of bytes transfered.
+
+
+ type=utilisation
+
+Provides a set of values comprising:
+- the sum of all Y-values,
+- the minimum X
+- the average X
+- the maximum X
+
+This appears to be a useful fill level indicator for queues etc.
+
+For example, a (request size, occurrence)-statistic would yield a very
+basic statement about the traffic pattern, with information about the range
+of request sizes observed.
+
+
+ type=histogram_lin
+
+Comprises a set of counters, with each counter summing up all those Y-values
+reported for an assigned range or interval of X-values. All intervals of
+X-values are equal.
+
+Additional required parameters include:
+- entries (number of buckets, at least 2 required)
+- range_min (first bucket stands for <=range_min)
+- base_interval (interval size each bucket covers)
+
+For example, a (request size, occurrence)-statistic would yield a histogram
+of observed request sizes, with the same precision for small, medium and
+large request sizes.
+
+
+ type=histogram_log2
+
+Similar to type=histogram_lin, except that the intervals double
+from bucket to bucket. That is, the histogram loses in precision for
+larger X-values.
+
+
+ type=sparse
+
+This one is similar to other histograms, with the exception that it provides
+buckets for discrete X-values instead of ranges of X-values. Since it
+utilises a list instead of an array, it is suited for compiling histogram-like
+results for rather few, sparse X-values which users want to measure
+separately.
+
+Additional required parameters include:
+- entries (list is capped at this number of entries)
+
+For example, a (request size, occurrence)-statistic would yield the
+occurrences of all request sizes. Since it records precise sizes,
+it can also show the odd one out, which might be problematic; who knows...
+
+
+ Other
+
+The statistic infrastructure has been designed to make the addition
+of more ways of data processing easy (see struct statistic_discipline).
+
+For example, two more types had been implemented which are not included
+in the source code:
+
+- A "raw" type statistic which provides a record of (X, Y)-pairs.
+ Nice for verification and debugging purposes.
+
+- An enhancement of other basic types, like "counter" or "utilisation"
+ by the dimension time, which provides a time-tagged history of their
+ results for successive periods of time.
+ For example, a (request size, occurrence)-statistic could yield the
+ transfer rate over time, like bytes per second.
+
+
+
+
+ 4. User interface
+
+ Locating statistics
+
+The statistics infrastructure's user interface is in the
+/sys/kernel/debug/statistics directory, assuming debugfs has been mounted at
+/sys/kernel/debug. The "statistics" directory holds interface subdirectories
+created on the behalf of exploiters, for example:
+
+ drwxr-xr-x 2 root root 0 Jul 28 02:16 zfcp-0.0.50d4
+
+An interface subdirectory contains two files, a data and a definition file:
+
+ -r-------- 1 root root 0 Jul 28 02:16 data
+ -rw------- 1 root root 0 Jul 28 02:16 definition
+
+
+ The "definition" file
+
+The statistics infrastructure processes reported data according to the
+settings in the definition file, particularly the type attribute. You
+can change some statistic attributes and thereby change how data is
+processed.
+
+Some of the attributes shown are common to all statistics, others only apply
+to specific statistic type of data processing (see there for description).
+Some attributes can be changed by users, others are read-only. All timestamps
+are in the style of printk-timestamps.
+
+
+ Common statistic attributes
+
+ Attribute Changeable Comment
+
+ name No The device driver provides the name that
+ defines a statistic.
+
+ units No Units defines what the device driver reports
+ as (X, Y) pair.
+
+ state Yes Valid assignments are
+ on, off, released, unconfigured.
+ Note: Transition from unconfigured requires
+ the specification of type and all
+ additional attributes for that type.
+
+ type Yes The attribute determines the way sampled data
+ is processed and displayed. See corresponding
+ section in this document for valid assignements.
+
+ data No The age of sampled data, that is, the time
+ since last reset.
+
+ started No The last time the statistic was started.
+ Depends on the state attribute.
+
+ stopped No The last time the statistic was stopped.
+ Depends on the state attribute.
+
+
+ Changing statistics
+
+Here are some commented examples. This is how we start off:
+
+ [scsi-0:0:0:0]# head -n 1 definition
+ name=issued_write state=off units=bytes/request type=sparse entries=256
+ data=[7283835.951603] started=[7283835.951604] stopped=[7283856.502492]
+
+Let's get rid of any data and setup for this particular statistic:
+
+ [scsi-0:0:0:0]# echo name=issued_write state=unconfigured > definition;
+ head -n 1 definition
+ name=issued_write state=unconfigured units=bytes/request
+
+Redefine the statistic without enabling data gathering:
+
+ [scsi-0:0:0:0]# echo name=issued_write type=utilisation > definition;
+ head -n 1 definition
+ name=issued_write state=released units=bytes/request type=utilisation
+
+Enable data gathering:
+
+ [scsi-0:0:0:0]# echo name=issued_write state=on > definition;
+ head -n 1 definition
+ name=issued_write state=on units=bytes/request type=utilisation
+ data=[7284319.773163] started=[7284319.773170] stopped=[7283856.502492]
+
+Discard data without changing any settings (note the time stamps):
+
+ [scsi-0:0:0:0]# echo name=issued_write data=reset > definition;
+ head -n 1 definition
+ name=issued_write state=on units=bytes/request type=utilisation
+ data=[7284495.638893] started=[7284495.638894] stopped=[7284495.638844]
+
+Change statistic back to defaults as provided by device driver:
+(Note this discards all data.)
+
+ [scsi-0:0:0:0]# echo name=issued_write defaults > definition;
+ head -n 1 definition
+ name=issued_write state=on units=bytes/request type=sparse entries=256
+ data=[7284603.624271] started=[7284603.624273] stopped=[7284603.624199]
+
+Change a type specific attribute - here, reduce maximum list size:
+(Note this discards all data.)
+
+ [scsi-0:0:0:0]# echo name=issued_write entries=16 > definition;
+ head -n 1 definition
+ name=issued_write state=on units=bytes/request type=sparse entries=16
+ data=[7285008.933757] started=[7285008.933758] stopped=[7285008.933699]
+
+Turn data gathering off and release all resources allocated:
+
+ [scsi-0:0:0:0]# echo name=issued_write state=released > definition;
+ head -n 1 definition
+ name=issued_write state=released units=bytes/request type=sparse entries=256
+
+One can either write the entire line describing a statistic, including
+read-only attributes (which are ignored by the statistics infrastructure, as
+is any junk). Or one may modify single attributes, assuming this results
+in a valid configuration.
+
+This simplifies the procedures (copy, paste to command line,
+modify command line, echo attributes from command line into definition file)
+and (cat to file, modify file content, cat file back into definition file).
+
+Some operations can be done in an atomic fashion for all statistics grouped
+within the scope of an interface. Simply omit the name= attribute:
+
+ echo state=on > definition
+
+ echo data=reset > definition
+
+ echo defaults > definition
+
+
+ Reading statistic output from the "data" file
+
+The "data" file contains the output of all statistics available through a
+particular interface. File content comes in ASCII. Depending on the type of a
+statistic, the output for a statistic consists of a single line or a bunch
+of lines. Each line delivers one value or one result of a statistic and
+consists of several strings separated by spaces. The beginning of each line
+is tagged with the name of the statistic the line belongs to. The rest of
+a line is statistic-type specific. The content of a "data" might look like
+this:
+
+ foo 0x1000 4
+ foo 0x2000 1
+ foo 0x5000 2
+ bar 961 1 42.000 128
+
+
+ Output formats of different statistic types
+
+ Statistic Type Output Format Number of Lines
+
+ counter_inc <name> <total of Y> 1
+
+ counter_prod <name> <total of Xi*Yi> 1
+
+ utilisation <name> <total of Y> <min X> <avg X> <max X> 1
+
+ sparse <name> <Xn> <total of Y for Xn> <= entries
+ ...
+
+ histogram_lin <name> "<="<Xn> <Y-total for interval> number of
+ histogram_log2 ... intervals as
+ <name> ">"<Xm> <Y-total for interval> determined by
+ base_interval,
+ entries and
+ range_min
+
+For sample output please see above.
+
+
+
+
+ 5. Programming interface
+
+The programming interface can be retrieved from the kernel-doc-style comments
+available for all interface functions. Programming examples can be found in
+drivers/scsi and drivers/s390/scsi.
+
+
+ Creating statistics
+
+Assuming one wants to embed an array of statistics into a structure
+representing some entity, the following members need to be added:
+
+ struct my_entity {
+ ...
+ struct statistic_interface stat_int;
+ struct statistic stat[N];
+ }
+
+stat is an array of N statistics of various sorts.
+
+Since one might want to create several instances of struct my_entity
+each coming with its own set of statistics (stat[N]) setup using the
+same template, provisions for such a template have been made as part of the
+programming interface. An array of struct statistic_info complements an
+array of struct statistic.
+
+ struct statistic_info[] {
+ { "refund", "cent", "bottle", 0, "type=counter_prod" },
+ { "fill_level", "millilitre", "bottle", 1, "type=utilisation" },
+ ...
+ } my_entity_stat_info;
+
+An enum that helps addressing individual statistics of an array comes in handy:
+
+ enum my_entitiy_stat_num {
+ MY_ENTITY_STAT_REFUND,
+ MY_ENTITY_STAT_FILL,
+ ...
+ N
+ };
+
+Now, here is how to tie the knot for statistics and templates:
+
+ {
+ struct my_entity *one;
+ ...
+
+ /* required */
+ one->stat_int.stat = one->stat;
+ one->stat_int.info = my_entity_stat_info;
+ one->stat_int.number = N;
+
+ /* Optional callback triggers update of MY_ENTITY_STAT_FILL
+ when user reads statistic data from file */
+ one->stat_int.pull = my_entity_pull_fn;
+ one->stat_int.pull_private = one;
+
+ retval = statistic_create(&one->stat_int, "bottled_stats");
+ /* now we can report statistics data */
+ ...
+ }
+
+
+ Reporting statistics data
+
+Add statistic_add*() or statistic_inc*() calls where appropriate for
+reporting statistics data. Data to be reported through these functions has the
+form of (X, Y) as explained above:
+
+ {
+ struct my_entity *one;
+ ...
+
+ statistic_add(&one->stat, MY_ENTITY_STAT_REFUND, pennies, bottles);
+ ...
+ }
+
+Which is equivalent to:
+
+ {
+ struct my_entity *one;
+ int i;
+ ...
+
+ for (i = 0; i < bottles; i++)
+ statistic_inc(&one->stat, MY_ENTITY_STAT_REFUND, pennies);
+ ...
+ }
+
+Of course, this example is not optimal. It tries to show how statistic_add() and
+statistic_inc() compare. Sometimes statistic_inc() might be just what you need.
+
+If there is a bunch of statistics to be updated in one go, consider these
+flavours of statistic_add() which require the exploiter to lock per-CPU data
+in one go for improved performance:
+
+ {
+ struct my_entity *one;
+ unsigned long flags;
+ ...
+
+ local_irq_save(flags);
+ statistic_inc_nolock(&one->stat, MY_ENTITY_STAT_X, x);
+ statistic_inc_nolock(&one->stat, MY_ENTITY_STAT_Y, y);
+ statistic_add_nolock(&one->stat, MY_ENTITY_STAT_Z, z, number);
+ ...
+ local_irq_restore(flags);
+ }
+
+The above examples show statistics that feed on incremental updates that
+get accumulated by the statistics infrastructure on top of data already
+gathered by the statistics infrastructure.
+That is why statistic_add() or statistic_inc() respectively are used.
+
+There might be statistics that come as total numbers, e.g. because they feed
+on counters already maintained by the exploiter or some hardware feature.
+These numbers can be exported through the statistics infrastructure along
+with any other statistic. In this case, use statistic_set() to report data.
+Usually it is sufficient to do so when the user opens the corresponding
+file to read statistic data. This will trigger the optional callback function
+to be executed. Place statistic_set() calls there. In the same context calling
+statistic_add() or statistic_inc() for other statistics that feed on
+incremental data works as well, in case that's needed:
+
+ my_entity_stat_pull_fn(void *__one)
+ {
+ struct my_entity *one = __one;
+ ...
+
+ statistic_set(&one->stat, MY_ENTITY_STAT_FILL, one->fill, 1);
+ ...
+ }
+
+
+ Removing statistics
+
+The function statistic_remove() cleans up an entire interface with
+all statistics attached:
+
+ {
+ struct my_entity *one;
+ ...
+
+ /* by this time we must have ceased reporting statistics data */
+ retval = statistic_remove(&one->stat_int);
+ ...
+ }
+
+
+ Adding a another data processing mode
+
+There is a statistics infrastructure internal programmimg interface that
+allows to add other data processing modes fairly easily.
+Please see kernel-doc description of struct statistics_discipline.
+
+
+
+
+ 6. Possible future enhancements / known bugs
+
+There are several possible enhancements and optimizations documented
+at the head of lib/statistic.c, where I keep track of bugs as well.
+
+
+
+
+ 7. Contact
+
+See MAINTAINERS file.



2006-05-23 14:01:23

by Balbir Singh

[permalink] [raw]
Subject: Re: [Patch 4/6] statistics infrastructure - documentation

<snip>

> +Actual semantics of the data that feeds a statistic is unimportant when it
> +comes to data processing. All that matters is how the user wants the data to
> +be presented (counters, histograms, and so on). That's a job that can be
> +be done by a generic layer without intervention by the device driver
> +which is the actual source of statistics data.

Can't this be pushed to user space? Can the same thing be accomplished with
the help of a user space library?

<snip>

> +
> + How data is reported
> +
> +There are two methods how such data can be provided to the statistics
> +infrastructure, a push interface and a pull interface. Each statistic
> +is either a pull-type or push-type statistic as determined by the exploiter.
> +
> +The push-interface is suitable for data feeds that report incremental updates
> +to statistics, and where actual accumulation can be left to the statistics
> +infrastructure. New measurements usually trigger pushing data.
> +(see statistics_add() and statistic_inc())
> +
> +The pull-interface is suitable for data that already comes in an aggregated
> +form, like hardware measurement data or counters already maintained and
> +used by exploiters for other purposes. Reading statistics data from files
> +triggers an optional callback of the exploiter, which can update pull-type
> +statistics then (see statistic_set()).
> +

(Ge)netlink does a great job of supporting the push and pull interfaces.

<snip>

> +For example, the same statistic might work as a single counter, or as a
> +histogram comprising a variable (user-defined) number of buckets, or as an
> +adaptable list of buckets for sparse concrete values, etc. Whatever the result
> +looks like should be left to the individual modes of data processing.
> +In order to reduce all kinds of data processing and their output to a common
> +denominator, an output format along the following lines is suggested and
> +has been implemented:
> +
> + latency_write <=0 0 \
> + latency_write <=1 13 |
> + latency_write <=2 13 |
> + latency_write <=4 56 |
> + latency_write <=8 144 |
> + latency_write <=16 184 | a histogran with
> + latency_write <=32 181 > 13 buckets
> + latency_write <=64 74 |
> + latency_write <=128 271 |
> + latency_write <=256 0 |
> + latency_write <=512 33 |
> + latency_write <=1024 0 |
> + latency_write >1024 0 /
> + latency_read <=0 0 \
> + ... > another histogram
> + latency_read >1024 0 /
> + size_write missed 0x0 \
> + size_write 0x1000 143 |
> + size_write 0xc000 42 |
> + size_write 0x10000 14 | an adaptable list
> + size_write 0xf000 13 > with a growing number of buckets
> + size_write 0x1e000 12 | (up to a defined limit only)
> + size_write 0x14000 12 |
> + ... |
> + size_write 0x9000 1 /
> + queue_used_depth 970 1 18.122 32 > num min avg max for a queue
> +
> +Such output can grow as needed in debugfs files. It is human-readable and
> +could be parsed and postprocessed by simple scripts that are aware of what the
> +output of the various data processing modes looks like.

What is the extent to which the data is buffered? Lets say the file
contains 1000 such records - they all need to be maintained in memory
till the file is closed or removed - right?

<snip>

> + 2. Performance
> +
> +
> + Some preliminary numbers
> +
> +FIXME
> +
> + Per-CPU data
> +
> +Measurements reported by exploiters are accumulated into per-CPU data areas
> +in order to avoid the introduction of serialisation during the
> +execution of statistic_add(). Locking of per-CPU data is done by disabling
> +preemption and interrupts per CPU for the short time of a statistic update.
> +

Is this not an overkill. What if the subsystem updating the statistics
does not require interrupts to be disabled for serialization.

<snip>

> + Memory footprint
> +
> +Because the statistics code uses per-CPU data, it observes CPU hot-(un)plug
> +events and allocates and releases per-CPU data as sparingly as possible.
> +
> +The differentiation of:
> +
> +- struct statistic (any data required for gathering data for a statistic),
> +- struct statistic_info (description of a class of statistics),
> +- struct statistic_discipline (description of a data processing mode), and
> +- struct statistic_interface (user interface for a collection of statistics)
> +
> +means avoidance of storing redundant data per statistic. Struct statistic
> +can be kept quite small.
> +
> +
> + Disabling statistics
> +
> +Data gathering can be turned off (by default or by users), which reduces
> +statistic_add() to a check.
> +
> +
> + Kernel configuration option
> +
> +CONFIG_STATISTICS can be used to include or exclude statistics during the
> +kernel build process.
> +
> +
> +
> +
> + 3. Modes of data processing
> +
> +So far, available are:
> +
> +
> + type=counter_inc
> +
> +A counter sums up all Y-values of (X, Y) data pairs reported, regardless of the
> +X-part.
> +
> +For example, a (request size, occurrence)-statistic would yield the
> +total of requests observed.
> +
> +
> + type=counter_prod
> +
> +A counter sums up all X*Y with X and Y belonging to the same (X, Y).
> +
> +For example, a (request size, occurrence)-statistic would yield the
> +total of bytes transfered.
> +
> +
> + type=utilisation
> +
> +Provides a set of values comprising:
> +- the sum of all Y-values,
> +- the minimum X
> +- the average X
> +- the maximum X
> +
> +This appears to be a useful fill level indicator for queues etc.
> +
> +For example, a (request size, occurrence)-statistic would yield a very
> +basic statement about the traffic pattern, with information about the range
> +of request sizes observed.
> +
> +
> + type=histogram_lin
> +
> +Comprises a set of counters, with each counter summing up all those Y-values
> +reported for an assigned range or interval of X-values. All intervals of
> +X-values are equal.
> +
> +Additional required parameters include:
> +- entries (number of buckets, at least 2 required)
> +- range_min (first bucket stands for <=range_min)
> +- base_interval (interval size each bucket covers)
> +
> +For example, a (request size, occurrence)-statistic would yield a histogram
> +of observed request sizes, with the same precision for small, medium and
> +large request sizes.
> +
> +
> + type=histogram_log2
> +
> +Similar to type=histogram_lin, except that the intervals double
> +from bucket to bucket. That is, the histogram loses in precision for
> +larger X-values.
> +
> +
> + type=sparse
> +
> +This one is similar to other histograms, with the exception that it provides
> +buckets for discrete X-values instead of ranges of X-values. Since it
> +utilises a list instead of an array, it is suited for compiling histogram-like
> +results for rather few, sparse X-values which users want to measure
> +separately.
> +
> +Additional required parameters include:
> +- entries (list is capped at this number of entries)
> +
> +For example, a (request size, occurrence)-statistic would yield the
> +occurrences of all request sizes. Since it records precise sizes,
> +it can also show the odd one out, which might be problematic; who knows...
> +
> +
> + Other
> +
> +The statistic infrastructure has been designed to make the addition
> +of more ways of data processing easy (see struct statistic_discipline).
> +
> +For example, two more types had been implemented which are not included
> +in the source code:
> +
> +- A "raw" type statistic which provides a record of (X, Y)-pairs.
> + Nice for verification and debugging purposes.
> +
> +- An enhancement of other basic types, like "counter" or "utilisation"
> + by the dimension time, which provides a time-tagged history of their
> + results for successive periods of time.
> + For example, a (request size, occurrence)-statistic could yield the
> + transfer rate over time, like bytes per second.
> +

Cant all of this be moved to user space if (X, Y) tuples are passed down?


<snip>

Warm Regards,
Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-05-23 15:59:55

by Martin Peschke

[permalink] [raw]
Subject: Re: [Patch 4/6] statistics infrastructure - documentation

Balbir Singh wrote:
> <snip>
>
>> +Actual semantics of the data that feeds a statistic is unimportant when it
>> +comes to data processing. All that matters is how the user wants the data to
>> +be presented (counters, histograms, and so on). That's a job that can be
>> +be done by a generic layer without intervention by the device driver
>> +which is the actual source of statistics data.
>
> Can't this be pushed to user space? Can the same thing be accomplished with
> the help of a user space library?

Does your qestion refer to the formatting or accumulation of data?

As to the accumulation of data:

Usually, that should be done in the kernel, instead of reporting zillions
of events up to user space and maintaining counters or whatever there.

As to the formatting of data:

Well, there has to be an agreement between kernel and user space on
how data looks like when being passed between the two. Some protocol -
though it might be quite lightweight - like the netlink taskstats
headers etc., or the format of my debugfs files.

My point is that such a protocol can be and should be generic
with regard to its payload; or, to be precise, with regard to the
origin of the statistics data. A counter is a counter, a histogram
is a histogram, regardless of the kernel component being the source
of data, and regardless of the entity that a statistic is embedded in.

That is why, my protocol has been reduced to only reflect which
information makes for a histograms etc. I think, the minimal formatting or
packaging work that needs to be done in the kernel is exactly about that.
I agree that anything beyond that should be done in user space
(Some script that is capable of generating bar charts in the format of
one's choice comes to mind).

>
> <snip>
>
>> +
>> + How data is reported
>> +
>> +There are two methods how such data can be provided to the statistics
>> +infrastructure, a push interface and a pull interface. Each statistic
>> +is either a pull-type or push-type statistic as determined by the exploiter.
>> +
>> +The push-interface is suitable for data feeds that report incremental updates
>> +to statistics, and where actual accumulation can be left to the statistics
>> +infrastructure. New measurements usually trigger pushing data.
>> +(see statistics_add() and statistic_inc())
>> +
>> +The pull-interface is suitable for data that already comes in an aggregated
>> +form, like hardware measurement data or counters already maintained and
>> +used by exploiters for other purposes. Reading statistics data from files
>> +triggers an optional callback of the exploiter, which can update pull-type
>> +statistics then (see statistic_set()).
>> +
>
> (Ge)netlink does a great job of supporting the push and pull interfaces.

Okay. Sorry, I am still reading code and documentation.

But, I think we have to be careful here not to mix up concepts, as there are
two interfaces: the programming interface provided by the generic layer to
exploiters, and the user interface provided by the generic layer to users.
The former supports both a push and pull method (statistic_add/inc/set),
because of the different natures of data feeds (as explained above).
The latter is about making accumulated data accessible and is currently
implemeted as files, and so it's a sort of pull interface.

I am not sure yet whether a more sophisticated user interface is really
required. So far, my position on this has been: "user pulls data from file
if needed" is good enough for now; and if a use case for the other method
pops up, we can discuss an enhancement.

I would appriciate feedback on the need for a push-method being part of the
_user interface_, as it seems to be a key point in the debugfs vs. netlink
discussion. Any use cases?

>> +For example, the same statistic might work as a single counter, or as a
>> +histogram comprising a variable (user-defined) number of buckets, or as an
>> +adaptable list of buckets for sparse concrete values, etc. Whatever the result
>> +looks like should be left to the individual modes of data processing.
>> +In order to reduce all kinds of data processing and their output to a common
>> +denominator, an output format along the following lines is suggested and
>> +has been implemented:
>> +
>> + latency_write <=0 0 \
>> + latency_write <=1 13 |
>> + latency_write <=2 13 |
>> + latency_write <=4 56 |
>> + latency_write <=8 144 |
>> + latency_write <=16 184 | a histogran with
>> + latency_write <=32 181 > 13 buckets
>> + latency_write <=64 74 |
>> + latency_write <=128 271 |
>> + latency_write <=256 0 |
>> + latency_write <=512 33 |
>> + latency_write <=1024 0 |
>> + latency_write >1024 0 /
>> + latency_read <=0 0 \
>> + ... > another histogram
>> + latency_read >1024 0 /
>> + size_write missed 0x0 \
>> + size_write 0x1000 143 |
>> + size_write 0xc000 42 |
>> + size_write 0x10000 14 | an adaptable list
>> + size_write 0xf000 13 > with a growing number of buckets
>> + size_write 0x1e000 12 | (up to a defined limit only)
>> + size_write 0x14000 12 |
>> + ... |
>> + size_write 0x9000 1 /
>> + queue_used_depth 970 1 18.122 32 > num min avg max for a queue
>> +
>> +Such output can grow as needed in debugfs files. It is human-readable and
>> +could be parsed and postprocessed by simple scripts that are aware of what the
>> +output of the various data processing modes looks like.
>
> What is the extent to which the data is buffered? Lets say the file
> contains 1000 such records - they all need to be maintained in memory
> till the file is closed or removed - right?

Okay, as a user I would get this by cat'ing the file content to the console
or into some file on disk. This costs about as much as reading from any seq_file.

I don't think 1000 lines is a typical use case. No exploiter has that many
statsitics to report for an entity. And most likely no user is interested in so
much detail (histogram with 1000 buckets, for example). However, 1000 lines
of such output would be about 1000*25 bytes. My code allocates a scatter-gather
output buffer of the required size build up from single pages allocated with
GFP_KERNEL - for the short time of 'cat data'. I don't see that as a real issue.

> <snip>
>
>> + 2. Performance
>> +
>> +
>> + Some preliminary numbers
>> +
>> +FIXME
>> +
>> + Per-CPU data
>> +
>> +Measurements reported by exploiters are accumulated into per-CPU data areas
>> +in order to avoid the introduction of serialisation during the
>> +execution of statistic_add(). Locking of per-CPU data is done by disabling
>> +preemption and interrupts per CPU for the short time of a statistic update.
>> +
>
> Is this not an overkill. What if the subsystem updating the statistics
> does not require interrupts to be disabled for serialization.

Public opinion seems to second a per-CPU data approach:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113458576022747&w=2

Anyway, if the exploiter doesn't require such precautions, it can indicate
to the generic layer trhough a flag not to use per-CPU data. And it can
use statistic_add/inc_nolock() - statistic_add flavours which come without
the serialisation code.

>> + Memory footprint
>> +
>> +Because the statistics code uses per-CPU data, it observes CPU hot-(un)plug
>> +events and allocates and releases per-CPU data as sparingly as possible.
>> +
>> +The differentiation of:
>> +
>> +- struct statistic (any data required for gathering data for a statistic),
>> +- struct statistic_info (description of a class of statistics),
>> +- struct statistic_discipline (description of a data processing mode), and
>> +- struct statistic_interface (user interface for a collection of statistics)
>> +
>> +means avoidance of storing redundant data per statistic. Struct statistic
>> +can be kept quite small.
>> +
>> +
>> + Disabling statistics
>> +
>> +Data gathering can be turned off (by default or by users), which reduces
>> +statistic_add() to a check.
>> +
>> +
>> + Kernel configuration option
>> +
>> +CONFIG_STATISTICS can be used to include or exclude statistics during the
>> +kernel build process.
>> +
>> +
>> +
>> +
>> + 3. Modes of data processing
>> +
>> +So far, available are:
>> +
>> +
>> + type=counter_inc
>> +
>> +A counter sums up all Y-values of (X, Y) data pairs reported, regardless of the
>> +X-part.
>> +
>> +For example, a (request size, occurrence)-statistic would yield the
>> +total of requests observed.
>> +
>> +
>> + type=counter_prod
>> +
>> +A counter sums up all X*Y with X and Y belonging to the same (X, Y).
>> +
>> +For example, a (request size, occurrence)-statistic would yield the
>> +total of bytes transfered.
>> +
>> +
>> + type=utilisation
>> +
>> +Provides a set of values comprising:
>> +- the sum of all Y-values,
>> +- the minimum X
>> +- the average X
>> +- the maximum X
>> +
>> +This appears to be a useful fill level indicator for queues etc.
>> +
>> +For example, a (request size, occurrence)-statistic would yield a very
>> +basic statement about the traffic pattern, with information about the range
>> +of request sizes observed.
>> +
>> +
>> + type=histogram_lin
>> +
>> +Comprises a set of counters, with each counter summing up all those Y-values
>> +reported for an assigned range or interval of X-values. All intervals of
>> +X-values are equal.
>> +
>> +Additional required parameters include:
>> +- entries (number of buckets, at least 2 required)
>> +- range_min (first bucket stands for <=range_min)
>> +- base_interval (interval size each bucket covers)
>> +
>> +For example, a (request size, occurrence)-statistic would yield a histogram
>> +of observed request sizes, with the same precision for small, medium and
>> +large request sizes.
>> +
>> +
>> + type=histogram_log2
>> +
>> +Similar to type=histogram_lin, except that the intervals double
>> +from bucket to bucket. That is, the histogram loses in precision for
>> +larger X-values.
>> +
>> +
>> + type=sparse
>> +
>> +This one is similar to other histograms, with the exception that it provides
>> +buckets for discrete X-values instead of ranges of X-values. Since it
>> +utilises a list instead of an array, it is suited for compiling histogram-like
>> +results for rather few, sparse X-values which users want to measure
>> +separately.
>> +
>> +Additional required parameters include:
>> +- entries (list is capped at this number of entries)
>> +
>> +For example, a (request size, occurrence)-statistic would yield the
>> +occurrences of all request sizes. Since it records precise sizes,
>> +it can also show the odd one out, which might be problematic; who knows...
>> +
>> +
>> + Other
>> +
>> +The statistic infrastructure has been designed to make the addition
>> +of more ways of data processing easy (see struct statistic_discipline).
>> +
>> +For example, two more types had been implemented which are not included
>> +in the source code:
>> +
>> +- A "raw" type statistic which provides a record of (X, Y)-pairs.
>> + Nice for verification and debugging purposes.
>> +
>> +- An enhancement of other basic types, like "counter" or "utilisation"
>> + by the dimension time, which provides a time-tagged history of their
>> + results for successive periods of time.
>> + For example, a (request size, occurrence)-statistic could yield the
>> + transfer rate over time, like bytes per second.
>> +
>
> Cant all of this be moved to user space if (X, Y) tuples are passed down?

No.

My sample SCSI statistics say that I can have easily millions of requests
and, therewith, millions of such (X, Y) tuples showing up during my
lunch break at the cafeteria.

You don't want to pass every update to user space and then have user space,
in a simple case, just increase a counter, or, in the worst case,
just drop this information because it's not interested.

It's much more efficient to do some sort of accumulation immediately and
to dramatically reduce the amount of data that needs to commute from
kernel to user space.

Martin