2010-04-15 05:43:53

by Divyesh Shah

[permalink] [raw]
Subject: [PATCH 0/4] block: Per-partition block IO performance histograms

The following patchset implements per partition 2-d histograms for IO to block
devices. The 3 types of histograms added are:

1) request histograms - 2-d histogram of total request time in ms (queueing +
service) broken down by IO size (in bytes).
2) dma histograms - 2-d histogram of total service time in ms broken down by
IO size (in bytes).
3) seek histograms - 1-d histogram of seek distance

All of these histograms are per-partition. The first 2 are further divided into
separate read and write histograms. The buckets for these histograms are
configurable via config options as well as at runtime (per-device).

These histograms have proven very valuable to us over the years to understand
the seek distribution of IOs over our production machines, detect large
queueing delays, find latency outliers, etc. by being used as part of an
always-on monitoring system.

They can be reset by writing any value to them which makes them useful for
tests and debugging too.

This was initially written by Edward Falk in 2006 and I've forward ported
and improved it a few times it across kernel versions.

He had also sent a very old version of this patchset (minus some features like
runtime configurable buckets) back then to lkml - see
http://lkml.indiana.edu/hypermail/linux/kernel/0611.1/2684.html
Some of the reasons mentioned for not including these patches are given below.

I'm requesting re-consideration for this patchset in light of the following
arguments.

1) This can be done with blktrace too, why add another API?

Yes blktrace can be used to get this kind of information w/ some help from
userspace post-processing. However, to use this as an always-on monitoring tool
w/ blktrace and have negligible performance overhead is difficult to achieve.
I did a quick 10-thread iozone direct IO write phase run w/ and w/o blktrace
on a traditional rotational disk to get a feel of the impact on throughput.
This was kernel built from Jens' for-2.6.35 branch and did not have these new
block histogram patches.
o w/o blktrace:
Children see throughput for 10 initial writers = 95211.22 KB/sec
Parent sees throughput for 10 initial writers = 37593.20 KB/sec
Min throughput per thread = 9078.65 KB/sec
Max throughput per thread = 10055.59 KB/sec
Avg throughput per thread = 9521.12 KB/sec
Min xfer = 462848.00 KB

o w/ blktrace:
Children see throughput for 10 initial writers = 93527.98 KB/sec
Parent sees throughput for 10 initial writers = 38594.47 KB/sec
Min throughput per thread = 9197.06 KB/sec
Max throughput per thread = 9640.09 KB/sec
Avg throughput per thread = 9352.80 KB/sec
Min xfer = 490496.00 KB

This is about 1.8% average throughput loss per thread.
The extra cpu time spent with blktrace is in addition to this loss of
throughput. This overhead will only go up on faster SSDs.

2) sysfs should be only for one value per file. There are some exceptions but we
are working on fixing them. Please don't add new ones.

There are excpetions like meminfo, etc. that violate this guideline (I'm not
sure if its an enforced rule) and some actually make sense since there is no way
of representing structured data. Though these block histograms are multi-valued
one can also interpret them as one logical piece of information.

IMO these histograms add value and given there seems to be no better way of
exporting this information w/o performance overhead, it might be ok to allow
this exception.

I might be wrong here and there are indeed better ways of exporting this data.
Any comments/suggestions for different representations for achieveing the same
goal are more than welcome.
---

Divyesh Shah (4):
Make base bucket for the histograms to be configurable per-part.
Add seek histograms to the block histograms
Add disk performance histograms which can be read from sysfs and cleared
Re-introduce rq->__nr_sectors to maintain the original size of the request


block/Kconfig | 35 ++++
block/blk-core.c | 5 +
block/blk-merge.c | 1
block/genhd.c | 410 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/partitions/check.c | 29 +++
include/linux/blkdev.h | 11 +
include/linux/genhd.h | 73 +++++++++
include/linux/time.h | 5 +
8 files changed, 567 insertions(+), 2 deletions(-)

--
Thanks,
Divyesh


2010-04-15 05:44:38

by Divyesh Shah

[permalink] [raw]
Subject: [PATCH 1/4] block: Re-introduce rq->__nr_sectors to maintain the

original size of the request right through completion where it will be useful
for the disk performance histograms.

Signed-off-by: Divyesh Shah <[email protected]>
---

block/blk-core.c | 4 ++++
block/blk-merge.c | 1 +
include/linux/blkdev.h | 7 +++++++
3 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e9a5ae2..f18e7b7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1159,6 +1159,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
struct request *req;
int el_ret;
unsigned int bytes = bio->bi_size;
+ const unsigned int nr_sectors = bio_sectors(bio);
const unsigned short prio = bio_prio(bio);
const bool sync = bio_rw_flagged(bio, BIO_RW_SYNCIO);
const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG);
@@ -1198,6 +1199,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
req->biotail->bi_next = bio;
req->biotail = bio;
req->__data_len += bytes;
+ req->__nr_sectors += nr_sectors;
req->ioprio = ioprio_best(req->ioprio, prio);
if (!blk_rq_cpu_valid(req))
req->cpu = bio->bi_comp_cpu;
@@ -1232,6 +1234,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
req->buffer = bio_data(bio);
req->__sector = bio->bi_sector;
req->__data_len += bytes;
+ req->__nr_sectors += nr_sectors;
req->ioprio = ioprio_best(req->ioprio, prio);
if (!blk_rq_cpu_valid(req))
req->cpu = bio->bi_comp_cpu;
@@ -2350,6 +2353,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
}
rq->__data_len = bio->bi_size;
rq->bio = rq->biotail = bio;
+ rq->__nr_sectors = bio_sectors(bio);

if (bio->bi_bdev)
rq->rq_disk = bio->bi_bdev->bd_disk;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5e7dc99..ee0bb50 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -411,6 +411,7 @@ static int attempt_merge(struct request_queue *q, struct request *req,
req->biotail = next->biotail;

req->__data_len += blk_rq_bytes(next);
+ req->__nr_sectors += blk_rq_size(next);

elv_merge_requests(q, req, next);

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d483c49..4cc2cdf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -170,6 +170,7 @@ struct request {
/* the following two fields are internal, NEVER access directly */
unsigned int __data_len; /* total data len */
sector_t __sector; /* sector cursor */
+ sector_t __nr_sectors; /* Total number of sectors */

struct bio *bio;
struct bio *biotail;
@@ -844,6 +845,7 @@ static inline struct request_queue *bdev_get_queue(struct block_device *bdev)

/*
* blk_rq_pos() : the current sector
+ * blk_rq_size() : the size of the request in sectors
* blk_rq_bytes() : bytes left in the entire request
* blk_rq_cur_bytes() : bytes left in the current segment
* blk_rq_err_bytes() : bytes left till the next error boundary
@@ -855,6 +857,11 @@ static inline sector_t blk_rq_pos(const struct request *rq)
return rq->__sector;
}

+static inline sector_t blk_rq_size(const struct request *rq)
+{
+ return rq->__nr_sectors;
+}
+
static inline unsigned int blk_rq_bytes(const struct request *rq)
{
return rq->__data_len;

2010-04-15 05:45:25

by Divyesh Shah

[permalink] [raw]
Subject: [PATCH 2/4] block: Add disk performance histograms which can be read

from sysfs and cleared upon writing.

Signed-off-by: Divyesh Shah <[email protected]>
From: Edward Falk <[email protected]>
---

block/Kconfig | 26 +++++
block/blk-core.c | 1
block/genhd.c | 270 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/partitions/check.c | 16 +++
include/linux/blkdev.h | 4 -
include/linux/genhd.h | 48 +++++++++
include/linux/time.h | 5 +
7 files changed, 368 insertions(+), 2 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index f9e89f4..b62fe49 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -100,6 +100,32 @@ config DEBUG_BLK_CGROUP
in the blk group which can be used by cfq for tracing various
group related activity.

+config BLOCK_HISTOGRAM
+ bool "Performance histogram data"
+ default n
+ ---help---
+ This option causes block devices to collect statistics on transfer
+ sizes and times. Useful for performance-tuning a system. Creates
+ entries in /sysfs/block/.
+
+ If you are unsure, say N here.
+
+config HISTO_SIZE_BUCKETS
+ int "Number of size buckets in histogram"
+ depends on BLOCK_HISTOGRAM
+ default "10"
+ ---help---
+ This option controls how many buckets are used to collect
+ transfer size statistics.
+
+config HISTO_TIME_BUCKETS
+ int "Number of time buckets in histogram"
+ depends on BLOCK_HISTOGRAM
+ default "11"
+ ---help---
+ This option controls how many buckets are used to collect
+ transfer time statistics.
+
endif # BLOCK

config BLOCK_COMPAT
diff --git a/block/blk-core.c b/block/blk-core.c
index f18e7b7..6432b14 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1744,6 +1744,7 @@ static void blk_account_io_done(struct request *req)
part_stat_inc(cpu, part, ios[rw]);
part_stat_add(cpu, part, ticks[rw], duration);
part_round_stats(cpu, part);
+ block_histogram_completion(cpu, part, req);
part_dec_in_flight(part, rw);

part_stat_unlock();
diff --git a/block/genhd.c b/block/genhd.c
index d13ba76..3666cf2 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -881,6 +881,16 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, disk_discard_alignment_show,
static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL);
static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+#ifdef CONFIG_BLOCK_HISTOGRAM
+static DEVICE_ATTR(read_request_histo, S_IRUGO | S_IWUSR,
+ part_read_request_histo_show, part_read_histo_clear);
+static DEVICE_ATTR(read_dma_histo, S_IRUGO | S_IWUSR, part_read_dma_histo_show,
+ part_read_histo_clear);
+static DEVICE_ATTR(write_request_histo, S_IRUGO | S_IWUSR,
+ part_write_request_histo_show, part_write_histo_clear);
+static DEVICE_ATTR(write_dma_histo, S_IRUGO | S_IWUSR,
+ part_write_dma_histo_show, part_write_histo_clear);
+#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -902,6 +912,12 @@ static struct attribute *disk_attrs[] = {
&dev_attr_capability.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+#ifdef CONFIG_BLOCK_HISTOGRAM
+ &dev_attr_read_request_histo.attr,
+ &dev_attr_read_dma_histo.attr,
+ &dev_attr_write_request_histo.attr,
+ &dev_attr_write_dma_histo.attr,
+#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
@@ -1286,3 +1302,257 @@ int invalidate_partition(struct gendisk *disk, int partno)
}

EXPORT_SYMBOL(invalidate_partition);
+
+#ifdef CONFIG_BLOCK_HISTOGRAM
+/*
+ * Clear one per-cpu instance of a particular I/O histogram. This should always
+ * be called between part_stat_lock() and part_stat_unklock() calls.
+ */
+static inline void __block_part_histogram_reset(struct disk_stats *stats,
+ int direction)
+{
+ if (direction == READ)
+ memset(&stats->rd_histo, 0, sizeof(stats->rd_histo));
+ else
+ memset(&stats->wr_histo, 0, sizeof(stats->wr_histo));
+}
+
+/*
+ * Clear the I/O histogram for a given partition.
+ */
+static void block_part_histogram_reset(struct hd_struct *part, int direction)
+{
+#ifdef CONFIG_SMP
+ int i;
+
+ part_stat_lock();
+ for_each_possible_cpu(i) {
+ if (cpu_possible(i))
+ __block_part_histogram_reset(per_cpu_ptr(part->dkstats,
+ i), direction);
+ }
+#else
+ part_stat_lock();
+ __block_part_histogram_reset(&part.dkstats, direction);
+#endif
+ part_stat_unlock();
+}
+
+/*
+ * Iterate though all partitions of the disk and clear the specified
+ * (read/write) histogram.
+ */
+static int block_disk_histogram_reset(struct hd_struct *part, int direction)
+{
+ struct disk_part_iter piter;
+ struct gendisk *disk = part_to_disk(part);
+ struct hd_struct *temp;
+
+ if (!disk)
+ return -ENODEV;
+
+ disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY_PART0);
+ while ((temp = disk_part_iter_next(&piter)))
+ block_part_histogram_reset(temp, direction);
+ disk_part_iter_exit(&piter);
+ return 0;
+}
+
+/*
+ * Map transfer size to histogram bucket. Transfer sizes are exponentially
+ * increasing. For example: 4,8,16,... sectors.
+ */
+static inline int stats_size_bucket(sector_t sectors)
+{
+ int i;
+ /* To make sure bucket for x bytes captures all IOs <= x bytes. */
+ --sectors;
+ do_div(sectors, BASE_HISTO_SIZE);
+ if (sectors >= (1 << (CONFIG_HISTO_SIZE_BUCKETS - 2)))
+ return CONFIG_HISTO_SIZE_BUCKETS - 1;
+
+ for (i = 0; sectors > 0; ++i, sectors /= 2)
+ ;
+ return i;
+}
+
+/*
+ * Map transfer time to histogram bucket. This also uses an exponential
+ * increment, but we like the 1,2,5,10,20,50 progression.
+ */
+static inline int stats_time_bucket(int jiffies)
+{
+ int i;
+ int ms = msecs_to_jiffies(jiffies);
+ int t = BASE_HISTO_TIME;
+
+ for (i = 0;; t *= 10) {
+ if (++i >= CONFIG_HISTO_TIME_BUCKETS || ms <= t)
+ return i - 1;
+ if (++i >= CONFIG_HISTO_TIME_BUCKETS || ms <= t*2)
+ return i - 1;
+ if (++i >= CONFIG_HISTO_TIME_BUCKETS || ms <= t*5)
+ return i - 1;
+ }
+}
+
+/*
+ * Log I/O completion, update histogram.
+ *
+ * @part: disk device partition
+ * @req: pointer to request
+ * @req_ms: time transfer required
+ * @dma_ms: time dma required
+ */
+static inline void __block_histogram_completion(int cpu, struct hd_struct *part,
+ struct request *req, unsigned int req_ms, unsigned int dma_ms)
+{
+ sector_t sectors = blk_rq_size(req);
+ int size_idx = stats_size_bucket(sectors);
+ int req_time_idx = stats_time_bucket(req_ms);
+ int dma_time_idx = stats_time_bucket(dma_ms);
+
+ if (!rq_data_dir(req))
+ part_stat_inc(cpu, part,
+ rd_histo[HISTO_REQUEST][size_idx][req_time_idx]);
+ else
+ part_stat_inc(cpu, part,
+ wr_histo[HISTO_REQUEST][size_idx][req_time_idx]);
+
+ if (!rq_data_dir(req))
+ part_stat_inc(cpu, part,
+ rd_histo[HISTO_DMA][size_idx][dma_time_idx]);
+ else
+ part_stat_inc(cpu, part,
+ wr_histo[HISTO_DMA][size_idx][dma_time_idx]);
+}
+
+/*
+ * Called after a dma interrupt. Should be called between part_stat_lock()
+ * and part_stat_unlock() calls.
+ * Note that for block devices with queue_depth > 1, the io_elapsed will not be
+ * accurate as it may include time spent in the disk queue due to re-ordering of
+ * requests by the disk.
+ */
+void block_histogram_completion(int cpu, struct hd_struct *part,
+ struct request *req)
+{
+ unsigned long long now = sched_clock();
+ uint64_t rq_elapsed = 0, io_elapsed = 0;
+
+ if (time_after64(now, rq_start_time_ns(req)))
+ rq_elapsed = now - rq_start_time_ns(req);
+ if (time_after64(now, rq_io_start_time_ns(req)))
+ io_elapsed = now - rq_io_start_time_ns(req);
+ __block_histogram_completion(cpu, part, req, nsecs_to_msecs(rq_elapsed),
+ nsecs_to_msecs(io_elapsed));
+}
+
+static uint64_t histo_stat_read(struct hd_struct *part, int direction,
+ int i, int j, int k)
+{
+ return (direction == READ) ? part_stat_read(part, rd_histo[i][j][k]) :
+ part_stat_read(part, wr_histo[i][j][k]);
+}
+
+/*
+ * Dumps the specified 'type' of histogram for part to out.
+ * The result must be less than PAGE_SIZE.
+ */
+static int dump_histo(struct hd_struct *part, int direction, int type,
+ char *page)
+{
+ ssize_t rem = PAGE_SIZE;
+ char *optr = page;
+ int i, j, len, ms, size = BASE_HISTO_SIZE * 512;
+ static const int mults[3] = {1, 2, 5};
+
+ /*
+ * Documentation/filesystems/sysfs.txt strongly discourages the use of
+ * any kind of fancy formatting here. We *are* emitting an array, so
+ * there needs to be some amount of formatting.
+ */
+
+ /* Row header */
+ len = snprintf(page, rem, " ");
+ page += len;
+ rem -= len;
+ for (i = 0, ms = BASE_HISTO_TIME; i < CONFIG_HISTO_TIME_BUCKETS;
+ ms *= 10) {
+ for (j = 0; j < 3 && i < CONFIG_HISTO_TIME_BUCKETS; ++j, ++i) {
+ len = snprintf(page, rem, "\t%d", ms * mults[j]);
+ page += len;
+ rem -= len;
+ }
+ }
+ len = snprintf(page, rem, "\n");
+ page += len;
+ rem -= len;
+
+ /* Payload */
+ for (i = 0; i < CONFIG_HISTO_SIZE_BUCKETS; i++, size *= 2) {
+ len = snprintf(page, rem, "%7d", size);
+ page += len;
+ rem -= len;
+ for (j = 0; j < CONFIG_HISTO_TIME_BUCKETS; j++) {
+ len = snprintf(page, rem, "\t%llu",
+ histo_stat_read(part, direction, type, i, j));
+ page += len;
+ rem -= len;
+ }
+ len = snprintf(page, rem, "\n");
+ page += len;
+ rem -= len;
+ }
+ return page - optr;
+}
+
+/*
+ * sysfs show() methods for the four histogram channels.
+ */
+ssize_t part_read_request_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ return dump_histo(dev_to_part(dev), READ, HISTO_REQUEST, page);
+}
+
+ssize_t part_read_dma_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ return dump_histo(dev_to_part(dev), READ, HISTO_DMA, page);
+}
+
+ssize_t part_write_request_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ return dump_histo(dev_to_part(dev), WRITE, HISTO_REQUEST, page);
+}
+
+ssize_t part_write_dma_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ return dump_histo(dev_to_part(dev), WRITE, HISTO_DMA, page);
+}
+
+/*
+ * Reinitializes the read histograms to 0.
+ */
+ssize_t part_read_histo_clear(struct device *dev,
+ struct device_attribute *attr, const char *page, size_t count)
+{
+ /* Ignore the data, just clear the histogram */
+ int retval = block_disk_histogram_reset(dev_to_part(dev), READ);
+ return (retval == 0 ? count : retval);
+}
+
+/*
+ * Reinitializes the write histograms to 0.
+ */
+ssize_t part_write_histo_clear(struct device *dev,
+ struct device_attribute *attr, const char *page, size_t count)
+{
+ int retval = block_disk_histogram_reset(dev_to_part(dev), WRITE);
+ return (retval == 0 ? count : retval);
+}
+
+#endif
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index e238ab2..e0044d4 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -300,6 +300,16 @@ static DEVICE_ATTR(discard_alignment, S_IRUGO, part_discard_alignment_show,
NULL);
static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL);
static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL);
+#ifdef CONFIG_BLOCK_HISTOGRAM
+static DEVICE_ATTR(read_request_histo, S_IRUGO | S_IWUSR,
+ part_read_request_histo_show, part_read_histo_clear);
+static DEVICE_ATTR(read_dma_histo, S_IRUGO | S_IWUSR, part_read_dma_histo_show,
+ part_read_histo_clear);
+static DEVICE_ATTR(write_request_histo, S_IRUGO | S_IWUSR,
+ part_write_request_histo_show, part_write_histo_clear);
+static DEVICE_ATTR(write_dma_histo, S_IRUGO | S_IWUSR,
+ part_write_dma_histo_show, part_write_histo_clear);
+#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
@@ -313,6 +323,12 @@ static struct attribute *part_attrs[] = {
&dev_attr_discard_alignment.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+#ifdef CONFIG_BLOCK_HISTOGRAM
+ &dev_attr_read_request_histo.attr,
+ &dev_attr_read_dma_histo.attr,
+ &dev_attr_write_request_histo.attr,
+ &dev_attr_write_dma_histo.attr,
+#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4cc2cdf..2e5e083 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -195,7 +195,7 @@ struct request {

struct gendisk *rq_disk;
unsigned long start_time;
-#ifdef CONFIG_BLK_CGROUP
+#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLOCK_HISTOGRAM)
unsigned long long start_time_ns;
unsigned long long io_start_time_ns; /* when passed to hardware */
#endif
@@ -1206,7 +1206,7 @@ static inline void put_dev_sector(Sector p)
struct work_struct;
int kblockd_schedule_work(struct request_queue *q, struct work_struct *work);

-#ifdef CONFIG_BLK_CGROUP
+#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLOCK_HISTOGRAM)
static inline void set_start_time_ns(struct request *req)
{
req->start_time_ns = sched_clock();
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5f2f4c4..7406533 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -65,6 +65,8 @@ enum {
#include <linux/fs.h>
#include <linux/workqueue.h>

+struct request;
+
struct partition {
unsigned char boot_ind; /* 0x80 - active */
unsigned char head; /* starting head */
@@ -78,6 +80,13 @@ struct partition {
__le32 nr_sects; /* nr of sectors in partition */
} __attribute__((packed));

+#define BASE_HISTO_SIZE 4 /* smallest transfer size, sectors */
+#define BASE_HISTO_TIME 10 /* shortest transfer time, ms */
+
+/* Index into the histo arrays */
+#define HISTO_REQUEST 0
+#define HISTO_DMA 1
+
struct disk_stats {
unsigned long sectors[2]; /* READs and WRITEs */
unsigned long ios[2];
@@ -85,6 +94,23 @@ struct disk_stats {
unsigned long ticks[2];
unsigned long io_ticks;
unsigned long time_in_queue;
+#ifdef CONFIG_BLOCK_HISTOGRAM
+ /*
+ * Implement 2-variable histograms, with transfers tracked by transfer
+ * size and completion time. The sysfs files are
+ * /sys/block/DEV/PART/read_request_histo,
+ * /sys/block/DEV/PART/write_request_histo,
+ * /sys/block/DEV/PART/read_dma_histo,
+ * /sys/block/DEV/PART/write_dma_histo and the
+ * /sys/block/DEV counterparts.
+ *
+ * The *request_histo files measure time from when the request is first
+ * submitted into the reuqest queue. The *dma_histo files measure time
+ * from when the request is dispatched from the queue to the device.
+ */
+ uint64_t rd_histo[2][CONFIG_HISTO_SIZE_BUCKETS][CONFIG_HISTO_TIME_BUCKETS];
+ uint64_t wr_histo[2][CONFIG_HISTO_SIZE_BUCKETS][CONFIG_HISTO_TIME_BUCKETS];
+#endif
};

struct hd_struct {
@@ -360,6 +386,28 @@ static inline int get_disk_ro(struct gendisk *disk)
return disk->part0.policy;
}

+#ifdef CONFIG_BLOCK_HISTOGRAM
+extern void block_histogram_completion(int cpu, struct hd_struct *part,
+ struct request *req);
+extern ssize_t part_read_request_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page);
+extern ssize_t part_read_dma_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page);
+extern ssize_t part_write_request_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page);
+extern ssize_t part_write_dma_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page);
+extern ssize_t part_write_dma_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page);
+extern ssize_t part_read_histo_clear(struct device *dev,
+ struct device_attribute *attr, const char *page, size_t count);
+extern ssize_t part_write_histo_clear(struct device *dev,
+ struct device_attribute *attr, const char *page, size_t count);
+#else
+static inline void block_histogram_completion(int cpu, struct hd_struct *part,
+ struct request *req) {}
+#endif
+
/* drivers/char/random.c */
extern void add_disk_randomness(struct gendisk *disk);
extern void rand_initialize_disk(struct gendisk *disk);
diff --git a/include/linux/time.h b/include/linux/time.h
index 6e026e4..fa1b9de 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -42,6 +42,11 @@ extern struct timezone sys_tz;

#define TIME_T_MAX (time_t)((1UL << ((sizeof(time_t) << 3) - 1)) - 1)

+static inline unsigned int nsecs_to_msecs(uint64_t ns)
+{
+ return ns / NSEC_PER_MSEC;
+}
+
static inline int timespec_equal(const struct timespec *a,
const struct timespec *b)
{

2010-04-15 05:45:52

by Divyesh Shah

[permalink] [raw]
Subject: [PATCH 3/4] block: Add seek histograms to the block histograms

Signed-off-by: Divyesh Shah <[email protected]>
From: Edward Falk <[email protected]>
---

block/Kconfig | 9 ++++
block/genhd.c | 103 +++++++++++++++++++++++++++++++++++++++++++------
fs/partitions/check.c | 4 ++
include/linux/genhd.h | 14 ++++++-
4 files changed, 117 insertions(+), 13 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index b62fe49..5dbc10b 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -126,6 +126,15 @@ config HISTO_TIME_BUCKETS
This option controls how many buckets are used to collect
transfer time statistics.

+config HISTO_SEEK_BUCKETS
+ int "Number of seek buckets in histogram"
+ depends on BLOCK_HISTOGRAM
+ default "20"
+ ---help---
+ This option controls how many buckets are used to collect
+ disk seek statistics. The actual number of buckets is 1 greater
+ than the number specified here as the last bucket is a catch-all one.
+
endif # BLOCK

config BLOCK_COMPAT
diff --git a/block/genhd.c b/block/genhd.c
index 3666cf2..8920994 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -890,6 +890,8 @@ static DEVICE_ATTR(write_request_histo, S_IRUGO | S_IWUSR,
part_write_request_histo_show, part_write_histo_clear);
static DEVICE_ATTR(write_dma_histo, S_IRUGO | S_IWUSR,
part_write_dma_histo_show, part_write_histo_clear);
+static DEVICE_ATTR(seek_histo, S_IRUGO | S_IWUSR,
+ part_seek_histo_show, part_seek_histo_clear);
#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
@@ -917,6 +919,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_read_dma_histo.attr,
&dev_attr_write_request_histo.attr,
&dev_attr_write_dma_histo.attr,
+ &dev_attr_seek_histo.attr,
#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
@@ -1304,6 +1307,8 @@ int invalidate_partition(struct gendisk *disk, int partno)
EXPORT_SYMBOL(invalidate_partition);

#ifdef CONFIG_BLOCK_HISTOGRAM
+typedef void (part_histo_reset) (struct disk_stats *, int);
+
/*
* Clear one per-cpu instance of a particular I/O histogram. This should always
* be called between part_stat_lock() and part_stat_unklock() calls.
@@ -1317,23 +1322,27 @@ static inline void __block_part_histogram_reset(struct disk_stats *stats,
memset(&stats->wr_histo, 0, sizeof(stats->wr_histo));
}

+static inline void __block_part_seek_histogram_reset(struct disk_stats *stats,
+ int dummy)
+{
+ memset(&stats->seek_histo, 0, sizeof(stats->seek_histo));
+}
+
/*
* Clear the I/O histogram for a given partition.
*/
-static void block_part_histogram_reset(struct hd_struct *part, int direction)
+static void block_part_histogram_reset(struct hd_struct *part,
+ part_histo_reset *reset_fn, int direction)
{
#ifdef CONFIG_SMP
int i;

part_stat_lock();
- for_each_possible_cpu(i) {
- if (cpu_possible(i))
- __block_part_histogram_reset(per_cpu_ptr(part->dkstats,
- i), direction);
- }
+ for_each_possible_cpu(i)
+ reset_fn(per_cpu_ptr(part->dkstats, i), direction);
#else
part_stat_lock();
- __block_part_histogram_reset(&part.dkstats, direction);
+ reset_fn(&part.dkstats, direction);
#endif
part_stat_unlock();
}
@@ -1342,7 +1351,8 @@ static void block_part_histogram_reset(struct hd_struct *part, int direction)
* Iterate though all partitions of the disk and clear the specified
* (read/write) histogram.
*/
-static int block_disk_histogram_reset(struct hd_struct *part, int direction)
+static int block_disk_histogram_reset(struct hd_struct *part,
+ part_histo_reset *reset_fn, int direction)
{
struct disk_part_iter piter;
struct gendisk *disk = part_to_disk(part);
@@ -1353,11 +1363,16 @@ static int block_disk_histogram_reset(struct hd_struct *part, int direction)

disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY_PART0);
while ((temp = disk_part_iter_next(&piter)))
- block_part_histogram_reset(temp, direction);
+ block_part_histogram_reset(temp, reset_fn, direction);
disk_part_iter_exit(&piter);
return 0;
}

+void init_part_histo_defaults(struct hd_struct *part)
+{
+ part->last_end_sector = part->start_sect;
+}
+
/*
* Map transfer size to histogram bucket. Transfer sizes are exponentially
* increasing. For example: 4,8,16,... sectors.
@@ -1397,6 +1412,15 @@ static inline int stats_time_bucket(int jiffies)
}

/*
+ * Map seek distance to histogram bucket. This also uses an exponential
+ * increment : 8, 16, 32, ... sectors.
+ */
+static inline int stats_seek_bucket(sector_t distance)
+{
+ return min(fls64(distance >> 3), CONFIG_HISTO_SEEK_BUCKETS);
+}
+
+/*
* Log I/O completion, update histogram.
*
* @part: disk device partition
@@ -1407,11 +1431,20 @@ static inline int stats_time_bucket(int jiffies)
static inline void __block_histogram_completion(int cpu, struct hd_struct *part,
struct request *req, unsigned int req_ms, unsigned int dma_ms)
{
- sector_t sectors = blk_rq_size(req);
+ sector_t sectors = blk_rq_size(req), end_sector = blk_rq_pos(req);
+ sector_t distance, start_sector = end_sector - sectors;
int size_idx = stats_size_bucket(sectors);
int req_time_idx = stats_time_bucket(req_ms);
int dma_time_idx = stats_time_bucket(dma_ms);

+ if (start_sector >= part->last_end_sector)
+ distance = start_sector - part->last_end_sector;
+ else
+ distance = part->last_end_sector - start_sector;
+
+ part_stat_inc(cpu, part, seek_histo[stats_seek_bucket(distance)]);
+ part->last_end_sector = end_sector;
+
if (!rq_data_dir(req))
part_stat_inc(cpu, part,
rd_histo[HISTO_REQUEST][size_idx][req_time_idx]);
@@ -1455,6 +1488,11 @@ static uint64_t histo_stat_read(struct hd_struct *part, int direction,
part_stat_read(part, wr_histo[i][j][k]);
}

+static uint64_t seek_histo_stat_read(struct hd_struct *part, int i)
+{
+ return part_stat_read(part, seek_histo[i]);
+}
+
/*
* Dumps the specified 'type' of histogram for part to out.
* The result must be less than PAGE_SIZE.
@@ -1508,6 +1546,28 @@ static int dump_histo(struct hd_struct *part, int direction, int type,
}

/*
+ * Dumps the seek histogram for part. The result must be less than PAGE_SIZE.
+ */
+static int dump_seek_histo(struct hd_struct *part, char* page)
+{
+ ssize_t rem = PAGE_SIZE;
+ char *optr = page;
+ int i, len;
+
+ for (i = 0; i < CONFIG_HISTO_SEEK_BUCKETS + 1; i++) {
+ if (i < CONFIG_HISTO_SEEK_BUCKETS)
+ len = snprintf(page, rem, "%ld\t%llu\n",
+ 1UL << (i + 3), seek_histo_stat_read(part, i));
+ else
+ len = snprintf(page, rem, "inf\t%llu\n",
+ seek_histo_stat_read(part, i));
+ page += len;
+ rem -= len;
+ }
+ return page - optr;
+}
+
+/*
* sysfs show() methods for the four histogram channels.
*/
ssize_t part_read_request_histo_show(struct device *dev,
@@ -1534,6 +1594,12 @@ ssize_t part_write_dma_histo_show(struct device *dev,
return dump_histo(dev_to_part(dev), WRITE, HISTO_DMA, page);
}

+ssize_t part_seek_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ return dump_seek_histo(dev_to_part(dev), page);
+}
+
/*
* Reinitializes the read histograms to 0.
*/
@@ -1541,7 +1607,8 @@ ssize_t part_read_histo_clear(struct device *dev,
struct device_attribute *attr, const char *page, size_t count)
{
/* Ignore the data, just clear the histogram */
- int retval = block_disk_histogram_reset(dev_to_part(dev), READ);
+ int retval = block_disk_histogram_reset(dev_to_part(dev),
+ __block_part_histogram_reset, READ);
return (retval == 0 ? count : retval);
}

@@ -1551,7 +1618,19 @@ ssize_t part_read_histo_clear(struct device *dev,
ssize_t part_write_histo_clear(struct device *dev,
struct device_attribute *attr, const char *page, size_t count)
{
- int retval = block_disk_histogram_reset(dev_to_part(dev), WRITE);
+ int retval = block_disk_histogram_reset(dev_to_part(dev),
+ __block_part_histogram_reset, WRITE);
+ return (retval == 0 ? count : retval);
+}
+
+/*
+ * Reinitializes the seek histograms to 0.
+ */
+ssize_t part_seek_histo_clear(struct device *dev,
+ struct device_attribute *attr, const char *page, size_t count)
+{
+ int retval = block_disk_histogram_reset(dev_to_part(dev),
+ __block_part_seek_histogram_reset, 0);
return (retval == 0 ? count : retval);
}

diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index e0044d4..47e2591 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -309,6 +309,8 @@ static DEVICE_ATTR(write_request_histo, S_IRUGO | S_IWUSR,
part_write_request_histo_show, part_write_histo_clear);
static DEVICE_ATTR(write_dma_histo, S_IRUGO | S_IWUSR,
part_write_dma_histo_show, part_write_histo_clear);
+static DEVICE_ATTR(seek_histo, S_IRUGO | S_IWUSR,
+ part_seek_histo_show, part_seek_histo_clear);
#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
@@ -328,6 +330,7 @@ static struct attribute *part_attrs[] = {
&dev_attr_read_dma_histo.attr,
&dev_attr_write_request_histo.attr,
&dev_attr_write_dma_histo.attr,
+ &dev_attr_seek_histo.attr,
#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
@@ -436,6 +439,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
p->nr_sects = len;
p->partno = partno;
p->policy = get_disk_ro(disk);
+ init_part_histo_defaults(p);

dname = dev_name(ddev);
if (isdigit(dname[strlen(dname) - 1]))
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 7406533..746b36b 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -101,7 +101,8 @@ struct disk_stats {
* /sys/block/DEV/PART/read_request_histo,
* /sys/block/DEV/PART/write_request_histo,
* /sys/block/DEV/PART/read_dma_histo,
- * /sys/block/DEV/PART/write_dma_histo and the
+ * /sys/block/DEV/PART/write_dma_histo,
+ * /sysfs/block/DEV/PART/seek_histo and the
* /sys/block/DEV counterparts.
*
* The *request_histo files measure time from when the request is first
@@ -110,6 +111,7 @@ struct disk_stats {
*/
uint64_t rd_histo[2][CONFIG_HISTO_SIZE_BUCKETS][CONFIG_HISTO_TIME_BUCKETS];
uint64_t wr_histo[2][CONFIG_HISTO_SIZE_BUCKETS][CONFIG_HISTO_TIME_BUCKETS];
+ uint64_t seek_histo[CONFIG_HISTO_SEEK_BUCKETS + 1];
#endif
};

@@ -131,6 +133,9 @@ struct hd_struct {
#else
struct disk_stats dkstats;
#endif
+#ifdef CONFIG_BLOCK_HISTOGRAM
+ sector_t last_end_sector;
+#endif
struct rcu_head rcu_head;
};

@@ -399,13 +404,20 @@ extern ssize_t part_write_dma_histo_show(struct device *dev,
struct device_attribute *attr, char *page);
extern ssize_t part_write_dma_histo_show(struct device *dev,
struct device_attribute *attr, char *page);
+extern ssize_t part_seek_histo_show(struct device *dev,
+ struct device_attribute *attr, char *page);
extern ssize_t part_read_histo_clear(struct device *dev,
struct device_attribute *attr, const char *page, size_t count);
extern ssize_t part_write_histo_clear(struct device *dev,
struct device_attribute *attr, const char *page, size_t count);
+extern ssize_t part_seek_histo_clear(struct device *dev,
+ struct device_attribute *attr, const char *page, size_t count);
+
+extern void init_part_histo_defaults(struct hd_struct *part);
#else
static inline void block_histogram_completion(int cpu, struct hd_struct *part,
struct request *req) {}
+static inline void init_part_histo_defaults(struct hd_struct *part) {}
#endif

/* drivers/char/random.c */

2010-04-15 05:46:58

by Divyesh Shah

[permalink] [raw]
Subject: [PATCH 4/4] block: Make base bucket for the histograms configurable

Signed-off-by: Divyesh Shah<[email protected]>
---

block/genhd.c | 87 ++++++++++++++++++++++++++++++++++++++++++-------
fs/partitions/check.c | 9 +++++
include/linux/genhd.h | 19 +++++++++--
3 files changed, 99 insertions(+), 16 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 8920994..c2ba04b 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -892,6 +892,12 @@ static DEVICE_ATTR(write_dma_histo, S_IRUGO | S_IWUSR,
part_write_dma_histo_show, part_write_histo_clear);
static DEVICE_ATTR(seek_histo, S_IRUGO | S_IWUSR,
part_seek_histo_show, part_seek_histo_clear);
+static DEVICE_ATTR(base_histo_size, S_IRUGO | S_IWUSR,
+ part_base_histo_size_show, part_base_histo_size_write);
+static DEVICE_ATTR(base_histo_time, S_IRUGO | S_IWUSR,
+ part_base_histo_time_show, part_base_histo_time_write);
+static DEVICE_ATTR(base_histo_seek, S_IRUGO | S_IWUSR,
+ part_base_histo_seek_show, part_base_histo_seek_write);
#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
@@ -920,6 +926,9 @@ static struct attribute *disk_attrs[] = {
&dev_attr_write_request_histo.attr,
&dev_attr_write_dma_histo.attr,
&dev_attr_seek_histo.attr,
+ &dev_attr_base_histo_size.attr,
+ &dev_attr_base_histo_time.attr,
+ &dev_attr_base_histo_seek.attr,
#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
@@ -1206,6 +1215,7 @@ struct gendisk *alloc_disk_node(int minors, int node_id)
return NULL;
}
disk->part_tbl->part[0] = &disk->part0;
+ init_part_histo_defaults(&disk->part0);

disk->minors = minors;
rand_initialize_disk(disk);
@@ -1307,6 +1317,13 @@ int invalidate_partition(struct gendisk *disk, int partno)
EXPORT_SYMBOL(invalidate_partition);

#ifdef CONFIG_BLOCK_HISTOGRAM
+/* Smallest transfer size identifiable (in sectors) by histograms. */
+static const int base_histo_size = 4;
+/* Smallest transfer time identifiable (in ms) by histograms. */
+static const int base_histo_time = 10;
+/* Smallest seek distance identifiable (in log base 2 sectors). */
+static const int base_histo_seek = 3;
+
typedef void (part_histo_reset) (struct disk_stats *, int);

/*
@@ -1371,18 +1388,22 @@ static int block_disk_histogram_reset(struct hd_struct *part,
void init_part_histo_defaults(struct hd_struct *part)
{
part->last_end_sector = part->start_sect;
+ part->base_histo_size = base_histo_size;
+ part->base_histo_time = base_histo_time;
+ part->base_histo_seek = base_histo_seek;
}

/*
* Map transfer size to histogram bucket. Transfer sizes are exponentially
* increasing. For example: 4,8,16,... sectors.
*/
-static inline int stats_size_bucket(sector_t sectors)
+static inline int stats_size_bucket(sector_t sectors, int base_histo_size)
{
int i;
/* To make sure bucket for x bytes captures all IOs <= x bytes. */
--sectors;
- do_div(sectors, BASE_HISTO_SIZE);
+ BUG_ON(!base_histo_size);
+ do_div(sectors, base_histo_size);
if (sectors >= (1 << (CONFIG_HISTO_SIZE_BUCKETS - 2)))
return CONFIG_HISTO_SIZE_BUCKETS - 1;

@@ -1395,11 +1416,11 @@ static inline int stats_size_bucket(sector_t sectors)
* Map transfer time to histogram bucket. This also uses an exponential
* increment, but we like the 1,2,5,10,20,50 progression.
*/
-static inline int stats_time_bucket(int jiffies)
+static inline int stats_time_bucket(int jiffies, int base_histo_time)
{
int i;
int ms = msecs_to_jiffies(jiffies);
- int t = BASE_HISTO_TIME;
+ int t = base_histo_time;

for (i = 0;; t *= 10) {
if (++i >= CONFIG_HISTO_TIME_BUCKETS || ms <= t)
@@ -1415,9 +1436,10 @@ static inline int stats_time_bucket(int jiffies)
* Map seek distance to histogram bucket. This also uses an exponential
* increment : 8, 16, 32, ... sectors.
*/
-static inline int stats_seek_bucket(sector_t distance)
+static inline int stats_seek_bucket(sector_t distance, int base_histo_seek)
{
- return min(fls64(distance >> 3), CONFIG_HISTO_SEEK_BUCKETS);
+ return min(fls64(distance >> base_histo_seek),
+ CONFIG_HISTO_SEEK_BUCKETS);
}

/*
@@ -1433,16 +1455,17 @@ static inline void __block_histogram_completion(int cpu, struct hd_struct *part,
{
sector_t sectors = blk_rq_size(req), end_sector = blk_rq_pos(req);
sector_t distance, start_sector = end_sector - sectors;
- int size_idx = stats_size_bucket(sectors);
- int req_time_idx = stats_time_bucket(req_ms);
- int dma_time_idx = stats_time_bucket(dma_ms);
+ int size_idx = stats_size_bucket(sectors, part->base_histo_size);
+ int req_time_idx = stats_time_bucket(req_ms, part->base_histo_time);
+ int dma_time_idx = stats_time_bucket(dma_ms, part->base_histo_time);

if (start_sector >= part->last_end_sector)
distance = start_sector - part->last_end_sector;
else
distance = part->last_end_sector - start_sector;

- part_stat_inc(cpu, part, seek_histo[stats_seek_bucket(distance)]);
+ part_stat_inc(cpu, part, seek_histo[stats_seek_bucket(distance,
+ part->base_histo_seek)]);
part->last_end_sector = end_sector;

if (!rq_data_dir(req))
@@ -1502,7 +1525,7 @@ static int dump_histo(struct hd_struct *part, int direction, int type,
{
ssize_t rem = PAGE_SIZE;
char *optr = page;
- int i, j, len, ms, size = BASE_HISTO_SIZE * 512;
+ int i, j, len, ms, size = part->base_histo_size * 512;
static const int mults[3] = {1, 2, 5};

/*
@@ -1515,7 +1538,7 @@ static int dump_histo(struct hd_struct *part, int direction, int type,
len = snprintf(page, rem, " ");
page += len;
rem -= len;
- for (i = 0, ms = BASE_HISTO_TIME; i < CONFIG_HISTO_TIME_BUCKETS;
+ for (i = 0, ms = part->base_histo_time; i < CONFIG_HISTO_TIME_BUCKETS;
ms *= 10) {
for (j = 0; j < 3 && i < CONFIG_HISTO_TIME_BUCKETS; ++j, ++i) {
len = snprintf(page, rem, "\t%d", ms * mults[j]);
@@ -1557,7 +1580,8 @@ static int dump_seek_histo(struct hd_struct *part, char* page)
for (i = 0; i < CONFIG_HISTO_SEEK_BUCKETS + 1; i++) {
if (i < CONFIG_HISTO_SEEK_BUCKETS)
len = snprintf(page, rem, "%ld\t%llu\n",
- 1UL << (i + 3), seek_histo_stat_read(part, i));
+ 1UL << (i + part->base_histo_seek),
+ seek_histo_stat_read(part, i));
else
len = snprintf(page, rem, "inf\t%llu\n",
seek_histo_stat_read(part, i));
@@ -1634,4 +1658,41 @@ ssize_t part_seek_histo_clear(struct device *dev,
return (retval == 0 ? count : retval);
}

+#define SHOW_BASE_FUNCTION(__VAR) \
+ssize_t part_##__VAR##_show(struct device *dev, \
+ struct device_attribute *attr, char *page) \
+{ \
+ struct hd_struct *part = dev_to_part(dev); \
+ \
+ return sprintf(page, "%d\n", part->__VAR); \
+}
+
+SHOW_BASE_FUNCTION(base_histo_size);
+SHOW_BASE_FUNCTION(base_histo_time);
+SHOW_BASE_FUNCTION(base_histo_seek);
+#undef SHOW_BASE_FUNCTION
+
+#define WRITE_BASE_FUNCTION(__VAR, MIN, reset_fn) \
+ssize_t part_##__VAR##_write(struct device *dev, \
+ struct device_attribute *attr, const char *page, size_t count) \
+{ \
+ struct hd_struct *part = dev_to_part(dev); \
+ char *p = (char *)page; \
+ unsigned long __data; \
+ int data, retval = strict_strtoul(p, 10, &__data); \
+ \
+ if (retval) \
+ return retval; \
+ data = __data; \
+ part->__VAR = max(data, MIN); \
+ block_disk_histogram_reset(part, reset_fn, READ); \
+ block_disk_histogram_reset(part, reset_fn, WRITE); \
+ return count; \
+}
+
+WRITE_BASE_FUNCTION(base_histo_size, 1, __block_part_histogram_reset);
+WRITE_BASE_FUNCTION(base_histo_time, 1, __block_part_histogram_reset);
+WRITE_BASE_FUNCTION(base_histo_seek, 1, __block_part_seek_histogram_reset);
+#undef WRITE_BASE_FUNCTION
+
#endif
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 47e2591..ddc0262 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -311,6 +311,12 @@ static DEVICE_ATTR(write_dma_histo, S_IRUGO | S_IWUSR,
part_write_dma_histo_show, part_write_histo_clear);
static DEVICE_ATTR(seek_histo, S_IRUGO | S_IWUSR,
part_seek_histo_show, part_seek_histo_clear);
+static DEVICE_ATTR(base_histo_size, S_IRUGO | S_IWUSR,
+ part_base_histo_size_show, part_base_histo_size_write);
+static DEVICE_ATTR(base_histo_time, S_IRUGO | S_IWUSR,
+ part_base_histo_time_show, part_base_histo_time_write);
+static DEVICE_ATTR(base_histo_seek, S_IRUGO | S_IWUSR,
+ part_base_histo_seek_show, part_base_histo_seek_write);
#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
@@ -331,6 +337,9 @@ static struct attribute *part_attrs[] = {
&dev_attr_write_request_histo.attr,
&dev_attr_write_dma_histo.attr,
&dev_attr_seek_histo.attr,
+ &dev_attr_base_histo_size.attr,
+ &dev_attr_base_histo_time.attr,
+ &dev_attr_base_histo_seek.attr,
#endif
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 746b36b..b64d983 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -80,9 +80,6 @@ struct partition {
__le32 nr_sects; /* nr of sectors in partition */
} __attribute__((packed));

-#define BASE_HISTO_SIZE 4 /* smallest transfer size, sectors */
-#define BASE_HISTO_TIME 10 /* shortest transfer time, ms */
-
/* Index into the histo arrays */
#define HISTO_REQUEST 0
#define HISTO_DMA 1
@@ -135,6 +132,9 @@ struct hd_struct {
#endif
#ifdef CONFIG_BLOCK_HISTOGRAM
sector_t last_end_sector;
+ int base_histo_size;
+ int base_histo_time;
+ int base_histo_seek;
#endif
struct rcu_head rcu_head;
};
@@ -414,6 +414,19 @@ extern ssize_t part_seek_histo_clear(struct device *dev,
struct device_attribute *attr, const char *page, size_t count);

extern void init_part_histo_defaults(struct hd_struct *part);
+
+extern ssize_t part_base_histo_size_show(struct device *dev,
+ struct device_attribute *attr, char *page);
+extern ssize_t part_base_histo_time_show(struct device *dev,
+ struct device_attribute *attr, char *page);
+extern ssize_t part_base_histo_seek_show(struct device *dev,
+ struct device_attribute *attr, char *page);
+extern ssize_t part_base_histo_size_write(struct device *dev,
+ struct device_attribute *attr, const char *page, size_t count);
+extern ssize_t part_base_histo_time_write(struct device *dev,
+ struct device_attribute *attr, const char *page, size_t count);
+extern ssize_t part_base_histo_seek_write(struct device *dev,
+ struct device_attribute *attr, const char *page, size_t count);
#else
static inline void block_histogram_completion(int cpu, struct hd_struct *part,
struct request *req) {}

2010-04-15 10:30:00

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/4] block: Per-partition block IO performance histograms

On Wed, Apr 14 2010, Divyesh Shah wrote:
> The following patchset implements per partition 2-d histograms for IO to block
> devices. The 3 types of histograms added are:
>
> 1) request histograms - 2-d histogram of total request time in ms (queueing +
> service) broken down by IO size (in bytes).
> 2) dma histograms - 2-d histogram of total service time in ms broken down by
> IO size (in bytes).
> 3) seek histograms - 1-d histogram of seek distance
>
> All of these histograms are per-partition. The first 2 are further divided into
> separate read and write histograms. The buckets for these histograms are
> configurable via config options as well as at runtime (per-device).
>
> These histograms have proven very valuable to us over the years to understand
> the seek distribution of IOs over our production machines, detect large
> queueing delays, find latency outliers, etc. by being used as part of an
> always-on monitoring system.
>
> They can be reset by writing any value to them which makes them useful for
> tests and debugging too.
>
> This was initially written by Edward Falk in 2006 and I've forward ported
> and improved it a few times it across kernel versions.
>
> He had also sent a very old version of this patchset (minus some features like
> runtime configurable buckets) back then to lkml - see
> http://lkml.indiana.edu/hypermail/linux/kernel/0611.1/2684.html
> Some of the reasons mentioned for not including these patches are given below.
>
> I'm requesting re-consideration for this patchset in light of the following
> arguments.
>
> 1) This can be done with blktrace too, why add another API?
>
> Yes blktrace can be used to get this kind of information w/ some help from
> userspace post-processing. However, to use this as an always-on monitoring tool
> w/ blktrace and have negligible performance overhead is difficult to achieve.
> I did a quick 10-thread iozone direct IO write phase run w/ and w/o blktrace
> on a traditional rotational disk to get a feel of the impact on throughput.
> This was kernel built from Jens' for-2.6.35 branch and did not have these new
> block histogram patches.
> o w/o blktrace:
> Children see throughput for 10 initial writers = 95211.22 KB/sec
> Parent sees throughput for 10 initial writers = 37593.20 KB/sec
> Min throughput per thread = 9078.65 KB/sec
> Max throughput per thread = 10055.59 KB/sec
> Avg throughput per thread = 9521.12 KB/sec
> Min xfer = 462848.00 KB
>
> o w/ blktrace:
> Children see throughput for 10 initial writers = 93527.98 KB/sec
> Parent sees throughput for 10 initial writers = 38594.47 KB/sec
> Min throughput per thread = 9197.06 KB/sec
> Max throughput per thread = 9640.09 KB/sec
> Avg throughput per thread = 9352.80 KB/sec
> Min xfer = 490496.00 KB
>
> This is about 1.8% average throughput loss per thread.
> The extra cpu time spent with blktrace is in addition to this loss of
> throughput. This overhead will only go up on faster SSDs.

blktrace definitely has a bit of overhead, even if I tried to keep it at
a minimum. I'm not too crazy about adding all this extra accounting for
something we can already get with the tracing that we have available.

The above blktrace run, I take it that was just a regular unmasked run?
Did you try and tailor the information logged? If you restricted to
logging just the particual event(s) that you need to generate this data,
the overhead would be a LOT smaller.

> 2) sysfs should be only for one value per file. There are some exceptions but we
> are working on fixing them. Please don't add new ones.
>
> There are excpetions like meminfo, etc. that violate this guideline (I'm not
> sure if its an enforced rule) and some actually make sense since there is no way
> of representing structured data. Though these block histograms are multi-valued
> one can also interpret them as one logical piece of information.

Not a problem in my book. There's also the case of giving a real
snapshot of the information as opposed to collecting from several files.

--
Jens Axboe

2010-04-15 13:40:20

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 0/4] block: Per-partition block IO performance histograms

Divyesh Shah <[email protected]> writes:

> The following patchset implements per partition 2-d histograms for IO to block
> devices. The 3 types of histograms added are:
>
> 1) request histograms - 2-d histogram of total request time in ms (queueing +
> service) broken down by IO size (in bytes).
> 2) dma histograms - 2-d histogram of total service time in ms broken down by
> IO size (in bytes).
> 3) seek histograms - 1-d histogram of seek distance
>
> All of these histograms are per-partition. The first 2 are further divided into
> separate read and write histograms. The buckets for these histograms are
> configurable via config options as well as at runtime (per-device).

Do you also keep track of statistics for the entire device? The I/O
schedulers operate at the device level, not the partition level.

> These histograms have proven very valuable to us over the years to understand
> the seek distribution of IOs over our production machines, detect large
> queueing delays, find latency outliers, etc. by being used as part of an
> always-on monitoring system.
>
> They can be reset by writing any value to them which makes them useful for
> tests and debugging too.
>
> This was initially written by Edward Falk in 2006 and I've forward ported
> and improved it a few times it across kernel versions.
>
> He had also sent a very old version of this patchset (minus some features like
> runtime configurable buckets) back then to lkml - see
> http://lkml.indiana.edu/hypermail/linux/kernel/0611.1/2684.html
> Some of the reasons mentioned for not including these patches are given below.
>
> I'm requesting re-consideration for this patchset in light of the following
> arguments.
>
> 1) This can be done with blktrace too, why add another API?
[...]
> This is about 1.8% average throughput loss per thread.
> The extra cpu time spent with blktrace is in addition to this loss of
> throughput. This overhead will only go up on faster SSDs.

I don't see any analysis of the overhead of your patch set. Would you
mind providing those numbers?

Thanks,
Jeff

2010-04-15 23:50:44

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 0/4] block: Per-partition block IO performance histograms

On Thu, Apr 15, 2010 at 6:40 AM, Jeff Moyer <[email protected]> wrote:
> Divyesh Shah <[email protected]> writes:
>
>> The following patchset implements per partition 2-d histograms for IO to block
>> devices. The 3 types of histograms added are:
>>
>> 1) request histograms - 2-d histogram of total request time in ms (queueing +
>> ? ?service) broken down by IO size (in bytes).
>> 2) dma histograms - 2-d histogram of total service time in ms broken down by
>> ? ?IO size (in bytes).
>> 3) seek histograms - 1-d histogram of seek distance
>>
>> All of these histograms are per-partition. The first 2 are further divided into
>> separate read and write histograms. The buckets for these histograms are
>> configurable via config options as well as at runtime (per-device).
>
> Do you also keep track of statistics for the entire device? ?The I/O
> schedulers operate at the device level, not the partition level.

Yes. This patch maintains stats for part0 too which represents the
entire device.

>
>> These histograms have proven very valuable to us over the years to understand
>> the seek distribution of IOs over our production machines, detect large
>> queueing delays, find latency outliers, etc. by being used as part of an
>> always-on monitoring system.
>>
>> They can be reset by writing any value to them which makes them useful for
>> tests and debugging too.
>>
>> This was initially written by Edward Falk in 2006 and I've forward ported
>> and improved it a few times it across kernel versions.
>>
>> He had also sent a very old version of this patchset (minus some features like
>> runtime configurable buckets) back then to lkml - see
>> http://lkml.indiana.edu/hypermail/linux/kernel/0611.1/2684.html
>> Some of the reasons mentioned for not including these patches are given below.
>>
>> I'm requesting re-consideration for this patchset in light of the following
>> arguments.
>>
>> 1) This can be done with blktrace too, why add another API?
> [...]
>> This is about 1.8% average throughput loss per thread.
>> The extra cpu time spent with blktrace is in addition to this loss of
>> throughput. This overhead will only go up on faster SSDs.
>
> I don't see any analysis of the overhead of your patch set. ?Would you
> mind providing those numbers?

I will try to run some tests and come back with more results (as
mentioned on the earlier response, there will be some delay).

Thanks,
Divyesh

>
> Thanks,
> Jeff
>

2010-04-15 23:56:48

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 0/4] block: Per-partition block IO performance histograms

On Thu, Apr 15, 2010 at 3:29 AM, Jens Axboe <[email protected]> wrote:
> On Wed, Apr 14 2010, Divyesh Shah wrote:
>> The following patchset implements per partition 2-d histograms for IO to block
>> devices. The 3 types of histograms added are:
>>
>> 1) request histograms - 2-d histogram of total request time in ms (queueing +
>> ? ?service) broken down by IO size (in bytes).
>> 2) dma histograms - 2-d histogram of total service time in ms broken down by
>> ? ?IO size (in bytes).
>> 3) seek histograms - 1-d histogram of seek distance
>>
>> All of these histograms are per-partition. The first 2 are further divided into
>> separate read and write histograms. The buckets for these histograms are
>> configurable via config options as well as at runtime (per-device).
>>
>> These histograms have proven very valuable to us over the years to understand
>> the seek distribution of IOs over our production machines, detect large
>> queueing delays, find latency outliers, etc. by being used as part of an
>> always-on monitoring system.
>>
>> They can be reset by writing any value to them which makes them useful for
>> tests and debugging too.
>>
>> This was initially written by Edward Falk in 2006 and I've forward ported
>> and improved it a few times it across kernel versions.
>>
>> He had also sent a very old version of this patchset (minus some features like
>> runtime configurable buckets) back then to lkml - see
>> http://lkml.indiana.edu/hypermail/linux/kernel/0611.1/2684.html
>> Some of the reasons mentioned for not including these patches are given below.
>>
>> I'm requesting re-consideration for this patchset in light of the following
>> arguments.
>>
>> 1) This can be done with blktrace too, why add another API?
>>
>> Yes blktrace can be used to get this kind of information w/ some help from
>> userspace post-processing. However, to use this as an always-on monitoring tool
>> w/ blktrace and have negligible performance overhead is difficult to achieve.
>> I did a quick 10-thread iozone direct IO write phase run w/ and w/o blktrace
>> on a traditional rotational disk to get a feel of the impact on throughput.
>> This was kernel built from Jens' for-2.6.35 branch and did not have these new
>> block histogram patches.
>> ? o w/o blktrace:
>> ? ? ? ? Children see throughput for 10 initial writers ?= ? 95211.22 KB/sec
>> ? ? ? ? Parent sees throughput for 10 initial writers ? = ? 37593.20 KB/sec
>> ? ? ? ? Min throughput per thread ? ? ? ? ? ? ? ? ? ? ? = ? ?9078.65 KB/sec
>> ? ? ? ? Max throughput per thread ? ? ? ? ? ? ? ? ? ? ? = ? 10055.59 KB/sec
>> ? ? ? ? Avg throughput per thread ? ? ? ? ? ? ? ? ? ? ? = ? ?9521.12 KB/sec
>> ? ? ? ? Min xfer ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= ?462848.00 KB
>>
>> ? o w/ blktrace:
>> ? ? ? ? Children see throughput for 10 initial writers ?= ? 93527.98 KB/sec
>> ? ? ? ? Parent sees throughput for 10 initial writers ? = ? 38594.47 KB/sec
>> ? ? ? ? Min throughput per thread ? ? ? ? ? ? ? ? ? ? ? = ? ?9197.06 KB/sec
>> ? ? ? ? Max throughput per thread ? ? ? ? ? ? ? ? ? ? ? = ? ?9640.09 KB/sec
>> ? ? ? ? Avg throughput per thread ? ? ? ? ? ? ? ? ? ? ? = ? ?9352.80 KB/sec
>> ? ? ? ? Min xfer ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= ?490496.00 KB
>>
>> This is about 1.8% average throughput loss per thread.
>> The extra cpu time spent with blktrace is in addition to this loss of
>> throughput. This overhead will only go up on faster SSDs.
>
> blktrace definitely has a bit of overhead, even if I tried to keep it at
> a minimum. I'm not too crazy about adding all this extra accounting for
> something we can already get with the tracing that we have available.
>
> The above blktrace run, I take it that was just a regular unmasked run?
> Did you try and tailor the information logged? If you restricted to
> logging just the particual event(s) that you need to generate this data,
> the overhead would be a LOT smaller.

Yes this was an unmasked run. I will try running some tests for only
these specific events and report back the results. However, I am going
to be away from work/email for the next 6 days (on vacation) so there
will be some delay before I can reply back.

>> 2) sysfs should be only for one value per file. There are some exceptions but we
>> ? ?are working on fixing them. Please don't add new ones.
>>
>> There are excpetions like meminfo, etc. that violate this guideline (I'm not
>> sure if its an enforced rule) and some actually make sense since there is no way
>> of representing structured data. Though these block histograms are multi-valued
>> one can also interpret them as one logical piece of information.
>
> Not a problem in my book. There's also the case of giving a real
> snapshot of the information as opposed to collecting from several files.

That is a good point too. Thanks for your comments!

>
> --
> Jens Axboe
>
>