2010-04-01 22:01:08

by Divyesh Shah

[permalink] [raw]
Subject: [PATCH 0/3] blkio: IO controller stats

The following series implements some additional stats for IO controller.

These stats have helped us debug issues with earlier IO controller
versions and should be useful now as well.
We've been using these stats for monitoring and debugging problems after the
fact as these stats can be collected and stored for later use.

One might argue that most of this information can be exported using blktrace
when debugging. However, blktrace has non-trivial performance impact and
cannot be always turned on. These stats provide a way for continuous monitoring
without losing much performance on rotational disks. We've been able to look
at these stats and debug issues after problems have been reported in the wild
and understand the IO pattern of the affected workloads.
Some of these stats are also a good data source for high-level analysis and
capacity planning.

This patchset adds 4 stats and I will send out another patchset later for
stats like io_merged and some stats that can be turned on only for
debugging - idle_time (total time spent idling for this blkio_group),
wait_time (total time spent by the blkio_group waiting before any one of its
queues got a timeslice). I've tried to breakdown the stats and sent the most
basic ones here.
---

Divyesh Shah (3):
Increment the blkio cgroup stats for real now.
Add io controller stats like
Remove per-cfqq nr_sectors as we'll be passing that info at request dispatch


block/blk-cgroup.c | 239 ++++++++++++++++++++++++++++++++++++++++++++----
block/blk-cgroup.h | 57 +++++++++--
block/blk-core.c | 6 +
block/cfq-iosched.c | 14 +--
include/linux/blkdev.h | 20 ++++
5 files changed, 291 insertions(+), 45 deletions(-)

--
Divyesh


2010-04-01 22:01:26

by Divyesh Shah

[permalink] [raw]
Subject: [PATCH 1/3] blkio: Remove per-cfqq nr_sectors as we'll be passing

that info at request dispatch with other stats now. This patch removes the
existing support for accounting sectors for a blkio_group. This will be added
back differently in the next two patches.

Signed-off-by: Divyesh Shah<[email protected]>
---

block/blk-cgroup.c | 3 +--
block/blk-cgroup.h | 6 ++----
block/cfq-iosched.c | 10 ++--------
3 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4b686ad..5be3981 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -56,10 +56,9 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);

void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
- unsigned long time, unsigned long sectors)
+ unsigned long time)
{
blkg->time += time;
- blkg->sectors += sectors;
}
EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_stats);

diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 8ccc204..fe44517 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -106,7 +106,7 @@ extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
void *key);
void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
- unsigned long time, unsigned long sectors);
+ unsigned long time);
#else
struct cgroup;
static inline struct blkio_cgroup *
@@ -123,8 +123,6 @@ blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
static inline struct blkio_group *
blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
static inline void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
- unsigned long time, unsigned long sectors)
-{
-}
+ unsigned long time) {}
#endif
#endif /* _BLK_CGROUP_H */
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ef1680b..c18e348 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -141,8 +141,6 @@ struct cfq_queue {
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
struct cfq_group *orig_cfqg;
- /* Sectors dispatched in current dispatch round */
- unsigned long nr_sectors;
};

/*
@@ -882,8 +880,7 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
slice_used = cfqq->allocated_slice;
}

- cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u sect=%lu", slice_used,
- cfqq->nr_sectors);
+ cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
return slice_used;
}

@@ -917,8 +914,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
- blkiocg_update_blkio_group_stats(&cfqg->blkg, used_sl,
- cfqq->nr_sectors);
+ blkiocg_update_blkio_group_stats(&cfqg->blkg, used_sl);
}

#ifdef CONFIG_CFQ_GROUP_IOSCHED
@@ -1524,7 +1520,6 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
cfqq->allocated_slice = 0;
cfqq->slice_end = 0;
cfqq->slice_dispatch = 0;
- cfqq->nr_sectors = 0;

cfq_clear_cfqq_wait_request(cfqq);
cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1869,7 +1864,6 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
elv_dispatch_sort(q, rq);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
- cfqq->nr_sectors += blk_rq_sectors(rq);
}

/*

2010-04-01 22:01:55

by Divyesh Shah

[permalink] [raw]
Subject: [PATCH 2/3] blkio: Add io controller stats like

- io_service_time
- io_wait_time
- io_serviced
- io_service_bytes

These stats are accumulated per operation type helping us to distinguish between
read and write, and sync and async IO. This patch does not increment any of
these stats.

Signed-off-by: Divyesh Shah<[email protected]>
---

block/blk-cgroup.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++-----
block/blk-cgroup.h | 39 +++++++++--
block/cfq-iosched.c | 2 -
3 files changed, 194 insertions(+), 25 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5be3981..ad6843f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -55,12 +55,15 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
}
EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);

-void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
- unsigned long time)
+void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
{
- blkg->time += time;
+ unsigned long flags;
+
+ spin_lock_irqsave(&blkg->stats_lock, flags);
+ blkg->stats.time += time;
+ spin_unlock_irqrestore(&blkg->stats_lock, flags);
}
-EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_stats);
+EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);

void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key, dev_t dev)
@@ -170,13 +173,121 @@ blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
return 0;
}

-#define SHOW_FUNCTION_PER_GROUP(__VAR) \
+static int
+blkiocg_reset_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
+{
+ struct blkio_cgroup *blkcg;
+ struct blkio_group *blkg;
+ struct hlist_node *n;
+ struct blkio_group_stats *stats;
+
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ spin_lock_irq(&blkcg->lock);
+ hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
+ spin_lock(&blkg->stats_lock);
+ stats = &blkg->stats;
+ memset(stats, 0, sizeof(struct blkio_group_stats));
+ spin_unlock(&blkg->stats_lock);
+ }
+ spin_unlock_irq(&blkcg->lock);
+ return 0;
+}
+
+void get_key_name(int type, char *disk_id, char *str, int chars_left)
+{
+ strlcpy(str, disk_id, chars_left);
+ chars_left -= strlen(str);
+ if (chars_left <= 0) {
+ printk(KERN_WARNING
+ "Possibly incorrect cgroup stat display format");
+ return;
+ }
+ switch (type) {
+ case IO_READ:
+ strlcat(str, " Read", chars_left);
+ break;
+ case IO_WRITE:
+ strlcat(str, " Write", chars_left);
+ break;
+ case IO_SYNC:
+ strlcat(str, " Sync", chars_left);
+ break;
+ case IO_ASYNC:
+ strlcat(str, " Async", chars_left);
+ break;
+ case IO_TYPE_MAX:
+ strlcat(str, " Total", chars_left);
+ break;
+ default:
+ strlcat(str, " Invalid", chars_left);
+ }
+}
+
+typedef uint64_t (get_var) (struct blkio_group *, int);
+
+#define MAX_KEY_LEN 100
+uint64_t get_typed_stat(struct blkio_group *blkg, struct cgroup_map_cb *cb,
+ get_var *getvar, char *disk_id)
+{
+ uint64_t disk_total;
+ char key_str[MAX_KEY_LEN];
+ int type;
+
+ for (type = 0; type < IO_TYPE_MAX; type++) {
+ get_key_name(type, disk_id, key_str, MAX_KEY_LEN);
+ cb->fill(cb, key_str, getvar(blkg, type));
+ }
+ disk_total = getvar(blkg, IO_READ) + getvar(blkg, IO_WRITE);
+ get_key_name(IO_TYPE_MAX, disk_id, key_str, MAX_KEY_LEN);
+ cb->fill(cb, key_str, disk_total);
+ return disk_total;
+}
+
+uint64_t get_stat(struct blkio_group *blkg, struct cgroup_map_cb *cb,
+ get_var *getvar, char *disk_id)
+{
+ uint64_t var = getvar(blkg, 0);
+ cb->fill(cb, disk_id, var);
+ return var;
+}
+
+#define GET_STAT_INDEXED(__VAR) \
+uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int type) \
+{ \
+ return blkg->stats.__VAR[type]; \
+} \
+
+GET_STAT_INDEXED(io_service_bytes);
+GET_STAT_INDEXED(io_serviced);
+GET_STAT_INDEXED(io_service_time);
+GET_STAT_INDEXED(io_wait_time);
+#undef GET_STAT_INDEXED
+
+#define GET_STAT(__VAR, __CONV) \
+uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int dummy) \
+{ \
+ uint64_t data = blkg->stats.__VAR; \
+ if (__CONV) \
+ data = (uint64_t)jiffies_to_msecs(data) * NSEC_PER_MSEC;\
+ return data; \
+}
+
+GET_STAT(time, 1);
+GET_STAT(sectors, 0);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+GET_STAT(dequeue, 0);
+#endif
+#undef GET_STAT
+
+#define SHOW_FUNCTION_PER_GROUP(__VAR, get_stats, getvar, show_total) \
static int blkiocg_##__VAR##_read(struct cgroup *cgroup, \
- struct cftype *cftype, struct seq_file *m) \
+ struct cftype *cftype, struct cgroup_map_cb *cb) \
{ \
struct blkio_cgroup *blkcg; \
struct blkio_group *blkg; \
struct hlist_node *n; \
+ uint64_t cgroup_total = 0; \
+ char disk_id[10]; \
\
if (!cgroup_lock_live_group(cgroup)) \
return -ENODEV; \
@@ -184,19 +295,32 @@ static int blkiocg_##__VAR##_read(struct cgroup *cgroup, \
blkcg = cgroup_to_blkio_cgroup(cgroup); \
rcu_read_lock(); \
hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {\
- if (blkg->dev) \
- seq_printf(m, "%u:%u %lu\n", MAJOR(blkg->dev), \
- MINOR(blkg->dev), blkg->__VAR); \
+ if (blkg->dev) { \
+ spin_lock_irq(&blkg->stats_lock); \
+ snprintf(disk_id, 10, "%u:%u", MAJOR(blkg->dev),\
+ MINOR(blkg->dev)); \
+ cgroup_total += get_stats(blkg, cb, getvar, \
+ disk_id); \
+ spin_unlock_irq(&blkg->stats_lock); \
+ } \
} \
+ if (show_total) \
+ cb->fill(cb, "Total", cgroup_total); \
rcu_read_unlock(); \
cgroup_unlock(); \
return 0; \
}

-SHOW_FUNCTION_PER_GROUP(time);
-SHOW_FUNCTION_PER_GROUP(sectors);
+SHOW_FUNCTION_PER_GROUP(time, get_stat, get_time_stat, 0);
+SHOW_FUNCTION_PER_GROUP(sectors, get_stat, get_sectors_stat, 0);
+SHOW_FUNCTION_PER_GROUP(io_service_bytes, get_typed_stat,
+ get_io_service_bytes_stat, 1);
+SHOW_FUNCTION_PER_GROUP(io_serviced, get_typed_stat, get_io_serviced_stat, 1);
+SHOW_FUNCTION_PER_GROUP(io_service_time, get_typed_stat,
+ get_io_service_time_stat, 1);
+SHOW_FUNCTION_PER_GROUP(io_wait_time, get_typed_stat, get_io_wait_time_stat, 1);
#ifdef CONFIG_DEBUG_BLK_CGROUP
-SHOW_FUNCTION_PER_GROUP(dequeue);
+SHOW_FUNCTION_PER_GROUP(dequeue, get_stat, get_dequeue_stat, 0);
#endif
#undef SHOW_FUNCTION_PER_GROUP

@@ -204,7 +328,7 @@ SHOW_FUNCTION_PER_GROUP(dequeue);
void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
unsigned long dequeue)
{
- blkg->dequeue += dequeue;
+ blkg->stats.dequeue += dequeue;
}
EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_dequeue_stats);
#endif
@@ -217,16 +341,38 @@ struct cftype blkio_files[] = {
},
{
.name = "time",
- .read_seq_string = blkiocg_time_read,
+ .read_map = blkiocg_time_read,
+ .write_u64 = blkiocg_reset_write,
},
{
.name = "sectors",
- .read_seq_string = blkiocg_sectors_read,
+ .read_map = blkiocg_sectors_read,
+ .write_u64 = blkiocg_reset_write,
+ },
+ {
+ .name = "io_service_bytes",
+ .read_map = blkiocg_io_service_bytes_read,
+ .write_u64 = blkiocg_reset_write,
+ },
+ {
+ .name = "io_serviced",
+ .read_map = blkiocg_io_serviced_read,
+ .write_u64 = blkiocg_reset_write,
+ },
+ {
+ .name = "io_service_time",
+ .read_map = blkiocg_io_service_time_read,
+ .write_u64 = blkiocg_reset_write,
+ },
+ {
+ .name = "io_wait_time",
+ .read_map = blkiocg_io_wait_time_read,
+ .write_u64 = blkiocg_reset_write,
},
#ifdef CONFIG_DEBUG_BLK_CGROUP
{
.name = "dequeue",
- .read_seq_string = blkiocg_dequeue_read,
+ .read_map = blkiocg_dequeue_read,
},
#endif
};
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index fe44517..5c5e529 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -23,6 +23,14 @@ extern struct cgroup_subsys blkio_subsys;
#define blkio_subsys_id blkio_subsys.subsys_id
#endif

+enum io_type {
+ IO_READ = 0,
+ IO_WRITE,
+ IO_SYNC,
+ IO_ASYNC,
+ IO_TYPE_MAX
+};
+
struct blkio_cgroup {
struct cgroup_subsys_state css;
unsigned int weight;
@@ -30,6 +38,23 @@ struct blkio_cgroup {
struct hlist_head blkg_list;
};

+struct blkio_group_stats {
+ /* total disk time and nr sectors dispatched by this group */
+ uint64_t time;
+ uint64_t sectors;
+ /* Total disk time used by IOs in ns */
+ uint64_t io_service_time[IO_TYPE_MAX];
+ uint64_t io_service_bytes[IO_TYPE_MAX]; /* Total bytes transferred */
+ /* Total IOs serviced, post merge */
+ uint64_t io_serviced[IO_TYPE_MAX];
+ /* Total time spent waiting in scheduler queue in ns */
+ uint64_t io_wait_time[IO_TYPE_MAX];
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ /* How many times this group has been removed from service tree */
+ unsigned long dequeue;
+#endif
+};
+
struct blkio_group {
/* An rcu protected unique identifier for the group */
void *key;
@@ -38,15 +63,13 @@ struct blkio_group {
#ifdef CONFIG_DEBUG_BLK_CGROUP
/* Store cgroup path */
char path[128];
- /* How many times this group has been removed from service tree */
- unsigned long dequeue;
#endif
/* The device MKDEV(major, minor), this group has been created for */
dev_t dev;

- /* total disk time and nr sectors dispatched by this group */
- unsigned long time;
- unsigned long sectors;
+ /* Need to serialize the stats in the case of reset/update */
+ spinlock_t stats_lock;
+ struct blkio_group_stats stats;
};

typedef void (blkio_unlink_group_fn) (void *key, struct blkio_group *blkg);
@@ -105,8 +128,8 @@ extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
void *key);
-void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
- unsigned long time);
+void blkiocg_update_timeslice_used(struct blkio_group *blkg,
+ unsigned long time);
#else
struct cgroup;
static inline struct blkio_cgroup *
@@ -122,7 +145,7 @@ blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }

static inline struct blkio_group *
blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
-static inline void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+static inline void blkiocg_update_timeslice_used(struct blkio_group *blkg,
unsigned long time) {}
#endif
#endif /* _BLK_CGROUP_H */
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c18e348..d91df9f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -914,7 +914,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,

cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
st->min_vdisktime);
- blkiocg_update_blkio_group_stats(&cfqg->blkg, used_sl);
+ blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
}

#ifdef CONFIG_CFQ_GROUP_IOSCHED

2010-04-01 22:02:11

by Divyesh Shah

[permalink] [raw]
Subject: [PATCH 3/3] blkio: Increment the blkio cgroup stats for real now

We also add start_time_ns and io_start_time_ns fields to struct request
here to record the time when a request is created and when it is
dispatched to device. We use ns uints here as ms and jiffies are
not very useful for non-rotational media.

Signed-off-by: Divyesh Shah<[email protected]>
---

block/blk-cgroup.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++--
block/blk-cgroup.h | 14 +++++++++--
block/blk-core.c | 6 +++--
block/cfq-iosched.c | 4 ++-
include/linux/blkdev.h | 20 +++++++++++++++-
5 files changed, 95 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ad6843f..9af7257 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -15,6 +15,7 @@
#include <linux/kdev_t.h>
#include <linux/module.h>
#include <linux/err.h>
+#include <linux/blkdev.h>
#include "blk-cgroup.h"

static DEFINE_SPINLOCK(blkio_list_lock);
@@ -55,6 +56,26 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
}
EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);

+/*
+ * Add to the appropriate stat variable depending on the request type.
+ * This should be called with the blkg->stats_lock held.
+ */
+void io_add_stat(uint64_t *stat, uint64_t add, unsigned int flags)
+{
+ if (flags & REQ_RW)
+ stat[IO_WRITE] += add;
+ else
+ stat[IO_READ] += add;
+ /*
+ * Everywhere in the block layer, an IO is treated as sync if it is a
+ * read or a SYNC write. We follow the same norm.
+ */
+ if (!(flags & REQ_RW) || flags & REQ_RW_SYNC)
+ stat[IO_SYNC] += add;
+ else
+ stat[IO_ASYNC] += add;
+}
+
void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
{
unsigned long flags;
@@ -65,6 +86,41 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
}
EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);

+void blkiocg_update_request_dispatch_stats(struct blkio_group *blkg,
+ struct request *rq)
+{
+ struct blkio_group_stats *stats;
+ unsigned long flags;
+
+ spin_lock_irqsave(&blkg->stats_lock, flags);
+ stats = &blkg->stats;
+ stats->sectors += blk_rq_sectors(rq);
+ io_add_stat(stats->io_serviced, 1, rq->cmd_flags);
+ io_add_stat(stats->io_service_bytes, blk_rq_sectors(rq) << 9,
+ rq->cmd_flags);
+ spin_unlock_irqrestore(&blkg->stats_lock, flags);
+}
+
+void blkiocg_update_request_completion_stats(struct blkio_group *blkg,
+ struct request *rq)
+{
+ struct blkio_group_stats *stats;
+ unsigned long flags;
+ unsigned long long now = sched_clock();
+
+ spin_lock_irqsave(&blkg->stats_lock, flags);
+ stats = &blkg->stats;
+ if (time_after64(now, rq->io_start_time_ns))
+ io_add_stat(stats->io_service_time, now - rq->io_start_time_ns,
+ rq->cmd_flags);
+ if (time_after64(rq->io_start_time_ns, rq->start_time_ns))
+ io_add_stat(stats->io_wait_time,
+ rq->io_start_time_ns - rq->start_time_ns,
+ rq->cmd_flags);
+ spin_unlock_irqrestore(&blkg->stats_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blkiocg_update_request_completion_stats);
+
void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key, dev_t dev)
{
@@ -325,12 +381,12 @@ SHOW_FUNCTION_PER_GROUP(dequeue, get_stat, get_dequeue_stat, 0);
#undef SHOW_FUNCTION_PER_GROUP

#ifdef CONFIG_DEBUG_BLK_CGROUP
-void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
+void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
unsigned long dequeue)
{
blkg->stats.dequeue += dequeue;
}
-EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_dequeue_stats);
+EXPORT_SYMBOL_GPL(blkiocg_update_dequeue_stats);
#endif

struct cftype blkio_files[] = {
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 5c5e529..80010ef 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -112,12 +112,12 @@ static inline char *blkg_path(struct blkio_group *blkg)
{
return blkg->path;
}
-void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
+void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
unsigned long dequeue);
#else
static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
-static inline void blkiocg_update_blkio_group_dequeue_stats(
- struct blkio_group *blkg, unsigned long dequeue) {}
+static inline void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
+ unsigned long dequeue) {}
#endif

#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
@@ -130,6 +130,10 @@ extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
void *key);
void blkiocg_update_timeslice_used(struct blkio_group *blkg,
unsigned long time);
+void blkiocg_update_request_dispatch_stats(struct blkio_group *blkg,
+ struct request *rq);
+void blkiocg_update_request_completion_stats(struct blkio_group *blkg,
+ struct request *rq);
#else
struct cgroup;
static inline struct blkio_cgroup *
@@ -147,5 +151,9 @@ static inline struct blkio_group *
blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
static inline void blkiocg_update_timeslice_used(struct blkio_group *blkg,
unsigned long time) {}
+static inline void blkiocg_update_request_dispatch_stats(
+ struct blkio_group *blkg, struct request *rq) {}
+static inline void blkiocg_update_request_completion_stats(
+ struct blkio_group *blkg, struct request *rq) {}
#endif
#endif /* _BLK_CGROUP_H */
diff --git a/block/blk-core.c b/block/blk-core.c
index 9fe174d..1d94f15 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -127,6 +127,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
rq->tag = -1;
rq->ref_count = 1;
rq->start_time = jiffies;
+ set_start_time_ns(rq);
}
EXPORT_SYMBOL(blk_rq_init);

@@ -1855,8 +1856,10 @@ void blk_dequeue_request(struct request *rq)
* and to it is freed is accounted as io that is in progress at
* the driver side.
*/
- if (blk_account_rq(rq))
+ if (blk_account_rq(rq)) {
q->in_flight[rq_is_sync(rq)]++;
+ set_io_start_time_ns(rq);
+ }
}

/**
@@ -2517,4 +2520,3 @@ int __init blk_dev_init(void)

return 0;
}
-
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d91df9f..5de0e54 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -854,7 +854,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
if (!RB_EMPTY_NODE(&cfqg->rb_node))
cfq_rb_erase(&cfqg->rb_node, st);
cfqg->saved_workload_slice = 0;
- blkiocg_update_blkio_group_dequeue_stats(&cfqg->blkg, 1);
+ blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
}

static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
@@ -1864,6 +1864,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
elv_dispatch_sort(q, rq);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
+ blkiocg_update_request_dispatch_stats(&cfqq->cfqg->blkg, rq);
}

/*
@@ -3284,6 +3285,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
WARN_ON(!cfqq->dispatched);
cfqd->rq_in_driver--;
cfqq->dispatched--;
+ blkiocg_update_request_completion_stats(&cfqq->cfqg->blkg, rq);

cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ebd22db..c793019 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -193,7 +193,10 @@ struct request {

struct gendisk *rq_disk;
unsigned long start_time;
-
+#ifdef CONFIG_BLK_CGROUP
+ unsigned long long start_time_ns;
+ unsigned long long io_start_time_ns; /* when passed to hardware */
+#endif
/* Number of scatter-gather DMA addr+len pairs after
* physical address coalescing is performed.
*/
@@ -1219,6 +1222,21 @@ static inline void put_dev_sector(Sector p)
struct work_struct;
int kblockd_schedule_work(struct request_queue *q, struct work_struct *work);

+#ifdef CONFIG_BLK_CGROUP
+static inline void set_start_time_ns(struct request *req)
+{
+ req->start_time_ns = sched_clock();
+}
+
+static inline void set_io_start_time_ns(struct request *req)
+{
+ req->io_start_time_ns = sched_clock();
+}
+#else
+static inline void set_start_time_ns(struct request *req) {}
+static inline void set_io_start_time_ns(struct request *req) {}
+#endif
+
#define MODULE_ALIAS_BLOCKDEV(major,minor) \
MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor))
#define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \

2010-04-02 06:45:21

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/3] blkio: IO controller stats

On Thu, Apr 01 2010, Divyesh Shah wrote:
> The following series implements some additional stats for IO controller.
>
> These stats have helped us debug issues with earlier IO controller
> versions and should be useful now as well.
> We've been using these stats for monitoring and debugging problems after the
> fact as these stats can be collected and stored for later use.
>
> One might argue that most of this information can be exported using blktrace
> when debugging. However, blktrace has non-trivial performance impact and
> cannot be always turned on. These stats provide a way for continuous monitoring
> without losing much performance on rotational disks. We've been able to look
> at these stats and debug issues after problems have been reported in the wild
> and understand the IO pattern of the affected workloads.
> Some of these stats are also a good data source for high-level analysis and
> capacity planning.
>
> This patchset adds 4 stats and I will send out another patchset later for
> stats like io_merged and some stats that can be turned on only for
> debugging - idle_time (total time spent idling for this blkio_group),
> wait_time (total time spent by the blkio_group waiting before any one of its
> queues got a timeslice). I've tried to breakdown the stats and sent the most
> basic ones here.

Thanks, I think this makes sense. I've applied it for 2.6.35.

--
Jens Axboe

2010-04-02 18:11:07

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 2/3] blkio: Add io controller stats like

On Thu, Apr 01, 2010 at 03:01:24PM -0700, Divyesh Shah wrote:
> - io_service_time
> - io_wait_time
> - io_serviced
> - io_service_bytes
>

Hi Divyesh,

Some more description what exactly these stats are will be helpful. Please
also update Documentation/cgroup/blkio-controller.txt file appropriately.
Especially, "Details of cgroup files" section.

> These stats are accumulated per operation type helping us to distinguish between
> read and write, and sync and async IO. This patch does not increment any of
> these stats.
>
> Signed-off-by: Divyesh Shah<[email protected]>
> ---
>
> block/blk-cgroup.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++-----
> block/blk-cgroup.h | 39 +++++++++--
> block/cfq-iosched.c | 2 -
> 3 files changed, 194 insertions(+), 25 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 5be3981..ad6843f 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -55,12 +55,15 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
> }
> EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
>
> -void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
> - unsigned long time)
> +void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
> {
> - blkg->time += time;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&blkg->stats_lock, flags);
> + blkg->stats.time += time;
> + spin_unlock_irqrestore(&blkg->stats_lock, flags);
> }
> -EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_stats);
> +EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
>
> void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> struct blkio_group *blkg, void *key, dev_t dev)
> @@ -170,13 +173,121 @@ blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
> return 0;
> }
>
> -#define SHOW_FUNCTION_PER_GROUP(__VAR) \
> +static int
> +blkiocg_reset_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
> +{
> + struct blkio_cgroup *blkcg;
> + struct blkio_group *blkg;
> + struct hlist_node *n;
> + struct blkio_group_stats *stats;
> +
> + blkcg = cgroup_to_blkio_cgroup(cgroup);
> + spin_lock_irq(&blkcg->lock);
> + hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
> + spin_lock(&blkg->stats_lock);
> + stats = &blkg->stats;
> + memset(stats, 0, sizeof(struct blkio_group_stats));
> + spin_unlock(&blkg->stats_lock);
> + }
> + spin_unlock_irq(&blkcg->lock);
> + return 0;
> +}
> +
> +void get_key_name(int type, char *disk_id, char *str, int chars_left)
> +{

get_key_name() should be static? Can we prefix all these internal function
names with blkio?

Do we have to introduce this separate notion of char *disk_id. Can we
simply stick to dev_t and do the sprintf like functions to convert that
into key. For me personally it is just less confusion while reading code.

> + strlcpy(str, disk_id, chars_left);
> + chars_left -= strlen(str);
> + if (chars_left <= 0) {
> + printk(KERN_WARNING
> + "Possibly incorrect cgroup stat display format");
> + return;
> + }
> + switch (type) {
> + case IO_READ:
> + strlcat(str, " Read", chars_left);
> + break;
> + case IO_WRITE:
> + strlcat(str, " Write", chars_left);
> + break;
> + case IO_SYNC:
> + strlcat(str, " Sync", chars_left);
> + break;
> + case IO_ASYNC:
> + strlcat(str, " Async", chars_left);
> + break;
> + case IO_TYPE_MAX:
> + strlcat(str, " Total", chars_left);
> + break;

Should we use IO_TYPE_TOTAL for "Total" stats.

> + default:
> + strlcat(str, " Invalid", chars_left);
> + }
> +}
> +
> +typedef uint64_t (get_var) (struct blkio_group *, int);
> +
> +#define MAX_KEY_LEN 100

Can we move above define to the beginning of the file. So that it is
easily visible.

> +uint64_t get_typed_stat(struct blkio_group *blkg, struct cgroup_map_cb *cb,
> + get_var *getvar, char *disk_id)
> +{

static function? blkio_get_typed_stat()?

Again, can we use dev_t instead of char *disk_id.

> + uint64_t disk_total;
> + char key_str[MAX_KEY_LEN];
> + int type;
> +
> + for (type = 0; type < IO_TYPE_MAX; type++) {
> + get_key_name(type, disk_id, key_str, MAX_KEY_LEN);
> + cb->fill(cb, key_str, getvar(blkg, type));
> + }
> + disk_total = getvar(blkg, IO_READ) + getvar(blkg, IO_WRITE);
> + get_key_name(IO_TYPE_MAX, disk_id, key_str, MAX_KEY_LEN);
> + cb->fill(cb, key_str, disk_total);
> + return disk_total;
> +}
> +
> +uint64_t get_stat(struct blkio_group *blkg, struct cgroup_map_cb *cb,
> + get_var *getvar, char *disk_id)
> +{
> + uint64_t var = getvar(blkg, 0);
> + cb->fill(cb, disk_id, var);
> + return var;
> +}
> +
> +#define GET_STAT_INDEXED(__VAR) \
> +uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int type) \
> +{ \
> + return blkg->stats.__VAR[type]; \
> +} \
> +
> +GET_STAT_INDEXED(io_service_bytes);
> +GET_STAT_INDEXED(io_serviced);
> +GET_STAT_INDEXED(io_service_time);
> +GET_STAT_INDEXED(io_wait_time);
> +#undef GET_STAT_INDEXED
> +
> +#define GET_STAT(__VAR, __CONV) \
> +uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int dummy) \
> +{ \
> + uint64_t data = blkg->stats.__VAR; \
> + if (__CONV) \
> + data = (uint64_t)jiffies_to_msecs(data) * NSEC_PER_MSEC;\
> + return data; \
> +}
> +
> +GET_STAT(time, 1);
> +GET_STAT(sectors, 0);
> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> +GET_STAT(dequeue, 0);
> +#endif
> +#undef GET_STAT
> +

I think now so many macros to define different kind of functions is
becoming really confusing. How about creating a two dimensional stat
array which is indexed by type of io stat like "io_service_time" and
then indexed by sub io type that is "READ, WRITE, SYNC" etc. And we
can define a single function to read all kind of stats.

That way we will not have to define one GET_STAT_INDEXED() and GET_STAT()
for each unindexed variable. If there are not multiple functions then,
then everywhere we don't have to pass around a function pointer to read
the variable stat and code should become much simpler. Example, code
follows.


enum stat_type {
BLKIO_STAT_TIME
BLKIO_STAT_SECTORS
BLKIO_STAT_WAIT_TIME
BLKIO_STAT_SERVICE_TIME
}

enum stat_sub_type {
BLKIO_STAT_READ
BLKIO_STAT_WRITE
BLKIO_STAT_SYNC
}

blkio_get_stat_var(enum stat_type var, enum stat_sub_type var_type) {

if (var == BLKIO_STAT_TIME)
reutrn blkg->stat.time

/* Same as above for sectors */

return blkg->stat[var][var_type];
}

Also all these get_* function needs to be static.

> +#define SHOW_FUNCTION_PER_GROUP(__VAR, get_stats, getvar, show_total) \
> static int blkiocg_##__VAR##_read(struct cgroup *cgroup, \
> - struct cftype *cftype, struct seq_file *m) \
> + struct cftype *cftype, struct cgroup_map_cb *cb) \
> { \
> struct blkio_cgroup *blkcg; \
> struct blkio_group *blkg; \
> struct hlist_node *n; \
> + uint64_t cgroup_total = 0; \
> + char disk_id[10]; \
> \
> if (!cgroup_lock_live_group(cgroup)) \
> return -ENODEV; \
> @@ -184,19 +295,32 @@ static int blkiocg_##__VAR##_read(struct cgroup *cgroup, \
> blkcg = cgroup_to_blkio_cgroup(cgroup); \
> rcu_read_lock(); \
> hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {\
> - if (blkg->dev) \
> - seq_printf(m, "%u:%u %lu\n", MAJOR(blkg->dev), \
> - MINOR(blkg->dev), blkg->__VAR); \
> + if (blkg->dev) { \
> + spin_lock_irq(&blkg->stats_lock); \
> + snprintf(disk_id, 10, "%u:%u", MAJOR(blkg->dev),\
> + MINOR(blkg->dev)); \
> + cgroup_total += get_stats(blkg, cb, getvar, \
> + disk_id); \
> + spin_unlock_irq(&blkg->stats_lock); \
> + } \
> } \
> + if (show_total) \
> + cb->fill(cb, "Total", cgroup_total); \
> rcu_read_unlock(); \
> cgroup_unlock(); \
> return 0; \
> }
>
> -SHOW_FUNCTION_PER_GROUP(time);
> -SHOW_FUNCTION_PER_GROUP(sectors);
> +SHOW_FUNCTION_PER_GROUP(time, get_stat, get_time_stat, 0);
> +SHOW_FUNCTION_PER_GROUP(sectors, get_stat, get_sectors_stat, 0);
> +SHOW_FUNCTION_PER_GROUP(io_service_bytes, get_typed_stat,
> + get_io_service_bytes_stat, 1);
> +SHOW_FUNCTION_PER_GROUP(io_serviced, get_typed_stat, get_io_serviced_stat, 1);
> +SHOW_FUNCTION_PER_GROUP(io_service_time, get_typed_stat,
> + get_io_service_time_stat, 1);
> +SHOW_FUNCTION_PER_GROUP(io_wait_time, get_typed_stat, get_io_wait_time_stat, 1);
> #ifdef CONFIG_DEBUG_BLK_CGROUP
> -SHOW_FUNCTION_PER_GROUP(dequeue);
> +SHOW_FUNCTION_PER_GROUP(dequeue, get_stat, get_dequeue_stat, 0);
> #endif
> #undef SHOW_FUNCTION_PER_GROUP
>
> @@ -204,7 +328,7 @@ SHOW_FUNCTION_PER_GROUP(dequeue);
> void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
> unsigned long dequeue)
> {
> - blkg->dequeue += dequeue;
> + blkg->stats.dequeue += dequeue;
> }
> EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_dequeue_stats);
> #endif
> @@ -217,16 +341,38 @@ struct cftype blkio_files[] = {
> },
> {
> .name = "time",
> - .read_seq_string = blkiocg_time_read,
> + .read_map = blkiocg_time_read,
> + .write_u64 = blkiocg_reset_write,
> },
> {
> .name = "sectors",
> - .read_seq_string = blkiocg_sectors_read,
> + .read_map = blkiocg_sectors_read,
> + .write_u64 = blkiocg_reset_write,
> + },
> + {
> + .name = "io_service_bytes",
> + .read_map = blkiocg_io_service_bytes_read,
> + .write_u64 = blkiocg_reset_write,
> + },
> + {
> + .name = "io_serviced",
> + .read_map = blkiocg_io_serviced_read,
> + .write_u64 = blkiocg_reset_write,
> + },
> + {
> + .name = "io_service_time",
> + .read_map = blkiocg_io_service_time_read,
> + .write_u64 = blkiocg_reset_write,
> + },
> + {
> + .name = "io_wait_time",
> + .read_map = blkiocg_io_wait_time_read,
> + .write_u64 = blkiocg_reset_write,
> },

blkiocg_reset_write() is invoked if a user does some write on any of the
above files. This will reset all the stats. So if I do echo x >
blkio.time, I will see all other files being reset? I think this is
little counter intutive.

Either we need to reset only those specific stats (the file being written
to) or may be we can provide a separate file "blkio.reset_stats" to reset
all the stats?

Secondly, instead of providing separate files for all the stats
"io_service_bytes, io_serviced...." do you think it makes sense to put
these in a single file "blkio.stat". Something like what memory controller
does for providing stats in a single file. I think number of files per
cgroup will be little less overwhelming that way.

Thinking loud, Hm.., but these stats are per device (unlink memory cgroup)
that too each stat has been broken down by type (READ, WRITE, SYNC..etc), so it
might be too much of info in a single file. That way, keeping these in
separate files makese sense. I guess, per stat file is ok, given the
fact that stats are per device.

Thanks
Vivek

> #ifdef CONFIG_DEBUG_BLK_CGROUP
> {
> .name = "dequeue",
> - .read_seq_string = blkiocg_dequeue_read,
> + .read_map = blkiocg_dequeue_read,
> },
> #endif
> };
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index fe44517..5c5e529 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -23,6 +23,14 @@ extern struct cgroup_subsys blkio_subsys;
> #define blkio_subsys_id blkio_subsys.subsys_id
> #endif
>
> +enum io_type {
> + IO_READ = 0,
> + IO_WRITE,
> + IO_SYNC,
> + IO_ASYNC,
> + IO_TYPE_MAX
> +};
> +
> struct blkio_cgroup {
> struct cgroup_subsys_state css;
> unsigned int weight;
> @@ -30,6 +38,23 @@ struct blkio_cgroup {
> struct hlist_head blkg_list;
> };
>
> +struct blkio_group_stats {
> + /* total disk time and nr sectors dispatched by this group */
> + uint64_t time;
> + uint64_t sectors;
> + /* Total disk time used by IOs in ns */
> + uint64_t io_service_time[IO_TYPE_MAX];
> + uint64_t io_service_bytes[IO_TYPE_MAX]; /* Total bytes transferred */
> + /* Total IOs serviced, post merge */
> + uint64_t io_serviced[IO_TYPE_MAX];
> + /* Total time spent waiting in scheduler queue in ns */
> + uint64_t io_wait_time[IO_TYPE_MAX];
> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> + /* How many times this group has been removed from service tree */
> + unsigned long dequeue;
> +#endif
> +};
> +
> struct blkio_group {
> /* An rcu protected unique identifier for the group */
> void *key;
> @@ -38,15 +63,13 @@ struct blkio_group {
> #ifdef CONFIG_DEBUG_BLK_CGROUP
> /* Store cgroup path */
> char path[128];
> - /* How many times this group has been removed from service tree */
> - unsigned long dequeue;
> #endif
> /* The device MKDEV(major, minor), this group has been created for */
> dev_t dev;
>
> - /* total disk time and nr sectors dispatched by this group */
> - unsigned long time;
> - unsigned long sectors;
> + /* Need to serialize the stats in the case of reset/update */
> + spinlock_t stats_lock;
> + struct blkio_group_stats stats;
> };
>
> typedef void (blkio_unlink_group_fn) (void *key, struct blkio_group *blkg);
> @@ -105,8 +128,8 @@ extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
> extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
> void *key);
> -void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
> - unsigned long time);
> +void blkiocg_update_timeslice_used(struct blkio_group *blkg,
> + unsigned long time);
> #else
> struct cgroup;
> static inline struct blkio_cgroup *
> @@ -122,7 +145,7 @@ blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
>
> static inline struct blkio_group *
> blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
> -static inline void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
> +static inline void blkiocg_update_timeslice_used(struct blkio_group *blkg,
> unsigned long time) {}
> #endif
> #endif /* _BLK_CGROUP_H */
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index c18e348..d91df9f 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -914,7 +914,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>
> cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
> st->min_vdisktime);
> - blkiocg_update_blkio_group_stats(&cfqg->blkg, used_sl);
> + blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
> }
>
> #ifdef CONFIG_CFQ_GROUP_IOSCHED

2010-04-02 18:18:16

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 2/3] blkio: Add io controller stats like

On Thu, Apr 01, 2010 at 03:01:24PM -0700, Divyesh Shah wrote:

[..]
> +#define GET_STAT(__VAR, __CONV) \
> +uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int dummy) \
> +{ \
> + uint64_t data = blkg->stats.__VAR; \
> + if (__CONV) \
> + data = (uint64_t)jiffies_to_msecs(data) * NSEC_PER_MSEC;\
> + return data; \
> +}
> +

Hi Divyesh,

I think now you are exporting blkio.time in ns instead of ms?

- First of all your are breaking ABI.
- Secondly, how does that help. You are capturing all your slice used
stats in ms. You already lost any ns granularity. What's the point
in converting these to ns now?
- Does user space software really need that fine grained accounting. If
this information is used for charging purposes, isn't ms good enough.

Thanks
Vivek

2010-04-02 18:55:17

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 2/3] blkio: Add io controller stats like

On Fri, Apr 2, 2010 at 11:17 AM, Vivek Goyal <[email protected]> wrote:
> On Thu, Apr 01, 2010 at 03:01:24PM -0700, Divyesh Shah wrote:
>
> [..]
>> +#define GET_STAT(__VAR, __CONV) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> +uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int dummy) ? ? \
>> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? uint64_t data = blkg->stats.__VAR; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? if (__CONV) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? data = (uint64_t)jiffies_to_msecs(data) * NSEC_PER_MSEC;\
>> + ? ? return data; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> +}
>> +
>
> Hi Divyesh,
>
> I think now you are exporting blkio.time in ns instead of ms?

This is an error.. I should be exporting this in ms. Will send out a
patchset with the change once I also address you other comments in a
bit. Thanks for the review!

> - First of all your are breaking ABI.
> - Secondly, how does that help. You are capturing all your slice used
> ?stats in ms. You already lost any ns granularity. What's the point
> ?in converting these to ns now?
> - Does user space software really need that fine grained accounting. If
> ?this information is used for charging purposes, isn't ms good enough.
>
> Thanks
> Vivek
>

2010-04-02 19:10:24

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 3/3] blkio: Increment the blkio cgroup stats for real now

On Thu, Apr 01, 2010 at 03:01:41PM -0700, Divyesh Shah wrote:
> We also add start_time_ns and io_start_time_ns fields to struct request
> here to record the time when a request is created and when it is
> dispatched to device. We use ns uints here as ms and jiffies are
> not very useful for non-rotational media.
>
> Signed-off-by: Divyesh Shah<[email protected]>
> ---
>
> block/blk-cgroup.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++--
> block/blk-cgroup.h | 14 +++++++++--
> block/blk-core.c | 6 +++--
> block/cfq-iosched.c | 4 ++-
> include/linux/blkdev.h | 20 +++++++++++++++-
> 5 files changed, 95 insertions(+), 9 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index ad6843f..9af7257 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -15,6 +15,7 @@
> #include <linux/kdev_t.h>
> #include <linux/module.h>
> #include <linux/err.h>
> +#include <linux/blkdev.h>
> #include "blk-cgroup.h"
>
> static DEFINE_SPINLOCK(blkio_list_lock);
> @@ -55,6 +56,26 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
> }
> EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
>
> +/*
> + * Add to the appropriate stat variable depending on the request type.
> + * This should be called with the blkg->stats_lock held.
> + */
> +void io_add_stat(uint64_t *stat, uint64_t add, unsigned int flags)
> +{
> + if (flags & REQ_RW)
> + stat[IO_WRITE] += add;
> + else
> + stat[IO_READ] += add;
> + /*
> + * Everywhere in the block layer, an IO is treated as sync if it is a
> + * read or a SYNC write. We follow the same norm.
> + */
> + if (!(flags & REQ_RW) || flags & REQ_RW_SYNC)
> + stat[IO_SYNC] += add;
> + else
> + stat[IO_ASYNC] += add;
> +}
> +

Hi Divyesh,

Can we have any request based information limited to cfq and not put that
in blkio-cgroup. The reason being that I am expecting that some kind of
max bw policy interface will not necessarily be implemented at CFQ
level. We might have to implement it at higher level so that it can
work with all dm/md devices. If that's the case, then it might very well
be either a bio based interface also.

So just keeping that possibility in mind, can we keep blk-cgroup as
generic as possible and not necessarily make it dependent on "struct
request".

If you implement, two dimensional arrays for stats then we can have
following function.

blkio_add_stat(enum stat_type var enum stat_sub_type var_type, u64 val)

The idea is that let policy specify what is the type of IO completed,
read, or write or sync, async and lets not make assumption in blkio-cgroup
that it is request based interface and use flags like REQ_RW_SYNC etc.

> void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
> {
> unsigned long flags;
> @@ -65,6 +86,41 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
> }
> EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
>
> +void blkiocg_update_request_dispatch_stats(struct blkio_group *blkg,
> + struct request *rq)
> +{
> + struct blkio_group_stats *stats;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&blkg->stats_lock, flags);
> + stats = &blkg->stats;
> + stats->sectors += blk_rq_sectors(rq);
> + io_add_stat(stats->io_serviced, 1, rq->cmd_flags);
> + io_add_stat(stats->io_service_bytes, blk_rq_sectors(rq) << 9,
> + rq->cmd_flags);

io_service_bytes is esentially same as blkio.sectors but it is giving
info in bytes instead of sectors and it is breaking down IO into subtypes.
That's fine if it is helping you.

instead of blk_rq_sectors(rq) << 9, you can probably use blk_rq_bytes().

Again, can we keep rq information out of blk-cgroup.c.

> + spin_unlock_irqrestore(&blkg->stats_lock, flags);
> +}
> +
> +void blkiocg_update_request_completion_stats(struct blkio_group *blkg,
> + struct request *rq)
> +{
> + struct blkio_group_stats *stats;
> + unsigned long flags;
> + unsigned long long now = sched_clock();
> +
> + spin_lock_irqsave(&blkg->stats_lock, flags);
> + stats = &blkg->stats;
> + if (time_after64(now, rq->io_start_time_ns))
> + io_add_stat(stats->io_service_time, now - rq->io_start_time_ns,
> + rq->cmd_flags);
> + if (time_after64(rq->io_start_time_ns, rq->start_time_ns))
> + io_add_stat(stats->io_wait_time,
> + rq->io_start_time_ns - rq->start_time_ns,
> + rq->cmd_flags);
> + spin_unlock_irqrestore(&blkg->stats_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(blkiocg_update_request_completion_stats);
> +

Same here that knowledge of rq, move into CFQ and keep blk-cgroup.c free
of it.

> void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> struct blkio_group *blkg, void *key, dev_t dev)
> {
> @@ -325,12 +381,12 @@ SHOW_FUNCTION_PER_GROUP(dequeue, get_stat, get_dequeue_stat, 0);
> #undef SHOW_FUNCTION_PER_GROUP
>
> #ifdef CONFIG_DEBUG_BLK_CGROUP
> -void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
> +void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
> unsigned long dequeue)
> {
> blkg->stats.dequeue += dequeue;
> }
> -EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_dequeue_stats);
> +EXPORT_SYMBOL_GPL(blkiocg_update_dequeue_stats);
> #endif
>
> struct cftype blkio_files[] = {
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 5c5e529..80010ef 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -112,12 +112,12 @@ static inline char *blkg_path(struct blkio_group *blkg)
> {
> return blkg->path;
> }
> -void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
> +void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
> unsigned long dequeue);
> #else
> static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
> -static inline void blkiocg_update_blkio_group_dequeue_stats(
> - struct blkio_group *blkg, unsigned long dequeue) {}
> +static inline void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
> + unsigned long dequeue) {}
> #endif
>
> #if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
> @@ -130,6 +130,10 @@ extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
> void *key);
> void blkiocg_update_timeslice_used(struct blkio_group *blkg,
> unsigned long time);
> +void blkiocg_update_request_dispatch_stats(struct blkio_group *blkg,
> + struct request *rq);
> +void blkiocg_update_request_completion_stats(struct blkio_group *blkg,
> + struct request *rq);
> #else
> struct cgroup;
> static inline struct blkio_cgroup *
> @@ -147,5 +151,9 @@ static inline struct blkio_group *
> blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
> static inline void blkiocg_update_timeslice_used(struct blkio_group *blkg,
> unsigned long time) {}
> +static inline void blkiocg_update_request_dispatch_stats(
> + struct blkio_group *blkg, struct request *rq) {}
> +static inline void blkiocg_update_request_completion_stats(
> + struct blkio_group *blkg, struct request *rq) {}
> #endif
> #endif /* _BLK_CGROUP_H */
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 9fe174d..1d94f15 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -127,6 +127,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
> rq->tag = -1;
> rq->ref_count = 1;
> rq->start_time = jiffies;
> + set_start_time_ns(rq);
> }
> EXPORT_SYMBOL(blk_rq_init);
>
> @@ -1855,8 +1856,10 @@ void blk_dequeue_request(struct request *rq)
> * and to it is freed is accounted as io that is in progress at
> * the driver side.
> */
> - if (blk_account_rq(rq))
> + if (blk_account_rq(rq)) {
> q->in_flight[rq_is_sync(rq)]++;
> + set_io_start_time_ns(rq);
> + }
> }
>
> /**
> @@ -2517,4 +2520,3 @@ int __init blk_dev_init(void)
>
> return 0;
> }
> -
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index d91df9f..5de0e54 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -854,7 +854,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
> if (!RB_EMPTY_NODE(&cfqg->rb_node))
> cfq_rb_erase(&cfqg->rb_node, st);
> cfqg->saved_workload_slice = 0;
> - blkiocg_update_blkio_group_dequeue_stats(&cfqg->blkg, 1);
> + blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
> }
>
> static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> @@ -1864,6 +1864,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
> elv_dispatch_sort(q, rq);
>
> cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
> + blkiocg_update_request_dispatch_stats(&cfqq->cfqg->blkg, rq);
> }
>
> /*
> @@ -3284,6 +3285,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> WARN_ON(!cfqq->dispatched);
> cfqd->rq_in_driver--;
> cfqq->dispatched--;
> + blkiocg_update_request_completion_stats(&cfqq->cfqg->blkg, rq);
>
> cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index ebd22db..c793019 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -193,7 +193,10 @@ struct request {
>
> struct gendisk *rq_disk;
> unsigned long start_time;
> -
> +#ifdef CONFIG_BLK_CGROUP
> + unsigned long long start_time_ns;
> + unsigned long long io_start_time_ns; /* when passed to hardware */
> +#endif
> /* Number of scatter-gather DMA addr+len pairs after
> * physical address coalescing is performed.
> */
> @@ -1219,6 +1222,21 @@ static inline void put_dev_sector(Sector p)
> struct work_struct;
> int kblockd_schedule_work(struct request_queue *q, struct work_struct *work);
>
> +#ifdef CONFIG_BLK_CGROUP
> +static inline void set_start_time_ns(struct request *req)
> +{
> + req->start_time_ns = sched_clock();
> +}
> +
> +static inline void set_io_start_time_ns(struct request *req)
> +{
> + req->io_start_time_ns = sched_clock();
> +}
> +#else
> +static inline void set_start_time_ns(struct request *req) {}
> +static inline void set_io_start_time_ns(struct request *req) {}
> +#endif
> +
> #define MODULE_ALIAS_BLOCKDEV(major,minor) \
> MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor))
> #define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \

2010-04-02 20:54:00

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 2/3] blkio: Add io controller stats like

Vivek, thanks for the detailed review. Comments inline. I'll send a V2
patchset soon.

On Fri, Apr 2, 2010 at 11:10 AM, Vivek Goyal <[email protected]> wrote:
> On Thu, Apr 01, 2010 at 03:01:24PM -0700, Divyesh Shah wrote:
>> - io_service_time
>> - io_wait_time
>> - io_serviced
>> - io_service_bytes
>>
>
> Hi Divyesh,
>
> Some more description what exactly these stats are will be helpful. Please
> also update Documentation/cgroup/blkio-controller.txt file appropriately.
> Especially, "Details of cgroup files" section.

Done. Will be in V2

>
>> These stats are accumulated per operation type helping us to distinguish between
>> read and write, and sync and async IO. This patch does not increment any of
>> these stats.
>>
>> Signed-off-by: Divyesh Shah<[email protected]>
>> ---
>>
>> ?block/blk-cgroup.c ?| ?178 ++++++++++++++++++++++++++++++++++++++++++++++-----
>> ?block/blk-cgroup.h ?| ? 39 +++++++++--
>> ?block/cfq-iosched.c | ? ?2 -
>> ?3 files changed, 194 insertions(+), 25 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index 5be3981..ad6843f 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -55,12 +55,15 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
>> ?}
>> ?EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
>>
>> -void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long time)
>> +void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
>> ?{
>> - ? ? blkg->time += time;
>> + ? ? unsigned long flags;
>> +
>> + ? ? spin_lock_irqsave(&blkg->stats_lock, flags);
>> + ? ? blkg->stats.time += time;
>> + ? ? spin_unlock_irqrestore(&blkg->stats_lock, flags);
>> ?}
>> -EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_stats);
>> +EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
>>
>> ?void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
>> ? ? ? ? ? ? ? ? ? ? ? struct blkio_group *blkg, void *key, dev_t dev)
>> @@ -170,13 +173,121 @@ blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
>> ? ? ? return 0;
>> ?}
>>
>> -#define SHOW_FUNCTION_PER_GROUP(__VAR) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> +static int
>> +blkiocg_reset_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
>> +{
>> + ? ? struct blkio_cgroup *blkcg;
>> + ? ? struct blkio_group *blkg;
>> + ? ? struct hlist_node *n;
>> + ? ? struct blkio_group_stats *stats;
>> +
>> + ? ? blkcg = cgroup_to_blkio_cgroup(cgroup);
>> + ? ? spin_lock_irq(&blkcg->lock);
>> + ? ? hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
>> + ? ? ? ? ? ? spin_lock(&blkg->stats_lock);
>> + ? ? ? ? ? ? stats = &blkg->stats;
>> + ? ? ? ? ? ? memset(stats, 0, sizeof(struct blkio_group_stats));
>> + ? ? ? ? ? ? spin_unlock(&blkg->stats_lock);
>> + ? ? }
>> + ? ? spin_unlock_irq(&blkcg->lock);
>> + ? ? return 0;
>> +}
>> +
>> +void get_key_name(int type, char *disk_id, char *str, int chars_left)
>> +{
>
> get_key_name() should be static? Can we prefix all these internal function
> names with blkio?
Done. Will be in V2.

>
> Do we have to introduce this separate notion of char *disk_id. Can we
> simply stick to dev_t and do the sprintf like functions to convert that
> into key. For me personally it is just less confusion while reading code.
>
>> + ? ? strlcpy(str, disk_id, chars_left);
>> + ? ? chars_left -= strlen(str);
>> + ? ? if (chars_left <= 0) {
>> + ? ? ? ? ? ? printk(KERN_WARNING
>> + ? ? ? ? ? ? ? ? ? ? "Possibly incorrect cgroup stat display format");
>> + ? ? ? ? ? ? return;
>> + ? ? }
>> + ? ? switch (type) {
>> + ? ? case IO_READ:
>> + ? ? ? ? ? ? strlcat(str, " Read", chars_left);
>> + ? ? ? ? ? ? break;
>> + ? ? case IO_WRITE:
>> + ? ? ? ? ? ? strlcat(str, " Write", chars_left);
>> + ? ? ? ? ? ? break;
>> + ? ? case IO_SYNC:
>> + ? ? ? ? ? ? strlcat(str, " Sync", chars_left);
>> + ? ? ? ? ? ? break;
>> + ? ? case IO_ASYNC:
>> + ? ? ? ? ? ? strlcat(str, " Async", chars_left);
>> + ? ? ? ? ? ? break;
>> + ? ? case IO_TYPE_MAX:
>> + ? ? ? ? ? ? strlcat(str, " Total", chars_left);
>> + ? ? ? ? ? ? break;
>
> Should we use IO_TYPE_TOTAL for "Total" stats.

Can certainly replace MAX with TOTAL.

>
>> + ? ? default:
>> + ? ? ? ? ? ? strlcat(str, " Invalid", chars_left);
>> + ? ? }
>> +}
>> +
>> +typedef uint64_t (get_var) (struct blkio_group *, int);
>> +
>> +#define MAX_KEY_LEN 100
>
> Can we move above define to the beginning of the file. So that it is
> easily visible.
Done
>
>> +uint64_t get_typed_stat(struct blkio_group *blkg, struct cgroup_map_cb *cb,
>> + ? ? ? ? ? ? get_var *getvar, char *disk_id)
>> +{
>
> static function? blkio_get_typed_stat()?
Done
>
> Again, can we use dev_t instead of char *disk_id.
Good point. Done
>
>> + ? ? uint64_t disk_total;
>> + ? ? char key_str[MAX_KEY_LEN];
>> + ? ? int type;
>> +
>> + ? ? for (type = 0; type < IO_TYPE_MAX; type++) {
>> + ? ? ? ? ? ? get_key_name(type, disk_id, key_str, MAX_KEY_LEN);
>> + ? ? ? ? ? ? cb->fill(cb, key_str, getvar(blkg, type));
>> + ? ? }
>> + ? ? disk_total = getvar(blkg, IO_READ) + getvar(blkg, IO_WRITE);
>> + ? ? get_key_name(IO_TYPE_MAX, disk_id, key_str, MAX_KEY_LEN);
>> + ? ? cb->fill(cb, key_str, disk_total);
>> + ? ? return disk_total;
>> +}
>> +
>> +uint64_t get_stat(struct blkio_group *blkg, struct cgroup_map_cb *cb,
>> + ? ? ? ? ? ? get_var *getvar, char *disk_id)
>> +{
>> + ? ? uint64_t var = getvar(blkg, 0);
>> + ? ? cb->fill(cb, disk_id, var);
>> + ? ? return var;
>> +}
>> +
>> +#define GET_STAT_INDEXED(__VAR) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> +uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int type) ? ? ? ? ? ? ?\
>> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? return blkg->stats.__VAR[type]; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> +} ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> +
>> +GET_STAT_INDEXED(io_service_bytes);
>> +GET_STAT_INDEXED(io_serviced);
>> +GET_STAT_INDEXED(io_service_time);
>> +GET_STAT_INDEXED(io_wait_time);
>> +#undef GET_STAT_INDEXED
>> +
>> +#define GET_STAT(__VAR, __CONV) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> +uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int dummy) ? ? \
>> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? uint64_t data = blkg->stats.__VAR; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? if (__CONV) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? data = (uint64_t)jiffies_to_msecs(data) * NSEC_PER_MSEC;\
>> + ? ? return data; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> +}
>> +
>> +GET_STAT(time, 1);
>> +GET_STAT(sectors, 0);
>> +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> +GET_STAT(dequeue, 0);
>> +#endif
>> +#undef GET_STAT
>> +
>
> I think now so many macros to define different kind of functions is
> becoming really confusing. How about creating a two dimensional stat
> array which is indexed by type of io stat like "io_service_time" and
> then indexed by sub io type that is "READ, WRITE, SYNC" etc. And we
> can define a single function to read all kind of stats.
>
> That way we will not have to define one GET_STAT_INDEXED() and GET_STAT()
> for each unindexed variable. If there are not multiple functions then,
> then everywhere we don't have to pass around a function pointer to read
> the variable stat and code should become much simpler. Example, code
> follows.
>
>
> enum stat_type {
> ? ? ? ?BLKIO_STAT_TIME
> ? ? ? ?BLKIO_STAT_SECTORS
> ? ? ? ?BLKIO_STAT_WAIT_TIME
> ? ? ? ?BLKIO_STAT_SERVICE_TIME
> }
>
> enum stat_sub_type {
> ? ? ? ?BLKIO_STAT_READ
> ? ? ? ?BLKIO_STAT_WRITE
> ? ? ? ?BLKIO_STAT_SYNC
> }
>
> blkio_get_stat_var(enum stat_type var, enum stat_sub_type var_type) {
>
> ? ? ? ?if (var == BLKIO_STAT_TIME)
> ? ? ? ? ? ? ? ?reutrn blkg->stat.time
>
> ? ? ? ?/* Same as above for sectors */
>
> ? ? ? ?return blkg->stat[var][var_type];
> }

I'm not sure adding this level of complexity (2nd dimension for stats)
and the numerous if (..) statements (or switch) is warranted when they
only apply for the get_stat() and get_typed_stat() functions. This
seems like a more complicated simplification :).

>
> Also all these get_* function needs to be static.
Done
>
>> +#define SHOW_FUNCTION_PER_GROUP(__VAR, get_stats, getvar, show_total) ? ? ? ?\
>> ?static int blkiocg_##__VAR##_read(struct cgroup *cgroup, ? ? ? ? ? ? \
>> - ? ? ? ? ? ? ? ? ? ? struct cftype *cftype, struct seq_file *m) ? ? ?\
>> + ? ? ? ? ? ? struct cftype *cftype, struct cgroup_map_cb *cb) ? ? ? ?\
>> ?{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> ? ? ? struct blkio_cgroup *blkcg; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> ? ? ? struct blkio_group *blkg; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> ? ? ? struct hlist_node *n; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? uint64_t cgroup_total = 0; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? char disk_id[10]; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> ? ? ? if (!cgroup_lock_live_group(cgroup)) ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> ? ? ? ? ? ? ? return -ENODEV; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> @@ -184,19 +295,32 @@ static int blkiocg_##__VAR##_read(struct cgroup *cgroup, ? ? ? ? ? ? ? ?\
>> ? ? ? blkcg = cgroup_to_blkio_cgroup(cgroup); ? ? ? ? ? ? ? ? ? ? ? ? \
>> ? ? ? rcu_read_lock(); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> ? ? ? hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {\
>> - ? ? ? ? ? ? if (blkg->dev) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> - ? ? ? ? ? ? ? ? ? ? seq_printf(m, "%u:%u %lu\n", MAJOR(blkg->dev), ?\
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MINOR(blkg->dev), blkg->__VAR); ? ? ? ?\
>> + ? ? ? ? ? ? if (blkg->dev) { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? ? ? ? ? spin_lock_irq(&blkg->stats_lock); ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? ? ? ? ? snprintf(disk_id, 10, "%u:%u", MAJOR(blkg->dev),\
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MINOR(blkg->dev)); ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? ? ? ? ? cgroup_total += get_stats(blkg, cb, getvar, ? ? \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? disk_id); ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? ? ? ? ? spin_unlock_irq(&blkg->stats_lock); ? ? ? ? ? ? \
>> + ? ? ? ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? if (show_total) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? cb->fill(cb, "Total", cgroup_total); ? ? ? ? ? ? ? ? ? ?\
>> ? ? ? rcu_read_unlock(); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> ? ? ? cgroup_unlock(); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> ? ? ? return 0; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> ?}
>>
>> -SHOW_FUNCTION_PER_GROUP(time);
>> -SHOW_FUNCTION_PER_GROUP(sectors);
>> +SHOW_FUNCTION_PER_GROUP(time, get_stat, get_time_stat, 0);
>> +SHOW_FUNCTION_PER_GROUP(sectors, get_stat, get_sectors_stat, 0);
>> +SHOW_FUNCTION_PER_GROUP(io_service_bytes, get_typed_stat,
>> + ? ? ? ? ? ? ? ? ? ? get_io_service_bytes_stat, 1);
>> +SHOW_FUNCTION_PER_GROUP(io_serviced, get_typed_stat, get_io_serviced_stat, 1);
>> +SHOW_FUNCTION_PER_GROUP(io_service_time, get_typed_stat,
>> + ? ? ? ? ? ? ? ? ? ? get_io_service_time_stat, 1);
>> +SHOW_FUNCTION_PER_GROUP(io_wait_time, get_typed_stat, get_io_wait_time_stat, 1);
>> ?#ifdef CONFIG_DEBUG_BLK_CGROUP
>> -SHOW_FUNCTION_PER_GROUP(dequeue);
>> +SHOW_FUNCTION_PER_GROUP(dequeue, get_stat, get_dequeue_stat, 0);
>> ?#endif
>> ?#undef SHOW_FUNCTION_PER_GROUP
>>
>> @@ -204,7 +328,7 @@ SHOW_FUNCTION_PER_GROUP(dequeue);
>> ?void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
>> ? ? ? ? ? ? ? ? ? ? ? unsigned long dequeue)
>> ?{
>> - ? ? blkg->dequeue += dequeue;
>> + ? ? blkg->stats.dequeue += dequeue;
>> ?}
>> ?EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_dequeue_stats);
>> ?#endif
>> @@ -217,16 +341,38 @@ struct cftype blkio_files[] = {
>> ? ? ? },
>> ? ? ? {
>> ? ? ? ? ? ? ? .name = "time",
>> - ? ? ? ? ? ? .read_seq_string = blkiocg_time_read,
>> + ? ? ? ? ? ? .read_map = blkiocg_time_read,
>> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> ? ? ? },
>> ? ? ? {
>> ? ? ? ? ? ? ? .name = "sectors",
>> - ? ? ? ? ? ? .read_seq_string = blkiocg_sectors_read,
>> + ? ? ? ? ? ? .read_map = blkiocg_sectors_read,
>> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> + ? ? },
>> + ? ? {
>> + ? ? ? ? ? ? .name = "io_service_bytes",
>> + ? ? ? ? ? ? .read_map = blkiocg_io_service_bytes_read,
>> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> + ? ? },
>> + ? ? {
>> + ? ? ? ? ? ? .name = "io_serviced",
>> + ? ? ? ? ? ? .read_map = blkiocg_io_serviced_read,
>> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> + ? ? },
>> + ? ? {
>> + ? ? ? ? ? ? .name = "io_service_time",
>> + ? ? ? ? ? ? .read_map = blkiocg_io_service_time_read,
>> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> + ? ? },
>> + ? ? {
>> + ? ? ? ? ? ? .name = "io_wait_time",
>> + ? ? ? ? ? ? .read_map = blkiocg_io_wait_time_read,
>> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> ? ? ? },
>
> blkiocg_reset_write() is invoked if a user does some write on any of the
> above files. This will reset all the stats. So if I do echo x >
> blkio.time, I will see all other files being reset? I think this is
> little counter intutive.
>
> Either we need to reset only those specific stats (the file being written
> to) or may be we can provide a separate file "blkio.reset_stats" to reset
> all the stats?

I added a note in the Documentation file that writing an int to any of
the stats should reset all. I don't get the point of introducing
another file for this and would want stats to be cleared all together
in order to have say io_service_time and io_serviced in sync, i.e.,
the io_service_time should always be the total service time for the
IOs accounted for in io_serviced.

> Secondly, instead of providing separate files for all the stats
> "io_service_bytes, io_serviced...." do you think it makes sense to put
> these in a single file "blkio.stat". Something like what memory controller
> does for providing stats in a single file. I think number of files per
> cgroup will be little less overwhelming that way.
>
> Thinking loud, Hm.., but these stats are per device (unlink memory cgroup)
> that too each stat has been broken down by type (READ, WRITE, SYNC..etc), so it
> might be too much of info in a single file. That way, keeping these in
> separate files makese sense. I guess, per stat file is ok, given the
> fact that stats are per device.

Yeah we'd rather keep them separate.
>
> Thanks
> Vivek
>
>> ?#ifdef CONFIG_DEBUG_BLK_CGROUP
>> ? ? ? ? {
>> ? ? ? ? ? ? ? .name = "dequeue",
>> - ? ? ? ? ? ? .read_seq_string = blkiocg_dequeue_read,
>> + ? ? ? ? ? ? .read_map = blkiocg_dequeue_read,
>> ? ? ? ? },
>> ?#endif
>> ?};
>> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
>> index fe44517..5c5e529 100644
>> --- a/block/blk-cgroup.h
>> +++ b/block/blk-cgroup.h
>> @@ -23,6 +23,14 @@ extern struct cgroup_subsys blkio_subsys;
>> ?#define blkio_subsys_id blkio_subsys.subsys_id
>> ?#endif
>>
>> +enum io_type {
>> + ? ? IO_READ = 0,
>> + ? ? IO_WRITE,
>> + ? ? IO_SYNC,
>> + ? ? IO_ASYNC,
>> + ? ? IO_TYPE_MAX
>> +};
>> +
>> ?struct blkio_cgroup {
>> ? ? ? struct cgroup_subsys_state css;
>> ? ? ? unsigned int weight;
>> @@ -30,6 +38,23 @@ struct blkio_cgroup {
>> ? ? ? struct hlist_head blkg_list;
>> ?};
>>
>> +struct blkio_group_stats {
>> + ? ? /* total disk time and nr sectors dispatched by this group */
>> + ? ? uint64_t time;
>> + ? ? uint64_t sectors;
>> + ? ? /* Total disk time used by IOs in ns */
>> + ? ? uint64_t io_service_time[IO_TYPE_MAX];
>> + ? ? uint64_t io_service_bytes[IO_TYPE_MAX]; /* Total bytes transferred */
>> + ? ? /* Total IOs serviced, post merge */
>> + ? ? uint64_t io_serviced[IO_TYPE_MAX];
>> + ? ? /* Total time spent waiting in scheduler queue in ns */
>> + ? ? uint64_t io_wait_time[IO_TYPE_MAX];
>> +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> + ? ? /* How many times this group has been removed from service tree */
>> + ? ? unsigned long dequeue;
>> +#endif
>> +};
>> +
>> ?struct blkio_group {
>> ? ? ? /* An rcu protected unique identifier for the group */
>> ? ? ? void *key;
>> @@ -38,15 +63,13 @@ struct blkio_group {
>> ?#ifdef CONFIG_DEBUG_BLK_CGROUP
>> ? ? ? /* Store cgroup path */
>> ? ? ? char path[128];
>> - ? ? /* How many times this group has been removed from service tree */
>> - ? ? unsigned long dequeue;
>> ?#endif
>> ? ? ? /* The device MKDEV(major, minor), this group has been created for */
>> ? ? ? dev_t ? dev;
>>
>> - ? ? /* total disk time and nr sectors dispatched by this group */
>> - ? ? unsigned long time;
>> - ? ? unsigned long sectors;
>> + ? ? /* Need to serialize the stats in the case of reset/update */
>> + ? ? spinlock_t stats_lock;
>> + ? ? struct blkio_group_stats stats;
>> ?};
>>
>> ?typedef void (blkio_unlink_group_fn) (void *key, struct blkio_group *blkg);
>> @@ -105,8 +128,8 @@ extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
>> ?extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
>> ?extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? void *key);
>> -void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long time);
>> +void blkiocg_update_timeslice_used(struct blkio_group *blkg,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long time);
>> ?#else
>> ?struct cgroup;
>> ?static inline struct blkio_cgroup *
>> @@ -122,7 +145,7 @@ blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
>>
>> ?static inline struct blkio_group *
>> ?blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
>> -static inline void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
>> +static inline void blkiocg_update_timeslice_used(struct blkio_group *blkg,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long time) {}
>> ?#endif
>> ?#endif /* _BLK_CGROUP_H */
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index c18e348..d91df9f 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -914,7 +914,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>
>> ? ? ? cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? st->min_vdisktime);
>> - ? ? blkiocg_update_blkio_group_stats(&cfqg->blkg, used_sl);
>> + ? ? blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
>> ?}
>>
>> ?#ifdef CONFIG_CFQ_GROUP_IOSCHED
>

2010-04-02 23:37:14

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 3/3] blkio: Increment the blkio cgroup stats for real now

On Fri, Apr 2, 2010 at 12:10 PM, Vivek Goyal <[email protected]> wrote:
> On Thu, Apr 01, 2010 at 03:01:41PM -0700, Divyesh Shah wrote:
>> We also add start_time_ns and io_start_time_ns fields to struct request
>> here to record the time when a request is created and when it is
>> dispatched to device. We use ns uints here as ms and jiffies are
>> not very useful for non-rotational media.
>>
>> Signed-off-by: Divyesh Shah<[email protected]>
>> ---
>>
>> ?block/blk-cgroup.c ? ? | ? 60 ++++++++++++++++++++++++++++++++++++++++++++++--
>> ?block/blk-cgroup.h ? ? | ? 14 +++++++++--
>> ?block/blk-core.c ? ? ? | ? ?6 +++--
>> ?block/cfq-iosched.c ? ?| ? ?4 ++-
>> ?include/linux/blkdev.h | ? 20 +++++++++++++++-
>> ?5 files changed, 95 insertions(+), 9 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index ad6843f..9af7257 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -15,6 +15,7 @@
>> ?#include <linux/kdev_t.h>
>> ?#include <linux/module.h>
>> ?#include <linux/err.h>
>> +#include <linux/blkdev.h>
>> ?#include "blk-cgroup.h"
>>
>> ?static DEFINE_SPINLOCK(blkio_list_lock);
>> @@ -55,6 +56,26 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
>> ?}
>> ?EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
>>
>> +/*
>> + * Add to the appropriate stat variable depending on the request type.
>> + * This should be called with the blkg->stats_lock held.
>> + */
>> +void io_add_stat(uint64_t *stat, uint64_t add, unsigned int flags)
>> +{
>> + ? ? if (flags & REQ_RW)
>> + ? ? ? ? ? ? stat[IO_WRITE] += add;
>> + ? ? else
>> + ? ? ? ? ? ? stat[IO_READ] += add;
>> + ? ? /*
>> + ? ? ?* Everywhere in the block layer, an IO is treated as sync if it is a
>> + ? ? ?* read or a SYNC write. We follow the same norm.
>> + ? ? ?*/
>> + ? ? if (!(flags & REQ_RW) || flags & REQ_RW_SYNC)
>> + ? ? ? ? ? ? stat[IO_SYNC] += add;
>> + ? ? else
>> + ? ? ? ? ? ? stat[IO_ASYNC] += add;
>> +}
>> +
>
> Hi Divyesh,
>
> Can we have any request based information limited to cfq and not put that
> in blkio-cgroup. The reason being that I am expecting that some kind of
> max bw policy interface will not necessarily be implemented at CFQ
> level. We might have to implement it at higher level so that it can
> work with all dm/md devices. If that's the case, then it might very well
> be either a bio based interface also.
>
> So just keeping that possibility in mind, can we keep blk-cgroup as
> generic as possible and not necessarily make it dependent on "struct
> request".

Ok. I do understand the motivation for keeping the request related
info out of blk-cgroup. Everything except the rq->cmd_flags can be
easily done away with. Maybe I'll need to have CFQ send the sync and
direction bits as args to the functions that need it. Not ideal coz
we'll have functions with many args but I guess its not that bad too.

>
> If you implement, two dimensional arrays for stats then we can have
> following function.
>
> blkio_add_stat(enum stat_type var enum stat_sub_type var_type, u64 val)

I would want to avoid calls like these from CFQ into the blkcg code
because many CFQ events trigger update for multiple stats (you'll see
more with stats in later patchsets) and doing these calls
independently for each stat would mean that we would also need to grab
the stats_lock multiple times when we could've avoided that.

>
> The idea is that let policy specify what is the type of IO completed,
> read, or write or sync, async and lets not make assumption in blkio-cgroup
> that it is request based interface and use flags like REQ_RW_SYNC etc.
>
>> ?void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
>> ?{
>> ? ? ? unsigned long flags;
>> @@ -65,6 +86,41 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
>> ?}
>> ?EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
>>
>> +void blkiocg_update_request_dispatch_stats(struct blkio_group *blkg,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct request *rq)
>> +{
>> + ? ? struct blkio_group_stats *stats;
>> + ? ? unsigned long flags;
>> +
>> + ? ? spin_lock_irqsave(&blkg->stats_lock, flags);
>> + ? ? stats = &blkg->stats;
>> + ? ? stats->sectors += blk_rq_sectors(rq);
>> + ? ? io_add_stat(stats->io_serviced, 1, rq->cmd_flags);
>> + ? ? io_add_stat(stats->io_service_bytes, blk_rq_sectors(rq) << 9,
>> + ? ? ? ? ? ? ? ? ? ? rq->cmd_flags);
>
> io_service_bytes is esentially same as blkio.sectors but it is giving
> info in bytes instead of sectors and it is breaking down IO into subtypes.
> That's fine if it is helping you.
>
> instead of blk_rq_sectors(rq) << 9, you can probably use blk_rq_bytes().

Good point. However, if I'm passing rq related data from CFQ now, I'd
just pass one of these 2 pieces and get the second one from it.

>
> Again, can we keep rq information out of blk-cgroup.c.
>
>> + ? ? spin_unlock_irqrestore(&blkg->stats_lock, flags);
>> +}
>> +
>> +void blkiocg_update_request_completion_stats(struct blkio_group *blkg,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct request *rq)
>> +{
>> + ? ? struct blkio_group_stats *stats;
>> + ? ? unsigned long flags;
>> + ? ? unsigned long long now = sched_clock();
>> +
>> + ? ? spin_lock_irqsave(&blkg->stats_lock, flags);
>> + ? ? stats = &blkg->stats;
>> + ? ? if (time_after64(now, rq->io_start_time_ns))
>> + ? ? ? ? ? ? io_add_stat(stats->io_service_time, now - rq->io_start_time_ns,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? rq->cmd_flags);
>> + ? ? if (time_after64(rq->io_start_time_ns, rq->start_time_ns))
>> + ? ? ? ? ? ? io_add_stat(stats->io_wait_time,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? rq->io_start_time_ns - rq->start_time_ns,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? rq->cmd_flags);
>> + ? ? spin_unlock_irqrestore(&blkg->stats_lock, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(blkiocg_update_request_completion_stats);
>> +
>
> Same here that knowledge of rq, move into CFQ and keep blk-cgroup.c free
> of it.
Will do.
>
>> ?void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
>> ? ? ? ? ? ? ? ? ? ? ? struct blkio_group *blkg, void *key, dev_t dev)
>> ?{
>> @@ -325,12 +381,12 @@ SHOW_FUNCTION_PER_GROUP(dequeue, get_stat, get_dequeue_stat, 0);
>> ?#undef SHOW_FUNCTION_PER_GROUP
>>
>> ?#ifdef CONFIG_DEBUG_BLK_CGROUP
>> -void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
>> +void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>> ? ? ? ? ? ? ? ? ? ? ? unsigned long dequeue)
>> ?{
>> ? ? ? blkg->stats.dequeue += dequeue;
>> ?}
>> -EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_dequeue_stats);
>> +EXPORT_SYMBOL_GPL(blkiocg_update_dequeue_stats);
>> ?#endif
>>
>> ?struct cftype blkio_files[] = {
>> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
>> index 5c5e529..80010ef 100644
>> --- a/block/blk-cgroup.h
>> +++ b/block/blk-cgroup.h
>> @@ -112,12 +112,12 @@ static inline char *blkg_path(struct blkio_group *blkg)
>> ?{
>> ? ? ? return blkg->path;
>> ?}
>> -void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
>> +void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long dequeue);
>> ?#else
>> ?static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
>> -static inline void blkiocg_update_blkio_group_dequeue_stats(
>> - ? ? ? ? ? ? ? ? ? ? struct blkio_group *blkg, unsigned long dequeue) {}
>> +static inline void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long dequeue) {}
>> ?#endif
>>
>> ?#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
>> @@ -130,6 +130,10 @@ extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? void *key);
>> ?void blkiocg_update_timeslice_used(struct blkio_group *blkg,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long time);
>> +void blkiocg_update_request_dispatch_stats(struct blkio_group *blkg,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct request *rq);
>> +void blkiocg_update_request_completion_stats(struct blkio_group *blkg,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct request *rq);
>> ?#else
>> ?struct cgroup;
>> ?static inline struct blkio_cgroup *
>> @@ -147,5 +151,9 @@ static inline struct blkio_group *
>> ?blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
>> ?static inline void blkiocg_update_timeslice_used(struct blkio_group *blkg,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long time) {}
>> +static inline void blkiocg_update_request_dispatch_stats(
>> + ? ? ? ? ? ? struct blkio_group *blkg, struct request *rq) {}
>> +static inline void blkiocg_update_request_completion_stats(
>> + ? ? ? ? ? ? struct blkio_group *blkg, struct request *rq) {}
>> ?#endif
>> ?#endif /* _BLK_CGROUP_H */
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 9fe174d..1d94f15 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -127,6 +127,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>> ? ? ? rq->tag = -1;
>> ? ? ? rq->ref_count = 1;
>> ? ? ? rq->start_time = jiffies;
>> + ? ? set_start_time_ns(rq);
>> ?}
>> ?EXPORT_SYMBOL(blk_rq_init);
>>
>> @@ -1855,8 +1856,10 @@ void blk_dequeue_request(struct request *rq)
>> ? ? ? ?* and to it is freed is accounted as io that is in progress at
>> ? ? ? ?* the driver side.
>> ? ? ? ?*/
>> - ? ? if (blk_account_rq(rq))
>> + ? ? if (blk_account_rq(rq)) {
>> ? ? ? ? ? ? ? q->in_flight[rq_is_sync(rq)]++;
>> + ? ? ? ? ? ? set_io_start_time_ns(rq);
>> + ? ? }
>> ?}
>>
>> ?/**
>> @@ -2517,4 +2520,3 @@ int __init blk_dev_init(void)
>>
>> ? ? ? return 0;
>> ?}
>> -
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index d91df9f..5de0e54 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -854,7 +854,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> ? ? ? if (!RB_EMPTY_NODE(&cfqg->rb_node))
>> ? ? ? ? ? ? ? cfq_rb_erase(&cfqg->rb_node, st);
>> ? ? ? cfqg->saved_workload_slice = 0;
>> - ? ? blkiocg_update_blkio_group_dequeue_stats(&cfqg->blkg, 1);
>> + ? ? blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
>> ?}
>>
>> ?static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
>> @@ -1864,6 +1864,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
>> ? ? ? elv_dispatch_sort(q, rq);
>>
>> ? ? ? cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
>> + ? ? blkiocg_update_request_dispatch_stats(&cfqq->cfqg->blkg, rq);
>> ?}
>>
>> ?/*
>> @@ -3284,6 +3285,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
>> ? ? ? WARN_ON(!cfqq->dispatched);
>> ? ? ? cfqd->rq_in_driver--;
>> ? ? ? cfqq->dispatched--;
>> + ? ? blkiocg_update_request_completion_stats(&cfqq->cfqg->blkg, rq);
>>
>> ? ? ? cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;
>>
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index ebd22db..c793019 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -193,7 +193,10 @@ struct request {
>>
>> ? ? ? struct gendisk *rq_disk;
>> ? ? ? unsigned long start_time;
>> -
>> +#ifdef CONFIG_BLK_CGROUP
>> + ? ? unsigned long long start_time_ns;
>> + ? ? unsigned long long io_start_time_ns; ? ?/* when passed to hardware */
>> +#endif
>> ? ? ? /* Number of scatter-gather DMA addr+len pairs after
>> ? ? ? ?* physical address coalescing is performed.
>> ? ? ? ?*/
>> @@ -1219,6 +1222,21 @@ static inline void put_dev_sector(Sector p)
>> ?struct work_struct;
>> ?int kblockd_schedule_work(struct request_queue *q, struct work_struct *work);
>>
>> +#ifdef CONFIG_BLK_CGROUP
>> +static inline void set_start_time_ns(struct request *req)
>> +{
>> + ? ? req->start_time_ns = sched_clock();
>> +}
>> +
>> +static inline void set_io_start_time_ns(struct request *req)
>> +{
>> + ? ? req->io_start_time_ns = sched_clock();
>> +}
>> +#else
>> +static inline void set_start_time_ns(struct request *req) {}
>> +static inline void set_io_start_time_ns(struct request *req) {}
>> +#endif
>> +
>> ?#define MODULE_ALIAS_BLOCKDEV(major,minor) \
>> ? ? ? MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor))
>> ?#define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \
>

2010-04-05 14:46:11

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 2/3] blkio: Add io controller stats like

On Fri, Apr 02, 2010 at 01:53:30PM -0700, Divyesh Shah wrote:

[..]
> >> +#define GET_STAT_INDEXED(__VAR) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> +uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int type) ? ? ? ? ? ? ?\
> >> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? return blkg->stats.__VAR[type]; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> +} ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> +
> >> +GET_STAT_INDEXED(io_service_bytes);
> >> +GET_STAT_INDEXED(io_serviced);
> >> +GET_STAT_INDEXED(io_service_time);
> >> +GET_STAT_INDEXED(io_wait_time);
> >> +#undef GET_STAT_INDEXED
> >> +
> >> +#define GET_STAT(__VAR, __CONV) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> +uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int dummy) ? ? \
> >> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? uint64_t data = blkg->stats.__VAR; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? if (__CONV) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? data = (uint64_t)jiffies_to_msecs(data) * NSEC_PER_MSEC;\
> >> + ? ? return data; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> +}
> >> +
> >> +GET_STAT(time, 1);
> >> +GET_STAT(sectors, 0);
> >> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> >> +GET_STAT(dequeue, 0);
> >> +#endif
> >> +#undef GET_STAT
> >> +
> >
> > I think now so many macros to define different kind of functions is
> > becoming really confusing. How about creating a two dimensional stat
> > array which is indexed by type of io stat like "io_service_time" and
> > then indexed by sub io type that is "READ, WRITE, SYNC" etc. And we
> > can define a single function to read all kind of stats.
> >
> > That way we will not have to define one GET_STAT_INDEXED() and GET_STAT()
> > for each unindexed variable. If there are not multiple functions then,
> > then everywhere we don't have to pass around a function pointer to read
> > the variable stat and code should become much simpler. Example, code
> > follows.
> >
> >
> > enum stat_type {
> > ? ? ? ?BLKIO_STAT_TIME
> > ? ? ? ?BLKIO_STAT_SECTORS
> > ? ? ? ?BLKIO_STAT_WAIT_TIME
> > ? ? ? ?BLKIO_STAT_SERVICE_TIME
> > }
> >
> > enum stat_sub_type {
> > ? ? ? ?BLKIO_STAT_READ
> > ? ? ? ?BLKIO_STAT_WRITE
> > ? ? ? ?BLKIO_STAT_SYNC
> > }
> >
> > blkio_get_stat_var(enum stat_type var, enum stat_sub_type var_type) {
> >
> > ? ? ? ?if (var == BLKIO_STAT_TIME)
> > ? ? ? ? ? ? ? ?reutrn blkg->stat.time
> >
> > ? ? ? ?/* Same as above for sectors */
> >
> > ? ? ? ?return blkg->stat[var][var_type];
> > }
>
> I'm not sure adding this level of complexity (2nd dimension for stats)
> and the numerous if (..) statements (or switch) is warranted

I think we need only two conditions to check. BLKIO_STAT_TIME and
BLKIO_STAT_SECTROS. Rest of these fall into typed category and we
don't have to check for tyeps.

> when they
> only apply for the get_stat() and get_typed_stat() functions. This
> seems like a more complicated simplification :).

I think this really is simplification. This also gets rid of all the
function pointers you are passing around to differentiate between two
types of stats. Also gets rid of special macros to generate one function
each for one kind of stat. These dynamic macros make code hard to read
and understand.

Can you please try this two dimensional array for stats approach once.
I belive your code will be much easier to read.

>
> >
> > Also all these get_* function needs to be static.
> Done
> >
> >> +#define SHOW_FUNCTION_PER_GROUP(__VAR, get_stats, getvar, show_total) ? ? ? ?\
> >> ?static int blkiocg_##__VAR##_read(struct cgroup *cgroup, ? ? ? ? ? ? \
> >> - ? ? ? ? ? ? ? ? ? ? struct cftype *cftype, struct seq_file *m) ? ? ?\
> >> + ? ? ? ? ? ? struct cftype *cftype, struct cgroup_map_cb *cb) ? ? ? ?\
> >> ?{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> ? ? ? struct blkio_cgroup *blkcg; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> ? ? ? struct blkio_group *blkg; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> ? ? ? struct hlist_node *n; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? uint64_t cgroup_total = 0; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? char disk_id[10]; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> ? ? ? if (!cgroup_lock_live_group(cgroup)) ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> ? ? ? ? ? ? ? return -ENODEV; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> @@ -184,19 +295,32 @@ static int blkiocg_##__VAR##_read(struct cgroup *cgroup, ? ? ? ? ? ? ? ?\
> >> ? ? ? blkcg = cgroup_to_blkio_cgroup(cgroup); ? ? ? ? ? ? ? ? ? ? ? ? \
> >> ? ? ? rcu_read_lock(); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> ? ? ? hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {\
> >> - ? ? ? ? ? ? if (blkg->dev) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> - ? ? ? ? ? ? ? ? ? ? seq_printf(m, "%u:%u %lu\n", MAJOR(blkg->dev), ?\
> >> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MINOR(blkg->dev), blkg->__VAR); ? ? ? ?\
> >> + ? ? ? ? ? ? if (blkg->dev) { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? ? ? ? ? ? ? ? ? spin_lock_irq(&blkg->stats_lock); ? ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? ? ? ? ? snprintf(disk_id, 10, "%u:%u", MAJOR(blkg->dev),\
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MINOR(blkg->dev)); ? ? ? ? ? ? ?\
> >> + ? ? ? ? ? ? ? ? ? ? cgroup_total += get_stats(blkg, cb, getvar, ? ? \
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? disk_id); ? ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? ? ? ? ? spin_unlock_irq(&blkg->stats_lock); ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? if (show_total) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? cb->fill(cb, "Total", cgroup_total); ? ? ? ? ? ? ? ? ? ?\
> >> ? ? ? rcu_read_unlock(); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> ? ? ? cgroup_unlock(); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> ? ? ? return 0; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> ?}
> >>
> >> -SHOW_FUNCTION_PER_GROUP(time);
> >> -SHOW_FUNCTION_PER_GROUP(sectors);
> >> +SHOW_FUNCTION_PER_GROUP(time, get_stat, get_time_stat, 0);
> >> +SHOW_FUNCTION_PER_GROUP(sectors, get_stat, get_sectors_stat, 0);
> >> +SHOW_FUNCTION_PER_GROUP(io_service_bytes, get_typed_stat,
> >> + ? ? ? ? ? ? ? ? ? ? get_io_service_bytes_stat, 1);
> >> +SHOW_FUNCTION_PER_GROUP(io_serviced, get_typed_stat, get_io_serviced_stat, 1);
> >> +SHOW_FUNCTION_PER_GROUP(io_service_time, get_typed_stat,
> >> + ? ? ? ? ? ? ? ? ? ? get_io_service_time_stat, 1);
> >> +SHOW_FUNCTION_PER_GROUP(io_wait_time, get_typed_stat, get_io_wait_time_stat, 1);
> >> ?#ifdef CONFIG_DEBUG_BLK_CGROUP
> >> -SHOW_FUNCTION_PER_GROUP(dequeue);
> >> +SHOW_FUNCTION_PER_GROUP(dequeue, get_stat, get_dequeue_stat, 0);
> >> ?#endif
> >> ?#undef SHOW_FUNCTION_PER_GROUP
> >>
> >> @@ -204,7 +328,7 @@ SHOW_FUNCTION_PER_GROUP(dequeue);
> >> ?void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
> >> ? ? ? ? ? ? ? ? ? ? ? unsigned long dequeue)
> >> ?{
> >> - ? ? blkg->dequeue += dequeue;
> >> + ? ? blkg->stats.dequeue += dequeue;
> >> ?}
> >> ?EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_dequeue_stats);
> >> ?#endif
> >> @@ -217,16 +341,38 @@ struct cftype blkio_files[] = {
> >> ? ? ? },
> >> ? ? ? {
> >> ? ? ? ? ? ? ? .name = "time",
> >> - ? ? ? ? ? ? .read_seq_string = blkiocg_time_read,
> >> + ? ? ? ? ? ? .read_map = blkiocg_time_read,
> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
> >> ? ? ? },
> >> ? ? ? {
> >> ? ? ? ? ? ? ? .name = "sectors",
> >> - ? ? ? ? ? ? .read_seq_string = blkiocg_sectors_read,
> >> + ? ? ? ? ? ? .read_map = blkiocg_sectors_read,
> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
> >> + ? ? },
> >> + ? ? {
> >> + ? ? ? ? ? ? .name = "io_service_bytes",
> >> + ? ? ? ? ? ? .read_map = blkiocg_io_service_bytes_read,
> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
> >> + ? ? },
> >> + ? ? {
> >> + ? ? ? ? ? ? .name = "io_serviced",
> >> + ? ? ? ? ? ? .read_map = blkiocg_io_serviced_read,
> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
> >> + ? ? },
> >> + ? ? {
> >> + ? ? ? ? ? ? .name = "io_service_time",
> >> + ? ? ? ? ? ? .read_map = blkiocg_io_service_time_read,
> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
> >> + ? ? },
> >> + ? ? {
> >> + ? ? ? ? ? ? .name = "io_wait_time",
> >> + ? ? ? ? ? ? .read_map = blkiocg_io_wait_time_read,
> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
> >> ? ? ? },
> >
> > blkiocg_reset_write() is invoked if a user does some write on any of the
> > above files. This will reset all the stats. So if I do echo x >
> > blkio.time, I will see all other files being reset? I think this is
> > little counter intutive.
> >
> > Either we need to reset only those specific stats (the file being written
> > to) or may be we can provide a separate file "blkio.reset_stats" to reset
> > all the stats?
>
> I added a note in the Documentation file that writing an int to any of
> the stats should reset all. I don't get the point of introducing
> another file for this and would want stats to be cleared all together
> in order to have say io_service_time and io_serviced in sync, i.e.,
> the io_service_time should always be the total service time for the
> IOs accounted for in io_serviced.
>

I agree that reset should mean reset of all stats. It is a matter of
whether to introduce a separate file or write to one file resets all the
other files. Though I find writting to one file resetting the stats of
other file un-intitutive, at the same time don't feel too excited about
introducing a new file just for reset.

Do we really need to introduce reset interface. Changing the ioscheduler to
noop or deadline and then setting it back to cfq should reset the stats for
all the cgroups.

In fact we can probably get rid of per cgroup reset stats interface. We
will then get rid of extra stat spin lock. If somebody wants to reset
the stats, just change the ioscheduler.

This will work until and unless somebody wants to reset the stats of one
cgroup but not the other.

Thanks
Vivek

2010-04-05 15:12:51

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 3/3] blkio: Increment the blkio cgroup stats for real now

On Fri, Apr 02, 2010 at 04:36:34PM -0700, Divyesh Shah wrote:
> On Fri, Apr 2, 2010 at 12:10 PM, Vivek Goyal <[email protected]> wrote:
> > On Thu, Apr 01, 2010 at 03:01:41PM -0700, Divyesh Shah wrote:
> >> We also add start_time_ns and io_start_time_ns fields to struct request
> >> here to record the time when a request is created and when it is
> >> dispatched to device. We use ns uints here as ms and jiffies are
> >> not very useful for non-rotational media.
> >>
> >> Signed-off-by: Divyesh Shah<[email protected]>
> >> ---
> >>
> >> ?block/blk-cgroup.c ? ? | ? 60 ++++++++++++++++++++++++++++++++++++++++++++++--
> >> ?block/blk-cgroup.h ? ? | ? 14 +++++++++--
> >> ?block/blk-core.c ? ? ? | ? ?6 +++--
> >> ?block/cfq-iosched.c ? ?| ? ?4 ++-
> >> ?include/linux/blkdev.h | ? 20 +++++++++++++++-
> >> ?5 files changed, 95 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> >> index ad6843f..9af7257 100644
> >> --- a/block/blk-cgroup.c
> >> +++ b/block/blk-cgroup.c
> >> @@ -15,6 +15,7 @@
> >> ?#include <linux/kdev_t.h>
> >> ?#include <linux/module.h>
> >> ?#include <linux/err.h>
> >> +#include <linux/blkdev.h>
> >> ?#include "blk-cgroup.h"
> >>
> >> ?static DEFINE_SPINLOCK(blkio_list_lock);
> >> @@ -55,6 +56,26 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
> >> ?}
> >> ?EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
> >>
> >> +/*
> >> + * Add to the appropriate stat variable depending on the request type.
> >> + * This should be called with the blkg->stats_lock held.
> >> + */
> >> +void io_add_stat(uint64_t *stat, uint64_t add, unsigned int flags)
> >> +{
> >> + ? ? if (flags & REQ_RW)
> >> + ? ? ? ? ? ? stat[IO_WRITE] += add;
> >> + ? ? else
> >> + ? ? ? ? ? ? stat[IO_READ] += add;
> >> + ? ? /*
> >> + ? ? ?* Everywhere in the block layer, an IO is treated as sync if it is a
> >> + ? ? ?* read or a SYNC write. We follow the same norm.
> >> + ? ? ?*/
> >> + ? ? if (!(flags & REQ_RW) || flags & REQ_RW_SYNC)
> >> + ? ? ? ? ? ? stat[IO_SYNC] += add;
> >> + ? ? else
> >> + ? ? ? ? ? ? stat[IO_ASYNC] += add;
> >> +}
> >> +
> >
> > Hi Divyesh,
> >
> > Can we have any request based information limited to cfq and not put that
> > in blkio-cgroup. The reason being that I am expecting that some kind of
> > max bw policy interface will not necessarily be implemented at CFQ
> > level. We might have to implement it at higher level so that it can
> > work with all dm/md devices. If that's the case, then it might very well
> > be either a bio based interface also.
> >
> > So just keeping that possibility in mind, can we keep blk-cgroup as
> > generic as possible and not necessarily make it dependent on "struct
> > request".
>
> Ok. I do understand the motivation for keeping the request related
> info out of blk-cgroup. Everything except the rq->cmd_flags can be
> easily done away with. Maybe I'll need to have CFQ send the sync and
> direction bits as args to the functions that need it. Not ideal coz
> we'll have functions with many args but I guess its not that bad too.
>
> >
> > If you implement, two dimensional arrays for stats then we can have
> > following function.
> >
> > blkio_add_stat(enum stat_type var enum stat_sub_type var_type, u64 val)
>
> I would want to avoid calls like these from CFQ into the blkcg code
> because many CFQ events trigger update for multiple stats (you'll see
> more with stats in later patchsets) and doing these calls
> independently for each stat would mean that we would also need to grab
> the stats_lock multiple times when we could've avoided that.

I understand the need to club the updates and reduce the need of taking
stats_lock multiple times. I was thinking of any of following.

- Get rid of reset interface per cgroup. Rely on changing ioscheduler on
request queue and that will get rid of stats_lock entirely.

- Can we use a function blkio_add_stat() with variable number of arguments
so that more than one stat can be updated in a single call?

If you have other ideas to implement it without assuming "struct rq" in
blk-cgroup, please do that.

Thanks
Vivek

2010-04-05 16:53:55

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 3/3] blkio: Increment the blkio cgroup stats for real now

On Mon, Apr 5, 2010 at 8:12 AM, Vivek Goyal <[email protected]> wrote:
> On Fri, Apr 02, 2010 at 04:36:34PM -0700, Divyesh Shah wrote:
>> On Fri, Apr 2, 2010 at 12:10 PM, Vivek Goyal <[email protected]> wrote:
>> > On Thu, Apr 01, 2010 at 03:01:41PM -0700, Divyesh Shah wrote:
>> >> We also add start_time_ns and io_start_time_ns fields to struct request
>> >> here to record the time when a request is created and when it is
>> >> dispatched to device. We use ns uints here as ms and jiffies are
>> >> not very useful for non-rotational media.
>> >>
>> >> Signed-off-by: Divyesh Shah<[email protected]>
>> >> ---
>> >>
>> >> ?block/blk-cgroup.c ? ? | ? 60 ++++++++++++++++++++++++++++++++++++++++++++++--
>> >> ?block/blk-cgroup.h ? ? | ? 14 +++++++++--
>> >> ?block/blk-core.c ? ? ? | ? ?6 +++--
>> >> ?block/cfq-iosched.c ? ?| ? ?4 ++-
>> >> ?include/linux/blkdev.h | ? 20 +++++++++++++++-
>> >> ?5 files changed, 95 insertions(+), 9 deletions(-)
>> >>
>> >> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> >> index ad6843f..9af7257 100644
>> >> --- a/block/blk-cgroup.c
>> >> +++ b/block/blk-cgroup.c
>> >> @@ -15,6 +15,7 @@
>> >> ?#include <linux/kdev_t.h>
>> >> ?#include <linux/module.h>
>> >> ?#include <linux/err.h>
>> >> +#include <linux/blkdev.h>
>> >> ?#include "blk-cgroup.h"
>> >>
>> >> ?static DEFINE_SPINLOCK(blkio_list_lock);
>> >> @@ -55,6 +56,26 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
>> >> ?}
>> >> ?EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
>> >>
>> >> +/*
>> >> + * Add to the appropriate stat variable depending on the request type.
>> >> + * This should be called with the blkg->stats_lock held.
>> >> + */
>> >> +void io_add_stat(uint64_t *stat, uint64_t add, unsigned int flags)
>> >> +{
>> >> + ? ? if (flags & REQ_RW)
>> >> + ? ? ? ? ? ? stat[IO_WRITE] += add;
>> >> + ? ? else
>> >> + ? ? ? ? ? ? stat[IO_READ] += add;
>> >> + ? ? /*
>> >> + ? ? ?* Everywhere in the block layer, an IO is treated as sync if it is a
>> >> + ? ? ?* read or a SYNC write. We follow the same norm.
>> >> + ? ? ?*/
>> >> + ? ? if (!(flags & REQ_RW) || flags & REQ_RW_SYNC)
>> >> + ? ? ? ? ? ? stat[IO_SYNC] += add;
>> >> + ? ? else
>> >> + ? ? ? ? ? ? stat[IO_ASYNC] += add;
>> >> +}
>> >> +
>> >
>> > Hi Divyesh,
>> >
>> > Can we have any request based information limited to cfq and not put that
>> > in blkio-cgroup. The reason being that I am expecting that some kind of
>> > max bw policy interface will not necessarily be implemented at CFQ
>> > level. We might have to implement it at higher level so that it can
>> > work with all dm/md devices. If that's the case, then it might very well
>> > be either a bio based interface also.
>> >
>> > So just keeping that possibility in mind, can we keep blk-cgroup as
>> > generic as possible and not necessarily make it dependent on "struct
>> > request".
>>
>> Ok. I do understand the motivation for keeping the request related
>> info out of blk-cgroup. Everything except the rq->cmd_flags can be
>> easily done away with. Maybe I'll need to have CFQ send the sync and
>> direction bits as args to the functions that need it. Not ideal coz
>> we'll have functions with many args but I guess its not that bad too.
>>
>> >
>> > If you implement, two dimensional arrays for stats then we can have
>> > following function.
>> >
>> > blkio_add_stat(enum stat_type var enum stat_sub_type var_type, u64 val)
>>
>> I would want to avoid calls like these from CFQ into the blkcg code
>> because many CFQ events trigger update for multiple stats (you'll see
>> more with stats in later patchsets) and doing these calls
>> independently for each stat would mean that we would also need to grab
>> the stats_lock multiple times when we could've avoided that.
>
> I understand the need to club the updates and reduce the need of taking
> stats_lock multiple times. I was thinking of any of following.
>
> - Get rid of reset interface per cgroup. Rely on changing ioscheduler on
> ?request queue and that will get rid of stats_lock entirely.

This takes away the ability to reset stats at will which is very
useful when debugging and for testing IO controller.

> - Can we use a function blkio_add_stat() with variable number of arguments
> ?so that more than one stat can be updated in a single call?

I really don't like this at all.

> If you have other ideas to implement it without assuming "struct rq" in
> blk-cgroup, please do that.

I've already got rid of any rq assumptions in blk-cgroup. The only
place where we're using rq is for rq_start_time_ns() and
rq_io-start_time_ns() functions but they are not used by the
blk-cgroup code directly (only CFQ uses them). For another user of
io-controller, we can implement a bio based functions.

>
> Thanks
> Vivek
>

2010-04-05 17:29:27

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 3/3] blkio: Increment the blkio cgroup stats for real now

On Mon, Apr 05, 2010 at 09:53:25AM -0700, Divyesh Shah wrote:
> On Mon, Apr 5, 2010 at 8:12 AM, Vivek Goyal <[email protected]> wrote:
> > On Fri, Apr 02, 2010 at 04:36:34PM -0700, Divyesh Shah wrote:
> >> On Fri, Apr 2, 2010 at 12:10 PM, Vivek Goyal <[email protected]> wrote:
> >> > On Thu, Apr 01, 2010 at 03:01:41PM -0700, Divyesh Shah wrote:
> >> >> We also add start_time_ns and io_start_time_ns fields to struct request
> >> >> here to record the time when a request is created and when it is
> >> >> dispatched to device. We use ns uints here as ms and jiffies are
> >> >> not very useful for non-rotational media.
> >> >>
> >> >> Signed-off-by: Divyesh Shah<[email protected]>
> >> >> ---
> >> >>
> >> >> ?block/blk-cgroup.c ? ? | ? 60 ++++++++++++++++++++++++++++++++++++++++++++++--
> >> >> ?block/blk-cgroup.h ? ? | ? 14 +++++++++--
> >> >> ?block/blk-core.c ? ? ? | ? ?6 +++--
> >> >> ?block/cfq-iosched.c ? ?| ? ?4 ++-
> >> >> ?include/linux/blkdev.h | ? 20 +++++++++++++++-
> >> >> ?5 files changed, 95 insertions(+), 9 deletions(-)
> >> >>
> >> >> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> >> >> index ad6843f..9af7257 100644
> >> >> --- a/block/blk-cgroup.c
> >> >> +++ b/block/blk-cgroup.c
> >> >> @@ -15,6 +15,7 @@
> >> >> ?#include <linux/kdev_t.h>
> >> >> ?#include <linux/module.h>
> >> >> ?#include <linux/err.h>
> >> >> +#include <linux/blkdev.h>
> >> >> ?#include "blk-cgroup.h"
> >> >>
> >> >> ?static DEFINE_SPINLOCK(blkio_list_lock);
> >> >> @@ -55,6 +56,26 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
> >> >> ?}
> >> >> ?EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
> >> >>
> >> >> +/*
> >> >> + * Add to the appropriate stat variable depending on the request type.
> >> >> + * This should be called with the blkg->stats_lock held.
> >> >> + */
> >> >> +void io_add_stat(uint64_t *stat, uint64_t add, unsigned int flags)
> >> >> +{
> >> >> + ? ? if (flags & REQ_RW)
> >> >> + ? ? ? ? ? ? stat[IO_WRITE] += add;
> >> >> + ? ? else
> >> >> + ? ? ? ? ? ? stat[IO_READ] += add;
> >> >> + ? ? /*
> >> >> + ? ? ?* Everywhere in the block layer, an IO is treated as sync if it is a
> >> >> + ? ? ?* read or a SYNC write. We follow the same norm.
> >> >> + ? ? ?*/
> >> >> + ? ? if (!(flags & REQ_RW) || flags & REQ_RW_SYNC)
> >> >> + ? ? ? ? ? ? stat[IO_SYNC] += add;
> >> >> + ? ? else
> >> >> + ? ? ? ? ? ? stat[IO_ASYNC] += add;
> >> >> +}
> >> >> +
> >> >
> >> > Hi Divyesh,
> >> >
> >> > Can we have any request based information limited to cfq and not put that
> >> > in blkio-cgroup. The reason being that I am expecting that some kind of
> >> > max bw policy interface will not necessarily be implemented at CFQ
> >> > level. We might have to implement it at higher level so that it can
> >> > work with all dm/md devices. If that's the case, then it might very well
> >> > be either a bio based interface also.
> >> >
> >> > So just keeping that possibility in mind, can we keep blk-cgroup as
> >> > generic as possible and not necessarily make it dependent on "struct
> >> > request".
> >>
> >> Ok. I do understand the motivation for keeping the request related
> >> info out of blk-cgroup. Everything except the rq->cmd_flags can be
> >> easily done away with. Maybe I'll need to have CFQ send the sync and
> >> direction bits as args to the functions that need it. Not ideal coz
> >> we'll have functions with many args but I guess its not that bad too.
> >>
> >> >
> >> > If you implement, two dimensional arrays for stats then we can have
> >> > following function.
> >> >
> >> > blkio_add_stat(enum stat_type var enum stat_sub_type var_type, u64 val)
> >>
> >> I would want to avoid calls like these from CFQ into the blkcg code
> >> because many CFQ events trigger update for multiple stats (you'll see
> >> more with stats in later patchsets) and doing these calls
> >> independently for each stat would mean that we would also need to grab
> >> the stats_lock multiple times when we could've avoided that.
> >
> > I understand the need to club the updates and reduce the need of taking
> > stats_lock multiple times. I was thinking of any of following.
> >
> > - Get rid of reset interface per cgroup. Rely on changing ioscheduler on
> > ?request queue and that will get rid of stats_lock entirely.
>
> This takes away the ability to reset stats at will which is very
> useful when debugging and for testing IO controller.
>

What do you mean by "reset stats at will"? You can change ioscheduler at
will and reset stats? The only possible issue I could think of is that only
admin can change the ioscheduler in providing per cgroup interface, one can
give write permission to indiviaul user and allow users to reset stats.

I am not sure in practice why would you allow a user to reset stats.
Especially if somebody's accounting software is based on these stats.

> > - Can we use a function blkio_add_stat() with variable number of arguments
> > ?so that more than one stat can be updated in a single call?
>
> I really don't like this at all.
>
> > If you have other ideas to implement it without assuming "struct rq" in
> > blk-cgroup, please do that.
>
> I've already got rid of any rq assumptions in blk-cgroup. The only
> place where we're using rq is for rq_start_time_ns() and
> rq_io-start_time_ns() functions but they are not used by the
> blk-cgroup code directly (only CFQ uses them). For another user of
> io-controller, we can implement a bio based functions.

Ok, you have made blkg->stats_lock visible to cfq. That's fine too.
Can you rename io_add_stat to blkio_add_stat. I think in V2 also, it is
still io_add_stat.

Vivek

2010-04-05 22:07:01

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 3/3] blkio: Increment the blkio cgroup stats for real now

On Mon, Apr 5, 2010 at 10:29 AM, Vivek Goyal <[email protected]> wrote:
> On Mon, Apr 05, 2010 at 09:53:25AM -0700, Divyesh Shah wrote:
>> On Mon, Apr 5, 2010 at 8:12 AM, Vivek Goyal <[email protected]> wrote:
>> > On Fri, Apr 02, 2010 at 04:36:34PM -0700, Divyesh Shah wrote:
>> >> On Fri, Apr 2, 2010 at 12:10 PM, Vivek Goyal <[email protected]> wrote:
>> >> > On Thu, Apr 01, 2010 at 03:01:41PM -0700, Divyesh Shah wrote:
>> >> >> We also add start_time_ns and io_start_time_ns fields to struct request
>> >> >> here to record the time when a request is created and when it is
>> >> >> dispatched to device. We use ns uints here as ms and jiffies are
>> >> >> not very useful for non-rotational media.
>> >> >>
>> >> >> Signed-off-by: Divyesh Shah<[email protected]>
>> >> >> ---
>> >> >>
>> >> >> ?block/blk-cgroup.c ? ? | ? 60 ++++++++++++++++++++++++++++++++++++++++++++++--
>> >> >> ?block/blk-cgroup.h ? ? | ? 14 +++++++++--
>> >> >> ?block/blk-core.c ? ? ? | ? ?6 +++--
>> >> >> ?block/cfq-iosched.c ? ?| ? ?4 ++-
>> >> >> ?include/linux/blkdev.h | ? 20 +++++++++++++++-
>> >> >> ?5 files changed, 95 insertions(+), 9 deletions(-)
>> >> >>
>> >> >> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> >> >> index ad6843f..9af7257 100644
>> >> >> --- a/block/blk-cgroup.c
>> >> >> +++ b/block/blk-cgroup.c
>> >> >> @@ -15,6 +15,7 @@
>> >> >> ?#include <linux/kdev_t.h>
>> >> >> ?#include <linux/module.h>
>> >> >> ?#include <linux/err.h>
>> >> >> +#include <linux/blkdev.h>
>> >> >> ?#include "blk-cgroup.h"
>> >> >>
>> >> >> ?static DEFINE_SPINLOCK(blkio_list_lock);
>> >> >> @@ -55,6 +56,26 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
>> >> >> ?}
>> >> >> ?EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
>> >> >>
>> >> >> +/*
>> >> >> + * Add to the appropriate stat variable depending on the request type.
>> >> >> + * This should be called with the blkg->stats_lock held.
>> >> >> + */
>> >> >> +void io_add_stat(uint64_t *stat, uint64_t add, unsigned int flags)
>> >> >> +{
>> >> >> + ? ? if (flags & REQ_RW)
>> >> >> + ? ? ? ? ? ? stat[IO_WRITE] += add;
>> >> >> + ? ? else
>> >> >> + ? ? ? ? ? ? stat[IO_READ] += add;
>> >> >> + ? ? /*
>> >> >> + ? ? ?* Everywhere in the block layer, an IO is treated as sync if it is a
>> >> >> + ? ? ?* read or a SYNC write. We follow the same norm.
>> >> >> + ? ? ?*/
>> >> >> + ? ? if (!(flags & REQ_RW) || flags & REQ_RW_SYNC)
>> >> >> + ? ? ? ? ? ? stat[IO_SYNC] += add;
>> >> >> + ? ? else
>> >> >> + ? ? ? ? ? ? stat[IO_ASYNC] += add;
>> >> >> +}
>> >> >> +
>> >> >
>> >> > Hi Divyesh,
>> >> >
>> >> > Can we have any request based information limited to cfq and not put that
>> >> > in blkio-cgroup. The reason being that I am expecting that some kind of
>> >> > max bw policy interface will not necessarily be implemented at CFQ
>> >> > level. We might have to implement it at higher level so that it can
>> >> > work with all dm/md devices. If that's the case, then it might very well
>> >> > be either a bio based interface also.
>> >> >
>> >> > So just keeping that possibility in mind, can we keep blk-cgroup as
>> >> > generic as possible and not necessarily make it dependent on "struct
>> >> > request".
>> >>
>> >> Ok. I do understand the motivation for keeping the request related
>> >> info out of blk-cgroup. Everything except the rq->cmd_flags can be
>> >> easily done away with. Maybe I'll need to have CFQ send the sync and
>> >> direction bits as args to the functions that need it. Not ideal coz
>> >> we'll have functions with many args but I guess its not that bad too.
>> >>
>> >> >
>> >> > If you implement, two dimensional arrays for stats then we can have
>> >> > following function.
>> >> >
>> >> > blkio_add_stat(enum stat_type var enum stat_sub_type var_type, u64 val)
>> >>
>> >> I would want to avoid calls like these from CFQ into the blkcg code
>> >> because many CFQ events trigger update for multiple stats (you'll see
>> >> more with stats in later patchsets) and doing these calls
>> >> independently for each stat would mean that we would also need to grab
>> >> the stats_lock multiple times when we could've avoided that.
>> >
>> > I understand the need to club the updates and reduce the need of taking
>> > stats_lock multiple times. I was thinking of any of following.
>> >
>> > - Get rid of reset interface per cgroup. Rely on changing ioscheduler on
>> > ?request queue and that will get rid of stats_lock entirely.
>>
>> This takes away the ability to reset stats at will which is very
>> useful when debugging and for testing IO controller.
>>
>
> What do you mean by "reset stats at will"? You can change ioscheduler at
> will and reset stats? The only possible issue I could think of is that only
> admin can change the ioscheduler in providing per cgroup interface, one can
> give write permission to indiviaul user and allow users to reset stats.

I should've said reset stats for a given cgroup at will. This is
mostly needed when debugging and for automated tests where we're
testing multiple cgroups and/or IO controller behavior between
different workloads (sync readers, buffered writers, sync_writers,
directIO read/write, etc). Resetting cgroup stats is something we use
regularly and depend on. If you feel very strongly about this being
counter-intuitive I can add a reset_stats file and make the other
stats read-only.

> I am not sure in practice why would you allow a user to reset stats.
> Especially if somebody's accounting software is based on these stats.
>
>> > - Can we use a function blkio_add_stat() with variable number of arguments
>> > ?so that more than one stat can be updated in a single call?
>>
>> I really don't like this at all.
>>
>> > If you have other ideas to implement it without assuming "struct rq" in
>> > blk-cgroup, please do that.
>>
>> I've already got rid of any rq assumptions in blk-cgroup. The only
>> place where we're using rq is for rq_start_time_ns() and
>> rq_io-start_time_ns() functions but they are not used by the
>> blk-cgroup code directly (only CFQ uses them). For another user of
>> io-controller, we can implement a bio based functions.
>
> Ok, you have made blkg->stats_lock visible to cfq. That's fine too.

By visible you mean accessible through cfqg->blkg, right? However, I
don't intend to use it anywhere in cfq code.

> Can you rename io_add_stat to blkio_add_stat. I think in V2 also, it is
> still io_add_stat.

Sorry I missed that one. Will do in v3.
>
> Vivek
>

2010-04-05 22:17:07

by Divyesh Shah

[permalink] [raw]
Subject: Re: [PATCH 2/3] blkio: Add io controller stats like

On Mon, Apr 5, 2010 at 7:45 AM, Vivek Goyal <[email protected]> wrote:
> On Fri, Apr 02, 2010 at 01:53:30PM -0700, Divyesh Shah wrote:
>
> [..]
>> >> +#define GET_STAT_INDEXED(__VAR) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> +uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int type) ? ? ? ? ? ? ?\
>> >> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> + ? ? return blkg->stats.__VAR[type]; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> +} ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> +
>> >> +GET_STAT_INDEXED(io_service_bytes);
>> >> +GET_STAT_INDEXED(io_serviced);
>> >> +GET_STAT_INDEXED(io_service_time);
>> >> +GET_STAT_INDEXED(io_wait_time);
>> >> +#undef GET_STAT_INDEXED
>> >> +
>> >> +#define GET_STAT(__VAR, __CONV) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> +uint64_t get_##__VAR##_stat(struct blkio_group *blkg, int dummy) ? ? \
>> >> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> + ? ? uint64_t data = blkg->stats.__VAR; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> + ? ? if (__CONV) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> + ? ? ? ? ? ? data = (uint64_t)jiffies_to_msecs(data) * NSEC_PER_MSEC;\
>> >> + ? ? return data; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> +}
>> >> +
>> >> +GET_STAT(time, 1);
>> >> +GET_STAT(sectors, 0);
>> >> +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> >> +GET_STAT(dequeue, 0);
>> >> +#endif
>> >> +#undef GET_STAT
>> >> +
>> >
>> > I think now so many macros to define different kind of functions is
>> > becoming really confusing. How about creating a two dimensional stat
>> > array which is indexed by type of io stat like "io_service_time" and
>> > then indexed by sub io type that is "READ, WRITE, SYNC" etc. And we
>> > can define a single function to read all kind of stats.
>> >
>> > That way we will not have to define one GET_STAT_INDEXED() and GET_STAT()
>> > for each unindexed variable. If there are not multiple functions then,
>> > then everywhere we don't have to pass around a function pointer to read
>> > the variable stat and code should become much simpler. Example, code
>> > follows.
>> >
>> >
>> > enum stat_type {
>> > ? ? ? ?BLKIO_STAT_TIME
>> > ? ? ? ?BLKIO_STAT_SECTORS
>> > ? ? ? ?BLKIO_STAT_WAIT_TIME
>> > ? ? ? ?BLKIO_STAT_SERVICE_TIME
>> > }
>> >
>> > enum stat_sub_type {
>> > ? ? ? ?BLKIO_STAT_READ
>> > ? ? ? ?BLKIO_STAT_WRITE
>> > ? ? ? ?BLKIO_STAT_SYNC
>> > }
>> >
>> > blkio_get_stat_var(enum stat_type var, enum stat_sub_type var_type) {
>> >
>> > ? ? ? ?if (var == BLKIO_STAT_TIME)
>> > ? ? ? ? ? ? ? ?reutrn blkg->stat.time
>> >
>> > ? ? ? ?/* Same as above for sectors */
>> >
>> > ? ? ? ?return blkg->stat[var][var_type];
>> > }
>>
>> I'm not sure adding this level of complexity (2nd dimension for stats)
>> and the numerous if (..) statements (or switch) is warranted
>
> I think we need only two conditions to check. BLKIO_STAT_TIME and
> BLKIO_STAT_SECTROS. Rest of these fall into typed category and we
> don't have to check for tyeps.
>
>> when they
>> only apply for the get_stat() and get_typed_stat() functions. This
>> seems like a more complicated simplification :).
>
> I think this really is simplification. This also gets rid of all the
> function pointers you are passing around to differentiate between two
> types of stats. Also gets rid of special macros to generate one function
> each for one kind of stat. These dynamic macros make code hard to read
> and understand.
>
> Can you please try this two dimensional array for stats approach once.
> I belive your code will be much easier to read.

There are 2 conditions now but as I mentioned that this is one of 2-3
patchsets for adding stats. More stats will increase the special
casing. I do agree the macros aren't exactly easy to read. I'll try
your suggested approach next and post a patchset when I'm done.

>
>>
>> >
>> > Also all these get_* function needs to be static.
>> Done
>> >
>> >> +#define SHOW_FUNCTION_PER_GROUP(__VAR, get_stats, getvar, show_total) ? ? ? ?\
>> >> ?static int blkiocg_##__VAR##_read(struct cgroup *cgroup, ? ? ? ? ? ? \
>> >> - ? ? ? ? ? ? ? ? ? ? struct cftype *cftype, struct seq_file *m) ? ? ?\
>> >> + ? ? ? ? ? ? struct cftype *cftype, struct cgroup_map_cb *cb) ? ? ? ?\
>> >> ?{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> ? ? ? struct blkio_cgroup *blkcg; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> ? ? ? struct blkio_group *blkg; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> ? ? ? struct hlist_node *n; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> + ? ? uint64_t cgroup_total = 0; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> + ? ? char disk_id[10]; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> ? ? ? if (!cgroup_lock_live_group(cgroup)) ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> ? ? ? ? ? ? ? return -ENODEV; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> @@ -184,19 +295,32 @@ static int blkiocg_##__VAR##_read(struct cgroup *cgroup, ? ? ? ? ? ? ? ?\
>> >> ? ? ? blkcg = cgroup_to_blkio_cgroup(cgroup); ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> ? ? ? rcu_read_lock(); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> ? ? ? hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {\
>> >> - ? ? ? ? ? ? if (blkg->dev) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> - ? ? ? ? ? ? ? ? ? ? seq_printf(m, "%u:%u %lu\n", MAJOR(blkg->dev), ?\
>> >> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MINOR(blkg->dev), blkg->__VAR); ? ? ? ?\
>> >> + ? ? ? ? ? ? if (blkg->dev) { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> + ? ? ? ? ? ? ? ? ? ? spin_lock_irq(&blkg->stats_lock); ? ? ? ? ? ? ? \
>> >> + ? ? ? ? ? ? ? ? ? ? snprintf(disk_id, 10, "%u:%u", MAJOR(blkg->dev),\
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MINOR(blkg->dev)); ? ? ? ? ? ? ?\
>> >> + ? ? ? ? ? ? ? ? ? ? cgroup_total += get_stats(blkg, cb, getvar, ? ? \
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? disk_id); ? ? ? ? ? ? ? \
>> >> + ? ? ? ? ? ? ? ? ? ? spin_unlock_irq(&blkg->stats_lock); ? ? ? ? ? ? \
>> >> + ? ? ? ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> + ? ? if (show_total) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> + ? ? ? ? ? ? cb->fill(cb, "Total", cgroup_total); ? ? ? ? ? ? ? ? ? ?\
>> >> ? ? ? rcu_read_unlock(); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> ? ? ? cgroup_unlock(); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> ? ? ? return 0; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> ?}
>> >>
>> >> -SHOW_FUNCTION_PER_GROUP(time);
>> >> -SHOW_FUNCTION_PER_GROUP(sectors);
>> >> +SHOW_FUNCTION_PER_GROUP(time, get_stat, get_time_stat, 0);
>> >> +SHOW_FUNCTION_PER_GROUP(sectors, get_stat, get_sectors_stat, 0);
>> >> +SHOW_FUNCTION_PER_GROUP(io_service_bytes, get_typed_stat,
>> >> + ? ? ? ? ? ? ? ? ? ? get_io_service_bytes_stat, 1);
>> >> +SHOW_FUNCTION_PER_GROUP(io_serviced, get_typed_stat, get_io_serviced_stat, 1);
>> >> +SHOW_FUNCTION_PER_GROUP(io_service_time, get_typed_stat,
>> >> + ? ? ? ? ? ? ? ? ? ? get_io_service_time_stat, 1);
>> >> +SHOW_FUNCTION_PER_GROUP(io_wait_time, get_typed_stat, get_io_wait_time_stat, 1);
>> >> ?#ifdef CONFIG_DEBUG_BLK_CGROUP
>> >> -SHOW_FUNCTION_PER_GROUP(dequeue);
>> >> +SHOW_FUNCTION_PER_GROUP(dequeue, get_stat, get_dequeue_stat, 0);
>> >> ?#endif
>> >> ?#undef SHOW_FUNCTION_PER_GROUP
>> >>
>> >> @@ -204,7 +328,7 @@ SHOW_FUNCTION_PER_GROUP(dequeue);
>> >> ?void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
>> >> ? ? ? ? ? ? ? ? ? ? ? unsigned long dequeue)
>> >> ?{
>> >> - ? ? blkg->dequeue += dequeue;
>> >> + ? ? blkg->stats.dequeue += dequeue;
>> >> ?}
>> >> ?EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_dequeue_stats);
>> >> ?#endif
>> >> @@ -217,16 +341,38 @@ struct cftype blkio_files[] = {
>> >> ? ? ? },
>> >> ? ? ? {
>> >> ? ? ? ? ? ? ? .name = "time",
>> >> - ? ? ? ? ? ? .read_seq_string = blkiocg_time_read,
>> >> + ? ? ? ? ? ? .read_map = blkiocg_time_read,
>> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> >> ? ? ? },
>> >> ? ? ? {
>> >> ? ? ? ? ? ? ? .name = "sectors",
>> >> - ? ? ? ? ? ? .read_seq_string = blkiocg_sectors_read,
>> >> + ? ? ? ? ? ? .read_map = blkiocg_sectors_read,
>> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> >> + ? ? },
>> >> + ? ? {
>> >> + ? ? ? ? ? ? .name = "io_service_bytes",
>> >> + ? ? ? ? ? ? .read_map = blkiocg_io_service_bytes_read,
>> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> >> + ? ? },
>> >> + ? ? {
>> >> + ? ? ? ? ? ? .name = "io_serviced",
>> >> + ? ? ? ? ? ? .read_map = blkiocg_io_serviced_read,
>> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> >> + ? ? },
>> >> + ? ? {
>> >> + ? ? ? ? ? ? .name = "io_service_time",
>> >> + ? ? ? ? ? ? .read_map = blkiocg_io_service_time_read,
>> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> >> + ? ? },
>> >> + ? ? {
>> >> + ? ? ? ? ? ? .name = "io_wait_time",
>> >> + ? ? ? ? ? ? .read_map = blkiocg_io_wait_time_read,
>> >> + ? ? ? ? ? ? .write_u64 = blkiocg_reset_write,
>> >> ? ? ? },
>> >
>> > blkiocg_reset_write() is invoked if a user does some write on any of the
>> > above files. This will reset all the stats. So if I do echo x >
>> > blkio.time, I will see all other files being reset? I think this is
>> > little counter intutive.
>> >
>> > Either we need to reset only those specific stats (the file being written
>> > to) or may be we can provide a separate file "blkio.reset_stats" to reset
>> > all the stats?
>>
>> I added a note in the Documentation file that writing an int to any of
>> the stats should reset all. I don't get the point of introducing
>> another file for this and would want stats to be cleared all together
>> in order to have say io_service_time and io_serviced in sync, i.e.,
>> the io_service_time should always be the total service time for the
>> IOs accounted for in io_serviced.
>>
>
> I agree that reset should mean reset of all stats. It is a matter of
> whether to introduce a separate file or write to one file resets all the
> other files. Though I find writting to one file resetting the stats of
> other file un-intitutive, at the same time don't feel too excited about
> introducing a new file just for reset.
>
> Do we really need to introduce reset interface. Changing the ioscheduler to
> noop or deadline and then setting it back to cfq should reset the stats for
> all the cgroups.
>
> In fact we can probably get rid of per cgroup reset stats interface. We
> will then get rid of extra stat spin lock. If somebody wants to reset
> the stats, just change the ioscheduler.
> This will work until and unless somebody wants to reset the stats of one
> cgroup but not the other.

As mentioned on the other thread this is exactly what I find the main
use case of this interface. Once again, if you feel too strongly about
it we can have a separate reset_stats interface to make it more
intuitive.

>
> Thanks
> Vivek
>