2011-05-18 19:13:34

by Vivek Goyal

[permalink] [raw]
Subject: [RFC PATCH 00/14] blk-throttle: lockless bio processing for no throttle rule group

Hi,

Block throttling code takes request queue lock for every incoming bio
(blk_throtl_bio()). This is true even if there are no throttle rules in
the group. This is a common case for root cgroup where distributions
will have throttling support compiled in but a vast majority of users
will not be specifying throttling rule.

This patch series tries to make bio processing lockless (no requeust
queue lock), if there are no rules specified for the group. Once
a bio is submitted, under rcu_read_lock() we search for the group,
update the stats and release the rcu lock. request queue lock is taken
only if there are throttling rules specified in the group.

I have made some of the dispatch stats per cpu so that these can be updated
without taking request queue lock.

On my system for a simple dd as follows, request queue lock acquisition
count has gone down by 11% roughly.

dd if=/mnt/zerofile-1G of=/dev/null bs=4K iflag=direct

lockstat output vanilla kernel
-----------------------------
class name acquisitions holdtime-total

&(&q->__queue_lock)->rlock: 2360944 1850183.07

lockstat output with patched kernel
-----------------------------------
class name acquisitions holdtime-total
&(&q->__queue_lock)->rlock: 2098599 1430478.79


I did test on a 4 cpu system doing IO to one SSD. I did not see any
significant improvement in throughput. I suspect that I never saturated
the cpus hence I don't see the improvement in throughput. I will see
if I can get more testing done on this and see if I notice IO throughput
improvement.

Jens, first patch of the series is already in your for-linus branch. I
was waiting for it to be pushed to Linus and then I can drop that first
patch.

Thanks
Vivek


Vivek Goyal (14):
blk-throttle: Use task_subsys_state() to determine a task's
blkio_cgroup
blk-throttle: Do the new group initialization with the help of a
function
blk-cgroup: move some fields of unaccounted_time file under right
config option
cfq-iosched: Get rid of redundant function parameter "create"
cfq-iosched: Fix a possible race with cfq cgroup removal code
blk-cgroup: Allow sleeping while dynamically allocating a group
blk-throttle: Dynamically allocate root group
blk-throttle: Introduce a helper function to fill in device details
blk-throttle: Use helper function to add root throtl group to lists
blk-throttle: Free up a group only after one rcu grace period
blk-throttle: Make dispatch stats per cpu
blk-cgroup: Make 64bit per cpu stats safe on 32bit arch
blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch
stats
blk-throttle: Make no throttling rule group processing lockless

block/blk-cgroup.c | 187 ++++++++++++++++++++++++------
block/blk-cgroup.h | 39 +++++--
block/blk-core.c | 3 +-
block/blk-throttle.c | 318 +++++++++++++++++++++++++++++++++++++-------------
block/cfq-iosched.c | 207 ++++++++++++++++++++++++---------
5 files changed, 578 insertions(+), 176 deletions(-)


2011-05-18 19:15:42

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 01/14] blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup

Currentlly we first map the task to cgroup and then cgroup to
blkio_cgroup. There is a more direct way to get to blkio_cgroup
from task using task_subsys_state(). Use that.

The real reason for the fix is that it also avoids a race in generic
cgroup code. During remount/umount rebind_subsystems() is called and
it can do following with and rcu protection.

cgrp->subsys[i] = NULL;

That means if somebody got hold of cgroup under rcu and then it tried
to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
is wrong. I was running into this race condition with ltp running on a
upstream derived kernel and that lead to crash.

So ideally we should also fix cgroup generic code to wait for rcu
grace period before setting pointer to NULL. Li Zefan is not very keen
on introducing synchronize_wait() as he thinks it will slow
down moun/remount/umount operations.

So for the time being atleast fix the kernel crash by taking a more
direct route to blkio_cgroup.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 7 +++++++
block/blk-cgroup.h | 3 +++
block/blk-throttle.c | 9 ++++-----
block/cfq-iosched.c | 11 +++++------
4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f0605ab..471fdcc 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -114,6 +114,13 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
}
EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);

+struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk)
+{
+ return container_of(task_subsys_state(tsk, blkio_subsys_id),
+ struct blkio_cgroup, css);
+}
+EXPORT_SYMBOL_GPL(task_blkio_cgroup);
+
static inline void
blkio_update_group_weight(struct blkio_group *blkg, unsigned int weight)
{
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 10919fa..c774930 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -291,6 +291,7 @@ static inline void blkiocg_set_start_empty_time(struct blkio_group *blkg) {}
#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
extern struct blkio_cgroup blkio_root_cgroup;
extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
+extern struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk);
extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key, dev_t dev,
enum blkio_policy_id plid);
@@ -314,6 +315,8 @@ void blkiocg_update_io_remove_stats(struct blkio_group *blkg,
struct cgroup;
static inline struct blkio_cgroup *
cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; }
+static inline struct blkio_cgroup *
+task_blkio_cgroup(struct task_struct *tsk) { return NULL; }

static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key, dev_t dev,
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0475a22..252a81a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -160,9 +160,8 @@ static void throtl_put_tg(struct throtl_grp *tg)
}

static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
- struct cgroup *cgroup)
+ struct blkio_cgroup *blkcg)
{
- struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
struct throtl_grp *tg = NULL;
void *key = td;
struct backing_dev_info *bdi = &td->queue->backing_dev_info;
@@ -229,12 +228,12 @@ done:

static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
{
- struct cgroup *cgroup;
struct throtl_grp *tg = NULL;
+ struct blkio_cgroup *blkcg;

rcu_read_lock();
- cgroup = task_cgroup(current, blkio_subsys_id);
- tg = throtl_find_alloc_tg(td, cgroup);
+ blkcg = task_blkio_cgroup(current);
+ tg = throtl_find_alloc_tg(td, blkcg);
if (!tg)
tg = &td->root_tg;
rcu_read_unlock();
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5b52011..ab7a9e6 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1014,10 +1014,9 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
cfqg->needs_update = true;
}

-static struct cfq_group *
-cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
+ struct blkio_cgroup *blkcg, int create)
{
- struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
struct cfq_group *cfqg = NULL;
void *key = cfqd;
int i, j;
@@ -1079,12 +1078,12 @@ done:
*/
static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
{
- struct cgroup *cgroup;
+ struct blkio_cgroup *blkcg;
struct cfq_group *cfqg = NULL;

rcu_read_lock();
- cgroup = task_cgroup(current, blkio_subsys_id);
- cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
+ blkcg = task_blkio_cgroup(current);
+ cfqg = cfq_find_alloc_cfqg(cfqd, blkcg, create);
if (!cfqg && create)
cfqg = &cfqd->root_group;
rcu_read_unlock();
--
1.7.1

2011-05-18 19:13:30

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 02/14] blk-throttle: Do the new group initialization with the help of a function

Group initialization code seems to be at two places. root group
initialization in blk_throtl_init() and dynamically allocated group
in throtl_find_alloc_tg(). Create a common function and use at both
the places.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 64 +++++++++++++++++++++++++++----------------------
1 files changed, 35 insertions(+), 29 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 252a81a..fa9a900 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -159,6 +159,35 @@ static void throtl_put_tg(struct throtl_grp *tg)
kfree(tg);
}

+static void throtl_init_group(struct throtl_grp *tg)
+{
+ INIT_HLIST_NODE(&tg->tg_node);
+ RB_CLEAR_NODE(&tg->rb_node);
+ bio_list_init(&tg->bio_lists[0]);
+ bio_list_init(&tg->bio_lists[1]);
+ tg->limits_changed = false;
+
+ /* Practically unlimited BW */
+ tg->bps[0] = tg->bps[1] = -1;
+ tg->iops[0] = tg->iops[1] = -1;
+
+ /*
+ * Take the initial reference that will be released on destroy
+ * This can be thought of a joint reference by cgroup and
+ * request queue which will be dropped by either request queue
+ * exit or cgroup deletion path depending on who is exiting first.
+ */
+ atomic_set(&tg->ref, 1);
+}
+
+/* Should be called with rcu read lock held (needed for blkcg) */
+static void
+throtl_add_group_to_td_list(struct throtl_data *td, struct throtl_grp *tg)
+{
+ hlist_add_head(&tg->tg_node, &td->tg_list);
+ td->nr_undestroyed_grps++;
+}
+
static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
struct blkio_cgroup *blkcg)
{
@@ -196,19 +225,7 @@ static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
if (!tg)
goto done;

- INIT_HLIST_NODE(&tg->tg_node);
- RB_CLEAR_NODE(&tg->rb_node);
- bio_list_init(&tg->bio_lists[0]);
- bio_list_init(&tg->bio_lists[1]);
- td->limits_changed = false;
-
- /*
- * Take the initial reference that will be released on destroy
- * This can be thought of a joint reference by cgroup and
- * request queue which will be dropped by either request queue
- * exit or cgroup deletion path depending on who is exiting first.
- */
- atomic_set(&tg->ref, 1);
+ throtl_init_group(tg);

/* Add group onto cgroup list */
sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
@@ -220,8 +237,7 @@ static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
tg->iops[READ] = blkcg_get_read_iops(blkcg, tg->blkg.dev);
tg->iops[WRITE] = blkcg_get_write_iops(blkcg, tg->blkg.dev);

- hlist_add_head(&tg->tg_node, &td->tg_list);
- td->nr_undestroyed_grps++;
+ throtl_add_group_to_td_list(td, tg);
done:
return tg;
}
@@ -1060,18 +1076,11 @@ int blk_throtl_init(struct request_queue *q)
INIT_HLIST_HEAD(&td->tg_list);
td->tg_service_tree = THROTL_RB_ROOT;
td->limits_changed = false;
+ INIT_DELAYED_WORK(&td->throtl_work, blk_throtl_work);

/* Init root group */
tg = &td->root_tg;
- INIT_HLIST_NODE(&tg->tg_node);
- RB_CLEAR_NODE(&tg->rb_node);
- bio_list_init(&tg->bio_lists[0]);
- bio_list_init(&tg->bio_lists[1]);
-
- /* Practically unlimited BW */
- tg->bps[0] = tg->bps[1] = -1;
- tg->iops[0] = tg->iops[1] = -1;
- td->limits_changed = false;
+ throtl_init_group(tg);

/*
* Set root group reference to 2. One reference will be dropped when
@@ -1080,16 +1089,13 @@ int blk_throtl_init(struct request_queue *q)
* as it is statically allocated and gets destroyed when throtl_data
* goes away.
*/
- atomic_set(&tg->ref, 2);
- hlist_add_head(&tg->tg_node, &td->tg_list);
- td->nr_undestroyed_grps++;
-
- INIT_DELAYED_WORK(&td->throtl_work, blk_throtl_work);
+ atomic_inc(&tg->ref);

rcu_read_lock();
blkiocg_add_blkio_group(&blkio_root_cgroup, &tg->blkg, (void *)td,
0, BLKIO_POLICY_THROTL);
rcu_read_unlock();
+ throtl_add_group_to_td_list(td, tg);

/* Attach throtl data to request queue */
td->queue = q;
--
1.7.1

2011-05-18 19:14:24

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 03/14] blk-cgroup: move some fields of unaccounted_time file under right config option

cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y.
there are some fields which are out side this config option. Fix that.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 2 ++
block/blk-cgroup.h | 7 ++++---
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 471fdcc..b0592bc 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -385,7 +385,9 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time,

spin_lock_irqsave(&blkg->stats_lock, flags);
blkg->stats.time += time;
+#ifdef CONFIG_DEBUG_BLK_CGROUP
blkg->stats.unaccounted_time += unaccounted_time;
+#endif
spin_unlock_irqrestore(&blkg->stats_lock, flags);
}
EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index c774930..63f1ef4 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -49,9 +49,9 @@ enum stat_type {
/* All the single valued stats go below this */
BLKIO_STAT_TIME,
BLKIO_STAT_SECTORS,
+#ifdef CONFIG_DEBUG_BLK_CGROUP
/* Time not charged to this cgroup */
BLKIO_STAT_UNACCOUNTED_TIME,
-#ifdef CONFIG_DEBUG_BLK_CGROUP
BLKIO_STAT_AVG_QUEUE_SIZE,
BLKIO_STAT_IDLE_TIME,
BLKIO_STAT_EMPTY_TIME,
@@ -117,10 +117,11 @@ struct blkio_group_stats {
/* total disk time and nr sectors dispatched by this group */
uint64_t time;
uint64_t sectors;
- /* Time not charged to this cgroup */
- uint64_t unaccounted_time;
uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
#ifdef CONFIG_DEBUG_BLK_CGROUP
+ /* Time not charged to this cgroup */
+ uint64_t unaccounted_time;
+
/* Sum of number of IOs queued across all samples */
uint64_t avg_queue_size_sum;
/* Count of samples taken for average */
--
1.7.1

2011-05-18 19:13:32

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 04/14] cfq-iosched: Get rid of redundant function parameter "create"

Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
Get rid of that.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 18 +++++++++---------
1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ab7a9e6..a905b55 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1015,7 +1015,7 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
}

static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
- struct blkio_cgroup *blkcg, int create)
+ struct blkio_cgroup *blkcg)
{
struct cfq_group *cfqg = NULL;
void *key = cfqd;
@@ -1030,7 +1030,7 @@ static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
cfqg->blkg.dev = MKDEV(major, minor);
goto done;
}
- if (cfqg || !create)
+ if (cfqg)
goto done;

cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
@@ -1073,18 +1073,18 @@ done:
}

/*
- * Search for the cfq group current task belongs to. If create = 1, then also
- * create the cfq group if it does not exist. request_queue lock must be held.
+ * Search for the cfq group current task belongs to. request_queue lock must
+ * be held.
*/
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
{
struct blkio_cgroup *blkcg;
struct cfq_group *cfqg = NULL;

rcu_read_lock();
blkcg = task_blkio_cgroup(current);
- cfqg = cfq_find_alloc_cfqg(cfqd, blkcg, create);
- if (!cfqg && create)
+ cfqg = cfq_find_alloc_cfqg(cfqd, blkcg);
+ if (!cfqg)
cfqg = &cfqd->root_group;
rcu_read_unlock();
return cfqg;
@@ -1176,7 +1176,7 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
}

#else /* GROUP_IOSCHED */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
{
return &cfqd->root_group;
}
@@ -2911,7 +2911,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
struct cfq_group *cfqg;

retry:
- cfqg = cfq_get_cfqg(cfqd, 1);
+ cfqg = cfq_get_cfqg(cfqd);
cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);
--
1.7.1

2011-05-18 19:13:36

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 05/14] cfq-iosched: Fix a possible race with cfq cgroup removal code

blkg->key = cfqd is an rcu protected pointer and hence we used to do
call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.

The problem here is that even though cfqd is around, there are no
gurantees that associated request queue (td->queue) or q->queue_lock
is still around. A driver might have called blk_cleanup_queue() and
release the lock.

It might happen that after freeing up the lock we call
blkg->key->queue->queue_ock and crash. This is possible in following
path.

blkiocg_destroy()
blkio_unlink_group_fn()
cfq_unlink_blkio_group()

Hence, wait for an rcu peirod if there are groups which have not
been unlinked from blkcg->blkg_list. That way, if there are any groups
which are taking cfq_unlink_blkio_group() path, can safely take queue
lock.

This is how we have taken care of race in throttling logic also.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 48 ++++++++++++++++++++++++++++++++++++------------
1 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a905b55..e2e6719 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -300,7 +300,9 @@ struct cfq_data {

/* List of cfq groups being managed on this device*/
struct hlist_head cfqg_list;
- struct rcu_head rcu;
+
+ /* Number of groups which are on blkcg->blkg_list */
+ unsigned int nr_blkcg_linked_grps;
};

static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
@@ -1063,6 +1065,7 @@ static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
0);

+ cfqd->nr_blkcg_linked_grps++;
cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);

/* Add group on cfqd list */
@@ -3815,15 +3818,11 @@ static void cfq_put_async_queues(struct cfq_data *cfqd)
cfq_put_queue(cfqd->async_idle_cfqq);
}

-static void cfq_cfqd_free(struct rcu_head *head)
-{
- kfree(container_of(head, struct cfq_data, rcu));
-}
-
static void cfq_exit_queue(struct elevator_queue *e)
{
struct cfq_data *cfqd = e->elevator_data;
struct request_queue *q = cfqd->queue;
+ bool wait = false;

cfq_shutdown_timer_wq(cfqd);

@@ -3842,7 +3841,13 @@ static void cfq_exit_queue(struct elevator_queue *e)

cfq_put_async_queues(cfqd);
cfq_release_cfq_groups(cfqd);
- cfq_blkiocg_del_blkio_group(&cfqd->root_group.blkg);
+
+ /*
+ * If there are groups which we could not unlink from blkcg list,
+ * wait for a rcu period for them to be freed.
+ */
+ if (cfqd->nr_blkcg_linked_grps)
+ wait = true;

spin_unlock_irq(q->queue_lock);

@@ -3852,8 +3857,20 @@ static void cfq_exit_queue(struct elevator_queue *e)
ida_remove(&cic_index_ida, cfqd->cic_index);
spin_unlock(&cic_index_lock);

- /* Wait for cfqg->blkg->key accessors to exit their grace periods. */
- call_rcu(&cfqd->rcu, cfq_cfqd_free);
+ /*
+ * Wait for cfqg->blkg->key accessors to exit their grace periods.
+ * Do this wait only if there are other unlinked groups out
+ * there. This can happen if cgroup deletion path claimed the
+ * responsibility of cleaning up a group before queue cleanup code
+ * get to the group.
+ *
+ * Do not call synchronize_rcu() unconditionally as there are drivers
+ * which create/delete request queue hundreds of times during scan/boot
+ * and synchronize_rcu() can take significant time and slow down boot.
+ */
+ if (wait)
+ synchronize_rcu();
+ kfree(cfqd);
}

static int cfq_alloc_cic_index(void)
@@ -3909,14 +3926,21 @@ static void *cfq_init_queue(struct request_queue *q)

#ifdef CONFIG_CFQ_GROUP_IOSCHED
/*
- * Take a reference to root group which we never drop. This is just
- * to make sure that cfq_put_cfqg() does not try to kfree root group
+ * Set root group reference to 2. One reference will be dropped when
+ * all groups on cfqd->cfqg_list are being deleted during queue exit.
+ * Other reference will remain there as we don't want to delete this
+ * group as it is statically allocated and gets destroyed when
+ * throtl_data goes away.
*/
- cfqg->ref = 1;
+ cfqg->ref = 2;
rcu_read_lock();
cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
(void *)cfqd, 0);
rcu_read_unlock();
+ cfqd->nr_blkcg_linked_grps++;
+
+ /* Add group on cfqd->cfqg_list */
+ hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
#endif
/*
* Not strictly needed (since RB_ROOT just clears the node and we
--
1.7.1

2011-05-18 19:15:29

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 06/14] blk-cgroup: Allow sleeping while dynamically allocating a group

Currently, all the cfq_group or throtl_group allocations happen while
we are holding ->queue_lock and sleeping is not allowed.

Soon, we will move to per cpu stats and also need to allocate the
per group stats. As one can not call alloc_percpu() from atomic
context as it can sleep, we need to drop ->queue_lock, allocate the
group, retake the lock and continue processing.

In throttling code, I check the queue DEAD flag again to make sure
that driver did not call blk_cleanup_queue() in the mean time.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-core.c | 3 +-
block/blk-throttle.c | 141 ++++++++++++++++++++++++++++++++++++++------------
block/cfq-iosched.c | 128 +++++++++++++++++++++++++++++++++------------
3 files changed, 205 insertions(+), 67 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a2e58ee..c95b3d2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1548,7 +1548,8 @@ static inline void __generic_make_request(struct bio *bio)
goto end_io;
}

- blk_throtl_bio(q, &bio);
+ if (blk_throtl_bio(q, &bio))
+ goto end_io;

/*
* If bio = NULL, bio has been throttled and will be submitted
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index fa9a900..c201967 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -188,8 +188,40 @@ throtl_add_group_to_td_list(struct throtl_data *td, struct throtl_grp *tg)
td->nr_undestroyed_grps++;
}

-static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
- struct blkio_cgroup *blkcg)
+static void throtl_init_add_tg_lists(struct throtl_data *td,
+ struct throtl_grp *tg, struct blkio_cgroup *blkcg)
+{
+ struct backing_dev_info *bdi = &td->queue->backing_dev_info;
+ unsigned int major, minor;
+
+ /* Add group onto cgroup list */
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ blkiocg_add_blkio_group(blkcg, &tg->blkg, (void *)td,
+ MKDEV(major, minor), BLKIO_POLICY_THROTL);
+
+ tg->bps[READ] = blkcg_get_read_bps(blkcg, tg->blkg.dev);
+ tg->bps[WRITE] = blkcg_get_write_bps(blkcg, tg->blkg.dev);
+ tg->iops[READ] = blkcg_get_read_iops(blkcg, tg->blkg.dev);
+ tg->iops[WRITE] = blkcg_get_write_iops(blkcg, tg->blkg.dev);
+
+ throtl_add_group_to_td_list(td, tg);
+}
+
+/* Should be called without queue lock and outside of rcu period */
+static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td)
+{
+ struct throtl_grp *tg = NULL;
+
+ tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, td->queue->node);
+ if (!tg)
+ return NULL;
+
+ throtl_init_group(tg);
+ return tg;
+}
+
+static struct
+throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
{
struct throtl_grp *tg = NULL;
void *key = td;
@@ -197,12 +229,6 @@ static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
unsigned int major, minor;

/*
- * TODO: Speed up blkiocg_lookup_group() by maintaining a radix
- * tree of blkg (instead of traversing through hash list all
- * the time.
- */
-
- /*
* This is the common case when there are no blkio cgroups.
* Avoid lookup in this case
*/
@@ -215,43 +241,83 @@ static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
if (tg && !tg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
tg->blkg.dev = MKDEV(major, minor);
- goto done;
}

- if (tg)
- goto done;
-
- tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, td->queue->node);
- if (!tg)
- goto done;
-
- throtl_init_group(tg);
-
- /* Add group onto cgroup list */
- sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
- blkiocg_add_blkio_group(blkcg, &tg->blkg, (void *)td,
- MKDEV(major, minor), BLKIO_POLICY_THROTL);
-
- tg->bps[READ] = blkcg_get_read_bps(blkcg, tg->blkg.dev);
- tg->bps[WRITE] = blkcg_get_write_bps(blkcg, tg->blkg.dev);
- tg->iops[READ] = blkcg_get_read_iops(blkcg, tg->blkg.dev);
- tg->iops[WRITE] = blkcg_get_write_iops(blkcg, tg->blkg.dev);
-
- throtl_add_group_to_td_list(td, tg);
-done:
return tg;
}

+/*
+ * This function returns with queue lock unlocked in case of error, like
+ * request queue is no more
+ */
static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
{
- struct throtl_grp *tg = NULL;
+ struct throtl_grp *tg = NULL, *__tg = NULL;
struct blkio_cgroup *blkcg;
+ struct request_queue *q = td->queue;

rcu_read_lock();
blkcg = task_blkio_cgroup(current);
- tg = throtl_find_alloc_tg(td, blkcg);
- if (!tg)
+ tg = throtl_find_tg(td, blkcg);
+ if (tg) {
+ rcu_read_unlock();
+ return tg;
+ }
+
+ /*
+ * Need to allocate a group. Allocation of group also needs allocation
+ * of per cpu stats which in-turn takes a mutex() and can block. Hence
+ * we need to drop rcu lock and queue_lock before we call alloc
+ *
+ * Take the request queue reference to make sure queue does not
+ * go away once we return from allocation.
+ */
+ blk_get_queue(q);
+ rcu_read_unlock();
+ spin_unlock_irq(q->queue_lock);
+
+ tg = throtl_alloc_tg(td);
+ /*
+ * We might have slept in group allocation. Make sure queue is not
+ * dead
+ */
+ if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
+ blk_put_queue(q);
+ if (tg)
+ kfree(tg);
+
+ return ERR_PTR(-ENODEV);
+ }
+ blk_put_queue(q);
+
+ /* Group allocated and queue is still alive. take the lock */
+ spin_lock_irq(q->queue_lock);
+
+ /*
+ * Initialize the new group. After sleeping, read the blkcg again.
+ */
+ rcu_read_lock();
+ blkcg = task_blkio_cgroup(current);
+
+ /*
+ * If some other thread already allocated the group while we were
+ * not holding queue lock, free up the group
+ */
+ __tg = throtl_find_tg(td, blkcg);
+
+ if (__tg) {
+ kfree(tg);
+ rcu_read_unlock();
+ return __tg;
+ }
+
+ /* Group allocation failed. Account the IO to root group */
+ if (!tg) {
tg = &td->root_tg;
+ return tg;
+ }
+
+ throtl_init_add_tg_lists(td, tg, blkcg);
rcu_read_unlock();
return tg;
}
@@ -1014,6 +1080,15 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
spin_lock_irq(q->queue_lock);
tg = throtl_get_tg(td);

+ if (IS_ERR(tg)) {
+ if (PTR_ERR(tg) == -ENODEV) {
+ /*
+ * Queue is gone. No queue lock held here.
+ */
+ return -ENODEV;
+ }
+ }
+
if (tg->nr_queued[rw]) {
/*
* There is already another bio queued in same dir. No
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e2e6719..606020f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1016,28 +1016,47 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
cfqg->needs_update = true;
}

-static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
- struct blkio_cgroup *blkcg)
+static void cfq_init_add_cfqg_lists(struct cfq_data *cfqd,
+ struct cfq_group *cfqg, struct blkio_cgroup *blkcg)
{
- struct cfq_group *cfqg = NULL;
- void *key = cfqd;
- int i, j;
- struct cfq_rb_root *st;
struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
unsigned int major, minor;

- cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
- if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+ /*
+ * Add group onto cgroup list. It might happen that bdi->dev is
+ * not initialized yet. Initialize this new group without major
+ * and minor info and this info will be filled in once a new thread
+ * comes for IO.
+ */
+ if (bdi->dev) {
sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
- cfqg->blkg.dev = MKDEV(major, minor);
- goto done;
- }
- if (cfqg)
- goto done;
+ cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg,
+ (void *)cfqd, MKDEV(major, minor));
+ } else
+ cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg,
+ (void *)cfqd, 0);
+
+ cfqd->nr_blkcg_linked_grps++;
+ cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+
+ /* Add group on cfqd list */
+ hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+}
+
+/*
+ * Should be called from sleepable context. No request queue lock as per
+ * cpu stats are allocated dynamically and alloc_percpu needs to be called
+ * from sleepable context.
+ */
+static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
+{
+ struct cfq_group *cfqg = NULL;
+ int i, j;
+ struct cfq_rb_root *st;

cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
if (!cfqg)
- goto done;
+ return NULL;

for_each_cfqg_st(cfqg, i, j, st)
*st = CFQ_RB_ROOT;
@@ -1050,28 +1069,31 @@ static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
* or cgroup deletion path depending on who is exiting first.
*/
cfqg->ref = 1;
+ return cfqg;
+}
+
+static struct cfq_group *
+cfq_find_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg)
+{
+ struct cfq_group *cfqg = NULL;
+ void *key = cfqd;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+ unsigned int major, minor;

/*
- * Add group onto cgroup list. It might happen that bdi->dev is
- * not initialized yet. Initialize this new group without major
- * and minor info and this info will be filled in once a new thread
- * comes for IO. See code above.
+ * This is the common case when there are no blkio cgroups.
+ * Avoid lookup in this case
*/
- if (bdi->dev) {
- sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
- cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
- MKDEV(major, minor));
- } else
- cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
- 0);
-
- cfqd->nr_blkcg_linked_grps++;
- cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+ if (blkcg == &blkio_root_cgroup)
+ cfqg = &cfqd->root_group;
+ else
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));

- /* Add group on cfqd list */
- hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+ if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ cfqg->blkg.dev = MKDEV(major, minor);
+ }

-done:
return cfqg;
}

@@ -1082,13 +1104,53 @@ done:
static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
{
struct blkio_cgroup *blkcg;
- struct cfq_group *cfqg = NULL;
+ struct cfq_group *cfqg = NULL, *__cfqg = NULL;
+ struct request_queue *q = cfqd->queue;
+
+ rcu_read_lock();
+ blkcg = task_blkio_cgroup(current);
+ cfqg = cfq_find_cfqg(cfqd, blkcg);
+ if (cfqg) {
+ rcu_read_unlock();
+ return cfqg;
+ }
+
+ /*
+ * Need to allocate a group. Allocation of group also needs allocation
+ * of per cpu stats which in-turn takes a mutex() and can block. Hence
+ * we need to drop rcu lock and queue_lock before we call alloc.
+ *
+ * Not taking any queue reference here and assuming that queue is
+ * around by the time we return. CFQ queue allocation code does
+ * the same. It might be racy though.
+ */
+
+ rcu_read_unlock();
+ spin_unlock_irq(q->queue_lock);
+
+ cfqg = cfq_alloc_cfqg(cfqd);
+
+ spin_lock_irq(q->queue_lock);

rcu_read_lock();
blkcg = task_blkio_cgroup(current);
- cfqg = cfq_find_alloc_cfqg(cfqd, blkcg);
+
+ /*
+ * If some other thread already allocated the group while we were
+ * not holding queue lock, free up the group
+ */
+ __cfqg = cfq_find_cfqg(cfqd, blkcg);
+
+ if (__cfqg) {
+ kfree(cfqg);
+ rcu_read_unlock();
+ return __cfqg;
+ }
+
if (!cfqg)
cfqg = &cfqd->root_group;
+
+ cfq_init_add_cfqg_lists(cfqd, cfqg, blkcg);
rcu_read_unlock();
return cfqg;
}
--
1.7.1

2011-05-18 19:15:37

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 07/14] blk-throttle: Dynamically allocate root group

Currently, we allocate root throtl_grp statically. But as we will be
introducing per cpu stat pointers and that will be allocated
dynamically even for root group, we might as well make whole root
throtl_grp allocation dynamic and treat it in same manner as other
groups.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 27 ++++++++++++---------------
1 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index c201967..68f2ac3 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -88,7 +88,7 @@ struct throtl_data
/* service tree for active throtl groups */
struct throtl_rb_root tg_service_tree;

- struct throtl_grp root_tg;
+ struct throtl_grp *root_tg;
struct request_queue *queue;

/* Total Number of queued bios on READ and WRITE lists */
@@ -233,7 +233,7 @@ throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
* Avoid lookup in this case
*/
if (blkcg == &blkio_root_cgroup)
- tg = &td->root_tg;
+ tg = td->root_tg;
else
tg = tg_of_blkg(blkiocg_lookup_group(blkcg, key));

@@ -313,7 +313,7 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)

/* Group allocation failed. Account the IO to root group */
if (!tg) {
- tg = &td->root_tg;
+ tg = td->root_tg;
return tg;
}

@@ -1153,18 +1153,16 @@ int blk_throtl_init(struct request_queue *q)
td->limits_changed = false;
INIT_DELAYED_WORK(&td->throtl_work, blk_throtl_work);

- /* Init root group */
- tg = &td->root_tg;
- throtl_init_group(tg);
+ /* alloc and Init root group. */
+ td->queue = q;
+ tg = throtl_alloc_tg(td);

- /*
- * Set root group reference to 2. One reference will be dropped when
- * all groups on tg_list are being deleted during queue exit. Other
- * reference will remain there as we don't want to delete this group
- * as it is statically allocated and gets destroyed when throtl_data
- * goes away.
- */
- atomic_inc(&tg->ref);
+ if (!tg) {
+ kfree(td);
+ return -ENOMEM;
+ }
+
+ td->root_tg = tg;

rcu_read_lock();
blkiocg_add_blkio_group(&blkio_root_cgroup, &tg->blkg, (void *)td,
@@ -1173,7 +1171,6 @@ int blk_throtl_init(struct request_queue *q)
throtl_add_group_to_td_list(td, tg);

/* Attach throtl data to request queue */
- td->queue = q;
q->td = td;
return 0;
}
--
1.7.1

2011-05-18 19:15:31

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 08/14] blk-throttle: Introduce a helper function to fill in device details

A helper function for the code which is used at 2-3 places. Makes reading
code little easier.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 35 +++++++++++++++++++++++------------
1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 68f2ac3..c2c2888 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -188,16 +188,34 @@ throtl_add_group_to_td_list(struct throtl_data *td, struct throtl_grp *tg)
td->nr_undestroyed_grps++;
}

+static void
+__throtl_tg_fill_dev_details(struct throtl_data *td, struct throtl_grp *tg)
+{
+ struct backing_dev_info *bdi = &td->queue->backing_dev_info;
+ unsigned int major, minor;
+
+ if (!tg || tg->blkg.dev)
+ return;
+
+ /*
+ * Fill in device details for a group which might not have been
+ * filled at group creation time as queue was being instantiated
+ * and driver had not attached a device yet
+ */
+ if (bdi->dev && dev_name(bdi->dev)) {
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ tg->blkg.dev = MKDEV(major, minor);
+ }
+}
+
static void throtl_init_add_tg_lists(struct throtl_data *td,
struct throtl_grp *tg, struct blkio_cgroup *blkcg)
{
- struct backing_dev_info *bdi = &td->queue->backing_dev_info;
- unsigned int major, minor;
+ __throtl_tg_fill_dev_details(td, tg);

/* Add group onto cgroup list */
- sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
blkiocg_add_blkio_group(blkcg, &tg->blkg, (void *)td,
- MKDEV(major, minor), BLKIO_POLICY_THROTL);
+ tg->blkg.dev, BLKIO_POLICY_THROTL);

tg->bps[READ] = blkcg_get_read_bps(blkcg, tg->blkg.dev);
tg->bps[WRITE] = blkcg_get_write_bps(blkcg, tg->blkg.dev);
@@ -225,8 +243,6 @@ throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
{
struct throtl_grp *tg = NULL;
void *key = td;
- struct backing_dev_info *bdi = &td->queue->backing_dev_info;
- unsigned int major, minor;

/*
* This is the common case when there are no blkio cgroups.
@@ -237,12 +253,7 @@ throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
else
tg = tg_of_blkg(blkiocg_lookup_group(blkcg, key));

- /* Fill in device details for root group */
- if (tg && !tg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
- sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
- tg->blkg.dev = MKDEV(major, minor);
- }
-
+ __throtl_tg_fill_dev_details(td, tg);
return tg;
}

--
1.7.1

2011-05-18 19:13:39

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 09/14] blk-throttle: Use helper function to add root throtl group to lists

Use same helper function for root group as we use with dynamically
allocated groups to add it to various lists.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 4 +---
1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index c2c2888..5543c82 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1176,10 +1176,8 @@ int blk_throtl_init(struct request_queue *q)
td->root_tg = tg;

rcu_read_lock();
- blkiocg_add_blkio_group(&blkio_root_cgroup, &tg->blkg, (void *)td,
- 0, BLKIO_POLICY_THROTL);
+ throtl_init_add_tg_lists(td, tg, &blkio_root_cgroup);
rcu_read_unlock();
- throtl_add_group_to_td_list(td, tg);

/* Attach throtl data to request queue */
q->td = td;
--
1.7.1

2011-05-18 19:15:39

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 10/14] blk-throttle: Free up a group only after one rcu grace period

Soon we will allow accessing a throtl_grp under rcu_read_lock(). Hence
start freeing up throtl_grp after one rcu grace period.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 22 +++++++++++++++++++++-
1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5543c82..64fdde5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -78,6 +78,8 @@ struct throtl_grp {

/* Some throttle limits got updated for the group */
int limits_changed;
+
+ struct rcu_head rcu_head;
};

struct throtl_data
@@ -151,12 +153,30 @@ static inline struct throtl_grp *throtl_ref_get_tg(struct throtl_grp *tg)
return tg;
}

+static void throtl_free_tg(struct rcu_head *head)
+{
+ struct throtl_grp *tg;
+
+ tg = container_of(head, struct throtl_grp, rcu_head);
+ kfree(tg);
+}
+
static void throtl_put_tg(struct throtl_grp *tg)
{
BUG_ON(atomic_read(&tg->ref) <= 0);
if (!atomic_dec_and_test(&tg->ref))
return;
- kfree(tg);
+
+ /*
+ * A group is freed in rcu manner. But having an rcu lock does not
+ * mean that one can access all the fields of blkg and assume these
+ * are valid. For example, don't try to follow throtl_data and
+ * request queue links.
+ *
+ * Having a reference to blkg under an rcu allows acess to only
+ * values local to groups like group stats and group rate limits
+ */
+ call_rcu(&tg->rcu_head, throtl_free_tg);
}

static void throtl_init_group(struct throtl_grp *tg)
--
1.7.1

2011-05-18 19:14:26

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 11/14] blk-throttle: Make dispatch stats per cpu

Currently we take blkg_stat lock for even updating the stats. So even if
a group has no throttling rules (common case for root group), we end
up taking blkg_lock, for updating the stats.

Make dispatch stats per cpu so that these can be updated without taking
blkg lock.

If cpu goes offline, these stats simply disappear. No protection has
been provided for that yet. Do we really need anything for that?

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 135 +++++++++++++++++++++++++++++++++++++-------------
block/blk-cgroup.h | 27 ++++++++--
block/blk-throttle.c | 9 +++
block/cfq-iosched.c | 18 ++++++-
4 files changed, 147 insertions(+), 42 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b0592bc..34bfcef 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -392,20 +392,22 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time,
}
EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);

+/*
+ * should be called under rcu read lock or queue lock to make sure blkg pointer
+ * is valid.
+ */
void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
uint64_t bytes, bool direction, bool sync)
{
- struct blkio_group_stats *stats;
- unsigned long flags;
+ struct blkio_group_stats_cpu *stats_cpu;

- spin_lock_irqsave(&blkg->stats_lock, flags);
- stats = &blkg->stats;
- stats->sectors += bytes >> 9;
- blkio_add_stat(stats->stat_arr[BLKIO_STAT_SERVICED], 1, direction,
- sync);
- blkio_add_stat(stats->stat_arr[BLKIO_STAT_SERVICE_BYTES], bytes,
- direction, sync);
- spin_unlock_irqrestore(&blkg->stats_lock, flags);
+ stats_cpu = this_cpu_ptr(blkg->stats_cpu);
+
+ stats_cpu->sectors += bytes >> 9;
+ blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_SERVICED],
+ 1, direction, sync);
+ blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_SERVICE_BYTES],
+ bytes, direction, sync);
}
EXPORT_SYMBOL_GPL(blkiocg_update_dispatch_stats);

@@ -440,6 +442,20 @@ void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
}
EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);

+/*
+ * This function allocates the per cpu stats for blkio_group. Should be called
+ * from sleepable context as alloc_per_cpu() requires that.
+ */
+int blkio_alloc_blkg_stats(struct blkio_group *blkg)
+{
+ /* Allocate memory for per cpu stats */
+ blkg->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
+ if (!blkg->stats_cpu)
+ return -ENOMEM;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats);
+
void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key, dev_t dev,
enum blkio_policy_id plid)
@@ -600,6 +616,53 @@ static uint64_t blkio_fill_stat(char *str, int chars_left, uint64_t val,
return val;
}

+
+static uint64_t blkio_read_stat_cpu(struct blkio_group *blkg,
+ enum stat_type_cpu type, enum stat_sub_type sub_type)
+{
+ int cpu;
+ struct blkio_group_stats_cpu *stats_cpu;
+ uint64_t val = 0;
+
+ for_each_possible_cpu(cpu) {
+ stats_cpu = per_cpu_ptr(blkg->stats_cpu, cpu);
+
+ if (type == BLKIO_STAT_CPU_SECTORS)
+ val += stats_cpu->sectors;
+ else
+ val += stats_cpu->stat_arr_cpu[type][sub_type];
+ }
+
+ return val;
+}
+
+static uint64_t blkio_get_stat_cpu(struct blkio_group *blkg,
+ struct cgroup_map_cb *cb, dev_t dev, enum stat_type_cpu type)
+{
+ uint64_t disk_total, val;
+ char key_str[MAX_KEY_LEN];
+ enum stat_sub_type sub_type;
+
+ if (type == BLKIO_STAT_CPU_SECTORS) {
+ val = blkio_read_stat_cpu(blkg, type, 0);
+ return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, val, cb, dev);
+ }
+
+ for (sub_type = BLKIO_STAT_READ; sub_type < BLKIO_STAT_TOTAL;
+ sub_type++) {
+ blkio_get_key_name(sub_type, dev, key_str, MAX_KEY_LEN, false);
+ val = blkio_read_stat_cpu(blkg, type, sub_type);
+ cb->fill(cb, key_str, val);
+ }
+
+ disk_total = blkio_read_stat_cpu(blkg, type, BLKIO_STAT_READ) +
+ blkio_read_stat_cpu(blkg, type, BLKIO_STAT_WRITE);
+
+ blkio_get_key_name(BLKIO_STAT_TOTAL, dev, key_str, MAX_KEY_LEN, false);
+ cb->fill(cb, key_str, disk_total);
+ return disk_total;
+}
+
/* This should be called with blkg->stats_lock held */
static uint64_t blkio_get_stat(struct blkio_group *blkg,
struct cgroup_map_cb *cb, dev_t dev, enum stat_type type)
@@ -611,9 +674,6 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
if (type == BLKIO_STAT_TIME)
return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
blkg->stats.time, cb, dev);
- if (type == BLKIO_STAT_SECTORS)
- return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
- blkg->stats.sectors, cb, dev);
#ifdef CONFIG_DEBUG_BLK_CGROUP
if (type == BLKIO_STAT_UNACCOUNTED_TIME)
return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
@@ -1077,8 +1137,8 @@ static int blkiocg_file_read(struct cgroup *cgrp, struct cftype *cft,
}

static int blkio_read_blkg_stats(struct blkio_cgroup *blkcg,
- struct cftype *cft, struct cgroup_map_cb *cb, enum stat_type type,
- bool show_total)
+ struct cftype *cft, struct cgroup_map_cb *cb,
+ enum stat_type type, bool show_total, bool pcpu)
{
struct blkio_group *blkg;
struct hlist_node *n;
@@ -1089,10 +1149,15 @@ static int blkio_read_blkg_stats(struct blkio_cgroup *blkcg,
if (blkg->dev) {
if (!cftype_blkg_same_policy(cft, blkg))
continue;
- spin_lock_irq(&blkg->stats_lock);
- cgroup_total += blkio_get_stat(blkg, cb, blkg->dev,
- type);
- spin_unlock_irq(&blkg->stats_lock);
+ if (pcpu)
+ cgroup_total += blkio_get_stat_cpu(blkg, cb,
+ blkg->dev, type);
+ else {
+ spin_lock_irq(&blkg->stats_lock);
+ cgroup_total += blkio_get_stat(blkg, cb,
+ blkg->dev, type);
+ spin_unlock_irq(&blkg->stats_lock);
+ }
}
}
if (show_total)
@@ -1116,47 +1181,47 @@ static int blkiocg_file_read_map(struct cgroup *cgrp, struct cftype *cft,
switch(name) {
case BLKIO_PROP_time:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_TIME, 0);
+ BLKIO_STAT_TIME, 0, 0);
case BLKIO_PROP_sectors:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_SECTORS, 0);
+ BLKIO_STAT_CPU_SECTORS, 0, 1);
case BLKIO_PROP_io_service_bytes:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_SERVICE_BYTES, 1);
+ BLKIO_STAT_CPU_SERVICE_BYTES, 1, 1);
case BLKIO_PROP_io_serviced:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_SERVICED, 1);
+ BLKIO_STAT_CPU_SERVICED, 1, 1);
case BLKIO_PROP_io_service_time:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_SERVICE_TIME, 1);
+ BLKIO_STAT_SERVICE_TIME, 1, 0);
case BLKIO_PROP_io_wait_time:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_WAIT_TIME, 1);
+ BLKIO_STAT_WAIT_TIME, 1, 0);
case BLKIO_PROP_io_merged:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_MERGED, 1);
+ BLKIO_STAT_MERGED, 1, 0);
case BLKIO_PROP_io_queued:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_QUEUED, 1);
+ BLKIO_STAT_QUEUED, 1, 0);
#ifdef CONFIG_DEBUG_BLK_CGROUP
case BLKIO_PROP_unaccounted_time:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_UNACCOUNTED_TIME, 0);
+ BLKIO_STAT_UNACCOUNTED_TIME, 0, 0);
case BLKIO_PROP_dequeue:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_DEQUEUE, 0);
+ BLKIO_STAT_DEQUEUE, 0, 0);
case BLKIO_PROP_avg_queue_size:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_AVG_QUEUE_SIZE, 0);
+ BLKIO_STAT_AVG_QUEUE_SIZE, 0, 0);
case BLKIO_PROP_group_wait_time:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_GROUP_WAIT_TIME, 0);
+ BLKIO_STAT_GROUP_WAIT_TIME, 0, 0);
case BLKIO_PROP_idle_time:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_IDLE_TIME, 0);
+ BLKIO_STAT_IDLE_TIME, 0, 0);
case BLKIO_PROP_empty_time:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_EMPTY_TIME, 0);
+ BLKIO_STAT_EMPTY_TIME, 0, 0);
#endif
default:
BUG();
@@ -1166,10 +1231,10 @@ static int blkiocg_file_read_map(struct cgroup *cgrp, struct cftype *cft,
switch(name){
case BLKIO_THROTL_io_service_bytes:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_SERVICE_BYTES, 1);
+ BLKIO_STAT_CPU_SERVICE_BYTES, 1, 1);
case BLKIO_THROTL_io_serviced:
return blkio_read_blkg_stats(blkcg, cft, cb,
- BLKIO_STAT_SERVICED, 1);
+ BLKIO_STAT_CPU_SERVICED, 1, 1);
default:
BUG();
}
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 63f1ef4..fd730a2 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -36,10 +36,6 @@ enum stat_type {
* request completion for IOs doen by this cgroup. This may not be
* accurate when NCQ is turned on. */
BLKIO_STAT_SERVICE_TIME = 0,
- /* Total bytes transferred */
- BLKIO_STAT_SERVICE_BYTES,
- /* Total IOs serviced, post merge */
- BLKIO_STAT_SERVICED,
/* Total time spent waiting in scheduler queue in ns */
BLKIO_STAT_WAIT_TIME,
/* Number of IOs merged */
@@ -48,7 +44,6 @@ enum stat_type {
BLKIO_STAT_QUEUED,
/* All the single valued stats go below this */
BLKIO_STAT_TIME,
- BLKIO_STAT_SECTORS,
#ifdef CONFIG_DEBUG_BLK_CGROUP
/* Time not charged to this cgroup */
BLKIO_STAT_UNACCOUNTED_TIME,
@@ -60,6 +55,16 @@ enum stat_type {
#endif
};

+/* Per cpu stats */
+enum stat_type_cpu {
+ BLKIO_STAT_CPU_SECTORS,
+ /* Total bytes transferred */
+ BLKIO_STAT_CPU_SERVICE_BYTES,
+ /* Total IOs serviced, post merge */
+ BLKIO_STAT_CPU_SERVICED,
+ BLKIO_STAT_CPU_NR
+};
+
enum stat_sub_type {
BLKIO_STAT_READ = 0,
BLKIO_STAT_WRITE,
@@ -116,7 +121,6 @@ struct blkio_cgroup {
struct blkio_group_stats {
/* total disk time and nr sectors dispatched by this group */
uint64_t time;
- uint64_t sectors;
uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
#ifdef CONFIG_DEBUG_BLK_CGROUP
/* Time not charged to this cgroup */
@@ -146,6 +150,12 @@ struct blkio_group_stats {
#endif
};

+/* Per cpu blkio group stats */
+struct blkio_group_stats_cpu {
+ uint64_t sectors;
+ uint64_t stat_arr_cpu[BLKIO_STAT_CPU_NR][BLKIO_STAT_TOTAL];
+};
+
struct blkio_group {
/* An rcu protected unique identifier for the group */
void *key;
@@ -161,6 +171,8 @@ struct blkio_group {
/* Need to serialize the stats in the case of reset/update */
spinlock_t stats_lock;
struct blkio_group_stats stats;
+ /* Per cpu stats pointer */
+ struct blkio_group_stats_cpu __percpu *stats_cpu;
};

struct blkio_policy_node {
@@ -296,6 +308,7 @@ extern struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk);
extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key, dev_t dev,
enum blkio_policy_id plid);
+extern int blkio_alloc_blkg_stats(struct blkio_group *blkg);
extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
void *key);
@@ -323,6 +336,8 @@ static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key, dev_t dev,
enum blkio_policy_id plid) {}

+static inline int blkio_alloc_blkg_stats(struct blkio_group *blkg) { return 0; }
+
static inline int
blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 64fdde5..44ce1a5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -158,6 +158,7 @@ static void throtl_free_tg(struct rcu_head *head)
struct throtl_grp *tg;

tg = container_of(head, struct throtl_grp, rcu_head);
+ free_percpu(tg->blkg.stats_cpu);
kfree(tg);
}

@@ -249,11 +250,19 @@ static void throtl_init_add_tg_lists(struct throtl_data *td,
static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td)
{
struct throtl_grp *tg = NULL;
+ int ret;

tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, td->queue->node);
if (!tg)
return NULL;

+ ret = blkio_alloc_blkg_stats(&tg->blkg);
+
+ if (ret) {
+ kfree(tg);
+ return NULL;
+ }
+
throtl_init_group(tg);
return tg;
}
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 606020f..d646b27 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1051,7 +1051,7 @@ static void cfq_init_add_cfqg_lists(struct cfq_data *cfqd,
static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
{
struct cfq_group *cfqg = NULL;
- int i, j;
+ int i, j, ret;
struct cfq_rb_root *st;

cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
@@ -1069,6 +1069,13 @@ static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
* or cgroup deletion path depending on who is exiting first.
*/
cfqg->ref = 1;
+
+ ret = blkio_alloc_blkg_stats(&cfqg->blkg);
+ if (ret) {
+ kfree(cfqg);
+ return NULL;
+ }
+
return cfqg;
}

@@ -1183,6 +1190,7 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
return;
for_each_cfqg_st(cfqg, i, j, st)
BUG_ON(!RB_EMPTY_ROOT(&st->rb));
+ free_percpu(cfqg->blkg.stats_cpu);
kfree(cfqg);
}

@@ -3995,7 +4003,15 @@ static void *cfq_init_queue(struct request_queue *q)
* throtl_data goes away.
*/
cfqg->ref = 2;
+
+ if (blkio_alloc_blkg_stats(&cfqg->blkg)) {
+ kfree(cfqg);
+ kfree(cfqd);
+ return NULL;
+ }
+
rcu_read_lock();
+
cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
(void *)cfqd, 0);
rcu_read_unlock();
--
1.7.1

2011-05-18 19:15:36

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 12/14] blk-cgroup: Make 64bit per cpu stats safe on 32bit arch

Some of the stats are 64bit and updation will be non atomic on 32bit
architecture. Use sequence counters on 32bit arch to make reading
of stats safe.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 27 ++++++++++++++++++++++-----
block/blk-cgroup.h | 2 ++
2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 34bfcef..3622518 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -400,14 +400,25 @@ void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
uint64_t bytes, bool direction, bool sync)
{
struct blkio_group_stats_cpu *stats_cpu;
+ unsigned long flags;
+
+ /*
+ * Disabling interrupts to provide mutual exclusion between two
+ * writes on same cpu. It probably is not needed for 64bit. Not
+ * optimizing that case yet.
+ */
+ local_irq_save(flags);

stats_cpu = this_cpu_ptr(blkg->stats_cpu);

+ u64_stats_update_begin(&stats_cpu->syncp);
stats_cpu->sectors += bytes >> 9;
blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_SERVICED],
1, direction, sync);
blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_SERVICE_BYTES],
bytes, direction, sync);
+ u64_stats_update_end(&stats_cpu->syncp);
+ local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(blkiocg_update_dispatch_stats);

@@ -622,15 +633,21 @@ static uint64_t blkio_read_stat_cpu(struct blkio_group *blkg,
{
int cpu;
struct blkio_group_stats_cpu *stats_cpu;
- uint64_t val = 0;
+ u64 val = 0, tval;

for_each_possible_cpu(cpu) {
+ unsigned int start;
stats_cpu = per_cpu_ptr(blkg->stats_cpu, cpu);

- if (type == BLKIO_STAT_CPU_SECTORS)
- val += stats_cpu->sectors;
- else
- val += stats_cpu->stat_arr_cpu[type][sub_type];
+ do {
+ start = u64_stats_fetch_begin(&stats_cpu->syncp);
+ if (type == BLKIO_STAT_CPU_SECTORS)
+ tval = stats_cpu->sectors;
+ else
+ tval = stats_cpu->stat_arr_cpu[type][sub_type];
+ } while(u64_stats_fetch_retry(&stats_cpu->syncp, start));
+
+ val += tval;
}

return val;
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index fd730a2..2622267 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -14,6 +14,7 @@
*/

#include <linux/cgroup.h>
+#include <linux/u64_stats_sync.h>

enum blkio_policy_id {
BLKIO_POLICY_PROP = 0, /* Proportional Bandwidth division */
@@ -154,6 +155,7 @@ struct blkio_group_stats {
struct blkio_group_stats_cpu {
uint64_t sectors;
uint64_t stat_arr_cpu[BLKIO_STAT_CPU_NR][BLKIO_STAT_TOTAL];
+ struct u64_stats_sync syncp;
};

struct blkio_group {
--
1.7.1

2011-05-18 19:14:19

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 13/14] blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats

Now dispatch stats update is lock free. But reset of these stats still
takes blkg->stats_lock and is dependent on that. As stats are per cpu,
we should be able to just reset the stats on each cpu without any locks.
(Atleast for 64bit arch).

On 32bit arch there is a small race where 64bit updates are not atomic.
The result of this race can be that in the presence of other writers,
one might not get 0 value after reset of a stat and might see something
intermediate

One can write more complicated code to cover this race like sending IPI
to other cpus to reset stats and for offline cpus, reset these directly.

Right not I am not taking that path because reset_update is more of a
debug feature and it can happen only on 32bit arch and possibility of
it happening is small. Will fix it if it becomes a real problem. For
the time being going for code simplicity.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 28 ++++++++++++++++++++++++++++
1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3622518..e41cc6f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -537,6 +537,30 @@ struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key)
}
EXPORT_SYMBOL_GPL(blkiocg_lookup_group);

+static void blkio_reset_stats_cpu(struct blkio_group *blkg)
+{
+ struct blkio_group_stats_cpu *stats_cpu;
+ int i, j, k;
+ /*
+ * Note: On 64 bit arch this should not be an issue. This has the
+ * possibility of returning some inconsistent value on 32bit arch
+ * as 64bit update on 32bit is non atomic. Taking care of this
+ * corner case makes code very complicated, like sending IPIs to
+ * cpus, taking care of stats of offline cpus etc.
+ *
+ * reset stats is anyway more of a debug feature and this sounds a
+ * corner case. So I am not complicating the code yet until and
+ * unless this becomes a real issue.
+ */
+ for_each_possible_cpu(i) {
+ stats_cpu = per_cpu_ptr(blkg->stats_cpu, i);
+ stats_cpu->sectors = 0;
+ for(j = 0; j < BLKIO_STAT_CPU_NR; j++)
+ for (k = 0; k < BLKIO_STAT_TOTAL; k++)
+ stats_cpu->stat_arr_cpu[j][k] = 0;
+ }
+}
+
static int
blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
{
@@ -581,7 +605,11 @@ blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
}
#endif
spin_unlock(&blkg->stats_lock);
+
+ /* Reset Per cpu stats which don't take blkg->stats_lock */
+ blkio_reset_stats_cpu(blkg);
}
+
spin_unlock_irq(&blkcg->lock);
return 0;
}
--
1.7.1

2011-05-18 19:14:22

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 14/14] blk-throttle: Make no throttling rule group processing lockless

Currently we take a queue lock on each bio to check if there are any
throttling rules associated with the group and also update the stats.
Now access the group under rcu and update the stats without taking
the queue lock. Queue lock is taken only if there are throttling rules
associated with the group.

So the common case of root group when there are no rules, save
unnecessary pounding of request queue lock.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/blk-throttle.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 44ce1a5..7df7067 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -229,6 +229,22 @@ __throtl_tg_fill_dev_details(struct throtl_data *td, struct throtl_grp *tg)
}
}

+/*
+ * Should be called with without queue lock held. Here queue lock will be
+ * taken rarely. It will be taken only once during life time of a group
+ * if need be
+ */
+static void
+throtl_tg_fill_dev_details(struct throtl_data *td, struct throtl_grp *tg)
+{
+ if (!tg || tg->blkg.dev)
+ return;
+
+ spin_lock_irq(td->queue->queue_lock);
+ __throtl_tg_fill_dev_details(td, tg);
+ spin_unlock_irq(td->queue->queue_lock);
+}
+
static void throtl_init_add_tg_lists(struct throtl_data *td,
struct throtl_grp *tg, struct blkio_cgroup *blkcg)
{
@@ -666,6 +682,12 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
return 0;
}

+static bool tg_no_rule_group(struct throtl_grp *tg, bool rw) {
+ if (tg->bps[rw] == -1 && tg->iops[rw] == -1)
+ return 1;
+ return 0;
+}
+
/*
* Returns whether one can dispatch a bio or not. Also returns approx number
* of jiffies to wait before this bio is with-in IO rate and can be dispatched
@@ -730,10 +752,6 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
tg->bytes_disp[rw] += bio->bi_size;
tg->io_disp[rw]++;

- /*
- * TODO: This will take blkg->stats_lock. Figure out a way
- * to avoid this cost.
- */
blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size, rw, sync);
}

@@ -1111,12 +1129,39 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
struct throtl_grp *tg;
struct bio *bio = *biop;
bool rw = bio_data_dir(bio), update_disptime = true;
+ struct blkio_cgroup *blkcg;

if (bio->bi_rw & REQ_THROTTLED) {
bio->bi_rw &= ~REQ_THROTTLED;
return 0;
}

+ /*
+ * A throtl_grp pointer retrieved under rcu can be used to access
+ * basic fields like stats and io rates. If a group has no rules,
+ * just update the dispatch stats in lockless manner and return.
+ */
+
+ rcu_read_lock();
+ blkcg = task_blkio_cgroup(current);
+ tg = throtl_find_tg(td, blkcg);
+ if (tg) {
+ throtl_tg_fill_dev_details(td, tg);
+
+ if (tg_no_rule_group(tg, rw)) {
+ blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size,
+ rw, bio->bi_rw & REQ_SYNC);
+ rcu_read_unlock();
+ return 0;
+ }
+ }
+ rcu_read_unlock();
+
+ /*
+ * Either group has not been allocated yet or it is not an unlimited
+ * IO group
+ */
+
spin_lock_irq(q->queue_lock);
tg = throtl_get_tg(td);

--
1.7.1

2011-05-19 18:33:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] blk-throttle: lockless bio processing for no throttle rule group

On 2011-05-18 21:13, Vivek Goyal wrote:
> Hi,
>
> Block throttling code takes request queue lock for every incoming bio
> (blk_throtl_bio()). This is true even if there are no throttle rules in
> the group. This is a common case for root cgroup where distributions
> will have throttling support compiled in but a vast majority of users
> will not be specifying throttling rule.
>
> This patch series tries to make bio processing lockless (no requeust
> queue lock), if there are no rules specified for the group. Once
> a bio is submitted, under rcu_read_lock() we search for the group,
> update the stats and release the rcu lock. request queue lock is taken
> only if there are throttling rules specified in the group.
>
> I have made some of the dispatch stats per cpu so that these can be updated
> without taking request queue lock.
>
> On my system for a simple dd as follows, request queue lock acquisition
> count has gone down by 11% roughly.
>
> dd if=/mnt/zerofile-1G of=/dev/null bs=4K iflag=direct
>
> lockstat output vanilla kernel
> -----------------------------
> class name acquisitions holdtime-total
>
> &(&q->__queue_lock)->rlock: 2360944 1850183.07
>
> lockstat output with patched kernel
> -----------------------------------
> class name acquisitions holdtime-total
> &(&q->__queue_lock)->rlock: 2098599 1430478.79
>
>
> I did test on a 4 cpu system doing IO to one SSD. I did not see any
> significant improvement in throughput. I suspect that I never saturated
> the cpus hence I don't see the improvement in throughput. I will see
> if I can get more testing done on this and see if I notice IO throughput
> improvement.
>
> Jens, first patch of the series is already in your for-linus branch. I
> was waiting for it to be pushed to Linus and then I can drop that first
> patch.

Vivek, I get weird things in these patches. In fact I always get on your
patches. = are =3D, =20 some places, and line breaks. Can I ask you to
try and resend it to [email protected] just to see if it's the company MTA
screwing things up, or if it's something at your end?

--
Jens Axboe

2011-05-19 18:44:35

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] blk-throttle: lockless bio processing for no throttle rule group

On Thu, May 19, 2011 at 08:33:10PM +0200, Jens Axboe wrote:
> On 2011-05-18 21:13, Vivek Goyal wrote:
> > Hi,
> >
> > Block throttling code takes request queue lock for every incoming bio
> > (blk_throtl_bio()). This is true even if there are no throttle rules in
> > the group. This is a common case for root cgroup where distributions
> > will have throttling support compiled in but a vast majority of users
> > will not be specifying throttling rule.
> >
> > This patch series tries to make bio processing lockless (no requeust
> > queue lock), if there are no rules specified for the group. Once
> > a bio is submitted, under rcu_read_lock() we search for the group,
> > update the stats and release the rcu lock. request queue lock is taken
> > only if there are throttling rules specified in the group.
> >
> > I have made some of the dispatch stats per cpu so that these can be updated
> > without taking request queue lock.
> >
> > On my system for a simple dd as follows, request queue lock acquisition
> > count has gone down by 11% roughly.
> >
> > dd if=/mnt/zerofile-1G of=/dev/null bs=4K iflag=direct
> >
> > lockstat output vanilla kernel
> > -----------------------------
> > class name acquisitions holdtime-total
> >
> > &(&q->__queue_lock)->rlock: 2360944 1850183.07
> >
> > lockstat output with patched kernel
> > -----------------------------------
> > class name acquisitions holdtime-total
> > &(&q->__queue_lock)->rlock: 2098599 1430478.79
> >
> >
> > I did test on a 4 cpu system doing IO to one SSD. I did not see any
> > significant improvement in throughput. I suspect that I never saturated
> > the cpus hence I don't see the improvement in throughput. I will see
> > if I can get more testing done on this and see if I notice IO throughput
> > improvement.
> >
> > Jens, first patch of the series is already in your for-linus branch. I
> > was waiting for it to be pushed to Linus and then I can drop that first
> > patch.
>
> Vivek, I get weird things in these patches. In fact I always get on your
> patches. = are =3D, =20 some places, and line breaks. Can I ask you to
> try and resend it to [email protected] just to see if it's the company MTA
> screwing things up, or if it's something at your end?

Sure, I am resending the series to your [email protected] id. If you still
see the problem there, then I need to look into my setup.

I had cced the patches to myself. So took the mailbox of the series
and did "git am <patch-series>" and it gave two warnings about space
before tab indent.

/home/vgoyal/main/git/linux-2.6/.git/rebase-apply/patch:16: space before
tab in indent.
struct backing_dev_info *bdi = &td->queue->backing_dev_info;
/home/vgoyal/main/git/linux-2.6/.git/rebase-apply/patch:17: space before
tab in indent.
unsigned int major, minor;
warning: 2 lines add whitespace errors.

Apart from that nothing else appeared. I will fix above two warnings and
resend the series and this time on your [email protected] id.

Thanks
Vivek