Hi everyone,
This series of patches of dm-ioband now includes "The bio tracking mechanism,"
which has been posted individually to this mailing list.
This makes it easy for anybody to control the I/O bandwidth even when
the I/O is one of delayed-write requests.
Have fun!
This series of patches consists of two parts:
1. dm-ioband
Dm-ioband is an I/O bandwidth controller implemented as a
device-mapper driver, which gives specified bandwidth to each job
running on the same physical device. A job is a group of processes
with the same pid or pgrp or uid or a virtual machine such as KVM
or Xen. A job can also be a cgroup by applying the bio-cgroup patch.
2. bio-cgroup
Bio-cgroup is a BIO tracking mechanism, which is implemented on
the cgroup memory subsystem. With the mechanism, it is able to
determine which cgroup each of bio belongs to, even when the bio
is one of delayed-write requests issued from a kernel thread
such as pdflush.
The above two parts have been posted individually to this mailing list
until now, but after this time we would release them all together.
[PATCH 1/7] dm-ioband: Patch of device-mapper driver
[PATCH 2/7] dm-ioband: Documentation of design overview, installation,
command reference and examples.
[PATCH 3/7] bio-cgroup: Introduction
[PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
[PATCH 5/7] bio-cgroup: Remove a lot of "#ifdef"s
[PATCH 6/7] bio-cgroup: Implement the bio-cgroup
[PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband
Please see the following site for more information:
Linux Block I/O Bandwidth Control Project
http://people.valinux.co.jp/~ryov/bwctl/
Thanks,
Ryo Tsuruta
This is the dm-ioband version 1.4.0 release.
Dm-ioband is an I/O bandwidth controller implemented as a device-mapper
driver, which gives specified bandwidth to each job running on the same
physical device.
- Can be applied to the kernel 2.6.27-rc1-mm1.
- Changes from 1.3.0 (posted on July 11, 2008):
- Fix the problem of processing urgent I/O requests.
Dm-ioband gives priority to I/O requests with pages with PG_reclaim
flag. We thought this situation only happens on a write request,
but it also happened on a read request, and it caused mishandling
of urgent I/O requests. We have not clarified it is proper
operation or not, yet.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <[email protected]>
Signed-off-by: Hirokazu Takahashi <[email protected]>
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/Kconfig linux-2.6.27-rc1-mm1/drivers/md/Kconfig
--- linux-2.6.27-rc1-mm1.orig/drivers/md/Kconfig 2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/Kconfig 2008-08-01 16:44:02.000000000 +0900
@@ -275,4 +275,17 @@ config DM_UEVENT
---help---
Generate udev events for DM events.
+config DM_IOBAND
+ tristate "I/O bandwidth control (EXPERIMENTAL)"
+ depends on BLK_DEV_DM && EXPERIMENTAL
+ ---help---
+ This device-mapper target allows to define how the
+ available bandwidth of a storage device should be
+ shared between processes, cgroups, the partitions or the LUNs.
+
+ Information on how to use dm-ioband is available in:
+ <file:Documentation/device-mapper/ioband.txt>.
+
+ If unsure, say N.
+
endif # MD
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/Makefile linux-2.6.27-rc1-mm1/drivers/md/Makefile
--- linux-2.6.27-rc1-mm1.orig/drivers/md/Makefile 2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/Makefile 2008-08-01 16:44:02.000000000 +0900
@@ -7,6 +7,7 @@ dm-mod-objs := dm.o dm-table.o dm-target
dm-multipath-objs := dm-path-selector.o dm-mpath.o
dm-snapshot-objs := dm-snap.o dm-exception-store.o
dm-mirror-objs := dm-raid1.o
+dm-ioband-objs := dm-ioband-ctl.o dm-ioband-policy.o dm-ioband-type.o
md-mod-objs := md.o bitmap.o
raid456-objs := raid5.o raid6algos.o raid6recov.o raid6tables.o \
raid6int1.o raid6int2.o raid6int4.o \
@@ -36,6 +37,7 @@ obj-$(CONFIG_DM_MULTIPATH) += dm-multipa
obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o
obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o
obj-$(CONFIG_DM_ZERO) += dm-zero.o
+obj-$(CONFIG_DM_IOBAND) += dm-ioband.o
quiet_cmd_unroll = UNROLL $@
cmd_unroll = $(PERL) $(srctree)/$(src)/unroll.pl $(UNROLL) \
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-ctl.c linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-ctl.c
--- linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-ctl.c 1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-ctl.c 2008-08-01 16:44:02.000000000 +0900
@@ -0,0 +1,1319 @@
+/*
+ * Copyright (C) 2008 VA Linux Systems Japan K.K.
+ * Authors: Hirokazu Takahashi <[email protected]>
+ * Ryo Tsuruta <[email protected]>
+ *
+ * I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/raid/md.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "dm-bio-list.h"
+#include "dm-ioband.h"
+
+#define DM_MSG_PREFIX "ioband"
+#define POLICY_PARAM_START 6
+#define POLICY_PARAM_DELIM "=:,"
+
+static LIST_HEAD(ioband_device_list);
+/* to protect ioband_device_list */
+static DEFINE_SPINLOCK(ioband_devicelist_lock);
+
+static void suspend_ioband_device(struct ioband_device *, unsigned long, int);
+static void resume_ioband_device(struct ioband_device *);
+static void ioband_conduct(struct work_struct *);
+static void ioband_hold_bio(struct ioband_group *, struct bio *);
+static struct bio *ioband_pop_bio(struct ioband_group *);
+static int ioband_set_param(struct ioband_group *, char *, char *);
+static int ioband_group_attach(struct ioband_group *, int, char *);
+static int ioband_group_type_select(struct ioband_group *, char *);
+
+long ioband_debug; /* just for debugging */
+
+static void do_nothing(void) {}
+
+static int policy_init(struct ioband_device *dp, char *name,
+ int argc, char **argv)
+{
+ struct policy_type *p;
+ struct ioband_group *gp;
+ unsigned long flags;
+ int r;
+
+ for (p = dm_ioband_policy_type; p->p_name; p++) {
+ if (!strcmp(name, p->p_name))
+ break;
+ }
+ if (!p->p_name)
+ return -EINVAL;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (dp->g_policy == p) {
+ /* do nothing if the same policy is already set */
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return 0;
+ }
+
+ suspend_ioband_device(dp, flags, 1);
+ list_for_each_entry(gp, &dp->g_groups, c_list)
+ dp->g_group_dtr(gp);
+
+ /* switch to the new policy */
+ dp->g_policy = p;
+ r = p->p_policy_init(dp, argc, argv);
+ if (!dp->g_hold_bio)
+ dp->g_hold_bio = ioband_hold_bio;
+ if (!dp->g_pop_bio)
+ dp->g_pop_bio = ioband_pop_bio;
+
+ list_for_each_entry(gp, &dp->g_groups, c_list)
+ dp->g_group_ctr(gp, NULL);
+ resume_ioband_device(dp);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+}
+
+static struct ioband_device *alloc_ioband_device(char *name,
+ int io_throttle, int io_limit)
+
+{
+ struct ioband_device *dp, *new;
+ unsigned long flags;
+
+ new = kzalloc(sizeof(struct ioband_device), GFP_KERNEL);
+ if (!new)
+ return NULL;
+
+ spin_lock_irqsave(&ioband_devicelist_lock, flags);
+ list_for_each_entry(dp, &ioband_device_list, g_list) {
+ if (!strcmp(dp->g_name, name)) {
+ dp->g_ref++;
+ spin_unlock_irqrestore(&ioband_devicelist_lock, flags);
+ kfree(new);
+ return dp;
+ }
+ }
+
+ /*
+ * Prepare its own workqueue as generic_make_request() may
+ * potentially block the workqueue when submitting BIOs.
+ */
+ new->g_ioband_wq = create_workqueue("kioband");
+ if (!new->g_ioband_wq) {
+ spin_unlock_irqrestore(&ioband_devicelist_lock, flags);
+ kfree(new);
+ return NULL;
+ }
+
+ INIT_DELAYED_WORK(&new->g_conductor, ioband_conduct);
+ INIT_LIST_HEAD(&new->g_groups);
+ INIT_LIST_HEAD(&new->g_list);
+ spin_lock_init(&new->g_lock);
+ mutex_init(&new->g_lock_device);
+ bio_list_init(&new->g_urgent_bios);
+ new->g_io_throttle = io_throttle;
+ new->g_io_limit[0] = io_limit;
+ new->g_io_limit[1] = io_limit;
+ new->g_issued[0] = 0;
+ new->g_issued[1] = 0;
+ new->g_blocked = 0;
+ new->g_ref = 1;
+ new->g_flags = 0;
+ strlcpy(new->g_name, name, sizeof(new->g_name));
+ new->g_policy = NULL;
+ new->g_hold_bio = NULL;
+ new->g_pop_bio = NULL;
+ init_waitqueue_head(&new->g_waitq);
+ init_waitqueue_head(&new->g_waitq_suspend);
+ init_waitqueue_head(&new->g_waitq_flush);
+ list_add_tail(&new->g_list, &ioband_device_list);
+
+ spin_unlock_irqrestore(&ioband_devicelist_lock, flags);
+ return new;
+}
+
+static void release_ioband_device(struct ioband_device *dp)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&ioband_devicelist_lock, flags);
+ dp->g_ref--;
+ if (dp->g_ref > 0) {
+ spin_unlock_irqrestore(&ioband_devicelist_lock, flags);
+ return;
+ }
+ list_del(&dp->g_list);
+ spin_unlock_irqrestore(&ioband_devicelist_lock, flags);
+ destroy_workqueue(dp->g_ioband_wq);
+ kfree(dp);
+}
+
+static int is_ioband_device_flushed(struct ioband_device *dp,
+ int wait_completion)
+{
+ struct ioband_group *gp;
+
+ if (wait_completion && dp->g_issued[0] + dp->g_issued[1] > 0)
+ return 0;
+ if (dp->g_blocked || waitqueue_active(&dp->g_waitq))
+ return 0;
+ list_for_each_entry(gp, &dp->g_groups, c_list)
+ if (waitqueue_active(&gp->c_waitq))
+ return 0;
+ return 1;
+}
+
+static void suspend_ioband_device(struct ioband_device *dp,
+ unsigned long flags, int wait_completion)
+{
+ struct ioband_group *gp;
+
+ /* block incoming bios */
+ set_device_suspended(dp);
+
+ /* wake up all blocked processes and go down all ioband groups */
+ wake_up_all(&dp->g_waitq);
+ list_for_each_entry(gp, &dp->g_groups, c_list) {
+ if (!is_group_down(gp)) {
+ set_group_down(gp);
+ set_group_need_up(gp);
+ }
+ wake_up_all(&gp->c_waitq);
+ }
+
+ /* flush the already mapped bios */
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ flush_workqueue(dp->g_ioband_wq);
+
+ /* wait for all processes to wake up and bios to release */
+ spin_lock_irqsave(&dp->g_lock, flags);
+ wait_event_lock_irq(dp->g_waitq_flush,
+ is_ioband_device_flushed(dp, wait_completion),
+ dp->g_lock, do_nothing());
+}
+
+static void resume_ioband_device(struct ioband_device *dp)
+{
+ struct ioband_group *gp;
+
+ /* go up ioband groups */
+ list_for_each_entry(gp, &dp->g_groups, c_list) {
+ if (group_need_up(gp)) {
+ clear_group_need_up(gp);
+ clear_group_down(gp);
+ }
+ }
+
+ /* accept incoming bios */
+ wake_up_all(&dp->g_waitq_suspend);
+ clear_device_suspended(dp);
+}
+
+static struct ioband_group *ioband_group_find(
+ struct ioband_group *head, int id)
+{
+ struct rb_node *node = head->c_group_root.rb_node;
+
+ while (node) {
+ struct ioband_group *p =
+ container_of(node, struct ioband_group, c_group_node);
+
+ if (p->c_id == id || id == IOBAND_ID_ANY)
+ return p;
+ node = (id < p->c_id) ? node->rb_left : node->rb_right;
+ }
+ return NULL;
+}
+
+static void ioband_group_add_node(struct rb_root *root,
+ struct ioband_group *gp)
+{
+ struct rb_node **new = &root->rb_node, *parent = NULL;
+ struct ioband_group *p;
+
+ while (*new) {
+ p = container_of(*new, struct ioband_group, c_group_node);
+ parent = *new;
+ new = (gp->c_id < p->c_id) ?
+ &(*new)->rb_left : &(*new)->rb_right;
+ }
+
+ rb_link_node(&gp->c_group_node, parent, new);
+ rb_insert_color(&gp->c_group_node, root);
+}
+
+static int ioband_group_init(struct ioband_group *gp,
+ struct ioband_group *head, struct ioband_device *dp, int id, char *param)
+{
+ unsigned long flags;
+ int r;
+
+ INIT_LIST_HEAD(&gp->c_list);
+ bio_list_init(&gp->c_blocked_bios);
+ bio_list_init(&gp->c_prio_bios);
+ gp->c_id = id; /* should be verified */
+ gp->c_blocked = 0;
+ gp->c_prio_blocked = 0;
+ memset(gp->c_stat, 0, sizeof(gp->c_stat));
+ init_waitqueue_head(&gp->c_waitq);
+ gp->c_flags = 0;
+ gp->c_group_root = RB_ROOT;
+ gp->c_banddev = dp;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (head && ioband_group_find(head, id)) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ DMWARN("ioband_group: id=%d already exists.", id);
+ return -EEXIST;
+ }
+
+ list_add_tail(&gp->c_list, &dp->g_groups);
+
+ r = dp->g_group_ctr(gp, param);
+ if (r) {
+ list_del(&gp->c_list);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+ }
+
+ if (head) {
+ ioband_group_add_node(&head->c_group_root, gp);
+ gp->c_dev = head->c_dev;
+ gp->c_target = head->c_target;
+ }
+
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+
+ return 0;
+}
+
+static void ioband_group_release(struct ioband_group *head,
+ struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ list_del(&gp->c_list);
+ if (head)
+ rb_erase(&gp->c_group_node, &head->c_group_root);
+ dp->g_group_dtr(gp);
+ kfree(gp);
+}
+
+static void ioband_group_destroy_all(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *group;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ while ((group = ioband_group_find(gp, IOBAND_ID_ANY)))
+ ioband_group_release(gp, group);
+ ioband_group_release(NULL, gp);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static void ioband_group_stop_all(struct ioband_group *head, int suspend)
+{
+ struct ioband_device *dp = head->c_banddev;
+ struct ioband_group *p;
+ struct rb_node *node;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ set_group_down(p);
+ if (suspend) {
+ set_group_suspended(p);
+ dprintk(KERN_ERR "ioband suspend: gp(%p)\n", p);
+ }
+ }
+ set_group_down(head);
+ if (suspend) {
+ set_group_suspended(head);
+ dprintk(KERN_ERR "ioband suspend: gp(%p)\n", head);
+ }
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ flush_workqueue(dp->g_ioband_wq);
+}
+
+static void ioband_group_resume_all(struct ioband_group *head)
+{
+ struct ioband_device *dp = head->c_banddev;
+ struct ioband_group *p;
+ struct rb_node *node;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ for (node = rb_first(&head->c_group_root); node;
+ node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ clear_group_down(p);
+ clear_group_suspended(p);
+ dprintk(KERN_ERR "ioband resume: gp(%p)\n", p);
+ }
+ clear_group_down(head);
+ clear_group_suspended(head);
+ dprintk(KERN_ERR "ioband resume: gp(%p)\n", head);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static int split_string(char *s, long *id, char **v)
+{
+ char *p, *q;
+ int r = 0;
+
+ *id = IOBAND_ID_ANY;
+ p = strsep(&s, POLICY_PARAM_DELIM);
+ q = strsep(&s, POLICY_PARAM_DELIM);
+ if (!q) {
+ *v = p;
+ } else {
+ r = strict_strtol(p, 0, id);
+ *v = q;
+ }
+ return r;
+}
+
+/*
+ * Create a new band device:
+ * parameters: <device> <device-group-id> <io_throttle> <io_limit>
+ * <type> <policy> <policy-param...> <group-id:group-param...>
+ */
+static int ioband_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct ioband_group *gp;
+ struct ioband_device *dp;
+ struct dm_dev *dev;
+ int io_throttle;
+ int io_limit;
+ int i, r, start;
+ long val, id;
+ char *param;
+
+ if (argc < POLICY_PARAM_START) {
+ ti->error = "Requires " __stringify(POLICY_PARAM_START)
+ " or more arguments";
+ return -EINVAL;
+ }
+
+ if (strlen(argv[1]) > IOBAND_NAME_MAX) {
+ ti->error = "Ioband device name is too long";
+ return -EINVAL;
+ }
+ dprintk(KERN_ERR "ioband_ctr ioband device name:%s\n", argv[1]);
+
+ r = strict_strtol(argv[2], 0, &val);
+ if (r || val < 0) {
+ ti->error = "Invalid io_throttle";
+ return -EINVAL;
+ }
+ io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val;
+
+ r = strict_strtol(argv[3], 0, &val);
+ if (r || val < 0) {
+ ti->error = "Invalid io_limit";
+ return -EINVAL;
+ }
+ io_limit = val;
+
+ r = dm_get_device(ti, argv[0], 0, ti->len,
+ dm_table_get_mode(ti->table), &dev);
+ if (r) {
+ ti->error = "Device lookup failed";
+ return r;
+ }
+
+ if (io_limit == 0) {
+ struct request_queue *q;
+
+ q = bdev_get_queue(dev->bdev);
+ if (!q) {
+ ti->error = "Can't get queue size";
+ r = -ENXIO;
+ goto release_dm_device;
+ }
+ dprintk(KERN_ERR "ioband_ctr nr_requests:%lu\n",
+ q->nr_requests);
+ io_limit = q->nr_requests;
+ }
+
+ if (io_limit < io_throttle)
+ io_limit = io_throttle;
+ dprintk(KERN_ERR "ioband_ctr io_throttle:%d io_limit:%d\n",
+ io_throttle, io_limit);
+
+ dp = alloc_ioband_device(argv[1], io_throttle, io_limit);
+ if (!dp) {
+ ti->error = "Cannot create ioband device";
+ r = -EINVAL;
+ goto release_dm_device;
+ }
+
+ mutex_lock(&dp->g_lock_device);
+ r = policy_init(dp, argv[POLICY_PARAM_START - 1],
+ argc - POLICY_PARAM_START, &argv[POLICY_PARAM_START]);
+ if (r) {
+ ti->error = "Invalid policy parameter";
+ goto release_ioband_device;
+ }
+
+ gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+ if (!gp) {
+ ti->error = "Cannot allocate memory for ioband group";
+ r = -ENOMEM;
+ goto release_ioband_device;
+ }
+
+ ti->private = gp;
+ gp->c_target = ti;
+ gp->c_dev = dev;
+
+ /* Find a default group parameter */
+ for (start = POLICY_PARAM_START; start < argc; start++)
+ if (argv[start][0] == ':')
+ break;
+ param = (start < argc) ? &argv[start][1] : NULL;
+
+ /* Create a default ioband group */
+ r = ioband_group_init(gp, NULL, dp, IOBAND_ID_ANY, param);
+ if (r) {
+ kfree(gp);
+ ti->error = "Cannot create default ioband group";
+ goto release_ioband_device;
+ }
+
+ r = ioband_group_type_select(gp, argv[4]);
+ if (r) {
+ ti->error = "Cannot set ioband group type";
+ goto release_ioband_group;
+ }
+
+ /* Create sub ioband groups */
+ for (i = start + 1; i < argc; i++) {
+ r = split_string(argv[i], &id, ¶m);
+ if (r) {
+ ti->error = "Invalid ioband group parameter";
+ goto release_ioband_group;
+ }
+ r = ioband_group_attach(gp, id, param);
+ if (r) {
+ ti->error = "Cannot create ioband group";
+ goto release_ioband_group;
+ }
+ }
+ mutex_unlock(&dp->g_lock_device);
+ return 0;
+
+release_ioband_group:
+ ioband_group_destroy_all(gp);
+release_ioband_device:
+ mutex_unlock(&dp->g_lock_device);
+ release_ioband_device(dp);
+release_dm_device:
+ dm_put_device(ti, dev);
+ return r;
+}
+
+static void ioband_dtr(struct dm_target *ti)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+
+ mutex_lock(&dp->g_lock_device);
+ ioband_group_stop_all(gp, 0);
+ cancel_delayed_work_sync(&dp->g_conductor);
+ dm_put_device(ti, gp->c_dev);
+ ioband_group_destroy_all(gp);
+ mutex_unlock(&dp->g_lock_device);
+ release_ioband_device(dp);
+}
+
+static void ioband_hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+ /* Todo: The list should be split into a read list and a write list */
+ bio_list_add(&gp->c_blocked_bios, bio);
+}
+
+static struct bio *ioband_pop_bio(struct ioband_group *gp)
+{
+ return bio_list_pop(&gp->c_blocked_bios);
+}
+
+static int is_urgent_bio(struct bio *bio)
+{
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ /*
+ * ToDo: A new flag should be added to struct bio, which indicates
+ * it contains urgent I/O requests.
+ */
+ if (!PageReclaim(page))
+ return 0;
+ if (PageSwapCache(page))
+ return 2;
+ return 1;
+}
+
+static inline void resume_to_accept_bios(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (is_device_blocked(dp)
+ && dp->g_blocked < dp->g_io_limit[0]+dp->g_io_limit[1]) {
+ clear_device_blocked(dp);
+ wake_up_all(&dp->g_waitq);
+ }
+ if (is_group_blocked(gp)) {
+ clear_group_blocked(gp);
+ wake_up_all(&gp->c_waitq);
+ }
+}
+
+static inline int device_should_block(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (is_group_down(gp))
+ return 0;
+ if (is_device_blocked(dp))
+ return 1;
+ if (dp->g_blocked >= dp->g_io_limit[0] + dp->g_io_limit[1]) {
+ set_device_blocked(dp);
+ return 1;
+ }
+ return 0;
+}
+
+static inline int group_should_block(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (is_group_down(gp))
+ return 0;
+ if (is_group_blocked(gp))
+ return 1;
+ if (dp->g_should_block(gp)) {
+ set_group_blocked(gp);
+ return 1;
+ }
+ return 0;
+}
+
+static void prevent_burst_bios(struct ioband_group *gp, struct bio *bio)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (current->flags & PF_KTHREAD || is_urgent_bio(bio)) {
+ /*
+ * Kernel threads shouldn't be blocked easily since each of
+ * them may handle BIOs for several groups on several
+ * partitions.
+ */
+ wait_event_lock_irq(dp->g_waitq, !device_should_block(gp),
+ dp->g_lock, do_nothing());
+ } else {
+ wait_event_lock_irq(gp->c_waitq, !group_should_block(gp),
+ dp->g_lock, do_nothing());
+ }
+}
+
+static inline int should_pushback_bio(struct ioband_group *gp)
+{
+ return is_group_suspended(gp) && dm_noflush_suspending(gp->c_target);
+}
+
+static inline int prepare_to_issue(struct ioband_group *gp, struct bio *bio)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ dp->g_issued[bio_data_dir(bio)]++;
+ return dp->g_prepare_bio(gp, bio, 0);
+}
+
+static inline int room_for_bio(struct ioband_device *dp)
+{
+ return dp->g_issued[0] < dp->g_io_limit[0]
+ || dp->g_issued[1] < dp->g_io_limit[1];
+}
+
+static void hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ dp->g_blocked++;
+ if (is_urgent_bio(bio)) {
+ /*
+ * ToDo:
+ * When barrier mode is supported, write bios sharing the same
+ * file system with the currnt one would be all moved
+ * to g_urgent_bios list.
+ * You don't have to care about barrier handling if the bio
+ * is for swapping.
+ */
+ dp->g_prepare_bio(gp, bio, IOBAND_URGENT);
+ bio_list_add(&dp->g_urgent_bios, bio);
+ } else {
+ gp->c_blocked++;
+ dp->g_hold_bio(gp, bio);
+ }
+}
+
+static inline int room_for_bio_rw(struct ioband_device *dp, int direct)
+{
+ return dp->g_issued[direct] < dp->g_io_limit[direct];
+}
+
+static void push_prio_bio(struct ioband_group *gp, struct bio *bio, int direct)
+{
+ if (bio_list_empty(&gp->c_prio_bios))
+ set_prio_queue(gp, direct);
+ bio_list_add(&gp->c_prio_bios, bio);
+ gp->c_prio_blocked++;
+}
+
+static struct bio *pop_prio_bio(struct ioband_group *gp)
+{
+ struct bio *bio = bio_list_pop(&gp->c_prio_bios);
+
+ if (bio_list_empty(&gp->c_prio_bios))
+ clear_prio_queue(gp);
+
+ if (bio)
+ gp->c_prio_blocked--;
+ return bio;
+}
+
+static int make_issue_list(struct ioband_group *gp, struct bio *bio,
+ struct bio_list *issue_list, struct bio_list *pushback_list)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ dp->g_blocked--;
+ gp->c_blocked--;
+ if (!gp->c_blocked)
+ resume_to_accept_bios(gp);
+ if (should_pushback_bio(gp))
+ bio_list_add(pushback_list, bio);
+ else {
+ int rw = bio_data_dir(bio);
+
+ gp->c_stat[rw].deferred++;
+ gp->c_stat[rw].sectors += bio_sectors(bio);
+ bio_list_add(issue_list, bio);
+ }
+ return prepare_to_issue(gp, bio);
+}
+
+static void release_urgent_bios(struct ioband_device *dp,
+ struct bio_list *issue_list, struct bio_list *pushback_list)
+{
+ struct bio *bio;
+
+ if (bio_list_empty(&dp->g_urgent_bios))
+ return;
+ while (room_for_bio_rw(dp, 1)) {
+ bio = bio_list_pop(&dp->g_urgent_bios);
+ if (!bio)
+ return;
+ dp->g_blocked--;
+ dp->g_issued[bio_data_dir(bio)]++;
+ bio_list_add(issue_list, bio);
+ }
+}
+
+static int release_prio_bios(struct ioband_group *gp,
+ struct bio_list *issue_list, struct bio_list *pushback_list)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct bio *bio;
+ int direct;
+ int ret;
+
+ if (bio_list_empty(&gp->c_prio_bios))
+ return R_OK;
+ direct = prio_queue_direct(gp);
+ while (gp->c_prio_blocked) {
+ if (!dp->g_can_submit(gp))
+ return R_BLOCK;
+ if (!room_for_bio_rw(dp, direct))
+ return R_OK;
+ bio = pop_prio_bio(gp);
+ if (!bio)
+ return R_OK;
+ ret = make_issue_list(gp, bio, issue_list, pushback_list);
+ if (ret)
+ return ret;
+ }
+ return R_OK;
+}
+
+static int release_norm_bios(struct ioband_group *gp,
+ struct bio_list *issue_list, struct bio_list *pushback_list)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct bio *bio;
+ int direct;
+ int ret;
+
+ while (gp->c_blocked - gp->c_prio_blocked) {
+ if (!dp->g_can_submit(gp))
+ return R_BLOCK;
+ if (!room_for_bio(dp))
+ return R_OK;
+ bio = dp->g_pop_bio(gp);
+ if (!bio)
+ return R_OK;
+
+ direct = bio_data_dir(bio);
+ if (!room_for_bio_rw(dp, direct)) {
+ push_prio_bio(gp, bio, direct);
+ continue;
+ }
+ ret = make_issue_list(gp, bio, issue_list, pushback_list);
+ if (ret)
+ return ret;
+ }
+ return R_OK;
+}
+
+static inline int release_bios(struct ioband_group *gp,
+ struct bio_list *issue_list, struct bio_list *pushback_list)
+{
+ int ret = release_prio_bios(gp, issue_list, pushback_list);
+ if (ret)
+ return ret;
+ return release_norm_bios(gp, issue_list, pushback_list);
+}
+
+static struct ioband_group *ioband_group_get(struct ioband_group *head,
+ struct bio *bio)
+{
+ struct ioband_group *gp;
+
+ if (!head->c_type->t_getid)
+ return head;
+
+ gp = ioband_group_find(head, head->c_type->t_getid(bio));
+
+ if (!gp)
+ gp = head;
+ return gp;
+}
+
+/*
+ * Start to control the bandwidth once the number of uncompleted BIOs
+ * exceeds the value of "io_throttle".
+ */
+static int ioband_map(struct dm_target *ti, struct bio *bio,
+ union map_info *map_context)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+ unsigned long flags;
+ int rw;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+
+ /*
+ * The device is suspended while some of the ioband device
+ * configurations are being changed.
+ */
+ if (is_device_suspended(dp))
+ wait_event_lock_irq(dp->g_waitq_suspend,
+ !is_device_suspended(dp), dp->g_lock, do_nothing());
+
+ gp = ioband_group_get(gp, bio);
+ prevent_burst_bios(gp, bio);
+ if (should_pushback_bio(gp)) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return DM_MAPIO_REQUEUE;
+ }
+
+ bio->bi_bdev = gp->c_dev->bdev;
+ bio->bi_sector -= ti->begin;
+ rw = bio_data_dir(bio);
+
+ if (!gp->c_blocked && room_for_bio_rw(dp, rw)) {
+ if (dp->g_can_submit(gp)) {
+ prepare_to_issue(gp, bio);
+ gp->c_stat[rw].immediate++;
+ gp->c_stat[rw].sectors += bio_sectors(bio);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return DM_MAPIO_REMAPPED;
+ } else if (!dp->g_blocked
+ && dp->g_issued[0] + dp->g_issued[1] == 0) {
+ dprintk(KERN_ERR "ioband_map: token expired "
+ "gp:%p bio:%p\n", gp, bio);
+ queue_delayed_work(dp->g_ioband_wq,
+ &dp->g_conductor, 1);
+ }
+ }
+ hold_bio(gp, bio);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+
+ return DM_MAPIO_SUBMITTED;
+}
+
+/*
+ * Select the best group to resubmit its BIOs.
+ */
+static struct ioband_group *choose_best_group(struct ioband_device *dp)
+{
+ struct ioband_group *gp;
+ struct ioband_group *best = NULL;
+ int highest = 0;
+ int pri;
+
+ /* Todo: The algorithm should be optimized.
+ * It would be better to use rbtree.
+ */
+ list_for_each_entry(gp, &dp->g_groups, c_list) {
+ if (!gp->c_blocked || !room_for_bio(dp))
+ continue;
+ if (gp->c_blocked == gp->c_prio_blocked
+ && !room_for_bio_rw(dp, prio_queue_direct(gp))) {
+ continue;
+ }
+ pri = dp->g_can_submit(gp);
+ if (pri > highest) {
+ highest = pri;
+ best = gp;
+ }
+ }
+
+ return best;
+}
+
+/*
+ * This function is called right after it becomes able to resubmit BIOs.
+ * It selects the best BIOs and passes them to the underlying layer.
+ */
+static void ioband_conduct(struct work_struct *work)
+{
+ struct ioband_device *dp =
+ container_of(work, struct ioband_device, g_conductor.work);
+ struct ioband_group *gp = NULL;
+ struct bio *bio;
+ unsigned long flags;
+ struct bio_list issue_list, pushback_list;
+
+ bio_list_init(&issue_list);
+ bio_list_init(&pushback_list);
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ release_urgent_bios(dp, &issue_list, &pushback_list);
+ if (dp->g_blocked) {
+ gp = choose_best_group(dp);
+ if (gp && release_bios(gp, &issue_list, &pushback_list)
+ == R_YIELD)
+ queue_delayed_work(dp->g_ioband_wq,
+ &dp->g_conductor, 0);
+ }
+
+ if (dp->g_blocked && room_for_bio_rw(dp, 0) && room_for_bio_rw(dp, 1) &&
+ bio_list_empty(&issue_list) && bio_list_empty(&pushback_list) &&
+ dp->g_restart_bios(dp)) {
+ dprintk(KERN_ERR "ioband_conduct: token expired dp:%p "
+ "issued(%d,%d) g_blocked(%d)\n", dp,
+ dp->g_issued[0], dp->g_issued[1], dp->g_blocked);
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ }
+
+
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+
+ while ((bio = bio_list_pop(&issue_list)))
+ generic_make_request(bio);
+ while ((bio = bio_list_pop(&pushback_list)))
+ bio_endio(bio, -EIO);
+}
+
+static int ioband_end_io(struct dm_target *ti, struct bio *bio,
+ int error, union map_info *map_context)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+ unsigned long flags;
+ int r = error;
+
+ /*
+ * XXX: A new error code for device mapper devices should be used
+ * rather than EIO.
+ */
+ if (error == -EIO && should_pushback_bio(gp)) {
+ /* This ioband device is suspending */
+ r = DM_ENDIO_REQUEUE;
+ }
+ /*
+ * Todo: The algorithm should be optimized to eliminate the spinlock.
+ */
+ spin_lock_irqsave(&dp->g_lock, flags);
+ dp->g_issued[bio_data_dir(bio)]--;
+
+ /*
+ * Todo: It would be better to introduce high/low water marks here
+ * not to kick the workqueues so often.
+ */
+ if (dp->g_blocked)
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ else if (is_device_suspended(dp)
+ && dp->g_issued[0] + dp->g_issued[1] == 0)
+ wake_up_all(&dp->g_waitq_flush);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+}
+
+static void ioband_presuspend(struct dm_target *ti)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+
+ mutex_lock(&dp->g_lock_device);
+ ioband_group_stop_all(gp, 1);
+ mutex_unlock(&dp->g_lock_device);
+}
+
+static void ioband_resume(struct dm_target *ti)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+
+ mutex_lock(&dp->g_lock_device);
+ ioband_group_resume_all(gp);
+ mutex_unlock(&dp->g_lock_device);
+}
+
+
+static void ioband_group_status(struct ioband_group *gp, int *szp,
+ char *result, unsigned int maxlen)
+{
+ struct ioband_group_stat *stat;
+ int i, sz = *szp; /* used in DMEMIT() */
+
+ DMEMIT(" %d", gp->c_id);
+ for (i = 0; i < 2; i++) {
+ stat = &gp->c_stat[i];
+ DMEMIT(" %lu %lu %lu",
+ stat->immediate + stat->deferred, stat->deferred,
+ stat->sectors);
+ }
+ *szp = sz;
+}
+
+static int ioband_status(struct dm_target *ti, status_type_t type,
+ char *result, unsigned int maxlen)
+{
+ struct ioband_group *gp = ti->private, *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ int sz = 0; /* used in DMEMIT() */
+ unsigned long flags;
+
+ mutex_lock(&dp->g_lock_device);
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ spin_lock_irqsave(&dp->g_lock, flags);
+ DMEMIT("%s", dp->g_name);
+ ioband_group_status(gp, &sz, result, maxlen);
+ for (node = rb_first(&gp->c_group_root); node;
+ node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ ioband_group_status(p, &sz, result, maxlen);
+ }
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ break;
+
+ case STATUSTYPE_TABLE:
+ spin_lock_irqsave(&dp->g_lock, flags);
+ DMEMIT("%s %s %d %d %s %s",
+ gp->c_dev->name, dp->g_name,
+ dp->g_io_throttle, dp->g_io_limit[0],
+ gp->c_type->t_name, dp->g_policy->p_name);
+ dp->g_show(gp, &sz, result, maxlen);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ break;
+ }
+
+ mutex_unlock(&dp->g_lock_device);
+ return 0;
+}
+
+static int ioband_group_type_select(struct ioband_group *gp, char *name)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct group_type *t;
+ unsigned long flags;
+
+ for (t = dm_ioband_group_type; (t->t_name); t++) {
+ if (!strcmp(name, t->t_name))
+ break;
+ }
+ if (!t->t_name) {
+ DMWARN("ioband type select: %s isn't supported.", name);
+ return -EINVAL;
+ }
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (!RB_EMPTY_ROOT(&gp->c_group_root)) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -EBUSY;
+ }
+ gp->c_type = t;
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+
+ return 0;
+}
+
+static int ioband_set_param(struct ioband_group *gp, char *cmd, char *value)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ char *val_str;
+ long id;
+ unsigned long flags;
+ int r;
+
+ r = split_string(value, &id, &val_str);
+ if (r)
+ return r;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (id != IOBAND_ID_ANY) {
+ gp = ioband_group_find(gp, id);
+ if (!gp) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ DMWARN("ioband_set_param: id=%ld not found.", id);
+ return -EINVAL;
+ }
+ }
+ r = dp->g_set_param(gp, cmd, val_str);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+}
+
+static int ioband_group_attach(struct ioband_group *gp, int id, char *param)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *sub_gp;
+ int r;
+
+ if (id < 0) {
+ DMWARN("ioband_group_attach: invalid id:%d", id);
+ return -EINVAL;
+ }
+ if (!gp->c_type->t_getid) {
+ DMWARN("ioband_group_attach: "
+ "no ioband group type is specified");
+ return -EINVAL;
+ }
+
+ sub_gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+ if (!sub_gp)
+ return -ENOMEM;
+
+ r = ioband_group_init(sub_gp, gp, dp, id, param);
+ if (r < 0) {
+ kfree(sub_gp);
+ return r;
+ }
+ return 0;
+}
+
+static int ioband_group_detach(struct ioband_group *gp, int id)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *sub_gp;
+ unsigned long flags;
+
+ if (id < 0) {
+ DMWARN("ioband_group_detach: invalid id:%d", id);
+ return -EINVAL;
+ }
+ spin_lock_irqsave(&dp->g_lock, flags);
+ sub_gp = ioband_group_find(gp, id);
+ if (!sub_gp) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ DMWARN("ioband_group_detach: invalid id:%d", id);
+ return -EINVAL;
+ }
+
+ /*
+ * Todo: Calling suspend_ioband_device() before releasing the
+ * ioband group has a large overhead. Need improvement.
+ */
+ suspend_ioband_device(dp, flags, 0);
+ ioband_group_release(gp, sub_gp);
+ resume_ioband_device(dp);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return 0;
+}
+
+/*
+ * Message parameters:
+ * "policy" <name>
+ * ex)
+ * "policy" "weight"
+ * "type" "none"|"pid"|"pgrp"|"node"|"cpuset"|"cgroup"|"user"|"gid"
+ * "io_throttle" <value>
+ * "io_limit" <value>
+ * "attach" <group id>
+ * "detach" <group id>
+ * "any-command" <group id>:<value>
+ * ex)
+ * "weight" 0:<value>
+ * "token" 24:<value>
+ */
+static int __ioband_message(struct dm_target *ti,
+ unsigned int argc, char **argv)
+{
+ struct ioband_group *gp = ti->private, *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ long val;
+ int r = 0;
+ unsigned long flags;
+
+ if (argc == 1 && !strcmp(argv[0], "reset")) {
+ spin_lock_irqsave(&dp->g_lock, flags);
+ memset(gp->c_stat, 0, sizeof(gp->c_stat));
+ for (node = rb_first(&gp->c_group_root); node;
+ node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ memset(p->c_stat, 0, sizeof(p->c_stat));
+ }
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return 0;
+ }
+
+ if (argc != 2) {
+ DMWARN("Unrecognised band message received.");
+ return -EINVAL;
+ }
+ if (!strcmp(argv[0], "debug")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r || val < 0)
+ return -EINVAL;
+ ioband_debug = val;
+ return 0;
+ } else if (!strcmp(argv[0], "io_throttle")) {
+ r = strict_strtol(argv[1], 0, &val);
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (r || val < 0 ||
+ val > dp->g_io_limit[0] || val > dp->g_io_limit[1]) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -EINVAL;
+ }
+ dp->g_io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val;
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ ioband_set_param(gp, argv[0], argv[1]);
+ return 0;
+ } else if (!strcmp(argv[0], "io_limit")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r || val < 0)
+ return -EINVAL;
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (val == 0) {
+ struct request_queue *q;
+
+ q = bdev_get_queue(gp->c_dev->bdev);
+ if (!q) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -ENXIO;
+ }
+ val = q->nr_requests;
+ }
+ if (val < dp->g_io_throttle) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -EINVAL;
+ }
+ dp->g_io_limit[0] = dp->g_io_limit[1] = val;
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ ioband_set_param(gp, argv[0], argv[1]);
+ return 0;
+ } else if (!strcmp(argv[0], "type")) {
+ return ioband_group_type_select(gp, argv[1]);
+ } else if (!strcmp(argv[0], "attach")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r)
+ return r;
+ return ioband_group_attach(gp, val, NULL);
+ } else if (!strcmp(argv[0], "detach")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r)
+ return r;
+ return ioband_group_detach(gp, val);
+ } else if (!strcmp(argv[0], "policy")) {
+ r = policy_init(dp, argv[1], 0, &argv[2]);
+ return r;
+ } else {
+ /* message anycommand <group-id>:<value> */
+ r = ioband_set_param(gp, argv[0], argv[1]);
+ if (r < 0)
+ DMWARN("Unrecognised band message received.");
+ return r;
+ }
+ return 0;
+}
+
+static int ioband_message(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+ int r;
+
+ mutex_lock(&dp->g_lock_device);
+ r = __ioband_message(ti, argc, argv);
+ mutex_unlock(&dp->g_lock_device);
+ return r;
+}
+
+static struct target_type ioband_target = {
+ .name = "ioband",
+ .module = THIS_MODULE,
+ .version = {1, 4, 0},
+ .ctr = ioband_ctr,
+ .dtr = ioband_dtr,
+ .map = ioband_map,
+ .end_io = ioband_end_io,
+ .presuspend = ioband_presuspend,
+ .resume = ioband_resume,
+ .status = ioband_status,
+ .message = ioband_message,
+};
+
+static int __init dm_ioband_init(void)
+{
+ int r;
+
+ r = dm_register_target(&ioband_target);
+ if (r < 0) {
+ DMERR("register failed %d", r);
+ return r;
+ }
+ return r;
+}
+
+static void __exit dm_ioband_exit(void)
+{
+ int r;
+
+ r = dm_unregister_target(&ioband_target);
+ if (r < 0)
+ DMERR("unregister failed %d", r);
+}
+
+module_init(dm_ioband_init);
+module_exit(dm_ioband_exit);
+
+MODULE_DESCRIPTION(DM_NAME " I/O bandwidth control");
+MODULE_AUTHOR("Hirokazu Takahashi <[email protected]>, "
+ "Ryo Tsuruta <[email protected]");
+MODULE_LICENSE("GPL");
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-policy.c linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-policy.c
--- linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-policy.c 1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-policy.c 2008-08-01 16:44:02.000000000 +0900
@@ -0,0 +1,447 @@
+/*
+ * Copyright (C) 2008 VA Linux Systems Japan K.K.
+ *
+ * I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "dm-bio-list.h"
+#include "dm-ioband.h"
+
+/*
+ * The following functions determine when and which BIOs should
+ * be submitted to control the I/O flow.
+ * It is possible to add a new BIO scheduling policy with it.
+ */
+
+
+/*
+ * Functions for weight balancing policy based on the number of I/Os.
+ */
+#define DEFAULT_WEIGHT 100
+#define DEFAULT_TOKENPOOL 2048
+#define DEFAULT_BUCKET 2
+#define IOBAND_IOPRIO_BASE 100
+#define TOKEN_BATCH_UNIT 20
+#define PROCEED_THRESHOLD 8
+#define LOCAL_ACTIVE_RATIO 8
+#define GLOBAL_ACTIVE_RATIO 16
+#define OVERCOMMIT_RATE 4
+
+/*
+ * Calculate the effective number of tokens this group has.
+ */
+static int get_token(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int token = gp->c_token;
+ int allowance = dp->g_epoch - gp->c_my_epoch;
+
+ if (allowance) {
+ if (allowance > dp->g_carryover)
+ allowance = dp->g_carryover;
+ token += gp->c_token_initial * allowance;
+ }
+ if (is_group_down(gp))
+ token += gp->c_token_initial * dp->g_carryover * 2;
+
+ return token;
+}
+
+/*
+ * Calculate the priority of a given group.
+ */
+static int iopriority(struct ioband_group *gp)
+{
+ return get_token(gp) * IOBAND_IOPRIO_BASE / gp->c_token_initial + 1;
+}
+
+/*
+ * This function is called when all the active group on the same ioband
+ * device has used up their tokens. It makes a new global epoch so that
+ * all groups on this device will get freshly assigned tokens.
+ */
+static int make_global_epoch(struct ioband_device *dp)
+{
+ struct ioband_group *gp = dp->g_dominant;
+
+ /*
+ * Don't make a new epoch if the dominant group still has a lot of
+ * tokens, except when the I/O load is low.
+ */
+ if (gp) {
+ int iopri = iopriority(gp);
+ if (iopri * PROCEED_THRESHOLD > IOBAND_IOPRIO_BASE &&
+ dp->g_issued[0] + dp->g_issued[1] >= dp->g_io_throttle)
+ return 0;
+ }
+
+ dp->g_epoch++;
+ dprintk(KERN_ERR "make_epoch %d --> %d\n",
+ dp->g_epoch-1, dp->g_epoch);
+
+ /* The leftover tokens will be used in the next epoch. */
+ dp->g_token_extra = dp->g_token_left;
+ if (dp->g_token_extra < 0)
+ dp->g_token_extra = 0;
+ dp->g_token_left = dp->g_token_bucket;
+
+ dp->g_expired = NULL;
+ dp->g_dominant = NULL;
+
+ return 1;
+}
+
+/*
+ * This function is called when this group has used up its own tokens.
+ * It will check whether it's possible to make a new epoch of this group.
+ */
+static inline int make_epoch(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int allowance = dp->g_epoch - gp->c_my_epoch;
+
+ if (!allowance)
+ return 0;
+ if (allowance > dp->g_carryover)
+ allowance = dp->g_carryover;
+ gp->c_my_epoch = dp->g_epoch;
+ return allowance;
+}
+
+/*
+ * Check whether this group has tokens to issue an I/O. Return 0 if it
+ * doesn't have any, otherwise return the priority of this group.
+ */
+static int is_token_left(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int allowance;
+ int delta;
+ int extra;
+
+ if (gp->c_token > 0)
+ return iopriority(gp);
+
+ if (is_group_down(gp)) {
+ gp->c_token = gp->c_token_initial;
+ return iopriority(gp);
+ }
+ allowance = make_epoch(gp);
+ if (!allowance)
+ return 0;
+ /*
+ * If this group has the right to get tokens for several epochs,
+ * give all of them to the group here.
+ */
+ delta = gp->c_token_initial * allowance;
+ dp->g_token_left -= delta;
+ /*
+ * Give some extra tokens to this group when there have left unused
+ * tokens on this ioband device from the previous epoch.
+ */
+ extra = dp->g_token_extra * gp->c_token_initial /
+ (dp->g_token_bucket - dp->g_token_extra/2);
+ delta += extra;
+ gp->c_token += delta;
+ gp->c_consumed = 0;
+
+ if (gp == dp->g_current)
+ dp->g_yield_mark += delta;
+ dprintk(KERN_ERR "refill token: "
+ "gp:%p token:%d->%d extra(%d) allowance(%d)\n",
+ gp, gp->c_token - delta, gp->c_token, extra, allowance);
+ if (gp->c_token > 0)
+ return iopriority(gp);
+ dprintk(KERN_ERR "refill token: yet empty gp:%p token:%d\n",
+ gp, gp->c_token);
+ return 0;
+}
+
+/*
+ * Use tokens to issue an I/O. After the operation, the number of tokens left
+ * on this group may become negative value, which will be treated as debt.
+ */
+static int consume_token(struct ioband_group *gp, int count, int flag)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (gp->c_consumed * LOCAL_ACTIVE_RATIO < gp->c_token_initial &&
+ gp->c_consumed * GLOBAL_ACTIVE_RATIO < dp->g_token_bucket) {
+ ; /* Do nothing unless this group is really active. */
+ } else if (!dp->g_dominant ||
+ get_token(gp) > get_token(dp->g_dominant)) {
+ /*
+ * Regard this group as the dominant group on this
+ * ioband device when it has larger number of tokens
+ * than those of the previous one.
+ */
+ dp->g_dominant = gp;
+ }
+ if (dp->g_epoch == gp->c_my_epoch &&
+ gp->c_token > 0 && gp->c_token - count <= 0) {
+ /* Remember the last group which used up its own tokens. */
+ dp->g_expired = gp;
+ if (dp->g_dominant == gp)
+ dp->g_dominant = NULL;
+ }
+
+ if (gp != dp->g_current) {
+ /* This group is the current already. */
+ dp->g_current = gp;
+ dp->g_yield_mark =
+ gp->c_token - (TOKEN_BATCH_UNIT << dp->g_token_unit);
+ }
+ gp->c_token -= count;
+ gp->c_consumed += count;
+ if (gp->c_token <= dp->g_yield_mark && !(flag & IOBAND_URGENT)) {
+ /*
+ * Return-value 1 means that this policy requests dm-ioband
+ * to give a chance to another group to be selected since
+ * this group has already issued enough amount of I/Os.
+ */
+ dp->g_current = NULL;
+ return R_YIELD;
+ }
+ /*
+ * Return-value 0 means that this policy allows dm-ioband to select
+ * this group to issue I/Os without a break.
+ */
+ return R_OK;
+}
+
+/*
+ * Consume one token on each I/O.
+ */
+static int prepare_token(struct ioband_group *gp, struct bio *bio, int flag)
+{
+ return consume_token(gp, 1, flag);
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ */
+static int is_queue_full(struct ioband_group *gp)
+{
+ return gp->c_blocked >= gp->c_limit;
+}
+
+static void set_weight(struct ioband_group *gp, int new)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+
+ dp->g_weight_total += (new - gp->c_weight);
+ gp->c_weight = new;
+
+ if (dp->g_weight_total == 0) {
+ list_for_each_entry(p, &dp->g_groups, c_list)
+ p->c_token = p->c_token_initial = p->c_limit = 1;
+ } else {
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ p->c_token = p->c_token_initial =
+ dp->g_token_bucket * p->c_weight /
+ dp->g_weight_total + 1;
+ p->c_limit = (dp->g_io_limit[0] + dp->g_io_limit[1]) *
+ p->c_weight / dp->g_weight_total /
+ OVERCOMMIT_RATE + 1;
+ }
+ }
+}
+
+static void init_token_bucket(struct ioband_device *dp, int val)
+{
+ dp->g_token_bucket = ((dp->g_io_limit[0] + dp->g_io_limit[1]) *
+ DEFAULT_BUCKET) << dp->g_token_unit;
+ if (!val)
+ val = DEFAULT_TOKENPOOL << dp->g_token_unit;
+ else if (val < dp->g_token_bucket)
+ val = dp->g_token_bucket;
+ dp->g_carryover = val/dp->g_token_bucket;
+ dp->g_token_left = 0;
+}
+
+static int policy_weight_param(struct ioband_group *gp, char *cmd, char *value)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ long val;
+ int r = 0, err;
+
+ err = strict_strtol(value, 0, &val);
+ if (!strcmp(cmd, "weight")) {
+ if (!err && 0 < val && val <= SHORT_MAX)
+ set_weight(gp, val);
+ else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "token")) {
+ if (!err && val > 0) {
+ init_token_bucket(dp, val);
+ set_weight(gp, gp->c_weight);
+ dp->g_token_extra = 0;
+ } else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "io_limit")) {
+ init_token_bucket(dp, dp->g_token_bucket * dp->g_carryover);
+ set_weight(gp, gp->c_weight);
+ } else {
+ r = -EINVAL;
+ }
+ return r;
+}
+
+static int policy_weight_ctr(struct ioband_group *gp, char *arg)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (!arg)
+ arg = __stringify(DEFAULT_WEIGHT);
+ gp->c_my_epoch = dp->g_epoch;
+ gp->c_weight = 0;
+ gp->c_consumed = 0;
+ return policy_weight_param(gp, "weight", arg);
+}
+
+static void policy_weight_dtr(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ set_weight(gp, 0);
+ dp->g_dominant = NULL;
+ dp->g_expired = NULL;
+}
+
+static void policy_weight_show(struct ioband_group *gp, int *szp,
+ char *result, unsigned int maxlen)
+{
+ struct ioband_group *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ int sz = *szp; /* used in DMEMIT() */
+
+ DMEMIT(" %d :%d", dp->g_token_bucket * dp->g_carryover, gp->c_weight);
+
+ for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ DMEMIT(" %d:%d", p->c_id, p->c_weight);
+ }
+ *szp = sz;
+}
+
+/*
+ * <Method> <description>
+ * g_can_submit : To determine whether a given group has the right to
+ * submit BIOs. The larger the return value the higher the
+ * priority to submit. Zero means it has no right.
+ * g_prepare_bio : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ * of them can be submitted now. This method has to
+ * reinitialize the data to restart to submit BIOs and return
+ * 0 or 1.
+ * The return value 0 means that it has become able to submit
+ * them now so that this ioband device will continue its work.
+ * The return value 1 means that it is still unable to submit
+ * them so that this device will stop its work. And this
+ * policy module has to reactivate the device when it gets
+ * to be able to submit BIOs.
+ * g_hold_bio : To hold a given BIO until it is submitted.
+ * The default function is used when this method is undefined.
+ * g_pop_bio : To select and get the best BIO to submit.
+ * g_group_ctr : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr : Called when struct ioband_group is removed.
+ * g_set_param : To update the policy own date.
+ * The parameters can be passed through "dmsetup message"
+ * command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ * Return 1 if a given group can't receive any more BIOs,
+ * otherwise return 0.
+ * g_show : Show the configuration.
+ */
+static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
+{
+ long val;
+ int r = 0;
+
+ if (argc < 1)
+ val = 0;
+ else {
+ r = strict_strtol(argv[0], 0, &val);
+ if (r || val < 0)
+ return -EINVAL;
+ }
+
+ dp->g_can_submit = is_token_left;
+ dp->g_prepare_bio = prepare_token;
+ dp->g_restart_bios = make_global_epoch;
+ dp->g_group_ctr = policy_weight_ctr;
+ dp->g_group_dtr = policy_weight_dtr;
+ dp->g_set_param = policy_weight_param;
+ dp->g_should_block = is_queue_full;
+ dp->g_show = policy_weight_show;
+
+ dp->g_epoch = 0;
+ dp->g_weight_total = 0;
+ dp->g_current = NULL;
+ dp->g_dominant = NULL;
+ dp->g_expired = NULL;
+ dp->g_token_extra = 0;
+ dp->g_token_unit = 0;
+ init_token_bucket(dp, val);
+ dp->g_token_left = dp->g_token_bucket;
+
+ return 0;
+}
+/* weight balancing policy based on the number of I/Os. --- End --- */
+
+
+/*
+ * Functions for weight balancing policy based on I/O size.
+ * It just borrows a lot of functions from the regular weight balancing policy.
+ */
+static int w2_prepare_token(struct ioband_group *gp, struct bio *bio, int flag)
+{
+ /* Consume tokens depending on the size of a given bio. */
+ return consume_token(gp, bio_sectors(bio), flag);
+}
+
+static int w2_policy_weight_init(struct ioband_device *dp,
+ int argc, char **argv)
+{
+ long val;
+ int r = 0;
+
+ if (argc < 1)
+ val = 0;
+ else {
+ r = strict_strtol(argv[0], 0, &val);
+ if (r || val < 0)
+ return -EINVAL;
+ }
+
+ r = policy_weight_init(dp, argc, argv);
+ if (r < 0)
+ return r;
+
+ dp->g_prepare_bio = w2_prepare_token;
+ dp->g_token_unit = PAGE_SHIFT - 9;
+ init_token_bucket(dp, val);
+ dp->g_token_left = dp->g_token_bucket;
+ return 0;
+}
+/* weight balancing policy based on I/O size. --- End --- */
+
+
+static int policy_default_init(struct ioband_device *dp,
+ int argc, char **argv)
+{
+ return policy_weight_init(dp, argc, argv);
+}
+
+struct policy_type dm_ioband_policy_type[] = {
+ {"default", policy_default_init},
+ {"weight", policy_weight_init},
+ {"weight-iosize", w2_policy_weight_init},
+ {NULL, policy_default_init}
+};
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-type.c linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-type.c
--- linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband-type.c 1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/dm-ioband-type.c 2008-08-01 16:44:02.000000000 +0900
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2008 VA Linux Systems Japan K.K.
+ *
+ * I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include "dm.h"
+#include "dm-bio-list.h"
+#include "dm-ioband.h"
+
+/*
+ * Any I/O bandwidth can be divided into several bandwidth groups, each of which
+ * has its own unique ID. The following functions are called to determine
+ * which group a given BIO belongs to and return the ID of the group.
+ */
+
+/* ToDo: unsigned long value would be better for group ID */
+
+static int ioband_process_id(struct bio *bio)
+{
+ /*
+ * This function will work for KVM and Xen.
+ */
+ return (int)current->tgid;
+}
+
+static int ioband_process_group(struct bio *bio)
+{
+ return (int)task_pgrp_nr(current);
+}
+
+static int ioband_uid(struct bio *bio)
+{
+ return (int)current->uid;
+}
+
+static int ioband_gid(struct bio *bio)
+{
+ return (int)current->gid;
+}
+
+static int ioband_cpuset(struct bio *bio)
+{
+ return 0; /* not implemented yet */
+}
+
+static int ioband_node(struct bio *bio)
+{
+ return 0; /* not implemented yet */
+}
+
+static int ioband_cgroup(struct bio *bio)
+{
+ /*
+ * This function should return the ID of the cgroup which issued "bio".
+ * The ID of the cgroup which the current process belongs to won't be
+ * suitable ID for this purpose, since some BIOs will be handled by kernel
+ * threads like aio or pdflush on behalf of the process requesting the BIOs.
+ */
+ return 0; /* not implemented yet */
+}
+
+struct group_type dm_ioband_group_type[] = {
+ {"none", NULL},
+ {"pgrp", ioband_process_group},
+ {"pid", ioband_process_id},
+ {"node", ioband_node},
+ {"cpuset", ioband_cpuset},
+ {"cgroup", ioband_cgroup},
+ {"user", ioband_uid},
+ {"uid", ioband_uid},
+ {"gid", ioband_gid},
+ {NULL, NULL}
+};
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband.h linux-2.6.27-rc1-mm1/drivers/md/dm-ioband.h
--- linux-2.6.27-rc1-mm1.orig/drivers/md/dm-ioband.h 1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/dm-ioband.h 2008-08-01 16:44:02.000000000 +0900
@@ -0,0 +1,190 @@
+/*
+ * Copyright (C) 2008 VA Linux Systems Japan K.K.
+ *
+ * I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+
+#include <linux/version.h>
+#include <linux/wait.h>
+
+#define DEFAULT_IO_THROTTLE 4
+#define DEFAULT_IO_LIMIT 128
+#define IOBAND_NAME_MAX 31
+#define IOBAND_ID_ANY (-1)
+
+struct ioband_group;
+
+struct ioband_device {
+ struct list_head g_groups;
+ struct delayed_work g_conductor;
+ struct workqueue_struct *g_ioband_wq;
+ struct bio_list g_urgent_bios;
+ int g_io_throttle;
+ int g_io_limit[2];
+ int g_issued[2];
+ int g_blocked;
+ spinlock_t g_lock;
+ struct mutex g_lock_device;
+ wait_queue_head_t g_waitq;
+ wait_queue_head_t g_waitq_suspend;
+ wait_queue_head_t g_waitq_flush;
+
+ int g_ref;
+ struct list_head g_list;
+ int g_flags;
+ char g_name[IOBAND_NAME_MAX + 1];
+ struct policy_type *g_policy;
+
+ /* policy dependent */
+ int (*g_can_submit)(struct ioband_group *);
+ int (*g_prepare_bio)(struct ioband_group *, struct bio *, int);
+ int (*g_restart_bios)(struct ioband_device *);
+ void (*g_hold_bio)(struct ioband_group *, struct bio *);
+ struct bio * (*g_pop_bio)(struct ioband_group *);
+ int (*g_group_ctr)(struct ioband_group *, char *);
+ void (*g_group_dtr)(struct ioband_group *);
+ int (*g_set_param)(struct ioband_group *, char *cmd, char *value);
+ int (*g_should_block)(struct ioband_group *);
+ void (*g_show)(struct ioband_group *, int *, char *, unsigned int);
+
+ /* members for weight balancing policy */
+ int g_epoch;
+ int g_weight_total;
+ /* the number of tokens which can be used in every epoch */
+ int g_token_bucket;
+ /* how many epochs tokens can be carried over */
+ int g_carryover;
+ /* how many tokens should be used for one page-sized I/O */
+ int g_token_unit;
+ /* the last group which used a token */
+ struct ioband_group *g_current;
+ /* give another group a chance to be scheduled when the rest
+ of tokens of the current group reaches this mark */
+ int g_yield_mark;
+ /* the latest group which used up its tokens */
+ struct ioband_group *g_expired;
+ /* the group which has the largest number of tokens in the
+ active groups */
+ struct ioband_group *g_dominant;
+ /* the number of unused tokens in this epoch */
+ int g_token_left;
+ /* left-over tokens from the previous epoch */
+ int g_token_extra;
+};
+
+struct ioband_group_stat {
+ unsigned long sectors;
+ unsigned long immediate;
+ unsigned long deferred;
+};
+
+struct ioband_group {
+ struct list_head c_list;
+ struct ioband_device *c_banddev;
+ struct dm_dev *c_dev;
+ struct dm_target *c_target;
+ struct bio_list c_blocked_bios;
+ struct bio_list c_prio_bios;
+ struct rb_root c_group_root;
+ struct rb_node c_group_node;
+ int c_id; /* should be unsigned long or unsigned long long */
+ char c_name[IOBAND_NAME_MAX + 1]; /* rfu */
+ int c_blocked;
+ int c_prio_blocked;
+ wait_queue_head_t c_waitq;
+ int c_flags;
+ struct ioband_group_stat c_stat[2]; /* hold rd/wr status */
+ struct group_type *c_type;
+
+ /* members for weight balancing policy */
+ int c_weight;
+ int c_my_epoch;
+ int c_token;
+ int c_token_initial;
+ int c_limit;
+ int c_consumed;
+
+ /* rfu */
+ /* struct bio_list c_ordered_tag_bios; */
+};
+
+#define IOBAND_URGENT 1
+
+#define DEV_BIO_BLOCKED 1
+#define DEV_SUSPENDED 2
+
+#define set_device_blocked(dp) ((dp)->g_flags |= DEV_BIO_BLOCKED)
+#define clear_device_blocked(dp) ((dp)->g_flags &= ~DEV_BIO_BLOCKED)
+#define is_device_blocked(dp) ((dp)->g_flags & DEV_BIO_BLOCKED)
+
+#define set_device_suspended(dp) ((dp)->g_flags |= DEV_SUSPENDED)
+#define clear_device_suspended(dp) ((dp)->g_flags &= ~DEV_SUSPENDED)
+#define is_device_suspended(dp) ((dp)->g_flags & DEV_SUSPENDED)
+
+#define IOG_PRIO_BIO_WRITE 1
+#define IOG_PRIO_QUEUE 2
+#define IOG_BIO_BLOCKED 4
+#define IOG_GOING_DOWN 8
+#define IOG_SUSPENDED 16
+#define IOG_NEED_UP 32
+
+#define R_OK 0
+#define R_BLOCK 1
+#define R_YIELD 2
+
+#define set_group_blocked(gp) ((gp)->c_flags |= IOG_BIO_BLOCKED)
+#define clear_group_blocked(gp) ((gp)->c_flags &= ~IOG_BIO_BLOCKED)
+#define is_group_blocked(gp) ((gp)->c_flags & IOG_BIO_BLOCKED)
+
+#define set_group_down(gp) ((gp)->c_flags |= IOG_GOING_DOWN)
+#define clear_group_down(gp) ((gp)->c_flags &= ~IOG_GOING_DOWN)
+#define is_group_down(gp) ((gp)->c_flags & IOG_GOING_DOWN)
+
+#define set_group_suspended(gp) ((gp)->c_flags |= IOG_SUSPENDED)
+#define clear_group_suspended(gp) ((gp)->c_flags &= ~IOG_SUSPENDED)
+#define is_group_suspended(gp) ((gp)->c_flags & IOG_SUSPENDED)
+
+#define set_group_need_up(gp) ((gp)->c_flags |= IOG_NEED_UP)
+#define clear_group_need_up(gp) ((gp)->c_flags &= ~IOG_NEED_UP)
+#define group_need_up(gp) ((gp)->c_flags & IOG_NEED_UP)
+
+#define set_prio_read(gp) ((gp)->c_flags |= IOG_PRIO_QUEUE)
+#define clear_prio_read(gp) ((gp)->c_flags &= ~IOG_PRIO_QUEUE)
+#define is_prio_read(gp) \
+ ((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE) == IOG_PRIO_QUEUE)
+
+#define set_prio_write(gp) \
+ ((gp)->c_flags |= (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE))
+#define clear_prio_write(gp) \
+ ((gp)->c_flags &= ~(IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE))
+#define is_prio_write(gp) \
+ ((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE) == \
+ (IOG_PRIO_QUEUE|IOG_PRIO_BIO_WRITE))
+
+#define set_prio_queue(gp, direct) \
+ ((gp)->c_flags |= (IOG_PRIO_QUEUE|direct))
+#define clear_prio_queue(gp) clear_prio_write(gp)
+#define is_prio_queue(gp) ((gp)->c_flags & IOG_PRIO_QUEUE)
+#define prio_queue_direct(gp) ((gp)->c_flags & IOG_PRIO_BIO_WRITE)
+
+
+struct policy_type {
+ const char *p_name;
+ int (*p_policy_init)(struct ioband_device *, int, char **);
+};
+
+extern struct policy_type dm_ioband_policy_type[];
+
+struct group_type {
+ const char *t_name;
+ int (*t_getid)(struct bio *);
+};
+
+extern struct group_type dm_ioband_group_type[];
+
+/* Just for debugging */
+extern long ioband_debug;
+#define dprintk(format, a...) \
+ if (ioband_debug > 0) ioband_debug--, printk(format, ##a)
Here is the documentation of design overview, installation, command
reference and examples.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <[email protected]>
Signed-off-by: Hirokazu Takahashi <[email protected]>
diff -uprN linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt
--- linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt 1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt 2008-08-01 16:44:02.000000000 +0900
@@ -0,0 +1,937 @@
+ Block I/O bandwidth control: dm-ioband
+
+ -------------------------------------------------------
+
+ Table of Contents
+
+ [1]What's dm-ioband all about?
+
+ [2]Differences from the CFQ I/O scheduler
+
+ [3]How dm-ioband works.
+
+ [4]Setup and Installation
+
+ [5]Getting started
+
+ [6]Command Reference
+
+ [7]Examples
+
+What's dm-ioband all about?
+
+ dm-ioband is an I/O bandwidth controller implemented as a device-mapper
+ driver. Several jobs using the same physical device have to share the
+ bandwidth of the device. dm-ioband gives bandwidth to each job according
+ to its weight, which each job can set its own value to.
+
+ A job is a group of processes with the same pid or pgrp or uid or a
+ virtual machine such as KVM or Xen. A job can also be a cgroup by applying
+ the bio-cgroup patch, which can be found at
+ http://people.valinux.co.jp/~ryov/bio-cgroup/.
+
+ +------+ +------+ +------+ +------+ +------+ +------+
+ |cgroup| |cgroup| | the | | pid | | pid | | the | jobs
+ | A | | B | |others| | X | | Y | |others|
+ +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--|---+
+ +--V----+---V---+----V---+ +--V----+---V---+----V---+
+ | group | group | default| | group | group | default| ioband groups
+ | | | group | | | | group |
+ +-------+-------+--------+ +-------+-------+--------+
+ | ioband1 | | ioband2 | ioband devices
+ +-----------|------------+ +-----------|------------+
+ +-----------V--------------+-------------V------------+
+ | | |
+ | sdb1 | sdb2 | physical devices
+ +--------------------------+--------------------------+
+
+
+ --------------------------------------------------------------------------
+
+Differences from the CFQ I/O scheduler
+
+ Dm-ioband is flexible to configure the bandwidth settings.
+
+ Dm-ioband can work with any type of I/O scheduler such as the NOOP
+ scheduler, which is often chosen for high-end storages, since it is
+ implemented outside the I/O scheduling layer. It allows both of partition
+ based bandwidth control and job --- a group of processes --- based
+ control. In addition, it can set different configuration on each physical
+ device to control its bandwidth.
+
+ Meanwhile the current implementation of the CFQ scheduler has 8 IO
+ priority levels and all jobs whose processes have the same IO priority
+ share the bandwidth assigned to this level between them. And IO priority
+ is an attribute of a process so that it equally effects to all block
+ devices.
+
+ --------------------------------------------------------------------------
+
+How dm-ioband works.
+
+ Every ioband device has one ioband group, which by default is called the
+ default group.
+
+ Ioband devices can also have extra ioband groups in them. Each ioband
+ group has a job to support and a weight. Proportional to the weight,
+ dm-ioband gives tokens to the group.
+
+ A group passes on I/O requests that its job issues to the underlying
+ layer so long as it has tokens left, while requests are blocked if there
+ aren't any tokens left in the group. Tokens are refilled once all of
+ groups that have requests on a given physical device use up their tokens.
+
+ There are two policies for token consumption. One is that a token is
+ consumed for each I/O request. The other is that a token is consumed for
+ each I/O sector, for example, one I/O request which consists of
+ 4Kbytes(512bytes * 8 sectors) read consumes 8 tokens. A user can choose
+ either policy.
+
+ With this approach, a job running on an ioband group with large weight
+ is guaranteed a wide I/O bandwidth.
+
+ --------------------------------------------------------------------------
+
+Setup and Installation
+
+ Build a kernel with these options enabled:
+
+ CONFIG_MD
+ CONFIG_BLK_DEV_DM
+ CONFIG_DM_IOBAND
+
+
+ If compiled as module, use modprobe to load dm-ioband.
+
+ # make modules
+ # make modules_install
+ # depmod -a
+ # modprobe dm-ioband
+
+
+ "dmsetup targets" command shows all available device-mapper targets.
+ "ioband" is displayed if dm-ioband has been loaded.
+
+ # dmsetup targets
+ ioband v1.4.0
+
+
+ --------------------------------------------------------------------------
+
+Getting started
+
+ The following is a brief description how to control the I/O bandwidth of
+ disks. In this description, we'll take one disk with two partitions as an
+ example target.
+
+ --------------------------------------------------------------------------
+
+ Create and map ioband devices
+
+ Create two ioband devices "ioband1" and "ioband2". "ioband1" is mapped
+ to "/dev/sda1" and has a weight of 40. "ioband2" is mapped to "/dev/sda2"
+ and has a weight of 10. "ioband1" can use 80% --- 40/(40+10)*100 --- of
+ the bandwidth of the physical disk "/dev/sda" while "ioband2" can use 20%.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \
+ "weight 0 :40" | dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \
+ "weight 0 :10" | dmsetup create ioband2
+
+
+ If the commands are successful then the device files
+ "/dev/mapper/ioband1" and "/dev/mapper/ioband2" will have been created.
+
+ --------------------------------------------------------------------------
+
+ Additional bandwidth control
+
+ In this example two extra ioband groups are created on "ioband1". The
+ first group consists of all the processes with user-id 1000 and the second
+ group consists of all the processes with user-id 2000. Their weights are
+ 30 and 20 respectively.
+
+ # dmsetup message ioband1 0 type user
+ # dmsetup message ioband1 0 attach 1000
+ # dmsetup message ioband1 0 attach 2000
+ # dmsetup message ioband1 0 weight 1000:30
+ # dmsetup message ioband1 0 weight 2000:20
+
+
+ Now the processes in the user-id 1000 group can use 30% ---
+ 30/(30+20+40+10)*100 --- of the bandwidth of the physical disk.
+
+ Table 1. Weight assignments
+
+ +----------------------------------------------------------------+
+ | ioband device | ioband group | ioband weight |
+ |---------------+--------------------------------+---------------|
+ | ioband1 | user id 1000 | 30 |
+ |---------------+--------------------------------+---------------|
+ | ioband1 | user id 2000 | 20 |
+ |---------------+--------------------------------+---------------|
+ | ioband1 | default group(the other users) | 40 |
+ |---------------+--------------------------------+---------------|
+ | ioband2 | default group | 10 |
+ +----------------------------------------------------------------+
+
+ --------------------------------------------------------------------------
+
+ Remove the ioband devices
+
+ Remove the ioband devices when no longer used.
+
+ # dmsetup remove ioband1
+ # dmsetup remove ioband2
+
+
+ --------------------------------------------------------------------------
+
+Command Reference
+
+ Create an ioband device
+
+ SYNOPSIS
+
+ dmsetup create IOBAND_DEVICE
+
+ DESCRIPTION
+
+ Create an ioband device with the given name IOBAND_DEVICE.
+ Generally, dmsetup reads a table from standard input. Each line of
+ the table specifies a single target and is of the form:
+
+ start_sector num_sectors "ioband" device_file ioband_device_id \
+ io_throttle io_limit ioband_group_type policy token_base \
+ :weight [ioband_group_id:weight...]
+
+
+ start_sector, num_sectors
+
+ The sector range of the underlying device where
+ dm-ioband maps.
+
+ ioband
+
+ Specify the string "ioband" as a target type.
+
+ device_file
+
+ Underlying device name.
+
+ ioband_device_id
+
+ The ID number for an ioband device. The same ID
+ must be set among the ioband devices that share the
+ same bandwidth, which means they work on the same
+ physical disk.
+
+ io_throttle
+
+ Dm-ioband starts to control the bandwidth when the
+ number of BIOs in progress exceeds this value. If 0
+ is specified, dm-ioband uses the default value.
+
+ io_limit
+
+ Dm-ioband blocks all I/O requests for the
+ IOBAND_DEVICE when the number of BIOs in progress
+ exceeds this value. If 0 is specified, dm-ioband uses
+ the default value.
+
+ ioband_group_type
+
+ Specify how to evaluate the ioband group ID. The
+ type must be one of "none", "user", "gid", "pid" or
+ "pgrp." The type "cgroup" is enabled by applying the
+ bio-cgroup patch. Specify "none" if you don't need
+ any ioband groups other than the default ioband
+ group.
+
+ policy
+
+ Specify bandwidth control policy. A user can choose
+ either policy "weight" or "weight-iosize."
+
+ weight
+
+ This policy controls bandwidth
+ according to the proportional to the
+ weight of each ioband group based on the
+ number of I/O requests.
+
+ weight-iosize
+
+ This policy controls bandwidth
+ according to the proportional to the
+ weight of each ioband group based on the
+ number of I/O sectors.
+
+ token_base
+
+ The number of tokens which specified by token_base
+ will be distributed to all ioband groups according to
+ the proportional to the weight of each ioband group.
+ If 0 is specified, dm-ioband uses the default value.
+
+ ioband_group_id:weight
+
+ Set the weight of the ioband group specified by
+ ioband_group_id. If ioband_group_id is omitted, the
+ weight is assigned to the default ioband group.
+
+ EXAMPLE
+
+ Create an ioband device with the following parameters:
+
+ * Starting sector = "0"
+
+ * The number of sectors = "$(blockdev --getsize /dev/sda1)"
+
+ * Target type = "ioband"
+
+ * Underlying device name = "/dev/sda1"
+
+ * Ioband device ID = "128"
+
+ * I/O throttle = "10"
+
+ * I/O limit = "400"
+
+ * Ioband group type = "user"
+
+ * Bandwidth control policy = "weight"
+
+ * Token base = "2048"
+
+ * Weight for the default ioband group = "100"
+
+ * Weight for the ioband group 1000 = "80"
+
+ * Weight for the ioband group 2000 = "20"
+
+ * Ioband device name = "ioband1"
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband" \
+ "/dev/sda1 128 10 400 user weight 2048 :100 1000:80 2000:20" \
+ | dmsetup create ioband1
+
+
+ Create two device groups (ID=1,2). The bandwidths of these
+ device groups will be individually controlled.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1" \
+ "0 0 none weight 0 :80" | dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1" \
+ "0 0 none weight 0 :20" | dmsetup create ioband2
+ # echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 2" \
+ "0 0 none weight 0 :60" | dmsetup create ioband3
+ # echo "0 $(blockdev --getsize /dev/sdb4) ioband /dev/sdb4 2" \
+ "0 0 none weight 0 :40" | dmsetup create ioband4
+
+
+ --------------------------------------------------------------------------
+
+ Remove the ioband device
+
+ SYNOPSIS
+
+ dmsetup remove IOBAND_DEVICE
+
+ DESCRIPTION
+
+ Remove the specified ioband device IOBAND_DEVICE. All the band
+ groups attached to the ioband device are also removed
+ automatically.
+
+ EXAMPLE
+
+ Remove ioband device "ioband1."
+
+ # dmsetup remove ioband1
+
+
+ --------------------------------------------------------------------------
+
+ Set an ioband group type
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 type TYPE
+
+ DESCRIPTION
+
+ Set the ioband group type of the specified ioband device
+ IOBAND_DEVICE. TYPE must be one of "none", "user", "gid", "pid" or
+ "pgrp." The type "cgroup" is enabled by applying the bio-cgroup
+ patch. Once the type is set, new ioband groups can be created on
+ IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Set the ioband group type of ioband device "ioband1" to "user."
+
+ # dmsetup message ioband1 0 type user
+
+
+ --------------------------------------------------------------------------
+
+ Create an ioband group
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 attach ID
+
+ DESCRIPTION
+
+ Create an ioband group and attach it to IOBAND_DEVICE. ID
+ specifies user-id, group-id, process-id or process-group-id
+ depending the ioband group type of IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Create an ioband group which consists of all processes with
+ user-id 1000 and attach it to ioband device "ioband1."
+
+ # dmsetup message ioband1 0 type user
+ # dmsetup message ioband1 0 attach 1000
+
+
+ --------------------------------------------------------------------------
+
+ Detach the ioband group
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 detach ID
+
+ DESCRIPTION
+
+ Detach the ioband group specified by ID from ioband device
+ IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Detach the ioband group with ID "2000" from ioband device
+ "ioband2."
+
+ # dmsetup message ioband2 0 detach 1000
+
+
+ --------------------------------------------------------------------------
+
+ Set bandwidth control policy
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 policy policy
+
+ DESCRIPTION
+
+ Set bandwidth control policy. This command applies to all ioband
+ devices which have the same ioband device ID as IOBAND_DEVICE. A
+ user can choose either policy "weight" or "weight-iosize."
+
+ weight
+
+ This policy controls bandwidth according to the
+ proportional to the weight of each ioband group based
+ on the number of I/O requests.
+
+ weight-iosize
+
+ This policy controls bandwidth according to the
+ proportional to the weight of each ioband group based
+ on the number of I/O sectors.
+
+ EXAMPLE
+
+ Set bandwidth control policy of ioband devices which have the
+ same ioband device ID as "ioband1" to "weight-iosize."
+
+ # dmsetup message ioband1 0 policy weight-iosize
+
+
+ --------------------------------------------------------------------------
+
+ Set the weight of an ioband group
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 weight VAL
+
+ dmsetup message IOBAND_DEVICE 0 weight ID:VAL
+
+ DESCRIPTION
+
+ Set the weight of the ioband group specified by ID. Set the
+ weight of the default ioband group of IOBAND_DEVICE if ID isn't
+ specified.
+
+ The following example means that "ioband1" can use 80% ---
+ 40/(40+10)*100 --- of the bandwidth of the physical disk while
+ "ioband2" can use 20%.
+
+ # dmsetup message ioband1 0 weight 40
+ # dmsetup message ioband2 0 weight 10
+
+
+ The following lines have the same effect as the above:
+
+ # dmsetup message ioband1 0 weight 4
+ # dmsetup message ioband2 0 weight 1
+
+
+ VAL must be an integer larger than 0. The default value, which
+ is assigned to newly created ioband groups, is 100.
+
+ EXAMPLE
+
+ Set the weight of the default ioband group of "ioband1" to 40.
+
+ # dmsetup message ioband1 0 weight 40
+
+
+ Set the weight of the ioband group of "ioband1" with ID "1000"
+ to 10.
+
+ # dmsetup message ioband1 0 weight 1000:10
+
+
+ --------------------------------------------------------------------------
+
+ Set the number of tokens
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 token VAL
+
+ DESCRIPTION
+
+ Set the number of tokens to VAL. According to their weight, this
+ number of tokens will be distributed to all the ioband groups on
+ the physical device to which ioband device IOBAND_DEVICE belongs
+ when they use up their tokens.
+
+ VAL must be an integer greater than 0. The default is 2048.
+
+ EXAMPLE
+
+ Set the number of tokens of the physical device to which
+ "ioband1" belongs to 256.
+
+ # dmsetup message ioband1 0 token 256
+
+
+ --------------------------------------------------------------------------
+
+ Set I/O throttling
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 io_throttle VAL
+
+ DESCRIPTION
+
+ Set the I/O throttling value of the physical disk to which
+ ioband device IOBAND_DEVICE belongs to VAL. Dm-ioband start to
+ control the bandwidth when the number of BIOs in progress on the
+ physical disk exceeds this value.
+
+ EXAMPLE
+
+ Set the I/O throttling value of "ioband1" to 16.
+
+ # dmsetup message ioband1 0 io_throttle 16
+
+
+ --------------------------------------------------------------------------
+
+ Set I/O limiting
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 io_limit VAL
+
+ DESCRIPTION
+
+ Set the I/O limiting value of the physical disk to which ioband
+ device IOBAND_DEVICE belongs to VAL. Dm-ioband will block all I/O
+ requests for the physical device if the number of BIOs in progress
+ on the physical disk exceeds this value.
+
+ EXAMPLE
+
+ Set the I/O limiting value of "ioband1" to 128.
+
+ # dmsetup message ioband1 0 io_limit 128
+
+
+ --------------------------------------------------------------------------
+
+ Display settings
+
+ SYNOPSIS
+
+ dmsetup table --target ioband
+
+ DESCRIPTION
+
+ Display the current table for the ioband device in a format. See
+ "dmsetup create" command for information on the table format.
+
+ EXAMPLE
+
+ The following output shows the current table of "ioband1."
+
+ # dmsetup table --target ioband
+ ioband: 0 32129937 ioband1 8:29 128 10 400 user weight \
+ 2048 :100 1000:80 2000:20
+
+
+ --------------------------------------------------------------------------
+
+ Display Statistics
+
+ SYNOPSIS
+
+ dmsetup status --target ioband
+
+ DESCRIPTION
+
+ Display the statistics of all the ioband devices whose target
+ type is "ioband."
+
+ The output format is as below. the first five columns shows:
+
+ * ioband device name
+
+ * logical start sector of the device (must be 0)
+
+ * device size in sectors
+
+ * target type (must be "ioband")
+
+ * device group ID
+
+ The remaining columns show the statistics of each ioband group
+ on the band device. Each group uses seven columns for its
+ statistics.
+
+ * ioband group ID (-1 means default)
+
+ * total read requests
+
+ * delayed read requests
+
+ * total read sectors
+
+ * total write requests
+
+ * delayed write requests
+
+ * total write sectors
+
+ EXAMPLE
+
+ The following output shows the statistics of two ioband devices.
+ Ioband2 only has the default ioband group and ioband1 has three
+ (default, 1001, 1002) ioband groups.
+
+ # dmsetup status
+ ioband2: 0 44371467 ioband 128 -1 143 90 424 122 78 352
+ ioband1: 0 44371467 ioband 128 -1 223 172 408 211 136 600 1001 \
+ 166 107 472 139 95 352 1002 211 146 520 210 147 504
+
+
+ --------------------------------------------------------------------------
+
+ Reset status counter
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 reset
+
+ DESCRIPTION
+
+ Reset the statistics of ioband device IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Reset the statistics of "ioband1."
+
+ # dmsetup message ioband1 0 reset
+
+
+ --------------------------------------------------------------------------
+
+Examples
+
+ Example #1: Bandwidth control on Partitions
+
+ This example describes how to control the bandwidth with disk
+ partitions. The following diagram illustrates the configuration of this
+ example. You may want to run a database on /dev/mapper/ioband1 and web
+ applications on /dev/mapper/ioband2.
+
+ /mnt1 /mnt2 mount points
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +--------------------------+ +--------------------------+
+ | default group | | default group | ioband groups
+ | (80) | | (40) | (weight)
+ +-------------|------------+ +-------------|------------+
+ | |
+ +-------------V-------------+--------------V------------+
+ | /dev/sda1 | /dev/sda2 | physical devices
+ +---------------------------+---------------------------+
+
+
+ To setup the above configuration, follow these steps:
+
+ 1. Create ioband devices with the same device group ID and assign
+ weights of 80 and 40 to the default ioband groups respectively.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0" \
+ "none weight 0 :80" | dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0" \
+ "none weight 0 :40" | dmsetup create ioband2
+
+
+ 2. Create filesystems on the ioband devices and mount them.
+
+ # mkfs.ext3 /dev/mapper/ioband1
+ # mount /dev/mapper/ioband1 /mnt1
+
+ # mkfs.ext3 /dev/mapper/ioband2
+ # mount /dev/mapper/ioband2 /mnt2
+
+
+ --------------------------------------------------------------------------
+
+ Example #2: Bandwidth control on Logical Volumes
+
+ This example is similar to the example #1 but it uses LVM logical
+ volumes instead of disk partitions. This example shows how to configure
+ ioband devices on two striped logical volumes.
+
+ /mnt1 /mnt2 mount points
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +--------------------------+ +--------------------------+
+ | default group | | default group | ioband groups
+ | (80) | | (40) | (weight)
+ +-------------|------------+ +-------------|------------+
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/lv0 | | /dev/mapper/lv1 | striped logical
+ | | | | volumes
+ +-------------------------------------------------------+
+ | vg0 | volume group
+ +-------------|----------------------------|------------+
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/sdb | | /dev/sdc | physical devices
+ +--------------------------+ +--------------------------+
+
+
+ To setup the above configuration, follow these steps:
+
+ 1. Initialize the partitions for use by LVM.
+
+ # pvcreate /dev/sdb
+ # pvcreate /dev/sdc
+
+
+ 2. Create a new volume group named "vg0" with /dev/sdb and /dev/sdc.
+
+ # vgcreate vg0 /dev/sdb /dev/sdc
+
+
+ 3. Create two logical volumes in "vg0." The volumes have to be striped.
+
+ # lvcreate -n lv0 -i 2 -l 64 vg0 -L 1024M
+ # lvcreate -n lv1 -i 2 -l 64 vg0 -L 1024M
+
+
+ The rest is the same as the example #1.
+
+ 4. Create ioband devices corresponding to each logical volume and
+ assign weights of 80 and 40 to the default ioband groups respectively.
+
+ # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv0)" \
+ "ioband /dev/mapper/vg0-lv0 1 0 0 none weight 0 :80" | \
+ dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv1)" \
+ "ioband /dev/mapper/vg0-lv1 1 0 0 none weight 0 :40" | \
+ dmsetup create ioband2
+
+
+ 5. Create filesystems on the ioband devices and mount them.
+
+ # mkfs.ext3 /dev/mapper/ioband1
+ # mount /dev/mapper/ioband1 /mnt1
+
+ # mkfs.ext3 /dev/mapper/ioband2
+ # mount /dev/mapper/ioband2 /mnt2
+
+
+ --------------------------------------------------------------------------
+
+ Example #3: Bandwidth control on processes
+
+ This example describes how to control the bandwidth with groups of
+ processes. You may also want to run an additional application on the same
+ machine described in the example #1. This example shows how to add a new
+ ioband group for this application.
+
+ /mnt1 /mnt2 mount points
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +-------------+------------+ +-------------+------------+
+ | default | | user=1000 | default | ioband groups
+ | (80) | | (20) | (40) | (weight)
+ +-------------+------------+ +-------------+------------+
+ | |
+ +-------------V-------------+--------------V------------+
+ | /dev/sda1 | /dev/sda2 | physical device
+ +---------------------------+---------------------------+
+
+
+ The following shows to set up a new ioband group on the machine that is
+ already configured as the example #1. The application will have a weight
+ of 20 and run with user-id 1000 on /dev/mapper/ioband2.
+
+ 1. Set the type of ioband2 to "user."
+
+ # dmsetup message ioband2 0 type user.
+
+
+ 2. Create a new ioband group on ioband2.
+
+ # dmsetup message ioband2 0 attach 1000
+
+
+ 3. Assign weight of 10 to this newly created ioband group.
+
+ # dmsetup message ioband2 0 weight 1000:20
+
+
+ --------------------------------------------------------------------------
+
+ Example #4: Bandwidth control for Xen virtual block devices
+
+ This example describes how to control the bandwidth for Xen virtual
+ block devices. The following diagram illustrates the configuration of this
+ example.
+
+ Virtual Machine 1 Virtual Machine 2 virtual machines
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/xvda1 | | /dev/xvda1 | virtual block
+ +-------------|------------+ +-------------|------------+ devices
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +--------------------------+ +--------------------------+
+ | default group | | default group | ioband groups
+ | (80) | | (40) | (weight)
+ +-------------|------------+ +-------------|------------+
+ | |
+ +-------------V-------------+--------------V------------+
+ | /dev/sda1 | /dev/sda2 | physical device
+ +---------------------------+---------------------------+
+
+
+ The followings shows how to map ioband device "ioband1" and "ioband2" to
+ virtual block device "/dev/xvda1 on Virtual Machine 1" and "/dev/xvda1 on
+ Virtual Machine 2" respectively on the machine configured as the example
+ #1. Add the following lines to the configuration files that are referenced
+ when creating "Virtual Machine 1" and "Virtual Machine 2."
+
+ For "Virtual Machine 1"
+ disk = [ 'phy:/dev/mapper/ioband1,xvda,w' ]
+
+ For "Virtual Machine 2"
+ disk = [ 'phy:/dev/mapper/ioband2,xvda,w' ]
+
+
+ --------------------------------------------------------------------------
+
+ Example #5: Bandwidth control for Xen blktap devices
+
+ This example describes how to control the bandwidth for Xen virtual
+ block devices when Xen blktap devices are used. The following diagram
+ illustrates the configuration of this example.
+
+ Virtual Machine 1 Virtual Machine 2 virtual machines
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/xvda1 | | /dev/xvda1 | virtual block
+ +-------------|------------+ +-------------|------------+ devices
+ | |
+ +-------------V----------------------------V------------+
+ | /dev/mapper/ioband1 | ioband device
+ +---------------------------+---------------------------+
+ | default group | default group | ioband groups
+ | (80) | (40) | (weight)
+ +-------------|-------------+--------------|------------+
+ | |
+ +-------------|----------------------------|------------+
+ | +----------V----------+ +----------V---------+ |
+ | | vm1.img | | vm2.img | | disk image files
+ | +---------------------+ +--------------------+ |
+ | /vmdisk | mount point
+ +---------------------------|---------------------------+
+ |
+ +---------------------------V---------------------------+
+ | /dev/sda1 | physical device
+ +-------------------------------------------------------+
+
+
+ To setup the above configuration, follow these steps:
+
+ 1. Create an ioband device.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+ "1 0 0 none weight 0 :100" | dmsetup create ioband1
+
+
+ 2. Add the following lines to the configuration files that are
+ referenced when creating "Virtual Machine 1" and "Virtual Machine 2."
+ Disk image files "/vmdisk/vm1.img" and "/vmdisk/vm2.img" will be used.
+
+ For "Virtual Machine 1"
+ disk = [ 'tap:aio:/vmdisk/vm1.img,xvda,w', ]
+
+ For "Virtual Machine 1"
+ disk = [ 'tap:aio:/vmdisk/vm2.img,xvda,w', ]
+
+
+ 3. Run the virtual machines.
+
+ # xm create vm1
+ # xm create vm2
+
+
+ 4. Find out the process IDs of the daemons which control the blktap
+ devices.
+
+ # lsof /vmdisk/disk[12].img
+ COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
+ tapdisk 15011 root 11u REG 253,0 2147483648 48961 /vmdisk/vm1.img
+ tapdisk 15276 root 13u REG 253,0 2147483648 48962 /vmdisk/vm2.img
+
+
+ 5. Create new ioband groups of pid 15011 and pid 15276, which are
+ process IDs of the tapdisks, and assign weight of 80 and 40 to the
+ groups respectively.
+
+ # dmsetup message ioband1 0 type pid
+ # dmsetup message ioband1 0 attach 15011
+ # dmsetup message ioband1 0 weight 15011:80
+ # dmsetup message ioband1 0 attach 15276
+ # dmsetup message ioband1 0 weight 15276:40
With this series of bio-cgruop patches, you can determine the owners of any
type of I/Os and it makes dm-ioband -- I/O bandwidth controller --
be able to control the Block I/O bandwidths even when it accepts
delayed write requests.
Dm-ioband can find the owner cgroup of each request.
It is also possible that the other people who work on the I/O
bandwidth throttling use this functionality to control asynchronous
I/Os with a little enhancement.
You have to apply the patch dm-ioband v1.4.0 before applying this series
of patches.
And you have to select the following config options when compiling kernel:
CONFIG_CGROUPS=y
CONFIG_CGROUP_BIO=y
And I recommend you should also select the options for cgroup memory
subsystem, because it makes it possible to give some I/O bandwidth
and some memory to a certain cgroup to control delayed write requests
and the processes in the cgroup will be able to make pages dirty only
inside the cgroup even when the given bandwidth is narrow.
CONFIG_RESOURCE_COUNTERS=y
CONFIG_CGROUP_MEM_RES_CTLR=y
This code is based on some part of the memory subsystem of cgroup
and I don't think the accuracy and overhead of the subsystem can be ignored
at this time, so we need to keep tuning it up.
--------------------------------------------------------
The following shows how to use dm-ioband with cgroups.
Please assume that you want make two cgroups, which we call "bio cgroup"
here, to track down block I/Os and assign them to ioband device "ioband1".
First, mount the bio cgroup filesystem.
# mount -t cgroup -o bio none /cgroup/bio
Then, make new bio cgroups and put some processes in them.
# mkdir /cgroup/bio/bgroup1
# mkdir /cgroup/bio/bgroup2
# echo 1234 > /cgroup/bio/bgroup1/tasks
# echo 5678 > /cgroup/bio/bgroup1/tasks
Now, check the ID of each bio cgroup which is just created.
# cat /cgroup/bio/bgroup1/bio.id
1
# cat /cgroup/bio/bgroup2/bio.id
2
Finally, attach the cgroups to "ioband1" and assign them weights.
# dmsetup message ioband1 0 type cgroup
# dmsetup message ioband1 0 attach 1
# dmsetup message ioband1 0 attach 2
# dmsetup message ioband1 0 weight 1:30
# dmsetup message ioband1 0 weight 2:60
You can also make use of the dm-ioband administration tool if you want.
The tool will be found here:
http://people.valinux.co.jp/~kaizuka/dm-ioband/iobandctl/manual.html
You can set up the device with the tool as follows.
In this case, you don't need to know the IDs of the cgroups.
# iobandctl.py group /dev/mapper/ioband1 cgroup /cgroup/bio/bgroup1:30 /cgroup/bio/bgroup2:60
This patch splits the cgroup memory subsystem into two parts.
One is for tracking pages to find out the owners. The other is
for controlling how much amount of memory should be assigned to
each cgroup.
With this patch, you can use the page tracking mechanism even if
the memory subsystem is off.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <[email protected]>
Signed-off-by: Hirokazu Takahashi <[email protected]>
diff -Ndupr linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h
--- linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h 2008-08-01 12:18:28.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h 2008-08-01 19:03:21.000000000 +0900
@@ -20,12 +20,62 @@
#ifndef _LINUX_MEMCONTROL_H
#define _LINUX_MEMCONTROL_H
+#include <linux/rcupdate.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+
struct mem_cgroup;
struct page_cgroup;
struct page;
struct mm_struct;
+#ifdef CONFIG_CGROUP_PAGE
+/*
+ * We use the lower bit of the page->page_cgroup pointer as a bit spin
+ * lock. We need to ensure that page->page_cgroup is at least two
+ * byte aligned (based on comments from Nick Piggin). But since
+ * bit_spin_lock doesn't actually set that lock bit in a non-debug
+ * uniprocessor kernel, we should avoid setting it here too.
+ */
+#define PAGE_CGROUP_LOCK_BIT 0x0
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
+#else
+#define PAGE_CGROUP_LOCK 0x0
+#endif
+
+/*
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup
+ */
+struct page_cgroup {
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct list_head lru; /* per cgroup LRU list */
+ struct mem_cgroup *mem_cgroup;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ struct page *page;
+ int flags;
+};
+#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
+#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
+#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
+#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
+
+static inline void lock_page_cgroup(struct page *page)
+{
+ bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+}
+
+static inline int try_lock_page_cgroup(struct page *page)
+{
+ return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+}
+
+static inline void unlock_page_cgroup(struct page *page)
+{
+ bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+}
#define page_reset_bad_cgroup(page) ((page)->page_cgroup = 0)
@@ -34,45 +84,15 @@ extern int mem_cgroup_charge(struct page
gfp_t gfp_mask);
extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
-extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
extern void mem_cgroup_uncharge_page(struct page *page);
extern void mem_cgroup_uncharge_cache_page(struct page *page);
-extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
-
-extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
- struct list_head *dst,
- unsigned long *scanned, int order,
- int mode, struct zone *z,
- struct mem_cgroup *mem_cont,
- int active, int file);
-extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
-
-extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-
-#define mm_match_cgroup(mm, cgroup) \
- ((cgroup) == mem_cgroup_from_task((mm)->owner))
extern int
mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
extern void mem_cgroup_end_migration(struct page *page);
+extern void page_cgroup_init(void);
-/*
- * For memory reclaim.
- */
-extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
-extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
-
-extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
-extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
- int priority);
-extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
- int priority);
-
-extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
- int priority, enum lru_list lru);
-
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
static inline void page_reset_bad_cgroup(struct page *page)
{
}
@@ -102,6 +122,53 @@ static inline void mem_cgroup_uncharge_c
{
}
+static inline int
+mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+{
+ return 0;
+}
+
+static inline void mem_cgroup_end_migration(struct page *page)
+{
+}
+#endif /* CONFIG_CGROUP_PAGE */
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+
+extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
+extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
+
+extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
+ struct list_head *dst,
+ unsigned long *scanned, int order,
+ int mode, struct zone *z,
+ struct mem_cgroup *mem_cont,
+ int active, int file);
+extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
+int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
+
+extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+
+#define mm_match_cgroup(mm, cgroup) \
+ ((cgroup) == mem_cgroup_from_task((mm)->owner))
+
+/*
+ * For memory reclaim.
+ */
+extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
+extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
+
+extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
+extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
+ int priority);
+extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
+ int priority);
+
+extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+ int priority, enum lru_list lru);
+
+#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+
static inline int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask)
{
return 0;
@@ -122,16 +189,6 @@ static inline int task_in_mem_cgroup(str
return 1;
}
-static inline int
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
-{
- return 0;
-}
-
-static inline void mem_cgroup_end_migration(struct page *page)
-{
-}
-
static inline int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem)
{
return 0;
@@ -163,7 +220,8 @@ static inline long mem_cgroup_calc_recla
{
return 0;
}
-#endif /* CONFIG_CGROUP_MEM_CONT */
+
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
#endif /* _LINUX_MEMCONTROL_H */
diff -Ndupr linux-2.6.27-rc1-mm1-ioband/include/linux/mm_types.h linux-2.6.27-rc1-mm1.cg0/include/linux/mm_types.h
--- linux-2.6.27-rc1-mm1-ioband/include/linux/mm_types.h 2008-08-01 12:18:28.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/include/linux/mm_types.h 2008-08-01 19:03:21.000000000 +0900
@@ -92,7 +92,7 @@ struct page {
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
unsigned long page_cgroup;
#endif
diff -Ndupr linux-2.6.27-rc1-mm1-ioband/init/Kconfig linux-2.6.27-rc1-mm1.cg0/init/Kconfig
--- linux-2.6.27-rc1-mm1-ioband/init/Kconfig 2008-08-01 12:18:28.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/init/Kconfig 2008-08-01 19:03:21.000000000 +0900
@@ -418,6 +418,10 @@ config CGROUP_MEMRLIMIT_CTLR
memory RSS and Page Cache control. Virtual address space control
is provided by this controller.
+config CGROUP_PAGE
+ def_bool y
+ depends on CGROUP_MEM_RES_CTLR
+
config SYSFS_DEPRECATED
bool
diff -Ndupr linux-2.6.27-rc1-mm1-ioband/mm/Makefile linux-2.6.27-rc1-mm1.cg0/mm/Makefile
--- linux-2.6.27-rc1-mm1-ioband/mm/Makefile 2008-08-01 12:18:28.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/mm/Makefile 2008-08-01 19:03:21.000000000 +0900
@@ -34,5 +34,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += memcontrol.o
obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
diff -Ndupr linux-2.6.27-rc1-mm1-ioband/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c
--- linux-2.6.27-rc1-mm1-ioband/mm/memcontrol.c 2008-08-01 12:18:28.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c 2008-08-01 19:48:55.000000000 +0900
@@ -36,10 +36,25 @@
#include <asm/uaccess.h>
-struct cgroup_subsys mem_cgroup_subsys __read_mostly;
+enum charge_type {
+ MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
+ MEM_CGROUP_CHARGE_TYPE_MAPPED,
+ MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
+};
+
+static void __mem_cgroup_uncharge_common(struct page *, enum charge_type);
+
static struct kmem_cache *page_cgroup_cache __read_mostly;
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+struct cgroup_subsys mem_cgroup_subsys __read_mostly;
#define MEM_CGROUP_RECLAIM_RETRIES 5
+static inline int mem_cgroup_disabled(void)
+{
+ return mem_cgroup_subsys.disabled;
+}
+
/*
* Statistics for memory cgroup.
*/
@@ -136,35 +151,6 @@ struct mem_cgroup {
};
static struct mem_cgroup init_mem_cgroup;
-/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock. We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin). But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK 0x0
-#endif
-
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
- struct list_head lru; /* per cgroup LRU list */
- struct page *page;
- struct mem_cgroup *mem_cgroup;
- int flags;
-};
-#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
-
static int page_cgroup_nid(struct page_cgroup *pc)
{
return page_to_nid(pc->page);
@@ -175,12 +161,6 @@ static enum zone_type page_cgroup_zid(st
return page_zonenum(pc->page);
}
-enum charge_type {
- MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
- MEM_CGROUP_CHARGE_TYPE_MAPPED,
- MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
-};
-
/*
* Always modified under lru lock. Then, not necessary to preempt_disable()
*/
@@ -248,37 +228,6 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
}
-static inline int page_cgroup_locked(struct page *page)
-{
- return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
-{
- VM_BUG_ON(!page_cgroup_locked(page));
- page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
-}
-
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
- return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
- bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
- return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
- bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
struct page_cgroup *pc)
{
@@ -367,7 +316,7 @@ void mem_cgroup_move_lists(struct page *
struct mem_cgroup_per_zone *mz;
unsigned long flags;
- if (mem_cgroup_subsys.disabled)
+ if (mem_cgroup_disabled())
return;
/*
@@ -506,273 +455,6 @@ unsigned long mem_cgroup_isolate_pages(u
}
/*
- * Charge the memory controller for page usage.
- * Return
- * 0 if the charge was successful
- * < 0 if the cgroup is over its limit
- */
-static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype,
- struct mem_cgroup *memcg)
-{
- struct mem_cgroup *mem;
- struct page_cgroup *pc;
- unsigned long flags;
- unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- struct mem_cgroup_per_zone *mz;
-
- pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
- if (unlikely(pc == NULL))
- goto err;
-
- /*
- * We always charge the cgroup the mm_struct belongs to.
- * The mm_struct's mem_cgroup changes on task migration if the
- * thread group leader migrates. It's possible that mm is not
- * set, if so charge the init_mm (happens for pagecache usage).
- */
- if (likely(!memcg)) {
- rcu_read_lock();
- mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
- /*
- * For every charge from the cgroup, increment reference count
- */
- css_get(&mem->css);
- rcu_read_unlock();
- } else {
- mem = memcg;
- css_get(&memcg->css);
- }
-
- while (res_counter_charge(&mem->res, PAGE_SIZE)) {
- if (!(gfp_mask & __GFP_WAIT))
- goto out;
-
- if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
- continue;
-
- /*
- * try_to_free_mem_cgroup_pages() might not give us a full
- * picture of reclaim. Some pages are reclaimed and might be
- * moved to swap cache or just unmapped from the cgroup.
- * Check the limit again to see if the reclaim reduced the
- * current usage of the cgroup before giving up
- */
- if (res_counter_check_under_limit(&mem->res))
- continue;
-
- if (!nr_retries--) {
- mem_cgroup_out_of_memory(mem, gfp_mask);
- goto out;
- }
- }
-
- pc->mem_cgroup = mem;
- pc->page = page;
- /*
- * If a page is accounted as a page cache, insert to inactive list.
- * If anon, insert to active list.
- */
- if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
- pc->flags = PAGE_CGROUP_FLAG_CACHE;
- if (page_is_file_cache(page))
- pc->flags |= PAGE_CGROUP_FLAG_FILE;
- else
- pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
- } else
- pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
-
- lock_page_cgroup(page);
- if (unlikely(page_get_page_cgroup(page))) {
- unlock_page_cgroup(page);
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- css_put(&mem->css);
- kmem_cache_free(page_cgroup_cache, pc);
- goto done;
- }
- page_assign_page_cgroup(page, pc);
-
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
-
- unlock_page_cgroup(page);
-done:
- return 0;
-out:
- css_put(&mem->css);
- kmem_cache_free(page_cgroup_cache, pc);
-err:
- return -ENOMEM;
-}
-
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
-{
- if (mem_cgroup_subsys.disabled)
- return 0;
-
- /*
- * If already mapped, we don't have to account.
- * If page cache, page->mapping has address_space.
- * But page->mapping may have out-of-use anon_vma pointer,
- * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
- * is NULL.
- */
- if (page_mapped(page) || (page->mapping && !PageAnon(page)))
- return 0;
- if (unlikely(!mm))
- mm = &init_mm;
- return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
-}
-
-int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask)
-{
- if (mem_cgroup_subsys.disabled)
- return 0;
-
- /*
- * Corner case handling. This is called from add_to_page_cache()
- * in usual. But some FS (shmem) precharges this page before calling it
- * and call add_to_page_cache() with GFP_NOWAIT.
- *
- * For GFP_NOWAIT case, the page may be pre-charged before calling
- * add_to_page_cache(). (See shmem.c) check it here and avoid to call
- * charge twice. (It works but has to pay a bit larger cost.)
- */
- if (!(gfp_mask & __GFP_WAIT)) {
- struct page_cgroup *pc;
-
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (pc) {
- VM_BUG_ON(pc->page != page);
- VM_BUG_ON(!pc->mem_cgroup);
- unlock_page_cgroup(page);
- return 0;
- }
- unlock_page_cgroup(page);
- }
-
- if (unlikely(!mm))
- mm = &init_mm;
-
- return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
-}
-
-/*
- * uncharge if !page_mapped(page)
- */
-static void
-__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
-{
- struct page_cgroup *pc;
- struct mem_cgroup *mem;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-
- if (mem_cgroup_subsys.disabled)
- return;
-
- /*
- * Check if our page_cgroup is valid
- */
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (unlikely(!pc))
- goto unlock;
-
- VM_BUG_ON(pc->page != page);
-
- if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
- && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
- || page_mapped(page)))
- goto unlock;
-
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_remove_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
-
- page_assign_page_cgroup(page, NULL);
- unlock_page_cgroup(page);
-
- mem = pc->mem_cgroup;
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- css_put(&mem->css);
-
- kmem_cache_free(page_cgroup_cache, pc);
- return;
-unlock:
- unlock_page_cgroup(page);
-}
-
-void mem_cgroup_uncharge_page(struct page *page)
-{
- __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
-}
-
-void mem_cgroup_uncharge_cache_page(struct page *page)
-{
- VM_BUG_ON(page_mapped(page));
- __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
-}
-
-/*
- * Before starting migration, account against new page.
- */
-int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
-{
- struct page_cgroup *pc;
- struct mem_cgroup *mem = NULL;
- enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
- int ret = 0;
-
- if (mem_cgroup_subsys.disabled)
- return 0;
-
- lock_page_cgroup(page);
- pc = page_get_page_cgroup(page);
- if (pc) {
- mem = pc->mem_cgroup;
- css_get(&mem->css);
- if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
- ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
- }
- unlock_page_cgroup(page);
- if (mem) {
- ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
- ctype, mem);
- css_put(&mem->css);
- }
- return ret;
-}
-
-/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct page *newpage)
-{
- /*
- * At success, page->mapping is not NULL.
- * special rollback care is necessary when
- * 1. at migration failure. (newpage->mapping is cleared in this case)
- * 2. the newpage was moved but not remapped again because the task
- * exits and the newpage is obsolete. In this case, the new page
- * may be a swapcache. So, we just call mem_cgroup_uncharge_page()
- * always for avoiding mess. The page_cgroup will be removed if
- * unnecessary. File cache pages is still on radix-tree. Don't
- * care it.
- */
- if (!newpage->mapping)
- __mem_cgroup_uncharge_common(newpage,
- MEM_CGROUP_CHARGE_TYPE_FORCE);
- else if (PageAnon(newpage))
- mem_cgroup_uncharge_page(newpage);
-}
-
-/*
* A call to try to shrink memory usage under specified resource controller.
* This is typically used for page reclaiming for shmem for reducing side
* effect of page allocation from shmem, which is used by some mem_cgroup.
@@ -783,7 +465,7 @@ int mem_cgroup_shrink_usage(struct mm_st
int progress = 0;
int retry = MEM_CGROUP_RECLAIM_RETRIES;
- if (mem_cgroup_subsys.disabled)
+ if (mem_cgroup_disabled())
return 0;
rcu_read_lock();
@@ -1104,7 +786,7 @@ mem_cgroup_create(struct cgroup_subsys *
if (unlikely((cont->parent) == NULL)) {
mem = &init_mem_cgroup;
- page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
+ page_cgroup_init();
} else {
mem = mem_cgroup_alloc();
if (!mem)
@@ -1188,3 +870,325 @@ struct cgroup_subsys mem_cgroup_subsys =
.attach = mem_cgroup_move_task,
.early_init = 0,
};
+
+#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+static inline int mem_cgroup_disabled(void)
+{
+ return 1;
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+static inline int page_cgroup_locked(struct page *page)
+{
+ return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
+}
+
+static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
+{
+ VM_BUG_ON(!page_cgroup_locked(page));
+ page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
+}
+
+struct page_cgroup *page_get_page_cgroup(struct page *page)
+{
+ return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
+}
+
+/*
+ * Charge the memory controller for page usage.
+ * Return
+ * 0 if the charge was successful
+ * < 0 if the cgroup is over its limit
+ */
+static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask, enum charge_type ctype,
+ struct mem_cgroup *memcg)
+{
+ struct page_cgroup *pc;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct mem_cgroup *mem;
+ unsigned long flags;
+ unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+ struct mem_cgroup_per_zone *mz;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+ pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
+ if (unlikely(pc == NULL))
+ goto err;
+
+ /*
+ * We always charge the cgroup the mm_struct belongs to.
+ * The mm_struct's mem_cgroup changes on task migration if the
+ * thread group leader migrates. It's possible that mm is not
+ * set, if so charge the init_mm (happens for pagecache usage).
+ */
+ if (likely(!memcg)) {
+ rcu_read_lock();
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ /*
+ * For every charge from the cgroup, increment reference count
+ */
+ css_get(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ rcu_read_unlock();
+ } else {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ mem = memcg;
+ css_get(&memcg->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ }
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ while (res_counter_charge(&mem->res, PAGE_SIZE)) {
+ if (!(gfp_mask & __GFP_WAIT))
+ goto out;
+
+ if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+ continue;
+
+ /*
+ * try_to_free_mem_cgroup_pages() might not give us a full
+ * picture of reclaim. Some pages are reclaimed and might be
+ * moved to swap cache or just unmapped from the cgroup.
+ * Check the limit again to see if the reclaim reduced the
+ * current usage of the cgroup before giving up
+ */
+ if (res_counter_check_under_limit(&mem->res))
+ continue;
+
+ if (!nr_retries--) {
+ mem_cgroup_out_of_memory(mem, gfp_mask);
+ goto out;
+ }
+ }
+ pc->mem_cgroup = mem;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+ pc->page = page;
+ /*
+ * If a page is accounted as a page cache, insert to inactive list.
+ * If anon, insert to active list.
+ */
+ if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
+ pc->flags = PAGE_CGROUP_FLAG_CACHE;
+ if (page_is_file_cache(page))
+ pc->flags |= PAGE_CGROUP_FLAG_FILE;
+ else
+ pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+ } else
+ pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
+
+ lock_page_cgroup(page);
+ if (unlikely(page_get_page_cgroup(page))) {
+ unlock_page_cgroup(page);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ css_put(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ kmem_cache_free(page_cgroup_cache, pc);
+ goto done;
+ }
+ page_assign_page_cgroup(page, pc);
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ mz = page_cgroup_zoneinfo(pc);
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ __mem_cgroup_add_list(mz, pc);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+ unlock_page_cgroup(page);
+done:
+ return 0;
+out:
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ css_put(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ kmem_cache_free(page_cgroup_cache, pc);
+err:
+ return -ENOMEM;
+}
+
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
+{
+ if (mem_cgroup_disabled())
+ return 0;
+
+ /*
+ * If already mapped, we don't have to account.
+ * If page cache, page->mapping has address_space.
+ * But page->mapping may have out-of-use anon_vma pointer,
+ * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
+ * is NULL.
+ */
+ if (page_mapped(page) || (page->mapping && !PageAnon(page)))
+ return 0;
+ if (unlikely(!mm))
+ mm = &init_mm;
+ return mem_cgroup_charge_common(page, mm, gfp_mask,
+ MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+}
+
+int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask)
+{
+ if (mem_cgroup_disabled())
+ return 0;
+
+ /*
+ * Corner case handling. This is called from add_to_page_cache()
+ * in usual. But some FS (shmem) precharges this page before calling it
+ * and call add_to_page_cache() with GFP_NOWAIT.
+ *
+ * For GFP_NOWAIT case, the page may be pre-charged before calling
+ * add_to_page_cache(). (See shmem.c) check it here and avoid to call
+ * charge twice. (It works but has to pay a bit larger cost.)
+ */
+ if (!(gfp_mask & __GFP_WAIT)) {
+ struct page_cgroup *pc;
+
+ lock_page_cgroup(page);
+ pc = page_get_page_cgroup(page);
+ if (pc) {
+ VM_BUG_ON(pc->page != page);
+ VM_BUG_ON(!pc->mem_cgroup);
+ unlock_page_cgroup(page);
+ return 0;
+ }
+ unlock_page_cgroup(page);
+ }
+
+ if (unlikely(!mm))
+ mm = &init_mm;
+
+ return mem_cgroup_charge_common(page, mm, gfp_mask,
+ MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
+}
+
+/*
+ * uncharge if !page_mapped(page)
+ */
+static void
+__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
+{
+ struct page_cgroup *pc;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct mem_cgroup *mem;
+ struct mem_cgroup_per_zone *mz;
+ unsigned long flags;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+ if (mem_cgroup_disabled())
+ return;
+
+ /*
+ * Check if our page_cgroup is valid
+ */
+ lock_page_cgroup(page);
+ pc = page_get_page_cgroup(page);
+ if (unlikely(!pc))
+ goto unlock;
+
+ VM_BUG_ON(pc->page != page);
+
+ if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
+ && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
+ || page_mapped(page)
+ || PageSwapCache(page)))
+ goto unlock;
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ mz = page_cgroup_zoneinfo(pc);
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ __mem_cgroup_remove_list(mz, pc);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+ page_assign_page_cgroup(page, NULL);
+ unlock_page_cgroup(page);
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ mem = pc->mem_cgroup;
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ css_put(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+ kmem_cache_free(page_cgroup_cache, pc);
+ return;
+unlock:
+ unlock_page_cgroup(page);
+}
+
+void mem_cgroup_uncharge_page(struct page *page)
+{
+ __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+}
+
+void mem_cgroup_uncharge_cache_page(struct page *page)
+{
+ VM_BUG_ON(page_mapped(page));
+ __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
+}
+
+/*
+ * Before starting migration, account against new page.
+ */
+int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+{
+ struct page_cgroup *pc;
+ struct mem_cgroup *mem = NULL;
+ enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+ int ret = 0;
+
+ if (mem_cgroup_disabled())
+ return 0;
+
+ lock_page_cgroup(page);
+ pc = page_get_page_cgroup(page);
+ if (pc) {
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ mem = pc->mem_cgroup;
+ css_get(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
+ ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
+ }
+ unlock_page_cgroup(page);
+ if (mem) {
+ ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
+ ctype, mem);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ css_put(&mem->css);
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ }
+ return ret;
+}
+
+/* remove redundant charge if migration failed*/
+void mem_cgroup_end_migration(struct page *newpage)
+{
+ /*
+ * At success, page->mapping is not NULL.
+ * special rollback care is necessary when
+ * 1. at migration failure. (newpage->mapping is cleared in this case)
+ * 2. the newpage was moved but not remapped again because the task
+ * exits and the newpage is obsolete. In this case, the new page
+ * may be a swapcache. So, we just call mem_cgroup_uncharge_page()
+ * always for avoiding mess. The page_cgroup will be removed if
+ * unnecessary. File cache pages is still on radix-tree. Don't
+ * care it.
+ */
+ if (!newpage->mapping)
+ __mem_cgroup_uncharge_common(newpage,
+ MEM_CGROUP_CHARGE_TYPE_FORCE);
+ else if (PageAnon(newpage))
+ mem_cgroup_uncharge_page(newpage);
+}
+
+void page_cgroup_init()
+{
+ if (!page_cgroup_cache)
+ page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
+}
This patch is for cleaning up the code of the cgroup memory subsystem
to remove a lot of "#ifdef"s.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <[email protected]>
Signed-off-by: Hirokazu Takahashi <[email protected]>
diff -Ndupr linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c
--- linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c 2008-08-01 19:48:55.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c 2008-08-01 19:49:38.000000000 +0900
@@ -228,6 +228,47 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
}
+static inline void get_mem_cgroup(struct mem_cgroup *mem)
+{
+ css_get(&mem->css);
+}
+
+static inline void put_mem_cgroup(struct mem_cgroup *mem)
+{
+ css_put(&mem->css);
+}
+
+static inline void set_mem_cgroup(struct page_cgroup *pc,
+ struct mem_cgroup *mem)
+{
+ pc->mem_cgroup = mem;
+}
+
+static inline void clear_mem_cgroup(struct page_cgroup *pc)
+{
+ struct mem_cgroup *mem = pc->mem_cgroup;
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ pc->mem_cgroup = NULL;
+ put_mem_cgroup(mem);
+}
+
+static inline struct mem_cgroup *get_mem_page_cgroup(struct page_cgroup *pc)
+{
+ struct mem_cgroup *mem = pc->mem_cgroup;
+ css_get(&mem->css);
+ return mem;
+}
+
+/* This sould be called in an RCU-protected section. */
+static inline struct mem_cgroup *mm_get_mem_cgroup(struct mm_struct *mm)
+{
+ struct mem_cgroup *mem;
+
+ mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ get_mem_cgroup(mem);
+ return mem;
+}
+
static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
struct page_cgroup *pc)
{
@@ -297,6 +338,26 @@ static void __mem_cgroup_move_lists(stru
list_move(&pc->lru, &mz->lists[lru]);
}
+static inline void mem_cgroup_add_page(struct page_cgroup *pc)
+{
+ struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+ unsigned long flags;
+
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ __mem_cgroup_add_list(mz, pc);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+}
+
+static inline void mem_cgroup_remove_page(struct page_cgroup *pc)
+{
+ struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+ unsigned long flags;
+
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ __mem_cgroup_remove_list(mz, pc);
+ spin_unlock_irqrestore(&mz->lru_lock, flags);
+}
+
int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
{
int ret;
@@ -339,6 +400,36 @@ void mem_cgroup_move_lists(struct page *
unlock_page_cgroup(page);
}
+static inline int mem_cgroup_try_to_allocate(struct mem_cgroup *mem,
+ gfp_t gfp_mask)
+{
+ unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+
+ while (res_counter_charge(&mem->res, PAGE_SIZE)) {
+ if (!(gfp_mask & __GFP_WAIT))
+ return -1;
+
+ if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+ continue;
+
+ /*
+ * try_to_free_mem_cgroup_pages() might not give us a full
+ * picture of reclaim. Some pages are reclaimed and might be
+ * moved to swap cache or just unmapped from the cgroup.
+ * Check the limit again to see if the reclaim reduced the
+ * current usage of the cgroup before giving up
+ */
+ if (res_counter_check_under_limit(&mem->res))
+ continue;
+
+ if (!nr_retries--) {
+ mem_cgroup_out_of_memory(mem, gfp_mask);
+ return -1;
+ }
+ }
+ return 0;
+}
+
/*
* Calculate mapped_ratio under memory controller. This will be used in
* vmscan.c for deteremining we have to reclaim mapped pages.
@@ -469,15 +560,14 @@ int mem_cgroup_shrink_usage(struct mm_st
return 0;
rcu_read_lock();
- mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
- css_get(&mem->css);
+ mem = mm_get_mem_cgroup(mm);
rcu_read_unlock();
do {
progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
} while (!progress && --retry);
- css_put(&mem->css);
+ put_mem_cgroup(mem);
if (!retry)
return -ENOMEM;
return 0;
@@ -558,7 +648,7 @@ static int mem_cgroup_force_empty(struct
int ret = -EBUSY;
int node, zid;
- css_get(&mem->css);
+ get_mem_cgroup(mem);
/*
* page reclaim code (kswapd etc..) will move pages between
* active_list <-> inactive_list while we don't take a lock.
@@ -578,7 +668,7 @@ static int mem_cgroup_force_empty(struct
}
ret = 0;
out:
- css_put(&mem->css);
+ put_mem_cgroup(mem);
return ret;
}
@@ -873,10 +963,37 @@ struct cgroup_subsys mem_cgroup_subsys =
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+struct mem_cgroup;
+
static inline int mem_cgroup_disabled(void)
{
return 1;
}
+
+static inline void mem_cgroup_add_page(struct page_cgroup *pc) {}
+static inline void mem_cgroup_remove_page(struct page_cgroup *pc) {}
+static inline void get_mem_cgroup(struct mem_cgroup *mem) {}
+static inline void put_mem_cgroup(struct mem_cgroup *mem) {}
+static inline void set_mem_cgroup(struct page_cgroup *pc,
+ struct mem_cgroup *mem) {}
+static inline void clear_mem_cgroup(struct page_cgroup *pc) {}
+
+static inline struct mem_cgroup *get_mem_page_cgroup(struct page_cgroup *pc)
+{
+ return NULL;
+}
+
+static inline struct mem_cgroup *mm_get_mem_cgroup(struct mm_struct *mm)
+{
+ return NULL;
+}
+
+static inline int mem_cgroup_try_to_allocate(struct mem_cgroup *mem,
+ gfp_t gfp_mask)
+{
+ return 0;
+}
+
#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
static inline int page_cgroup_locked(struct page *page)
@@ -906,12 +1023,7 @@ static int mem_cgroup_charge_common(stru
struct mem_cgroup *memcg)
{
struct page_cgroup *pc;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem;
- unsigned long flags;
- unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- struct mem_cgroup_per_zone *mz;
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
if (unlikely(pc == NULL))
@@ -925,47 +1037,16 @@ static int mem_cgroup_charge_common(stru
*/
if (likely(!memcg)) {
rcu_read_lock();
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
- /*
- * For every charge from the cgroup, increment reference count
- */
- css_get(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ mem = mm_get_mem_cgroup(mm);
rcu_read_unlock();
} else {
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
mem = memcg;
- css_get(&memcg->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
- }
-
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- while (res_counter_charge(&mem->res, PAGE_SIZE)) {
- if (!(gfp_mask & __GFP_WAIT))
- goto out;
-
- if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
- continue;
-
- /*
- * try_to_free_mem_cgroup_pages() might not give us a full
- * picture of reclaim. Some pages are reclaimed and might be
- * moved to swap cache or just unmapped from the cgroup.
- * Check the limit again to see if the reclaim reduced the
- * current usage of the cgroup before giving up
- */
- if (res_counter_check_under_limit(&mem->res))
- continue;
-
- if (!nr_retries--) {
- mem_cgroup_out_of_memory(mem, gfp_mask);
- goto out;
- }
+ get_mem_cgroup(mem);
}
- pc->mem_cgroup = mem;
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ if (mem_cgroup_try_to_allocate(mem, gfp_mask) < 0)
+ goto out;
+ set_mem_cgroup(pc, mem);
pc->page = page;
/*
* If a page is accounted as a page cache, insert to inactive list.
@@ -983,29 +1064,19 @@ static int mem_cgroup_charge_common(stru
lock_page_cgroup(page);
if (unlikely(page_get_page_cgroup(page))) {
unlock_page_cgroup(page);
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- css_put(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ clear_mem_cgroup(pc);
kmem_cache_free(page_cgroup_cache, pc);
goto done;
}
page_assign_page_cgroup(page, pc);
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_add_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ mem_cgroup_add_page(pc);
unlock_page_cgroup(page);
done:
return 0;
out:
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- css_put(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ put_mem_cgroup(mem);
kmem_cache_free(page_cgroup_cache, pc);
err:
return -ENOMEM;
@@ -1074,11 +1145,6 @@ static void
__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
{
struct page_cgroup *pc;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- struct mem_cgroup *mem;
- struct mem_cgroup_per_zone *mz;
- unsigned long flags;
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
if (mem_cgroup_disabled())
return;
@@ -1099,21 +1165,12 @@ __mem_cgroup_uncharge_common(struct page
|| PageSwapCache(page)))
goto unlock;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- mz = page_cgroup_zoneinfo(pc);
- spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_remove_list(mz, pc);
- spin_unlock_irqrestore(&mz->lru_lock, flags);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ mem_cgroup_remove_page(pc);
page_assign_page_cgroup(page, NULL);
unlock_page_cgroup(page);
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- mem = pc->mem_cgroup;
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- css_put(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ clear_mem_cgroup(pc);
kmem_cache_free(page_cgroup_cache, pc);
return;
@@ -1148,10 +1205,7 @@ int mem_cgroup_prepare_migration(struct
lock_page_cgroup(page);
pc = page_get_page_cgroup(page);
if (pc) {
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- mem = pc->mem_cgroup;
- css_get(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ mem = get_mem_page_cgroup(pc);
if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
}
@@ -1159,9 +1213,7 @@ int mem_cgroup_prepare_migration(struct
if (mem) {
ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
ctype, mem);
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
- css_put(&mem->css);
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ put_mem_cgroup(mem);
}
return ret;
}
This patch implements the bio cgroup on the memory cgroup.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <[email protected]>
Signed-off-by: Hirokazu Takahashi <[email protected]>
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c
--- linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c 2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c 2008-08-01 19:18:38.000000000 +0900
@@ -84,24 +84,28 @@ void exit_io_context(void)
}
}
+void init_io_context(struct io_context *ioc)
+{
+ atomic_set(&ioc->refcount, 1);
+ atomic_set(&ioc->nr_tasks, 1);
+ spin_lock_init(&ioc->lock);
+ ioc->ioprio_changed = 0;
+ ioc->ioprio = 0;
+ ioc->last_waited = jiffies; /* doesn't matter... */
+ ioc->nr_batch_requests = 0; /* because this is 0 */
+ ioc->aic = NULL;
+ INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+ INIT_HLIST_HEAD(&ioc->cic_list);
+ ioc->ioc_data = NULL;
+}
+
struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
{
struct io_context *ret;
ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
- if (ret) {
- atomic_set(&ret->refcount, 1);
- atomic_set(&ret->nr_tasks, 1);
- spin_lock_init(&ret->lock);
- ret->ioprio_changed = 0;
- ret->ioprio = 0;
- ret->last_waited = jiffies; /* doesn't matter... */
- ret->nr_batch_requests = 0; /* because this is 0 */
- ret->aic = NULL;
- INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
- INIT_HLIST_HEAD(&ret->cic_list);
- ret->ioc_data = NULL;
- }
+ if (ret)
+ init_io_context(ret);
return ret;
}
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h 1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h 2008-08-01 19:21:56.000000000 +0900
@@ -0,0 +1,159 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/memcontrol.h>
+
+#ifndef _LINUX_BIOCONTROL_H
+#define _LINUX_BIOCONTROL_H
+
+#ifdef CONFIG_CGROUP_BIO
+
+struct io_context;
+struct block_device;
+
+struct bio_cgroup {
+ struct cgroup_subsys_state css;
+ int id;
+ struct io_context *io_context; /* default io_context */
+/* struct radix_tree_root io_context_root; per device io_context */
+ spinlock_t page_list_lock;
+ struct list_head page_list;
+};
+
+static inline int bio_cgroup_disabled(void)
+{
+ return bio_cgroup_subsys.disabled;
+}
+
+static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p)
+{
+ return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
+ struct bio_cgroup, css);
+}
+
+static inline void __bio_cgroup_add_page(struct page_cgroup *pc)
+{
+ struct bio_cgroup *biog = pc->bio_cgroup;
+ list_add(&pc->blist, &biog->page_list);
+}
+
+static inline void bio_cgroup_add_page(struct page_cgroup *pc)
+{
+ struct bio_cgroup *biog = pc->bio_cgroup;
+ unsigned long flags;
+ spin_lock_irqsave(&biog->page_list_lock, flags);
+ __bio_cgroup_add_page(pc);
+ spin_unlock_irqrestore(&biog->page_list_lock, flags);
+}
+
+static inline void __bio_cgroup_remove_page(struct page_cgroup *pc)
+{
+ list_del_init(&pc->blist);
+}
+
+static inline void bio_cgroup_remove_page(struct page_cgroup *pc)
+{
+ struct bio_cgroup *biog = pc->bio_cgroup;
+ unsigned long flags;
+ spin_lock_irqsave(&biog->page_list_lock, flags);
+ __bio_cgroup_remove_page(pc);
+ spin_unlock_irqrestore(&biog->page_list_lock, flags);
+}
+
+static inline void get_bio_cgroup(struct bio_cgroup *biog)
+{
+ css_get(&biog->css);
+}
+
+static inline void put_bio_cgroup(struct bio_cgroup *biog)
+{
+ css_put(&biog->css);
+}
+
+static inline void set_bio_cgroup(struct page_cgroup *pc,
+ struct bio_cgroup *biog)
+{
+ pc->bio_cgroup = biog;
+}
+
+static inline void clear_bio_cgroup(struct page_cgroup *pc)
+{
+ struct bio_cgroup *biog = pc->bio_cgroup;
+ pc->bio_cgroup = NULL;
+ put_bio_cgroup(biog);
+}
+
+static inline struct bio_cgroup *get_bio_page_cgroup(struct page_cgroup *pc)
+{
+ struct bio_cgroup *biog = pc->bio_cgroup;
+ css_get(&biog->css);
+ return biog;
+}
+
+/* This sould be called in an RCU-protected section. */
+static inline struct bio_cgroup *mm_get_bio_cgroup(struct mm_struct *mm)
+{
+ struct bio_cgroup *biog;
+ biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
+ get_bio_cgroup(biog);
+ return biog;
+}
+
+extern struct io_context *get_bio_cgroup_iocontext(struct bio *bio);
+
+#else /* CONFIG_CGROUP_BIO */
+
+struct bio_cgroup;
+
+static inline int bio_cgroup_disabled(void)
+{
+ return 1;
+}
+
+static inline void bio_cgroup_add_page(struct page_cgroup *pc)
+{
+}
+
+static inline void bio_cgroup_remove_page(struct page_cgroup *pc)
+{
+}
+
+static inline void get_bio_cgroup(struct bio_cgroup *biog)
+{
+}
+
+static inline void put_bio_cgroup(struct bio_cgroup *biog)
+{
+}
+
+static inline void set_bio_cgroup(struct page_cgroup *pc,
+ struct bio_cgroup *biog)
+{
+}
+
+static inline void clear_bio_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline struct bio_cgroup *get_bio_page_cgroup(struct page_cgroup *pc)
+{
+ return NULL;
+}
+
+static inline struct bio_cgroup *mm_get_bio_cgroup(struct mm_struct *mm)
+{
+ return NULL;
+}
+
+static inline int get_bio_cgroup_id(struct page *page)
+{
+ return 0;
+}
+
+static inline struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BIO */
+
+#endif /* _LINUX_BIOCONTROL_H */
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/cgroup_subsys.h linux-2.6.27-rc1-mm1.cg2/include/linux/cgroup_subsys.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/cgroup_subsys.h 2008-08-01 12:18:28.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/cgroup_subsys.h 2008-08-01 19:18:38.000000000 +0900
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
/* */
+#ifdef CONFIG_CGROUP_BIO
+SUBSYS(bio_cgroup)
+#endif
+
+/* */
+
#ifdef CONFIG_CGROUP_DEVICE
SUBSYS(devices)
#endif
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/iocontext.h linux-2.6.27-rc1-mm1.cg2/include/linux/iocontext.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/iocontext.h 2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/iocontext.h 2008-08-01 19:18:38.000000000 +0900
@@ -83,6 +83,8 @@ struct io_context {
struct radix_tree_root radix_root;
struct hlist_head cic_list;
void *ioc_data;
+
+ int id; /* cgroup ID */
};
static inline struct io_context *ioc_task_link(struct io_context *ioc)
@@ -104,6 +106,7 @@ int put_io_context(struct io_context *io
void exit_io_context(void);
struct io_context *get_io_context(gfp_t gfp_flags, int node);
struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
void copy_io_context(struct io_context **pdst, struct io_context **psrc);
#else
static inline void exit_io_context(void)
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/memcontrol.h linux-2.6.27-rc1-mm1.cg2/include/linux/memcontrol.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/memcontrol.h 2008-08-01 19:03:21.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/memcontrol.h 2008-08-01 19:22:10.000000000 +0900
@@ -54,6 +54,10 @@ struct page_cgroup {
struct list_head lru; /* per cgroup LRU list */
struct mem_cgroup *mem_cgroup;
#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+#ifdef CONFIG_CGROUP_BIO
+ struct list_head blist; /* for bio_cgroup page list */
+ struct bio_cgroup *bio_cgroup;
+#endif
struct page *page;
int flags;
};
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/init/Kconfig linux-2.6.27-rc1-mm1.cg2/init/Kconfig
--- linux-2.6.27-rc1-mm1.cg1/init/Kconfig 2008-08-01 19:03:21.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/init/Kconfig 2008-08-01 19:18:38.000000000 +0900
@@ -418,9 +418,20 @@ config CGROUP_MEMRLIMIT_CTLR
memory RSS and Page Cache control. Virtual address space control
is provided by this controller.
+config CGROUP_BIO
+ bool "Block I/O cgroup subsystem"
+ depends on CGROUPS
+ select MM_OWNER
+ help
+ Provides a Resource Controller which enables to track the onwner
+ of every Block I/O.
+ The information this subsystem provides can be used from any
+ kind of module such as dm-ioband device mapper modules or
+ the cfq-scheduler.
+
config CGROUP_PAGE
def_bool y
- depends on CGROUP_MEM_RES_CTLR
+ depends on CGROUP_MEM_RES_CTLR || CGROUP_BIO
config SYSFS_DEPRECATED
bool
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/mm/Makefile linux-2.6.27-rc1-mm1.cg2/mm/Makefile
--- linux-2.6.27-rc1-mm1.cg1/mm/Makefile 2008-08-01 19:03:21.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/mm/Makefile 2008-08-01 19:18:38.000000000 +0900
@@ -35,4 +35,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_PAGE) += memcontrol.o
+obj-$(CONFIG_CGROUP_BIO) += biocontrol.o
obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/mm/biocontrol.c linux-2.6.27-rc1-mm1.cg2/mm/biocontrol.c
--- linux-2.6.27-rc1-mm1.cg1/mm/biocontrol.c 1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/mm/biocontrol.c 2008-08-01 19:35:51.000000000 +0900
@@ -0,0 +1,233 @@
+/* biocontrol.c - Block I/O Controller
+ *
+ * Copyright IBM Corporation, 2007
+ * Author Balbir Singh <[email protected]>
+ *
+ * Copyright 2007 OpenVZ SWsoft Inc
+ * Author: Pavel Emelianov <[email protected]>
+ *
+ * Copyright VA Linux Systems Japan, 2008
+ * Author Hirokazu Takahashi <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/blkdev.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/idr.h>
+#include <linux/err.h>
+#include <linux/biocontrol.h>
+
+/* return corresponding bio_cgroup object of a cgroup */
+static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, bio_cgroup_subsys_id),
+ struct bio_cgroup, css);
+}
+
+static struct idr bio_cgroup_id;
+static DEFINE_SPINLOCK(bio_cgroup_idr_lock);
+
+static struct cgroup_subsys_state *
+bio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct bio_cgroup *biog;
+ struct io_context *ioc;
+ int error;
+
+ if (!cgrp->parent) {
+ static struct bio_cgroup default_bio_cgroup;
+ static struct io_context default_bio_io_context;
+
+ biog = &default_bio_cgroup;
+ ioc = &default_bio_io_context;
+ init_io_context(ioc);
+
+ idr_init(&bio_cgroup_id);
+ biog->id = 0;
+
+ page_cgroup_init();
+ } else {
+ biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+ ioc = alloc_io_context(GFP_KERNEL, -1);
+ if (!ioc || !biog) {
+ error = -ENOMEM;
+ goto out;
+ }
+retry:
+ if (unlikely(!idr_pre_get(&bio_cgroup_id, GFP_KERNEL))) {
+ error = -EAGAIN;
+ goto out;
+ }
+ spin_lock_irq(&bio_cgroup_idr_lock);
+ error = idr_get_new_above(&bio_cgroup_id,
+ (void *)biog, 1, &biog->id);
+ spin_unlock_irq(&bio_cgroup_idr_lock);
+ if (error == -EAGAIN)
+ goto retry;
+ else if (error)
+ goto out;
+ }
+
+ ioc->id = biog->id;
+ biog->io_context = ioc;
+
+ INIT_LIST_HEAD(&biog->page_list);
+ spin_lock_init(&biog->page_list_lock);
+
+ /* Bind the cgroup to bio_cgroup object we just created */
+ biog->css.cgroup = cgrp;
+
+ return &biog->css;
+out:
+ if (ioc)
+ put_io_context(ioc);
+ kfree(biog);
+ return ERR_PTR(error);
+}
+
+#define FORCE_UNCHARGE_BATCH (128)
+static void bio_cgroup_force_empty(struct bio_cgroup *biog)
+{
+ struct page_cgroup *pc;
+ struct page *page;
+ int count = FORCE_UNCHARGE_BATCH;
+ struct list_head *list = &biog->page_list;
+ unsigned long flags;
+
+ spin_lock_irqsave(&biog->page_list_lock, flags);
+ while (!list_empty(list)) {
+ pc = list_entry(list->prev, struct page_cgroup, blist);
+ page = pc->page;
+ get_page(page);
+ spin_unlock_irqrestore(&biog->page_list_lock, flags);
+ mem_cgroup_uncharge_page(page);
+ put_page(page);
+ if (--count <= 0) {
+ count = FORCE_UNCHARGE_BATCH;
+ cond_resched();
+ }
+ spin_lock_irqsave(&biog->page_list_lock, flags);
+ }
+ spin_unlock_irqrestore(&biog->page_list_lock, flags);
+ return;
+}
+
+static void bio_cgroup_pre_destroy(struct cgroup_subsys *ss,
+ struct cgroup *cgrp)
+{
+ struct bio_cgroup *biog = cgroup_bio(cgrp);
+ bio_cgroup_force_empty(biog);
+}
+
+static void bio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct bio_cgroup *biog = cgroup_bio(cgrp);
+
+ put_io_context(biog->io_context);
+
+ spin_lock_irq(&bio_cgroup_idr_lock);
+ idr_remove(&bio_cgroup_id, biog->id);
+ spin_unlock_irq(&bio_cgroup_idr_lock);
+
+ kfree(biog);
+}
+
+struct bio_cgroup *find_bio_cgroup(int id)
+{
+ struct bio_cgroup *biog;
+ spin_lock_irq(&bio_cgroup_idr_lock);
+ biog = (struct bio_cgroup *)
+ idr_find(&bio_cgroup_id, id);
+ spin_unlock_irq(&bio_cgroup_idr_lock);
+ get_bio_cgroup(biog);
+ return biog;
+}
+
+struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
+{
+ struct io_context *ioc;
+ struct page_cgroup *pc;
+ struct bio_cgroup *biog;
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+
+ lock_page_cgroup(page);
+ pc = page_get_page_cgroup(page);
+ if (pc)
+ biog = pc->bio_cgroup;
+ else
+ biog = bio_cgroup_from_task(rcu_dereference(init_mm.owner));
+ ioc = biog->io_context; /* default io_context for this cgroup */
+ atomic_inc(&ioc->refcount);
+ unlock_page_cgroup(page);
+ return ioc;
+}
+EXPORT_SYMBOL(get_bio_cgroup_iocontext);
+
+static u64 bio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct bio_cgroup *biog = cgroup_bio(cgrp);
+
+ return (u64) biog->id;
+}
+
+
+static struct cftype bio_files[] = {
+ {
+ .name = "id",
+ .read_u64 = bio_id_read,
+ },
+};
+
+static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ if (bio_cgroup_disabled())
+ return 0;
+ return cgroup_add_files(cont, ss, bio_files, ARRAY_SIZE(bio_files));
+}
+
+static void bio_cgroup_move_task(struct cgroup_subsys *ss,
+ struct cgroup *cont,
+ struct cgroup *old_cont,
+ struct task_struct *p)
+{
+ struct mm_struct *mm;
+ struct bio_cgroup *biog, *old_biog;
+
+ if (bio_cgroup_disabled())
+ return;
+
+ mm = get_task_mm(p);
+ if (mm == NULL)
+ return;
+
+ biog = cgroup_bio(cont);
+ old_biog = cgroup_bio(old_cont);
+
+ mmput(mm);
+ return;
+}
+
+
+struct cgroup_subsys bio_cgroup_subsys = {
+ .name = "bio",
+ .subsys_id = bio_cgroup_subsys_id,
+ .create = bio_cgroup_create,
+ .destroy = bio_cgroup_destroy,
+ .pre_destroy = bio_cgroup_pre_destroy,
+ .populate = bio_cgroup_populate,
+ .attach = bio_cgroup_move_task,
+ .early_init = 0,
+};
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg2/mm/memcontrol.c
--- linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c 2008-08-01 19:49:38.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/mm/memcontrol.c 2008-08-01 19:49:53.000000000 +0900
@@ -20,6 +20,7 @@
#include <linux/res_counter.h>
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
+#include <linux/biocontrol.h>
#include <linux/mm.h>
#include <linux/smp.h>
#include <linux/page-flags.h>
@@ -1019,11 +1020,12 @@ struct page_cgroup *page_get_page_cgroup
* < 0 if the cgroup is over its limit
*/
static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, enum charge_type ctype,
- struct mem_cgroup *memcg)
+ gfp_t gfp_mask, enum charge_type ctype,
+ struct mem_cgroup *memcg, struct bio_cgroup *biocg)
{
struct page_cgroup *pc;
struct mem_cgroup *mem;
+ struct bio_cgroup *biog;
pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
if (unlikely(pc == NULL))
@@ -1035,18 +1037,15 @@ static int mem_cgroup_charge_common(stru
* thread group leader migrates. It's possible that mm is not
* set, if so charge the init_mm (happens for pagecache usage).
*/
- if (likely(!memcg)) {
- rcu_read_lock();
- mem = mm_get_mem_cgroup(mm);
- rcu_read_unlock();
- } else {
- mem = memcg;
- get_mem_cgroup(mem);
- }
+ rcu_read_lock();
+ mem = memcg ? memcg : mm_get_mem_cgroup(mm);
+ biog = biocg ? biocg : mm_get_bio_cgroup(mm);
+ rcu_read_unlock();
if (mem_cgroup_try_to_allocate(mem, gfp_mask) < 0)
goto out;
set_mem_cgroup(pc, mem);
+ set_bio_cgroup(pc, biog);
pc->page = page;
/*
* If a page is accounted as a page cache, insert to inactive list.
@@ -1065,18 +1064,21 @@ static int mem_cgroup_charge_common(stru
if (unlikely(page_get_page_cgroup(page))) {
unlock_page_cgroup(page);
clear_mem_cgroup(pc);
+ clear_bio_cgroup(pc);
kmem_cache_free(page_cgroup_cache, pc);
goto done;
}
page_assign_page_cgroup(page, pc);
mem_cgroup_add_page(pc);
+ bio_cgroup_add_page(pc);
unlock_page_cgroup(page);
done:
return 0;
out:
put_mem_cgroup(mem);
+ put_bio_cgroup(biog);
kmem_cache_free(page_cgroup_cache, pc);
err:
return -ENOMEM;
@@ -1099,7 +1101,7 @@ int mem_cgroup_charge(struct page *page,
if (unlikely(!mm))
mm = &init_mm;
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+ MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL, NULL);
}
int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -1135,7 +1137,7 @@ int mem_cgroup_cache_charge(struct page
mm = &init_mm;
return mem_cgroup_charge_common(page, mm, gfp_mask,
- MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
+ MEM_CGROUP_CHARGE_TYPE_CACHE, NULL, NULL);
}
/*
@@ -1146,7 +1148,7 @@ __mem_cgroup_uncharge_common(struct page
{
struct page_cgroup *pc;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && bio_cgroup_disabled())
return;
/*
@@ -1166,11 +1168,13 @@ __mem_cgroup_uncharge_common(struct page
goto unlock;
mem_cgroup_remove_page(pc);
+ bio_cgroup_remove_page(pc);
page_assign_page_cgroup(page, NULL);
unlock_page_cgroup(page);
clear_mem_cgroup(pc);
+ clear_bio_cgroup(pc);
kmem_cache_free(page_cgroup_cache, pc);
return;
@@ -1196,24 +1200,29 @@ int mem_cgroup_prepare_migration(struct
{
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
+ struct bio_cgroup *biog = NULL;
enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
int ret = 0;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && bio_cgroup_disabled())
return 0;
lock_page_cgroup(page);
pc = page_get_page_cgroup(page);
if (pc) {
mem = get_mem_page_cgroup(pc);
+ biog = get_bio_page_cgroup(pc);
if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
}
unlock_page_cgroup(page);
- if (mem) {
+ if (pc) {
ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
- ctype, mem);
- put_mem_cgroup(mem);
+ ctype, mem, biog);
+ if (mem)
+ put_mem_cgroup(mem);
+ if (biog)
+ put_bio_cgroup(biog);
}
return ret;
}
With this patch, dm-ioband can work with the bio cgroup.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <[email protected]>
Signed-off-by: Hirokazu Takahashi <[email protected]>
diff -Ndupr linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c
--- linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c 2008-08-01 16:53:57.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c 2008-08-01 19:44:36.000000000 +0900
@@ -6,6 +6,7 @@
* This file is released under the GPL.
*/
#include <linux/bio.h>
+#include <linux/biocontrol.h>
#include "dm.h"
#include "dm-bio-list.h"
#include "dm-ioband.h"
@@ -53,13 +54,13 @@ static int ioband_node(struct bio *bio)
static int ioband_cgroup(struct bio *bio)
{
- /*
- * This function should return the ID of the cgroup which issued "bio".
- * The ID of the cgroup which the current process belongs to won't be
- * suitable ID for this purpose, since some BIOs will be handled by kernel
- * threads like aio or pdflush on behalf of the process requesting the BIOs.
- */
- return 0; /* not implemented yet */
+ struct io_context *ioc = get_bio_cgroup_iocontext(bio);
+ int id = 0;
+ if (ioc) {
+ id = ioc->id;
+ put_io_context(ioc);
+ }
+ return id;
}
struct group_type dm_ioband_group_type[] = {
On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
> This series of patches of dm-ioband now includes "The bio tracking mechanism,"
> which has been posted individually to this mailing list.
> This makes it easy for anybody to control the I/O bandwidth even when
> the I/O is one of delayed-write requests.
During the Containers mini-summit at OLS, it was mentioned that there
are at least *FOUR* of these I/O controllers floating around. Have you
talked to the other authors? (I've cc'd at least one of them).
We obviously can't come to any kind of real consensus with people just
tossing the same patches back and forth.
-- Dave
Dave Hansen wrote:
> On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
>> This series of patches of dm-ioband now includes "The bio tracking mechanism,"
>> which has been posted individually to this mailing list.
>> This makes it easy for anybody to control the I/O bandwidth even when
>> the I/O is one of delayed-write requests.
>
> During the Containers mini-summit at OLS, it was mentioned that there
> are at least *FOUR* of these I/O controllers floating around. Have you
> talked to the other authors? (I've cc'd at least one of them).
>
> We obviously can't come to any kind of real consensus with people just
> tossing the same patches back and forth.
>
> -- Dave
>
Dave,
thanks for this email first of all. I've talked with Satoshi (cc-ed)
about his solution "Yet another I/O bandwidth controlling subsystem for
CGroups based on CFQ".
I did some experiments trying to implement minimum bandwidth requirements
for my io-throttle controller, mapping the requirements to CFQ prio and
using the Satoshi's controller. But this needs additional work and
testing right now, so I've not posted anything yet, just informed
Satoshi about this.
Unfortunately I've not talked to Ryo yet. I've continued my work using a
quite different approach, because the dm-ioband solution didn't work
with delayed-write requests. Now the bio tracking feature seems really
prosiming and I would like to do some tests ASAP, and review the patch
as well.
But I'm not yet convinced that limiting the IO writes at the device
mapper layer is the best solution. IMHO it would be better to throttle
applications' writes when they're dirtying pages in the page cache (the
io-throttle way), because when the IO requests arrive to the device
mapper it's too late (we would only have a lot of dirty pages that are
waiting to be flushed to the limited block devices, and maybe this could
lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
(at least for my requirements). Ryo, correct me if I'm wrong or if I've
not understood the dm-ioband approach.
Another thing I prefer is to directly define bandwidth limiting rules,
instead of using priorities/weights (i.e. 10MiB/s for /dev/sda), but
this seems to be in the dm-ioband TODO list, so maybe we can merge the
work I did in io-throttle to define such rules.
Anyway, I still need to look at the dm-ioband and bio-cgroup code in
details, so probably all I said above is totally wrong...
-Andrea
Dave Hansen wrote:
> On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
>> This series of patches of dm-ioband now includes "The bio tracking mechanism,"
>> which has been posted individually to this mailing list.
>> This makes it easy for anybody to control the I/O bandwidth even when
>> the I/O is one of delayed-write requests.
>
> During the Containers mini-summit at OLS, it was mentioned that there
> are at least *FOUR* of these I/O controllers floating around. Have you
> talked to the other authors? (I've cc'd at least one of them).
>
> We obviously can't come to any kind of real consensus with people just
> tossing the same patches back and forth.
Ryo and Andrea - Naveen and Satoshi met up at OLS and discussed their approach.
It would be really nice to see an RFC, I know Andrea did work on this and
compared the approaches.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote:
> But I'm not yet convinced that limiting the IO writes at the device
> mapper layer is the best solution. IMHO it would be better to throttle
> applications' writes when they're dirtying pages in the page cache (the
> io-throttle way), because when the IO requests arrive to the device
> mapper it's too late (we would only have a lot of dirty pages that are
> waiting to be flushed to the limited block devices, and maybe this could
> lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
> (at least for my requirements). Ryo, correct me if I'm wrong or if I've
> not understood the dm-ioband approach.
The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if
you look at this in combination with the memory controller, they would
make a great team.
The memory controller keeps you from dirtying more than your limit of
pages (and pinning too much memory) even if the dm layer is doing the
throttling and itself can't throttle the memory usage.
I also don't think this is any different from the problems we have in
the regular VM these days. Right now, people can dirty lots of pages on
devices that are slow. The only thing dm-ioband would be added would be
changing how those devices *got* slow. :)
-- Dave
Balbir Singh wrote:
> Dave Hansen wrote:
>> On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
>>> This series of patches of dm-ioband now includes "The bio tracking mechanism,"
>>> which has been posted individually to this mailing list.
>>> This makes it easy for anybody to control the I/O bandwidth even when
>>> the I/O is one of delayed-write requests.
>> During the Containers mini-summit at OLS, it was mentioned that there
>> are at least *FOUR* of these I/O controllers floating around. Have you
>> talked to the other authors? (I've cc'd at least one of them).
>>
>> We obviously can't come to any kind of real consensus with people just
>> tossing the same patches back and forth.
>
> Ryo and Andrea - Naveen and Satoshi met up at OLS and discussed their approach.
> It would be really nice to see an RFC, I know Andrea did work on this and
> compared the approaches.
>
yes, I wrote down something about the comparison of priority-based vs
bandwidth shaping solutions in terms of performance predictability. And
other considerations, like the one I cited before, about dirty-ratio
throttling in memory, AIO handling, etc.
Something is also reported in the io-throttle documentation:
http://marc.info/?l=linux-kernel&m=121780176907686&w=2
But ok, I agree with Balbir, I can try to put the things together (in a
better form in particular) and try to post an RFC together with Ryo.
Ryo, do you have other documentation besides the info reported in the
dm-ioband website?
Thanks,
-Andrea
Dave Hansen wrote:
> On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote:
>> But I'm not yet convinced that limiting the IO writes at the device
>> mapper layer is the best solution. IMHO it would be better to throttle
>> applications' writes when they're dirtying pages in the page cache (the
>> io-throttle way), because when the IO requests arrive to the device
>> mapper it's too late (we would only have a lot of dirty pages that are
>> waiting to be flushed to the limited block devices, and maybe this could
>> lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
>> (at least for my requirements). Ryo, correct me if I'm wrong or if I've
>> not understood the dm-ioband approach.
>
> The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if
> you look at this in combination with the memory controller, they would
> make a great team.
>
> The memory controller keeps you from dirtying more than your limit of
> pages (and pinning too much memory) even if the dm layer is doing the
> throttling and itself can't throttle the memory usage.
mmh... but in this way we would just move the OOM inside the cgroup,
that is a nice improvement, but the main problem is not resolved...
A safer approach IMHO is to force the tasks to wait synchronously on
each operation that directly or indirectly generates i/o.
In particular the solution used by the io-throttle controller to limit
the dirty-ratio in memory is to impose a sleep via
schedule_timeout_killable() in balance_dirty_pages() when a generic
process exceeds the limits defined for the belonging cgroup.
Limiting read operations is a lot more easy, because they're always
synchronized with i/o requests.
-Andrea
On Mon, 2008-08-04 at 22:44 +0200, Andrea Righi wrote:
> Dave Hansen wrote:
> > On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote:
> >> But I'm not yet convinced that limiting the IO writes at the device
> >> mapper layer is the best solution. IMHO it would be better to throttle
> >> applications' writes when they're dirtying pages in the page cache (the
> >> io-throttle way), because when the IO requests arrive to the device
> >> mapper it's too late (we would only have a lot of dirty pages that are
> >> waiting to be flushed to the limited block devices, and maybe this could
> >> lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
> >> (at least for my requirements). Ryo, correct me if I'm wrong or if I've
> >> not understood the dm-ioband approach.
> >
> > The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if
> > you look at this in combination with the memory controller, they would
> > make a great team.
> >
> > The memory controller keeps you from dirtying more than your limit of
> > pages (and pinning too much memory) even if the dm layer is doing the
> > throttling and itself can't throttle the memory usage.
>
> mmh... but in this way we would just move the OOM inside the cgroup,
> that is a nice improvement, but the main problem is not resolved...
>
> A safer approach IMHO is to force the tasks to wait synchronously on
> each operation that directly or indirectly generates i/o.
Fine in theory, hard in practice. :)
I think the best we can hope for is to keep parity with what happens in
the rest of the kernel. We already have a problem today with people
mmap()'ing lots of memory and dirtying it all at once. Adding a i/o
bandwidth controller or a memory controller isn't really going to fix
that. I think it is outside the scope of the i/o (and memory)
controllers until we solve it generically, first.
-- Dave
Hi, Andrea.
I participated in Containers Mini-summit.
And, I talked with Mr. Andrew Morton in The Linux Foundation Japan
Symposium BoF, Japan, July 10th.
Currently, in ML, some I/O controller patches is sent and the respective
patch keeps sending the improvement version.
We and maintainers wouldn't like this situation.
We wanted to solve this situation by the Mini-summit, but unfortunately,
no other developers participated.
(I couldn't give an opinion, because my English skill is low)
Mr. Naveen present his way in Linux Symposium, and we discussed about
I/O control at a few time after this presentation.
Mr. Andrew gave a advice "Should discuss about design more and more"
to me.
And, in Containers Mini-summit (and Linux Symposium 2008 in Ottawa),
Paul said that a necessary to us is to decide a requirement first.
So, we must discuss requirement and design.
My requirement is
* to be able to distribute performance moderately.
(* to be able to isolate each group(environment)).
I guess (it may be wrong)
Naveen's requirement is
* to be able to handle latency.
(high priority is always precede in handling I/O)
$B!!!!!!(B(Only share isn't just given priority to, like CFQ.)
* to be able to distribute performance moderately.
Andrea's requirement is
* to be able to set and control by absolute(direct) performance.
Ryo's requirement is
* to be able to distribute performance moderately.
* to be able to set and control I/Os at flexible range
(multi device such as LVM).
I think that most solutions controls I/O performance moderately
(by using weight/priority/percentage/etc. and by not using absolute)
because disk I/O performance is inconstant and is affected by
situation (such as application, file(data) balance, and so on).
So, it is difficult to guarantee performance which is set by
absolute bandwidth.
If devices have constant performance, it will good to control by
absolute bandwidth.
And, when guaranteeing it by the low ability, it'll be possible.
However, no one likes to make the resources wasteful.
And, he gave a advice "Can't a framework which organized each way,
such as I/O elevator, be made?".
I try to consider such framework (in elevator layer or block layer).
Now, I look at the other methods, again.
I think that OOM problems caused by memory/cache systems.
So, it will be better that I/O controller created out of these problems
first, although a lateness of the I/O device would be related.
If these problem can be resolved, its technique should be applied into
normal I/O control as well as cgroups.
Buffered write I/O is also related with cache system.
We must consider this problem as I/O control.
I don't have a good way which can resolve this problems.
> I did some experiments trying to implement minimum bandwidth requirements
> for my io-throttle controller, mapping the requirements to CFQ prio and
> using the Satoshi's controller. But this needs additional work and
> testing right now, so I've not posted anything yet, just informed
> Satoshi about this.
I'm very interested in this results.
Thanks,
Satoshi Uchida.
> -----Original Message-----
> From: Andrea Righi [mailto:[email protected]]
> Sent: Tuesday, August 05, 2008 3:23 AM
> To: Dave Hansen
> Cc: Ryo Tsuruta; [email protected]; [email protected];
> [email protected];
> [email protected];
> [email protected]; [email protected]; Satoshi UCHIDA
> Subject: Re: Too many I/O controller patches
>
> Dave Hansen wrote:
> > On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
> >> This series of patches of dm-ioband now includes "The bio tracking
> mechanism,"
> >> which has been posted individually to this mailing list.
> >> This makes it easy for anybody to control the I/O bandwidth even when
> >> the I/O is one of delayed-write requests.
> >
> > During the Containers mini-summit at OLS, it was mentioned that there
> > are at least *FOUR* of these I/O controllers floating around. Have you
> > talked to the other authors? (I've cc'd at least one of them).
> >
> > We obviously can't come to any kind of real consensus with people just
> > tossing the same patches back and forth.
> >
> > -- Dave
> >
>
> Dave,
>
> thanks for this email first of all. I've talked with Satoshi (cc-ed)
> about his solution "Yet another I/O bandwidth controlling subsystem for
> CGroups based on CFQ".
>
> I did some experiments trying to implement minimum bandwidth requirements
> for my io-throttle controller, mapping the requirements to CFQ prio and
> using the Satoshi's controller. But this needs additional work and
> testing right now, so I've not posted anything yet, just informed
> Satoshi about this.
>
> Unfortunately I've not talked to Ryo yet. I've continued my work using a
> quite different approach, because the dm-ioband solution didn't work
> with delayed-write requests. Now the bio tracking feature seems really
> prosiming and I would like to do some tests ASAP, and review the patch
> as well.
>
> But I'm not yet convinced that limiting the IO writes at the device
> mapper layer is the best solution. IMHO it would be better to throttle
> applications' writes when they're dirtying pages in the page cache (the
> io-throttle way), because when the IO requests arrive to the device
> mapper it's too late (we would only have a lot of dirty pages that are
> waiting to be flushed to the limited block devices, and maybe this could
> lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
> (at least for my requirements). Ryo, correct me if I'm wrong or if I've
> not understood the dm-ioband approach.
>
> Another thing I prefer is to directly define bandwidth limiting rules,
> instead of using priorities/weights (i.e. 10MiB/s for /dev/sda), but
> this seems to be in the dm-ioband TODO list, so maybe we can merge the
> work I did in io-throttle to define such rules.
>
> Anyway, I still need to look at the dm-ioband and bio-cgroup code in
> details, so probably all I said above is totally wrong...
>
> -Andrea
On Mon, Aug 4, 2008 at 1:44 PM, Andrea Righi <[email protected]> wrote:
>
> A safer approach IMHO is to force the tasks to wait synchronously on
> each operation that directly or indirectly generates i/o.
>
> In particular the solution used by the io-throttle controller to limit
> the dirty-ratio in memory is to impose a sleep via
> schedule_timeout_killable() in balance_dirty_pages() when a generic
> process exceeds the limits defined for the belonging cgroup.
>
> Limiting read operations is a lot more easy, because they're always
> synchronized with i/o requests.
I think that you're conflating two issues:
- controlling how much dirty memory a cgroup can have at any given
time (since dirty memory is much harder/slower to reclaim than clean
memory)
- controlling how much effect a cgroup can have on a given I/O device.
By controlling the rate at which a task can generate dirty pages,
you're not really limiting either of these. I think you'd have to set
your I/O limits artificially low to prevent a case of a process
writing a large data file and then doing fsync() on it, which would
then hit the disk with the entire file at once, and blow away any QoS
guarantees for other groups.
As Dave suggested, I think it would make more sense to have your
page-dirtying throttle points hook into the memory controller instead,
and allow the memory controller to track/limit dirty pages for a
cgroup, and potentially do throttling as part of that.
Paul
Paul Menage wrote:
> On Mon, Aug 4, 2008 at 1:44 PM, Andrea Righi <[email protected]> wrote:
>> A safer approach IMHO is to force the tasks to wait synchronously on
>> each operation that directly or indirectly generates i/o.
>>
>> In particular the solution used by the io-throttle controller to limit
>> the dirty-ratio in memory is to impose a sleep via
>> schedule_timeout_killable() in balance_dirty_pages() when a generic
>> process exceeds the limits defined for the belonging cgroup.
>>
>> Limiting read operations is a lot more easy, because they're always
>> synchronized with i/o requests.
>
> I think that you're conflating two issues:
>
> - controlling how much dirty memory a cgroup can have at any given
> time (since dirty memory is much harder/slower to reclaim than clean
> memory)
>
> - controlling how much effect a cgroup can have on a given I/O device.
>
> By controlling the rate at which a task can generate dirty pages,
> you're not really limiting either of these. I think you'd have to set
> your I/O limits artificially low to prevent a case of a process
> writing a large data file and then doing fsync() on it, which would
> then hit the disk with the entire file at once, and blow away any QoS
> guarantees for other groups.
>
> As Dave suggested, I think it would make more sense to have your
> page-dirtying throttle points hook into the memory controller instead,
> and allow the memory controller to track/limit dirty pages for a
> cgroup, and potentially do throttling as part of that.
Yes, that would be nicer. The IO controller should control both read and write
and dirty pages is mostly related to writes.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
Hi, Andrea,
I'm working with Ryo on dm-ioband and other stuff.
> > On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote:
> >> But I'm not yet convinced that limiting the IO writes at the device
> >> mapper layer is the best solution. IMHO it would be better to throttle
> >> applications' writes when they're dirtying pages in the page cache (the
> >> io-throttle way), because when the IO requests arrive to the device
> >> mapper it's too late (we would only have a lot of dirty pages that are
> >> waiting to be flushed to the limited block devices, and maybe this could
> >> lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
> >> (at least for my requirements). Ryo, correct me if I'm wrong or if I've
> >> not understood the dm-ioband approach.
> >
> > The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if
> > you look at this in combination with the memory controller, they would
> > make a great team.
> >
> > The memory controller keeps you from dirtying more than your limit of
> > pages (and pinning too much memory) even if the dm layer is doing the
> > throttling and itself can't throttle the memory usage.
>
> mmh... but in this way we would just move the OOM inside the cgroup,
> that is a nice improvement, but the main problem is not resolved...
The concept of dm-ioband includes it should be used with cgroup memory
controller as well as the bio cgroup. The memory controller is supposed
to control memory allocation and dirty-page ratio inside each cgroup.
Some guys of cgroup memory controller team just started to implement
the latter mechanism. They try to make each cgroup have a threshold
to limit the number of dirty pages in the group.
I feel this is good approach since each functions can work independently.
> A safer approach IMHO is to force the tasks to wait synchronously on
> each operation that directly or indirectly generates i/o.
>
> In particular the solution used by the io-throttle controller to limit
> the dirty-ratio in memory is to impose a sleep via
> schedule_timeout_killable() in balance_dirty_pages() when a generic
> process exceeds the limits defined for the belonging cgroup.
I guess it would make the memory controller team guys happier if you
can help them design their dirty-page ratio controlling functionality
much cooler and more generic. I think their goal is almost the same
as yours.
> Limiting read operations is a lot more easy, because they're always
> synchronized with i/o requests.
>
> -Andrea
Thank you,
Hirokazu Takahashi.
Hi,
> > >> But I'm not yet convinced that limiting the IO writes at the device
> > >> mapper layer is the best solution. IMHO it would be better to throttle
> > >> applications' writes when they're dirtying pages in the page cache (the
> > >> io-throttle way), because when the IO requests arrive to the device
> > >> mapper it's too late (we would only have a lot of dirty pages that are
> > >> waiting to be flushed to the limited block devices, and maybe this could
> > >> lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
> > >> (at least for my requirements). Ryo, correct me if I'm wrong or if I've
> > >> not understood the dm-ioband approach.
> > >
> > > The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if
> > > you look at this in combination with the memory controller, they would
> > > make a great team.
> > >
> > > The memory controller keeps you from dirtying more than your limit of
> > > pages (and pinning too much memory) even if the dm layer is doing the
> > > throttling and itself can't throttle the memory usage.
> >
> > mmh... but in this way we would just move the OOM inside the cgroup,
> > that is a nice improvement, but the main problem is not resolved...
> >
> > A safer approach IMHO is to force the tasks to wait synchronously on
> > each operation that directly or indirectly generates i/o.
>
> Fine in theory, hard in practice. :)
>
> I think the best we can hope for is to keep parity with what happens in
> the rest of the kernel. We already have a problem today with people
> mmap()'ing lots of memory and dirtying it all at once. Adding a i/o
> bandwidth controller or a memory controller isn't really going to fix
> that. I think it is outside the scope of the i/o (and memory)
> controllers until we solve it generically, first.
Yes, that's right. This should be solved.
But there is a good thing when you use a memory controller.
A problem occurred in a certain cgroup will be confined in its cgroup.
I think this is a great point, don't you think?
Thank you,
Hirokazu Takahashi.
Hi,
> > Hi, Andrea,
> >
> > I'm working with Ryo on dm-ioband and other stuff.
> >
> >>> On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote:
> >>>> But I'm not yet convinced that limiting the IO writes at the device
> >>>> mapper layer is the best solution. IMHO it would be better to throttle
> >>>> applications' writes when they're dirtying pages in the page cache (the
> >>>> io-throttle way), because when the IO requests arrive to the device
> >>>> mapper it's too late (we would only have a lot of dirty pages that are
> >>>> waiting to be flushed to the limited block devices, and maybe this could
> >>>> lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
> >>>> (at least for my requirements). Ryo, correct me if I'm wrong or if I've
> >>>> not understood the dm-ioband approach.
> >>> The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if
> >>> you look at this in combination with the memory controller, they would
> >>> make a great team.
> >>>
> >>> The memory controller keeps you from dirtying more than your limit of
> >>> pages (and pinning too much memory) even if the dm layer is doing the
> >>> throttling and itself can't throttle the memory usage.
> >> mmh... but in this way we would just move the OOM inside the cgroup,
> >> that is a nice improvement, but the main problem is not resolved...
> >
> > The concept of dm-ioband includes it should be used with cgroup memory
> > controller as well as the bio cgroup. The memory controller is supposed
> > to control memory allocation and dirty-page ratio inside each cgroup.
> >
> > Some guys of cgroup memory controller team just started to implement
> > the latter mechanism. They try to make each cgroup have a threshold
> > to limit the number of dirty pages in the group.
>
> Interesting, they also post a patch or RFC?
You can take a look at the thread start from
http://www.ussg.iu.edu/hypermail/linux/kernel/0807.1/0472.html,
whose subject is "[PATCH][RFC] dirty balancing for cgroups."
This project has just started, so it would be a good time to
discuss it with them.
Thanks,
Hirokazu Takahashi.
Ryo Tsuruta wrote:
> +static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> + gfp_t gfp_mask, enum charge_type ctype,
> + struct mem_cgroup *memcg)
> +{
> + struct page_cgroup *pc;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + struct mem_cgroup *mem;
> + unsigned long flags;
> + unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> + struct mem_cgroup_per_zone *mz;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> + pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
> + if (unlikely(pc == NULL))
> + goto err;
> +
> + /*
> + * We always charge the cgroup the mm_struct belongs to.
> + * The mm_struct's mem_cgroup changes on task migration if the
> + * thread group leader migrates. It's possible that mm is not
> + * set, if so charge the init_mm (happens for pagecache usage).
> + */
> + if (likely(!memcg)) {
> + rcu_read_lock();
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + /*
> + * For every charge from the cgroup, increment reference count
> + */
> + css_get(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> + rcu_read_unlock();
> + } else {
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + mem = memcg;
> + css_get(&memcg->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> + }
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + while (res_counter_charge(&mem->res, PAGE_SIZE)) {
> + if (!(gfp_mask & __GFP_WAIT))
> + goto out;
> +
> + if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> + continue;
> +
> + /*
> + * try_to_free_mem_cgroup_pages() might not give us a full
> + * picture of reclaim. Some pages are reclaimed and might be
> + * moved to swap cache or just unmapped from the cgroup.
> + * Check the limit again to see if the reclaim reduced the
> + * current usage of the cgroup before giving up
> + */
> + if (res_counter_check_under_limit(&mem->res))
> + continue;
> +
> + if (!nr_retries--) {
> + mem_cgroup_out_of_memory(mem, gfp_mask);
> + goto out;
> + }
> + }
> + pc->mem_cgroup = mem;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
you can remove some ifdefs doing:
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
if (likely(!memcg)) {
rcu_read_lock();
mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
/*
* For every charge from the cgroup, increment reference count
*/
css_get(&mem->css);
rcu_read_unlock();
} else {
mem = memcg;
css_get(&memcg->css);
}
while (res_counter_charge(&mem->res, PAGE_SIZE)) {
if (!(gfp_mask & __GFP_WAIT))
goto out;
if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
continue;
/*
* try_to_free_mem_cgroup_pages() might not give us a full
* picture of reclaim. Some pages are reclaimed and might be
* moved to swap cache or just unmapped from the cgroup.
* Check the limit again to see if the reclaim reduced the
* current usage of the cgroup before giving up
*/
if (res_counter_check_under_limit(&mem->res))
continue;
if (!nr_retries--) {
mem_cgroup_out_of_memory(mem, gfp_mask);
goto out;
}
}
pc->mem_cgroup = mem;
#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
Hi, Andrea,
> you can remove some ifdefs doing:
I think you don't have to care about this much, since one of the following
patches removes most of these ifdefs.
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> if (likely(!memcg)) {
> rcu_read_lock();
> mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> /*
> * For every charge from the cgroup, increment reference count
> */
> css_get(&mem->css);
> rcu_read_unlock();
> } else {
> mem = memcg;
> css_get(&memcg->css);
> }
> while (res_counter_charge(&mem->res, PAGE_SIZE)) {
> if (!(gfp_mask & __GFP_WAIT))
> goto out;
>
> if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> continue;
>
> /*
> * try_to_free_mem_cgroup_pages() might not give us a full
> * picture of reclaim. Some pages are reclaimed and might be
> * moved to swap cache or just unmapped from the cgroup.
> * Check the limit again to see if the reclaim reduced the
> * current usage of the cgroup before giving up
> */
> if (res_counter_check_under_limit(&mem->res))
> continue;
>
> if (!nr_retries--) {
> mem_cgroup_out_of_memory(mem, gfp_mask);
> goto out;
> }
> }
> pc->mem_cgroup = mem;
> #endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers
Hirokazu Takahashi wrote:
> Hi, Andrea,
>
> I'm working with Ryo on dm-ioband and other stuff.
>
>>> On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote:
>>>> But I'm not yet convinced that limiting the IO writes at the device
>>>> mapper layer is the best solution. IMHO it would be better to throttle
>>>> applications' writes when they're dirtying pages in the page cache (the
>>>> io-throttle way), because when the IO requests arrive to the device
>>>> mapper it's too late (we would only have a lot of dirty pages that are
>>>> waiting to be flushed to the limited block devices, and maybe this could
>>>> lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
>>>> (at least for my requirements). Ryo, correct me if I'm wrong or if I've
>>>> not understood the dm-ioband approach.
>>> The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if
>>> you look at this in combination with the memory controller, they would
>>> make a great team.
>>>
>>> The memory controller keeps you from dirtying more than your limit of
>>> pages (and pinning too much memory) even if the dm layer is doing the
>>> throttling and itself can't throttle the memory usage.
>> mmh... but in this way we would just move the OOM inside the cgroup,
>> that is a nice improvement, but the main problem is not resolved...
>
> The concept of dm-ioband includes it should be used with cgroup memory
> controller as well as the bio cgroup. The memory controller is supposed
> to control memory allocation and dirty-page ratio inside each cgroup.
>
> Some guys of cgroup memory controller team just started to implement
> the latter mechanism. They try to make each cgroup have a threshold
> to limit the number of dirty pages in the group.
Interesting, they also post a patch or RFC?
-Andrea
Paul Menage wrote:
> On Mon, Aug 4, 2008 at 1:44 PM, Andrea Righi <[email protected]> wrote:
>> A safer approach IMHO is to force the tasks to wait synchronously on
>> each operation that directly or indirectly generates i/o.
>>
>> In particular the solution used by the io-throttle controller to limit
>> the dirty-ratio in memory is to impose a sleep via
>> schedule_timeout_killable() in balance_dirty_pages() when a generic
>> process exceeds the limits defined for the belonging cgroup.
>>
>> Limiting read operations is a lot more easy, because they're always
>> synchronized with i/o requests.
>
> I think that you're conflating two issues:
>
> - controlling how much dirty memory a cgroup can have at any given
> time (since dirty memory is much harder/slower to reclaim than clean
> memory)
>
> - controlling how much effect a cgroup can have on a given I/O device.
>
> By controlling the rate at which a task can generate dirty pages,
> you're not really limiting either of these. I think you'd have to set
> your I/O limits artificially low to prevent a case of a process
> writing a large data file and then doing fsync() on it, which would
> then hit the disk with the entire file at once, and blow away any QoS
> guarantees for other groups.
Anyway, dirty pages ratio is directly proportional to the IO that will
be performed on the real device, isn't it? this wouldn't prevent IO
bursts as you correctly say, but IMHO it is a simple and quite effective
way to measure the IO write activity of each cgroup on each affected
device.
To prevent the IO peaks I usually reduce the vm_dirty_ratio, but, ok,
this is a workaround, not the solution to the problem either.
IMHO, based on the dirty-page rate measurement, we should apply both
limiting methods: throttle dirty-pages ratio to prevent too many dirty
pages in the system (harde to reclaim and generating
unpredictable/unpleasant/unresponsiveness behaviour), and throttle the
dispatching of IO requests at the device-mapper/IO-scheduler layer to
smooth IO peaks/bursts, generated by fsync() and similar scenarios.
Another different approach could be to implement the measurement in the
elevator, looking at the elapsed between the IO request is issued to the
drive and the request is served. So, look at the start time T1,
completion time T2, take the difference (T2 - T1) and say: cgroup C1
consumed an amount of IO of (T2 - T1), and also use a token-bucket
policy to fill/reduce the "credits" of each IO cgroup in terms of IO
time slots. This would be a more precise measurement, instead of trying
to predict how expensive the IO operation will be, only looking at the
dirty-page ratio. Then throttle both dirty-page ratio *and* the
dispatching of the IO requests submitted by the cgroup that exceeds the
limits.
>
> As Dave suggested, I think it would make more sense to have your
> page-dirtying throttle points hook into the memory controller instead,
> and allow the memory controller to track/limit dirty pages for a
> cgroup, and potentially do throttling as part of that.
>
> Paul
Yes, implementing page-drity throttling in memory controller seems
absolutely reasonable. I can try to move in this direction, merge the
page-dirty throttling in memory controller and also post the RFC.
Thanks,
-Andrea
Satoshi UCHIDA wrote:
> Andrea's requirement is
> * to be able to set and control by absolute(direct) performance.
* improve IO performance predictability of each cgroup
(try to guarantee more precise IO performance values)
> And, he gave a advice "Can't a framework which organized each way,
> such as I/O elevator, be made?".
> I try to consider such framework (in elevator layer or block layer).
It would be probably the best place to evaluate the "cost" of each
IO operation.
> I think that OOM problems caused by memory/cache systems.
> So, it will be better that I/O controller created out of these problems
> first, although a lateness of the I/O device would be related.
> If these problem can be resolved, its technique should be applied into
> normal I/O control as well as cgroups.
>
> Buffered write I/O is also related with cache system.
> We must consider this problem as I/O control.
Agree. At least, maybe we should consider if an IO controller could be
a valid solution also for these problems.
>> I did some experiments trying to implement minimum bandwidth requirements
>> for my io-throttle controller, mapping the requirements to CFQ prio and
>> using the Satoshi's controller. But this needs additional work and
>> testing right now, so I've not posted anything yet, just informed
>> Satoshi about this.
>
> I'm very interested in this results.
I'll collect some numbers and keep you informed.
-Andrea
Hi,
> I think that OOM problems caused by memory/cache systems.
> So, it will be better that I/O controller created out of these problems
> first, although a lateness of the I/O device would be related.
> If these problem can be resolved, its technique should be applied into
> normal I/O control as well as cgroups.
Yes, this is one of the problems linux kernel still has, which should
be solved.
But I believe this should be done in the linux memory management layer
including the cgroup memory controller, which has to work correctly
on any type of device with various access speeds.
I think it's better that I/O controllers should only focus on flow of
I/O requests. This approach will keep the implementation of linux
kernel simple.
> Buffered write I/O is also related with cache system.
> We must consider this problem as I/O control.
> I don't have a good way which can resolve this problems.
>
Thank you,
Hirokazu Takahashi.
Hi Andrea, Satoshi and all,
Thanks for giving a chance to discuss.
> Mr. Andrew gave a advice "Should discuss about design more and more"
> to me.
> And, in Containers Mini-summit (and Linux Symposium 2008 in Ottawa),
> Paul said that a necessary to us is to decide a requirement first.
> So, we must discuss requirement and design.
We've implemented dm-ioband and bio-cgroup to meet the following requirements:
* Assign some bandwidth to each group on the same device.
A group is a set of processes, which may be a cgroup.
* Assign some bandwidth to each partition on the same device.
It can work with the process group based bandwidth control.
ex) With this feature, you can assign 40% of the bandwidth of a
disk to /root and 60% of them to /usr.
* It can work with virtual machines such as Xen and KVM.
I/O requests issued from virtual machines have to be controlled.
* It should work any type of I/O scheduler, including ones which
will be released in the future.
* Support multiple devices which share the same bandwidth such as
raid disks and LVM.
* Handle asynchronous I/O requests such as AIO request and delayed
write requests.
- This can be done with bio-cgroup, which uses the page-tracking
mechanism the cgroup memory controller has.
* Control dirty page ratio.
- This can be done with the cgroup memory controller in the near
feature. It would be great that you can also use other features
the memory controller is going to have with dm-ioband.
* Make it easy to enhance.
- The current implementation of dm-ioband has an interface to
add a new policy to control I/O requests. You can easily add
I/O throttling policy if you want.
* Fine grained bandwidth control.
* Keep I/O throughput.
* Make it scalable.
* It should work correctly if the I/O load is quite high,
even when the io-request queue of a certain disk is overflowed.
> Ryo, do you have other documentation besides the info reported in the
> dm-ioband website?
I don't have any documentation besides in the website.
Thanks,
Ryo Tsuruta
On Tue, 2008-08-05 at 11:28 +0200, Andrea Righi wrote:
> > Buffered write I/O is also related with cache system.
> > We must consider this problem as I/O control.
>
> Agree. At least, maybe we should consider if an IO controller could be
> a valid solution also for these problems.
Isn't this one of the core points that we keep going back and forth
over? It seems like people are arguing in circles over this:
Do we:
1. control potential memory usage by throttling I/O
or
2. Throttle I/O when memory is full
I might lean toward (1) if we didn't already have a memory controller.
But, we have one, and it works. Also, we *already* do (2) in the
kernel, so it would seem to graft well onto existing mechanisms that we
have.
I/O controllers should not worry about memory. They're going to have a
hard enough time getting the I/O part right. :)
Or, am I over-simplifying this?
-- Dave
On Mon, 2008-08-04 at 22:55 -0700, Paul Menage wrote:
>
> As Dave suggested, I think it would make more sense to have your
> page-dirtying throttle points hook into the memory controller instead,
> and allow the memory controller to track/limit dirty pages for a
> cgroup, and potentially do throttling as part of that.
Yeah, I'm sure we're going to have to get to setting the dirty ratio
$ cat /proc/sys/vm/dirty_ratio
40
on a per-container basis at *some* point. We might as well do it
earlier rather than later.
-- Dave
On Mon, 2008-08-04 at 10:20 -0700, Dave Hansen wrote:
> On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
> > This series of patches of dm-ioband now includes "The bio tracking mechanism,"
> > which has been posted individually to this mailing list.
> > This makes it easy for anybody to control the I/O bandwidth even when
> > the I/O is one of delayed-write requests.
>
> During the Containers mini-summit at OLS, it was mentioned that there
> are at least *FOUR* of these I/O controllers floating around. Have you
> talked to the other authors? (I've cc'd at least one of them).
>
> We obviously can't come to any kind of real consensus with people just
> tossing the same patches back and forth.
>
> -- Dave
Hi Dave,
I have been tracking the memory controller patches for a while which
spurred my interest in cgroups and prompted me to start working on I/O
bandwidth controlling mechanisms. This year I have had several
opportunities to discuss the design challenges of i/o controllers with
the NEC and VALinux Japan teams (CCed), most recently last month during
the Linux Foundation Japan Linux Symposium, where we took advantage of
Andrew Morton's visit to Japan to do some brainstorming on this topic. I
will try so summarize what was discussed there (and in the Linux Storage
& Filesystem Workshop earlier this year) and propose a hopefully
acceptable way to proceed and try to get things started.
This RFC ended up being a bit longer than I had originally intended, but
hopefully it will serve as the start of a fruitful discussion.
As you pointed out, it seems that there is not much consensus building
going on, but that does not mean there is a lack of interest. To get the
ball rolling it is probably a good idea to clarify the state of things
and try to establish what we are trying to accomplish.
*** State of things in the mainstream kernel<BR>
The kernel has had somewhat adavanced I/O control capabilities for quite
some time now: CFQ. But the current CFQ has some problems:
- I/O priority can be set by PID, PGRP, or UID, but...
- ...all the processes that fall within the same class/priority are
scheduled together and arbitrary grouping are not possible.
- Buffered I/O is not handled properly.
- CFQ's IO priority is an attribute of a process that affects all
devices it sends I/O requests to. In other words, with the current
implementation it is not possible to assign per-device IO priorities to
a task.
*** Goals
1. Cgroups-aware I/O scheduling (being able to define arbitrary
groupings of processes and treat each group as a single scheduling
entity).
2. Being able to perform I/O bandwidth control independently on each
device.
3. I/O bandwidth shaping.
4. Scheduler-independent I/O bandwidth control.
5. Usable with stacking devices (md, dm and other devices of that
ilk).
6. I/O tracking (handle buffered and asynchronous I/O properly).
The list of goals above is not exhaustive and it is also likely to
contain some not-so-nice-to-have features so your feedback would be
appreciated.
1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
groupings of processes and treat each group as a single scheduling
identity)
We obviously need this because our final goal is to be able to control
the IO generated by a Linux container. The good news is that we already
have the cgroups infrastructure so, regarding this problem, we would
just have to transform our I/O bandwidth controller into a cgroup
subsystem.
This seems to be the easiest part, but the current cgroups
infrastructure has some limitations when it comes to dealing with block
devices: impossibility of creating/removing certain control structures
dynamically and hardcoding of subsystems (i.e. resource controllers).
This makes it difficult to handle block devices that can be hotplugged
and go away at any time (this applies not only to usb storage but also
to some SATA and SCSI devices). To cope with this situation properly we
would need hotplug support in cgroups, but, as suggested before and
discussed in the past (see (0) below), there are some limitations.
Even in the non-hotplug case it would be nice if we could treat each
block I/O device as an independent resource, which means we could do
things like allocating I/O bandwidth on a per-device basis. As long as
performance is not compromised too much, adding some kind of basic
hotplug support to cgroups is probably worth it.
(0) http://lkml.org/lkml/2008/5/21/12
3. & 4. & 5. - I/O bandwidth shaping & General design aspects
The implementation of an I/O scheduling algorithm is to a certain extent
influenced by what we are trying to achieve in terms of I/O bandwidth
shaping, but, as discussed below, the required accuracy can determine
the layer where the I/O controller has to reside. Off the top of my
head, there are three basic operations we may want perform:
- I/O nice prioritization: ionice-like approach.
- Proportional bandwidth scheduling: each process/group of processes
has a weight that determines the share of bandwidth they receive.
- I/O limiting: set an upper limit to the bandwidth a group of tasks
can use.
If we are pursuing a I/O prioritization model à la CFQ the temptation is
to implement it at the elevator layer or extend any of the existing I/O
schedulers.
There have been several proposals that extend either the CFQ scheduler
(see (1), (2) below) or the AS scheduler (see (3) below). The problem
with these controllers is that they are scheduler dependent, which means
that they become unusable when we change the scheduler or when we want
to control stacking devices which define their own make_request_fn
function (md and dm come to mind). It could be argued that the physical
devices controlled by a dm or md driver are likely to be fed by
traditional I/O schedulers such as CFQ, but these I/O schedulers would
be running independently from each other, each one controlling its own
device ignoring the fact that they part of a stacking device. This lack
of information at the elevator layer makes it pretty difficult to obtain
accurate results when using stacking devices. It seems that unless we
can make the elevator layer aware of the topology of stacking devices
(possibly by extending the elevator API?) evelator-based approaches do
not constitute a generic solution. Here onwards, for discussion
purposes, I will refer to this type of I/O bandwidth controllers as
elevator-based I/O controllers.
A simple way of solving the problems discussed in the previous paragraph
is to perform I/O control before the I/O actually enters the block layer
either at the pagecache level (when pages are dirtied) or at the entry
point to the generic block layer (generic_make_request()). Andrea's I/O
throttling patches stick to the former variant (see (4) below) and
Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later
approach. The rationale is that by hooking into the source of I/O
requests we can perform I/O control in a topology-agnostic and
elevator-agnostic way. I will refer to this new type of I/O bandwidth
controller as block layer I/O controller.
By residing just above the generic block layer the implementation of a
block layer I/O controller becomes relatively easy, but by not taking
into account the characteristics of the underlying devices we might risk
underutilizing them. For this reason, in some cases it would probably
make sense to complement a generic I/O controller with elevator-based
I/O controller, so that the maximum throughput can be squeezed from the
physical devices.
(1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/
(2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/
(3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/
(4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975
(5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581
6.- I/O tracking
This is arguably the most important part, since to perform I/O control
we need to be able to determine where the I/O is coming from.
Reads are trivial because they are served in the context of the task
that generated the I/O. But most writes are performed by pdflush,
kswapd, and friends so performing I/O control just in the synchronous
I/O path would lead to large inaccuracy. To get this right we would need
to track ownership all the way up to the pagecache page. In other words,
it is necessary to track who is dirtying pages so that when they are
written to disk the right task is charged for that I/O.
Fortunately, such tracking of pages is one of the things the existing
memory resource controller is doing to control memory usage. This is a
clever observation which has a useful implication: if the rather
imbricated tracking and accounting parts of the memory resource
controller were split the I/O controller could leverage the existing
infrastructure to track buffered and asynchronous I/O. This is exactly
what the bio-cgroup (see (6) below) patches set out to do.
It is also possible to do without I/O tracking. For that we would need
to hook into the synchronous I/O path and every place in the kernel
where pages are dirtied (see (4) above for details). However controlling
the rate at which a cgroup can generate dirty pages seems to be a task
that belongs in the memory controller not the I/O controller. As Dave
and Paul suggested its probably better to delegate this to the memory
controller. In fact, it seems that Yamamoto-san is cooking some patches
that implement just that: dirty balancing for cgroups (see (7) for
details).
Another argument in favor of I/O tracking is that not only block layer
I/O controllers would benefit from it, but also the existing I/O
schedulers and the elevator-based I/O controllers proposed by
Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself
are working on this and hopefully will be sending patches soon).
(6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90
(7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/
*** How to move on
As discussed before, it probably makes sense to have both a block layer
I/O controller and a elevator-based one, and they could certainly
cohabitate. As discussed before, all of them need I/O tracking
capabilities so I would like to suggest the plan below to get things
started:
- Improve the I/O tracking patches (see (6) above) until they are in
mergeable shape.
- Fix CFQ and AS to use the new I/O tracking functionality to show its
benefits. If the performance impact is acceptable this should suffice to
convince the respective maintainer and get the I/O tracking patches
merged.
- Implement a block layer resource controller. dm-ioband is a working
solution and feature rich but its dependency on the dm infrastructure is
likely to find opposition (the dm layer does not handle barriers
properly and the maximum size of I/O requests can be limited in some
cases). In such a case, we could either try to build a standalone
resource controller based on dm-ioband (which would probably hook into
generic_make_request) or try to come up with something new.
- If the I/O tracking patches make it into the kernel we could move on
and try to get the Cgroup extensions to CFQ and AS mentioned before (see
(1), (2), and (3) above for details) merged.
- Delegate the task of controlling the rate at which a task can
generate dirty pages to the memory controller.
This RFC is somewhat vague but my feeling is that we build some
consensus on the goals and basic design aspects before delving into
implementation details.
I would appreciate your comments and feedback.
- Fernando
On Tue, 05 Aug 2008 09:20:18 -0700
Dave Hansen <[email protected]> wrote:
> On Tue, 2008-08-05 at 11:28 +0200, Andrea Righi wrote:
> > > Buffered write I/O is also related with cache system.
> > > We must consider this problem as I/O control.
> >
> > Agree. At least, maybe we should consider if an IO controller could be
> > a valid solution also for these problems.
>
> Isn't this one of the core points that we keep going back and forth
> over? It seems like people are arguing in circles over this:
>
> Do we:
> 1. control potential memory usage by throttling I/O
> or
> 2. Throttle I/O when memory is full
>
> I might lean toward (1) if we didn't already have a memory controller.
> But, we have one, and it works. Also, we *already* do (2) in the
> kernel, so it would seem to graft well onto existing mechanisms that we
> have.
>
> I/O controllers should not worry about memory.
I agree here ;)
>They're going to have a hard enough time getting the I/O part right. :)
>
memcg have more problems now ;(
Only a difficult thing to limit dirty-ratio in memcg is how-to-count dirty
pages. If I/O controller's hook helps, it's good.
My small concern is "What happens if we throttole I/O bandwidth too small
under some memcg." In such cgroup, we may see more OOMs because I/O will
not finish in time.
A system admin have to find some way to avoid this.
But please do I/O control first. Dirty-page control is related but different
layer's problem, I think.
Thanks,
-Kame
> Or, am I over-simplifying this?
>
KAMEZAWA Hiroyuki wrote:
> On Tue, 05 Aug 2008 09:20:18 -0700
> Dave Hansen <[email protected]> wrote:
>
>> On Tue, 2008-08-05 at 11:28 +0200, Andrea Righi wrote:
>>>> Buffered write I/O is also related with cache system.
>>>> We must consider this problem as I/O control.
>>> Agree. At least, maybe we should consider if an IO controller could be
>>> a valid solution also for these problems.
>> Isn't this one of the core points that we keep going back and forth
>> over? It seems like people are arguing in circles over this:
>>
>> Do we:
>> 1. control potential memory usage by throttling I/O
>> or
>> 2. Throttle I/O when memory is full
>>
>> I might lean toward (1) if we didn't already have a memory controller.
>> But, we have one, and it works. Also, we *already* do (2) in the
>> kernel, so it would seem to graft well onto existing mechanisms that we
>> have.
>>
>> I/O controllers should not worry about memory.
> I agree here ;)
>
>> They're going to have a hard enough time getting the I/O part right. :)
>>
> memcg have more problems now ;(
>
> Only a difficult thing to limit dirty-ratio in memcg is how-to-count dirty
> pages. If I/O controller's hook helps, it's good.
>
> My small concern is "What happens if we throttole I/O bandwidth too small
> under some memcg." In such cgroup, we may see more OOMs because I/O will
> not finish in time.
> A system admin have to find some way to avoid this.
>
> But please do I/O control first. Dirty-page control is related but different
> layer's problem, I think.
Yes, please solve the I/O control problem first.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
Hi Fernando,
> This RFC ended up being a bit longer than I had originally intended, but
> hopefully it will serve as the start of a fruitful discussion.
Thanks a lot for posting the RFC.
> *** Goals
> 1. Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> entity).
> 2. Being able to perform I/O bandwidth control independently on each
> device.
> 3. I/O bandwidth shaping.
> 4. Scheduler-independent I/O bandwidth control.
> 5. Usable with stacking devices (md, dm and other devices of that
> ilk).
> 6. I/O tracking (handle buffered and asynchronous I/O properly).
>
> The list of goals above is not exhaustive and it is also likely to
> contain some not-so-nice-to-have features so your feedback would be
> appreciated.
I'd like to add the following item to the goals.
7. Selectable from multiple bandwidth control policy (proportion,
maximum rate limiting, ...) like I/O scheduler.
> *** How to move on
>
> As discussed before, it probably makes sense to have both a block layer
> I/O controller and a elevator-based one, and they could certainly
> cohabitate. As discussed before, all of them need I/O tracking
> capabilities so I would like to suggest the plan below to get things
> started:
>
> - Improve the I/O tracking patches (see (6) above) until they are in
> mergeable shape.
> - Fix CFQ and AS to use the new I/O tracking functionality to show its
> benefits. If the performance impact is acceptable this should suffice to
> convince the respective maintainer and get the I/O tracking patches
> merged.
> - Implement a block layer resource controller. dm-ioband is a working
> solution and feature rich but its dependency on the dm infrastructure is
> likely to find opposition (the dm layer does not handle barriers
> properly and the maximum size of I/O requests can be limited in some
> cases). In such a case, we could either try to build a standalone
> resource controller based on dm-ioband (which would probably hook into
> generic_make_request) or try to come up with something new.
> - If the I/O tracking patches make it into the kernel we could move on
> and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> (1), (2), and (3) above for details) merged.
> - Delegate the task of controlling the rate at which a task can
> generate dirty pages to the memory controller.
I agree with your plan.
We keep bio-cgroup improving and porting to the latest kernel.
Thanks,
Ryo Tsuruta
On Wed, 2008-08-06 at 15:18 +0900, Ryo Tsuruta wrote:
> Hi Fernando,
>
> > This RFC ended up being a bit longer than I had originally intended, but
> > hopefully it will serve as the start of a fruitful discussion.
>
> Thanks a lot for posting the RFC.
>
> > *** Goals
> > 1. Cgroups-aware I/O scheduling (being able to define arbitrary
> > groupings of processes and treat each group as a single scheduling
> > entity).
> > 2. Being able to perform I/O bandwidth control independently on each
> > device.
> > 3. I/O bandwidth shaping.
> > 4. Scheduler-independent I/O bandwidth control.
> > 5. Usable with stacking devices (md, dm and other devices of that
> > ilk).
> > 6. I/O tracking (handle buffered and asynchronous I/O properly).
> >
> > The list of goals above is not exhaustive and it is also likely to
> > contain some not-so-nice-to-have features so your feedback would be
> > appreciated.
>
> I'd like to add the following item to the goals.
>
> 7. Selectable from multiple bandwidth control policy (proportion,
> maximum rate limiting, ...) like I/O scheduler.
Yep, makes sense.
> > *** How to move on
> >
> > As discussed before, it probably makes sense to have both a block layer
> > I/O controller and a elevator-based one, and they could certainly
> > cohabitate. As discussed before, all of them need I/O tracking
> > capabilities so I would like to suggest the plan below to get things
> > started:
> >
> > - Improve the I/O tracking patches (see (6) above) until they are in
> > mergeable shape.
> > - Fix CFQ and AS to use the new I/O tracking functionality to show its
> > benefits. If the performance impact is acceptable this should suffice to
> > convince the respective maintainer and get the I/O tracking patches
> > merged.
> > - Implement a block layer resource controller. dm-ioband is a working
> > solution and feature rich but its dependency on the dm infrastructure is
> > likely to find opposition (the dm layer does not handle barriers
> > properly and the maximum size of I/O requests can be limited in some
> > cases). In such a case, we could either try to build a standalone
> > resource controller based on dm-ioband (which would probably hook into
> > generic_make_request) or try to come up with something new.
> > - If the I/O tracking patches make it into the kernel we could move on
> > and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> > (1), (2), and (3) above for details) merged.
> > - Delegate the task of controlling the rate at which a task can
> > generate dirty pages to the memory controller.
>
> I agree with your plan.
> We keep bio-cgroup improving and porting to the latest kernel.
Having more users of bio-cgroup would probably help to get it merged, so
we'll certainly send patches as soon as we get our cfq prototype in
shape.
Regards,
Fernando
Hi,
> > > > Buffered write I/O is also related with cache system.
> > > > We must consider this problem as I/O control.
> > >
> > > Agree. At least, maybe we should consider if an IO controller could be
> > > a valid solution also for these problems.
> >
> > Isn't this one of the core points that we keep going back and forth
> > over? It seems like people are arguing in circles over this:
> >
> > Do we:
> > 1. control potential memory usage by throttling I/O
> > or
> > 2. Throttle I/O when memory is full
> >
> > I might lean toward (1) if we didn't already have a memory controller.
> > But, we have one, and it works. Also, we *already* do (2) in the
> > kernel, so it would seem to graft well onto existing mechanisms that we
> > have.
> >
> > I/O controllers should not worry about memory.
> I agree here ;)
>
> >They're going to have a hard enough time getting the I/O part right. :)
> >
> memcg have more problems now ;(
>
> Only a difficult thing to limit dirty-ratio in memcg is how-to-count dirty
> pages. If I/O controller's hook helps, it's good.
>
> My small concern is "What happens if we throttole I/O bandwidth too small
> under some memcg." In such cgroup, we may see more OOMs because I/O will
> not finish in time.
I/O controllers are just supposed to emulate slow device from the point
of view of the processes in a certain cgroup or something. I think
the memory management layer and the memory controller are the ones
which should be able to handle these, which might be as slow as
floppy disks though.
> A system admin have to find some way to avoid this.
>
> But please do I/O control first. Dirty-page control is related but different
> layer's problem, I think.
Yup.
Thanks,
Hirokazu Takahashi.
On Mon, 04 Aug 2008 17:57:48 +0900 (JST)
Ryo Tsuruta <[email protected]> wrote:
> This patch splits the cgroup memory subsystem into two parts.
> One is for tracking pages to find out the owners. The other is
> for controlling how much amount of memory should be assigned to
> each cgroup.
>
> With this patch, you can use the page tracking mechanism even if
> the memory subsystem is off.
>
> Based on 2.6.27-rc1-mm1
> Signed-off-by: Ryo Tsuruta <[email protected]>
> Signed-off-by: Hirokazu Takahashi <[email protected]>
>
Plese CC me or Balbir or Pavel (See Maintainer list) when you try this ;)
After this patch, the total structure is
page <-> page_cgroup <-> bio_cgroup.
(multiple bio_cgroup can be attached to page_cgroup)
Does this pointer chain will add
- significant performance regression or
- new race condtions
?
I like more loose relationship between them.
For example, adding a simple function.
==
int get_page_io_id(struct page *)
- returns a I/O cgroup ID for this page. If ID is not found, -1 is returned.
ID is not guaranteed to be valid value. (ID can be obsolete)
==
And just storing cgroup ID to page_cgroup at page allocation.
Then, making bio_cgroup independent from page_cgroup and
get ID if avialble and avoid too much pointer walking.
Thanks,
-Kame
> diff -Ndupr linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h
> --- linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h 2008-08-01 12:18:28.000000000 +0900
> +++ linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h 2008-08-01 19:03:21.000000000 +0900
> @@ -20,12 +20,62 @@
> #ifndef _LINUX_MEMCONTROL_H
> #define _LINUX_MEMCONTROL_H
>
> +#include <linux/rcupdate.h>
> +#include <linux/mm.h>
> +#include <linux/smp.h>
> +#include <linux/bit_spinlock.h>
> +
> struct mem_cgroup;
> struct page_cgroup;
> struct page;
> struct mm_struct;
>
> +#ifdef CONFIG_CGROUP_PAGE
> +/*
> + * We use the lower bit of the page->page_cgroup pointer as a bit spin
> + * lock. We need to ensure that page->page_cgroup is at least two
> + * byte aligned (based on comments from Nick Piggin). But since
> + * bit_spin_lock doesn't actually set that lock bit in a non-debug
> + * uniprocessor kernel, we should avoid setting it here too.
> + */
> +#define PAGE_CGROUP_LOCK_BIT 0x0
> +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
> +#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
> +#else
> +#define PAGE_CGROUP_LOCK 0x0
> +#endif
> +
> +/*
> + * A page_cgroup page is associated with every page descriptor. The
> + * page_cgroup helps us identify information about the cgroup
> + */
> +struct page_cgroup {
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + struct list_head lru; /* per cgroup LRU list */
> + struct mem_cgroup *mem_cgroup;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> + struct page *page;
> + int flags;
> +};
> +#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
> +#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
> +#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
> +#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
> +
> +static inline void lock_page_cgroup(struct page *page)
> +{
> + bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> +}
> +
> +static inline int try_lock_page_cgroup(struct page *page)
> +{
> + return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> +}
> +
> +static inline void unlock_page_cgroup(struct page *page)
> +{
> + bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> +}
>
> #define page_reset_bad_cgroup(page) ((page)->page_cgroup = 0)
>
> @@ -34,45 +84,15 @@ extern int mem_cgroup_charge(struct page
> gfp_t gfp_mask);
> extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> -extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
> extern void mem_cgroup_uncharge_page(struct page *page);
> extern void mem_cgroup_uncharge_cache_page(struct page *page);
> -extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
> -
> -extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> - struct list_head *dst,
> - unsigned long *scanned, int order,
> - int mode, struct zone *z,
> - struct mem_cgroup *mem_cont,
> - int active, int file);
> -extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
> -int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
> -
> -extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> -
> -#define mm_match_cgroup(mm, cgroup) \
> - ((cgroup) == mem_cgroup_from_task((mm)->owner))
>
> extern int
> mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
> extern void mem_cgroup_end_migration(struct page *page);
> +extern void page_cgroup_init(void);
>
> -/*
> - * For memory reclaim.
> - */
> -extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
> -extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
> -
> -extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
> -extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
> - int priority);
> -extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
> - int priority);
> -
> -extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
> - int priority, enum lru_list lru);
> -
> -#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +#else /* CONFIG_CGROUP_PAGE */
> static inline void page_reset_bad_cgroup(struct page *page)
> {
> }
> @@ -102,6 +122,53 @@ static inline void mem_cgroup_uncharge_c
> {
> }
>
> +static inline int
> +mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> +{
> + return 0;
> +}
> +
> +static inline void mem_cgroup_end_migration(struct page *page)
> +{
> +}
> +#endif /* CONFIG_CGROUP_PAGE */
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +
> +extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
> +extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
> +
> +extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> + struct list_head *dst,
> + unsigned long *scanned, int order,
> + int mode, struct zone *z,
> + struct mem_cgroup *mem_cont,
> + int active, int file);
> +extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
> +int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
> +
> +extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> +
> +#define mm_match_cgroup(mm, cgroup) \
> + ((cgroup) == mem_cgroup_from_task((mm)->owner))
> +
> +/*
> + * For memory reclaim.
> + */
> +extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
> +extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
> +
> +extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
> +extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
> + int priority);
> +extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
> + int priority);
> +
> +extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
> + int priority, enum lru_list lru);
> +
> +#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> static inline int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask)
> {
> return 0;
> @@ -122,16 +189,6 @@ static inline int task_in_mem_cgroup(str
> return 1;
> }
>
> -static inline int
> -mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> -{
> - return 0;
> -}
> -
> -static inline void mem_cgroup_end_migration(struct page *page)
> -{
> -}
> -
> static inline int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem)
> {
> return 0;
> @@ -163,7 +220,8 @@ static inline long mem_cgroup_calc_recla
> {
> return 0;
> }
> -#endif /* CONFIG_CGROUP_MEM_CONT */
> +
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
>
> #endif /* _LINUX_MEMCONTROL_H */
>
> diff -Ndupr linux-2.6.27-rc1-mm1-ioband/include/linux/mm_types.h linux-2.6.27-rc1-mm1.cg0/include/linux/mm_types.h
> --- linux-2.6.27-rc1-mm1-ioband/include/linux/mm_types.h 2008-08-01 12:18:28.000000000 +0900
> +++ linux-2.6.27-rc1-mm1.cg0/include/linux/mm_types.h 2008-08-01 19:03:21.000000000 +0900
> @@ -92,7 +92,7 @@ struct page {
> void *virtual; /* Kernel virtual address (NULL if
> not kmapped, ie. highmem) */
> #endif /* WANT_PAGE_VIRTUAL */
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> unsigned long page_cgroup;
> #endif
>
> diff -Ndupr linux-2.6.27-rc1-mm1-ioband/init/Kconfig linux-2.6.27-rc1-mm1.cg0/init/Kconfig
> --- linux-2.6.27-rc1-mm1-ioband/init/Kconfig 2008-08-01 12:18:28.000000000 +0900
> +++ linux-2.6.27-rc1-mm1.cg0/init/Kconfig 2008-08-01 19:03:21.000000000 +0900
> @@ -418,6 +418,10 @@ config CGROUP_MEMRLIMIT_CTLR
> memory RSS and Page Cache control. Virtual address space control
> is provided by this controller.
>
> +config CGROUP_PAGE
> + def_bool y
> + depends on CGROUP_MEM_RES_CTLR
> +
> config SYSFS_DEPRECATED
> bool
>
> diff -Ndupr linux-2.6.27-rc1-mm1-ioband/mm/Makefile linux-2.6.27-rc1-mm1.cg0/mm/Makefile
> --- linux-2.6.27-rc1-mm1-ioband/mm/Makefile 2008-08-01 12:18:28.000000000 +0900
> +++ linux-2.6.27-rc1-mm1.cg0/mm/Makefile 2008-08-01 19:03:21.000000000 +0900
> @@ -34,5 +34,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
> obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_SMP) += allocpercpu.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_PAGE) += memcontrol.o
> obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
> diff -Ndupr linux-2.6.27-rc1-mm1-ioband/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c
> --- linux-2.6.27-rc1-mm1-ioband/mm/memcontrol.c 2008-08-01 12:18:28.000000000 +0900
> +++ linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c 2008-08-01 19:48:55.000000000 +0900
> @@ -36,10 +36,25 @@
>
> #include <asm/uaccess.h>
>
> -struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> +enum charge_type {
> + MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> + MEM_CGROUP_CHARGE_TYPE_MAPPED,
> + MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
> +};
> +
> +static void __mem_cgroup_uncharge_common(struct page *, enum charge_type);
> +
> static struct kmem_cache *page_cgroup_cache __read_mostly;
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> #define MEM_CGROUP_RECLAIM_RETRIES 5
>
> +static inline int mem_cgroup_disabled(void)
> +{
> + return mem_cgroup_subsys.disabled;
> +}
> +
> /*
> * Statistics for memory cgroup.
> */
> @@ -136,35 +151,6 @@ struct mem_cgroup {
> };
> static struct mem_cgroup init_mem_cgroup;
>
> -/*
> - * We use the lower bit of the page->page_cgroup pointer as a bit spin
> - * lock. We need to ensure that page->page_cgroup is at least two
> - * byte aligned (based on comments from Nick Piggin). But since
> - * bit_spin_lock doesn't actually set that lock bit in a non-debug
> - * uniprocessor kernel, we should avoid setting it here too.
> - */
> -#define PAGE_CGROUP_LOCK_BIT 0x0
> -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
> -#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
> -#else
> -#define PAGE_CGROUP_LOCK 0x0
> -#endif
> -
> -/*
> - * A page_cgroup page is associated with every page descriptor. The
> - * page_cgroup helps us identify information about the cgroup
> - */
> -struct page_cgroup {
> - struct list_head lru; /* per cgroup LRU list */
> - struct page *page;
> - struct mem_cgroup *mem_cgroup;
> - int flags;
> -};
> -#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
> -#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
> -#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
> -#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
> -
> static int page_cgroup_nid(struct page_cgroup *pc)
> {
> return page_to_nid(pc->page);
> @@ -175,12 +161,6 @@ static enum zone_type page_cgroup_zid(st
> return page_zonenum(pc->page);
> }
>
> -enum charge_type {
> - MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> - MEM_CGROUP_CHARGE_TYPE_MAPPED,
> - MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */
> -};
> -
> /*
> * Always modified under lru lock. Then, not necessary to preempt_disable()
> */
> @@ -248,37 +228,6 @@ struct mem_cgroup *mem_cgroup_from_task(
> struct mem_cgroup, css);
> }
>
> -static inline int page_cgroup_locked(struct page *page)
> -{
> - return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
> -{
> - VM_BUG_ON(!page_cgroup_locked(page));
> - page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
> -}
> -
> -struct page_cgroup *page_get_page_cgroup(struct page *page)
> -{
> - return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
> -}
> -
> -static void lock_page_cgroup(struct page *page)
> -{
> - bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static int try_lock_page_cgroup(struct page *page)
> -{
> - return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> -static void unlock_page_cgroup(struct page *page)
> -{
> - bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> -}
> -
> static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
> struct page_cgroup *pc)
> {
> @@ -367,7 +316,7 @@ void mem_cgroup_move_lists(struct page *
> struct mem_cgroup_per_zone *mz;
> unsigned long flags;
>
> - if (mem_cgroup_subsys.disabled)
> + if (mem_cgroup_disabled())
> return;
>
> /*
> @@ -506,273 +455,6 @@ unsigned long mem_cgroup_isolate_pages(u
> }
>
> /*
> - * Charge the memory controller for page usage.
> - * Return
> - * 0 if the charge was successful
> - * < 0 if the cgroup is over its limit
> - */
> -static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> - gfp_t gfp_mask, enum charge_type ctype,
> - struct mem_cgroup *memcg)
> -{
> - struct mem_cgroup *mem;
> - struct page_cgroup *pc;
> - unsigned long flags;
> - unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> - struct mem_cgroup_per_zone *mz;
> -
> - pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
> - if (unlikely(pc == NULL))
> - goto err;
> -
> - /*
> - * We always charge the cgroup the mm_struct belongs to.
> - * The mm_struct's mem_cgroup changes on task migration if the
> - * thread group leader migrates. It's possible that mm is not
> - * set, if so charge the init_mm (happens for pagecache usage).
> - */
> - if (likely(!memcg)) {
> - rcu_read_lock();
> - mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> - /*
> - * For every charge from the cgroup, increment reference count
> - */
> - css_get(&mem->css);
> - rcu_read_unlock();
> - } else {
> - mem = memcg;
> - css_get(&memcg->css);
> - }
> -
> - while (res_counter_charge(&mem->res, PAGE_SIZE)) {
> - if (!(gfp_mask & __GFP_WAIT))
> - goto out;
> -
> - if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> - continue;
> -
> - /*
> - * try_to_free_mem_cgroup_pages() might not give us a full
> - * picture of reclaim. Some pages are reclaimed and might be
> - * moved to swap cache or just unmapped from the cgroup.
> - * Check the limit again to see if the reclaim reduced the
> - * current usage of the cgroup before giving up
> - */
> - if (res_counter_check_under_limit(&mem->res))
> - continue;
> -
> - if (!nr_retries--) {
> - mem_cgroup_out_of_memory(mem, gfp_mask);
> - goto out;
> - }
> - }
> -
> - pc->mem_cgroup = mem;
> - pc->page = page;
> - /*
> - * If a page is accounted as a page cache, insert to inactive list.
> - * If anon, insert to active list.
> - */
> - if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
> - pc->flags = PAGE_CGROUP_FLAG_CACHE;
> - if (page_is_file_cache(page))
> - pc->flags |= PAGE_CGROUP_FLAG_FILE;
> - else
> - pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> - } else
> - pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> -
> - lock_page_cgroup(page);
> - if (unlikely(page_get_page_cgroup(page))) {
> - unlock_page_cgroup(page);
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> - css_put(&mem->css);
> - kmem_cache_free(page_cgroup_cache, pc);
> - goto done;
> - }
> - page_assign_page_cgroup(page, pc);
> -
> - mz = page_cgroup_zoneinfo(pc);
> - spin_lock_irqsave(&mz->lru_lock, flags);
> - __mem_cgroup_add_list(mz, pc);
> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> - unlock_page_cgroup(page);
> -done:
> - return 0;
> -out:
> - css_put(&mem->css);
> - kmem_cache_free(page_cgroup_cache, pc);
> -err:
> - return -ENOMEM;
> -}
> -
> -int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> -{
> - if (mem_cgroup_subsys.disabled)
> - return 0;
> -
> - /*
> - * If already mapped, we don't have to account.
> - * If page cache, page->mapping has address_space.
> - * But page->mapping may have out-of-use anon_vma pointer,
> - * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
> - * is NULL.
> - */
> - if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> - return 0;
> - if (unlikely(!mm))
> - mm = &init_mm;
> - return mem_cgroup_charge_common(page, mm, gfp_mask,
> - MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> -}
> -
> -int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> - gfp_t gfp_mask)
> -{
> - if (mem_cgroup_subsys.disabled)
> - return 0;
> -
> - /*
> - * Corner case handling. This is called from add_to_page_cache()
> - * in usual. But some FS (shmem) precharges this page before calling it
> - * and call add_to_page_cache() with GFP_NOWAIT.
> - *
> - * For GFP_NOWAIT case, the page may be pre-charged before calling
> - * add_to_page_cache(). (See shmem.c) check it here and avoid to call
> - * charge twice. (It works but has to pay a bit larger cost.)
> - */
> - if (!(gfp_mask & __GFP_WAIT)) {
> - struct page_cgroup *pc;
> -
> - lock_page_cgroup(page);
> - pc = page_get_page_cgroup(page);
> - if (pc) {
> - VM_BUG_ON(pc->page != page);
> - VM_BUG_ON(!pc->mem_cgroup);
> - unlock_page_cgroup(page);
> - return 0;
> - }
> - unlock_page_cgroup(page);
> - }
> -
> - if (unlikely(!mm))
> - mm = &init_mm;
> -
> - return mem_cgroup_charge_common(page, mm, gfp_mask,
> - MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
> -}
> -
> -/*
> - * uncharge if !page_mapped(page)
> - */
> -static void
> -__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
> -{
> - struct page_cgroup *pc;
> - struct mem_cgroup *mem;
> - struct mem_cgroup_per_zone *mz;
> - unsigned long flags;
> -
> - if (mem_cgroup_subsys.disabled)
> - return;
> -
> - /*
> - * Check if our page_cgroup is valid
> - */
> - lock_page_cgroup(page);
> - pc = page_get_page_cgroup(page);
> - if (unlikely(!pc))
> - goto unlock;
> -
> - VM_BUG_ON(pc->page != page);
> -
> - if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> - && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
> - || page_mapped(page)))
> - goto unlock;
> -
> - mz = page_cgroup_zoneinfo(pc);
> - spin_lock_irqsave(&mz->lru_lock, flags);
> - __mem_cgroup_remove_list(mz, pc);
> - spin_unlock_irqrestore(&mz->lru_lock, flags);
> -
> - page_assign_page_cgroup(page, NULL);
> - unlock_page_cgroup(page);
> -
> - mem = pc->mem_cgroup;
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> - css_put(&mem->css);
> -
> - kmem_cache_free(page_cgroup_cache, pc);
> - return;
> -unlock:
> - unlock_page_cgroup(page);
> -}
> -
> -void mem_cgroup_uncharge_page(struct page *page)
> -{
> - __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> -}
> -
> -void mem_cgroup_uncharge_cache_page(struct page *page)
> -{
> - VM_BUG_ON(page_mapped(page));
> - __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> -}
> -
> -/*
> - * Before starting migration, account against new page.
> - */
> -int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> -{
> - struct page_cgroup *pc;
> - struct mem_cgroup *mem = NULL;
> - enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
> - int ret = 0;
> -
> - if (mem_cgroup_subsys.disabled)
> - return 0;
> -
> - lock_page_cgroup(page);
> - pc = page_get_page_cgroup(page);
> - if (pc) {
> - mem = pc->mem_cgroup;
> - css_get(&mem->css);
> - if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
> - ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> - }
> - unlock_page_cgroup(page);
> - if (mem) {
> - ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> - ctype, mem);
> - css_put(&mem->css);
> - }
> - return ret;
> -}
> -
> -/* remove redundant charge if migration failed*/
> -void mem_cgroup_end_migration(struct page *newpage)
> -{
> - /*
> - * At success, page->mapping is not NULL.
> - * special rollback care is necessary when
> - * 1. at migration failure. (newpage->mapping is cleared in this case)
> - * 2. the newpage was moved but not remapped again because the task
> - * exits and the newpage is obsolete. In this case, the new page
> - * may be a swapcache. So, we just call mem_cgroup_uncharge_page()
> - * always for avoiding mess. The page_cgroup will be removed if
> - * unnecessary. File cache pages is still on radix-tree. Don't
> - * care it.
> - */
> - if (!newpage->mapping)
> - __mem_cgroup_uncharge_common(newpage,
> - MEM_CGROUP_CHARGE_TYPE_FORCE);
> - else if (PageAnon(newpage))
> - mem_cgroup_uncharge_page(newpage);
> -}
> -
> -/*
> * A call to try to shrink memory usage under specified resource controller.
> * This is typically used for page reclaiming for shmem for reducing side
> * effect of page allocation from shmem, which is used by some mem_cgroup.
> @@ -783,7 +465,7 @@ int mem_cgroup_shrink_usage(struct mm_st
> int progress = 0;
> int retry = MEM_CGROUP_RECLAIM_RETRIES;
>
> - if (mem_cgroup_subsys.disabled)
> + if (mem_cgroup_disabled())
> return 0;
>
> rcu_read_lock();
> @@ -1104,7 +786,7 @@ mem_cgroup_create(struct cgroup_subsys *
>
> if (unlikely((cont->parent) == NULL)) {
> mem = &init_mem_cgroup;
> - page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
> + page_cgroup_init();
> } else {
> mem = mem_cgroup_alloc();
> if (!mem)
> @@ -1188,3 +870,325 @@ struct cgroup_subsys mem_cgroup_subsys =
> .attach = mem_cgroup_move_task,
> .early_init = 0,
> };
> +
> +#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> +static inline int mem_cgroup_disabled(void)
> +{
> + return 1;
> +}
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> +static inline int page_cgroup_locked(struct page *page)
> +{
> + return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
> +}
> +
> +static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
> +{
> + VM_BUG_ON(!page_cgroup_locked(page));
> + page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
> +}
> +
> +struct page_cgroup *page_get_page_cgroup(struct page *page)
> +{
> + return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
> +}
> +
> +/*
> + * Charge the memory controller for page usage.
> + * Return
> + * 0 if the charge was successful
> + * < 0 if the cgroup is over its limit
> + */
> +static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> + gfp_t gfp_mask, enum charge_type ctype,
> + struct mem_cgroup *memcg)
> +{
> + struct page_cgroup *pc;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + struct mem_cgroup *mem;
> + unsigned long flags;
> + unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> + struct mem_cgroup_per_zone *mz;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> + pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
> + if (unlikely(pc == NULL))
> + goto err;
> +
> + /*
> + * We always charge the cgroup the mm_struct belongs to.
> + * The mm_struct's mem_cgroup changes on task migration if the
> + * thread group leader migrates. It's possible that mm is not
> + * set, if so charge the init_mm (happens for pagecache usage).
> + */
> + if (likely(!memcg)) {
> + rcu_read_lock();
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + /*
> + * For every charge from the cgroup, increment reference count
> + */
> + css_get(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> + rcu_read_unlock();
> + } else {
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + mem = memcg;
> + css_get(&memcg->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> + }
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + while (res_counter_charge(&mem->res, PAGE_SIZE)) {
> + if (!(gfp_mask & __GFP_WAIT))
> + goto out;
> +
> + if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> + continue;
> +
> + /*
> + * try_to_free_mem_cgroup_pages() might not give us a full
> + * picture of reclaim. Some pages are reclaimed and might be
> + * moved to swap cache or just unmapped from the cgroup.
> + * Check the limit again to see if the reclaim reduced the
> + * current usage of the cgroup before giving up
> + */
> + if (res_counter_check_under_limit(&mem->res))
> + continue;
> +
> + if (!nr_retries--) {
> + mem_cgroup_out_of_memory(mem, gfp_mask);
> + goto out;
> + }
> + }
> + pc->mem_cgroup = mem;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> + pc->page = page;
> + /*
> + * If a page is accounted as a page cache, insert to inactive list.
> + * If anon, insert to active list.
> + */
> + if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
> + pc->flags = PAGE_CGROUP_FLAG_CACHE;
> + if (page_is_file_cache(page))
> + pc->flags |= PAGE_CGROUP_FLAG_FILE;
> + else
> + pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
> + } else
> + pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
> +
> + lock_page_cgroup(page);
> + if (unlikely(page_get_page_cgroup(page))) {
> + unlock_page_cgroup(page);
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + css_put(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> + kmem_cache_free(page_cgroup_cache, pc);
> + goto done;
> + }
> + page_assign_page_cgroup(page, pc);
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + mz = page_cgroup_zoneinfo(pc);
> + spin_lock_irqsave(&mz->lru_lock, flags);
> + __mem_cgroup_add_list(mz, pc);
> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> + unlock_page_cgroup(page);
> +done:
> + return 0;
> +out:
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + css_put(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> + kmem_cache_free(page_cgroup_cache, pc);
> +err:
> + return -ENOMEM;
> +}
> +
> +int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +{
> + if (mem_cgroup_disabled())
> + return 0;
> +
> + /*
> + * If already mapped, we don't have to account.
> + * If page cache, page->mapping has address_space.
> + * But page->mapping may have out-of-use anon_vma pointer,
> + * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
> + * is NULL.
> + */
> + if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> + return 0;
> + if (unlikely(!mm))
> + mm = &init_mm;
> + return mem_cgroup_charge_common(page, mm, gfp_mask,
> + MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> +}
> +
> +int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> + gfp_t gfp_mask)
> +{
> + if (mem_cgroup_disabled())
> + return 0;
> +
> + /*
> + * Corner case handling. This is called from add_to_page_cache()
> + * in usual. But some FS (shmem) precharges this page before calling it
> + * and call add_to_page_cache() with GFP_NOWAIT.
> + *
> + * For GFP_NOWAIT case, the page may be pre-charged before calling
> + * add_to_page_cache(). (See shmem.c) check it here and avoid to call
> + * charge twice. (It works but has to pay a bit larger cost.)
> + */
> + if (!(gfp_mask & __GFP_WAIT)) {
> + struct page_cgroup *pc;
> +
> + lock_page_cgroup(page);
> + pc = page_get_page_cgroup(page);
> + if (pc) {
> + VM_BUG_ON(pc->page != page);
> + VM_BUG_ON(!pc->mem_cgroup);
> + unlock_page_cgroup(page);
> + return 0;
> + }
> + unlock_page_cgroup(page);
> + }
> +
> + if (unlikely(!mm))
> + mm = &init_mm;
> +
> + return mem_cgroup_charge_common(page, mm, gfp_mask,
> + MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
> +}
> +
> +/*
> + * uncharge if !page_mapped(page)
> + */
> +static void
> +__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
> +{
> + struct page_cgroup *pc;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + struct mem_cgroup *mem;
> + struct mem_cgroup_per_zone *mz;
> + unsigned long flags;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + /*
> + * Check if our page_cgroup is valid
> + */
> + lock_page_cgroup(page);
> + pc = page_get_page_cgroup(page);
> + if (unlikely(!pc))
> + goto unlock;
> +
> + VM_BUG_ON(pc->page != page);
> +
> + if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> + && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
> + || page_mapped(page)
> + || PageSwapCache(page)))
> + goto unlock;
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + mz = page_cgroup_zoneinfo(pc);
> + spin_lock_irqsave(&mz->lru_lock, flags);
> + __mem_cgroup_remove_list(mz, pc);
> + spin_unlock_irqrestore(&mz->lru_lock, flags);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> + page_assign_page_cgroup(page, NULL);
> + unlock_page_cgroup(page);
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + mem = pc->mem_cgroup;
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + css_put(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +
> + kmem_cache_free(page_cgroup_cache, pc);
> + return;
> +unlock:
> + unlock_page_cgroup(page);
> +}
> +
> +void mem_cgroup_uncharge_page(struct page *page)
> +{
> + __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> +}
> +
> +void mem_cgroup_uncharge_cache_page(struct page *page)
> +{
> + VM_BUG_ON(page_mapped(page));
> + __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> +}
> +
> +/*
> + * Before starting migration, account against new page.
> + */
> +int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
> +{
> + struct page_cgroup *pc;
> + struct mem_cgroup *mem = NULL;
> + enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
> + int ret = 0;
> +
> + if (mem_cgroup_disabled())
> + return 0;
> +
> + lock_page_cgroup(page);
> + pc = page_get_page_cgroup(page);
> + if (pc) {
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + mem = pc->mem_cgroup;
> + css_get(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> + if (pc->flags & PAGE_CGROUP_FLAG_CACHE)
> + ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> + }
> + unlock_page_cgroup(page);
> + if (mem) {
> + ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
> + ctype, mem);
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + css_put(&mem->css);
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> + }
> + return ret;
> +}
> +
> +/* remove redundant charge if migration failed*/
> +void mem_cgroup_end_migration(struct page *newpage)
> +{
> + /*
> + * At success, page->mapping is not NULL.
> + * special rollback care is necessary when
> + * 1. at migration failure. (newpage->mapping is cleared in this case)
> + * 2. the newpage was moved but not remapped again because the task
> + * exits and the newpage is obsolete. In this case, the new page
> + * may be a swapcache. So, we just call mem_cgroup_uncharge_page()
> + * always for avoiding mess. The page_cgroup will be removed if
> + * unnecessary. File cache pages is still on radix-tree. Don't
> + * care it.
> + */
> + if (!newpage->mapping)
> + __mem_cgroup_uncharge_common(newpage,
> + MEM_CGROUP_CHARGE_TYPE_FORCE);
> + else if (PageAnon(newpage))
> + mem_cgroup_uncharge_page(newpage);
> +}
> +
> +void page_cgroup_init()
> +{
> + if (!page_cgroup_cache)
> + page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
> +}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Hi,
> > This patch splits the cgroup memory subsystem into two parts.
> > One is for tracking pages to find out the owners. The other is
> > for controlling how much amount of memory should be assigned to
> > each cgroup.
> >
> > With this patch, you can use the page tracking mechanism even if
> > the memory subsystem is off.
> >
> > Based on 2.6.27-rc1-mm1
> > Signed-off-by: Ryo Tsuruta <[email protected]>
> > Signed-off-by: Hirokazu Takahashi <[email protected]>
> >
>
> Plese CC me or Balbir or Pavel (See Maintainer list) when you try this ;)
>
> After this patch, the total structure is
>
> page <-> page_cgroup <-> bio_cgroup.
> (multiple bio_cgroup can be attached to page_cgroup)
>
> Does this pointer chain will add
> - significant performance regression or
> - new race condtions
> ?
I don't think it will cause significant performance loss, because
the link between a page and a page_cgroup has already existed, which
the memory resource controller prepared. Bio_cgroup uses this as it is,
and does nothing about this.
And the link between page_cgroup and bio_cgroup isn't protected
by any additional spin-locks, since the associated bio_cgroup is
guaranteed to exist as long as the bio_cgroup owns pages.
I've just noticed that most of overhead comes from the spin-locks
when reclaiming the pages inside mem_cgroups and the spin-locks to
protect the links between pages and page_cgroups.
The latter overhead comes from the policy your team has chosen
that page_cgroup structures are allocated on demand. I still feel
this approach doesn't make any sense because linux kernel tries to
make use of most of the pages as far as it can, so most of them
have to be assigned its related page_cgroup. It would make us happy
if page_cgroups are allocated at the booting time.
> For example, adding a simple function.
> ==
> int get_page_io_id(struct page *)
> - returns a I/O cgroup ID for this page. If ID is not found, -1 is returned.
> ID is not guaranteed to be valid value. (ID can be obsolete)
> ==
> And just storing cgroup ID to page_cgroup at page allocation.
> Then, making bio_cgroup independent from page_cgroup and
> get ID if avialble and avoid too much pointer walking.
I don't think there are any diffrences between a poiter and ID.
I think this ID is just a encoded version of the pointer.
> Thanks,
> -Kame
----- Original Message -----
>> > This patch splits the cgroup memory subsystem into two parts.
>> > One is for tracking pages to find out the owners. The other is
>> > for controlling how much amount of memory should be assigned to
>> > each cgroup.
>> >
>> > With this patch, you can use the page tracking mechanism even if
>> > the memory subsystem is off.
>> >
>> > Based on 2.6.27-rc1-mm1
>> > Signed-off-by: Ryo Tsuruta <[email protected]>
>> > Signed-off-by: Hirokazu Takahashi <[email protected]>
>> >
>>
>> Plese CC me or Balbir or Pavel (See Maintainer list) when you try this ;)
>>
>> After this patch, the total structure is
>>
>> page <-> page_cgroup <-> bio_cgroup.
>> (multiple bio_cgroup can be attached to page_cgroup)
>>
>> Does this pointer chain will add
>> - significant performance regression or
>> - new race condtions
>> ?
>
>I don't think it will cause significant performance loss, because
>the link between a page and a page_cgroup has already existed, which
>the memory resource controller prepared. Bio_cgroup uses this as it is,
>and does nothing about this.
>
>And the link between page_cgroup and bio_cgroup isn't protected
>by any additional spin-locks, since the associated bio_cgroup is
>guaranteed to exist as long as the bio_cgroup owns pages.
>
Hmm, I think page_cgroup's cost is visible when
1. a page is changed to be in-use state. (fault or radixt-tree-insert)
2. a page is changed to be out-of-use state (fault or radixt-tree-removal)
3. memcg hit its limit or global LRU reclaim runs.
"1" and "2" can be catched as 5% loss of exec throuput.
"3" is not measured (because LRU walk itself is heavy.)
What new chances to access page_cgroup you'll add ?
I'll have to take into account them.
>I've just noticed that most of overhead comes from the spin-locks
>when reclaiming the pages inside mem_cgroups and the spin-locks to
>protect the links between pages and page_cgroups.
Overhead between page <-> page_cgroup lock is cannot be catched by
lock_stat now.Do you have numbers ?
But ok, there are too many locks ;(
>The latter overhead comes from the policy your team has chosen
>that page_cgroup structures are allocated on demand. I still feel
>this approach doesn't make any sense because linux kernel tries to
>make use of most of the pages as far as it can, so most of them
>have to be assigned its related page_cgroup. It would make us happy
>if page_cgroups are allocated at the booting time.
>
Now, multi-sizer-page-cache is discussed for a long time. If it's our
direction, on-demand page_cgroup make sense.
>> For example, adding a simple function.
>> ==
>> int get_page_io_id(struct page *)
>> - returns a I/O cgroup ID for this page. If ID is not found, -1 is returne
d.
>> ID is not guaranteed to be valid value. (ID can be obsolete)
>> ==
>> And just storing cgroup ID to page_cgroup at page allocation.
>> Then, making bio_cgroup independent from page_cgroup and
>> get ID if avialble and avoid too much pointer walking.
>
>I don't think there are any diffrences between a poiter and ID.
>I think this ID is just a encoded version of the pointer.
>
ID can be obsolete, pointer is not. memory cgroup has to take care of
bio cgroup's race condition ? (About race conditions, it's already complicated
enough)
To be honest, I think adding a new (4 or 8 bytes) page struct and record infor
mation of bio-control is more straightforward approach. Buy as you might
think, "there is no room"
Thanks,
-Kame
On Wed, 2008-08-06 at 15:41 +0900, Fernando Luis V?zquez Cao wrote:
> > I agree with your plan.
> > We keep bio-cgroup improving and porting to the latest kernel.
> Having more users of bio-cgroup would probably help to get it merged, so
> we'll certainly send patches as soon as we get our cfq prototype in
> shape.
I'm confused. Are these two of the competing controllers? Or are the
complementary in some way?
-- Dave
Fernando Luis Vázquez Cao wrote:
> On Mon, 2008-08-04 at 10:20 -0700, Dave Hansen wrote:
>> On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
>>> This series of patches of dm-ioband now includes "The bio tracking mechanism,"
>>> which has been posted individually to this mailing list.
>>> This makes it easy for anybody to control the I/O bandwidth even when
>>> the I/O is one of delayed-write requests.
>> During the Containers mini-summit at OLS, it was mentioned that there
>> are at least *FOUR* of these I/O controllers floating around. Have you
>> talked to the other authors? (I've cc'd at least one of them).
>>
>> We obviously can't come to any kind of real consensus with people just
>> tossing the same patches back and forth.
>>
>> -- Dave
>
> Hi Dave,
>
> I have been tracking the memory controller patches for a while which
> spurred my interest in cgroups and prompted me to start working on I/O
> bandwidth controlling mechanisms. This year I have had several
> opportunities to discuss the design challenges of i/o controllers with
> the NEC and VALinux Japan teams (CCed), most recently last month during
> the Linux Foundation Japan Linux Symposium, where we took advantage of
> Andrew Morton's visit to Japan to do some brainstorming on this topic. I
> will try so summarize what was discussed there (and in the Linux Storage
> & Filesystem Workshop earlier this year) and propose a hopefully
> acceptable way to proceed and try to get things started.
>
> This RFC ended up being a bit longer than I had originally intended, but
> hopefully it will serve as the start of a fruitful discussion.
>
> As you pointed out, it seems that there is not much consensus building
> going on, but that does not mean there is a lack of interest. To get the
> ball rolling it is probably a good idea to clarify the state of things
> and try to establish what we are trying to accomplish.
>
> *** State of things in the mainstream kernel<BR>
> The kernel has had somewhat adavanced I/O control capabilities for quite
> some time now: CFQ. But the current CFQ has some problems:
> - I/O priority can be set by PID, PGRP, or UID, but...
> - ...all the processes that fall within the same class/priority are
> scheduled together and arbitrary grouping are not possible.
> - Buffered I/O is not handled properly.
> - CFQ's IO priority is an attribute of a process that affects all
> devices it sends I/O requests to. In other words, with the current
> implementation it is not possible to assign per-device IO priorities to
> a task.
>
> *** Goals
> 1. Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> entity).
> 2. Being able to perform I/O bandwidth control independently on each
> device.
> 3. I/O bandwidth shaping.
> 4. Scheduler-independent I/O bandwidth control.
> 5. Usable with stacking devices (md, dm and other devices of that
> ilk).
> 6. I/O tracking (handle buffered and asynchronous I/O properly).
>
> The list of goals above is not exhaustive and it is also likely to
> contain some not-so-nice-to-have features so your feedback would be
> appreciated.
>
Would you like to split up IO into read and write IO. We know that read can be
very latency sensitive when compared to writes. Should we consider them
separately in the RFC?
> 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> identity)
>
> We obviously need this because our final goal is to be able to control
> the IO generated by a Linux container. The good news is that we already
> have the cgroups infrastructure so, regarding this problem, we would
> just have to transform our I/O bandwidth controller into a cgroup
> subsystem.
>
> This seems to be the easiest part, but the current cgroups
> infrastructure has some limitations when it comes to dealing with block
> devices: impossibility of creating/removing certain control structures
> dynamically and hardcoding of subsystems (i.e. resource controllers).
> This makes it difficult to handle block devices that can be hotplugged
> and go away at any time (this applies not only to usb storage but also
> to some SATA and SCSI devices). To cope with this situation properly we
> would need hotplug support in cgroups, but, as suggested before and
> discussed in the past (see (0) below), there are some limitations.
>
> Even in the non-hotplug case it would be nice if we could treat each
> block I/O device as an independent resource, which means we could do
> things like allocating I/O bandwidth on a per-device basis. As long as
> performance is not compromised too much, adding some kind of basic
> hotplug support to cgroups is probably worth it.
>
Won't that get too complex. What if the user has thousands of disks with several
partitions on each?
> (0) http://lkml.org/lkml/2008/5/21/12
>
> 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
>
> The implementation of an I/O scheduling algorithm is to a certain extent
> influenced by what we are trying to achieve in terms of I/O bandwidth
> shaping, but, as discussed below, the required accuracy can determine
> the layer where the I/O controller has to reside. Off the top of my
> head, there are three basic operations we may want perform:
> - I/O nice prioritization: ionice-like approach.
> - Proportional bandwidth scheduling: each process/group of processes
> has a weight that determines the share of bandwidth they receive.
> - I/O limiting: set an upper limit to the bandwidth a group of tasks
> can use.
>
> If we are pursuing a I/O prioritization model à la CFQ the temptation is
> to implement it at the elevator layer or extend any of the existing I/O
> schedulers.
>
> There have been several proposals that extend either the CFQ scheduler
> (see (1), (2) below) or the AS scheduler (see (3) below). The problem
> with these controllers is that they are scheduler dependent, which means
> that they become unusable when we change the scheduler or when we want
> to control stacking devices which define their own make_request_fn
> function (md and dm come to mind). It could be argued that the physical
> devices controlled by a dm or md driver are likely to be fed by
> traditional I/O schedulers such as CFQ, but these I/O schedulers would
> be running independently from each other, each one controlling its own
> device ignoring the fact that they part of a stacking device. This lack
> of information at the elevator layer makes it pretty difficult to obtain
> accurate results when using stacking devices. It seems that unless we
> can make the elevator layer aware of the topology of stacking devices
> (possibly by extending the elevator API?) evelator-based approaches do
> not constitute a generic solution. Here onwards, for discussion
> purposes, I will refer to this type of I/O bandwidth controllers as
> elevator-based I/O controllers.
>
> A simple way of solving the problems discussed in the previous paragraph
> is to perform I/O control before the I/O actually enters the block layer
> either at the pagecache level (when pages are dirtied) or at the entry
> point to the generic block layer (generic_make_request()). Andrea's I/O
> throttling patches stick to the former variant (see (4) below) and
> Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later
> approach. The rationale is that by hooking into the source of I/O
> requests we can perform I/O control in a topology-agnostic and
> elevator-agnostic way. I will refer to this new type of I/O bandwidth
> controller as block layer I/O controller.
>
> By residing just above the generic block layer the implementation of a
> block layer I/O controller becomes relatively easy, but by not taking
> into account the characteristics of the underlying devices we might risk
> underutilizing them. For this reason, in some cases it would probably
> make sense to complement a generic I/O controller with elevator-based
> I/O controller, so that the maximum throughput can be squeezed from the
> physical devices.
>
> (1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/
> (2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/
> (3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/
> (4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975
> (5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581
>
> 6.- I/O tracking
>
> This is arguably the most important part, since to perform I/O control
> we need to be able to determine where the I/O is coming from.
>
> Reads are trivial because they are served in the context of the task
> that generated the I/O. But most writes are performed by pdflush,
> kswapd, and friends so performing I/O control just in the synchronous
> I/O path would lead to large inaccuracy. To get this right we would need
> to track ownership all the way up to the pagecache page. In other words,
> it is necessary to track who is dirtying pages so that when they are
> written to disk the right task is charged for that I/O.
>
> Fortunately, such tracking of pages is one of the things the existing
> memory resource controller is doing to control memory usage. This is a
> clever observation which has a useful implication: if the rather
> imbricated tracking and accounting parts of the memory resource
> controller were split the I/O controller could leverage the existing
> infrastructure to track buffered and asynchronous I/O. This is exactly
> what the bio-cgroup (see (6) below) patches set out to do.
>
Are you suggesting that the IO and memory controller should always be bound
together?
> It is also possible to do without I/O tracking. For that we would need
> to hook into the synchronous I/O path and every place in the kernel
> where pages are dirtied (see (4) above for details). However controlling
> the rate at which a cgroup can generate dirty pages seems to be a task
> that belongs in the memory controller not the I/O controller. As Dave
> and Paul suggested its probably better to delegate this to the memory
> controller. In fact, it seems that Yamamoto-san is cooking some patches
> that implement just that: dirty balancing for cgroups (see (7) for
> details).
>
> Another argument in favor of I/O tracking is that not only block layer
> I/O controllers would benefit from it, but also the existing I/O
> schedulers and the elevator-based I/O controllers proposed by
> Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself
> are working on this and hopefully will be sending patches soon).
>
> (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90
> (7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/
>
> *** How to move on
>
> As discussed before, it probably makes sense to have both a block layer
> I/O controller and a elevator-based one, and they could certainly
> cohabitate. As discussed before, all of them need I/O tracking
> capabilities so I would like to suggest the plan below to get things
> started:
>
> - Improve the I/O tracking patches (see (6) above) until they are in
> mergeable shape.
Yes, I agree with this step as being the first step. May be extending the
current task I/O accounting to cgroups could be done as a part of this.
> - Fix CFQ and AS to use the new I/O tracking functionality to show its
> benefits. If the performance impact is acceptable this should suffice to
> convince the respective maintainer and get the I/O tracking patches
> merged.
> - Implement a block layer resource controller. dm-ioband is a working
> solution and feature rich but its dependency on the dm infrastructure is
> likely to find opposition (the dm layer does not handle barriers
> properly and the maximum size of I/O requests can be limited in some
> cases). In such a case, we could either try to build a standalone
> resource controller based on dm-ioband (which would probably hook into
> generic_make_request) or try to come up with something new.
> - If the I/O tracking patches make it into the kernel we could move on
> and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> (1), (2), and (3) above for details) merged.
> - Delegate the task of controlling the rate at which a task can
> generate dirty pages to the memory controller.
>
> This RFC is somewhat vague but my feeling is that we build some
> consensus on the goals and basic design aspects before delving into
> implementation details.
>
> I would appreciate your comments and feedback.
Very nice summary
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
On Wed, 2008-08-06 at 22:12 +0530, Balbir Singh wrote:
> Would you like to split up IO into read and write IO. We know that read can be
> very latency sensitive when compared to writes. Should we consider them
> separately in the RFC?
I'd just suggest doing what is simplest and can be done in the smallest
amount of code. As long as it is functional in some way and can be
extended to cover the end goal, I say keep it tiny.
> > Even in the non-hotplug case it would be nice if we could treat each
> > block I/O device as an independent resource, which means we could do
> > things like allocating I/O bandwidth on a per-device basis. As long as
> > performance is not compromised too much, adding some kind of basic
> > hotplug support to cgroups is probably worth it.
>
> Won't that get too complex. What if the user has thousands of disks with several
> partitions on each?
I think what Fernando is suggesting is that we *allow* each disk to be
treated separately, not that we actually separate them out. I agree
that with large disk count systems, it would get a bit nutty to deal
with I/O limits on each of them. It would also probably be nutty for
some dude with two disks in his system to have to set (or care about)
individual limits.
I guess I'm just arguing that we should allow pretty arbitrary grouping
of block devices into these resource pools.
-- Dave
Fernando
Nice summary. My comments are inline.
-Naveen
2008/8/5 Fernando Luis V?zquez Cao <[email protected]>:
> On Mon, 2008-08-04 at 10:20 -0700, Dave Hansen wrote:
>> On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
>> > This series of patches of dm-ioband now includes "The bio tracking mechanism,"
>> > which has been posted individually to this mailing list.
>> > This makes it easy for anybody to control the I/O bandwidth even when
>> > the I/O is one of delayed-write requests.
>>
>> During the Containers mini-summit at OLS, it was mentioned that there
>> are at least *FOUR* of these I/O controllers floating around. Have you
>> talked to the other authors? (I've cc'd at least one of them).
>>
>> We obviously can't come to any kind of real consensus with people just
>> tossing the same patches back and forth.
>>
>> -- Dave
>
> Hi Dave,
>
> I have been tracking the memory controller patches for a while which
> spurred my interest in cgroups and prompted me to start working on I/O
> bandwidth controlling mechanisms. This year I have had several
> opportunities to discuss the design challenges of i/o controllers with
> the NEC and VALinux Japan teams (CCed), most recently last month during
> the Linux Foundation Japan Linux Symposium, where we took advantage of
> Andrew Morton's visit to Japan to do some brainstorming on this topic. I
> will try so summarize what was discussed there (and in the Linux Storage
> & Filesystem Workshop earlier this year) and propose a hopefully
> acceptable way to proceed and try to get things started.
>
> This RFC ended up being a bit longer than I had originally intended, but
> hopefully it will serve as the start of a fruitful discussion.
>
> As you pointed out, it seems that there is not much consensus building
> going on, but that does not mean there is a lack of interest. To get the
> ball rolling it is probably a good idea to clarify the state of things
> and try to establish what we are trying to accomplish.
>
> *** State of things in the mainstream kernel<BR>
> The kernel has had somewhat adavanced I/O control capabilities for quite
> some time now: CFQ. But the current CFQ has some problems:
> - I/O priority can be set by PID, PGRP, or UID, but...
> - ...all the processes that fall within the same class/priority are
> scheduled together and arbitrary grouping are not possible.
> - Buffered I/O is not handled properly.
> - CFQ's IO priority is an attribute of a process that affects all
> devices it sends I/O requests to. In other words, with the current
> implementation it is not possible to assign per-device IO priorities to
> a task.
>
> *** Goals
> 1. Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> entity).
> 2. Being able to perform I/O bandwidth control independently on each
> device.
> 3. I/O bandwidth shaping.
> 4. Scheduler-independent I/O bandwidth control.
> 5. Usable with stacking devices (md, dm and other devices of that
> ilk).
> 6. I/O tracking (handle buffered and asynchronous I/O properly).
>
> The list of goals above is not exhaustive and it is also likely to
> contain some not-so-nice-to-have features so your feedback would be
> appreciated.
>
> 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> identity)
>
> We obviously need this because our final goal is to be able to control
> the IO generated by a Linux container. The good news is that we already
> have the cgroups infrastructure so, regarding this problem, we would
> just have to transform our I/O bandwidth controller into a cgroup
> subsystem.
>
> This seems to be the easiest part, but the current cgroups
> infrastructure has some limitations when it comes to dealing with block
> devices: impossibility of creating/removing certain control structures
> dynamically and hardcoding of subsystems (i.e. resource controllers).
> This makes it difficult to handle block devices that can be hotplugged
> and go away at any time (this applies not only to usb storage but also
> to some SATA and SCSI devices). To cope with this situation properly we
> would need hotplug support in cgroups, but, as suggested before and
> discussed in the past (see (0) below), there are some limitations.
>
> Even in the non-hotplug case it would be nice if we could treat each
> block I/O device as an independent resource, which means we could do
> things like allocating I/O bandwidth on a per-device basis. As long as
> performance is not compromised too much, adding some kind of basic
> hotplug support to cgroups is probably worth it.
>
> (0) http://lkml.org/lkml/2008/5/21/12
>
> 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
>
> The implementation of an I/O scheduling algorithm is to a certain extent
> influenced by what we are trying to achieve in terms of I/O bandwidth
> shaping, but, as discussed below, the required accuracy can determine
> the layer where the I/O controller has to reside. Off the top of my
> head, there are three basic operations we may want perform:
> - I/O nice prioritization: ionice-like approach.
> - Proportional bandwidth scheduling: each process/group of processes
> has a weight that determines the share of bandwidth they receive.
> - I/O limiting: set an upper limit to the bandwidth a group of tasks
> can use.
I/O limiting can be a special case of proportional bandwidth
scheduling. A process/process group can use use it's share of
bandwidth and if there is spare bandwidth it be allowed to use it. And
if we want to absolutely restrict it we add another flag which
specifies that the specified proportion is exact and has an upper
bound.
Let's say the ideal b/w for a device is 100MB/s
And process 1 is assigned b/w of 20%. When we say that the proportion
is strict, the b/w for process 1 will be 20% of the max b/w (which may
be less than 100MB/s) subject to a max of 20MB/s.
>
> If we are pursuing a I/O prioritization model ? la CFQ the temptation is
> to implement it at the elevator layer or extend any of the existing I/O
> schedulers.
>
> There have been several proposals that extend either the CFQ scheduler
> (see (1), (2) below) or the AS scheduler (see (3) below). The problem
> with these controllers is that they are scheduler dependent, which means
> that they become unusable when we change the scheduler or when we want
> to control stacking devices which define their own make_request_fn
> function (md and dm come to mind). It could be argued that the physical
> devices controlled by a dm or md driver are likely to be fed by
> traditional I/O schedulers such as CFQ, but these I/O schedulers would
> be running independently from each other, each one controlling its own
> device ignoring the fact that they part of a stacking device. This lack
> of information at the elevator layer makes it pretty difficult to obtain
> accurate results when using stacking devices. It seems that unless we
> can make the elevator layer aware of the topology of stacking devices
> (possibly by extending the elevator API?) evelator-based approaches do
> not constitute a generic solution. Here onwards, for discussion
> purposes, I will refer to this type of I/O bandwidth controllers as
> elevator-based I/O controllers.
It can be argued that any scheduling decision wrt to i/o belongs to
elevators. Till now they have been used to improve performance. But
with new requirements to isolate i/o based on process or cgroup, we
need to change the elevators.
If we add another layer of i/o scheduling (block layer I/O controller)
above elevators
1) It builds another layer of i/o scheduling (bandwidth or priority)
2) This new layer can have decisions for i/o scheduling which conflict
with underlying elevator. e.g. If we decide to do b/w scheduling in
this new layer, there is no way a priority based elevator could work
underneath it.
If a custom make_request_fn is defined (which means the said device is
not using existing elevator), it could build it's own scheduling
rather than asking kernel to add another layer at the time of i/o
submission. Since it has complete control of i/o.
>
> A simple way of solving the problems discussed in the previous paragraph
> is to perform I/O control before the I/O actually enters the block layer
> either at the pagecache level (when pages are dirtied) or at the entry
> point to the generic block layer (generic_make_request()). Andrea's I/O
> throttling patches stick to the former variant (see (4) below) and
> Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later
> approach. The rationale is that by hooking into the source of I/O
> requests we can perform I/O control in a topology-agnostic and
> elevator-agnostic way. I will refer to this new type of I/O bandwidth
> controller as block layer I/O controller.
>
> By residing just above the generic block layer the implementation of a
> block layer I/O controller becomes relatively easy, but by not taking
> into account the characteristics of the underlying devices we might risk
> underutilizing them. For this reason, in some cases it would probably
> make sense to complement a generic I/O controller with elevator-based
> I/O controller, so that the maximum throughput can be squeezed from the
> physical devices.
>
> (1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/
> (2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/
> (3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/
> (4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975
> (5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581
>
> 6.- I/O tracking
>
> This is arguably the most important part, since to perform I/O control
> we need to be able to determine where the I/O is coming from.
>
> Reads are trivial because they are served in the context of the task
> that generated the I/O. But most writes are performed by pdflush,
> kswapd, and friends so performing I/O control just in the synchronous
> I/O path would lead to large inaccuracy. To get this right we would need
> to track ownership all the way up to the pagecache page. In other words,
> it is necessary to track who is dirtying pages so that when they are
> written to disk the right task is charged for that I/O.
>
> Fortunately, such tracking of pages is one of the things the existing
> memory resource controller is doing to control memory usage. This is a
> clever observation which has a useful implication: if the rather
> imbricated tracking and accounting parts of the memory resource
> controller were split the I/O controller could leverage the existing
> infrastructure to track buffered and asynchronous I/O. This is exactly
> what the bio-cgroup (see (6) below) patches set out to do.
>
> It is also possible to do without I/O tracking. For that we would need
> to hook into the synchronous I/O path and every place in the kernel
> where pages are dirtied (see (4) above for details). However controlling
> the rate at which a cgroup can generate dirty pages seems to be a task
> that belongs in the memory controller not the I/O controller. As Dave
> and Paul suggested its probably better to delegate this to the memory
> controller. In fact, it seems that Yamamoto-san is cooking some patches
> that implement just that: dirty balancing for cgroups (see (7) for
> details).
>
> Another argument in favor of I/O tracking is that not only block layer
> I/O controllers would benefit from it, but also the existing I/O
> schedulers and the elevator-based I/O controllers proposed by
> Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself
> are working on this and hopefully will be sending patches soon).
>
> (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90
> (7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/
>
> *** How to move on
>
> As discussed before, it probably makes sense to have both a block layer
> I/O controller and a elevator-based one, and they could certainly
> cohabitate. As discussed before, all of them need I/O tracking
> capabilities so I would like to suggest the plan below to get things
> started:
>
> - Improve the I/O tracking patches (see (6) above) until they are in
> mergeable shape.
> - Fix CFQ and AS to use the new I/O tracking functionality to show its
> benefits. If the performance impact is acceptable this should suffice to
> convince the respective maintainer and get the I/O tracking patches
> merged.
> - Implement a block layer resource controller. dm-ioband is a working
> solution and feature rich but its dependency on the dm infrastructure is
> likely to find opposition (the dm layer does not handle barriers
> properly and the maximum size of I/O requests can be limited in some
> cases). In such a case, we could either try to build a standalone
> resource controller based on dm-ioband (which would probably hook into
> generic_make_request) or try to come up with something new.
> - If the I/O tracking patches make it into the kernel we could move on
> and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> (1), (2), and (3) above for details) merged.
> - Delegate the task of controlling the rate at which a task can
> generate dirty pages to the memory controller.
>
> This RFC is somewhat vague but my feeling is that we build some
> consensus on the goals and basic design aspects before delving into
> implementation details.
>
> I would appreciate your comments and feedback.
>
> - Fernando
>
>
On Wed, 2008-08-06 at 22:12 +0530, Balbir Singh wrote:
> > 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
> > groupings of processes and treat each group as a single scheduling
> > identity)
> >
> > We obviously need this because our final goal is to be able to control
> > the IO generated by a Linux container. The good news is that we already
> > have the cgroups infrastructure so, regarding this problem, we would
> > just have to transform our I/O bandwidth controller into a cgroup
> > subsystem.
> >
> > This seems to be the easiest part, but the current cgroups
> > infrastructure has some limitations when it comes to dealing with block
> > devices: impossibility of creating/removing certain control structures
> > dynamically and hardcoding of subsystems (i.e. resource controllers).
> > This makes it difficult to handle block devices that can be hotplugged
> > and go away at any time (this applies not only to usb storage but also
> > to some SATA and SCSI devices). To cope with this situation properly we
> > would need hotplug support in cgroups, but, as suggested before and
> > discussed in the past (see (0) below), there are some limitations.
> >
> > Even in the non-hotplug case it would be nice if we could treat each
> > block I/O device as an independent resource, which means we could do
> > things like allocating I/O bandwidth on a per-device basis. As long as
> > performance is not compromised too much, adding some kind of basic
> > hotplug support to cgroups is probably worth it.
> >
>
> Won't that get too complex. What if the user has thousands of disks with several
> partitions on each?
As Dave pointed out I just think that we should allow each disk to be
treated separately. To avoid the administration nightmare you mention
adding block device grouping capabilities should suffice to solve most
of the issues.
> > 6.- I/O tracking
> >
> > This is arguably the most important part, since to perform I/O control
> > we need to be able to determine where the I/O is coming from.
> >
> > Reads are trivial because they are served in the context of the task
> > that generated the I/O. But most writes are performed by pdflush,
> > kswapd, and friends so performing I/O control just in the synchronous
> > I/O path would lead to large inaccuracy. To get this right we would need
> > to track ownership all the way up to the pagecache page. In other words,
> > it is necessary to track who is dirtying pages so that when they are
> > written to disk the right task is charged for that I/O.
> >
> > Fortunately, such tracking of pages is one of the things the existing
> > memory resource controller is doing to control memory usage. This is a
> > clever observation which has a useful implication: if the rather
> > imbricated tracking and accounting parts of the memory resource
> > controller were split the I/O controller could leverage the existing
> > infrastructure to track buffered and asynchronous I/O. This is exactly
> > what the bio-cgroup (see (6) below) patches set out to do.
> >
>
> Are you suggesting that the IO and memory controller should always be bound
> together?
That is a really good question. The I/O tracking patches split the
memory controller in two functional parts: (1) page tracking and (2)
memory accounting/cgroup policy enforcement. By doing so the memory
controller specific code can be separated from the rest, which
admittedly, will not benefit the memory controller a great deal but,
hopefully, we can get cleaner code that is easier to maintain.
The important thing, though, is that with this separation the page
tracking bits can be easily reused by any subsystem that needs to keep
track of pages, and the I/O controller is certainly one such candidate.
Synchronous I/O is easy to deal with because everything is done in the
context of the task that generated the I/O, but buffered I/O and
synchronous I/O are problematic. However with the observation that the
owner of an I/O request happens to be the owner the of the pages the I/O
buffers of that request reside in, it becomes clear that pdflush and
friends could use that information to determine who the originator of
the I/O is and the I/O request accordingly.
Going back to your question, with the current I/O tracking patches I/O
controller would be bound to the page tracking functionality of cgroups
(page_cgroup) not the memory controller. We would not even need to
compile the memory controller. The dependency on cgroups would still be
there though.
As an aside, I guess that with some effort we could get rid of this
dependency by providing some basic tracking capabilities even when the
cgroups infrastructure is not being used. By doing so traditional I/O
schedulers such as CFQ could benefit from proper I/O tracking
capabilities without using cgroups. Of course if the kernel has cgroups
support compiled in the cgroups I/O tracking would be used instead (this
idea was inpired by CFS' group scheduling, which works both with and
without cgroups support). I am currently trying to implement this.
> > (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90
> > *** How to move on
> >
> > As discussed before, it probably makes sense to have both a block layer
> > I/O controller and a elevator-based one, and they could certainly
> > cohabitate. As discussed before, all of them need I/O tracking
> > capabilities so I would like to suggest the plan below to get things
> > started:
> >
> > - Improve the I/O tracking patches (see (6) above) until they are in
> > mergeable shape.
>
> Yes, I agree with this step as being the first step. May be extending the
> current task I/O accounting to cgroups could be done as a part of this.
Yes, makes sense.
> > - Fix CFQ and AS to use the new I/O tracking functionality to show its
> > benefits. If the performance impact is acceptable this should suffice to
> > convince the respective maintainer and get the I/O tracking patches
> > merged.
> > - Implement a block layer resource controller. dm-ioband is a working
> > solution and feature rich but its dependency on the dm infrastructure is
> > likely to find opposition (the dm layer does not handle barriers
> > properly and the maximum size of I/O requests can be limited in some
> > cases). In such a case, we could either try to build a standalone
> > resource controller based on dm-ioband (which would probably hook into
> > generic_make_request) or try to come up with something new.
> > - If the I/O tracking patches make it into the kernel we could move on
> > and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> > (1), (2), and (3) above for details) merged.
> > - Delegate the task of controlling the rate at which a task can
> > generate dirty pages to the memory controller.
> >
> > This RFC is somewhat vague but my feeling is that we build some
> > consensus on the goals and basic design aspects before delving into
> > implementation details.
> >
> > I would appreciate your comments and feedback.
>
> Very nice summary
Thank you!
On Wed, 2008-08-06 at 22:12 +0530, Balbir Singh wrote:
> > *** Goals
> > 1. Cgroups-aware I/O scheduling (being able to define arbitrary
> > groupings of processes and treat each group as a single scheduling
> > entity).
> > 2. Being able to perform I/O bandwidth control independently on each
> > device.
> > 3. I/O bandwidth shaping.
> > 4. Scheduler-independent I/O bandwidth control.
> > 5. Usable with stacking devices (md, dm and other devices of that
> > ilk).
> > 6. I/O tracking (handle buffered and asynchronous I/O properly).
> >
> > The list of goals above is not exhaustive and it is also likely to
> > contain some not-so-nice-to-have features so your feedback would be
> > appreciated.
> >
>
> Would you like to split up IO into read and write IO. We know that read can be
> very latency sensitive when compared to writes. Should we consider them
> separately in the RFC?
Oops, I somehow ended up leaving your first question unanswered. Sorry.
I do not think we should consider them separately, as long as there is a
proper IO tracking infrastructure in place. As you mentioned, reads can
be very latecy sensitive, but the read case could be treated as an
special case IO controller/IO tracking subsystem. There certainly are
optimization opportunities. For example, in the synchronous I/O patch ww
could mark bios with the iocontext of the current task, because it will
happen to be originator of that IO. By effectively caching the ownership
information in the bio we can avoid all the accesses to struct page,
page_cgroup, etc, and reads would definitively benefit from that.
On Wed, 2008-08-06 at 08:48 -0700, Dave Hansen wrote:
> On Wed, 2008-08-06 at 15:41 +0900, Fernando Luis Vázquez Cao wrote:
> > > I agree with your plan.
> > > We keep bio-cgroup improving and porting to the latest kernel.
> > Having more users of bio-cgroup would probably help to get it merged, so
> > we'll certainly send patches as soon as we get our cfq prototype in
> > shape.
>
> I'm confused. Are these two of the competing controllers? Or are the
> complementary in some way?
Sorry, I did not explain myself correctly. I was not referring to a new
IO _controller_, I was just trying to say that the traditional IO
_schedulers_ already present in the mainstream kernel would benefit from
proper IO tracking too. As an example, the current implementation of CFQ
assumes that all IO is generated in the IO context of the current task,
which in only true in the synchronous path. This renders CFQ almost
unusable for controlling of asynchronous and buffered IO. Of course CFQ
is not to blame here, since it has no way to tell who the real
originator of the IO was (CFQ just sees IO requests coming from pdflush
and other kernel threads).
However, once we have a working IO tracking infrastructure in place, the
existing IO _schedulers_ could be modified so that they use it to
determine the real owner/originator of asynchronous and buffered IO.
This can be done without turning IO schedulers into IO resource
controllers. If we can demonstrate that a IO tracking infrastructure
would also be beneficial outside the cgroups arena, it should be easier
to get it merged.
Hi,
> >> > This patch splits the cgroup memory subsystem into two parts.
> >> > One is for tracking pages to find out the owners. The other is
> >> > for controlling how much amount of memory should be assigned to
> >> > each cgroup.
> >> >
> >> > With this patch, you can use the page tracking mechanism even if
> >> > the memory subsystem is off.
> >> >
> >> > Based on 2.6.27-rc1-mm1
> >> > Signed-off-by: Ryo Tsuruta <[email protected]>
> >> > Signed-off-by: Hirokazu Takahashi <[email protected]>
> >> >
> >>
> >> Plese CC me or Balbir or Pavel (See Maintainer list) when you try this ;)
> >>
> >> After this patch, the total structure is
> >>
> >> page <-> page_cgroup <-> bio_cgroup.
> >> (multiple bio_cgroup can be attached to page_cgroup)
> >>
> >> Does this pointer chain will add
> >> - significant performance regression or
> >> - new race condtions
> >> ?
> >
> >I don't think it will cause significant performance loss, because
> >the link between a page and a page_cgroup has already existed, which
> >the memory resource controller prepared. Bio_cgroup uses this as it is,
> >and does nothing about this.
> >
> >And the link between page_cgroup and bio_cgroup isn't protected
> >by any additional spin-locks, since the associated bio_cgroup is
> >guaranteed to exist as long as the bio_cgroup owns pages.
> >
> Hmm, I think page_cgroup's cost is visible when
> 1. a page is changed to be in-use state. (fault or radixt-tree-insert)
> 2. a page is changed to be out-of-use state (fault or radixt-tree-removal)
> 3. memcg hit its limit or global LRU reclaim runs.
> "1" and "2" can be catched as 5% loss of exec throuput.
> "3" is not measured (because LRU walk itself is heavy.)
>
> What new chances to access page_cgroup you'll add ?
> I'll have to take into account them.
I haven't add any at this moment, but I thinks some people may want
to move some pages in page-cache from one cgroup to another cgroup.
When that time comes, I'll try to make the cost minimized that
I will probably only update the link between a page_cgroup and
a bio_cgroup and leave the others untouched.
> >I've just noticed that most of overhead comes from the spin-locks
> >when reclaiming the pages inside mem_cgroups and the spin-locks to
> >protect the links between pages and page_cgroups.
> Overhead between page <-> page_cgroup lock is cannot be catched by
> lock_stat now.Do you have numbers ?
> But ok, there are too many locks ;(
The problem is that every time the lock is held, the associated
cache line is flushed.
> >The latter overhead comes from the policy your team has chosen
> >that page_cgroup structures are allocated on demand. I still feel
> >this approach doesn't make any sense because linux kernel tries to
> >make use of most of the pages as far as it can, so most of them
> >have to be assigned its related page_cgroup. It would make us happy
> >if page_cgroups are allocated at the booting time.
> >
> Now, multi-sizer-page-cache is discussed for a long time. If it's our
> direction, on-demand page_cgroup make sense.
I don't think I can agree to this.
When multi-sized-page-cache is introduced, some data structures will be
allocated to manage multi-sized-pages. I think page_cgroups should be
allocated at the same time. This approach will make things simple.
It seems like the on-demand allocation approach leads not only
overhead but complexity and a lot of race conditions.
If you allocate page_cgroups when allocating page structures,
You can get rid of most of the locks and you don't have to care about
allocation error of page_cgroups anymore.
And it will also give us flexibility that memcg related data can be
referred/updated inside critical sections.
> >> For example, adding a simple function.
> >> ==
> >> int get_page_io_id(struct page *)
> >> - returns a I/O cgroup ID for this page. If ID is not found, -1 is returne
> d.
> >> ID is not guaranteed to be valid value. (ID can be obsolete)
> >> ==
> >> And just storing cgroup ID to page_cgroup at page allocation.
> >> Then, making bio_cgroup independent from page_cgroup and
> >> get ID if avialble and avoid too much pointer walking.
> >
> >I don't think there are any diffrences between a poiter and ID.
> >I think this ID is just a encoded version of the pointer.
> >
> ID can be obsolete, pointer is not. memory cgroup has to take care of
> bio cgroup's race condition ? (About race conditions, it's already complicated
> enough)
Bio-cgroup just expects that the call-backs bio-cgroup prepares are called
when the status of a page_cgroup get changed.
> To be honest, I think adding a new (4 or 8 bytes) page struct and record infor
> mation of bio-control is more straightforward approach. Buy as you might
> think, "there is no room"
But only if everyone allows me to add some new members into "struct page."
I think the same thing goes with memcg you're working on.
Thank you,
Hirokazu Takahashi.
Fernando Luis Vázquez Cao wrote:
> This RFC ended up being a bit longer than I had originally intended, but
> hopefully it will serve as the start of a fruitful discussion.
Thanks for posting this detailed RFC! A few comments below.
> As you pointed out, it seems that there is not much consensus building
> going on, but that does not mean there is a lack of interest. To get the
> ball rolling it is probably a good idea to clarify the state of things
> and try to establish what we are trying to accomplish.
>
> *** State of things in the mainstream kernel<BR>
> The kernel has had somewhat adavanced I/O control capabilities for quite
> some time now: CFQ. But the current CFQ has some problems:
> - I/O priority can be set by PID, PGRP, or UID, but...
> - ...all the processes that fall within the same class/priority are
> scheduled together and arbitrary grouping are not possible.
> - Buffered I/O is not handled properly.
> - CFQ's IO priority is an attribute of a process that affects all
> devices it sends I/O requests to. In other words, with the current
> implementation it is not possible to assign per-device IO priorities to
> a task.
>
> *** Goals
> 1. Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> entity).
> 2. Being able to perform I/O bandwidth control independently on each
> device.
> 3. I/O bandwidth shaping.
> 4. Scheduler-independent I/O bandwidth control.
> 5. Usable with stacking devices (md, dm and other devices of that
> ilk).
> 6. I/O tracking (handle buffered and asynchronous I/O properly).
The same above also for IO operations/sec (bandwidth intended not only
in terms of bytes/sec), plus:
7. Optimal bandwidth usage: allow to exceed the IO limits to take
advantage of free/unused IO resources (i.e. allow "bursts" when the
whole physical bandwidth for a block device is not fully used and then
"throttle" again when IO from unlimited cgroups comes into place)
8. "fair throttling": avoid to throttle always the same task within a
cgroup, but try to distribute the throttling among all the tasks
belonging to the throttle cgroup
> The list of goals above is not exhaustive and it is also likely to
> contain some not-so-nice-to-have features so your feedback would be
> appreciated.
>
> 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
> groupings of processes and treat each group as a single scheduling
> identity)
>
> We obviously need this because our final goal is to be able to control
> the IO generated by a Linux container. The good news is that we already
> have the cgroups infrastructure so, regarding this problem, we would
> just have to transform our I/O bandwidth controller into a cgroup
> subsystem.
>
> This seems to be the easiest part, but the current cgroups
> infrastructure has some limitations when it comes to dealing with block
> devices: impossibility of creating/removing certain control structures
> dynamically and hardcoding of subsystems (i.e. resource controllers).
> This makes it difficult to handle block devices that can be hotplugged
> and go away at any time (this applies not only to usb storage but also
> to some SATA and SCSI devices). To cope with this situation properly we
> would need hotplug support in cgroups, but, as suggested before and
> discussed in the past (see (0) below), there are some limitations.
>
> Even in the non-hotplug case it would be nice if we could treat each
> block I/O device as an independent resource, which means we could do
> things like allocating I/O bandwidth on a per-device basis. As long as
> performance is not compromised too much, adding some kind of basic
> hotplug support to cgroups is probably worth it.
>
> (0) http://lkml.org/lkml/2008/5/21/12
What about using major,minor numbers to identify each device and account
IO statistics? If a device is unplugged we could reset IO statistics
and/or remove IO limitations for that device from userspace (i.e. by a
deamon), but pluggin/unplugging the device would not be blocked/affected
in any case. Or am I oversimplifying the problem?
> 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
>
> The implementation of an I/O scheduling algorithm is to a certain extent
> influenced by what we are trying to achieve in terms of I/O bandwidth
> shaping, but, as discussed below, the required accuracy can determine
> the layer where the I/O controller has to reside. Off the top of my
> head, there are three basic operations we may want perform:
> - I/O nice prioritization: ionice-like approach.
> - Proportional bandwidth scheduling: each process/group of processes
> has a weight that determines the share of bandwidth they receive.
> - I/O limiting: set an upper limit to the bandwidth a group of tasks
> can use.
Use a deadline-based IO scheduling could be an interesting path to be
explored as well, IMHO, to try to guarantee per-cgroup minimum bandwidth
requirements.
>
> If we are pursuing a I/O prioritization model à la CFQ the temptation is
> to implement it at the elevator layer or extend any of the existing I/O
> schedulers.
>
> There have been several proposals that extend either the CFQ scheduler
> (see (1), (2) below) or the AS scheduler (see (3) below). The problem
> with these controllers is that they are scheduler dependent, which means
> that they become unusable when we change the scheduler or when we want
> to control stacking devices which define their own make_request_fn
> function (md and dm come to mind). It could be argued that the physical
> devices controlled by a dm or md driver are likely to be fed by
> traditional I/O schedulers such as CFQ, but these I/O schedulers would
> be running independently from each other, each one controlling its own
> device ignoring the fact that they part of a stacking device. This lack
> of information at the elevator layer makes it pretty difficult to obtain
> accurate results when using stacking devices. It seems that unless we
> can make the elevator layer aware of the topology of stacking devices
> (possibly by extending the elevator API?) evelator-based approaches do
> not constitute a generic solution. Here onwards, for discussion
> purposes, I will refer to this type of I/O bandwidth controllers as
> elevator-based I/O controllers.
>
> A simple way of solving the problems discussed in the previous paragraph
> is to perform I/O control before the I/O actually enters the block layer
> either at the pagecache level (when pages are dirtied) or at the entry
> point to the generic block layer (generic_make_request()). Andrea's I/O
> throttling patches stick to the former variant (see (4) below) and
> Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later
> approach. The rationale is that by hooking into the source of I/O
> requests we can perform I/O control in a topology-agnostic and
> elevator-agnostic way. I will refer to this new type of I/O bandwidth
> controller as block layer I/O controller.
>
> By residing just above the generic block layer the implementation of a
> block layer I/O controller becomes relatively easy, but by not taking
> into account the characteristics of the underlying devices we might risk
> underutilizing them. For this reason, in some cases it would probably
> make sense to complement a generic I/O controller with elevator-based
> I/O controller, so that the maximum throughput can be squeezed from the
> physical devices.
>
> (1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/
> (2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/
> (3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/
> (4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975
> (5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581
>
> 6.- I/O tracking
>
> This is arguably the most important part, since to perform I/O control
> we need to be able to determine where the I/O is coming from.
>
> Reads are trivial because they are served in the context of the task
> that generated the I/O. But most writes are performed by pdflush,
> kswapd, and friends so performing I/O control just in the synchronous
> I/O path would lead to large inaccuracy. To get this right we would need
> to track ownership all the way up to the pagecache page. In other words,
> it is necessary to track who is dirtying pages so that when they are
> written to disk the right task is charged for that I/O.
>
> Fortunately, such tracking of pages is one of the things the existing
> memory resource controller is doing to control memory usage. This is a
> clever observation which has a useful implication: if the rather
> imbricated tracking and accounting parts of the memory resource
> controller were split the I/O controller could leverage the existing
> infrastructure to track buffered and asynchronous I/O. This is exactly
> what the bio-cgroup (see (6) below) patches set out to do.
>
> It is also possible to do without I/O tracking. For that we would need
> to hook into the synchronous I/O path and every place in the kernel
> where pages are dirtied (see (4) above for details). However controlling
> the rate at which a cgroup can generate dirty pages seems to be a task
> that belongs in the memory controller not the I/O controller. As Dave
> and Paul suggested its probably better to delegate this to the memory
> controller. In fact, it seems that Yamamoto-san is cooking some patches
> that implement just that: dirty balancing for cgroups (see (7) for
> details).
>
> Another argument in favor of I/O tracking is that not only block layer
> I/O controllers would benefit from it, but also the existing I/O
> schedulers and the elevator-based I/O controllers proposed by
> Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself
> are working on this and hopefully will be sending patches soon).
>
> (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90
> (7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/
>
> *** How to move on
>
> As discussed before, it probably makes sense to have both a block layer
> I/O controller and a elevator-based one, and they could certainly
> cohabitate. As discussed before, all of them need I/O tracking
> capabilities so I would like to suggest the plan below to get things
> started:
>
> - Improve the I/O tracking patches (see (6) above) until they are in
> mergeable shape.
> - Fix CFQ and AS to use the new I/O tracking functionality to show its
> benefits. If the performance impact is acceptable this should suffice to
> convince the respective maintainer and get the I/O tracking patches
> merged.
> - Implement a block layer resource controller. dm-ioband is a working
> solution and feature rich but its dependency on the dm infrastructure is
> likely to find opposition (the dm layer does not handle barriers
> properly and the maximum size of I/O requests can be limited in some
> cases). In such a case, we could either try to build a standalone
> resource controller based on dm-ioband (which would probably hook into
> generic_make_request) or try to come up with something new.
> - If the I/O tracking patches make it into the kernel we could move on
> and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> (1), (2), and (3) above for details) merged.
> - Delegate the task of controlling the rate at which a task can
> generate dirty pages to the memory controller.
>
> This RFC is somewhat vague but my feeling is that we build some
> consensus on the goals and basic design aspects before delving into
> implementation details.
>
> I would appreciate your comments and feedback.
Very nice RFC.
-Andrea
On Thu, 07 Aug 2008 16:25:12 +0900 (JST)
Hirokazu Takahashi <[email protected]> wrote:
> > >I've just noticed that most of overhead comes from the spin-locks
> > >when reclaiming the pages inside mem_cgroups and the spin-locks to
> > >protect the links between pages and page_cgroups.
> > Overhead between page <-> page_cgroup lock is cannot be catched by
> > lock_stat now.Do you have numbers ?
> > But ok, there are too many locks ;(
>
> The problem is that every time the lock is held, the associated
> cache line is flushed.
I think "page" and "page_cgroup" is not so heavly shared object in fast path.
foot-print is also important here.
(anyway, I'd like to remove lock_page_cgroup() when I find a chance)
>
> > >The latter overhead comes from the policy your team has chosen
> > >that page_cgroup structures are allocated on demand. I still feel
> > >this approach doesn't make any sense because linux kernel tries to
> > >make use of most of the pages as far as it can, so most of them
> > >have to be assigned its related page_cgroup. It would make us happy
> > >if page_cgroups are allocated at the booting time.
> > >
> > Now, multi-sizer-page-cache is discussed for a long time. If it's our
> > direction, on-demand page_cgroup make sense.
>
> I don't think I can agree to this.
> When multi-sized-page-cache is introduced, some data structures will be
> allocated to manage multi-sized-pages.
maybe no. it will be encoded into struct page.
> I think page_cgroups should be allocated at the same time.
> This approach will make things simple.
yes, of course.
>
> It seems like the on-demand allocation approach leads not only
> overhead but complexity and a lot of race conditions.
> If you allocate page_cgroups when allocating page structures,
> You can get rid of most of the locks and you don't have to care about
> allocation error of page_cgroups anymore.
>
> And it will also give us flexibility that memcg related data can be
> referred/updated inside critical sections.
>
But it's not good for the systems with small "NORMAL" pages.
This discussion should be done again when more users of page_group appears and
it's overhead is obvious.
Thanks,
-Kame
Hi, Naveen,
> > If we are pursuing a I/O prioritization model ? la CFQ the temptation is
> > to implement it at the elevator layer or extend any of the existing I/O
> > schedulers.
> >
> > There have been several proposals that extend either the CFQ scheduler
> > (see (1), (2) below) or the AS scheduler (see (3) below). The problem
> > with these controllers is that they are scheduler dependent, which means
> > that they become unusable when we change the scheduler or when we want
> > to control stacking devices which define their own make_request_fn
> > function (md and dm come to mind). It could be argued that the physical
> > devices controlled by a dm or md driver are likely to be fed by
> > traditional I/O schedulers such as CFQ, but these I/O schedulers would
> > be running independently from each other, each one controlling its own
> > device ignoring the fact that they part of a stacking device. This lack
> > of information at the elevator layer makes it pretty difficult to obtain
> > accurate results when using stacking devices. It seems that unless we
> > can make the elevator layer aware of the topology of stacking devices
> > (possibly by extending the elevator API?) evelator-based approaches do
> > not constitute a generic solution. Here onwards, for discussion
> > purposes, I will refer to this type of I/O bandwidth controllers as
> > elevator-based I/O controllers.
>
> It can be argued that any scheduling decision wrt to i/o belongs to
> elevators. Till now they have been used to improve performance. But
> with new requirements to isolate i/o based on process or cgroup, we
> need to change the elevators.
>
> If we add another layer of i/o scheduling (block layer I/O controller)
> above elevators
> 1) It builds another layer of i/o scheduling (bandwidth or priority)
> 2) This new layer can have decisions for i/o scheduling which conflict
> with underlying elevator. e.g. If we decide to do b/w scheduling in
> this new layer, there is no way a priority based elevator could work
> underneath it.
I seems like the same goes for the current Linux kernel implementation
that if processes issued a lot of I/O requests and the io-request queue
of a disk is overflowed, all the I/O requests after will be blocked
and the priorities of them are meaningless.
In other word, it won't work if it receives lots of requests more than
the ability/bandwidth of a disk.
It doesn't seem so weird if it won't work if a cgroup issues lots of
I/O requests more than the bandwidth which is assigned to the cgroup.
> If a custom make_request_fn is defined (which means the said device is
> not using existing elevator), it could build it's own scheduling
> rather than asking kernel to add another layer at the time of i/o
> submission. Since it has complete control of i/o.
Thanks,
Hirokazu Takahashi
Hi,
> > > >I've just noticed that most of overhead comes from the spin-locks
> > > >when reclaiming the pages inside mem_cgroups and the spin-locks to
> > > >protect the links between pages and page_cgroups.
> > > Overhead between page <-> page_cgroup lock is cannot be catched by
> > > lock_stat now.Do you have numbers ?
> > > But ok, there are too many locks ;(
> >
> > The problem is that every time the lock is held, the associated
> > cache line is flushed.
> I think "page" and "page_cgroup" is not so heavly shared object in fast path.
> foot-print is also important here.
> (anyway, I'd like to remove lock_page_cgroup() when I find a chance)
OK.
> > > >The latter overhead comes from the policy your team has chosen
> > > >that page_cgroup structures are allocated on demand. I still feel
> > > >this approach doesn't make any sense because linux kernel tries to
> > > >make use of most of the pages as far as it can, so most of them
> > > >have to be assigned its related page_cgroup. It would make us happy
> > > >if page_cgroups are allocated at the booting time.
> > > >
> > > Now, multi-sizer-page-cache is discussed for a long time. If it's our
> > > direction, on-demand page_cgroup make sense.
> >
> > I don't think I can agree to this.
> > When multi-sized-page-cache is introduced, some data structures will be
> > allocated to manage multi-sized-pages.
> maybe no. it will be encoded into struct page.
It will nice and simple if it will be.
> > I think page_cgroups should be allocated at the same time.
> > This approach will make things simple.
> yes, of course.
>
> >
> > It seems like the on-demand allocation approach leads not only
> > overhead but complexity and a lot of race conditions.
> > If you allocate page_cgroups when allocating page structures,
> > You can get rid of most of the locks and you don't have to care about
> > allocation error of page_cgroups anymore.
> >
> > And it will also give us flexibility that memcg related data can be
> > referred/updated inside critical sections.
> >
> But it's not good for the systems with small "NORMAL" pages.
Even when it happens to be a system with small "NORMAL" pages, if you
want to use memcg feature, you have to allocate page_groups for most of
the pages in the system. It's impossible to avoid the allocation as far
as you use memcg.
> This discussion should be done again when more users of page_group appears and
> it's overhead is obvious.
Thanks,
Hirokazu Takahashi.
Hi Naveen,
On Wed, 2008-08-06 at 12:37 -0700, Naveen Gupta wrote:
> > 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
> >
> > The implementation of an I/O scheduling algorithm is to a certain extent
> > influenced by what we are trying to achieve in terms of I/O bandwidth
> > shaping, but, as discussed below, the required accuracy can determine
> > the layer where the I/O controller has to reside. Off the top of my
> > head, there are three basic operations we may want perform:
> > - I/O nice prioritization: ionice-like approach.
> > - Proportional bandwidth scheduling: each process/group of processes
> > has a weight that determines the share of bandwidth they receive.
> > - I/O limiting: set an upper limit to the bandwidth a group of tasks
> > can use.
>
> I/O limiting can be a special case of proportional bandwidth
> scheduling. A process/process group can use use it's share of
> bandwidth and if there is spare bandwidth it be allowed to use it. And
> if we want to absolutely restrict it we add another flag which
> specifies that the specified proportion is exact and has an upper
> bound.
>
> Let's say the ideal b/w for a device is 100MB/s
>
> And process 1 is assigned b/w of 20%. When we say that the proportion
> is strict, the b/w for process 1 will be 20% of the max b/w (which may
> be less than 100MB/s) subject to a max of 20MB/s.
I essentially agree with you. The nice thing about proportional
bandwidth scheduling is that we get bandwidth guarantees when there is
contention for the block device, but still get the benefits of
statistical multiplexing in the non-contended case. With strict IO
limiting we risk underusing the block devices.
> > If we are pursuing a I/O prioritization model à la CFQ the temptation is
> > to implement it at the elevator layer or extend any of the existing I/O
> > schedulers.
> >
> > There have been several proposals that extend either the CFQ scheduler
> > (see (1), (2) below) or the AS scheduler (see (3) below). The problem
> > with these controllers is that they are scheduler dependent, which means
> > that they become unusable when we change the scheduler or when we want
> > to control stacking devices which define their own make_request_fn
> > function (md and dm come to mind). It could be argued that the physical
> > devices controlled by a dm or md driver are likely to be fed by
> > traditional I/O schedulers such as CFQ, but these I/O schedulers would
> > be running independently from each other, each one controlling its own
> > device ignoring the fact that they part of a stacking device. This lack
> > of information at the elevator layer makes it pretty difficult to obtain
> > accurate results when using stacking devices. It seems that unless we
> > can make the elevator layer aware of the topology of stacking devices
> > (possibly by extending the elevator API?) evelator-based approaches do
> > not constitute a generic solution. Here onwards, for discussion
> > purposes, I will refer to this type of I/O bandwidth controllers as
> > elevator-based I/O controllers.
>
> It can be argued that any scheduling decision wrt to i/o belongs to
> elevators. Till now they have been used to improve performance. But
> with new requirements to isolate i/o based on process or cgroup, we
> need to change the elevators.
I have the impression there is a tendency to conflate two different
issues when discussing I/O schedulers and resource controllers, so let
me elaborate on this point.
On the one hand, we have the problem of feeding physical devices with IO
requests in such a way that we squeeze the maximum performance out of
them. Of course in some cases we may want to prioritize responsiveness
over throughput. In either case the kernel has to perform the same basic
operations: merging and dispatching IO requests. There is no discussion
this is the elevator's job and the elevator should take into account the
physical characteristics of the device.
On the other hand, there is the problem of sharing an IO resource, i.e.
block device, between multiple tasks or groups of tasks. There are many
ways of sharing an IO resource depending on what we are trying to
accomplish: proportional bandwidth scheduling, priority-based
scheduling, etc. But to implement this sharing algorithms the kernel has
to determine the task whose IO will be submitted. In a sense, we are
scheduling tasks (and groups of tasks) not IO requests (which has much
in common with CPU scheduling). Besides, the sharing problem is not
directly related to the characteristics of the underlying device, which
means it does not need to be implemented at the elevator layer.
Traditional elevators limit themselves to schedule IO requests to disk
with no regard to where it came from. However, new IO schedulers such as
CFQ combine this with IO prioritization capabilities. This means that
the elevator decides the application whose IO will be dispatched next.
The problem is that at this layer there is not enough information to
make such decisions in an accurate way, because, as mentioned in the
RFC, the elevator has not way to know the block IO topology. The
implication of this is that the elevator does not know the impact a
particular scheduling decision will make in the IO throughput seen by
applications, which is what users care about.
For all these reasons, I think the elevator should take care of
optimizing the last stretch of the IO path (generic block layer -> block
device) for performance/responsiveness, and leave the job of ensuring
that each task is guaranteed a fair share of the kernel's IO resources
to the upper layers (for example a block layer resource controller).
I recognize that in some cases global performance could be improved if
the block layer had access to information from the elevator, and that is
why I mentioned in the RFC that in some cases it might make sense to
combine a block layer resource controller and a elevator layer one (we
just would need to figure out a way for the to communicate with each
other and work well in tandem).
> If we add another layer of i/o scheduling (block layer I/O controller)
> above elevators
> 1) It builds another layer of i/o scheduling (bandwidth or priority)
As I mentioned before we are trying to achieve two things: making the
best use of block devices, and sharing those IO resources between tasks
or groups of tasks. There are two possible approaches here: implement
everything in the elevator or move the sharing bits somewhere above the
elevator layer. In either case we have to carry out the same tasks so
the impact of delegating part of the work to a new layer should not be
that big, and, hopefully, will improve maintainability.
> 2) This new layer can have decisions for i/o scheduling which conflict
> with underlying elevator. e.g. If we decide to do b/w scheduling in
> this new layer, there is no way a priority based elevator could work
> underneath it.
The priority system could be implemented above the elevator layer in the
block layer resource controller, which means that the elevator would
only have to worry about scheduling the requests it receives from the
block layer and dispatching them to disk in the best possible way.
An alternative would be using a block layer resource controller and a
elavator-based resource controller in tandem.
> If a custom make_request_fn is defined (which means the said device is
> not using existing elevator),
Please note that each of the block devices that constitute a stacking
device could have its own elevator.
> it could build it's own scheduling
> rather than asking kernel to add another layer at the time of i/o
> submission. Since it has complete control of i/o.
I think that is something we should avoid. The IO scheduling behavior
that the user sees should not depend on the topology of the system. We
certainly do not want to reimplement the same scheduling algorithm for
every RAID driver. I am of the opinion that whatever IO scheduling
algorithm we choose should be implemented just once and usable under any
IO configuration.
Hi Andrea!
On Thu, 2008-08-07 at 09:46 +0200, Andrea Righi wrote:
> Fernando Luis Vázquez Cao wrote:
> > This RFC ended up being a bit longer than I had originally intended, but
> > hopefully it will serve as the start of a fruitful discussion.
>
> Thanks for posting this detailed RFC! A few comments below.
>
> > As you pointed out, it seems that there is not much consensus building
> > going on, but that does not mean there is a lack of interest. To get the
> > ball rolling it is probably a good idea to clarify the state of things
> > and try to establish what we are trying to accomplish.
> >
> > *** State of things in the mainstream kernel<BR>
> > The kernel has had somewhat adavanced I/O control capabilities for quite
> > some time now: CFQ. But the current CFQ has some problems:
> > - I/O priority can be set by PID, PGRP, or UID, but...
> > - ...all the processes that fall within the same class/priority are
> > scheduled together and arbitrary grouping are not possible.
> > - Buffered I/O is not handled properly.
> > - CFQ's IO priority is an attribute of a process that affects all
> > devices it sends I/O requests to. In other words, with the current
> > implementation it is not possible to assign per-device IO priorities to
> > a task.
> >
> > *** Goals
> > 1. Cgroups-aware I/O scheduling (being able to define arbitrary
> > groupings of processes and treat each group as a single scheduling
> > entity).
> > 2. Being able to perform I/O bandwidth control independently on each
> > device.
> > 3. I/O bandwidth shaping.
> > 4. Scheduler-independent I/O bandwidth control.
> > 5. Usable with stacking devices (md, dm and other devices of that
> > ilk).
> > 6. I/O tracking (handle buffered and asynchronous I/O properly).
>
> The same above also for IO operations/sec (bandwidth intended not only
> in terms of bytes/sec), plus:
>
> 7. Optimal bandwidth usage: allow to exceed the IO limits to take
> advantage of free/unused IO resources (i.e. allow "bursts" when the
> whole physical bandwidth for a block device is not fully used and then
> "throttle" again when IO from unlimited cgroups comes into place)
>
> 8. "fair throttling": avoid to throttle always the same task within a
> cgroup, but try to distribute the throttling among all the tasks
> belonging to the throttle cgroup
Thank you for the ideas!
By the way, point "3." above (I/O bandwidth shaping) refers to IO
scheduling algorithms in general. When I wrote the RFC I thought that
once we have the IO tracking and accounting mechanisms in place choosing
and implementing an algorithm (fair throttling, proportional bandwidth
scheduling, etc) would be easy in comparison, and that is the reason a
list was not included.
Once I get more feedback from all of you I will resend a more detailed
RFC that will include your suggestions.
> > 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
> > groupings of processes and treat each group as a single scheduling
> > identity)
> >
> > We obviously need this because our final goal is to be able to control
> > the IO generated by a Linux container. The good news is that we already
> > have the cgroups infrastructure so, regarding this problem, we would
> > just have to transform our I/O bandwidth controller into a cgroup
> > subsystem.
> >
> > This seems to be the easiest part, but the current cgroups
> > infrastructure has some limitations when it comes to dealing with block
> > devices: impossibility of creating/removing certain control structures
> > dynamically and hardcoding of subsystems (i.e. resource controllers).
> > This makes it difficult to handle block devices that can be hotplugged
> > and go away at any time (this applies not only to usb storage but also
> > to some SATA and SCSI devices). To cope with this situation properly we
> > would need hotplug support in cgroups, but, as suggested before and
> > discussed in the past (see (0) below), there are some limitations.
> >
> > Even in the non-hotplug case it would be nice if we could treat each
> > block I/O device as an independent resource, which means we could do
> > things like allocating I/O bandwidth on a per-device basis. As long as
> > performance is not compromised too much, adding some kind of basic
> > hotplug support to cgroups is probably worth it.
> >
> > (0) http://lkml.org/lkml/2008/5/21/12
>
> What about using major,minor numbers to identify each device and account
> IO statistics? If a device is unplugged we could reset IO statistics
> and/or remove IO limitations for that device from userspace (i.e. by a
> deamon), but pluggin/unplugging the device would not be blocked/affected
> in any case. Or am I oversimplifying the problem?
If a resource we want to control (a block device in this case) is
hot-plugged/unplugged the corresponding cgroup-related structures inside
the kernel need to be allocated/freed dynamically, respectively. The
problem is that this is not always possible. For example, with the
current implementation of cgroups it is not possible to treat each block
device as a different cgroup subsytem/resource controlled, because
subsystems are created at compile time.
> > 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
> >
> > The implementation of an I/O scheduling algorithm is to a certain extent
> > influenced by what we are trying to achieve in terms of I/O bandwidth
> > shaping, but, as discussed below, the required accuracy can determine
> > the layer where the I/O controller has to reside. Off the top of my
> > head, there are three basic operations we may want perform:
> > - I/O nice prioritization: ionice-like approach.
> > - Proportional bandwidth scheduling: each process/group of processes
> > has a weight that determines the share of bandwidth they receive.
> > - I/O limiting: set an upper limit to the bandwidth a group of tasks
> > can use.
>
> Use a deadline-based IO scheduling could be an interesting path to be
> explored as well, IMHO, to try to guarantee per-cgroup minimum bandwidth
> requirements.
Please note that the only thing we can do is to guarantee minimum
bandwidth requirement when there is contention for an IO resource, which
is precisely what a proportional bandwidth scheduler does. An I missing
something?
Hi, Fernando,
It's a good work!
> *** How to move on
>
> As discussed before, it probably makes sense to have both a block layer
> I/O controller and a elevator-based one, and they could certainly
> cohabitate. As discussed before, all of them need I/O tracking
> capabilities so I would like to suggest the plan below to get things
> started:
>
> - Improve the I/O tracking patches (see (6) above) until they are in
> mergeable shape.
The current implementation of bio-cgroup is quite basic that a certain
page is owned by the cgroup that allocated the page, that is the same
way as the memory controller does. In most of cases this is enough and
it helps minimize the overhead.
I think you many want to add some feature to change the owner of a page.
It will be ok we implement it step by step. I know there will be some
tradeoff between the overhead and the accuracy to track pages.
We also try to reduce the overhead of the tracking, whose code comes
from the memory controller though. We all should help the memory
controller team do this.
> - Fix CFQ and AS to use the new I/O tracking functionality to show its
> benefits. If the performance impact is acceptable this should suffice to
> convince the respective maintainer and get the I/O tracking patches
> merged.
Yes.
> - Implement a block layer resource controller. dm-ioband is a working
> solution and feature rich but its dependency on the dm infrastructure is
> likely to find opposition (the dm layer does not handle barriers
> properly and the maximum size of I/O requests can be limited in some
> cases). In such a case, we could either try to build a standalone
> resource controller based on dm-ioband (which would probably hook into
> generic_make_request) or try to come up with something new.
I doubt about the maximum size of I/O requests problem. You can't avoid
this problem as far as you use device mapper modules with such a bad
manner, even if the controller is implemented as a stand-alone controller.
There is no limitation if you only use dm-ioband without any other device
mapper modules.
And I think the device mapper team just started designing barriers support.
I guess it won't take long. Right, Alasdair?
We should know it is logically impossible to support barriers on some
types of device mapper modules such as LVM. You can't avoid the barrier
problem when you use this kind of multiple devices even if you implement
the controller in the block layer.
But I think a stand-alone implementation will have a merit that it
makes it easier to setup the configuration rather than dm-ioband.
>From this point of view, it would be good that you move the algorithm
of dm-ioband into the block layer.
On the other hand, we should know it will make it impossible to use
the dm infrastructure from the controller, though it isn't so rich.
> - If the I/O tracking patches make it into the kernel we could move on
> and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> (1), (2), and (3) above for details) merged.
> - Delegate the task of controlling the rate at which a task can
> generate dirty pages to the memory controller.
>
> This RFC is somewhat vague but my feeling is that we build some
> consensus on the goals and basic design aspects before delving into
> implementation details.
>
> I would appreciate your comments and feedback.
>
> - Fernando
>
Hi,
> > - Implement a block layer resource controller. dm-ioband is a working
> > solution and feature rich but its dependency on the dm infrastructure is
> > likely to find opposition (the dm layer does not handle barriers
> > properly and the maximum size of I/O requests can be limited in some
> > cases). In such a case, we could either try to build a standalone
> > resource controller based on dm-ioband (which would probably hook into
> > generic_make_request) or try to come up with something new.
>
> I doubt about the maximum size of I/O requests problem. You can't avoid
> this problem as far as you use device mapper modules with such a bad
> manner, even if the controller is implemented as a stand-alone controller.
> There is no limitation if you only use dm-ioband without any other device
> mapper modules.
The following is a part of source code where the limitation comes from.
dm-table.c: dm_set_device_limits()
/*
* Check if merge fn is supported.
* If not we'll force DM to use PAGE_SIZE or
* smaller I/O, just to be safe.
*/
if (q->merge_bvec_fn && !ti->type->merge)
rs->max_sectors =
min_not_zero(rs->max_sectors,
(unsigned int) (PAGE_SIZE >> 9));
As far as I can find, In 2.6.27-rc1-mm1, Only some software RAID
drivers and pktcdvd driver define merge_bvec_fn().
Thanks,
Ryo Tsuruta
Ryo Tsuruta wrote:
> +static void bio_cgroup_move_task(struct cgroup_subsys *ss,
> + struct cgroup *cont,
> + struct cgroup *old_cont,
> + struct task_struct *p)
> +{
> + struct mm_struct *mm;
> + struct bio_cgroup *biog, *old_biog;
> +
> + if (bio_cgroup_disabled())
> + return;
> +
> + mm = get_task_mm(p);
> + if (mm == NULL)
> + return;
> +
> + biog = cgroup_bio(cont);
> + old_biog = cgroup_bio(old_cont);
> +
> + mmput(mm);
> + return;
> +}
Is this function fully implemented?
I tried to put a process into a group by writing to
"/cgroup/bio/BGROUP/tasks" but failed.
I think this function is not enough to be used as "attach."
> +
> +
> +struct cgroup_subsys bio_cgroup_subsys = {
> + .name = "bio",
> + .subsys_id = bio_cgroup_subsys_id,
> + .create = bio_cgroup_create,
> + .destroy = bio_cgroup_destroy,
> + .pre_destroy = bio_cgroup_pre_destroy,
> + .populate = bio_cgroup_populate,
> + .attach = bio_cgroup_move_task,
> + .early_init = 0,
> +};
Without "attach" function, it is difficult to check
the effectiveness of block I/O tracking.
Thanks,
- Takuya Yoshikawa
On Fri, 2008-08-08 at 16:20 +0900, Ryo Tsuruta wrote:
> > > - Implement a block layer resource controller. dm-ioband is a working
> > > solution and feature rich but its dependency on the dm infrastructure is
> > > likely to find opposition (the dm layer does not handle barriers
> > > properly and the maximum size of I/O requests can be limited in some
> > > cases). In such a case, we could either try to build a standalone
> > > resource controller based on dm-ioband (which would probably hook into
> > > generic_make_request) or try to come up with something new.
> >
> > I doubt about the maximum size of I/O requests problem. You can't avoid
> > this problem as far as you use device mapper modules with such a bad
> > manner, even if the controller is implemented as a stand-alone controller.
> > There is no limitation if you only use dm-ioband without any other device
> > mapper modules.
>
> The following is a part of source code where the limitation comes from.
>
> dm-table.c: dm_set_device_limits()
> /*
> * Check if merge fn is supported.
> * If not we'll force DM to use PAGE_SIZE or
> * smaller I/O, just to be safe.
> */
>
> if (q->merge_bvec_fn && !ti->type->merge)
> rs->max_sectors =
> min_not_zero(rs->max_sectors,
> (unsigned int) (PAGE_SIZE >> 9));
>
> As far as I can find, In 2.6.27-rc1-mm1, Only some software RAID
> drivers and pktcdvd driver define merge_bvec_fn().
Yup, exactly. The implication of this is that we may see a drop in
performance in some RAID configurations.
Hi Yoshikawa-san,
> > +static void bio_cgroup_move_task(struct cgroup_subsys *ss,
> > + struct cgroup *cont,
> > + struct cgroup *old_cont,
> > + struct task_struct *p)
> > +{
> > + struct mm_struct *mm;
> > + struct bio_cgroup *biog, *old_biog;
> > +
> > + if (bio_cgroup_disabled())
> > + return;
> > +
> > + mm = get_task_mm(p);
> > + if (mm == NULL)
> > + return;
> > +
> > + biog = cgroup_bio(cont);
> > + old_biog = cgroup_bio(old_cont);
> > +
> > + mmput(mm);
> > + return;
> > +}
>
> Is this function fully implemented?
This function can be more simplified, there is some unnecessary code
from old version.
> I tried to put a process into a group by writing to
> "/cgroup/bio/BGROUP/tasks" but failed.
Could you tell me what you actually did? I will try the same thing.
--
Ryo Tsuruta <[email protected]>
Hi Tsuruta-san,
Ryo Tsuruta wrote:
> Hi Yoshikawa-san,
>
>>> +static void bio_cgroup_move_task(struct cgroup_subsys *ss,
>>> + struct cgroup *cont,
>>> + struct cgroup *old_cont,
>>> + struct task_struct *p)
>>> +{
>>> + struct mm_struct *mm;
>>> + struct bio_cgroup *biog, *old_biog;
>>> +
>>> + if (bio_cgroup_disabled())
>>> + return;
>>> +
>>> + mm = get_task_mm(p);
>>> + if (mm == NULL)
>>> + return;
>>> +
>>> + biog = cgroup_bio(cont);
>>> + old_biog = cgroup_bio(old_cont);
>>> +
>>> + mmput(mm);
>>> + return;
>>> +}
>> Is this function fully implemented?
>
> This function can be more simplified, there is some unnecessary code
> from old version.
>
I think it is neccessary to attach the task p to new biog.
>> I tried to put a process into a group by writing to
>> "/cgroup/bio/BGROUP/tasks" but failed.
>
> Could you tell me what you actually did? I will try the same thing.
>
> --
> Ryo Tsuruta <[email protected]>
>
I wanted to test my own scheduler which uses bio tracking information.
SO I tried your patch, especially, get_bio_cgroup_iocontext(), to get
the io_context from bio.
In my test, I made some threads with certain iopriorities run
concurrently. To schedule these threads based on their iopriorities,
I made BGROUP directories for each iopriorities.
e.g. /cgroup/bio/be0 ... /cgroup/bio/be7
Then, I tried to attach the processes to the appropriate groups.
But the processes stayed in the original group(id=0).
...
I am sorry but I have to leave now and I cannot come here next week.
--> I will take summer holidays.
I will reply to you later.
Thanks,
- Takuya Yoshikawa
Hi Fernando,
> > > > - Implement a block layer resource controller. dm-ioband is a working
> > > > solution and feature rich but its dependency on the dm infrastructure is
> > > > likely to find opposition (the dm layer does not handle barriers
> > > > properly and the maximum size of I/O requests can be limited in some
> > > > cases). In such a case, we could either try to build a standalone
> > > > resource controller based on dm-ioband (which would probably hook into
> > > > generic_make_request) or try to come up with something new.
> > >
> > > I doubt about the maximum size of I/O requests problem. You can't avoid
> > > this problem as far as you use device mapper modules with such a bad
> > > manner, even if the controller is implemented as a stand-alone controller.
> > > There is no limitation if you only use dm-ioband without any other device
> > > mapper modules.
> >
> > The following is a part of source code where the limitation comes from.
> >
> > dm-table.c: dm_set_device_limits()
> > /*
> > * Check if merge fn is supported.
> > * If not we'll force DM to use PAGE_SIZE or
> > * smaller I/O, just to be safe.
> > */
> >
> > if (q->merge_bvec_fn && !ti->type->merge)
> > rs->max_sectors =
> > min_not_zero(rs->max_sectors,
> > (unsigned int) (PAGE_SIZE >> 9));
> >
> > As far as I can find, In 2.6.27-rc1-mm1, Only some software RAID
> > drivers and pktcdvd driver define merge_bvec_fn().
>
> Yup, exactly. The implication of this is that we may see a drop in
> performance in some RAID configurations.
The current device-mapper introduces a bvec merge function for device
mapper devices. IMHO, the limitation goes away once we implement this
in dm-ioband. Am I right, Alasdair?
Thanks,
Ryo Tsuruta
Hi,
> > Would you like to split up IO into read and write IO. We know that read can be
> > very latency sensitive when compared to writes. Should we consider them
> > separately in the RFC?
> Oops, I somehow ended up leaving your first question unanswered. Sorry.
>
> I do not think we should consider them separately, as long as there is a
> proper IO tracking infrastructure in place. As you mentioned, reads can
> be very latecy sensitive, but the read case could be treated as an
> special case IO controller/IO tracking subsystem. There certainly are
> optimization opportunities. For example, in the synchronous I/O patch ww
> could mark bios with the iocontext of the current task, because it will
> happen to be originator of that IO. By effectively caching the ownership
> information in the bio we can avoid all the accesses to struct page,
> page_cgroup, etc, and reads would definitively benefit from that.
FYI, we should also take special care of pages being reclaimed, the free
memory of the cgroup these pages belong to may be really low.
Dm-ioband is doing this.
Thanks,
Hirokazu Takahashi.
Hi Yoshikawa-san,
> I wanted to test my own scheduler which uses bio tracking information.
> SO I tried your patch, especially, get_bio_cgroup_iocontext(), to get
> the io_context from bio.
>
> In my test, I made some threads with certain iopriorities run
> concurrently. To schedule these threads based on their iopriorities,
> I made BGROUP directories for each iopriorities.
> e.g. /cgroup/bio/be0 ... /cgroup/bio/be7
> Then, I tried to attach the processes to the appropriate groups.
>
> But the processes stayed in the original group(id=0).
In the current implementation, when a process moves to an another cgroup:
- Already allocated memory does not move to the cgroup, still remains.
- Only allocated memory after move belongs to the cgroup.
This behavior follows the memory controller.
Memory does not move between cgroups since it is so heavy operation,
but it would be worth under some sort of conditions.
Could you try to move a process between cgroups in the following way?
# echo $$ > /cgroup/bio/be0
# run_your_program
# echo $$ > /cgroup/bio/be1
# run_your_program
...
> I am sorry but I have to leave now and I cannot come here next week.
> --> I will take summer holidays.
Have a nice vacation!
Thanks,
Ryo Tsuruta
Hi, Fernando,
> > - Implement a block layer resource controller. dm-ioband is a working
> > solution and feature rich but its dependency on the dm infrastructure is
> > likely to find opposition (the dm layer does not handle barriers
> > properly and the maximum size of I/O requests can be limited in some
> > cases). In such a case, we could either try to build a standalone
> > resource controller based on dm-ioband (which would probably hook into
> > generic_make_request) or try to come up with something new.
>
> I doubt about the maximum size of I/O requests problem. You can't avoid
> this problem as far as you use device mapper modules with such a bad
> manner, even if the controller is implemented as a stand-alone controller.
> There is no limitation if you only use dm-ioband without any other device
> mapper modules.
Ryo told me this isn't true anymore. The dm infrastructure introduced
a new feature to support multiple page-sized I/O requests, that was
just merged to the current linux tree. So you and me don't need to
worry about this stuff anymore.
Ryo said he was going to make dm-ioband support this new feature and
post the patches soon.
> And I think the device mapper team just started designing barriers support.
> I guess it won't take long. Right, Alasdair?
> We should know it is logically impossible to support barriers on some
> types of device mapper modules such as LVM. You can't avoid the barrier
> problem when you use this kind of multiple devices even if you implement
> the controller in the block layer.
>
> But I think a stand-alone implementation will have a merit that it
> makes it easier to setup the configuration rather than dm-ioband.
> From this point of view, it would be good that you move the algorithm
> of dm-ioband into the block layer.
> On the other hand, we should know it will make it impossible to use
> the dm infrastructure from the controller, though it isn't so rich.
>
> > - If the I/O tracking patches make it into the kernel we could move on
> > and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> > (1), (2), and (3) above for details) merged.
> > - Delegate the task of controlling the rate at which a task can
> > generate dirty pages to the memory controller.
> >
> > This RFC is somewhat vague but my feeling is that we build some
> > consensus on the goals and basic design aspects before delving into
> > implementation details.
> >
> > I would appreciate your comments and feedback.
> >
> > - Fernando
Thanks,
Hirokazu Takahashi.
Hello Fernando
2008/8/7 Fernando Luis V?zquez Cao <[email protected]>:
> Hi Naveen,
>
> On Wed, 2008-08-06 at 12:37 -0700, Naveen Gupta wrote:
>> > 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
>> >
>> > The implementation of an I/O scheduling algorithm is to a certain extent
>> > influenced by what we are trying to achieve in terms of I/O bandwidth
>> > shaping, but, as discussed below, the required accuracy can determine
>> > the layer where the I/O controller has to reside. Off the top of my
>> > head, there are three basic operations we may want perform:
>> > - I/O nice prioritization: ionice-like approach.
>> > - Proportional bandwidth scheduling: each process/group of processes
>> > has a weight that determines the share of bandwidth they receive.
>> > - I/O limiting: set an upper limit to the bandwidth a group of tasks
>> > can use.
>>
>> I/O limiting can be a special case of proportional bandwidth
>> scheduling. A process/process group can use use it's share of
>> bandwidth and if there is spare bandwidth it be allowed to use it. And
>> if we want to absolutely restrict it we add another flag which
>> specifies that the specified proportion is exact and has an upper
>> bound.
>>
>> Let's say the ideal b/w for a device is 100MB/s
>>
>> And process 1 is assigned b/w of 20%. When we say that the proportion
>> is strict, the b/w for process 1 will be 20% of the max b/w (which may
>> be less than 100MB/s) subject to a max of 20MB/s.
> I essentially agree with you. The nice thing about proportional
> bandwidth scheduling is that we get bandwidth guarantees when there is
> contention for the block device, but still get the benefits of
> statistical multiplexing in the non-contended case. With strict IO
> limiting we risk underusing the block devices.
>
>> > If we are pursuing a I/O prioritization model ? la CFQ the temptation is
>> > to implement it at the elevator layer or extend any of the existing I/O
>> > schedulers.
>> >
>> > There have been several proposals that extend either the CFQ scheduler
>> > (see (1), (2) below) or the AS scheduler (see (3) below). The problem
>> > with these controllers is that they are scheduler dependent, which means
>> > that they become unusable when we change the scheduler or when we want
>> > to control stacking devices which define their own make_request_fn
>> > function (md and dm come to mind). It could be argued that the physical
>> > devices controlled by a dm or md driver are likely to be fed by
>> > traditional I/O schedulers such as CFQ, but these I/O schedulers would
>> > be running independently from each other, each one controlling its own
>> > device ignoring the fact that they part of a stacking device. This lack
>> > of information at the elevator layer makes it pretty difficult to obtain
>> > accurate results when using stacking devices. It seems that unless we
>> > can make the elevator layer aware of the topology of stacking devices
>> > (possibly by extending the elevator API?) evelator-based approaches do
>> > not constitute a generic solution. Here onwards, for discussion
>> > purposes, I will refer to this type of I/O bandwidth controllers as
>> > elevator-based I/O controllers.
>>
>> It can be argued that any scheduling decision wrt to i/o belongs to
>> elevators. Till now they have been used to improve performance. But
>> with new requirements to isolate i/o based on process or cgroup, we
>> need to change the elevators.
> I have the impression there is a tendency to conflate two different
> issues when discussing I/O schedulers and resource controllers, so let
> me elaborate on this point.
>
> On the one hand, we have the problem of feeding physical devices with IO
> requests in such a way that we squeeze the maximum performance out of
> them. Of course in some cases we may want to prioritize responsiveness
> over throughput. In either case the kernel has to perform the same basic
> operations: merging and dispatching IO requests. There is no discussion
> this is the elevator's job and the elevator should take into account the
> physical characteristics of the device.
>
> On the other hand, there is the problem of sharing an IO resource, i.e.
> block device, between multiple tasks or groups of tasks. There are many
> ways of sharing an IO resource depending on what we are trying to
> accomplish: proportional bandwidth scheduling, priority-based
> scheduling, etc. But to implement this sharing algorithms the kernel has
> to determine the task whose IO will be submitted. In a sense, we are
> scheduling tasks (and groups of tasks) not IO requests (which has much
> in common with CPU scheduling). Besides, the sharing problem is not
> directly related to the characteristics of the underlying device, which
> means it does not need to be implemented at the elevator layer.
What if we pass the task specific information to the elevator. We do
this for CFQ (where we pass the priority). And if we need any
additional information to be passed we could add that in a similar
manner.
I really liked your initial suggestion where step 1 would be to add
I/O tracking patches. And then use this in CFQ and AS to do resource
sharing. And if we see any shortcoming with this approach. Let's see
what the best place is to solve remaining problems.
>
> Traditional elevators limit themselves to schedule IO requests to disk
> with no regard to where it came from. However, new IO schedulers such as
> CFQ combine this with IO prioritization capabilities. This means that
> the elevator decides the application whose IO will be dispatched next.
> The problem is that at this layer there is not enough information to
> make such decisions in an accurate way, because, as mentioned in the
> RFC, the elevator has not way to know the block IO topology. The
> implication of this is that the elevator does not know the impact a
> particular scheduling decision will make in the IO throughput seen by
> applications, which is what users care about.
Is it possible to send the topology information to the elevators. And
then they can make global as well as local decisions.
>
> For all these reasons, I think the elevator should take care of
> optimizing the last stretch of the IO path (generic block layer -> block
> device) for performance/responsiveness, and leave the job of ensuring
> that each task is guaranteed a fair share of the kernel's IO resources
> to the upper layers (for example a block layer resource controller).
>
> I recognize that in some cases global performance could be improved if
> the block layer had access to information from the elevator, and that is
> why I mentioned in the RFC that in some cases it might make sense to
> combine a block layer resource controller and a elevator layer one (we
> just would need to figure out a way for the to communicate with each
> other and work well in tandem).
>
>> If we add another layer of i/o scheduling (block layer I/O controller)
>> above elevators
>> 1) It builds another layer of i/o scheduling (bandwidth or priority)
> As I mentioned before we are trying to achieve two things: making the
> best use of block devices, and sharing those IO resources between tasks
> or groups of tasks. There are two possible approaches here: implement
> everything in the elevator or move the sharing bits somewhere above the
> elevator layer. In either case we have to carry out the same tasks so
> the impact of delegating part of the work to a new layer should not be
> that big, and, hopefully, will improve maintainability.
>
>> 2) This new layer can have decisions for i/o scheduling which conflict
>> with underlying elevator. e.g. If we decide to do b/w scheduling in
>> this new layer, there is no way a priority based elevator could work
>> underneath it.
> The priority system could be implemented above the elevator layer in the
> block layer resource controller, which means that the elevator would
> only have to worry about scheduling the requests it receives from the
> block layer and dispatching them to disk in the best possible way.
>
> An alternative would be using a block layer resource controller and a
> elavator-based resource controller in tandem.
>
>> If a custom make_request_fn is defined (which means the said device is
>> not using existing elevator),
> Please note that each of the block devices that constitute a stacking
> device could have its own elevator.
Another possible approach, if the top layer cannot pass topology info
to the underling block device elevators. We could use FIFO for the
underlying block devices, effectively disabling them. The Top layer
will make it's scheduling decision in custom __make_request and the
layers below will just forward. And we can easily avoid any conflict.
>
>> it could build it's own scheduling
>> rather than asking kernel to add another layer at the time of i/o
>> submission. Since it has complete control of i/o.
> I think that is something we should avoid. The IO scheduling behavior
> that the user sees should not depend on the topology of the system. We
> certainly do not want to reimplement the same scheduling algorithm for
> every RAID driver. I am of the opinion that whatever IO scheduling
> algorithm we choose should be implemented just once and usable under any
> IO configuration.
>
I agree that we shouldn't be reinventing things for every RAID driver.
We could have a generic algorithm which everyone plugs into. If not
that is not possible, we always have the option to create one in
custom __make_request.
>
-Naveen
A minor sidebar:
2008/8/7 Fernando Luis V?zquez Cao <[email protected]> wrote"
>>On the one hand, we have the problem of feeding physical devices with IO
>>requests in such a way that we squeeze the maximum performance out of
>>them. Of course in some cases we may want to prioritize responsiveness
>>over throughput. In either case ...
Maximum performance for non-batch processes is reached soon after
the point at which latency starts to degrade badly, so be very
careful when you're trying for max throughput on anything
that has a queue (e.g., CPUs, disks, etc).
My usual example is a device that takes 10 milliseconds from
request to end of response, shown in the gif below.
At load = 10, theoretical throughput should be at 100%. In
fact, it's around 82%, because some queuing is already happening.
Sitting in queue cuts performance, so the response time is already
up to 0.030 seconds (30 milliseconds).
Trying for 95% of max throughput requires a load of 14, with
a truly horrid response time of 0.050 seconds (50 milliseconds).
Therefor: we're trying to deliver good response times as a *first*
priority, without throttling the device so it can't deliver
the majority of it's throughput, not the other way around.
By the way, this is the kind of calculation that leads people
to the rule of thumb of using 80% of max throughput as a good
target.
--dave
--
David Collier-Brown | Always do right. This will gratify
Sun Microsystems, Toronto | some people and astonish the rest
[email protected] | -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
Fernando Luis Vázquez Cao wrote:
>>> This seems to be the easiest part, but the current cgroups
>>> infrastructure has some limitations when it comes to dealing with block
>>> devices: impossibility of creating/removing certain control structures
>>> dynamically and hardcoding of subsystems (i.e. resource controllers).
>>> This makes it difficult to handle block devices that can be hotplugged
>>> and go away at any time (this applies not only to usb storage but also
>>> to some SATA and SCSI devices). To cope with this situation properly we
>>> would need hotplug support in cgroups, but, as suggested before and
>>> discussed in the past (see (0) below), there are some limitations.
>>>
>>> Even in the non-hotplug case it would be nice if we could treat each
>>> block I/O device as an independent resource, which means we could do
>>> things like allocating I/O bandwidth on a per-device basis. As long as
>>> performance is not compromised too much, adding some kind of basic
>>> hotplug support to cgroups is probably worth it.
>>>
>>> (0) http://lkml.org/lkml/2008/5/21/12
>> What about using major,minor numbers to identify each device and account
>> IO statistics? If a device is unplugged we could reset IO statistics
>> and/or remove IO limitations for that device from userspace (i.e. by a
>> deamon), but pluggin/unplugging the device would not be blocked/affected
>> in any case. Or am I oversimplifying the problem?
> If a resource we want to control (a block device in this case) is
> hot-plugged/unplugged the corresponding cgroup-related structures inside
> the kernel need to be allocated/freed dynamically, respectively. The
> problem is that this is not always possible. For example, with the
> current implementation of cgroups it is not possible to treat each block
> device as a different cgroup subsytem/resource controlled, because
> subsystems are created at compile time.
The whole subsystem is created at compile time, but controller data
structures are allocated dynamically (i.e. see struct mem_cgroup for
memory controller). So, identifying each device with a name, or a key
like major,minor, instead of a reference/pointer to a struct could help
to handle this in userspace. I mean, if a device is unplugged a
userspace daemon can just handle the event and delete the controller
data structures allocated for this device, asynchronously, via
userspace->kernel interface. And without holding a reference to that
particular block device in the kernel. Anyway, implementing a generic
interface that would allow to define hooks for hot-pluggable devices (or
similar events) in cgroups would be interesting.
>>> 3. & 4. & 5. - I/O bandwidth shaping & General design aspects
>>>
>>> The implementation of an I/O scheduling algorithm is to a certain extent
>>> influenced by what we are trying to achieve in terms of I/O bandwidth
>>> shaping, but, as discussed below, the required accuracy can determine
>>> the layer where the I/O controller has to reside. Off the top of my
>>> head, there are three basic operations we may want perform:
>>> - I/O nice prioritization: ionice-like approach.
>>> - Proportional bandwidth scheduling: each process/group of processes
>>> has a weight that determines the share of bandwidth they receive.
>>> - I/O limiting: set an upper limit to the bandwidth a group of tasks
>>> can use.
>> Use a deadline-based IO scheduling could be an interesting path to be
>> explored as well, IMHO, to try to guarantee per-cgroup minimum bandwidth
>> requirements.
> Please note that the only thing we can do is to guarantee minimum
> bandwidth requirement when there is contention for an IO resource, which
> is precisely what a proportional bandwidth scheduler does. An I missing
> something?
Correct. Proportional bandwidth automatically allows to guarantee min
requirements (instead of IO limiting approach, that needs additional
mechanisms to achive this).
In any case there's no guarantee for a cgroup/application to sustain
i.e. 10MB/s on a certain device, but this is a hard problem anyway, and
the best we can do is to try to satisfy "soft" constraints.
-Andrea
On Fri, 2008-08-08 at 20:39 +0900, Hirokazu Takahashi wrote:
> Hi,
>
> > > Would you like to split up IO into read and write IO. We know that read can be
> > > very latency sensitive when compared to writes. Should we consider them
> > > separately in the RFC?
> > Oops, I somehow ended up leaving your first question unanswered. Sorry.
> >
> > I do not think we should consider them separately, as long as there is a
> > proper IO tracking infrastructure in place. As you mentioned, reads can
> > be very latecy sensitive, but the read case could be treated as an
> > special case IO controller/IO tracking subsystem. There certainly are
> > optimization opportunities. For example, in the synchronous I/O patch ww
> > could mark bios with the iocontext of the current task, because it will
> > happen to be originator of that IO. By effectively caching the ownership
> > information in the bio we can avoid all the accesses to struct page,
> > page_cgroup, etc, and reads would definitively benefit from that.
>
> FYI, we should also take special care of pages being reclaimed, the free
> memory of the cgroup these pages belong to may be really low.
> Dm-ioband is doing this.
Thank you for the heads-up.
- Fernando