LinuxLists.cc - [PATCH v15 0/7] cgroup: io-throttle controller

2009-04-28 08:44:18

by Andrea Righi

[permalink] [raw]

Subject: [PATCH v15 0/7] cgroup: io-throttle controller

Objective
~~~~~~~~~
The objective of the io-throttle controller is to improve IO performance
predictability of different cgroups that share the same block devices.

State of the art (quick overview)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A recent work made by Vivek propose a weighted BW solution introducing
fair queuing support in the elevator layer and modifying the existent IO
schedulers to use that functionality
(https://lists.linux-foundation.org/pipermail/containers/2009-March/016129.html).

For the fair queuing part Vivek's IO controller makes use of the BFQ
code as posted by Paolo and Fabio (http://lkml.org/lkml/2008/11/11/148).

The dm-ioband controller by the valinux guys is also proposing a
proportional ticket-based solution fully implemented at the device
mapper level (http://people.valinux.co.jp/~ryov/dm-ioband/).

The bio-cgroup patch (http://people.valinux.co.jp/~ryov/bio-cgroup/) is
a BIO tracking mechanism for cgroups, implemented in the cgroup memory
subsystem. It is maintained by Ryo and it allows dm-ioband to track
writeback requests issued by kernel threads (pdflush).

Another work by Satoshi implements the cgroup awareness in CFQ, mapping
per-cgroup priority to CFQ IO priorities and this also provide only the
proportional BW support (http://lwn.net/Articles/306772/).

Please correct me or integrate if I missed someone or something. :)

Proposed solution
~~~~~~~~~~~~~~~~~
Respect to other priority/weight-based solutions the approach used by
this controller is to explicitly choke applications' requests that
directly or indirectly generate IO activity in the system (this
controller addresses both synchronous IO and writeback/buffered IO).

The bandwidth and iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the
overall performance of the system in terms of throughput.

IO throttling and accounting is performed during the submission of IO
requests and it is independent of the particular IO scheduler.

Detailed informations about design, goal and usage are described in the
documentation (see [PATCH 1/7]).

Implementation
~~~~~~~~~~~~~~
Patchset against latest Linus' git:

[PATCH v15 0/7] cgroup: block device IO controller
[PATCH v15 1/7] io-throttle documentation
[PATCH v15 2/7] res_counter: introduce ratelimiting attributes
[PATCH v15 3/7] page_cgroup: provide a generic page tracking infrastructure
[PATCH v15 4/7] io-throttle controller infrastructure
[PATCH v15 5/7] kiothrottled: throttle buffered (writeback) IO
[PATCH v15 6/7] io-throttle instrumentation
[PATCH v15 7/7] io-throttle: export per-task statistics to userspace

The v15 all-in-one patch, along with the previous versions, can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/

Changelog (v14 -> v15)
~~~~~~~~~~~~~~~~~~~~~~
* performance optimization for direct IO (O_DIRECT): in submit_bio() instead of
checking if the bio has been generated by the current task using the slow
get_iothrottle_from_bio(), use the faster is_in_dio(), that simply check the
value of task_struct->in_dio, set before submitting O_DIRECT requests and
unset for.
* block tasks that have exceeded the cgroup limits also in
balance_dirty_pages_ratelimited_nr(): when the submission of IO requests is
blocked by io-throttle we also want to throttle the dirty page rate, to reduce
the generation of hard reclaimable dirty pages in the system and prevent
potential OOM conditions
* explicitly check if cgroup_lock() is held in the iothrottle block device list
(suggested by: Paul E. McKenney <[email protected]>)
* fixed a build bug in page_cgroup.c when CONFIG_SPARSEMEM was not set
(reported by: Gui Jianfeng <[email protected]>)
* small styling fixes in res_counter

Overall diffstat
~~~~~~~~~~~~~~~~
Documentation/cgroups/io-throttle.txt | 417 ++++++++++++++++
block/Makefile | 1 +
block/blk-core.c | 8 +
block/blk-io-throttle.c | 851 +++++++++++++++++++++++++++++++++
block/kiothrottled.c | 341 +++++++++++++
fs/aio.c | 12 +
fs/buffer.c | 2 +
fs/direct-io.c | 3 +
fs/proc/base.c | 18 +
include/linux/blk-io-throttle.h | 168 +++++++
include/linux/cgroup.h | 1 +
include/linux/cgroup_subsys.h | 6 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 33 ++-
include/linux/res_counter.h | 69 ++-
include/linux/sched.h | 8 +
init/Kconfig | 16 +
kernel/cgroup.c | 9 +
kernel/fork.c | 8 +
kernel/res_counter.c | 73 +++
mm/Makefile | 3 +-
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 +
mm/page-writeback.c | 13 +
mm/page_cgroup.c | 96 ++++-
mm/readahead.c | 3 +
28 files changed, 2145 insertions(+), 34 deletions(-)

-Andrea

2009-04-28 08:44:35

by Andrea Righi

[permalink] [raw]

Subject: [PATCH v15 1/7] io-throttle documentation

Documentation of the block device I/O controller: description, usage,
advantages and design.

Signed-off-by: Andrea Righi <[email protected]>
---
Documentation/cgroups/io-throttle.txt | 417 +++++++++++++++++++++++++++++++++
1 files changed, 417 insertions(+), 0 deletions(-)
create mode 100644 Documentation/cgroups/io-throttle.txt

diff --git a/Documentation/cgroups/io-throttle.txt b/Documentation/cgroups/io-throttle.txt
new file mode 100644
index 0000000..789116c
--- /dev/null
+++ b/Documentation/cgroups/io-throttle.txt
@@ -0,0 +1,417 @@
+
+ Block device I/O bandwidth controller
+
+----------------------------------------------------------------------
+1. DESCRIPTION
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups [1]) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements. Nevertheless, priority based solutions are
+affected by performance bursts, when only low-priority requests are submitted
+to a general purpose resource dispatcher.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability from the applications' point of view and provide performance
+isolation of different control groups sharing the same block devices.
+
+NOTE #1: If you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+NOTE #2: The current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
+limits specified by the user; minimum I/O rate thresholds are supposed to be
+guaranteed if the user configures a proper I/O bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total I/O
+bandwidth of that device).
+
+----------------------------------------------------------------------
+2. USER INTERFACE
+
+A new I/O limitation rule is described using the files:
+- blockio.bandwidth-max
+- blockio.iops-max
+
+The I/O bandwidth (blockio.bandwidth-max) can be used to limit the throughput
+of a certain cgroup, while blockio.iops-max can be used to throttle cgroups
+containing applications doing a sparse/seeky I/O workload. Any combination of
+them can be used to define more complex I/O limiting rules, expressed both in
+terms of iops/s and bandwidth.
+
+The same files can be used to set multiple rules for different block devices
+relative to the same cgroup.
+
+The following syntax can be used to configure any limiting rule:
+
+# /bin/echo DEV:LIMIT:STRATEGY:BUCKET_SIZE > CGROUP/FILE
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- LIMIT is the maximum I/O activity allowed on DEV by CGROUP; LIMIT can
+ represent a bandwidth limitation (expressed in bytes/s) when writing to
+ blockio.bandwidth-max, or a limitation to the maximum I/O operations per
+ second (expressed in iops/s) issued by CGROUP.
+
+ A generic I/O limiting rule for a block device DEV can be removed setting the
+ LIMIT to 0.
+
+- STRATEGY is the throttling strategy used to throttle the applications' I/O
+ requests from/to device DEV. At the moment two different strategies can be
+ used [2][3]:
+
+ 0 = leaky bucket: the controller accepts at most B bytes (B = LIMIT * time)
+ or O operations (O = LIMIT * time); further I/O requests
+ are delayed scheduling a timeout for the tasks that made
+ those requests.
+
+ Different I/O flow
+ | | |
+ | v |
+ | v
+ v
+ .......
+ \ /
+ \ / leaky-bucket
+ ---
+ |||
+ vvv
+ Smoothed I/O flow
+
+ 1 = token bucket: LIMIT tokens are added to the bucket every seconds; the
+ bucket can hold at the most BUCKET_SIZE tokens; I/O
+ requests are accepted if there are available tokens in the
+ bucket; when a request of N bytes arrives N tokens are
+ removed from the bucket; if fewer than N tokens are
+ available the request is delayed until a sufficient amount
+ of token is available in the bucket.
+
+ Tokens (I/O rate)
+ o
+ o
+ o
+ ....... <--.
+ \ / | Bucket size (burst limit)
+ \ooo/ |
+ --- <--'
+ |ooo
+ Incoming --->|---> Conforming
+ I/O |oo I/O
+ requests -->|--> requests
+ |
+ ---->|
+
+ Leaky bucket is more precise than token bucket to respect the limits, because
+ bursty workloads are always smoothed. Token bucket, instead, allows a small
+ irregularity degree in the I/O flows (burst limit), and, for this, it is
+ better in terms of efficiency (bursty workloads are not smoothed when there
+ are sufficient tokens in the bucket).
+
+- BUCKET_SIZE is used only with token bucket (STRATEGY == 1) and defines the
+ size of the bucket in bytes (blockio.bandwidth-max) or in I/O operations
+ (blockio.iops-max).
+
+- CGROUP is the name of the limited process container.
+
+Also the following syntaxes are allowed:
+
+- remove an I/O bandwidth limiting rule
+# /bin/echo DEV:0 > CGROUP/blockio.bandwidth-max
+
+- configure a limiting rule using leaky bucket throttling (ignore bucket size):
+# /bin/echo DEV:LIMIT:0 > CGROUP/blockio.bandwidth-max
+
+- configure a limiting rule using token bucket throttling
+ (with bucket size == LIMIT):
+# /bin/echo DEV:LIMIT:1 > CGROUP/blockio.bandwidth-max
+
+2.2. Show I/O limiting rules
+
+All the defined rules and statistics for a specific cgroup can be shown reading
+the files blockio.bandwidth-max for bandwidth constraints and blockio.iops-max
+for I/O operations per second constraints.
+
+The following syntax is used:
+
+$ cat CGROUP/blockio.bandwidth-max
+MAJOR MINOR LIMIT STRATEGY LEAKY_STAT BUCKET_SIZE BUCKET_FILL TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- LIMIT, STRATEGY and BUCKET_SIZE are the same parameters defined above
+
+- LEAKY_STAT is the amount of bytes (blockio.bandwidth-max) or I/O operations
+ (blockio.iops-max) currently allowed by the I/O controller (only used with
+ leaky bucket strategy - STRATEGY == 0)
+
+- BUCKET_FILL represents the amount of tokens present in the bucket (only used
+ with token bucket strategy - STRATEGY == 1)
+
+- TIME_DELTA can be one of the following:
+ - the amount of jiffies elapsed from the last I/O request (token bucket)
+ - the amount of jiffies during which the bytes or the number of I/O
+ operations given by LEAKY_STAT have been accumulated (leaky bucket)
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 .. n):
+
+$ cat CGROUP/blockio.bandwidth-max
+MAJOR1 MINOR1 BW1 STRATEGY1 LEAKY_STAT1 BUCKET_SIZE1 BUCKET_FILL1 TIME_DELTA1
+MAJOR1 MINOR1 BW2 STRATEGY2 LEAKY_STAT2 BUCKET_SIZE2 BUCKET_FILL2 TIME_DELTA2
+...
+MAJORn MINORn BWn STRATEGYn LEAKY_STATn BUCKET_SIZEn BUCKET_FILLn TIME_DELTAn
+
+The same fields are used to describe I/O operations/sec rules. The only
+difference is that the cost of each I/O operation is scaled up by a factor of
+1000. This allows to apply better fine grained sleeps and provide a more
+precise throttling.
+
+$ cat CGROUP/blockio.iops-max
+MAJOR MINOR LIMITx1000 STRATEGY LEAKY_STATx1000 BUCKET_SIZEx1000 BUCKET_FILLx1000 TIME_DELTA
+...
+
+2.3. Additional I/O statistics
+
+Additional cgroup I/O throttling statistics are reported in
+blockio.throttlecnt:
+
+$ cat CGROUP/blockio.throttlecnt
+MAJOR MINOR BW_COUNTER BW_SLEEP IOPS_COUNTER IOPS_SLEEP
+
+ - MAJOR, MINOR are respectively the major and the minor number of the device
+ the following statistics refer to
+ - BW_COUNTER gives the number of times that the cgroup bandwidth limit of
+ this particular device was exceeded
+ - BW_SLEEP is the amount of sleep time measured in clock ticks (divide
+ by sysconf(_SC_CLK_TCK)) imposed to the processes of this cgroup that
+ exceeded the bandwidth limit for this particular device
+ - IOPS_COUNTER gives the number of times that the cgroup I/O operation per
+ second limit of this particular device was exceeded
+ - IOPS_SLEEP is the amount of sleep time measured in clock ticks (divide
+ by sysconf(_SC_CLK_TCK)) imposed to the processes of this cgroup that
+ exceeded the I/O operations per second limit for this particular device
+
+Example:
+$ cat CGROUP/blockio.throttlecnt
+8 0 0 0 0 0
+^ ^ ^ ^ ^ ^
+ \ \ \ \ \ \___iops sleep (in clock ticks)
+ \ \ \ \ \____iops throttle counter
+ \ \ \ \_____bandwidth sleep (in clock ticks)
+ \ \ \______bandwidth throttle counter
+ \ \_______minor dev. number
+ \________major dev. number
+
+Distinct statistics for each process are reported in
+/proc/PID/io-throttle-stat:
+
+$ cat /proc/PID/io-throttle-stat
+BW_COUNTER BW_SLEEP IOPS_COUNTER IOPS_SLEEP
+
+Example:
+$ cat /proc/$$/io-throttle-stat
+0 0 0 0
+^ ^ ^ ^
+ \ \ \ \_____global iops sleep (in clock ticks)
+ \ \ \______global iops counter
+ \ \_______global bandwidth sleep (clock ticks)
+ \________global bandwidth counter
+
+2.5. Generic usage examples
+
+* Mount the cgroup filesystem (blockio subsystem):
+ # mkdir /mnt/cgroup
+ # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+ # mkdir /mnt/cgroup/foo
+ --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+ # /bin/echo $$ > /mnt/cgroup/foo/tasks
+ --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda for the cgroup "foo", using
+ leaky bucket throttling strategy:
+ # /bin/echo /dev/sda:$((1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth-max
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo", using
+ token bucket throttling strategy, bucket size = 8MiB:
+ # /bin/echo /dev/sdb:$((8 * 1024 * 1024)):1:$((8 * 1024 * 1024)) > \
+ > /mnt/cgroup/foo/blockio.bandwidth-max
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda (controlled by leaky bucket throttling)
+ and 8MiB/s on /dev/sdb (controlled by token bucket throttling)
+
+* Run a benchmark doing I/O on /dev/sda and /dev/sdb; I/O limits and usage
+ defined for cgroup "foo" can be shown as following:
+ # cat /mnt/cgroup/foo/blockio.bandwidth-max
+ 8 16 8388608 1 0 8388608 -522560 48
+ 8 0 1048576 0 737280 0 0 216
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 16MiB/s on /dev/sda:
+ # /bin/echo /dev/sda:$((16 * 1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth-max
+ # cat /mnt/cgroup/foo/blockio.bandwidth-max
+ 8 16 8388608 1 0 8388608 -84432 206436
+ 8 0 16777216 0 0 0 0 15212
+
+* Remove limiting rule on /dev/sdb for cgroup "foo":
+ # /bin/echo /dev/sdb:0:0:0 > /mnt/cgroup/foo/blockio.bandwidth-max
+ # cat /mnt/cgroup/foo/blockio.bandwidth-max
+ 8 0 16777216 0 0 0 0 110388
+
+* Set a maximum of 100 I/O operations/sec (leaky bucket strategy) to /dev/sdc
+ for cgroup "foo":
+ # /bin/echo /dev/sdc:100:0 > /mnt/cgroup/foo/blockio.iops-max
+ # cat /mnt/cgroup/foo/blockio.iops-max
+ 8 32 100000 0 846000 0 2113
+ ^ ^
+ /________/
+ /
+ Remember: these values are scaled up by a factor of 1000 to apply a fine
+ grained throttling (i.e. LIMIT == 100000 means a maximum of 100 I/O operation
+ per second)
+
+* Remove limiting rule for I/O operations from /dev/sdc for cgroup "foo":
+ # /bin/echo /dev/sdc:0 > /mnt/cgroup/foo/blockio.iops-max
+
+----------------------------------------------------------------------
+3. ADVANTAGES OF PROVIDING THIS FEATURE
+
+* Allow I/O traffic shaping for block device shared among different cgroups
+* Improve I/O performance predictability on block devices shared between
+ different cgroups
+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+ deadline, CFQ, noop) and/or the type of the underlying block devices
+* The bandwidth limitations are guaranteed both for synchronous and
+ asynchronous operations, even the I/O passing through the page cache or
+ buffers and not only direct I/O (see below for details)
+* It is possible to implement a simple user-space application to dynamically
+ adjust the I/O workload of different process containers at run-time,
+ according to the particular users' requirements and applications' performance
+ constraints
+
+----------------------------------------------------------------------
+4. DESIGN
+
+The I/O throttling is performed imposing an explicit timeout on the processes
+that exceed the I/O limits dedicated to the cgroup they belong to. I/O
+accounting happens per cgroup.
+
+Only the actual I/O that flows in the block devices is considered. Multiple
+re-reads of pages already present in the page cache as well as re-writes of
+dirty pages are not considered to account and throttle the I/O activity, since
+they don't actually generate any real I/O operation.
+
+This means that a process that re-reads or re-writes multiple times the same
+blocks of a file is affected by the I/O limitations only for the actual I/O
+performed from/to the underlying block devices.
+
+4.1. Synchronous I/O tracking and throttling
+
+The io-throttle controller just works as expected for synchronous (read and
+write) operations: the real I/O activity is reduced synchronously according to
+the defined limitations.
+
+If the operation is synchronous we automatically know that the context of the
+request is the current task and so we can charge the cgroup the current task
+belongs to. And throttle the current task as well, if it exceeded the cgroup
+limitations.
+
+4.2. Buffered I/O (write-back) tracking
+
+For buffered writes the scenario is a bit more complex, because the writes in
+the page cache are processed asynchronously by kernel threads (pdflush), using
+a write-back policy. So the real writes to the underlying block devices occur
+in a different I/O context respect to the task that originally generated the
+dirty pages.
+
+The I/O bandwidth controller uses the following solution to resolve this
+problem.
+
+If the operation is a buffered write, we can charge the right cgroup looking at
+the owner of the first page involved in the I/O operation, that gives the
+context that generated the I/O activity at the source. This information can be
+retrieved using the page_cgroup functionality originally provided by the cgroup
+memory controller [4], and now provided by a modified version of the bio-cgroup
+controller [5], embedding the page tracking feature directly into the
+io-throttle controller.
+
+The page_cgroup structure is used to encode the owner of each struct page: this
+information is encoded in page_cgroup->flags. A owner is characterized by a
+numeric ID: the io-throttle css_id(). The owner of a page is set when a page is
+dirtied or added to the page cache. At the moment I/O generated by anonymous
+pages (swap) is not considered by the io-throttle controller.
+
+In this way we can correctly account the I/O cost to the right cgroup, but we
+cannot throttle the current task in this stage, because, in general, it is a
+different task (e.g., pdflush that is processing asynchronously the dirty
+page).
+
+For this reason, all the write-back requests that are not directly submitted by
+the real owner and that need to be throttled are not dispatched immediately in
+submit_bio(). Instead, they are added into an rbtree and processed
+asynchronously by a dedicated kernel thread: kiothrottled.
+
+A deadline is associated to each throttled write-back request depending on the
+bandwidth usage of the cgroup it belongs. When a request is inserted into the
+rbtree kiothrottled is awakened. This thread periodically selects all the
+requests with an expired deadline and submit the bunch of selected requests to
+the underlying block devices using generic_make_request().
+
+4.3. Per-block device IO limiting rules
+
+Multiple rules for different block devices are stored in a linked list, using
+the dev_t number of each block device as key to uniquely identify each element
+of the list. RCU synchronization is used to protect the whole list structure,
+since the elements in the list are not supposed to change frequently (they
+change only when a new rule is defined or an old rule is removed or updated),
+while the reads in the list occur at each operation that generates I/O. This
+allows to provide zero overhead for cgroups that do not use any limitation.
+
+WARNING: per-block device limiting rules always refer to the dev_t device
+number. If a block device is unplugged (i.e. a USB device) the limiting rules
+defined for that device persist and they are still valid if a new device is
+plugged in the system and it uses the same major and minor numbers.
+
+4.4. Asynchronous I/O (AIO) handling
+
+Explicit sleeps are *not* imposed on tasks doing asynchronous I/O (AIO)
+operations; AIO throttling is performed returning -EAGAIN from sys_io_submit().
+Userspace applications must be able to handle this error code opportunely.
+
+----------------------------------------------------------------------
+5. TODO
+
+* Support proportional I/O bandwidth for an optimal bandwidth usage. For
+ example use the kiothrottled rbtree: all the requests queued to the I/O
+ subsystem first will go into the rbtree; then based on a per-cgroup I/O
+ priority and feedback from I/O schedulers dispatch the requests to the
+ elevator. This would allow to provide both bandwidth limiting and
+ proportional bandwidth functionalities using a generic approach.
+
+* Implement a fair throttling policy: distribute the time to sleep equally
+ among all the tasks of a cgroup that exceeded the I/O limits, e.g., depending
+ of the amount of I/O activity previously generated in the past by each task
+ (see task_io_accounting).
+
+----------------------------------------------------------------------
+6. REFERENCES
+
+[1] Documentation/cgroups/cgroups.txt
+[2] http://en.wikipedia.org/wiki/Leaky_bucket
+[3] http://en.wikipedia.org/wiki/Token_bucket
+[4] Documentation/controllers/memory.txt
+[5] http://people.valinux.co.jp/~ryov/bio-cgroup
--
1.6.0.4

2009-04-28 08:45:27

by Andrea Righi

[permalink] [raw]

Subject: [PATCH v15 3/7] page_cgroup: provide a generic page tracking infrastructure

Dirty pages in the page cache can be processed asynchronously by kernel
threads (pdflush) using a writeback policy. For this reason the real
writes to the underlying block devices occur in a different IO context
respect to the task that originally generated the dirty pages involved
in the IO operation. This makes the tracking and throttling of writeback
IO more complicate respect to the synchronous IO.

The page_cgroup infrastructure, currently available only for the memory
cgroup controller, can be used to store the owner of each page and
opportunely track the writeback IO. This information is encoded in
page_cgroup->flags.

A owner can be identified using a generic ID number and the following
interfaces are provided to store a retrieve this information:

unsigned long page_cgroup_get_owner(struct page *page);
int page_cgroup_set_owner(struct page *page, unsigned long id);
int page_cgroup_copy_owner(struct page *npage, struct page *opage);

The io-throttle controller uses the cgroup css_id() as the owner's ID
number.

A big part of this code is taken from the Ryo and Hirokazu's
bio-cgroup controller (http://people.valinux.co.jp/~ryov/bio-cgroup/).

TODO: try to remove the lock/unlock_page_cgroup() in
page_cgroup_*_owner().

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>
---
include/linux/memcontrol.h | 6 +++
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 33 +++++++++++++-
init/Kconfig | 4 ++
mm/Makefile | 3 +-
mm/memcontrol.c | 6 +++
mm/page_cgroup.c | 96 ++++++++++++++++++++++++++++++++++++++-----
7 files changed, 135 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a9e3b76..e80e335 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
* (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
*/

+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
@@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..b178eb9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -607,7 +607,7 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_PAGE_TRACKING
struct page_cgroup *node_page_cgroup;
#endif
#endif
@@ -958,7 +958,7 @@ struct mem_section {

/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_PAGE_TRACKING
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..f24d081 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
#ifndef __LINUX_PAGE_CGROUP_H
#define __LINUX_PAGE_CGROUP_H

-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_PAGE_TRACKING
#include <linux/bit_spinlock.h>
/*
* Page Cgroup can be considered as an extended mem_map.
@@ -12,11 +12,38 @@
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct mem_cgroup *mem_cgroup;
struct list_head lru; /* per cgroup LRU list */
+#endif
};

+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PAGE_TRACKING_ID_SHIFT (16)
+#define PAGE_TRACKING_ID_BITS \
+ (8 * sizeof(unsigned long) - PAGE_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+ return pc->flags >> PAGE_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+ WARN_ON(id >= (1UL << PAGE_TRACKING_ID_BITS));
+ pc->flags &= (1UL << PAGE_TRACKING_ID_SHIFT) - 1;
+ pc->flags |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
+}
+
+unsigned long page_cgroup_get_owner(struct page *page);
+int page_cgroup_set_owner(struct page *page, unsigned long id);
+int page_cgroup_copy_owner(struct page *npage, struct page *opage);
+
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
void __init page_cgroup_init(void);
struct page_cgroup *lookup_page_cgroup(struct page *page);
@@ -71,7 +98,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
bit_spin_unlock(PCG_LOCK, &pc->flags);
}

-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_PAGE_TRACKING */
struct page_cgroup;

static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..5428ac7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -569,6 +569,7 @@ config CGROUP_MEM_RES_CTLR
bool "Memory Resource Controller for Control Groups"
depends on CGROUPS && RESOURCE_COUNTERS
select MM_OWNER
+ select PAGE_TRACKING
help
Provides a memory resource controller that manages both anonymous
memory and page cache. (See Documentation/cgroups/memory.txt)
@@ -611,6 +612,9 @@ endif # CGROUPS
config MM_OWNER
bool

+config PAGE_TRACKING
+ bool
+
config SYSFS_DEPRECATED
bool

diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..b94e074 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,4 +37,5 @@ else
obj-$(CONFIG_SMP) += allocpercpu.o
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_PAGE_TRACKING) += page_cgroup.o
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..69d1c31 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2524,6 +2524,12 @@ struct cgroup_subsys mem_cgroup_subsys = {
.use_id = 1,
};

+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+ pc->mem_cgroup = NULL;
+ INIT_LIST_HEAD(&pc->lru);
+}
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP

static int __init disable_swap_account(char *s)
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 791905c..ec9ed05 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -3,6 +3,7 @@
#include <linux/bootmem.h>
#include <linux/bit_spinlock.h>
#include <linux/page_cgroup.h>
+#include <linux/blk-io-throttle.h>
#include <linux/hash.h>
#include <linux/slab.h>
#include <linux/memory.h>
@@ -14,9 +15,8 @@ static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
- INIT_LIST_HEAD(&pc->lru);
+ __init_mem_page_cgroup(pc);
}
static unsigned long total_usage;

@@ -74,7 +74,7 @@ void __init page_cgroup_init(void)

int nid, fail;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && iothrottle_disabled())
return;

for_each_online_node(nid) {
@@ -83,12 +83,13 @@ void __init page_cgroup_init(void)
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you"
- " don't want\n");
+ printk(KERN_INFO
+ "try cgroup_disable=memory,blockio option if you don't want\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
- printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+ printk(KERN_CRIT
+ "try cgroup_disable=memory,blockio boot option\n");
panic("Out of memory");
}

@@ -248,7 +249,7 @@ void __init page_cgroup_init(void)
unsigned long pfn;
int fail = 0;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && iothrottle_disabled())
return;

for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -257,14 +258,15 @@ void __init page_cgroup_init(void)
fail = init_section_page_cgroup(pfn);
}
if (fail) {
- printk(KERN_CRIT "try cgroup_disable=memory boot option\n");
+ printk(KERN_CRIT
+ "try cgroup_disable=memory,blockio boot option\n");
panic("Out of memory");
} else {
hotplug_memory_notifier(page_cgroup_callback, 0);
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
- " want\n");
+ printk(KERN_INFO
+ "try cgroup_disable=memory,blockio option if you don't want\n");
}

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -272,8 +274,80 @@ void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
return;
}

-#endif
+#endif /* !defined(CONFIG_SPARSEMEM) */
+
+/**
+ * page_cgroup_get_owner() - get the owner ID of a page
+ * @page: the page we want to find the owner
+ *
+ * Returns the owner ID of the page, 0 means that the owner cannot be
+ * retrieved.
+ **/
+unsigned long page_cgroup_get_owner(struct page *page)
+{
+ struct page_cgroup *pc;
+ unsigned long ret;
+
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return 0;
+
+ lock_page_cgroup(pc);
+ ret = page_cgroup_get_id(pc);
+ unlock_page_cgroup(pc);
+ return ret;
+}
+
+/**
+ * page_cgroup_set_owner() - set the owner ID of a page
+ * @page: the page we want to tag
+ * @id: the ID number that will be associated to page
+ *
+ * Returns 0 if the owner is correctly associated to the page. Returns a
+ * negative value in case of failure.
+ **/
+int page_cgroup_set_owner(struct page *page, unsigned long id)
+{
+ struct page_cgroup *pc;
+
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return -ENOENT;
+
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, id);
+ unlock_page_cgroup(pc);
+ return 0;
+}

+/**
+ * page_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage: the page where we want to copy the owner
+ * @opage: the page from which we want to copy the ID
+ *
+ * Returns 0 if the owner is correctly associated to npage. Returns a negative
+ * value in case of failure.
+ **/
+int page_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+ struct page_cgroup *npc, *opc;
+ unsigned long id;
+
+ npc = lookup_page_cgroup(npage);
+ if (unlikely(!npc))
+ return -ENOENT;
+ opc = lookup_page_cgroup(opage);
+ if (unlikely(!opc))
+ return -ENOENT;
+ lock_page_cgroup(opc);
+ lock_page_cgroup(npc);
+ id = page_cgroup_get_id(opc);
+ page_cgroup_set_id(npc, id);
+ unlock_page_cgroup(npc);
+ unlock_page_cgroup(opc);
+
+ return 0;
+}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP

--
1.6.0.4

2009-04-28 08:44:53

by Andrea Righi

[permalink] [raw]

Subject: [PATCH v15 2/7] res_counter: introduce ratelimiting attributes

Introduce attributes and functions in res_counter to implement
throttling-based cgroup subsystems.

The following attributes have been added to struct res_counter:
* @policy: the limiting policy / algorithm
* @capacity: the maximum capacity of the resource (the unit of
measurement depends on the particular resource)
* @timestamp: timestamp of the last accounted resource request

Currently the available policies are: token-bucket and leaky-bucket
and the attribute @capacity is only used by token-bucket policy (to
represent the bucket size).

The following function has been implemented to return the amount of time
a cgroup should be throttled to remain within the defined resource
limits.

unsigned long long
res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);

[ Note: only the interfaces needed by the cgroup IO controller are
implemented right now ]

TODO (reduce the size of struct res_counter):
- replace policy with a more generic unsigned int flags and encode the
policy using a single bit of flags
- use int for failcnt (unsigned long long is probably too much)
- union max_usage and capacity: max_usage is not used in ratelimited
resources and capacity is not used in all the other cases (it has
been introduced only for ratelimited resources)

Signed-off-by: Andrea Righi <[email protected]>
---
include/linux/res_counter.h | 69 ++++++++++++++++++++++++++++++----------
kernel/res_counter.c | 73 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 125 insertions(+), 17 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 4c5bcf6..9bed6af 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -14,30 +14,36 @@
*/

#include <linux/cgroup.h>
+#include <linux/jiffies.h>

-/*
- * The core object. the cgroup that wishes to account for some
- * resource may include this counter into its structures and use
- * the helpers described beyond
- */
+/* The various policies that can be used for ratelimiting resources */
+#define RATELIMIT_LEAKY_BUCKET 0
+#define RATELIMIT_TOKEN_BUCKET 1

+/**
+ * struct res_counter - the core object to account cgroup resources
+ *
+ * @usage: the current resource consumption level
+ * @max_usage: the maximal value of the usage from the counter creation
+ * @limit: the limit that usage cannot be exceeded
+ * @failcnt: the number of unsuccessful attempts to consume the resource
+ * @policy: the limiting policy / algorithm
+ * @capacity: the maximum capacity of the resource
+ * @timestamp: timestamp of the last accounted resource request
+ * @lock: the lock to protect all of the above.
+ * The routines below consider this to be IRQ-safe
+ *
+ * The cgroup that wishes to account for some resource may include this counter
+ * into its structures and use the helpers described beyond.
+ */
struct res_counter {
- /*
- * the current resource consumption level
- */
unsigned long long usage;
- /*
- * the maximal value of the usage from the counter creation
- */
unsigned long long max_usage;
- /*
- * the limit that usage cannot exceed
- */
unsigned long long limit;
- /*
- * the number of unsuccessful attempts to consume the resource
- */
unsigned long long failcnt;
+ unsigned long long policy;
+ unsigned long long capacity;
+ unsigned long long timestamp;
/*
* the lock to protect all of the above.
* the routines below consider this to be IRQ-safe
@@ -84,6 +90,9 @@ enum {
RES_USAGE,
RES_MAX_USAGE,
RES_LIMIT,
+ RES_POLICY,
+ RES_TIMESTAMP,
+ RES_CAPACITY,
RES_FAILCNT,
};

@@ -130,6 +139,15 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
return false;
}

+static inline unsigned long long
+res_counter_ratelimit_delta_t(struct res_counter *res)
+{
+ return (long long)get_jiffies_64() - (long long)res->timestamp;
+}
+
+unsigned long long
+res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
+
/*
* Helper function to detect if the cgroup is within it's limit or
* not. It's currently called from cgroup_rss_prepare()
@@ -163,6 +181,23 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
spin_unlock_irqrestore(&cnt->lock, flags);
}

+static inline int
+res_counter_ratelimit_set_limit(struct res_counter *cnt,
+ unsigned long long policy,
+ unsigned long long limit, unsigned long long max)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ cnt->limit = limit;
+ cnt->capacity = max;
+ cnt->policy = policy;
+ cnt->timestamp = get_jiffies_64();
+ cnt->usage = 0;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return 0;
+}
+
static inline int res_counter_set_limit(struct res_counter *cnt,
unsigned long long limit)
{
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index bf8e753..6f882c6 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -9,6 +9,7 @@

#include <linux/types.h>
#include <linux/parser.h>
+#include <linux/jiffies.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/res_counter.h>
@@ -20,6 +21,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
spin_lock_init(&counter->lock);
counter->limit = (unsigned long long)LLONG_MAX;
counter->parent = parent;
+ counter->capacity = (unsigned long long)LLONG_MAX;
+ counter->timestamp = get_jiffies_64();
}

int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -99,6 +102,12 @@ res_counter_member(struct res_counter *counter, int member)
return &counter->max_usage;
case RES_LIMIT:
return &counter->limit;
+ case RES_POLICY:
+ return &counter->policy;
+ case RES_TIMESTAMP:
+ return &counter->timestamp;
+ case RES_CAPACITY:
+ return &counter->capacity;
case RES_FAILCNT:
return &counter->failcnt;
};
@@ -163,3 +172,67 @@ int res_counter_write(struct res_counter *counter, int member,
spin_unlock_irqrestore(&counter->lock, flags);
return 0;
}
+
+/* Note: called with res->lock held */
+static unsigned long long
+ratelimit_leaky_bucket(struct res_counter *res, ssize_t val)
+{
+ unsigned long long delta, t;
+
+ res->usage += val;
+ delta = res_counter_ratelimit_delta_t(res);
+ if (!delta)
+ return 0;
+ t = res->usage * USEC_PER_SEC;
+ t = usecs_to_jiffies(div_u64(t, res->limit));
+ if (t > delta)
+ return t - delta;
+ /* Reset i/o statistics */
+ res->usage = 0;
+ res->timestamp = get_jiffies_64();
+ return 0;
+}
+
+/* Note: called with res->lock held */
+static unsigned long long
+ratelimit_token_bucket(struct res_counter *res, ssize_t val)
+{
+ unsigned long long delta;
+ long long tok;
+
+ res->usage -= val;
+ delta = jiffies_to_msecs(res_counter_ratelimit_delta_t(res));
+ res->timestamp = get_jiffies_64();
+ tok = (long long)res->usage * MSEC_PER_SEC;
+ if (delta) {
+ long long max = (long long)res->capacity * MSEC_PER_SEC;
+
+ tok += delta * res->limit;
+ tok = max_t(long long, tok, max);
+ res->usage = (unsigned long long)div_s64(tok, MSEC_PER_SEC);
+ }
+ return (tok < 0) ? msecs_to_jiffies(div_u64(-tok, res->limit)) : 0;
+}
+
+unsigned long long
+res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val)
+{
+ unsigned long long sleep = 0;
+ unsigned long flags;
+
+ spin_lock_irqsave(&res->lock, flags);
+ if (res->limit)
+ switch (res->policy) {
+ case RATELIMIT_LEAKY_BUCKET:
+ sleep = ratelimit_leaky_bucket(res, val);
+ break;
+ case RATELIMIT_TOKEN_BUCKET:
+ sleep = ratelimit_token_bucket(res, val);
+ break;
+ default:
+ WARN_ON(1);
+ break;
+ }
+ spin_unlock_irqrestore(&res->lock, flags);
+ return sleep;
+}
--
1.6.0.4

2009-04-28 08:45:49

by Andrea Righi

[permalink] [raw]

Subject: [PATCH v15 4/7] io-throttle controller infrastructure

This is the core of the io-throttle kernel infrastructure. It creates
the basic interfaces to the cgroup subsystem and implements the I/O
measurement and throttling functionality.

Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
---
block/Makefile | 1 +
block/blk-io-throttle.c | 851 +++++++++++++++++++++++++++++++++++++++
include/linux/blk-io-throttle.h | 168 ++++++++
include/linux/cgroup.h | 1 +
include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 12 +
kernel/cgroup.c | 9 +
7 files changed, 1048 insertions(+), 0 deletions(-)
create mode 100644 block/blk-io-throttle.c
create mode 100644 include/linux/blk-io-throttle.h

diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..42b6a46 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,5 +13,6 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
new file mode 100644
index 0000000..380a21a
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,851 @@
+/*
+ * blk-io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <[email protected]>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/res_counter.h>
+#include <linux/memcontrol.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/genhd.h>
+#include <linux/hardirq.h>
+#include <linux/list.h>
+#include <linux/seq_file.h>
+#include <linux/spinlock.h>
+#include <linux/blk-io-throttle.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+#include <linux/sched.h>
+#include <linux/bio.h>
+
+/*
+ * Statistics for I/O bandwidth controller.
+ */
+enum iothrottle_stat_index {
+ /* # of times the cgroup has been throttled for bw limit */
+ IOTHROTTLE_STAT_BW_COUNT,
+ /* # of jiffies spent to sleep for throttling for bw limit */
+ IOTHROTTLE_STAT_BW_SLEEP,
+ /* # of times the cgroup has been throttled for iops limit */
+ IOTHROTTLE_STAT_IOPS_COUNT,
+ /* # of jiffies spent to sleep for throttling for iops limit */
+ IOTHROTTLE_STAT_IOPS_SLEEP,
+ /* total number of bytes read and written */
+ IOTHROTTLE_STAT_BYTES_TOT,
+ /* total number of I/O operations */
+ IOTHROTTLE_STAT_IOPS_TOT,
+
+ IOTHROTTLE_STAT_NSTATS,
+};
+
+struct iothrottle_stat_cpu {
+ unsigned long long count[IOTHROTTLE_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct iothrottle_stat {
+ struct iothrottle_stat_cpu cpustat[NR_CPUS];
+};
+
+static void iothrottle_stat_add(struct iothrottle_stat *stat,
+ enum iothrottle_stat_index type, unsigned long long val)
+{
+ int cpu = get_cpu();
+
+ stat->cpustat[cpu].count[type] += val;
+ put_cpu();
+}
+
+static void iothrottle_stat_add_sleep(struct iothrottle_stat *stat,
+ int type, unsigned long long sleep)
+{
+ int cpu = get_cpu();
+
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_COUNT]++;
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_SLEEP] += sleep;
+ break;
+ case IOTHROTTLE_IOPS:
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_COUNT]++;
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_SLEEP] += sleep;
+ break;
+ }
+ put_cpu();
+}
+
+static unsigned long long iothrottle_read_stat(struct iothrottle_stat *stat,
+ enum iothrottle_stat_index idx)
+{
+ int cpu;
+ unsigned long long ret = 0;
+
+ for_each_possible_cpu(cpu)
+ ret += stat->cpustat[cpu].count[idx];
+ return ret;
+}
+
+struct iothrottle_sleep {
+ unsigned long long bw_sleep;
+ unsigned long long iops_sleep;
+};
+
+/*
+ * struct iothrottle_node - throttling rule of a single block device
+ * @node: list of per block device throttling rules
+ * @dev: block device number, used as key in the list
+ * @bw: max i/o bandwidth (in bytes/s)
+ * @iops: max i/o operations per second
+ * @stat: throttling statistics
+ *
+ * Define a i/o throttling rule for a single block device.
+ *
+ * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
+ * the limiting rules defined for that device persist and they are still valid
+ * if a new device is plugged and it uses the same dev_t number.
+ */
+struct iothrottle_node {
+ struct list_head node;
+ dev_t dev;
+ struct res_counter bw;
+ struct res_counter iops;
+ struct iothrottle_stat stat;
+};
+
+/**
+ * struct iothrottle - throttling rules for a cgroup
+ * @css: pointer to the cgroup state
+ * @list: list of iothrottle_node elements
+ *
+ * Define multiple per-block device i/o throttling rules.
+ * Note: the list of the throttling rules is protected by RCU locking:
+ * - hold cgroup_lock() for update.
+ * - hold rcu_read_lock() for read.
+ */
+struct iothrottle {
+ struct cgroup_subsys_state css;
+ struct list_head list;
+};
+static struct iothrottle init_iothrottle;
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() held.
+ */
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() or cgroup_lock() held.
+ */
+static struct iothrottle_node *
+iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
+{
+ struct iothrottle_node *n;
+
+ if (list_empty(&iot->list))
+ return NULL;
+ list_for_each_entry_rcu(n, &iot->list, node)
+ if (n->dev == dev)
+ return n;
+ return NULL;
+}
+
+/*
+ * Note: called with cgroup_lock() held.
+ */
+static void iothrottle_insert_node(struct iothrottle *iot,
+ struct iothrottle_node *n)
+{
+ WARN_ON_ONCE(!cgroup_is_locked());
+ list_add_rcu(&n->node, &iot->list);
+}
+
+/*
+ * Note: called with cgroup_lock() held.
+ */
+static void
+iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,
+ struct iothrottle_node *new)
+{
+ WARN_ON_ONCE(!cgroup_is_locked());
+ list_replace_rcu(&old->node, &new->node);
+}
+
+/*
+ * Note: called with cgroup_lock() held.
+ */
+static void
+iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
+{
+ WARN_ON_ONCE(!cgroup_is_locked());
+ list_del_rcu(&n->node);
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *
+iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct iothrottle *iot;
+
+ if (unlikely((cgrp->parent) == NULL)) {
+ iot = &init_iothrottle;
+ } else {
+ iot = kzalloc(sizeof(*iot), GFP_KERNEL);
+ if (unlikely(!iot))
+ return ERR_PTR(-ENOMEM);
+ }
+ INIT_LIST_HEAD(&iot->list);
+
+ return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct iothrottle_node *n, *p;
+ struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+
+ free_css_id(&iothrottle_subsys, &iot->css);
+ /*
+ * don't worry about locking here, at this point there must be not any
+ * reference to the list.
+ */
+ if (!list_empty(&iot->list))
+ list_for_each_entry_safe(n, p, &iot->list, node)
+ kfree(n);
+ kfree(iot);
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ *
+ * do not care too much about locking for single res_counter values here.
+ */
+static void iothrottle_show_limit(struct seq_file *m, dev_t dev,
+ struct res_counter *res)
+{
+ if (!res->limit)
+ return;
+ seq_printf(m, "%u %u %llu %llu %lli %llu %li\n",
+ MAJOR(dev), MINOR(dev),
+ res->limit, res->policy,
+ (long long)res->usage, res->capacity,
+ jiffies_to_clock_t(res_counter_ratelimit_delta_t(res)));
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ *
+ */
+static void iothrottle_show_failcnt(struct seq_file *m, dev_t dev,
+ struct iothrottle_stat *stat)
+{
+ unsigned long long bw_count, bw_sleep, iops_count, iops_sleep;
+
+ bw_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_COUNT);
+ bw_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_SLEEP);
+ iops_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_COUNT);
+ iops_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_SLEEP);
+
+ seq_printf(m, "%u %u %llu %li %llu %li\n", MAJOR(dev), MINOR(dev),
+ bw_count, jiffies_to_clock_t(bw_sleep),
+ iops_count, jiffies_to_clock_t(iops_sleep));
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_show_stat(struct seq_file *m, dev_t dev,
+ struct iothrottle_stat *stat)
+{
+ unsigned long long bytes, iops;
+
+ bytes = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BYTES_TOT);
+ iops = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_TOT);
+
+ seq_printf(m, "%u %u %llu %llu\n", MAJOR(dev), MINOR(dev), bytes, iops);
+}
+
+static int iothrottle_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+ struct iothrottle_node *n;
+
+ rcu_read_lock();
+ if (list_empty(&iot->list))
+ goto unlock_and_return;
+ list_for_each_entry_rcu(n, &iot->list, node) {
+ BUG_ON(!n->dev);
+ switch (cft->private) {
+ case IOTHROTTLE_BANDWIDTH:
+ iothrottle_show_limit(m, n->dev, &n->bw);
+ break;
+ case IOTHROTTLE_IOPS:
+ iothrottle_show_limit(m, n->dev, &n->iops);
+ break;
+ case IOTHROTTLE_FAILCNT:
+ iothrottle_show_failcnt(m, n->dev, &n->stat);
+ break;
+ case IOTHROTTLE_STAT:
+ iothrottle_show_stat(m, n->dev, &n->stat);
+ break;
+ }
+ }
+unlock_and_return:
+ rcu_read_unlock();
+ return 0;
+}
+
+static dev_t devname2dev_t(const char *buf)
+{
+ struct block_device *bdev;
+ dev_t dev = 0;
+ struct gendisk *disk;
+ int part;
+
+ /* use a lookup to validate the block device */
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return 0;
+ /* only entire devices are allowed, not single partitions */
+ disk = get_gendisk(bdev->bd_dev, &part);
+ if (disk && !part) {
+ BUG_ON(!bdev->bd_inode);
+ dev = bdev->bd_inode->i_rdev;
+ }
+ bdput(bdev);
+
+ return dev;
+}
+
+/*
+ * The userspace input string must use one of the following syntaxes:
+ *
+ * dev:0 <- delete an i/o limiting rule
+ * dev:io-limit:0 <- set a leaky bucket throttling rule
+ * dev:io-limit:1:bucket-size <- set a token bucket throttling rule
+ * dev:io-limit:1 <- set a token bucket throttling rule using
+ * bucket-size == io-limit
+ */
+static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
+ dev_t *dev, unsigned long long *iolimit,
+ unsigned long long *strategy,
+ unsigned long long *bucket_size)
+{
+ char *p;
+ int count = 0;
+ char *s[4];
+ int ret;
+
+ memset(s, 0, sizeof(s));
+ *dev = 0;
+ *iolimit = 0;
+ *strategy = 0;
+ *bucket_size = 0;
+
+ /* split the colon-delimited input string into its elements */
+ while (count < ARRAY_SIZE(s)) {
+ p = strsep(&buf, ":");
+ if (!p)
+ break;
+ if (!*p)
+ continue;
+ s[count++] = p;
+ }
+
+ /* i/o limit */
+ if (!s[1])
+ return -EINVAL;
+ ret = strict_strtoull(s[1], 10, iolimit);
+ if (ret < 0)
+ return ret;
+ if (!*iolimit)
+ goto out;
+ /* throttling strategy (leaky bucket / token bucket) */
+ if (!s[2])
+ return -EINVAL;
+ ret = strict_strtoull(s[2], 10, strategy);
+ if (ret < 0)
+ return ret;
+ switch (*strategy) {
+ case RATELIMIT_LEAKY_BUCKET:
+ goto out;
+ case RATELIMIT_TOKEN_BUCKET:
+ break;
+ default:
+ return -EINVAL;
+ }
+ /* bucket size */
+ if (!s[3])
+ *bucket_size = *iolimit;
+ else {
+ ret = strict_strtoll(s[3], 10, bucket_size);
+ if (ret < 0)
+ return ret;
+ }
+ if (*bucket_size <= 0)
+ return -EINVAL;
+out:
+ /* block device number */
+ *dev = devname2dev_t(s[0]);
+ return *dev ? 0 : -EINVAL;
+}
+
+static int iothrottle_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct iothrottle *iot;
+ struct iothrottle_node *n, *newn = NULL;
+ dev_t dev;
+ unsigned long long iolimit, strategy, bucket_size;
+ char *buf;
+ size_t nbytes = strlen(buffer);
+ int ret = 0;
+
+ /*
+ * We need to allocate a new buffer here, because
+ * iothrottle_parse_args() can modify it and the buffer provided by
+ * write_string is supposed to be const.
+ */
+ buf = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+ memcpy(buf, buffer, nbytes + 1);
+
+ ret = iothrottle_parse_args(buf, nbytes, cft->private, &dev, &iolimit,
+ &strategy, &bucket_size);
+ if (ret)
+ goto out1;
+ newn = kzalloc(sizeof(*newn), GFP_KERNEL);
+ if (!newn) {
+ ret = -ENOMEM;
+ goto out1;
+ }
+ newn->dev = dev;
+ res_counter_init(&newn->bw, NULL);
+ res_counter_init(&newn->iops, NULL);
+
+ switch (cft->private) {
+ case IOTHROTTLE_BANDWIDTH:
+ res_counter_ratelimit_set_limit(&newn->iops, 0, 0, 0);
+ res_counter_ratelimit_set_limit(&newn->bw, strategy,
+ ALIGN(iolimit, 1024), ALIGN(bucket_size, 1024));
+ break;
+ case IOTHROTTLE_IOPS:
+ res_counter_ratelimit_set_limit(&newn->bw, 0, 0, 0);
+ /*
+ * scale up iops cost by a factor of 1000, this allows to apply
+ * a more fine grained sleeps, and throttling results more
+ * precise this way.
+ */
+ res_counter_ratelimit_set_limit(&newn->iops, strategy,
+ iolimit * 1000, bucket_size * 1000);
+ break;
+ default:
+ WARN_ON(1);
+ break;
+ }
+
+ if (!cgroup_lock_live_group(cgrp)) {
+ ret = -ENODEV;
+ goto out1;
+ }
+ iot = cgroup_to_iothrottle(cgrp);
+
+ n = iothrottle_search_node(iot, dev);
+ if (!n) {
+ if (iolimit) {
+ /* Add a new block device limiting rule */
+ iothrottle_insert_node(iot, newn);
+ newn = NULL;
+ }
+ goto out2;
+ }
+ switch (cft->private) {
+ case IOTHROTTLE_BANDWIDTH:
+ if (!iolimit && !n->iops.limit) {
+ /* Delete a block device limiting rule */
+ iothrottle_delete_node(iot, n);
+ goto out2;
+ }
+ if (!n->iops.limit)
+ break;
+ /* Update a block device limiting rule */
+ newn->iops = n->iops;
+ break;
+ case IOTHROTTLE_IOPS:
+ if (!iolimit && !n->bw.limit) {
+ /* Delete a block device limiting rule */
+ iothrottle_delete_node(iot, n);
+ goto out2;
+ }
+ if (!n->bw.limit)
+ break;
+ /* Update a block device limiting rule */
+ newn->bw = n->bw;
+ break;
+ }
+ iothrottle_replace_node(iot, n, newn);
+ newn = NULL;
+out2:
+ cgroup_unlock();
+ if (n) {
+ synchronize_rcu();
+ kfree(n);
+ }
+out1:
+ kfree(newn);
+ kfree(buf);
+ return ret;
+}
+
+static struct cftype files[] = {
+ {
+ .name = "bandwidth-max",
+ .read_seq_string = iothrottle_read,
+ .write_string = iothrottle_write,
+ .max_write_len = 256,
+ .private = IOTHROTTLE_BANDWIDTH,
+ },
+ {
+ .name = "iops-max",
+ .read_seq_string = iothrottle_read,
+ .write_string = iothrottle_write,
+ .max_write_len = 256,
+ .private = IOTHROTTLE_IOPS,
+ },
+ {
+ .name = "throttlecnt",
+ .read_seq_string = iothrottle_read,
+ .private = IOTHROTTLE_FAILCNT,
+ },
+ {
+ .name = "stat",
+ .read_seq_string = iothrottle_read,
+ .private = IOTHROTTLE_STAT,
+ },
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+ .name = "blockio",
+ .create = iothrottle_create,
+ .destroy = iothrottle_destroy,
+ .populate = iothrottle_populate,
+ .subsys_id = iothrottle_subsys_id,
+ .early_init = 1,
+ .use_id = 1,
+};
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_evaluate_sleep(struct iothrottle_sleep *sleep,
+ struct iothrottle *iot,
+ struct block_device *bdev, ssize_t bytes)
+{
+ struct iothrottle_node *n;
+ dev_t dev;
+
+ BUG_ON(!iot);
+
+ /* accounting and throttling is done only on entire block devices */
+ dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev), bdev->bd_disk->first_minor);
+ n = iothrottle_search_node(iot, dev);
+ if (!n)
+ return;
+
+ /* Update statistics */
+ iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_BYTES_TOT, bytes);
+ if (bytes)
+ iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_IOPS_TOT, 1);
+
+ /* Evaluate sleep values */
+ sleep->bw_sleep = res_counter_ratelimit_sleep(&n->bw, bytes);
+ /*
+ * scale up iops cost by a factor of 1000, this allows to apply
+ * a more fine grained sleeps, and throttling works better in
+ * this way.
+ *
+ * Note: do not account any i/o operation if bytes is negative or zero.
+ */
+ sleep->iops_sleep = res_counter_ratelimit_sleep(&n->iops,
+ bytes ? 1000 : 0);
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_acct_stat(struct iothrottle *iot,
+ struct block_device *bdev, int type,
+ unsigned long long sleep)
+{
+ struct iothrottle_node *n;
+ dev_t dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev),
+ bdev->bd_disk->first_minor);
+
+ n = iothrottle_search_node(iot, dev);
+ if (!n)
+ return;
+ iothrottle_stat_add_sleep(&n->stat, type, sleep);
+}
+
+static void iothrottle_acct_task_stat(int type, unsigned long long sleep)
+{
+ /*
+ * XXX: per-task statistics may be inaccurate (this is not a
+ * critical issue, anyway, respect to introduce locking
+ * overhead or increase the size of task_struct).
+ */
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ current->io_throttle_bw_cnt++;
+ current->io_throttle_bw_sleep += sleep;
+ break;
+
+ case IOTHROTTLE_IOPS:
+ current->io_throttle_iops_cnt++;
+ current->io_throttle_iops_sleep += sleep;
+ break;
+ }
+}
+
+/*
+ * A helper function to get iothrottle from css id.
+ *
+ * NOTE: must be called under rcu_read_lock(). The caller must check
+ * css_is_removed() or some if it's concern.
+ */
+static struct iothrottle *iothrottle_lookup(unsigned long id)
+{
+ struct cgroup_subsys_state *css;
+
+ if (!id)
+ return NULL;
+ css = css_lookup(&iothrottle_subsys, id);
+ if (!css)
+ return NULL;
+ return container_of(css, struct iothrottle, css);
+}
+
+static struct iothrottle *get_iothrottle_from_page(struct page *page)
+{
+ struct iothrottle *iot;
+ unsigned long id;
+
+ BUG_ON(!page);
+ id = page_cgroup_get_owner(page);
+
+ rcu_read_lock();
+ iot = iothrottle_lookup(id);
+ if (!iot)
+ goto out;
+ css_get(&iot->css);
+out:
+ rcu_read_unlock();
+ return iot;
+}
+
+static struct iothrottle *get_iothrottle_from_bio(struct bio *bio)
+{
+ if (!bio)
+ return NULL;
+ return get_iothrottle_from_page(bio_page(bio));
+}
+
+int iothrottle_set_page_owner(struct page *page, struct mm_struct *mm)
+{
+ struct iothrottle *iot;
+ unsigned short id = 0;
+
+ if (iothrottle_disabled())
+ return 0;
+ if (!mm)
+ goto out;
+ rcu_read_lock();
+ iot = task_to_iothrottle(rcu_dereference(mm->owner));
+ if (likely(iot))
+ id = css_id(&iot->css);
+ rcu_read_unlock();
+out:
+ return page_cgroup_set_owner(page, id);
+}
+
+int iothrottle_set_pagedirty_owner(struct page *page, struct mm_struct *mm)
+{
+ if (PageSwapCache(page) || PageAnon(page))
+ return 0;
+ if (current->flags & PF_MEMALLOC)
+ return 0;
+ return iothrottle_set_page_owner(page, mm);
+}
+
+int iothrottle_copy_page_owner(struct page *npage, struct page *opage)
+{
+ if (iothrottle_disabled())
+ return 0;
+ return page_cgroup_copy_owner(npage, opage);
+}
+
+static inline int is_kthread_io(void)
+{
+ return current->flags & (PF_KTHREAD | PF_FLUSHER | PF_KSWAPD);
+}
+
+static bool is_urgent_io(struct bio *bio)
+{
+ if (bio && (bio_rw_meta(bio) || bio_noidle(bio)))
+ return true;
+ if (has_fs_excl())
+ return true;
+ return false;
+}
+
+static void iothrottle_force_sleep(int type, unsigned long long sleep)
+{
+ pr_debug("io-throttle: task %p (%s) must sleep %llu jiffies\n",
+ current, current->comm, sleep);
+ iothrottle_acct_task_stat(type, sleep);
+ schedule_timeout_killable(sleep);
+}
+
+/**
+ * cgroup_io_throttle() - account and throttle synchronous i/o activity
+ * @bio: the bio structure used to retrieve the owner of the i/o
+ * operation.
+ * @bdev: block device involved for the i/o.
+ * @bytes: size in bytes of the i/o operation.
+ *
+ * This is the core of the block device i/o bandwidth controller. This function
+ * must be called by any function that generates i/o activity (directly or
+ * indirectly). It provides both i/o accounting and throttling functionalities;
+ * throttling is disabled if @can_sleep is set to 0.
+ *
+ * Returns the value of sleep in jiffies if it was not possible to schedule the
+ * timeout.
+ **/
+unsigned long long
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
+{
+ struct iothrottle *iot = NULL, *curr_iot;
+ struct iothrottle_sleep s = {};
+ unsigned long long sleep;
+ int type, can_sleep = 1;
+
+ if (iothrottle_disabled())
+ return 0;
+ if (unlikely(!bdev))
+ return 0;
+ BUG_ON(!bdev->bd_inode || !bdev->bd_disk);
+ /*
+ * Never throttle kernel threads directly, since they may completely
+ * block other cgroups, the i/o on other block devices or even the
+ * whole system.
+ *
+ * For the same reason never throttle IO that comes from tasks that are
+ * holding exclusive access resources (urgent IO).
+ *
+ * And never sleep if we're inside an AIO context; just account the i/o
+ * activity. Throttling is performed in io_submit_one() returning
+ * -EAGAIN when the limits are exceeded.
+ */
+ if (is_kthread_io() || is_urgent_io(bio) || is_in_aio())
+ can_sleep = 0;
+ /*
+ * WARNING: in_atomic() do not know about held spinlocks in
+ * non-preemptible kernels, but we want to check it here to raise
+ * potential bugs when a preemptible kernel is used.
+ */
+ WARN_ON_ONCE(can_sleep &&
+ (irqs_disabled() || in_interrupt() || in_atomic()));
+ /*
+ * Evaluate the IO context of bio.
+ *
+ * In O_DIRECT mode the context of bio always refers to the current
+ * task. Otherwise, to differentiate writeback IO from synchronous IO
+ * we compare the bio's io-throttle cgroup with the current task's
+ * cgroup. If they're different we're doing writeback IO and we can't
+ * throttle the current task directly.
+ */
+ if (!is_in_dio())
+ iot = get_iothrottle_from_bio(bio);
+ rcu_read_lock();
+ curr_iot = task_to_iothrottle(current);
+ if (curr_iot != iot) {
+ css_get(&curr_iot->css);
+ /*
+ * IO occurs in a different context of the current task
+ * (writeback IO).
+ *
+ * Do not throttle current task directly in this case, just
+ * delay the submission of the IO request (that will be
+ * dispatched by kiothrottled).
+ */
+ can_sleep = 0;
+ }
+ if (iot == NULL) {
+ /* IO occurs in the same context of the current task */
+ iot = curr_iot;
+ }
+ /* Apply IO throttling */
+ iothrottle_evaluate_sleep(&s, iot, bdev, bytes);
+ sleep = max(s.bw_sleep, s.iops_sleep);
+ type = (s.bw_sleep < s.iops_sleep) ?
+ IOTHROTTLE_IOPS : IOTHROTTLE_BANDWIDTH;
+ if (unlikely(sleep && can_sleep))
+ iothrottle_acct_stat(iot, bdev, type, sleep);
+ css_put(&iot->css);
+ if (curr_iot != iot)
+ css_put(&curr_iot->css);
+ rcu_read_unlock();
+ if (unlikely(sleep && can_sleep)) {
+ /* Throttle the current task directly */
+ iothrottle_force_sleep(type, sleep);
+ return 0;
+ }
+ /*
+ * Account, but do not throttle async filesystems' metadata IO or IO
+ * that is explicitly marked to not wait or being anticipated, i.e.
+ * writes with wbc->sync_mode set to WBC_SYNC_ALL - fsync() - or
+ * journal activity.
+ */
+ if (is_urgent_io(bio))
+ sleep = 0;
+ return sleep;
+}
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
new file mode 100644
index 0000000..e448130
--- /dev/null
+++ b/include/linux/blk-io-throttle.h
@@ -0,0 +1,168 @@
+#ifndef BLK_IO_THROTTLE_H
+#define BLK_IO_THROTTLE_H
+
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/sched.h>
+#include <linux/cgroup.h>
+#include <asm/atomic.h>
+#include <asm/current.h>
+
+#define IOTHROTTLE_BANDWIDTH 0
+#define IOTHROTTLE_IOPS 1
+#define IOTHROTTLE_FAILCNT 2
+#define IOTHROTTLE_STAT 3
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+
+static inline bool iothrottle_disabled(void)
+{
+ if (iothrottle_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern unsigned long long
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes);
+
+extern int iothrottle_make_request(struct bio *bio, unsigned long deadline);
+
+int iothrottle_set_page_owner(struct page *page, struct mm_struct *mm);
+int iothrottle_set_pagedirty_owner(struct page *page, struct mm_struct *mm);
+int iothrottle_copy_page_owner(struct page *npage, struct page *opage);
+
+extern int iothrottle_sync(void);
+
+static inline void set_in_aio(void)
+{
+ atomic_set(&current->in_aio, 1);
+}
+
+static inline void unset_in_aio(void)
+{
+ atomic_set(&current->in_aio, 0);
+}
+
+static inline int is_in_aio(void)
+{
+ return atomic_read(&current->in_aio);
+}
+
+static inline void set_in_dio(void)
+{
+ atomic_set(&current->in_dio, 1);
+}
+
+static inline void unset_in_dio(void)
+{
+ atomic_set(&current->in_dio, 0);
+}
+
+static inline int is_in_dio(void)
+{
+ return atomic_read(&current->in_dio);
+}
+
+static inline unsigned long long
+get_io_throttle_cnt(struct task_struct *t, int type)
+{
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ return t->io_throttle_bw_cnt;
+ case IOTHROTTLE_IOPS:
+ return t->io_throttle_iops_cnt;
+ }
+ BUG();
+}
+
+static inline unsigned long long
+get_io_throttle_sleep(struct task_struct *t, int type)
+{
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ return jiffies_to_clock_t(t->io_throttle_bw_sleep);
+ case IOTHROTTLE_IOPS:
+ return jiffies_to_clock_t(t->io_throttle_iops_sleep);
+ }
+ BUG();
+}
+#else /* CONFIG_CGROUP_IO_THROTTLE */
+
+static inline bool iothrottle_disabled(void)
+{
+ return true;
+}
+
+static inline unsigned long long
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
+{
+ return 0;
+}
+
+static inline int
+iothrottle_make_request(struct bio *bio, unsigned long deadline)
+{
+ return 0;
+}
+
+static inline int iothrottle_set_page_owner(struct page *page,
+ struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline int iothrottle_set_pagedirty_owner(struct page *page,
+ struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline int iothrottle_copy_page_owner(struct page *npage,
+ struct page *opage)
+{
+ return 0;
+}
+
+static inline int iothrottle_sync(void)
+{
+ return 0;
+}
+
+static inline void set_in_aio(void) { }
+
+static inline void unset_in_aio(void) { }
+
+static inline int is_in_aio(void)
+{
+ return 0;
+}
+
+static inline void set_in_dio(void) { }
+
+static inline void unset_in_dio(void) { }
+
+static inline int is_in_dio(void)
+{
+ return 0;
+}
+
+static inline unsigned long long
+get_io_throttle_cnt(struct task_struct *t, int type)
+{
+ return 0;
+}
+
+static inline unsigned long long
+get_io_throttle_sleep(struct task_struct *t, int type)
+{
+ return 0;
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+static inline struct block_device *as_to_bdev(struct address_space *mapping)
+{
+ return (mapping->host && mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+}
+
+#endif /* BLK_IO_THROTTLE_H */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 665fa70..40cb412 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -28,6 +28,7 @@ struct css_id;
extern int cgroup_init_early(void);
extern int cgroup_init(void);
extern void cgroup_lock(void);
+extern int cgroup_is_locked(void);
extern bool cgroup_lock_live_group(struct cgroup *cgrp);
extern void cgroup_unlock(void);
extern void cgroup_fork(struct task_struct *p);
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..c37cc4b 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)

/* */

+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
+
#ifdef CONFIG_CGROUP_DEVICE
SUBSYS(devices)
#endif
diff --git a/init/Kconfig b/init/Kconfig
index 5428ac7..d496c5f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -565,6 +565,18 @@ config RESOURCE_COUNTERS
infrastructure that works with cgroups.
depends on CGROUPS

+config CGROUP_IO_THROTTLE
+ bool "Enable cgroup I/O throttling"
+ depends on CGROUPS && RESOURCE_COUNTERS && EXPERIMENTAL
+ select MM_OWNER
+ select PAGE_TRACKING
+ help
+ This allows to limit the maximum I/O bandwidth for specific
+ cgroup(s).
+ See Documentation/cgroups/io-throttle.txt for more information.
+
+ If unsure, say N.
+
config CGROUP_MEM_RES_CTLR
bool "Memory Resource Controller for Control Groups"
depends on CGROUPS && RESOURCE_COUNTERS
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 382109b..5dbb2a7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -584,6 +584,15 @@ void cgroup_unlock(void)
mutex_unlock(&cgroup_mutex);
}

+/**
+ * cgroup_is_locked - check if the cgroup mutex is locked
+ *
+ */
+int cgroup_is_locked(void)
+{
+ return mutex_is_locked(&cgroup_mutex);
+}
+
/*
* A couple of forward declarations required, due to cyclic reference loop:
* cgroup_mkdir -> cgroup_create -> cgroup_populate_dir ->
--
1.6.0.4

2009-04-28 08:46:37

by Andrea Righi

[permalink] [raw]

Subject: [PATCH v15 6/7] io-throttle instrumentation

Apply the io-throttle control and page tracking to the opportune kernel
functions.

Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-core.c | 8 ++++++++
fs/aio.c | 12 ++++++++++++
fs/buffer.c | 2 ++
fs/direct-io.c | 3 +++
include/linux/sched.h | 8 ++++++++
kernel/fork.c | 8 ++++++++
mm/bounce.c | 2 ++
mm/filemap.c | 2 ++
mm/page-writeback.c | 13 +++++++++++++
mm/readahead.c | 3 +++
10 files changed, 61 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2998fe3..a9689df 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/blktrace_api.h>
#include <linux/fault-inject.h>
#include <trace/block.h>
@@ -1549,11 +1550,16 @@ void submit_bio(int rw, struct bio *bio)
* go through the normal accounting stuff before submission.
*/
if (bio_has_data(bio)) {
+ unsigned long sleep = 0;
+
if (rw & WRITE) {
count_vm_events(PGPGOUT, count);
+ sleep = cgroup_io_throttle(bio,
+ bio->bi_bdev, bio->bi_size);
} else {
task_io_account_read(bio->bi_size);
count_vm_events(PGPGIN, count);
+ cgroup_io_throttle(NULL, bio->bi_bdev, bio->bi_size);
}

if (unlikely(block_dump)) {
@@ -1564,6 +1570,8 @@ void submit_bio(int rw, struct bio *bio)
(unsigned long long)bio->bi_sector,
bdevname(bio->bi_bdev, b));
}
+ if (sleep && !iothrottle_make_request(bio, jiffies + sleep))
+ return;
}

generic_make_request(bio);
diff --git a/fs/aio.c b/fs/aio.c
index 76da125..ab6c457 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -22,6 +22,7 @@
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/file.h>
+#include <linux/blk-io-throttle.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/slab.h>
@@ -1587,6 +1588,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
{
struct kiocb *req;
struct file *file;
+ struct block_device *bdev;
ssize_t ret;

/* enforce forwards compatibility on users */
@@ -1609,6 +1611,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
if (unlikely(!file))
return -EBADF;

+ /* check if we're exceeding the IO throttling limits */
+ bdev = as_to_bdev(file->f_mapping);
+ ret = cgroup_io_throttle(NULL, bdev, 0);
+ if (unlikely(ret)) {
+ fput(file);
+ return -EAGAIN;
+ }
+
req = aio_get_req(ctx); /* returns with 2 references to req */
if (unlikely(!req)) {
fput(file);
@@ -1652,12 +1662,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
goto out_put_req;

spin_lock_irq(&ctx->ctx_lock);
+ set_in_aio();
aio_run_iocb(req);
if (!list_empty(&ctx->run_list)) {
/* drain the run list */
while (__aio_run_iocbs(ctx))
;
}
+ unset_in_aio();
spin_unlock_irq(&ctx->ctx_lock);
aio_put_req(req); /* drop extra ref to req */
return 0;
diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..2eb581f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
+#include <linux/blk-io-throttle.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ iothrottle_set_pagedirty_owner(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 05763bb..1b304b6 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -28,6 +28,7 @@
#include <linux/highmem.h>
#include <linux/pagemap.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/bio.h>
#include <linux/wait.h>
#include <linux/err.h>
@@ -340,7 +341,9 @@ static void dio_bio_submit(struct dio *dio)
if (dio->is_async && dio->rw == READ)
bio_set_pages_dirty(bio);

+ set_in_dio();
submit_bio(dio->rw, bio);
+ unset_in_dio();

dio->bio = NULL;
dio->boundary = 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..3294430 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,14 @@ struct task_struct {
unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
struct task_io_accounting ioac;
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_t in_aio;
+ atomic_t in_dio;
+ unsigned long long io_throttle_bw_cnt;
+ unsigned long long io_throttle_bw_sleep;
+ unsigned long long io_throttle_iops_cnt;
+ unsigned long long io_throttle_iops_sleep;
+#endif
#if defined(CONFIG_TASK_XACCT)
u64 acct_rss_mem1; /* accumulated rss usage */
u64 acct_vm_mem1; /* accumulated virtual memory usage */
diff --git a/kernel/fork.c b/kernel/fork.c
index b9e2edd..7b4d991 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1043,6 +1043,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
task_io_accounting_init(&p->ioac);
acct_clear_integrals(p);

+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_set(&p->in_aio, 0);
+ atomic_set(&p->in_dio, 0);
+ p->io_throttle_bw_cnt = 0;
+ p->io_throttle_bw_sleep = 0;
+ p->io_throttle_iops_cnt = 0;
+ p->io_throttle_iops_sleep = 0;
+#endif
posix_cpu_timers_init(p);

p->lock_depth = -1; /* -1 = no lock */
diff --git a/mm/bounce.c b/mm/bounce.c
index e590272..80bf52c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -10,6 +10,7 @@
#include <linux/pagemap.h>
#include <linux/mempool.h>
#include <linux/blkdev.h>
+#include <linux/blk-io-throttle.h>
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
@@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
to->bv_len = from->bv_len;
to->bv_offset = from->bv_offset;
inc_zone_page_state(to->bv_page, NR_BOUNCE);
+ iothrottle_copy_page_owner(to->bv_page, page);

if (rw == WRITE) {
char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..5498d1d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -28,6 +28,7 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/blkdev.h>
+#include <linux/blk-io-throttle.h>
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/cpuset.h>
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
+ iothrottle_set_page_owner(page, current->mm);

error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..90cd65a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -24,6 +24,7 @@
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/blkdev.h>
+#include <linux/blk-io-throttle.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
#include <linux/percpu.h>
@@ -626,12 +627,23 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
static DEFINE_PER_CPU(unsigned long, ratelimits) = 0;
unsigned long ratelimit;
unsigned long *p;
+ struct block_device *bdev = as_to_bdev(mapping);

ratelimit = ratelimit_pages;
if (mapping->backing_dev_info->dirty_exceeded)
ratelimit = 8;

/*
+ * Just check if we've exceeded cgroup IO limits, but do not account
+ * anything here because we're not actually doing IO at this stage.
+ *
+ * We just want to stop to dirty additional pages in the system,
+ * because we're not dispatching the IO requests generated by this
+ * cgroup.
+ */
+ cgroup_io_throttle(NULL, bdev, 0);
+
+ /*
* Check the rate limiting. Also, we do not want to throttle real-time
* tasks in balance_dirty_pages(). Period.
*/
@@ -1243,6 +1255,7 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ iothrottle_set_pagedirty_owner(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/mm/readahead.c b/mm/readahead.c
index 133b6d5..25cae4c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -14,6 +14,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/pagevec.h>
#include <linux/pagemap.h>

@@ -81,6 +82,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
int (*filler)(void *, struct page *), void *data)
{
struct page *page;
+ struct block_device *bdev = as_to_bdev(mapping);
int ret = 0;

while (!list_empty(pages)) {
@@ -99,6 +101,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
+ cgroup_io_throttle(NULL, bdev, PAGE_CACHE_SIZE);
}
return ret;
}
--
1.6.0.4

2009-04-28 08:46:53

by Andrea Righi

[permalink] [raw]

Subject: [PATCH v15 7/7] io-throttle: export per-task statistics to userspace

Export the throttling statistics collected for each task through
/proc/PID/io-throttle-stat.

Example:
$ cat /proc/$$/io-throttle-stat
0 0 0 0
^ ^ ^ ^
\ \ \ \_____global iops sleep (in clock ticks)
\ \ \______global iops counter
\ \_______global bandwidth sleep (in clock ticks)
\________global bandwidth counter

Signed-off-by: Andrea Righi <[email protected]>
---
fs/proc/base.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index aa763ab..94061bf 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -54,6 +54,7 @@
#include <linux/proc_fs.h>
#include <linux/stat.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/init.h>
#include <linux/capability.h>
#include <linux/file.h>
@@ -2453,6 +2454,17 @@ static int proc_tgid_io_accounting(struct task_struct *task, char *buffer)
}
#endif /* CONFIG_TASK_IO_ACCOUNTING */

+#ifdef CONFIG_CGROUP_IO_THROTTLE
+static int proc_iothrottle_stat(struct task_struct *task, char *buffer)
+{
+ return sprintf(buffer, "%llu %llu %llu %llu\n",
+ get_io_throttle_cnt(task, IOTHROTTLE_BANDWIDTH),
+ get_io_throttle_sleep(task, IOTHROTTLE_BANDWIDTH),
+ get_io_throttle_cnt(task, IOTHROTTLE_IOPS),
+ get_io_throttle_sleep(task, IOTHROTTLE_IOPS));
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -2539,6 +2551,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io", S_IRUGO, proc_tgid_io_accounting),
#endif
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ INF("io-throttle-stat", S_IRUGO, proc_iothrottle_stat),
+#endif
};

static int proc_tgid_base_readdir(struct file * filp,
@@ -2874,6 +2889,9 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io", S_IRUGO, proc_tid_io_accounting),
#endif
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ INF("io-throttle-stat", S_IRUGO, proc_iothrottle_stat),
+#endif
};

static int proc_tid_base_readdir(struct file * filp,
--
1.6.0.4

2009-04-28 08:47:40

by Andrea Righi

[permalink] [raw]

Subject: Re: [PATCH v15 0/7] cgroup: io-throttle controller

I've repeated some tests with this new version (v15) of the io-throttle
controller.

The following results have been generated using the io-throttle
testcase, available at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/testcase/

The testcase is an update version of the io-throttle testcase included
in LTP
(http://ltp.cvs.sourceforge.net/viewvc/ltp/ltp/testcases/kernel/controllers/io-throttle/).

Summary
~~~~~~~
The goal of this test is to highlight the effectiveness of io-throttle
to control direct and writeback IO applying maximum BW limits (the
proportional BW approach is not addressed by this test, only absolute BW
limits are considered).

The benchmark #1 is a run of different parallel streams per cgroup
without imposing any IO limitation. The benchmark #2 repeats the same
tests using 4 cgroups with a BW limit of respectively 2MB/s, 4MB/s,
6MB/s and 8MB/s. The disk IO is constantly monitored (using dstat) to
evaluate the amount of writeback IO with and without the IO BW limits.

The results of benchmark #2 show the validity of the IO controller both
from the application's and disk's point of view.

Experimental Results
~~~~~~~~~~~~~~~~~~~~
==> collect system info <==
* kernel: 2.6.30-rc3
* disk: /dev/sda:
Timing cached reads: 1380 MB in 2.00 seconds = 690.79 MB/sec
Timing buffered disk reads: 76 MB in 3.07 seconds = 24.73 MB/sec
* filesystem: ext3
* VM dirty_ratio/dirty_background_ratio: 20/10

==> start benchmark #1 <==
* block-size: 16384 KB
* file-size: 262144 KB
* using 4 io-throttle cgroups:
- unlimited IO BW

==> results #1 (avg io-rate per cgroup) <==
* 1 parallel streams per cgroup, O_DIRECT=n
(cgroup-1, 1 tasks) [async-io] rate 12695 KiB/s
(cgroup-2, 1 tasks) [async-io] rate 13671 KiB/s
(cgroup-3, 1 tasks) [async-io] rate 12695 KiB/s
(cgroup-4, 1 tasks) [async-io] rate 12695 KiB/s
* 2 parallel streams per cgroup, O_DIRECT=n
(cgroup-1, 2 tasks) [async-io] rate 14648 KiB/s
(cgroup-2, 2 tasks) [async-io] rate 15625 KiB/s
(cgroup-3, 2 tasks) [async-io] rate 14648 KiB/s
(cgroup-4, 2 tasks) [async-io] rate 14648 KiB/s
* 4 parallel streams per cgroup, O_DIRECT=n
(cgroup-1, 4 tasks) [async-io] rate 20507 KiB/s
(cgroup-2, 4 tasks) [async-io] rate 20507 KiB/s
(cgroup-3, 4 tasks) [async-io] rate 20507 KiB/s
(cgroup-4, 4 tasks) [async-io] rate 20507 KiB/s
* 1 parallel streams per cgroup, O_DIRECT=y
(cgroup-1, 1 tasks) [direct-io] rate 3906 KiB/s
(cgroup-2, 1 tasks) [direct-io] rate 3906 KiB/s
(cgroup-3, 1 tasks) [direct-io] rate 3906 KiB/s
(cgroup-4, 1 tasks) [direct-io] rate 3906 KiB/s
* 2 parallel streams per cgroup, O_DIRECT=y
(cgroup-1, 2 tasks) [direct-io] rate 3906 KiB/s
(cgroup-2, 2 tasks) [direct-io] rate 3906 KiB/s
(cgroup-3, 2 tasks) [direct-io] rate 3906 KiB/s
(cgroup-4, 2 tasks) [direct-io] rate 3906 KiB/s
* 4 parallel streams per cgroup, O_DIRECT=y
(cgroup-1, 4 tasks) [direct-io] rate 3906 KiB/s
(cgroup-2, 4 tasks) [direct-io] rate 3906 KiB/s
(cgroup-3, 4 tasks) [direct-io] rate 3906 KiB/s
(cgroup-4, 4 tasks) [direct-io] rate 3906 KiB/s

A snapshot of the writeback IO (with O_DIRECT=n) in bytes/sec:
(statistics collected using dstat)

/dev/sda
----------
21729280.0
20733952.0
19628032.0
19390464.0
... <-- uniform to 20MB/s for the whole run

average: 19563861.33
stdev: 1078639.21

==> start benchmark #2 <==
* block-size: 16384 KB
* file-size: 262144 KB
* using 4 io-throttle cgroups:
- cgroup 1: 2048 KB/s on /dev/sda
- cgroup 2: 4096 KB/s on /dev/sda
- cgroup 3: 6144 KB/s on /dev/sda
- cgroup 4: 8192 KB/s on /dev/sda

==> results #2 (avg io-rate per cgroup) <==
* 1 parallel streams per cgroup, O_DIRECT=n
(cgroup-1, 1 tasks) [async-io] io-bw 2048 KiB/s, io-rate 12695 KiB/s
(cgroup-2, 1 tasks) [async-io] io-bw 4096 KiB/s, io-rate 15625 KiB/s
(cgroup-3, 1 tasks) [async-io] io-bw 6144 KiB/s, io-rate 15625 KiB/s
(cgroup-4, 1 tasks) [async-io] io-bw 8192 KiB/s, io-rate 22460 KiB/s
* 2 parallel streams per cgroup, O_DIRECT=n
(cgroup-1, 2 tasks) [async-io] io-bw 2048 KiB/s, io-rate 14648 KiB/s
(cgroup-2, 2 tasks) [async-io] io-bw 4096 KiB/s, io-rate 20507 KiB/s
(cgroup-3, 2 tasks) [async-io] io-bw 6144 KiB/s, io-rate 23437 KiB/s
(cgroup-4, 2 tasks) [async-io] io-bw 8192 KiB/s, io-rate 29296 KiB/s
* 4 parallel streams per cgroup, O_DIRECT=n
(cgroup-1, 4 tasks) [async-io] io-bw 2048 KiB/s, io-rate 10742 KiB/s
(cgroup-2, 4 tasks) [async-io] io-bw 4096 KiB/s, io-rate 16601 KiB/s
(cgroup-3, 4 tasks) [async-io] io-bw 6144 KiB/s, io-rate 21484 KiB/s
(cgroup-4, 4 tasks) [async-io] io-bw 8192 KiB/s, io-rate 23437 KiB/s
* 1 parallel streams per cgroup, O_DIRECT=y
(cgroup-1, 1 tasks) [direct-io] io-bw 2048 KiB/s, io-rate 2929 KiB/s
(cgroup-2, 1 tasks) [direct-io] io-bw 4096 KiB/s, io-rate 3906 KiB/s
(cgroup-3, 1 tasks) [direct-io] io-bw 6144 KiB/s, io-rate 4882 KiB/s
(cgroup-4, 1 tasks) [direct-io] io-bw 8192 KiB/s, io-rate 5859 KiB/s
* 2 parallel streams per cgroup, O_DIRECT=y
(cgroup-1, 2 tasks) [direct-io] io-bw 2048 KiB/s, io-rate 2929 KiB/s
(cgroup-2, 2 tasks) [direct-io] io-bw 4096 KiB/s, io-rate 4882 KiB/s
(cgroup-3, 2 tasks) [direct-io] io-bw 6144 KiB/s, io-rate 5859 KiB/s
(cgroup-4, 2 tasks) [direct-io] io-bw 8192 KiB/s, io-rate 5859 KiB/s
* 4 parallel streams per cgroup, O_DIRECT=y
(cgroup-1, 4 tasks) [direct-io] io-bw 2048 KiB/s, io-rate 976 KiB/s
(cgroup-2, 4 tasks) [direct-io] io-bw 4096 KiB/s, io-rate 1953 KiB/s
(cgroup-3, 4 tasks) [direct-io] io-bw 6144 KiB/s, io-rate 2929 KiB/s
(cgroup-4, 4 tasks) [direct-io] io-bw 8192 KiB/s, io-rate 3906 KiB/s

A snapshot of the writeback IO (with O_DIRECT=n) in bytes/sec:
(statistics collected using dstat)

/dev/sda
----------
... <-- all cgroups running (expected io-rate 20MB/s)
19550208.0
19030016.0
19546112.0
20070400.0
... <-- 1st cgroup ends (expected io-rate 12MB/s)
12673024.0
11304960.0
10604544.0
12357632.0
... <-- 2nd cgroup ends (expected io-rate 6MB/s)
6332416.0
6324224.0
6324224.0
6320128.0
... <-- 3rd cgroup ends (expected io-rate 2MB/s)
2105344.0
2113536.0
2097152.0
2101248.0

Open issues & toughts
~~~~~~~~~~~~~~~~~~~~~
1) impact for the VM

Limiting the IO without considering the amount of dirty pages per cgroup
can cause potential OOM conditions due to the presence of hard
reclaimable pages (that must be flushed to disk, before being evicted
from memory). At the moment, when a cgroup exceeds the IO BW limit,
direct IO requests are delayed and at the same time each task (in the
exceeding cgroups) is blocked in balance_dirty_pages_ratelimited_nr() to
prevent the generation of additional dirty pages in the system.

This will be probably handled in a better way by the memory cgroup soft
limits (under development by Kamezawa) and the accounting of the dirty
pages per cgroup. For the accounting something like task_struct->dirties
should be probably implemented in struct mem_cgroup (i.e.
memcg->dirties) to keep track of the dirty pages generated by a generic
mem_cgroup. And based on these statistics tasks should start to actively
writeback dirty inodes in proportion of the previously generated dirty
pages, when a threshold is exceeded.

There are also the cgroups interactions to keep in consideration. As a
practical example, if task T belonging to cgroup A is dirtying some
pages in cgroup B and the other tasks in cgroup A dirtied a lot of pages
in the whole system before, then task T1 should be forced to writeback
some pages, in proportion of the dirtied pages accounted to A.

In general, all the IO requests generated to writeback some dirty pages
must be affected by the IO BW limiting rules of the cgroup that
originally dirtied those pages. So, if B is under memory pressure and
task T stops to write, the tasks in cgroup B must start to actively
writeback some dirty pages, but using the BW limits defined for cgroup A
(that originally dirtied the pages).

>From the IO controller point of view, it should only keep track of the
cgroup that dirtied each page and apply the IO BW limits defined for
this cgroup for the writeback IO. At the moment the io-throttle
controller uses this approach to throttle writeback IO requests.

2) impact for the IO subsystem

A block device with IO limits should be considered just like a "normal"
slow device by the kernel. Following the io-throttle approach, the
slowness is implemented in the submission of IO requests (throttling the
tasks directly in submit_bio() for synchronous requests, or delaying
these requests, adding them in a rbtree and dispatch them asynchronously
by kiothrottled, for writeback IO). Implementing such slowness at the IO
scheduler level is another valid solution that should be explored (i.e.
the work proposed by Vivek).

3) proportional BW and absolute BW limits

Other related works, like dm-ioband or a recent work proposed by Vivek
(https://lists.linux-foundation.org/pipermail/containers/2009-March/016129.html)
implement the logic of proportional weight IO control.

Proportional BW control allows to better exploit the whole physical BW
of the disk: if a cgroup is not using its dedicated BW, other cgroups
sharing the same disk can make use of the spare BW.

OTOH, absolute limiting rules do not fully exploit all the physical BW,
but offer an immediate action on policy enforcement. With absolute BW
limits the problem is mitigated before it happens, because the system
guarantees that the "hard" limits are never exceeded. IOW it is a kind
of performance isolation through static partitioning. This approach can
be suitable for environments where certain critical/low-latency
applications must respect strict timing constraint (real-time). Or in
hosted environment where we want to "contain" classes of user as if they
were on a virtual private system (depending on how much customer
customer pays).

A good "general-purpose" IO controller should be able to provide both
solutions to satisfy all the possible user requirements.

Currentely, the proportional BW control is not provided by io-throttle.
This is in the TODO list, but it requires additional work, especially to
maintain the controller "light" and to not introduce too much overhead
or complexity.

4) sync(2) handling

What is the correct behaviour when an user issues "sync" in presence of
the io-throttle controller?

>From the sync(2) manpage:

According to the standard specification (e.g., POSIX.1-2001), sync() schedules
the writes, but may return before the actual writing is done. However,
since version 1.3.20 Linux does actually wait. (This still does not
guarantee data integrity: modern disks have large caches.)

In the current io-throttle implementation the sync(2) waits that all the
delayed IO requests pending in the rbtree are all flushed back to disk.
In this way obviuosly a cgroup can be forced to wait for other cgroups'
BW limit, that could sound strange, but this is probably the correct
behaviour to respect the semantic of this command.

-Andrea

2009-04-28 08:46:17

by Andrea Righi

[permalink] [raw]

Subject: [PATCH v15 5/7] kiothrottled: throttle buffered (writeback) IO

Together with cgroup_io_throttle() the kiothrottled kernel thread
represents the core of the io-throttle subsystem.

All the writeback IO requests that need to be throttled are not
dispatched immediately in submit_bio(). Instead, they are added into an
rbtree by iothrottle_make_request() and processed asynchronously by
kiothrottled.

A deadline is associated to each request depending on the bandwidth
usage of the cgroup it belongs. When a request is inserted into the
rbtree kiothrottled is awakened. This thread selects all the requests
with an expired deadline and submit the bunch of selected requests to
the underlying block devices using generic_make_request().

Signed-off-by: Andrea Righi <[email protected]>
---
block/Makefile | 2 +-
block/kiothrottled.c | 341 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 342 insertions(+), 1 deletions(-)
create mode 100644 block/kiothrottled.c

diff --git a/block/Makefile b/block/Makefile
index 42b6a46..5f10a45 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,6 +13,6 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

-obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o kiothrottled.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
diff --git a/block/kiothrottled.c b/block/kiothrottled.c
new file mode 100644
index 0000000..3df22c1
--- /dev/null
+++ b/block/kiothrottled.c
@@ -0,0 +1,341 @@
+/*
+ * kiothrottled.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <[email protected]>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/kthread.h>
+#include <linux/jiffies.h>
+#include <linux/ioprio.h>
+#include <linux/rbtree.h>
+#include <linux/blkdev.h>
+
+/* io-throttle bio element */
+struct iot_bio {
+ struct rb_node node;
+ unsigned long deadline;
+ struct bio *bio;
+};
+
+/* io-throttle bio tree */
+struct iot_bio_tree {
+ /* Protect the iothrottle rbtree */
+ spinlock_t lock;
+ struct rb_root tree;
+};
+
+/*
+ * TODO: create one iothrottle rbtree per block device and many kiothrottled
+ * threads per rbtree, instead of a poor scalable single rbtree / single thread
+ * solution.
+ */
+static struct iot_bio_tree *iot;
+static struct task_struct *kiothrottled_thread;
+
+/* Timer used to periodically wake-up kiothrottled */
+static struct timer_list kiothrottled_timer;
+
+/* Insert a new iot_bio element in the iot_bio_tree */
+static void iot_bio_insert(struct rb_root *root, struct iot_bio *data)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ while (*new) {
+ struct iot_bio *this = container_of(*new, struct iot_bio, node);
+ parent = *new;
+ if (data->deadline < this->deadline)
+ new = &((*new)->rb_left);
+ else
+ new = &((*new)->rb_right);
+ }
+ rb_link_node(&data->node, parent, new);
+ rb_insert_color(&data->node, root);
+}
+
+/*
+ * NOTE: no need to care about locking here, we're flushing all the pending
+ * requests, kiothrottled has been stopped and no additional request will be
+ * submitted in the tree.
+ */
+static void iot_bio_cleanup(struct rb_root *root)
+{
+ struct iot_bio *data;
+ struct rb_node *next;
+
+ next = rb_first(root);
+ while (next) {
+ data = rb_entry(next, struct iot_bio, node);
+ pr_debug("%s: dispatching element: %p (%lu)\n",
+ __func__, data->bio, data->deadline);
+ generic_make_request(data->bio);
+ next = rb_next(&data->node);
+ rb_erase(&data->node, root);
+ kfree(data);
+ }
+}
+
+/**
+ * iothrottle_make_request() - submit a delayed IO requests that will be
+ * processed asynchronously by kiothrottled.
+ *
+ * @bio: the bio structure that contains the IO request's informations
+ * @deadline: the request will be actually dispatched only when the deadline
+ * will expire
+ *
+ * Returns 0 if the request is successfully submitted and inserted into the
+ * iot_bio_tree. Return a negative value in case of failure.
+ **/
+int iothrottle_make_request(struct bio *bio, unsigned long deadline)
+{
+ struct iot_bio *data;
+
+ BUG_ON(!iot);
+
+ if (unlikely(!kiothrottled_thread))
+ return -ENOENT;
+
+ data = kzalloc(sizeof(*data), GFP_KERNEL);
+ if (unlikely(!data))
+ return -ENOMEM;
+ data->deadline = deadline;
+ data->bio = bio;
+
+ spin_lock_irq(&iot->lock);
+ iot_bio_insert(&iot->tree, data);
+ spin_unlock_irq(&iot->lock);
+
+ wake_up_process(kiothrottled_thread);
+ return 0;
+}
+EXPORT_SYMBOL(iothrottle_make_request);
+
+static void kiothrottled_timer_expired(unsigned long __unused)
+{
+ wake_up_process(kiothrottled_thread);
+}
+
+static void kiothrottled_sleep(void)
+{
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule();
+}
+
+/**
+ * kiothrottled() - throttle buffered (writeback) i/o activity
+ *
+ * Together with cgroup_io_throttle() this kernel thread represents the core of
+ * the cgroup-io-throttle subsystem.
+ *
+ * All the writeback IO requests that need to be throttled are not dispatched
+ * immediately in submit_bio(). Instead, they are added into the iot_bio_tree
+ * rbtree by iothrottle_make_request() and processed asynchronously by
+ * kiothrottled.
+ *
+ * A deadline is associated to each request depending on the bandwidth usage of
+ * the cgroup it belongs. When a request is inserted into the rbtree
+ * kiothrottled is awakened. This thread selects all the requests with an
+ * expired deadline and submit the bunch of selected requests to the underlying
+ * block devices using generic_make_request().
+ **/
+static int kiothrottled(void *__unused)
+{
+ /*
+ * kiothrottled is responsible of dispatching all the writeback IO
+ * requests with an expired deadline. To dispatch those requests as
+ * soon as possible and to avoid priority inversion problems set
+ * maximum IO real-time priority for this thread.
+ */
+ set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 0));
+
+ while (!kthread_should_stop()) {
+ struct iot_bio *data;
+ struct rb_node *req;
+ struct rb_root staging_tree = RB_ROOT;
+ unsigned long now = jiffies;
+ long delta_t = 0;
+
+ /* Select requests to dispatch */
+ spin_lock_irq(&iot->lock);
+ req = rb_first(&iot->tree);
+ while (req) {
+ data = rb_entry(req, struct iot_bio, node);
+ delta_t = (long)data->deadline - (long)now;
+ if (delta_t > 0)
+ break;
+ req = rb_next(&data->node);
+ rb_erase(&data->node, &iot->tree);
+ iot_bio_insert(&staging_tree, data);
+ }
+ spin_unlock_irq(&iot->lock);
+
+ /* Dispatch requests */
+ req = rb_first(&staging_tree);
+ while (req) {
+ data = rb_entry(req, struct iot_bio, node);
+ req = rb_next(&data->node);
+ rb_erase(&data->node, &staging_tree);
+ pr_debug("%s: dispatching request: %p (%lu)\n",
+ __func__, data->bio, data->deadline);
+ generic_make_request(data->bio);
+ kfree(data);
+ }
+
+ /* Wait for new requests ready to be dispatched */
+ if (delta_t > 0)
+ mod_timer(&kiothrottled_timer, jiffies + HZ);
+ kiothrottled_sleep();
+ }
+ return 0;
+}
+
+/* TODO: handle concurrent startup and shutdown */
+static void kiothrottle_shutdown(void)
+{
+ if (!kiothrottled_thread)
+ return;
+ del_timer(&kiothrottled_timer);
+ printk(KERN_INFO "%s: stopping kiothrottled\n", __func__);
+ kthread_stop(kiothrottled_thread);
+ printk(KERN_INFO "%s: flushing pending requests\n", __func__);
+ spin_lock_irq(&iot->lock);
+ kiothrottled_thread = NULL;
+ spin_unlock_irq(&iot->lock);
+ iot_bio_cleanup(&iot->tree);
+}
+
+static int kiothrottle_startup(void)
+{
+ init_timer(&kiothrottled_timer);
+ kiothrottled_timer.function = kiothrottled_timer_expired;
+
+ printk(KERN_INFO "%s: starting kiothrottled\n", __func__);
+ kiothrottled_thread = kthread_run(kiothrottled, NULL, "kiothrottled");
+ if (IS_ERR(kiothrottled_thread))
+ return -PTR_ERR(kiothrottled_thread);
+ return 0;
+}
+
+/*
+ * NOTE: provide this interface only for emergency situations, when we need to
+ * force the immediate flush of pending (writeback) IO throttled requests.
+ */
+int iothrottle_sync(void)
+{
+ kiothrottle_shutdown();
+ return kiothrottle_startup();
+}
+EXPORT_SYMBOL(iothrottle_sync);
+
+/*
+ * Writing in /proc/kiothrottled_debug enforces an immediate flush of throttled
+ * IO requests.
+ */
+static ssize_t kiothrottle_write(struct file *filp, const char __user *buffer,
+ size_t count, loff_t *data)
+{
+ int ret;
+
+ ret = iothrottle_sync();
+ if (ret)
+ return ret;
+ return count;
+}
+
+/*
+ * Export to userspace the list of pending IO throttled requests.
+ * TODO: this can be useful only for debugging, maybe we should make this
+ * interface optionally, depending on an opportune compile-time config option.
+ */
+static int kiothrottle_show(struct seq_file *m, void *v)
+{
+ struct iot_bio *data;
+ struct rb_node *next;
+ unsigned long now = jiffies;
+ long delta_t;
+
+ spin_lock_irq(&iot->lock);
+ next = rb_first(&iot->tree);
+ while (next) {
+ data = rb_entry(next, struct iot_bio, node);
+ delta_t = (long)data->deadline - (long)now;
+ seq_printf(m, "%p %lu %lu %li\n", data->bio,
+ data->deadline, now, delta_t);
+ next = rb_next(&data->node);
+ }
+ spin_unlock_irq(&iot->lock);
+
+ return 0;
+}
+
+static int kiothrottle_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, kiothrottle_show, NULL);
+}
+
+static const struct file_operations kiothrottle_ops = {
+ .open = kiothrottle_open,
+ .read = seq_read,
+ .write = kiothrottle_write,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+int __init kiothrottled_init(void)
+{
+ struct proc_dir_entry *pe;
+ int ret;
+
+ iot = kzalloc(sizeof(*iot), GFP_KERNEL);
+ if (unlikely(!iot))
+ return -ENOMEM;
+ spin_lock_init(&iot->lock);
+ iot->tree = RB_ROOT;
+
+ pe = create_proc_entry("kiothrottled_debug", 0644, NULL);
+ if (!pe) {
+ kfree(iot);
+ return -ENOMEM;
+ }
+ pe->proc_fops = &kiothrottle_ops;
+
+ ret = kiothrottle_startup();
+ if (ret) {
+ remove_proc_entry("kiothrottled_debug", NULL);
+ kfree(iot);
+ return ret;
+ }
+ printk(KERN_INFO "%s: initialized\n", __func__);
+ return 0;
+}
+
+void __exit kiothrottled_exit(void)
+{
+ kiothrottle_shutdown();
+ remove_proc_entry("kiothrottled_debug", NULL);
+ kfree(iot);
+ printk(KERN_INFO "%s: unloaded\n", __func__);
+}
+
+module_init(kiothrottled_init);
+module_exit(kiothrottled_exit);
+MODULE_LICENSE("GPL");
--
1.6.0.4

2009-04-28 14:39:43

by Andrea Righi

[permalink] [raw]

Subject: Re: [PATCH v15 2/7] res_counter: introduce ratelimiting attributes

Subject: io-throttle: reduce the size of res_counter

Reduce the size of struct res_counter after the introduction of
ratelimited resources:

- replace policy with a more generic unsigned long flags and encode the
throttling policy using a single bit of flags

- remove the attribute capacity: max_usage is not used in ratelimited
resources and capacity is not used in all the other cases (it has been
introduced only for token-bucket ratelimited resources), so just merge
capacitiy into max_usage

On a 64-bit architecture:

vanilla: sizeof(struct res_counter) = 48
with-io-throttle: sizeof(struct res_counter) = 72
with-io-throttle-and-reduced-res-counter: sizeof(struct res_counter) = 64

[ This patch must be applied on top of io-throttle v15. ]

Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-io-throttle.c | 14 ++++++++------
include/linux/res_counter.h | 40 +++++++++++++++++++++++++---------------
kernel/res_counter.c | 23 ++++++-----------------
3 files changed, 39 insertions(+), 38 deletions(-)

diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index 8dc2c93..a7edc47 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -257,10 +257,11 @@ static void iothrottle_show_limit(struct seq_file *m, dev_t dev,
{
if (!res->limit)
return;
- seq_printf(m, "%u %u %llu %llu %lli %llu %li\n",
+ /* maj min bw-limit ratelimit-policy usage bucket-size delta-time */
+ seq_printf(m, "%u %u %llu %lu %lli %llu %li\n",
MAJOR(dev), MINOR(dev),
- res->limit, res->policy,
- (long long)res->usage, res->capacity,
+ res->limit, res_counter_flagged(res, RES_COUNTER_POLICY),
+ (long long)res->usage, res->max_usage,
jiffies_to_clock_t(res_counter_ratelimit_delta_t(res)));
}

@@ -361,7 +362,7 @@ static dev_t devname2dev_t(const char *buf)
*/
static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
dev_t *dev, unsigned long long *iolimit,
- unsigned long long *strategy,
+ unsigned long *strategy,
unsigned long long *bucket_size)
{
char *p;
@@ -396,7 +397,7 @@ static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
/* throttling strategy (leaky bucket / token bucket) */
if (!s[2])
return -EINVAL;
- ret = strict_strtoull(s[2], 10, strategy);
+ ret = strict_strtoul(s[2], 10, strategy);
if (ret < 0)
return ret;
switch (*strategy) {
@@ -429,7 +430,8 @@ static int iothrottle_write(struct cgroup *cgrp, struct cftype *cft,
struct iothrottle *iot;
struct iothrottle_node *n, *newn = NULL;
dev_t dev;
- unsigned long long iolimit, strategy, bucket_size;
+ unsigned long long iolimit, bucket_size;
+ unsigned long strategy;
char *buf;
size_t nbytes = strlen(buffer);
int ret = 0;
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 9bed6af..c18cee2 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -16,6 +16,15 @@
#include <linux/cgroup.h>
#include <linux/jiffies.h>

+/*
+ * res_counter flags
+ *
+ * bit 0 -- ratelimiting policy: leaky bucket / token bucket
+ */
+#define RES_COUNTER_POLICY 0
+
+#define res_counter_flagged(rc, flag) ((rc)->flags & (1 << (flag)))
+
/* The various policies that can be used for ratelimiting resources */
#define RATELIMIT_LEAKY_BUCKET 0
#define RATELIMIT_TOKEN_BUCKET 1
@@ -23,35 +32,32 @@
/**
* struct res_counter - the core object to account cgroup resources
*
+ * @flags: resource counter attributes
* @usage: the current resource consumption level
- * @max_usage: the maximal value of the usage from the counter creation
+ * @max_usage: the maximal value of the usage from the counter creation,
+ * or the maximum capacity of the resource (for ratelimited
+ * resources)
* @limit: the limit that usage cannot be exceeded
* @failcnt: the number of unsuccessful attempts to consume the resource
- * @policy: the limiting policy / algorithm
- * @capacity: the maximum capacity of the resource
* @timestamp: timestamp of the last accounted resource request
- * @lock: the lock to protect all of the above.
- * The routines below consider this to be IRQ-safe
+ * @lock: the lock to protect all of the above
+ * @parent: Parent counter, used for hierarchial resource accounting
*
* The cgroup that wishes to account for some resource may include this counter
* into its structures and use the helpers described beyond.
*/
struct res_counter {
+ unsigned long flags;
unsigned long long usage;
unsigned long long max_usage;
unsigned long long limit;
unsigned long long failcnt;
- unsigned long long policy;
- unsigned long long capacity;
unsigned long long timestamp;
/*
* the lock to protect all of the above.
* the routines below consider this to be IRQ-safe
*/
spinlock_t lock;
- /*
- * Parent counter, used for hierarchial resource accounting
- */
struct res_counter *parent;
};

@@ -90,9 +96,7 @@ enum {
RES_USAGE,
RES_MAX_USAGE,
RES_LIMIT,
- RES_POLICY,
RES_TIMESTAMP,
- RES_CAPACITY,
RES_FAILCNT,
};

@@ -183,15 +187,21 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)

static inline int
res_counter_ratelimit_set_limit(struct res_counter *cnt,
- unsigned long long policy,
+ unsigned long policy,
unsigned long long limit, unsigned long long max)
{
unsigned long flags;

spin_lock_irqsave(&cnt->lock, flags);
cnt->limit = limit;
- cnt->capacity = max;
- cnt->policy = policy;
+ /*
+ * In ratelimited res_counter max_usage is used to save the token
+ * bucket capacity.
+ */
+ cnt->max_usage = max;
+ cnt->flags = 0;
+ if (policy == RATELIMIT_TOKEN_BUCKET)
+ set_bit(RES_COUNTER_POLICY, &cnt->flags);
cnt->timestamp = get_jiffies_64();
cnt->usage = 0;
spin_unlock_irqrestore(&cnt->lock, flags);
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 6f882c6..f6d97a2 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -21,7 +21,6 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
spin_lock_init(&counter->lock);
counter->limit = (unsigned long long)LLONG_MAX;
counter->parent = parent;
- counter->capacity = (unsigned long long)LLONG_MAX;
counter->timestamp = get_jiffies_64();
}

@@ -102,12 +101,8 @@ res_counter_member(struct res_counter *counter, int member)
return &counter->max_usage;
case RES_LIMIT:
return &counter->limit;
- case RES_POLICY:
- return &counter->policy;
case RES_TIMESTAMP:
return &counter->timestamp;
- case RES_CAPACITY:
- return &counter->capacity;
case RES_FAILCNT:
return &counter->failcnt;
};
@@ -205,7 +200,7 @@ ratelimit_token_bucket(struct res_counter *res, ssize_t val)
res->timestamp = get_jiffies_64();
tok = (long long)res->usage * MSEC_PER_SEC;
if (delta) {
- long long max = (long long)res->capacity * MSEC_PER_SEC;
+ long long max = (long long)res->max_usage * MSEC_PER_SEC;

tok += delta * res->limit;
tok = max_t(long long, tok, max);
@@ -221,18 +216,12 @@ res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val)
unsigned long flags;

spin_lock_irqsave(&res->lock, flags);
- if (res->limit)
- switch (res->policy) {
- case RATELIMIT_LEAKY_BUCKET:
- sleep = ratelimit_leaky_bucket(res, val);
- break;
- case RATELIMIT_TOKEN_BUCKET:
+ if (res->limit) {
+ if (res_counter_flagged(res, RES_COUNTER_POLICY))
sleep = ratelimit_token_bucket(res, val);
- break;
- default:
- WARN_ON(1);
- break;
- }
+ else
+ sleep = ratelimit_leaky_bucket(res, val);
+ }
spin_unlock_irqrestore(&res->lock, flags);
return sleep;
}

2009-04-28 14:51:19

by Andrea Righi

[permalink] [raw]

Subject: Re: [PATCH v15 4/7] io-throttle controller infrastructure

On Tue, Apr 28, 2009 at 10:43:51AM +0200, Andrea Righi wrote:
> This is the core of the io-throttle kernel infrastructure. It creates
> the basic interfaces to the cgroup subsystem and implements the I/O
> measurement and throttling functionality.

Subject: io-throttle: correctly throttle O_DIRECT reads

There's a bug in the latest io-throttle patchset: the IO generated by
O_DIRECT reads is correctly accounted, but tasks doing direct IO are not
correctly throttled.

The following fix apply the correct behaviour, throttling the tasks that
are doing O_DIRECT reads directly, instead of delaying their IO
requests.

[ This patch must be applied on top of io-throttle v15 ]

Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-io-throttle.c | 21 ++++++++++++---------
1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index 380a21a..8dc2c93 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -803,12 +803,21 @@ cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
* cgroup. If they're different we're doing writeback IO and we can't
* throttle the current task directly.
*/
- if (!is_in_dio())
+ if (!is_in_dio()) {
+ /*
+ * We're not doing O_DIRECT: find the source of this IO
+ * request.
+ */
iot = get_iothrottle_from_bio(bio);
+ }
rcu_read_lock();
curr_iot = task_to_iothrottle(current);
- if (curr_iot != iot) {
- css_get(&curr_iot->css);
+ if (iot == NULL) {
+ /* IO occurs in the same context of the current task */
+ iot = curr_iot;
+ css_get(&iot->css);
+ }
+ if (iot != curr_iot) {
/*
* IO occurs in a different context of the current task
* (writeback IO).
@@ -819,10 +828,6 @@ cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
*/
can_sleep = 0;
}
- if (iot == NULL) {
- /* IO occurs in the same context of the current task */
- iot = curr_iot;
- }
/* Apply IO throttling */
iothrottle_evaluate_sleep(&s, iot, bdev, bytes);
sleep = max(s.bw_sleep, s.iops_sleep);
@@ -831,8 +836,6 @@ cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
if (unlikely(sleep && can_sleep))
iothrottle_acct_stat(iot, bdev, type, sleep);
css_put(&iot->css);
- if (curr_iot != iot)
- css_put(&curr_iot->css);
rcu_read_unlock();
if (unlikely(sleep && can_sleep)) {
/* Throttle the current task directly */