Objective
~~~~~~~~~
The objective of the io-throttle controller is to improve IO performance
predictability of different cgroups that share the same block devices.
State of the art (quick overview)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A recent work made by Vivek propose a weighted BW solution introducing
fair queuing support in the elevator layer and modifying the existent IO
schedulers to use that functionality
(https://lists.linux-foundation.org/pipermail/containers/2009-March/016129.html).
For the fair queuing part Vivek's IO controller makes use of the BFQ
code as posted by Paolo and Fabio (http://lkml.org/lkml/2008/11/11/148).
The dm-ioband controller by the valinux guys is also proposing a
proportional ticket-based solution fully implemented at the device
mapper level (http://people.valinux.co.jp/~ryov/dm-ioband/).
The bio-cgroup patch (http://people.valinux.co.jp/~ryov/bio-cgroup/) is
a BIO tracking mechanism for cgroups, implemented in the cgroup memory
subsystem. It is maintained by Ryo and it allows dm-ioband to track
writeback requests issued by kernel threads (pdflush).
Another work by Satoshi implements the cgroup awareness in CFQ, mapping
per-cgroup priority to CFQ IO priorities and this also provide only the
proportional BW support (http://lwn.net/Articles/306772/).
Please correct me or integrate if I missed someone or something. :)
Proposed solution
~~~~~~~~~~~~~~~~~
Respect to other priority/weight-based solutions the approach used by
this controller is to explicitly choke applications' requests that
directly or indirectly generate IO activity in the system (this
controller addresses both synchronous IO and writeback/buffered IO).
The bandwidth and iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the
overall performance of the system in terms of throughput.
IO throttling and accounting is performed during the submission of IO
requests and it is independent of the particular IO scheduler.
Detailed informations about design, goal and usage are described in the
documentation (see [PATCH 1/7]).
Implementation
~~~~~~~~~~~~~~
Patchset against latest Linus' git:
[PATCH 0/7] cgroup: block device IO controller (v14)
[PATCH 1/7] io-throttle documentation
[PATCH 2/7] res_counter: introduce ratelimiting attributes
[PATCH 3/7] page_cgroup: provide a generic page tracking infrastructure
[PATCH 4/7] io-throttle controller infrastructure
[PATCH 5/7] kiothrottled: throttle buffered (writeback) IO
[PATCH 6/7] io-throttle instrumentation
[PATCH 7/7] export per-task io-throttle statistics to userspace
The v14 all-in-one patch, along with the previous versions, can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/
What's new
~~~~~~~~~~
In this new version I've embedded the bio-cgroup code inside
io-throttle, providing the page_cgroup page tracking infrastructure.
This completely removes the complexity and the overhead of associating
multiple IO controllers (bio-cgroup groups and io-throttle groups) from
userspace, preserving the same tracking and throttling functionalities
for writeback IO. And it is also possibel to bind other cgroup
subsystems with io-throttle.
I've removed the tracking of IO generated by anonymous pages (swap), to
reduce the overhead of the page tracking functionality (and probably is
not a good idea to delay IO requests that come from swap-in/swap-out
operations).
I've also removed the ext3 specific patch to tag journal IO with
BIO_RW_META to never throttle such IO requests.
As suggested by Ted and Jens we need a more specific solution, where
filesystems inform the IO subsystem which IO requests come from tasks
that are holding filesystem exclusive resources (journal IO, metadata,
etc.). Then, the IO subsystem (both the IO scheduler and the IO
controller) will be able to dispatched those "special" requests at the
highest priority to avoid the classic priority inversion problems.
Changelog (v13 -> v14)
~~~~~~~~~~~~~~~~~~~~~~
* implemented the bio-cgroup functionality as pure infrastructure for page
tracking capability
* removed the tracking and throttling of IO generated by anonymous pages (swap)
* updated documentation
Overall diffstat
~~~~~~~~~~~~~~~~
Documentation/cgroups/io-throttle.txt | 417 +++++++++++++++++
block/Makefile | 1 +
block/blk-core.c | 8 +
block/blk-io-throttle.c | 822 +++++++++++++++++++++++++++++++++
block/kiothrottled.c | 341 ++++++++++++++
fs/aio.c | 12 +
fs/buffer.c | 2 +
fs/proc/base.c | 18 +
include/linux/blk-io-throttle.h | 144 ++++++
include/linux/cgroup_subsys.h | 6 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 33 ++-
include/linux/res_counter.h | 69 ++-
include/linux/sched.h | 7 +
init/Kconfig | 16 +
kernel/fork.c | 7 +
kernel/res_counter.c | 72 +++
mm/Makefile | 3 +-
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 +
mm/page-writeback.c | 2 +
mm/page_cgroup.c | 95 ++++-
mm/readahead.c | 3 +
25 files changed, 2065 insertions(+), 33 deletions(-)
-Andrea
Documentation of the block device I/O controller: description, usage,
advantages and design.
Signed-off-by: Andrea Righi <[email protected]>
---
Documentation/cgroups/io-throttle.txt | 417 +++++++++++++++++++++++++++++++++
1 files changed, 417 insertions(+), 0 deletions(-)
create mode 100644 Documentation/cgroups/io-throttle.txt
diff --git a/Documentation/cgroups/io-throttle.txt b/Documentation/cgroups/io-throttle.txt
new file mode 100644
index 0000000..789116c
--- /dev/null
+++ b/Documentation/cgroups/io-throttle.txt
@@ -0,0 +1,417 @@
+
+ Block device I/O bandwidth controller
+
+----------------------------------------------------------------------
+1. DESCRIPTION
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups [1]) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements. Nevertheless, priority based solutions are
+affected by performance bursts, when only low-priority requests are submitted
+to a general purpose resource dispatcher.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability from the applications' point of view and provide performance
+isolation of different control groups sharing the same block devices.
+
+NOTE #1: If you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+NOTE #2: The current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
+limits specified by the user; minimum I/O rate thresholds are supposed to be
+guaranteed if the user configures a proper I/O bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total I/O
+bandwidth of that device).
+
+----------------------------------------------------------------------
+2. USER INTERFACE
+
+A new I/O limitation rule is described using the files:
+- blockio.bandwidth-max
+- blockio.iops-max
+
+The I/O bandwidth (blockio.bandwidth-max) can be used to limit the throughput
+of a certain cgroup, while blockio.iops-max can be used to throttle cgroups
+containing applications doing a sparse/seeky I/O workload. Any combination of
+them can be used to define more complex I/O limiting rules, expressed both in
+terms of iops/s and bandwidth.
+
+The same files can be used to set multiple rules for different block devices
+relative to the same cgroup.
+
+The following syntax can be used to configure any limiting rule:
+
+# /bin/echo DEV:LIMIT:STRATEGY:BUCKET_SIZE > CGROUP/FILE
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- LIMIT is the maximum I/O activity allowed on DEV by CGROUP; LIMIT can
+ represent a bandwidth limitation (expressed in bytes/s) when writing to
+ blockio.bandwidth-max, or a limitation to the maximum I/O operations per
+ second (expressed in iops/s) issued by CGROUP.
+
+ A generic I/O limiting rule for a block device DEV can be removed setting the
+ LIMIT to 0.
+
+- STRATEGY is the throttling strategy used to throttle the applications' I/O
+ requests from/to device DEV. At the moment two different strategies can be
+ used [2][3]:
+
+ 0 = leaky bucket: the controller accepts at most B bytes (B = LIMIT * time)
+ or O operations (O = LIMIT * time); further I/O requests
+ are delayed scheduling a timeout for the tasks that made
+ those requests.
+
+ Different I/O flow
+ | | |
+ | v |
+ | v
+ v
+ .......
+ \ /
+ \ / leaky-bucket
+ ---
+ |||
+ vvv
+ Smoothed I/O flow
+
+ 1 = token bucket: LIMIT tokens are added to the bucket every seconds; the
+ bucket can hold at the most BUCKET_SIZE tokens; I/O
+ requests are accepted if there are available tokens in the
+ bucket; when a request of N bytes arrives N tokens are
+ removed from the bucket; if fewer than N tokens are
+ available the request is delayed until a sufficient amount
+ of token is available in the bucket.
+
+ Tokens (I/O rate)
+ o
+ o
+ o
+ ....... <--.
+ \ / | Bucket size (burst limit)
+ \ooo/ |
+ --- <--'
+ |ooo
+ Incoming --->|---> Conforming
+ I/O |oo I/O
+ requests -->|--> requests
+ |
+ ---->|
+
+ Leaky bucket is more precise than token bucket to respect the limits, because
+ bursty workloads are always smoothed. Token bucket, instead, allows a small
+ irregularity degree in the I/O flows (burst limit), and, for this, it is
+ better in terms of efficiency (bursty workloads are not smoothed when there
+ are sufficient tokens in the bucket).
+
+- BUCKET_SIZE is used only with token bucket (STRATEGY == 1) and defines the
+ size of the bucket in bytes (blockio.bandwidth-max) or in I/O operations
+ (blockio.iops-max).
+
+- CGROUP is the name of the limited process container.
+
+Also the following syntaxes are allowed:
+
+- remove an I/O bandwidth limiting rule
+# /bin/echo DEV:0 > CGROUP/blockio.bandwidth-max
+
+- configure a limiting rule using leaky bucket throttling (ignore bucket size):
+# /bin/echo DEV:LIMIT:0 > CGROUP/blockio.bandwidth-max
+
+- configure a limiting rule using token bucket throttling
+ (with bucket size == LIMIT):
+# /bin/echo DEV:LIMIT:1 > CGROUP/blockio.bandwidth-max
+
+2.2. Show I/O limiting rules
+
+All the defined rules and statistics for a specific cgroup can be shown reading
+the files blockio.bandwidth-max for bandwidth constraints and blockio.iops-max
+for I/O operations per second constraints.
+
+The following syntax is used:
+
+$ cat CGROUP/blockio.bandwidth-max
+MAJOR MINOR LIMIT STRATEGY LEAKY_STAT BUCKET_SIZE BUCKET_FILL TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- LIMIT, STRATEGY and BUCKET_SIZE are the same parameters defined above
+
+- LEAKY_STAT is the amount of bytes (blockio.bandwidth-max) or I/O operations
+ (blockio.iops-max) currently allowed by the I/O controller (only used with
+ leaky bucket strategy - STRATEGY == 0)
+
+- BUCKET_FILL represents the amount of tokens present in the bucket (only used
+ with token bucket strategy - STRATEGY == 1)
+
+- TIME_DELTA can be one of the following:
+ - the amount of jiffies elapsed from the last I/O request (token bucket)
+ - the amount of jiffies during which the bytes or the number of I/O
+ operations given by LEAKY_STAT have been accumulated (leaky bucket)
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 .. n):
+
+$ cat CGROUP/blockio.bandwidth-max
+MAJOR1 MINOR1 BW1 STRATEGY1 LEAKY_STAT1 BUCKET_SIZE1 BUCKET_FILL1 TIME_DELTA1
+MAJOR1 MINOR1 BW2 STRATEGY2 LEAKY_STAT2 BUCKET_SIZE2 BUCKET_FILL2 TIME_DELTA2
+...
+MAJORn MINORn BWn STRATEGYn LEAKY_STATn BUCKET_SIZEn BUCKET_FILLn TIME_DELTAn
+
+The same fields are used to describe I/O operations/sec rules. The only
+difference is that the cost of each I/O operation is scaled up by a factor of
+1000. This allows to apply better fine grained sleeps and provide a more
+precise throttling.
+
+$ cat CGROUP/blockio.iops-max
+MAJOR MINOR LIMITx1000 STRATEGY LEAKY_STATx1000 BUCKET_SIZEx1000 BUCKET_FILLx1000 TIME_DELTA
+...
+
+2.3. Additional I/O statistics
+
+Additional cgroup I/O throttling statistics are reported in
+blockio.throttlecnt:
+
+$ cat CGROUP/blockio.throttlecnt
+MAJOR MINOR BW_COUNTER BW_SLEEP IOPS_COUNTER IOPS_SLEEP
+
+ - MAJOR, MINOR are respectively the major and the minor number of the device
+ the following statistics refer to
+ - BW_COUNTER gives the number of times that the cgroup bandwidth limit of
+ this particular device was exceeded
+ - BW_SLEEP is the amount of sleep time measured in clock ticks (divide
+ by sysconf(_SC_CLK_TCK)) imposed to the processes of this cgroup that
+ exceeded the bandwidth limit for this particular device
+ - IOPS_COUNTER gives the number of times that the cgroup I/O operation per
+ second limit of this particular device was exceeded
+ - IOPS_SLEEP is the amount of sleep time measured in clock ticks (divide
+ by sysconf(_SC_CLK_TCK)) imposed to the processes of this cgroup that
+ exceeded the I/O operations per second limit for this particular device
+
+Example:
+$ cat CGROUP/blockio.throttlecnt
+8 0 0 0 0 0
+^ ^ ^ ^ ^ ^
+ \ \ \ \ \ \___iops sleep (in clock ticks)
+ \ \ \ \ \____iops throttle counter
+ \ \ \ \_____bandwidth sleep (in clock ticks)
+ \ \ \______bandwidth throttle counter
+ \ \_______minor dev. number
+ \________major dev. number
+
+Distinct statistics for each process are reported in
+/proc/PID/io-throttle-stat:
+
+$ cat /proc/PID/io-throttle-stat
+BW_COUNTER BW_SLEEP IOPS_COUNTER IOPS_SLEEP
+
+Example:
+$ cat /proc/$$/io-throttle-stat
+0 0 0 0
+^ ^ ^ ^
+ \ \ \ \_____global iops sleep (in clock ticks)
+ \ \ \______global iops counter
+ \ \_______global bandwidth sleep (clock ticks)
+ \________global bandwidth counter
+
+2.5. Generic usage examples
+
+* Mount the cgroup filesystem (blockio subsystem):
+ # mkdir /mnt/cgroup
+ # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+ # mkdir /mnt/cgroup/foo
+ --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+ # /bin/echo $$ > /mnt/cgroup/foo/tasks
+ --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda for the cgroup "foo", using
+ leaky bucket throttling strategy:
+ # /bin/echo /dev/sda:$((1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth-max
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo", using
+ token bucket throttling strategy, bucket size = 8MiB:
+ # /bin/echo /dev/sdb:$((8 * 1024 * 1024)):1:$((8 * 1024 * 1024)) > \
+ > /mnt/cgroup/foo/blockio.bandwidth-max
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda (controlled by leaky bucket throttling)
+ and 8MiB/s on /dev/sdb (controlled by token bucket throttling)
+
+* Run a benchmark doing I/O on /dev/sda and /dev/sdb; I/O limits and usage
+ defined for cgroup "foo" can be shown as following:
+ # cat /mnt/cgroup/foo/blockio.bandwidth-max
+ 8 16 8388608 1 0 8388608 -522560 48
+ 8 0 1048576 0 737280 0 0 216
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 16MiB/s on /dev/sda:
+ # /bin/echo /dev/sda:$((16 * 1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth-max
+ # cat /mnt/cgroup/foo/blockio.bandwidth-max
+ 8 16 8388608 1 0 8388608 -84432 206436
+ 8 0 16777216 0 0 0 0 15212
+
+* Remove limiting rule on /dev/sdb for cgroup "foo":
+ # /bin/echo /dev/sdb:0:0:0 > /mnt/cgroup/foo/blockio.bandwidth-max
+ # cat /mnt/cgroup/foo/blockio.bandwidth-max
+ 8 0 16777216 0 0 0 0 110388
+
+* Set a maximum of 100 I/O operations/sec (leaky bucket strategy) to /dev/sdc
+ for cgroup "foo":
+ # /bin/echo /dev/sdc:100:0 > /mnt/cgroup/foo/blockio.iops-max
+ # cat /mnt/cgroup/foo/blockio.iops-max
+ 8 32 100000 0 846000 0 2113
+ ^ ^
+ /________/
+ /
+ Remember: these values are scaled up by a factor of 1000 to apply a fine
+ grained throttling (i.e. LIMIT == 100000 means a maximum of 100 I/O operation
+ per second)
+
+* Remove limiting rule for I/O operations from /dev/sdc for cgroup "foo":
+ # /bin/echo /dev/sdc:0 > /mnt/cgroup/foo/blockio.iops-max
+
+----------------------------------------------------------------------
+3. ADVANTAGES OF PROVIDING THIS FEATURE
+
+* Allow I/O traffic shaping for block device shared among different cgroups
+* Improve I/O performance predictability on block devices shared between
+ different cgroups
+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+ deadline, CFQ, noop) and/or the type of the underlying block devices
+* The bandwidth limitations are guaranteed both for synchronous and
+ asynchronous operations, even the I/O passing through the page cache or
+ buffers and not only direct I/O (see below for details)
+* It is possible to implement a simple user-space application to dynamically
+ adjust the I/O workload of different process containers at run-time,
+ according to the particular users' requirements and applications' performance
+ constraints
+
+----------------------------------------------------------------------
+4. DESIGN
+
+The I/O throttling is performed imposing an explicit timeout on the processes
+that exceed the I/O limits dedicated to the cgroup they belong to. I/O
+accounting happens per cgroup.
+
+Only the actual I/O that flows in the block devices is considered. Multiple
+re-reads of pages already present in the page cache as well as re-writes of
+dirty pages are not considered to account and throttle the I/O activity, since
+they don't actually generate any real I/O operation.
+
+This means that a process that re-reads or re-writes multiple times the same
+blocks of a file is affected by the I/O limitations only for the actual I/O
+performed from/to the underlying block devices.
+
+4.1. Synchronous I/O tracking and throttling
+
+The io-throttle controller just works as expected for synchronous (read and
+write) operations: the real I/O activity is reduced synchronously according to
+the defined limitations.
+
+If the operation is synchronous we automatically know that the context of the
+request is the current task and so we can charge the cgroup the current task
+belongs to. And throttle the current task as well, if it exceeded the cgroup
+limitations.
+
+4.2. Buffered I/O (write-back) tracking
+
+For buffered writes the scenario is a bit more complex, because the writes in
+the page cache are processed asynchronously by kernel threads (pdflush), using
+a write-back policy. So the real writes to the underlying block devices occur
+in a different I/O context respect to the task that originally generated the
+dirty pages.
+
+The I/O bandwidth controller uses the following solution to resolve this
+problem.
+
+If the operation is a buffered write, we can charge the right cgroup looking at
+the owner of the first page involved in the I/O operation, that gives the
+context that generated the I/O activity at the source. This information can be
+retrieved using the page_cgroup functionality originally provided by the cgroup
+memory controller [4], and now provided by a modified version of the bio-cgroup
+controller [5], embedding the page tracking feature directly into the
+io-throttle controller.
+
+The page_cgroup structure is used to encode the owner of each struct page: this
+information is encoded in page_cgroup->flags. A owner is characterized by a
+numeric ID: the io-throttle css_id(). The owner of a page is set when a page is
+dirtied or added to the page cache. At the moment I/O generated by anonymous
+pages (swap) is not considered by the io-throttle controller.
+
+In this way we can correctly account the I/O cost to the right cgroup, but we
+cannot throttle the current task in this stage, because, in general, it is a
+different task (e.g., pdflush that is processing asynchronously the dirty
+page).
+
+For this reason, all the write-back requests that are not directly submitted by
+the real owner and that need to be throttled are not dispatched immediately in
+submit_bio(). Instead, they are added into an rbtree and processed
+asynchronously by a dedicated kernel thread: kiothrottled.
+
+A deadline is associated to each throttled write-back request depending on the
+bandwidth usage of the cgroup it belongs. When a request is inserted into the
+rbtree kiothrottled is awakened. This thread periodically selects all the
+requests with an expired deadline and submit the bunch of selected requests to
+the underlying block devices using generic_make_request().
+
+4.3. Per-block device IO limiting rules
+
+Multiple rules for different block devices are stored in a linked list, using
+the dev_t number of each block device as key to uniquely identify each element
+of the list. RCU synchronization is used to protect the whole list structure,
+since the elements in the list are not supposed to change frequently (they
+change only when a new rule is defined or an old rule is removed or updated),
+while the reads in the list occur at each operation that generates I/O. This
+allows to provide zero overhead for cgroups that do not use any limitation.
+
+WARNING: per-block device limiting rules always refer to the dev_t device
+number. If a block device is unplugged (i.e. a USB device) the limiting rules
+defined for that device persist and they are still valid if a new device is
+plugged in the system and it uses the same major and minor numbers.
+
+4.4. Asynchronous I/O (AIO) handling
+
+Explicit sleeps are *not* imposed on tasks doing asynchronous I/O (AIO)
+operations; AIO throttling is performed returning -EAGAIN from sys_io_submit().
+Userspace applications must be able to handle this error code opportunely.
+
+----------------------------------------------------------------------
+5. TODO
+
+* Support proportional I/O bandwidth for an optimal bandwidth usage. For
+ example use the kiothrottled rbtree: all the requests queued to the I/O
+ subsystem first will go into the rbtree; then based on a per-cgroup I/O
+ priority and feedback from I/O schedulers dispatch the requests to the
+ elevator. This would allow to provide both bandwidth limiting and
+ proportional bandwidth functionalities using a generic approach.
+
+* Implement a fair throttling policy: distribute the time to sleep equally
+ among all the tasks of a cgroup that exceeded the I/O limits, e.g., depending
+ of the amount of I/O activity previously generated in the past by each task
+ (see task_io_accounting).
+
+----------------------------------------------------------------------
+6. REFERENCES
+
+[1] Documentation/cgroups/cgroups.txt
+[2] http://en.wikipedia.org/wiki/Leaky_bucket
+[3] http://en.wikipedia.org/wiki/Token_bucket
+[4] Documentation/controllers/memory.txt
+[5] http://people.valinux.co.jp/~ryov/bio-cgroup
--
1.5.6.3
Introduce attributes and functions in res_counter to implement throttling-based
cgroup subsystems.
The following attributes have been added to struct res_counter:
* @policy: the limiting policy / algorithm
* @capacity: the maximum capacity of the resource
* @timestamp: timestamp of the last accounted resource request
Currently the available policies are: token-bucket and leaky-bucket and the
attribute @capacity is only used by token-bucket policy (to represent the
bucket size).
The following function has been implemented to return the amount of time a
cgroup should sleep to remain within the defined resource limits.
unsigned long long
res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
[ Note: only the interfaces needed by the cgroup IO controller are implemented
right now ]
Signed-off-by: Andrea Righi <[email protected]>
---
include/linux/res_counter.h | 69 +++++++++++++++++++++++++++++++----------
kernel/res_counter.c | 72 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 124 insertions(+), 17 deletions(-)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 4c5bcf6..9bed6af 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -14,30 +14,36 @@
*/
#include <linux/cgroup.h>
+#include <linux/jiffies.h>
-/*
- * The core object. the cgroup that wishes to account for some
- * resource may include this counter into its structures and use
- * the helpers described beyond
- */
+/* The various policies that can be used for ratelimiting resources */
+#define RATELIMIT_LEAKY_BUCKET 0
+#define RATELIMIT_TOKEN_BUCKET 1
+/**
+ * struct res_counter - the core object to account cgroup resources
+ *
+ * @usage: the current resource consumption level
+ * @max_usage: the maximal value of the usage from the counter creation
+ * @limit: the limit that usage cannot be exceeded
+ * @failcnt: the number of unsuccessful attempts to consume the resource
+ * @policy: the limiting policy / algorithm
+ * @capacity: the maximum capacity of the resource
+ * @timestamp: timestamp of the last accounted resource request
+ * @lock: the lock to protect all of the above.
+ * The routines below consider this to be IRQ-safe
+ *
+ * The cgroup that wishes to account for some resource may include this counter
+ * into its structures and use the helpers described beyond.
+ */
struct res_counter {
- /*
- * the current resource consumption level
- */
unsigned long long usage;
- /*
- * the maximal value of the usage from the counter creation
- */
unsigned long long max_usage;
- /*
- * the limit that usage cannot exceed
- */
unsigned long long limit;
- /*
- * the number of unsuccessful attempts to consume the resource
- */
unsigned long long failcnt;
+ unsigned long long policy;
+ unsigned long long capacity;
+ unsigned long long timestamp;
/*
* the lock to protect all of the above.
* the routines below consider this to be IRQ-safe
@@ -84,6 +90,9 @@ enum {
RES_USAGE,
RES_MAX_USAGE,
RES_LIMIT,
+ RES_POLICY,
+ RES_TIMESTAMP,
+ RES_CAPACITY,
RES_FAILCNT,
};
@@ -130,6 +139,15 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
return false;
}
+static inline unsigned long long
+res_counter_ratelimit_delta_t(struct res_counter *res)
+{
+ return (long long)get_jiffies_64() - (long long)res->timestamp;
+}
+
+unsigned long long
+res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
+
/*
* Helper function to detect if the cgroup is within it's limit or
* not. It's currently called from cgroup_rss_prepare()
@@ -163,6 +181,23 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
spin_unlock_irqrestore(&cnt->lock, flags);
}
+static inline int
+res_counter_ratelimit_set_limit(struct res_counter *cnt,
+ unsigned long long policy,
+ unsigned long long limit, unsigned long long max)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ cnt->limit = limit;
+ cnt->capacity = max;
+ cnt->policy = policy;
+ cnt->timestamp = get_jiffies_64();
+ cnt->usage = 0;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return 0;
+}
+
static inline int res_counter_set_limit(struct res_counter *cnt,
unsigned long long limit)
{
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index bf8e753..b62319c 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -9,6 +9,7 @@
#include <linux/types.h>
#include <linux/parser.h>
+#include <linux/jiffies.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/res_counter.h>
@@ -20,6 +21,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
spin_lock_init(&counter->lock);
counter->limit = (unsigned long long)LLONG_MAX;
counter->parent = parent;
+ counter->capacity = (unsigned long long)LLONG_MAX;
+ counter->timestamp = get_jiffies_64();
}
int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -99,6 +102,12 @@ res_counter_member(struct res_counter *counter, int member)
return &counter->max_usage;
case RES_LIMIT:
return &counter->limit;
+ case RES_POLICY:
+ return &counter->policy;
+ case RES_TIMESTAMP:
+ return &counter->timestamp;
+ case RES_CAPACITY:
+ return &counter->capacity;
case RES_FAILCNT:
return &counter->failcnt;
};
@@ -163,3 +172,66 @@ int res_counter_write(struct res_counter *counter, int member,
spin_unlock_irqrestore(&counter->lock, flags);
return 0;
}
+
+static unsigned long long
+ratelimit_leaky_bucket(struct res_counter *res, ssize_t val)
+{
+ unsigned long long delta, t;
+
+ res->usage += val;
+ delta = res_counter_ratelimit_delta_t(res);
+ if (!delta)
+ return 0;
+ t = res->usage * USEC_PER_SEC;
+ t = usecs_to_jiffies(div_u64(t, res->limit));
+ if (t > delta)
+ return t - delta;
+ /* Reset i/o statistics */
+ res->usage = 0;
+ res->timestamp = get_jiffies_64();
+ return 0;
+}
+
+static unsigned long long
+ratelimit_token_bucket(struct res_counter *res, ssize_t val)
+{
+ unsigned long long delta;
+ long long tok;
+
+ res->usage -= val;
+ delta = jiffies_to_msecs(res_counter_ratelimit_delta_t(res));
+ res->timestamp = get_jiffies_64();
+ tok = (long long)res->usage * MSEC_PER_SEC;
+ if (delta) {
+ long long max = (long long)res->capacity * MSEC_PER_SEC;
+
+ tok += delta * res->limit;
+ if (tok > max)
+ tok = max;
+ res->usage = (unsigned long long)div_s64(tok, MSEC_PER_SEC);
+ }
+ return (tok < 0) ? msecs_to_jiffies(div_u64(-tok, res->limit)) : 0;
+}
+
+unsigned long long
+res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val)
+{
+ unsigned long long sleep = 0;
+ unsigned long flags;
+
+ spin_lock_irqsave(&res->lock, flags);
+ if (res->limit)
+ switch (res->policy) {
+ case RATELIMIT_LEAKY_BUCKET:
+ sleep = ratelimit_leaky_bucket(res, val);
+ break;
+ case RATELIMIT_TOKEN_BUCKET:
+ sleep = ratelimit_token_bucket(res, val);
+ break;
+ default:
+ WARN_ON(1);
+ break;
+ }
+ spin_unlock_irqrestore(&res->lock, flags);
+ return sleep;
+}
--
1.5.6.3
Dirty pages in the page cache can be processed asynchronously by kernel
threads (pdflush) using a writeback policy. For this reason the real
writes to the underlying block devices occur in a different IO context
respect to the task that originally generated the dirty pages involved
in the IO operation. This makes the tracking and throttling of writeback
IO more complicate respect to the synchronous IO.
The page_cgroup infrastructure, currently available only for the memory
cgroup controller, can be used to store the owner of each page and
opportunely track the writeback IO. This information is encoded in
page_cgroup->flags.
A owner can be identified using a generic ID number and the following
interfaces are provided to store a retrieve this information:
unsigned long page_cgroup_get_owner(struct page *page);
int page_cgroup_set_owner(struct page *page, unsigned long id);
int page_cgroup_copy_owner(struct page *npage, struct page *opage);
The io-throttle controller uses the cgroup css_id() as the owner's ID
number.
A big part of this code is taken from the Ryo and Hirokazu's bio-cgroup
controller (http://people.valinux.co.jp/~ryov/bio-cgroup/).
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>
---
include/linux/memcontrol.h | 6 +++
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 33 +++++++++++++-
init/Kconfig | 4 ++
mm/Makefile | 3 +-
mm/memcontrol.c | 6 +++
mm/page_cgroup.c | 95 ++++++++++++++++++++++++++++++++++++++-----
7 files changed, 135 insertions(+), 16 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..f3e0e64 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
* (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
*/
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
@@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..b178eb9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -607,7 +607,7 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_PAGE_TRACKING
struct page_cgroup *node_page_cgroup;
#endif
#endif
@@ -958,7 +958,7 @@ struct mem_section {
/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_PAGE_TRACKING
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..f24d081 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
#ifndef __LINUX_PAGE_CGROUP_H
#define __LINUX_PAGE_CGROUP_H
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_PAGE_TRACKING
#include <linux/bit_spinlock.h>
/*
* Page Cgroup can be considered as an extended mem_map.
@@ -12,11 +12,38 @@
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct mem_cgroup *mem_cgroup;
struct list_head lru; /* per cgroup LRU list */
+#endif
};
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PAGE_TRACKING_ID_SHIFT (16)
+#define PAGE_TRACKING_ID_BITS \
+ (8 * sizeof(unsigned long) - PAGE_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+ return pc->flags >> PAGE_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+ WARN_ON(id >= (1UL << PAGE_TRACKING_ID_BITS));
+ pc->flags &= (1UL << PAGE_TRACKING_ID_SHIFT) - 1;
+ pc->flags |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
+}
+
+unsigned long page_cgroup_get_owner(struct page *page);
+int page_cgroup_set_owner(struct page *page, unsigned long id);
+int page_cgroup_copy_owner(struct page *npage, struct page *opage);
+
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
void __init page_cgroup_init(void);
struct page_cgroup *lookup_page_cgroup(struct page *page);
@@ -71,7 +98,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
bit_spin_unlock(PCG_LOCK, &pc->flags);
}
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_PAGE_TRACKING */
struct page_cgroup;
static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..5428ac7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -569,6 +569,7 @@ config CGROUP_MEM_RES_CTLR
bool "Memory Resource Controller for Control Groups"
depends on CGROUPS && RESOURCE_COUNTERS
select MM_OWNER
+ select PAGE_TRACKING
help
Provides a memory resource controller that manages both anonymous
memory and page cache. (See Documentation/cgroups/memory.txt)
@@ -611,6 +612,9 @@ endif # CGROUPS
config MM_OWNER
bool
+config PAGE_TRACKING
+ bool
+
config SYSFS_DEPRECATED
bool
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..b94e074 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,4 +37,5 @@ else
obj-$(CONFIG_SMP) += allocpercpu.o
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_PAGE_TRACKING) += page_cgroup.o
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..69d1c31 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2524,6 +2524,12 @@ struct cgroup_subsys mem_cgroup_subsys = {
.use_id = 1,
};
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+ pc->mem_cgroup = NULL;
+ INIT_LIST_HEAD(&pc->lru);
+}
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
static int __init disable_swap_account(char *s)
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 791905c..b3b394c 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -3,6 +3,7 @@
#include <linux/bootmem.h>
#include <linux/bit_spinlock.h>
#include <linux/page_cgroup.h>
+#include <linux/blk-io-throttle.h>
#include <linux/hash.h>
#include <linux/slab.h>
#include <linux/memory.h>
@@ -14,9 +15,8 @@ static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
- INIT_LIST_HEAD(&pc->lru);
+ __init_mem_page_cgroup(pc);
}
static unsigned long total_usage;
@@ -74,7 +74,7 @@ void __init page_cgroup_init(void)
int nid, fail;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && iothrottle_disabled())
return;
for_each_online_node(nid) {
@@ -83,12 +83,13 @@ void __init page_cgroup_init(void)
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you"
- " don't want\n");
+ printk(KERN_INFO
+ "try cgroup_disable=memory,blockio option if you don't want\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
- printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+ printk(KERN_CRIT
+ "try cgroup_disable=memory,blockio boot option\n");
panic("Out of memory");
}
@@ -243,12 +244,85 @@ static int __meminit page_cgroup_callback(struct notifier_block *self,
#endif
+/**
+ * page_cgroup_get_owner() - get the owner ID of a page
+ * @page: the page we want to find the owner
+ *
+ * Returns the owner ID of the page, 0 means that the owner cannot be
+ * retrieved.
+ **/
+unsigned long page_cgroup_get_owner(struct page *page)
+{
+ struct page_cgroup *pc;
+ unsigned long ret;
+
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return 0;
+
+ lock_page_cgroup(pc);
+ ret = page_cgroup_get_id(pc);
+ unlock_page_cgroup(pc);
+ return ret;
+}
+
+/**
+ * page_cgroup_set_owner() - set the owner ID of a page
+ * @page: the page we want to tag
+ * @id: the ID number that will be associated to page
+ *
+ * Returns 0 if the owner is correctly associated to the page. Returns a
+ * negative value in case of failure.
+ **/
+int page_cgroup_set_owner(struct page *page, unsigned long id)
+{
+ struct page_cgroup *pc;
+
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return -ENOENT;
+
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, id);
+ unlock_page_cgroup(pc);
+ return 0;
+}
+
+/**
+ * page_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage: the page where we want to copy the owner
+ * @opage: the page from which we want to copy the ID
+ *
+ * Returns 0 if the owner is correctly associated to npage. Returns a negative
+ * value in case of failure.
+ **/
+int page_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+ struct page_cgroup *npc, *opc;
+ unsigned long id;
+
+ npc = lookup_page_cgroup(npage);
+ if (unlikely(!npc))
+ return -ENOENT;
+ opc = lookup_page_cgroup(opage);
+ if (unlikely(!opc))
+ return -ENOENT;
+ lock_page_cgroup(opc);
+ lock_page_cgroup(npc);
+ id = page_cgroup_get_id(opc);
+ page_cgroup_set_id(npc, id);
+ unlock_page_cgroup(npc);
+ unlock_page_cgroup(opc);
+
+ return 0;
+}
+
void __init page_cgroup_init(void)
{
unsigned long pfn;
int fail = 0;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && iothrottle_disabled())
return;
for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -257,14 +331,15 @@ void __init page_cgroup_init(void)
fail = init_section_page_cgroup(pfn);
}
if (fail) {
- printk(KERN_CRIT "try cgroup_disable=memory boot option\n");
+ printk(KERN_CRIT
+ "try cgroup_disable=memory,blockio boot option\n");
panic("Out of memory");
} else {
hotplug_memory_notifier(page_cgroup_callback, 0);
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
- " want\n");
+ printk(KERN_INFO
+ "try cgroup_disable=memory,blockio option if you don't want\n");
}
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
--
1.5.6.3
This is the core of the io-throttle kernel infrastructure. It creates
the basic interfaces to the cgroup subsystem and implements the I/O
measurement and throttling functionality.
Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
---
block/Makefile | 1 +
block/blk-io-throttle.c | 822 +++++++++++++++++++++++++++++++++++++++
include/linux/blk-io-throttle.h | 144 +++++++
include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 12 +
5 files changed, 985 insertions(+), 0 deletions(-)
create mode 100644 block/blk-io-throttle.c
create mode 100644 include/linux/blk-io-throttle.h
diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..42b6a46 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,5 +13,6 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
new file mode 100644
index 0000000..c8214fc
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,822 @@
+/*
+ * blk-io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <[email protected]>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/res_counter.h>
+#include <linux/memcontrol.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/genhd.h>
+#include <linux/hardirq.h>
+#include <linux/list.h>
+#include <linux/seq_file.h>
+#include <linux/spinlock.h>
+#include <linux/blk-io-throttle.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+#include <linux/sched.h>
+#include <linux/bio.h>
+
+/*
+ * Statistics for I/O bandwidth controller.
+ */
+enum iothrottle_stat_index {
+ /* # of times the cgroup has been throttled for bw limit */
+ IOTHROTTLE_STAT_BW_COUNT,
+ /* # of jiffies spent to sleep for throttling for bw limit */
+ IOTHROTTLE_STAT_BW_SLEEP,
+ /* # of times the cgroup has been throttled for iops limit */
+ IOTHROTTLE_STAT_IOPS_COUNT,
+ /* # of jiffies spent to sleep for throttling for iops limit */
+ IOTHROTTLE_STAT_IOPS_SLEEP,
+ /* total number of bytes read and written */
+ IOTHROTTLE_STAT_BYTES_TOT,
+ /* total number of I/O operations */
+ IOTHROTTLE_STAT_IOPS_TOT,
+
+ IOTHROTTLE_STAT_NSTATS,
+};
+
+struct iothrottle_stat_cpu {
+ unsigned long long count[IOTHROTTLE_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct iothrottle_stat {
+ struct iothrottle_stat_cpu cpustat[NR_CPUS];
+};
+
+static void iothrottle_stat_add(struct iothrottle_stat *stat,
+ enum iothrottle_stat_index type, unsigned long long val)
+{
+ int cpu = get_cpu();
+
+ stat->cpustat[cpu].count[type] += val;
+ put_cpu();
+}
+
+static void iothrottle_stat_add_sleep(struct iothrottle_stat *stat,
+ int type, unsigned long long sleep)
+{
+ int cpu = get_cpu();
+
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_COUNT]++;
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_SLEEP] += sleep;
+ break;
+ case IOTHROTTLE_IOPS:
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_COUNT]++;
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_SLEEP] += sleep;
+ break;
+ }
+ put_cpu();
+}
+
+static unsigned long long iothrottle_read_stat(struct iothrottle_stat *stat,
+ enum iothrottle_stat_index idx)
+{
+ int cpu;
+ unsigned long long ret = 0;
+
+ for_each_possible_cpu(cpu)
+ ret += stat->cpustat[cpu].count[idx];
+ return ret;
+}
+
+struct iothrottle_sleep {
+ unsigned long long bw_sleep;
+ unsigned long long iops_sleep;
+};
+
+/*
+ * struct iothrottle_node - throttling rule of a single block device
+ * @node: list of per block device throttling rules
+ * @dev: block device number, used as key in the list
+ * @bw: max i/o bandwidth (in bytes/s)
+ * @iops: max i/o operations per second
+ * @stat: throttling statistics
+ *
+ * Define a i/o throttling rule for a single block device.
+ *
+ * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
+ * the limiting rules defined for that device persist and they are still valid
+ * if a new device is plugged and it uses the same dev_t number.
+ */
+struct iothrottle_node {
+ struct list_head node;
+ dev_t dev;
+ struct res_counter bw;
+ struct res_counter iops;
+ struct iothrottle_stat stat;
+};
+
+/**
+ * struct iothrottle - throttling rules for a cgroup
+ * @css: pointer to the cgroup state
+ * @list: list of iothrottle_node elements
+ *
+ * Define multiple per-block device i/o throttling rules.
+ * Note: the list of the throttling rules is protected by RCU locking:
+ * - hold cgroup_lock() for update.
+ * - hold rcu_read_lock() for read.
+ */
+struct iothrottle {
+ struct cgroup_subsys_state css;
+ struct list_head list;
+};
+static struct iothrottle init_iothrottle;
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() held.
+ */
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() or iot->lock held.
+ */
+static struct iothrottle_node *
+iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
+{
+ struct iothrottle_node *n;
+
+ if (list_empty(&iot->list))
+ return NULL;
+ list_for_each_entry_rcu(n, &iot->list, node)
+ if (n->dev == dev)
+ return n;
+ return NULL;
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void iothrottle_insert_node(struct iothrottle *iot,
+ struct iothrottle_node *n)
+{
+ list_add_rcu(&n->node, &iot->list);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,
+ struct iothrottle_node *new)
+{
+ list_replace_rcu(&old->node, &new->node);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
+{
+ list_del_rcu(&n->node);
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *
+iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct iothrottle *iot;
+
+ if (unlikely((cgrp->parent) == NULL)) {
+ iot = &init_iothrottle;
+ } else {
+ iot = kzalloc(sizeof(*iot), GFP_KERNEL);
+ if (unlikely(!iot))
+ return ERR_PTR(-ENOMEM);
+ }
+ INIT_LIST_HEAD(&iot->list);
+
+ return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct iothrottle_node *n, *p;
+ struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+
+ free_css_id(&iothrottle_subsys, &iot->css);
+ /*
+ * don't worry about locking here, at this point there must be not any
+ * reference to the list.
+ */
+ if (!list_empty(&iot->list))
+ list_for_each_entry_safe(n, p, &iot->list, node)
+ kfree(n);
+ kfree(iot);
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ *
+ * do not care too much about locking for single res_counter values here.
+ */
+static void iothrottle_show_limit(struct seq_file *m, dev_t dev,
+ struct res_counter *res)
+{
+ if (!res->limit)
+ return;
+ seq_printf(m, "%u %u %llu %llu %lli %llu %li\n",
+ MAJOR(dev), MINOR(dev),
+ res->limit, res->policy,
+ (long long)res->usage, res->capacity,
+ jiffies_to_clock_t(res_counter_ratelimit_delta_t(res)));
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ *
+ */
+static void iothrottle_show_failcnt(struct seq_file *m, dev_t dev,
+ struct iothrottle_stat *stat)
+{
+ unsigned long long bw_count, bw_sleep, iops_count, iops_sleep;
+
+ bw_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_COUNT);
+ bw_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_SLEEP);
+ iops_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_COUNT);
+ iops_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_SLEEP);
+
+ seq_printf(m, "%u %u %llu %li %llu %li\n", MAJOR(dev), MINOR(dev),
+ bw_count, jiffies_to_clock_t(bw_sleep),
+ iops_count, jiffies_to_clock_t(iops_sleep));
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_show_stat(struct seq_file *m, dev_t dev,
+ struct iothrottle_stat *stat)
+{
+ unsigned long long bytes, iops;
+
+ bytes = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BYTES_TOT);
+ iops = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_TOT);
+
+ seq_printf(m, "%u %u %llu %llu\n", MAJOR(dev), MINOR(dev), bytes, iops);
+}
+
+static int iothrottle_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+ struct iothrottle_node *n;
+
+ rcu_read_lock();
+ if (list_empty(&iot->list))
+ goto unlock_and_return;
+ list_for_each_entry_rcu(n, &iot->list, node) {
+ BUG_ON(!n->dev);
+ switch (cft->private) {
+ case IOTHROTTLE_BANDWIDTH:
+ iothrottle_show_limit(m, n->dev, &n->bw);
+ break;
+ case IOTHROTTLE_IOPS:
+ iothrottle_show_limit(m, n->dev, &n->iops);
+ break;
+ case IOTHROTTLE_FAILCNT:
+ iothrottle_show_failcnt(m, n->dev, &n->stat);
+ break;
+ case IOTHROTTLE_STAT:
+ iothrottle_show_stat(m, n->dev, &n->stat);
+ break;
+ }
+ }
+unlock_and_return:
+ rcu_read_unlock();
+ return 0;
+}
+
+static dev_t devname2dev_t(const char *buf)
+{
+ struct block_device *bdev;
+ dev_t dev = 0;
+ struct gendisk *disk;
+ int part;
+
+ /* use a lookup to validate the block device */
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return 0;
+ /* only entire devices are allowed, not single partitions */
+ disk = get_gendisk(bdev->bd_dev, &part);
+ if (disk && !part) {
+ BUG_ON(!bdev->bd_inode);
+ dev = bdev->bd_inode->i_rdev;
+ }
+ bdput(bdev);
+
+ return dev;
+}
+
+/*
+ * The userspace input string must use one of the following syntaxes:
+ *
+ * dev:0 <- delete an i/o limiting rule
+ * dev:io-limit:0 <- set a leaky bucket throttling rule
+ * dev:io-limit:1:bucket-size <- set a token bucket throttling rule
+ * dev:io-limit:1 <- set a token bucket throttling rule using
+ * bucket-size == io-limit
+ */
+static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
+ dev_t *dev, unsigned long long *iolimit,
+ unsigned long long *strategy,
+ unsigned long long *bucket_size)
+{
+ char *p;
+ int count = 0;
+ char *s[4];
+ int ret;
+
+ memset(s, 0, sizeof(s));
+ *dev = 0;
+ *iolimit = 0;
+ *strategy = 0;
+ *bucket_size = 0;
+
+ /* split the colon-delimited input string into its elements */
+ while (count < ARRAY_SIZE(s)) {
+ p = strsep(&buf, ":");
+ if (!p)
+ break;
+ if (!*p)
+ continue;
+ s[count++] = p;
+ }
+
+ /* i/o limit */
+ if (!s[1])
+ return -EINVAL;
+ ret = strict_strtoull(s[1], 10, iolimit);
+ if (ret < 0)
+ return ret;
+ if (!*iolimit)
+ goto out;
+ /* throttling strategy (leaky bucket / token bucket) */
+ if (!s[2])
+ return -EINVAL;
+ ret = strict_strtoull(s[2], 10, strategy);
+ if (ret < 0)
+ return ret;
+ switch (*strategy) {
+ case RATELIMIT_LEAKY_BUCKET:
+ goto out;
+ case RATELIMIT_TOKEN_BUCKET:
+ break;
+ default:
+ return -EINVAL;
+ }
+ /* bucket size */
+ if (!s[3])
+ *bucket_size = *iolimit;
+ else {
+ ret = strict_strtoll(s[3], 10, bucket_size);
+ if (ret < 0)
+ return ret;
+ }
+ if (*bucket_size <= 0)
+ return -EINVAL;
+out:
+ /* block device number */
+ *dev = devname2dev_t(s[0]);
+ return *dev ? 0 : -EINVAL;
+}
+
+static int iothrottle_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct iothrottle *iot;
+ struct iothrottle_node *n, *newn = NULL;
+ dev_t dev;
+ unsigned long long iolimit, strategy, bucket_size;
+ char *buf;
+ size_t nbytes = strlen(buffer);
+ int ret = 0;
+
+ /*
+ * We need to allocate a new buffer here, because
+ * iothrottle_parse_args() can modify it and the buffer provided by
+ * write_string is supposed to be const.
+ */
+ buf = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+ memcpy(buf, buffer, nbytes + 1);
+
+ ret = iothrottle_parse_args(buf, nbytes, cft->private, &dev, &iolimit,
+ &strategy, &bucket_size);
+ if (ret)
+ goto out1;
+ newn = kzalloc(sizeof(*newn), GFP_KERNEL);
+ if (!newn) {
+ ret = -ENOMEM;
+ goto out1;
+ }
+ newn->dev = dev;
+ res_counter_init(&newn->bw, NULL);
+ res_counter_init(&newn->iops, NULL);
+
+ switch (cft->private) {
+ case IOTHROTTLE_BANDWIDTH:
+ res_counter_ratelimit_set_limit(&newn->iops, 0, 0, 0);
+ res_counter_ratelimit_set_limit(&newn->bw, strategy,
+ ALIGN(iolimit, 1024), ALIGN(bucket_size, 1024));
+ break;
+ case IOTHROTTLE_IOPS:
+ res_counter_ratelimit_set_limit(&newn->bw, 0, 0, 0);
+ /*
+ * scale up iops cost by a factor of 1000, this allows to apply
+ * a more fine grained sleeps, and throttling results more
+ * precise this way.
+ */
+ res_counter_ratelimit_set_limit(&newn->iops, strategy,
+ iolimit * 1000, bucket_size * 1000);
+ break;
+ default:
+ WARN_ON(1);
+ break;
+ }
+
+ if (!cgroup_lock_live_group(cgrp)) {
+ ret = -ENODEV;
+ goto out1;
+ }
+ iot = cgroup_to_iothrottle(cgrp);
+
+ n = iothrottle_search_node(iot, dev);
+ if (!n) {
+ if (iolimit) {
+ /* Add a new block device limiting rule */
+ iothrottle_insert_node(iot, newn);
+ newn = NULL;
+ }
+ goto out2;
+ }
+ switch (cft->private) {
+ case IOTHROTTLE_BANDWIDTH:
+ if (!iolimit && !n->iops.limit) {
+ /* Delete a block device limiting rule */
+ iothrottle_delete_node(iot, n);
+ goto out2;
+ }
+ if (!n->iops.limit)
+ break;
+ /* Update a block device limiting rule */
+ newn->iops = n->iops;
+ break;
+ case IOTHROTTLE_IOPS:
+ if (!iolimit && !n->bw.limit) {
+ /* Delete a block device limiting rule */
+ iothrottle_delete_node(iot, n);
+ goto out2;
+ }
+ if (!n->bw.limit)
+ break;
+ /* Update a block device limiting rule */
+ newn->bw = n->bw;
+ break;
+ }
+ iothrottle_replace_node(iot, n, newn);
+ newn = NULL;
+out2:
+ cgroup_unlock();
+ if (n) {
+ synchronize_rcu();
+ kfree(n);
+ }
+out1:
+ kfree(newn);
+ kfree(buf);
+ return ret;
+}
+
+static struct cftype files[] = {
+ {
+ .name = "bandwidth-max",
+ .read_seq_string = iothrottle_read,
+ .write_string = iothrottle_write,
+ .max_write_len = 256,
+ .private = IOTHROTTLE_BANDWIDTH,
+ },
+ {
+ .name = "iops-max",
+ .read_seq_string = iothrottle_read,
+ .write_string = iothrottle_write,
+ .max_write_len = 256,
+ .private = IOTHROTTLE_IOPS,
+ },
+ {
+ .name = "throttlecnt",
+ .read_seq_string = iothrottle_read,
+ .private = IOTHROTTLE_FAILCNT,
+ },
+ {
+ .name = "stat",
+ .read_seq_string = iothrottle_read,
+ .private = IOTHROTTLE_STAT,
+ },
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+ .name = "blockio",
+ .create = iothrottle_create,
+ .destroy = iothrottle_destroy,
+ .populate = iothrottle_populate,
+ .subsys_id = iothrottle_subsys_id,
+ .early_init = 1,
+ .use_id = 1,
+};
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_evaluate_sleep(struct iothrottle_sleep *sleep,
+ struct iothrottle *iot,
+ struct block_device *bdev, ssize_t bytes)
+{
+ struct iothrottle_node *n;
+ dev_t dev;
+
+ BUG_ON(!iot);
+
+ /* accounting and throttling is done only on entire block devices */
+ dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev), bdev->bd_disk->first_minor);
+ n = iothrottle_search_node(iot, dev);
+ if (!n)
+ return;
+
+ /* Update statistics */
+ iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_BYTES_TOT, bytes);
+ if (bytes)
+ iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_IOPS_TOT, 1);
+
+ /* Evaluate sleep values */
+ sleep->bw_sleep = res_counter_ratelimit_sleep(&n->bw, bytes);
+ /*
+ * scale up iops cost by a factor of 1000, this allows to apply
+ * a more fine grained sleeps, and throttling works better in
+ * this way.
+ *
+ * Note: do not account any i/o operation if bytes is negative or zero.
+ */
+ sleep->iops_sleep = res_counter_ratelimit_sleep(&n->iops,
+ bytes ? 1000 : 0);
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_acct_stat(struct iothrottle *iot,
+ struct block_device *bdev, int type,
+ unsigned long long sleep)
+{
+ struct iothrottle_node *n;
+ dev_t dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev),
+ bdev->bd_disk->first_minor);
+
+ n = iothrottle_search_node(iot, dev);
+ if (!n)
+ return;
+ iothrottle_stat_add_sleep(&n->stat, type, sleep);
+}
+
+static void iothrottle_acct_task_stat(int type, unsigned long long sleep)
+{
+ /*
+ * XXX: per-task statistics may be inaccurate (this is not a
+ * critical issue, anyway, respect to introduce locking
+ * overhead or increase the size of task_struct).
+ */
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ current->io_throttle_bw_cnt++;
+ current->io_throttle_bw_sleep += sleep;
+ break;
+
+ case IOTHROTTLE_IOPS:
+ current->io_throttle_iops_cnt++;
+ current->io_throttle_iops_sleep += sleep;
+ break;
+ }
+}
+
+/*
+ * A helper function to get iothrottle from css id.
+ *
+ * NOTE: must be called under rcu_read_lock(). The caller must check
+ * css_is_removed() or some if it's concern.
+ */
+static struct iothrottle *iothrottle_lookup(unsigned long id)
+{
+ struct cgroup_subsys_state *css;
+
+ if (!id)
+ return NULL;
+ css = css_lookup(&iothrottle_subsys, id);
+ if (!css)
+ return NULL;
+ return container_of(css, struct iothrottle, css);
+}
+
+static struct iothrottle *get_iothrottle_from_page(struct page *page)
+{
+ struct iothrottle *iot;
+ unsigned long id;
+
+ BUG_ON(!page);
+ id = page_cgroup_get_owner(page);
+
+ rcu_read_lock();
+ iot = iothrottle_lookup(id);
+ if (!iot)
+ goto out;
+ css_get(&iot->css);
+out:
+ rcu_read_unlock();
+ return iot;
+}
+
+static struct iothrottle *get_iothrottle_from_bio(struct bio *bio)
+{
+ if (!bio)
+ return NULL;
+ return get_iothrottle_from_page(bio_page(bio));
+}
+
+int iothrottle_set_page_owner(struct page *page, struct mm_struct *mm)
+{
+ struct iothrottle *iot;
+ unsigned short id = 0;
+
+ if (iothrottle_disabled())
+ return 0;
+ if (!mm)
+ goto out;
+ rcu_read_lock();
+ iot = task_to_iothrottle(rcu_dereference(mm->owner));
+ if (likely(iot))
+ id = css_id(&iot->css);
+ rcu_read_unlock();
+out:
+ return page_cgroup_set_owner(page, id);
+}
+
+int iothrottle_set_pagedirty_owner(struct page *page, struct mm_struct *mm)
+{
+ if (PageSwapCache(page) || PageAnon(page))
+ return 0;
+ if (current->flags & PF_MEMALLOC)
+ return 0;
+ return iothrottle_set_page_owner(page, mm);
+}
+
+int iothrottle_copy_page_owner(struct page *npage, struct page *opage)
+{
+ if (iothrottle_disabled())
+ return 0;
+ return page_cgroup_copy_owner(npage, opage);
+}
+
+static inline int is_kthread_io(void)
+{
+ return current->flags & (PF_KTHREAD | PF_FLUSHER | PF_KSWAPD);
+}
+
+static inline bool is_urgent_io(struct bio *bio)
+{
+ if (bio && (bio_rw_meta(bio) || bio_noidle(bio)))
+ return true;
+ if (has_fs_excl())
+ return true;
+ return false;
+}
+
+/**
+ * cgroup_io_throttle() - account and throttle synchronous i/o activity
+ * @bio: the bio structure used to retrieve the owner of the i/o
+ * operation.
+ * @bdev: block device involved for the i/o.
+ * @bytes: size in bytes of the i/o operation.
+ *
+ * This is the core of the block device i/o bandwidth controller. This function
+ * must be called by any function that generates i/o activity (directly or
+ * indirectly). It provides both i/o accounting and throttling functionalities;
+ * throttling is disabled if @can_sleep is set to 0.
+ *
+ * Returns the value of sleep in jiffies if it was not possible to schedule the
+ * timeout.
+ **/
+unsigned long long
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
+{
+ struct iothrottle *iot = NULL;
+ struct iothrottle_sleep s = {};
+ unsigned long long sleep;
+ int can_sleep = 1;
+
+ if (iothrottle_disabled())
+ return 0;
+ if (unlikely(!bdev))
+ return 0;
+ BUG_ON(!bdev->bd_inode || !bdev->bd_disk);
+ /*
+ * Never throttle kernel threads directly, since they may completely
+ * block other cgroups, the i/o on other block devices or even the
+ * whole system.
+ *
+ * For the same reason never throttle IO that comes from tasks that are
+ * holding exclusive access resources (urgent IO).
+ *
+ * And never sleep if we're inside an AIO context; just account the i/o
+ * activity. Throttling is performed in io_submit_one() returning
+ * -EAGAIN when the limits are exceeded.
+ */
+ if (is_kthread_io() || is_urgent_io(bio) || is_in_aio())
+ can_sleep = 0;
+ /*
+ * WARNING: in_atomic() do not know about held spinlocks in
+ * non-preemptible kernels, but we want to check it here to raise
+ * potential bugs when a preemptible kernel is used.
+ */
+ WARN_ON_ONCE(can_sleep &&
+ (irqs_disabled() || in_interrupt() || in_atomic()));
+
+ /* Apply IO throttling */
+ iot = get_iothrottle_from_bio(bio);
+ rcu_read_lock();
+ if (!iot) {
+ /* IO occurs in the same context of the current task */
+ iot = task_to_iothrottle(current);
+ css_get(&iot->css);
+ }
+ iothrottle_evaluate_sleep(&s, iot, bdev, bytes);
+ sleep = max(s.bw_sleep, s.iops_sleep);
+ if (unlikely(sleep && can_sleep)) {
+ int type = (s.bw_sleep < s.iops_sleep) ?
+ IOTHROTTLE_IOPS : IOTHROTTLE_BANDWIDTH;
+
+ iothrottle_acct_stat(iot, bdev, type, sleep);
+ css_put(&iot->css);
+ rcu_read_unlock();
+
+ pr_debug("io-throttle: task %p (%s) must sleep %llu jiffies\n",
+ current, current->comm, sleep);
+ iothrottle_acct_task_stat(type, sleep);
+ schedule_timeout_killable(sleep);
+ return 0;
+ }
+ css_put(&iot->css);
+ rcu_read_unlock();
+
+ /*
+ * Account, but do not delay filesystems' metadata IO or IO that is
+ * explicitly marked to not wait or being anticipated, i.e. writes with
+ * wbc->sync_mode set to WBC_SYNC_ALL - fsync() - or journal activity.
+ */
+ if (is_urgent_io(bio))
+ sleep = 0;
+ return sleep;
+}
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
new file mode 100644
index 0000000..304c56c
--- /dev/null
+++ b/include/linux/blk-io-throttle.h
@@ -0,0 +1,144 @@
+#ifndef BLK_IO_THROTTLE_H
+#define BLK_IO_THROTTLE_H
+
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/sched.h>
+#include <linux/cgroup.h>
+#include <asm/atomic.h>
+#include <asm/current.h>
+
+#define IOTHROTTLE_BANDWIDTH 0
+#define IOTHROTTLE_IOPS 1
+#define IOTHROTTLE_FAILCNT 2
+#define IOTHROTTLE_STAT 3
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+
+static inline bool iothrottle_disabled(void)
+{
+ if (iothrottle_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern unsigned long long
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes);
+
+extern int iothrottle_make_request(struct bio *bio, unsigned long deadline);
+
+int iothrottle_set_page_owner(struct page *page, struct mm_struct *mm);
+int iothrottle_set_pagedirty_owner(struct page *page, struct mm_struct *mm);
+int iothrottle_copy_page_owner(struct page *npage, struct page *opage);
+
+extern int iothrottle_sync(void);
+
+static inline void set_in_aio(void)
+{
+ atomic_set(¤t->in_aio, 1);
+}
+
+static inline void unset_in_aio(void)
+{
+ atomic_set(¤t->in_aio, 0);
+}
+
+static inline int is_in_aio(void)
+{
+ return atomic_read(¤t->in_aio);
+}
+
+static inline unsigned long long
+get_io_throttle_cnt(struct task_struct *t, int type)
+{
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ return t->io_throttle_bw_cnt;
+ case IOTHROTTLE_IOPS:
+ return t->io_throttle_iops_cnt;
+ }
+ BUG();
+}
+
+static inline unsigned long long
+get_io_throttle_sleep(struct task_struct *t, int type)
+{
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ return jiffies_to_clock_t(t->io_throttle_bw_sleep);
+ case IOTHROTTLE_IOPS:
+ return jiffies_to_clock_t(t->io_throttle_iops_sleep);
+ }
+ BUG();
+}
+#else /* CONFIG_CGROUP_IO_THROTTLE */
+
+static inline bool iothrottle_disabled(void)
+{
+ return true;
+}
+
+static inline unsigned long long
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
+{
+ return 0;
+}
+
+static inline int
+iothrottle_make_request(struct bio *bio, unsigned long deadline)
+{
+ return 0;
+}
+
+static inline int iothrottle_set_page_owner(struct page *page,
+ struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline int iothrottle_set_pagedirty_owner(struct page *page,
+ struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline int iothrottle_copy_page_owner(struct page *npage,
+ struct page *opage)
+{
+ return 0;
+}
+
+static inline int iothrottle_sync(void)
+{
+ return 0;
+}
+
+static inline void set_in_aio(void) { }
+
+static inline void unset_in_aio(void) { }
+
+static inline int is_in_aio(void)
+{
+ return 0;
+}
+
+static inline unsigned long long
+get_io_throttle_cnt(struct task_struct *t, int type)
+{
+ return 0;
+}
+
+static inline unsigned long long
+get_io_throttle_sleep(struct task_struct *t, int type)
+{
+ return 0;
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+static inline struct block_device *as_to_bdev(struct address_space *mapping)
+{
+ return (mapping->host && mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+}
+
+#endif /* BLK_IO_THROTTLE_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..c37cc4b 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
/* */
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
+
#ifdef CONFIG_CGROUP_DEVICE
SUBSYS(devices)
#endif
diff --git a/init/Kconfig b/init/Kconfig
index 5428ac7..d496c5f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -565,6 +565,18 @@ config RESOURCE_COUNTERS
infrastructure that works with cgroups.
depends on CGROUPS
+config CGROUP_IO_THROTTLE
+ bool "Enable cgroup I/O throttling"
+ depends on CGROUPS && RESOURCE_COUNTERS && EXPERIMENTAL
+ select MM_OWNER
+ select PAGE_TRACKING
+ help
+ This allows to limit the maximum I/O bandwidth for specific
+ cgroup(s).
+ See Documentation/cgroups/io-throttle.txt for more information.
+
+ If unsure, say N.
+
config CGROUP_MEM_RES_CTLR
bool "Memory Resource Controller for Control Groups"
depends on CGROUPS && RESOURCE_COUNTERS
--
1.5.6.3
Apply the io-throttle control and page tracking to the opportune kernel
functions.
Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-core.c | 8 ++++++++
fs/aio.c | 12 ++++++++++++
fs/buffer.c | 2 ++
include/linux/sched.h | 7 +++++++
kernel/fork.c | 7 +++++++
mm/bounce.c | 2 ++
mm/filemap.c | 2 ++
mm/page-writeback.c | 2 ++
mm/readahead.c | 3 +++
9 files changed, 45 insertions(+), 0 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 07ab754..4d7f9f6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/blktrace_api.h>
#include <linux/fault-inject.h>
#include <trace/block.h>
@@ -1547,11 +1548,16 @@ void submit_bio(int rw, struct bio *bio)
* go through the normal accounting stuff before submission.
*/
if (bio_has_data(bio)) {
+ unsigned long sleep = 0;
+
if (rw & WRITE) {
count_vm_events(PGPGOUT, count);
+ sleep = cgroup_io_throttle(bio,
+ bio->bi_bdev, bio->bi_size);
} else {
task_io_account_read(bio->bi_size);
count_vm_events(PGPGIN, count);
+ cgroup_io_throttle(NULL, bio->bi_bdev, bio->bi_size);
}
if (unlikely(block_dump)) {
@@ -1562,6 +1568,8 @@ void submit_bio(int rw, struct bio *bio)
(unsigned long long)bio->bi_sector,
bdevname(bio->bi_bdev, b));
}
+ if (sleep && !iothrottle_make_request(bio, jiffies + sleep))
+ return;
}
generic_make_request(bio);
diff --git a/fs/aio.c b/fs/aio.c
index 76da125..ab6c457 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -22,6 +22,7 @@
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/file.h>
+#include <linux/blk-io-throttle.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/slab.h>
@@ -1587,6 +1588,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
{
struct kiocb *req;
struct file *file;
+ struct block_device *bdev;
ssize_t ret;
/* enforce forwards compatibility on users */
@@ -1609,6 +1611,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
if (unlikely(!file))
return -EBADF;
+ /* check if we're exceeding the IO throttling limits */
+ bdev = as_to_bdev(file->f_mapping);
+ ret = cgroup_io_throttle(NULL, bdev, 0);
+ if (unlikely(ret)) {
+ fput(file);
+ return -EAGAIN;
+ }
+
req = aio_get_req(ctx); /* returns with 2 references to req */
if (unlikely(!req)) {
fput(file);
@@ -1652,12 +1662,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
goto out_put_req;
spin_lock_irq(&ctx->ctx_lock);
+ set_in_aio();
aio_run_iocb(req);
if (!list_empty(&ctx->run_list)) {
/* drain the run list */
while (__aio_run_iocbs(ctx))
;
}
+ unset_in_aio();
spin_unlock_irq(&ctx->ctx_lock);
aio_put_req(req); /* drop extra ref to req */
return 0;
diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..2eb581f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
+#include <linux/blk-io-throttle.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ iothrottle_set_pagedirty_owner(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..e0cd710 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,13 @@ struct task_struct {
unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
struct task_io_accounting ioac;
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_t in_aio;
+ unsigned long long io_throttle_bw_cnt;
+ unsigned long long io_throttle_bw_sleep;
+ unsigned long long io_throttle_iops_cnt;
+ unsigned long long io_throttle_iops_sleep;
+#endif
#if defined(CONFIG_TASK_XACCT)
u64 acct_rss_mem1; /* accumulated rss usage */
u64 acct_vm_mem1; /* accumulated virtual memory usage */
diff --git a/kernel/fork.c b/kernel/fork.c
index b9e2edd..272c461 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1043,6 +1043,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
task_io_accounting_init(&p->ioac);
acct_clear_integrals(p);
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_set(&p->in_aio, 0);
+ p->io_throttle_bw_cnt = 0;
+ p->io_throttle_bw_sleep = 0;
+ p->io_throttle_iops_cnt = 0;
+ p->io_throttle_iops_sleep = 0;
+#endif
posix_cpu_timers_init(p);
p->lock_depth = -1; /* -1 = no lock */
diff --git a/mm/bounce.c b/mm/bounce.c
index e590272..80bf52c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -10,6 +10,7 @@
#include <linux/pagemap.h>
#include <linux/mempool.h>
#include <linux/blkdev.h>
+#include <linux/blk-io-throttle.h>
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
@@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
to->bv_len = from->bv_len;
to->bv_offset = from->bv_offset;
inc_zone_page_state(to->bv_page, NR_BOUNCE);
+ iothrottle_copy_page_owner(to->bv_page, page);
if (rw == WRITE) {
char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..5498d1d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -28,6 +28,7 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/blkdev.h>
+#include <linux/blk-io-throttle.h>
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/cpuset.h>
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
+ iothrottle_set_page_owner(page, current->mm);
error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..46cf92e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -24,6 +24,7 @@
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/blkdev.h>
+#include <linux/blk-io-throttle.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
#include <linux/percpu.h>
@@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ iothrottle_set_pagedirty_owner(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/mm/readahead.c b/mm/readahead.c
index 133b6d5..25cae4c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -14,6 +14,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/pagevec.h>
#include <linux/pagemap.h>
@@ -81,6 +82,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
int (*filler)(void *, struct page *), void *data)
{
struct page *page;
+ struct block_device *bdev = as_to_bdev(mapping);
int ret = 0;
while (!list_empty(pages)) {
@@ -99,6 +101,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
+ cgroup_io_throttle(NULL, bdev, PAGE_CACHE_SIZE);
}
return ret;
}
--
1.5.6.3
Together with cgroup_io_throttle() the kiothrottled kernel thread
represents the core of the io-throttle subsystem.
All the writeback IO requests that need to be throttled are not
dispatched immediately in submit_bio(). Instead, they are added into an
rbtree by iothrottle_make_request() and processed asynchronously by
kiothrottled.
A deadline is associated to each request depending on the bandwidth
usage of the cgroup it belongs. When a request is inserted into the
rbtree kiothrottled is awakened. This thread selects all the requests
with an expired deadline and submit the bunch of selected requests to
the underlying block devices using generic_make_request().
Signed-off-by: Andrea Righi <[email protected]>
---
block/Makefile | 2 +-
block/kiothrottled.c | 341 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 342 insertions(+), 1 deletions(-)
create mode 100644 block/kiothrottled.c
diff --git a/block/Makefile b/block/Makefile
index 42b6a46..5f10a45 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,6 +13,6 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
-obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o kiothrottled.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
diff --git a/block/kiothrottled.c b/block/kiothrottled.c
new file mode 100644
index 0000000..3df22c1
--- /dev/null
+++ b/block/kiothrottled.c
@@ -0,0 +1,341 @@
+/*
+ * kiothrottled.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <[email protected]>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/kthread.h>
+#include <linux/jiffies.h>
+#include <linux/ioprio.h>
+#include <linux/rbtree.h>
+#include <linux/blkdev.h>
+
+/* io-throttle bio element */
+struct iot_bio {
+ struct rb_node node;
+ unsigned long deadline;
+ struct bio *bio;
+};
+
+/* io-throttle bio tree */
+struct iot_bio_tree {
+ /* Protect the iothrottle rbtree */
+ spinlock_t lock;
+ struct rb_root tree;
+};
+
+/*
+ * TODO: create one iothrottle rbtree per block device and many kiothrottled
+ * threads per rbtree, instead of a poor scalable single rbtree / single thread
+ * solution.
+ */
+static struct iot_bio_tree *iot;
+static struct task_struct *kiothrottled_thread;
+
+/* Timer used to periodically wake-up kiothrottled */
+static struct timer_list kiothrottled_timer;
+
+/* Insert a new iot_bio element in the iot_bio_tree */
+static void iot_bio_insert(struct rb_root *root, struct iot_bio *data)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ while (*new) {
+ struct iot_bio *this = container_of(*new, struct iot_bio, node);
+ parent = *new;
+ if (data->deadline < this->deadline)
+ new = &((*new)->rb_left);
+ else
+ new = &((*new)->rb_right);
+ }
+ rb_link_node(&data->node, parent, new);
+ rb_insert_color(&data->node, root);
+}
+
+/*
+ * NOTE: no need to care about locking here, we're flushing all the pending
+ * requests, kiothrottled has been stopped and no additional request will be
+ * submitted in the tree.
+ */
+static void iot_bio_cleanup(struct rb_root *root)
+{
+ struct iot_bio *data;
+ struct rb_node *next;
+
+ next = rb_first(root);
+ while (next) {
+ data = rb_entry(next, struct iot_bio, node);
+ pr_debug("%s: dispatching element: %p (%lu)\n",
+ __func__, data->bio, data->deadline);
+ generic_make_request(data->bio);
+ next = rb_next(&data->node);
+ rb_erase(&data->node, root);
+ kfree(data);
+ }
+}
+
+/**
+ * iothrottle_make_request() - submit a delayed IO requests that will be
+ * processed asynchronously by kiothrottled.
+ *
+ * @bio: the bio structure that contains the IO request's informations
+ * @deadline: the request will be actually dispatched only when the deadline
+ * will expire
+ *
+ * Returns 0 if the request is successfully submitted and inserted into the
+ * iot_bio_tree. Return a negative value in case of failure.
+ **/
+int iothrottle_make_request(struct bio *bio, unsigned long deadline)
+{
+ struct iot_bio *data;
+
+ BUG_ON(!iot);
+
+ if (unlikely(!kiothrottled_thread))
+ return -ENOENT;
+
+ data = kzalloc(sizeof(*data), GFP_KERNEL);
+ if (unlikely(!data))
+ return -ENOMEM;
+ data->deadline = deadline;
+ data->bio = bio;
+
+ spin_lock_irq(&iot->lock);
+ iot_bio_insert(&iot->tree, data);
+ spin_unlock_irq(&iot->lock);
+
+ wake_up_process(kiothrottled_thread);
+ return 0;
+}
+EXPORT_SYMBOL(iothrottle_make_request);
+
+static void kiothrottled_timer_expired(unsigned long __unused)
+{
+ wake_up_process(kiothrottled_thread);
+}
+
+static void kiothrottled_sleep(void)
+{
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule();
+}
+
+/**
+ * kiothrottled() - throttle buffered (writeback) i/o activity
+ *
+ * Together with cgroup_io_throttle() this kernel thread represents the core of
+ * the cgroup-io-throttle subsystem.
+ *
+ * All the writeback IO requests that need to be throttled are not dispatched
+ * immediately in submit_bio(). Instead, they are added into the iot_bio_tree
+ * rbtree by iothrottle_make_request() and processed asynchronously by
+ * kiothrottled.
+ *
+ * A deadline is associated to each request depending on the bandwidth usage of
+ * the cgroup it belongs. When a request is inserted into the rbtree
+ * kiothrottled is awakened. This thread selects all the requests with an
+ * expired deadline and submit the bunch of selected requests to the underlying
+ * block devices using generic_make_request().
+ **/
+static int kiothrottled(void *__unused)
+{
+ /*
+ * kiothrottled is responsible of dispatching all the writeback IO
+ * requests with an expired deadline. To dispatch those requests as
+ * soon as possible and to avoid priority inversion problems set
+ * maximum IO real-time priority for this thread.
+ */
+ set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 0));
+
+ while (!kthread_should_stop()) {
+ struct iot_bio *data;
+ struct rb_node *req;
+ struct rb_root staging_tree = RB_ROOT;
+ unsigned long now = jiffies;
+ long delta_t = 0;
+
+ /* Select requests to dispatch */
+ spin_lock_irq(&iot->lock);
+ req = rb_first(&iot->tree);
+ while (req) {
+ data = rb_entry(req, struct iot_bio, node);
+ delta_t = (long)data->deadline - (long)now;
+ if (delta_t > 0)
+ break;
+ req = rb_next(&data->node);
+ rb_erase(&data->node, &iot->tree);
+ iot_bio_insert(&staging_tree, data);
+ }
+ spin_unlock_irq(&iot->lock);
+
+ /* Dispatch requests */
+ req = rb_first(&staging_tree);
+ while (req) {
+ data = rb_entry(req, struct iot_bio, node);
+ req = rb_next(&data->node);
+ rb_erase(&data->node, &staging_tree);
+ pr_debug("%s: dispatching request: %p (%lu)\n",
+ __func__, data->bio, data->deadline);
+ generic_make_request(data->bio);
+ kfree(data);
+ }
+
+ /* Wait for new requests ready to be dispatched */
+ if (delta_t > 0)
+ mod_timer(&kiothrottled_timer, jiffies + HZ);
+ kiothrottled_sleep();
+ }
+ return 0;
+}
+
+/* TODO: handle concurrent startup and shutdown */
+static void kiothrottle_shutdown(void)
+{
+ if (!kiothrottled_thread)
+ return;
+ del_timer(&kiothrottled_timer);
+ printk(KERN_INFO "%s: stopping kiothrottled\n", __func__);
+ kthread_stop(kiothrottled_thread);
+ printk(KERN_INFO "%s: flushing pending requests\n", __func__);
+ spin_lock_irq(&iot->lock);
+ kiothrottled_thread = NULL;
+ spin_unlock_irq(&iot->lock);
+ iot_bio_cleanup(&iot->tree);
+}
+
+static int kiothrottle_startup(void)
+{
+ init_timer(&kiothrottled_timer);
+ kiothrottled_timer.function = kiothrottled_timer_expired;
+
+ printk(KERN_INFO "%s: starting kiothrottled\n", __func__);
+ kiothrottled_thread = kthread_run(kiothrottled, NULL, "kiothrottled");
+ if (IS_ERR(kiothrottled_thread))
+ return -PTR_ERR(kiothrottled_thread);
+ return 0;
+}
+
+/*
+ * NOTE: provide this interface only for emergency situations, when we need to
+ * force the immediate flush of pending (writeback) IO throttled requests.
+ */
+int iothrottle_sync(void)
+{
+ kiothrottle_shutdown();
+ return kiothrottle_startup();
+}
+EXPORT_SYMBOL(iothrottle_sync);
+
+/*
+ * Writing in /proc/kiothrottled_debug enforces an immediate flush of throttled
+ * IO requests.
+ */
+static ssize_t kiothrottle_write(struct file *filp, const char __user *buffer,
+ size_t count, loff_t *data)
+{
+ int ret;
+
+ ret = iothrottle_sync();
+ if (ret)
+ return ret;
+ return count;
+}
+
+/*
+ * Export to userspace the list of pending IO throttled requests.
+ * TODO: this can be useful only for debugging, maybe we should make this
+ * interface optionally, depending on an opportune compile-time config option.
+ */
+static int kiothrottle_show(struct seq_file *m, void *v)
+{
+ struct iot_bio *data;
+ struct rb_node *next;
+ unsigned long now = jiffies;
+ long delta_t;
+
+ spin_lock_irq(&iot->lock);
+ next = rb_first(&iot->tree);
+ while (next) {
+ data = rb_entry(next, struct iot_bio, node);
+ delta_t = (long)data->deadline - (long)now;
+ seq_printf(m, "%p %lu %lu %li\n", data->bio,
+ data->deadline, now, delta_t);
+ next = rb_next(&data->node);
+ }
+ spin_unlock_irq(&iot->lock);
+
+ return 0;
+}
+
+static int kiothrottle_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, kiothrottle_show, NULL);
+}
+
+static const struct file_operations kiothrottle_ops = {
+ .open = kiothrottle_open,
+ .read = seq_read,
+ .write = kiothrottle_write,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+int __init kiothrottled_init(void)
+{
+ struct proc_dir_entry *pe;
+ int ret;
+
+ iot = kzalloc(sizeof(*iot), GFP_KERNEL);
+ if (unlikely(!iot))
+ return -ENOMEM;
+ spin_lock_init(&iot->lock);
+ iot->tree = RB_ROOT;
+
+ pe = create_proc_entry("kiothrottled_debug", 0644, NULL);
+ if (!pe) {
+ kfree(iot);
+ return -ENOMEM;
+ }
+ pe->proc_fops = &kiothrottle_ops;
+
+ ret = kiothrottle_startup();
+ if (ret) {
+ remove_proc_entry("kiothrottled_debug", NULL);
+ kfree(iot);
+ return ret;
+ }
+ printk(KERN_INFO "%s: initialized\n", __func__);
+ return 0;
+}
+
+void __exit kiothrottled_exit(void)
+{
+ kiothrottle_shutdown();
+ remove_proc_entry("kiothrottled_debug", NULL);
+ kfree(iot);
+ printk(KERN_INFO "%s: unloaded\n", __func__);
+}
+
+module_init(kiothrottled_init);
+module_exit(kiothrottled_exit);
+MODULE_LICENSE("GPL");
--
1.5.6.3
Export the throttling statistics collected for each task through
/proc/PID/io-throttle-stat.
Example:
$ cat /proc/$$/io-throttle-stat
0 0 0 0
^ ^ ^ ^
\ \ \ \_____global iops sleep (in clock ticks)
\ \ \______global iops counter
\ \_______global bandwidth sleep (in clock ticks)
\________global bandwidth counter
Signed-off-by: Andrea Righi <[email protected]>
---
fs/proc/base.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index aa763ab..94061bf 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -54,6 +54,7 @@
#include <linux/proc_fs.h>
#include <linux/stat.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/init.h>
#include <linux/capability.h>
#include <linux/file.h>
@@ -2453,6 +2454,17 @@ static int proc_tgid_io_accounting(struct task_struct *task, char *buffer)
}
#endif /* CONFIG_TASK_IO_ACCOUNTING */
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+static int proc_iothrottle_stat(struct task_struct *task, char *buffer)
+{
+ return sprintf(buffer, "%llu %llu %llu %llu\n",
+ get_io_throttle_cnt(task, IOTHROTTLE_BANDWIDTH),
+ get_io_throttle_sleep(task, IOTHROTTLE_BANDWIDTH),
+ get_io_throttle_cnt(task, IOTHROTTLE_IOPS),
+ get_io_throttle_sleep(task, IOTHROTTLE_IOPS));
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -2539,6 +2551,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io", S_IRUGO, proc_tgid_io_accounting),
#endif
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ INF("io-throttle-stat", S_IRUGO, proc_iothrottle_stat),
+#endif
};
static int proc_tgid_base_readdir(struct file * filp,
@@ -2874,6 +2889,9 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io", S_IRUGO, proc_tid_io_accounting),
#endif
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ INF("io-throttle-stat", S_IRUGO, proc_iothrottle_stat),
+#endif
};
static int proc_tid_base_readdir(struct file * filp,
--
1.5.6.3
On Sat, Apr 18, 2009 at 11:38:29PM +0200, Andrea Righi wrote:
> This is the core of the io-throttle kernel infrastructure. It creates
> the basic interfaces to the cgroup subsystem and implements the I/O
> measurement and throttling functionality.
A few questions interspersed below.
Thanx, Paul
> Signed-off-by: Gui Jianfeng <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> ---
> block/Makefile | 1 +
> block/blk-io-throttle.c | 822 +++++++++++++++++++++++++++++++++++++++
> include/linux/blk-io-throttle.h | 144 +++++++
> include/linux/cgroup_subsys.h | 6 +
> init/Kconfig | 12 +
> 5 files changed, 985 insertions(+), 0 deletions(-)
> create mode 100644 block/blk-io-throttle.c
> create mode 100644 include/linux/blk-io-throttle.h
>
> diff --git a/block/Makefile b/block/Makefile
> index e9fa4dd..42b6a46 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -13,5 +13,6 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
> obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
> obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
>
> +obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
> obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
> obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
> diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
> new file mode 100644
> index 0000000..c8214fc
> --- /dev/null
> +++ b/block/blk-io-throttle.c
> @@ -0,0 +1,822 @@
> +/*
> + * blk-io-throttle.c
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + *
> + * Copyright (C) 2008 Andrea Righi <[email protected]>
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/res_counter.h>
> +#include <linux/memcontrol.h>
> +#include <linux/slab.h>
> +#include <linux/gfp.h>
> +#include <linux/err.h>
> +#include <linux/genhd.h>
> +#include <linux/hardirq.h>
> +#include <linux/list.h>
> +#include <linux/seq_file.h>
> +#include <linux/spinlock.h>
> +#include <linux/blk-io-throttle.h>
> +#include <linux/mm.h>
> +#include <linux/page_cgroup.h>
> +#include <linux/sched.h>
> +#include <linux/bio.h>
> +
> +/*
> + * Statistics for I/O bandwidth controller.
> + */
> +enum iothrottle_stat_index {
> + /* # of times the cgroup has been throttled for bw limit */
> + IOTHROTTLE_STAT_BW_COUNT,
> + /* # of jiffies spent to sleep for throttling for bw limit */
> + IOTHROTTLE_STAT_BW_SLEEP,
> + /* # of times the cgroup has been throttled for iops limit */
> + IOTHROTTLE_STAT_IOPS_COUNT,
> + /* # of jiffies spent to sleep for throttling for iops limit */
> + IOTHROTTLE_STAT_IOPS_SLEEP,
> + /* total number of bytes read and written */
> + IOTHROTTLE_STAT_BYTES_TOT,
> + /* total number of I/O operations */
> + IOTHROTTLE_STAT_IOPS_TOT,
> +
> + IOTHROTTLE_STAT_NSTATS,
> +};
> +
> +struct iothrottle_stat_cpu {
> + unsigned long long count[IOTHROTTLE_STAT_NSTATS];
> +} ____cacheline_aligned_in_smp;
> +
> +struct iothrottle_stat {
> + struct iothrottle_stat_cpu cpustat[NR_CPUS];
> +};
> +
> +static void iothrottle_stat_add(struct iothrottle_stat *stat,
> + enum iothrottle_stat_index type, unsigned long long val)
> +{
> + int cpu = get_cpu();
> +
> + stat->cpustat[cpu].count[type] += val;
> + put_cpu();
> +}
> +
> +static void iothrottle_stat_add_sleep(struct iothrottle_stat *stat,
> + int type, unsigned long long sleep)
> +{
> + int cpu = get_cpu();
> +
> + switch (type) {
> + case IOTHROTTLE_BANDWIDTH:
> + stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_COUNT]++;
> + stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_SLEEP] += sleep;
> + break;
> + case IOTHROTTLE_IOPS:
> + stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_COUNT]++;
> + stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_SLEEP] += sleep;
> + break;
> + }
> + put_cpu();
> +}
> +
> +static unsigned long long iothrottle_read_stat(struct iothrottle_stat *stat,
> + enum iothrottle_stat_index idx)
> +{
> + int cpu;
> + unsigned long long ret = 0;
> +
> + for_each_possible_cpu(cpu)
> + ret += stat->cpustat[cpu].count[idx];
> + return ret;
> +}
> +
> +struct iothrottle_sleep {
> + unsigned long long bw_sleep;
> + unsigned long long iops_sleep;
> +};
> +
> +/*
> + * struct iothrottle_node - throttling rule of a single block device
> + * @node: list of per block device throttling rules
> + * @dev: block device number, used as key in the list
> + * @bw: max i/o bandwidth (in bytes/s)
> + * @iops: max i/o operations per second
> + * @stat: throttling statistics
> + *
> + * Define a i/o throttling rule for a single block device.
> + *
> + * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
> + * the limiting rules defined for that device persist and they are still valid
> + * if a new device is plugged and it uses the same dev_t number.
> + */
> +struct iothrottle_node {
> + struct list_head node;
> + dev_t dev;
> + struct res_counter bw;
> + struct res_counter iops;
> + struct iothrottle_stat stat;
> +};
> +
> +/**
> + * struct iothrottle - throttling rules for a cgroup
> + * @css: pointer to the cgroup state
> + * @list: list of iothrottle_node elements
> + *
> + * Define multiple per-block device i/o throttling rules.
> + * Note: the list of the throttling rules is protected by RCU locking:
> + * - hold cgroup_lock() for update.
> + * - hold rcu_read_lock() for read.
> + */
> +struct iothrottle {
> + struct cgroup_subsys_state css;
> + struct list_head list;
> +};
> +static struct iothrottle init_iothrottle;
> +
> +static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
> +{
> + return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
> + struct iothrottle, css);
> +}
> +
> +/*
> + * Note: called with rcu_read_lock() held.
> + */
> +static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
> +{
> + return container_of(task_subsys_state(task, iothrottle_subsys_id),
> + struct iothrottle, css);
OK, task_subsys_state() has an rcu_dereference(), so...
> +}
> +
> +/*
> + * Note: called with rcu_read_lock() or iot->lock held.
> + */
> +static struct iothrottle_node *
> +iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
> +{
> + struct iothrottle_node *n;
> +
> + if (list_empty(&iot->list))
> + return NULL;
> + list_for_each_entry_rcu(n, &iot->list, node)
> + if (n->dev == dev)
> + return n;
> + return NULL;
> +}
> +
> +/*
> + * Note: called with iot->lock held.
Should this be a WARN_ON() or something similar? The machine is unable
to enforce a comment. ;-)
> + */
> +static inline void iothrottle_insert_node(struct iothrottle *iot,
> + struct iothrottle_node *n)
> +{
> + list_add_rcu(&n->node, &iot->list);
> +}
> +
> +/*
> + * Note: called with iot->lock held.
Ditto.
> + */
> +static inline void
> +iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,
> + struct iothrottle_node *new)
> +{
> + list_replace_rcu(&old->node, &new->node);
> +}
> +
> +/*
> + * Note: called with iot->lock held.
Ditto.
> + */
> +static inline void
> +iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
> +{
> + list_del_rcu(&n->node);
> +}
> +
> +/*
> + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> + */
> +static struct cgroup_subsys_state *
> +iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + struct iothrottle *iot;
> +
> + if (unlikely((cgrp->parent) == NULL)) {
> + iot = &init_iothrottle;
> + } else {
> + iot = kzalloc(sizeof(*iot), GFP_KERNEL);
> + if (unlikely(!iot))
> + return ERR_PTR(-ENOMEM);
> + }
> + INIT_LIST_HEAD(&iot->list);
> +
> + return &iot->css;
> +}
> +
> +/*
> + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> + */
> +static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + struct iothrottle_node *n, *p;
> + struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
> +
> + free_css_id(&iothrottle_subsys, &iot->css);
> + /*
> + * don't worry about locking here, at this point there must be not any
> + * reference to the list.
> + */
> + if (!list_empty(&iot->list))
> + list_for_each_entry_safe(n, p, &iot->list, node)
> + kfree(n);
> + kfree(iot);
> +}
> +
> +/*
> + * NOTE: called with rcu_read_lock() held.
> + *
> + * do not care too much about locking for single res_counter values here.
> + */
> +static void iothrottle_show_limit(struct seq_file *m, dev_t dev,
> + struct res_counter *res)
> +{
> + if (!res->limit)
> + return;
> + seq_printf(m, "%u %u %llu %llu %lli %llu %li\n",
> + MAJOR(dev), MINOR(dev),
> + res->limit, res->policy,
> + (long long)res->usage, res->capacity,
> + jiffies_to_clock_t(res_counter_ratelimit_delta_t(res)));
OK, looks like the rcu_dereference() in the list_for_each_entry_rcu() in
the caller suffices here. But thought I should ask the question anyway,
even though at first glance it does look correct.
> +}
> +
> +/*
> + * NOTE: called with rcu_read_lock() held.
> + *
> + */
> +static void iothrottle_show_failcnt(struct seq_file *m, dev_t dev,
> + struct iothrottle_stat *stat)
> +{
> + unsigned long long bw_count, bw_sleep, iops_count, iops_sleep;
> +
> + bw_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_COUNT);
> + bw_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_SLEEP);
> + iops_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_COUNT);
> + iops_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_SLEEP);
> +
> + seq_printf(m, "%u %u %llu %li %llu %li\n", MAJOR(dev), MINOR(dev),
> + bw_count, jiffies_to_clock_t(bw_sleep),
> + iops_count, jiffies_to_clock_t(iops_sleep));
Ditto.
> +}
> +
> +/*
> + * NOTE: called with rcu_read_lock() held.
> + */
> +static void iothrottle_show_stat(struct seq_file *m, dev_t dev,
> + struct iothrottle_stat *stat)
> +{
> + unsigned long long bytes, iops;
> +
> + bytes = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BYTES_TOT);
> + iops = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_TOT);
> +
> + seq_printf(m, "%u %u %llu %llu\n", MAJOR(dev), MINOR(dev), bytes, iops);
Ditto.
> +}
> +
> +static int iothrottle_read(struct cgroup *cgrp, struct cftype *cft,
> + struct seq_file *m)
> +{
> + struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
> + struct iothrottle_node *n;
> +
> + rcu_read_lock();
> + if (list_empty(&iot->list))
> + goto unlock_and_return;
> + list_for_each_entry_rcu(n, &iot->list, node) {
> + BUG_ON(!n->dev);
> + switch (cft->private) {
> + case IOTHROTTLE_BANDWIDTH:
> + iothrottle_show_limit(m, n->dev, &n->bw);
> + break;
> + case IOTHROTTLE_IOPS:
> + iothrottle_show_limit(m, n->dev, &n->iops);
> + break;
> + case IOTHROTTLE_FAILCNT:
> + iothrottle_show_failcnt(m, n->dev, &n->stat);
> + break;
> + case IOTHROTTLE_STAT:
> + iothrottle_show_stat(m, n->dev, &n->stat);
> + break;
> + }
> + }
> +unlock_and_return:
> + rcu_read_unlock();
> + return 0;
> +}
> +
> +static dev_t devname2dev_t(const char *buf)
> +{
> + struct block_device *bdev;
> + dev_t dev = 0;
> + struct gendisk *disk;
> + int part;
> +
> + /* use a lookup to validate the block device */
> + bdev = lookup_bdev(buf);
> + if (IS_ERR(bdev))
> + return 0;
> + /* only entire devices are allowed, not single partitions */
> + disk = get_gendisk(bdev->bd_dev, &part);
> + if (disk && !part) {
> + BUG_ON(!bdev->bd_inode);
> + dev = bdev->bd_inode->i_rdev;
> + }
> + bdput(bdev);
> +
> + return dev;
> +}
> +
> +/*
> + * The userspace input string must use one of the following syntaxes:
> + *
> + * dev:0 <- delete an i/o limiting rule
> + * dev:io-limit:0 <- set a leaky bucket throttling rule
> + * dev:io-limit:1:bucket-size <- set a token bucket throttling rule
> + * dev:io-limit:1 <- set a token bucket throttling rule using
> + * bucket-size == io-limit
> + */
> +static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
> + dev_t *dev, unsigned long long *iolimit,
> + unsigned long long *strategy,
> + unsigned long long *bucket_size)
> +{
> + char *p;
> + int count = 0;
> + char *s[4];
> + int ret;
> +
> + memset(s, 0, sizeof(s));
> + *dev = 0;
> + *iolimit = 0;
> + *strategy = 0;
> + *bucket_size = 0;
> +
> + /* split the colon-delimited input string into its elements */
> + while (count < ARRAY_SIZE(s)) {
> + p = strsep(&buf, ":");
> + if (!p)
> + break;
> + if (!*p)
> + continue;
> + s[count++] = p;
> + }
> +
> + /* i/o limit */
> + if (!s[1])
> + return -EINVAL;
> + ret = strict_strtoull(s[1], 10, iolimit);
> + if (ret < 0)
> + return ret;
> + if (!*iolimit)
> + goto out;
> + /* throttling strategy (leaky bucket / token bucket) */
> + if (!s[2])
> + return -EINVAL;
> + ret = strict_strtoull(s[2], 10, strategy);
> + if (ret < 0)
> + return ret;
> + switch (*strategy) {
> + case RATELIMIT_LEAKY_BUCKET:
> + goto out;
> + case RATELIMIT_TOKEN_BUCKET:
> + break;
> + default:
> + return -EINVAL;
> + }
> + /* bucket size */
> + if (!s[3])
> + *bucket_size = *iolimit;
> + else {
> + ret = strict_strtoll(s[3], 10, bucket_size);
> + if (ret < 0)
> + return ret;
> + }
> + if (*bucket_size <= 0)
> + return -EINVAL;
> +out:
> + /* block device number */
> + *dev = devname2dev_t(s[0]);
> + return *dev ? 0 : -EINVAL;
> +}
> +
> +static int iothrottle_write(struct cgroup *cgrp, struct cftype *cft,
> + const char *buffer)
> +{
> + struct iothrottle *iot;
> + struct iothrottle_node *n, *newn = NULL;
> + dev_t dev;
> + unsigned long long iolimit, strategy, bucket_size;
> + char *buf;
> + size_t nbytes = strlen(buffer);
> + int ret = 0;
> +
> + /*
> + * We need to allocate a new buffer here, because
> + * iothrottle_parse_args() can modify it and the buffer provided by
> + * write_string is supposed to be const.
> + */
> + buf = kmalloc(nbytes + 1, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> + memcpy(buf, buffer, nbytes + 1);
> +
> + ret = iothrottle_parse_args(buf, nbytes, cft->private, &dev, &iolimit,
> + &strategy, &bucket_size);
> + if (ret)
> + goto out1;
> + newn = kzalloc(sizeof(*newn), GFP_KERNEL);
> + if (!newn) {
> + ret = -ENOMEM;
> + goto out1;
> + }
> + newn->dev = dev;
> + res_counter_init(&newn->bw, NULL);
> + res_counter_init(&newn->iops, NULL);
> +
> + switch (cft->private) {
> + case IOTHROTTLE_BANDWIDTH:
> + res_counter_ratelimit_set_limit(&newn->iops, 0, 0, 0);
> + res_counter_ratelimit_set_limit(&newn->bw, strategy,
> + ALIGN(iolimit, 1024), ALIGN(bucket_size, 1024));
> + break;
> + case IOTHROTTLE_IOPS:
> + res_counter_ratelimit_set_limit(&newn->bw, 0, 0, 0);
> + /*
> + * scale up iops cost by a factor of 1000, this allows to apply
> + * a more fine grained sleeps, and throttling results more
> + * precise this way.
> + */
> + res_counter_ratelimit_set_limit(&newn->iops, strategy,
> + iolimit * 1000, bucket_size * 1000);
> + break;
> + default:
> + WARN_ON(1);
> + break;
> + }
> +
> + if (!cgroup_lock_live_group(cgrp)) {
> + ret = -ENODEV;
> + goto out1;
> + }
> + iot = cgroup_to_iothrottle(cgrp);
> +
> + n = iothrottle_search_node(iot, dev);
> + if (!n) {
> + if (iolimit) {
> + /* Add a new block device limiting rule */
> + iothrottle_insert_node(iot, newn);
> + newn = NULL;
> + }
> + goto out2;
> + }
> + switch (cft->private) {
> + case IOTHROTTLE_BANDWIDTH:
> + if (!iolimit && !n->iops.limit) {
> + /* Delete a block device limiting rule */
> + iothrottle_delete_node(iot, n);
> + goto out2;
> + }
> + if (!n->iops.limit)
> + break;
> + /* Update a block device limiting rule */
> + newn->iops = n->iops;
> + break;
> + case IOTHROTTLE_IOPS:
> + if (!iolimit && !n->bw.limit) {
> + /* Delete a block device limiting rule */
> + iothrottle_delete_node(iot, n);
> + goto out2;
> + }
> + if (!n->bw.limit)
> + break;
> + /* Update a block device limiting rule */
> + newn->bw = n->bw;
> + break;
> + }
> + iothrottle_replace_node(iot, n, newn);
> + newn = NULL;
> +out2:
> + cgroup_unlock();
How does the above lock relate to the iot->lock called out in the comment
headers in the earlier functions? Hmmm... Come to think of it, I don't
see an acquisition of iot->lock anywhere.
So, what is the story here?
> + if (n) {
> + synchronize_rcu();
> + kfree(n);
> + }
> +out1:
> + kfree(newn);
> + kfree(buf);
> + return ret;
> +}
> +
> +static struct cftype files[] = {
> + {
> + .name = "bandwidth-max",
> + .read_seq_string = iothrottle_read,
> + .write_string = iothrottle_write,
> + .max_write_len = 256,
> + .private = IOTHROTTLE_BANDWIDTH,
> + },
> + {
> + .name = "iops-max",
> + .read_seq_string = iothrottle_read,
> + .write_string = iothrottle_write,
> + .max_write_len = 256,
> + .private = IOTHROTTLE_IOPS,
> + },
> + {
> + .name = "throttlecnt",
> + .read_seq_string = iothrottle_read,
> + .private = IOTHROTTLE_FAILCNT,
> + },
> + {
> + .name = "stat",
> + .read_seq_string = iothrottle_read,
> + .private = IOTHROTTLE_STAT,
> + },
> +};
> +
> +static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
> +}
> +
> +struct cgroup_subsys iothrottle_subsys = {
> + .name = "blockio",
> + .create = iothrottle_create,
> + .destroy = iothrottle_destroy,
> + .populate = iothrottle_populate,
> + .subsys_id = iothrottle_subsys_id,
> + .early_init = 1,
> + .use_id = 1,
> +};
> +
> +/*
> + * NOTE: called with rcu_read_lock() held.
> + */
> +static void iothrottle_evaluate_sleep(struct iothrottle_sleep *sleep,
> + struct iothrottle *iot,
> + struct block_device *bdev, ssize_t bytes)
> +{
> + struct iothrottle_node *n;
> + dev_t dev;
> +
> + BUG_ON(!iot);
> +
> + /* accounting and throttling is done only on entire block devices */
> + dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev), bdev->bd_disk->first_minor);
> + n = iothrottle_search_node(iot, dev);
> + if (!n)
> + return;
> +
> + /* Update statistics */
> + iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_BYTES_TOT, bytes);
> + if (bytes)
> + iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_IOPS_TOT, 1);
> +
> + /* Evaluate sleep values */
> + sleep->bw_sleep = res_counter_ratelimit_sleep(&n->bw, bytes);
> + /*
> + * scale up iops cost by a factor of 1000, this allows to apply
> + * a more fine grained sleeps, and throttling works better in
> + * this way.
> + *
> + * Note: do not account any i/o operation if bytes is negative or zero.
> + */
> + sleep->iops_sleep = res_counter_ratelimit_sleep(&n->iops,
> + bytes ? 1000 : 0);
> +}
> +
> +/*
> + * NOTE: called with rcu_read_lock() held.
> + */
> +static void iothrottle_acct_stat(struct iothrottle *iot,
> + struct block_device *bdev, int type,
> + unsigned long long sleep)
> +{
> + struct iothrottle_node *n;
> + dev_t dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev),
> + bdev->bd_disk->first_minor);
> +
> + n = iothrottle_search_node(iot, dev);
> + if (!n)
> + return;
> + iothrottle_stat_add_sleep(&n->stat, type, sleep);
> +}
> +
> +static void iothrottle_acct_task_stat(int type, unsigned long long sleep)
> +{
> + /*
> + * XXX: per-task statistics may be inaccurate (this is not a
> + * critical issue, anyway, respect to introduce locking
> + * overhead or increase the size of task_struct).
> + */
> + switch (type) {
> + case IOTHROTTLE_BANDWIDTH:
> + current->io_throttle_bw_cnt++;
> + current->io_throttle_bw_sleep += sleep;
> + break;
> +
> + case IOTHROTTLE_IOPS:
> + current->io_throttle_iops_cnt++;
> + current->io_throttle_iops_sleep += sleep;
> + break;
> + }
> +}
> +
> +/*
> + * A helper function to get iothrottle from css id.
> + *
> + * NOTE: must be called under rcu_read_lock(). The caller must check
> + * css_is_removed() or some if it's concern.
> + */
> +static struct iothrottle *iothrottle_lookup(unsigned long id)
> +{
> + struct cgroup_subsys_state *css;
> +
> + if (!id)
> + return NULL;
> + css = css_lookup(&iothrottle_subsys, id);
> + if (!css)
> + return NULL;
> + return container_of(css, struct iothrottle, css);
> +}
> +
> +static struct iothrottle *get_iothrottle_from_page(struct page *page)
> +{
> + struct iothrottle *iot;
> + unsigned long id;
> +
> + BUG_ON(!page);
> + id = page_cgroup_get_owner(page);
> +
> + rcu_read_lock();
> + iot = iothrottle_lookup(id);
> + if (!iot)
> + goto out;
> + css_get(&iot->css);
> +out:
> + rcu_read_unlock();
> + return iot;
> +}
> +
> +static struct iothrottle *get_iothrottle_from_bio(struct bio *bio)
> +{
> + if (!bio)
> + return NULL;
> + return get_iothrottle_from_page(bio_page(bio));
> +}
> +
> +int iothrottle_set_page_owner(struct page *page, struct mm_struct *mm)
> +{
> + struct iothrottle *iot;
> + unsigned short id = 0;
> +
> + if (iothrottle_disabled())
> + return 0;
> + if (!mm)
> + goto out;
> + rcu_read_lock();
> + iot = task_to_iothrottle(rcu_dereference(mm->owner));
Given that task_to_iothrottle() calls task_subsys_state(), which contains
an rcu_dereference(), why is the rcu_dereference() above required?
(There might well be a good reason, just cannot see it right offhand.)
> + if (likely(iot))
> + id = css_id(&iot->css);
> + rcu_read_unlock();
> +out:
> + return page_cgroup_set_owner(page, id);
> +}
> +
> +int iothrottle_set_pagedirty_owner(struct page *page, struct mm_struct *mm)
> +{
> + if (PageSwapCache(page) || PageAnon(page))
> + return 0;
> + if (current->flags & PF_MEMALLOC)
> + return 0;
> + return iothrottle_set_page_owner(page, mm);
> +}
> +
> +int iothrottle_copy_page_owner(struct page *npage, struct page *opage)
> +{
> + if (iothrottle_disabled())
> + return 0;
> + return page_cgroup_copy_owner(npage, opage);
> +}
> +
> +static inline int is_kthread_io(void)
> +{
> + return current->flags & (PF_KTHREAD | PF_FLUSHER | PF_KSWAPD);
> +}
> +
> +static inline bool is_urgent_io(struct bio *bio)
> +{
> + if (bio && (bio_rw_meta(bio) || bio_noidle(bio)))
> + return true;
> + if (has_fs_excl())
> + return true;
> + return false;
> +}
> +
> +/**
> + * cgroup_io_throttle() - account and throttle synchronous i/o activity
> + * @bio: the bio structure used to retrieve the owner of the i/o
> + * operation.
> + * @bdev: block device involved for the i/o.
> + * @bytes: size in bytes of the i/o operation.
> + *
> + * This is the core of the block device i/o bandwidth controller. This function
> + * must be called by any function that generates i/o activity (directly or
> + * indirectly). It provides both i/o accounting and throttling functionalities;
> + * throttling is disabled if @can_sleep is set to 0.
> + *
> + * Returns the value of sleep in jiffies if it was not possible to schedule the
> + * timeout.
> + **/
> +unsigned long long
> +cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
> +{
> + struct iothrottle *iot = NULL;
> + struct iothrottle_sleep s = {};
> + unsigned long long sleep;
> + int can_sleep = 1;
> +
> + if (iothrottle_disabled())
> + return 0;
> + if (unlikely(!bdev))
> + return 0;
> + BUG_ON(!bdev->bd_inode || !bdev->bd_disk);
> + /*
> + * Never throttle kernel threads directly, since they may completely
> + * block other cgroups, the i/o on other block devices or even the
> + * whole system.
> + *
> + * For the same reason never throttle IO that comes from tasks that are
> + * holding exclusive access resources (urgent IO).
> + *
> + * And never sleep if we're inside an AIO context; just account the i/o
> + * activity. Throttling is performed in io_submit_one() returning
> + * -EAGAIN when the limits are exceeded.
> + */
> + if (is_kthread_io() || is_urgent_io(bio) || is_in_aio())
> + can_sleep = 0;
> + /*
> + * WARNING: in_atomic() do not know about held spinlocks in
> + * non-preemptible kernels, but we want to check it here to raise
> + * potential bugs when a preemptible kernel is used.
> + */
> + WARN_ON_ONCE(can_sleep &&
> + (irqs_disabled() || in_interrupt() || in_atomic()));
> +
> + /* Apply IO throttling */
> + iot = get_iothrottle_from_bio(bio);
> + rcu_read_lock();
> + if (!iot) {
> + /* IO occurs in the same context of the current task */
> + iot = task_to_iothrottle(current);
> + css_get(&iot->css);
> + }
> + iothrottle_evaluate_sleep(&s, iot, bdev, bytes);
> + sleep = max(s.bw_sleep, s.iops_sleep);
> + if (unlikely(sleep && can_sleep)) {
> + int type = (s.bw_sleep < s.iops_sleep) ?
> + IOTHROTTLE_IOPS : IOTHROTTLE_BANDWIDTH;
> +
> + iothrottle_acct_stat(iot, bdev, type, sleep);
> + css_put(&iot->css);
> + rcu_read_unlock();
> +
> + pr_debug("io-throttle: task %p (%s) must sleep %llu jiffies\n",
> + current, current->comm, sleep);
> + iothrottle_acct_task_stat(type, sleep);
> + schedule_timeout_killable(sleep);
> + return 0;
> + }
> + css_put(&iot->css);
> + rcu_read_unlock();
> +
> + /*
> + * Account, but do not delay filesystems' metadata IO or IO that is
> + * explicitly marked to not wait or being anticipated, i.e. writes with
> + * wbc->sync_mode set to WBC_SYNC_ALL - fsync() - or journal activity.
> + */
> + if (is_urgent_io(bio))
> + sleep = 0;
> + return sleep;
> +}
> diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
> new file mode 100644
> index 0000000..304c56c
> --- /dev/null
> +++ b/include/linux/blk-io-throttle.h
> @@ -0,0 +1,144 @@
> +#ifndef BLK_IO_THROTTLE_H
> +#define BLK_IO_THROTTLE_H
> +
> +#include <linux/fs.h>
> +#include <linux/jiffies.h>
> +#include <linux/sched.h>
> +#include <linux/cgroup.h>
> +#include <asm/atomic.h>
> +#include <asm/current.h>
> +
> +#define IOTHROTTLE_BANDWIDTH 0
> +#define IOTHROTTLE_IOPS 1
> +#define IOTHROTTLE_FAILCNT 2
> +#define IOTHROTTLE_STAT 3
> +
> +#ifdef CONFIG_CGROUP_IO_THROTTLE
> +
> +static inline bool iothrottle_disabled(void)
> +{
> + if (iothrottle_subsys.disabled)
> + return true;
> + return false;
> +}
> +
> +extern unsigned long long
> +cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes);
> +
> +extern int iothrottle_make_request(struct bio *bio, unsigned long deadline);
> +
> +int iothrottle_set_page_owner(struct page *page, struct mm_struct *mm);
> +int iothrottle_set_pagedirty_owner(struct page *page, struct mm_struct *mm);
> +int iothrottle_copy_page_owner(struct page *npage, struct page *opage);
> +
> +extern int iothrottle_sync(void);
> +
> +static inline void set_in_aio(void)
> +{
> + atomic_set(¤t->in_aio, 1);
> +}
> +
> +static inline void unset_in_aio(void)
> +{
> + atomic_set(¤t->in_aio, 0);
> +}
> +
> +static inline int is_in_aio(void)
> +{
> + return atomic_read(¤t->in_aio);
> +}
> +
> +static inline unsigned long long
> +get_io_throttle_cnt(struct task_struct *t, int type)
> +{
> + switch (type) {
> + case IOTHROTTLE_BANDWIDTH:
> + return t->io_throttle_bw_cnt;
> + case IOTHROTTLE_IOPS:
> + return t->io_throttle_iops_cnt;
> + }
> + BUG();
> +}
> +
> +static inline unsigned long long
> +get_io_throttle_sleep(struct task_struct *t, int type)
> +{
> + switch (type) {
> + case IOTHROTTLE_BANDWIDTH:
> + return jiffies_to_clock_t(t->io_throttle_bw_sleep);
> + case IOTHROTTLE_IOPS:
> + return jiffies_to_clock_t(t->io_throttle_iops_sleep);
> + }
> + BUG();
> +}
> +#else /* CONFIG_CGROUP_IO_THROTTLE */
> +
> +static inline bool iothrottle_disabled(void)
> +{
> + return true;
> +}
> +
> +static inline unsigned long long
> +cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
> +{
> + return 0;
> +}
> +
> +static inline int
> +iothrottle_make_request(struct bio *bio, unsigned long deadline)
> +{
> + return 0;
> +}
> +
> +static inline int iothrottle_set_page_owner(struct page *page,
> + struct mm_struct *mm)
> +{
> + return 0;
> +}
> +
> +static inline int iothrottle_set_pagedirty_owner(struct page *page,
> + struct mm_struct *mm)
> +{
> + return 0;
> +}
> +
> +static inline int iothrottle_copy_page_owner(struct page *npage,
> + struct page *opage)
> +{
> + return 0;
> +}
> +
> +static inline int iothrottle_sync(void)
> +{
> + return 0;
> +}
> +
> +static inline void set_in_aio(void) { }
> +
> +static inline void unset_in_aio(void) { }
> +
> +static inline int is_in_aio(void)
> +{
> + return 0;
> +}
> +
> +static inline unsigned long long
> +get_io_throttle_cnt(struct task_struct *t, int type)
> +{
> + return 0;
> +}
> +
> +static inline unsigned long long
> +get_io_throttle_sleep(struct task_struct *t, int type)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_CGROUP_IO_THROTTLE */
> +
> +static inline struct block_device *as_to_bdev(struct address_space *mapping)
> +{
> + return (mapping->host && mapping->host->i_sb->s_bdev) ?
> + mapping->host->i_sb->s_bdev : NULL;
> +}
> +
> +#endif /* BLK_IO_THROTTLE_H */
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 9c8d31b..c37cc4b 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
>
> /* */
>
> +#ifdef CONFIG_CGROUP_IO_THROTTLE
> +SUBSYS(iothrottle)
> +#endif
> +
> +/* */
> +
> #ifdef CONFIG_CGROUP_DEVICE
> SUBSYS(devices)
> #endif
> diff --git a/init/Kconfig b/init/Kconfig
> index 5428ac7..d496c5f 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -565,6 +565,18 @@ config RESOURCE_COUNTERS
> infrastructure that works with cgroups.
> depends on CGROUPS
>
> +config CGROUP_IO_THROTTLE
> + bool "Enable cgroup I/O throttling"
> + depends on CGROUPS && RESOURCE_COUNTERS && EXPERIMENTAL
> + select MM_OWNER
> + select PAGE_TRACKING
> + help
> + This allows to limit the maximum I/O bandwidth for specific
> + cgroup(s).
> + See Documentation/cgroups/io-throttle.txt for more information.
> +
> + If unsure, say N.
> +
> config CGROUP_MEM_RES_CTLR
> bool "Memory Resource Controller for Control Groups"
> depends on CGROUPS && RESOURCE_COUNTERS
> --
> 1.5.6.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Mon, Apr 20, 2009 at 10:59:04AM -0700, Paul E. McKenney wrote:
> On Sat, Apr 18, 2009 at 11:38:29PM +0200, Andrea Righi wrote:
> > This is the core of the io-throttle kernel infrastructure. It creates
> > the basic interfaces to the cgroup subsystem and implements the I/O
> > measurement and throttling functionality.
>
> A few questions interspersed below.
>
> Thanx, Paul
>
> > Signed-off-by: Gui Jianfeng <[email protected]>
> > Signed-off-by: Andrea Righi <[email protected]>
> > ---
> > block/Makefile | 1 +
> > block/blk-io-throttle.c | 822 +++++++++++++++++++++++++++++++++++++++
> > include/linux/blk-io-throttle.h | 144 +++++++
> > include/linux/cgroup_subsys.h | 6 +
> > init/Kconfig | 12 +
> > 5 files changed, 985 insertions(+), 0 deletions(-)
> > create mode 100644 block/blk-io-throttle.c
> > create mode 100644 include/linux/blk-io-throttle.h
> >
> > diff --git a/block/Makefile b/block/Makefile
> > index e9fa4dd..42b6a46 100644
> > --- a/block/Makefile
> > +++ b/block/Makefile
> > @@ -13,5 +13,6 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
> > obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
> > obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
> >
> > +obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
> > obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
> > obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
> > diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
> > new file mode 100644
> > index 0000000..c8214fc
> > --- /dev/null
> > +++ b/block/blk-io-throttle.c
> > @@ -0,0 +1,822 @@
> > +/*
> > + * blk-io-throttle.c
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2 of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; if not, write to the
> > + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> > + * Boston, MA 021110-1307, USA.
> > + *
> > + * Copyright (C) 2008 Andrea Righi <[email protected]>
> > + */
> > +
> > +#include <linux/init.h>
> > +#include <linux/module.h>
> > +#include <linux/res_counter.h>
> > +#include <linux/memcontrol.h>
> > +#include <linux/slab.h>
> > +#include <linux/gfp.h>
> > +#include <linux/err.h>
> > +#include <linux/genhd.h>
> > +#include <linux/hardirq.h>
> > +#include <linux/list.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/blk-io-throttle.h>
> > +#include <linux/mm.h>
> > +#include <linux/page_cgroup.h>
> > +#include <linux/sched.h>
> > +#include <linux/bio.h>
> > +
> > +/*
> > + * Statistics for I/O bandwidth controller.
> > + */
> > +enum iothrottle_stat_index {
> > + /* # of times the cgroup has been throttled for bw limit */
> > + IOTHROTTLE_STAT_BW_COUNT,
> > + /* # of jiffies spent to sleep for throttling for bw limit */
> > + IOTHROTTLE_STAT_BW_SLEEP,
> > + /* # of times the cgroup has been throttled for iops limit */
> > + IOTHROTTLE_STAT_IOPS_COUNT,
> > + /* # of jiffies spent to sleep for throttling for iops limit */
> > + IOTHROTTLE_STAT_IOPS_SLEEP,
> > + /* total number of bytes read and written */
> > + IOTHROTTLE_STAT_BYTES_TOT,
> > + /* total number of I/O operations */
> > + IOTHROTTLE_STAT_IOPS_TOT,
> > +
> > + IOTHROTTLE_STAT_NSTATS,
> > +};
> > +
> > +struct iothrottle_stat_cpu {
> > + unsigned long long count[IOTHROTTLE_STAT_NSTATS];
> > +} ____cacheline_aligned_in_smp;
> > +
> > +struct iothrottle_stat {
> > + struct iothrottle_stat_cpu cpustat[NR_CPUS];
> > +};
> > +
> > +static void iothrottle_stat_add(struct iothrottle_stat *stat,
> > + enum iothrottle_stat_index type, unsigned long long val)
> > +{
> > + int cpu = get_cpu();
> > +
> > + stat->cpustat[cpu].count[type] += val;
> > + put_cpu();
> > +}
> > +
> > +static void iothrottle_stat_add_sleep(struct iothrottle_stat *stat,
> > + int type, unsigned long long sleep)
> > +{
> > + int cpu = get_cpu();
> > +
> > + switch (type) {
> > + case IOTHROTTLE_BANDWIDTH:
> > + stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_COUNT]++;
> > + stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_SLEEP] += sleep;
> > + break;
> > + case IOTHROTTLE_IOPS:
> > + stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_COUNT]++;
> > + stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_SLEEP] += sleep;
> > + break;
> > + }
> > + put_cpu();
> > +}
> > +
> > +static unsigned long long iothrottle_read_stat(struct iothrottle_stat *stat,
> > + enum iothrottle_stat_index idx)
> > +{
> > + int cpu;
> > + unsigned long long ret = 0;
> > +
> > + for_each_possible_cpu(cpu)
> > + ret += stat->cpustat[cpu].count[idx];
> > + return ret;
> > +}
> > +
> > +struct iothrottle_sleep {
> > + unsigned long long bw_sleep;
> > + unsigned long long iops_sleep;
> > +};
> > +
> > +/*
> > + * struct iothrottle_node - throttling rule of a single block device
> > + * @node: list of per block device throttling rules
> > + * @dev: block device number, used as key in the list
> > + * @bw: max i/o bandwidth (in bytes/s)
> > + * @iops: max i/o operations per second
> > + * @stat: throttling statistics
> > + *
> > + * Define a i/o throttling rule for a single block device.
> > + *
> > + * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
> > + * the limiting rules defined for that device persist and they are still valid
> > + * if a new device is plugged and it uses the same dev_t number.
> > + */
> > +struct iothrottle_node {
> > + struct list_head node;
> > + dev_t dev;
> > + struct res_counter bw;
> > + struct res_counter iops;
> > + struct iothrottle_stat stat;
> > +};
> > +
> > +/**
> > + * struct iothrottle - throttling rules for a cgroup
> > + * @css: pointer to the cgroup state
> > + * @list: list of iothrottle_node elements
> > + *
> > + * Define multiple per-block device i/o throttling rules.
> > + * Note: the list of the throttling rules is protected by RCU locking:
> > + * - hold cgroup_lock() for update.
> > + * - hold rcu_read_lock() for read.
> > + */
> > +struct iothrottle {
> > + struct cgroup_subsys_state css;
> > + struct list_head list;
> > +};
> > +static struct iothrottle init_iothrottle;
> > +
> > +static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
> > +{
> > + return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
> > + struct iothrottle, css);
> > +}
> > +
> > +/*
> > + * Note: called with rcu_read_lock() held.
> > + */
> > +static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
> > +{
> > + return container_of(task_subsys_state(task, iothrottle_subsys_id),
> > + struct iothrottle, css);
>
> OK, task_subsys_state() has an rcu_dereference(), so...
Do you mean the comment is obvious and it can be just removed?
>
> > +}
> > +
> > +/*
> > + * Note: called with rcu_read_lock() or iot->lock held.
> > + */
> > +static struct iothrottle_node *
> > +iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
> > +{
> > + struct iothrottle_node *n;
> > +
> > + if (list_empty(&iot->list))
> > + return NULL;
> > + list_for_each_entry_rcu(n, &iot->list, node)
> > + if (n->dev == dev)
> > + return n;
> > + return NULL;
> > +}
> > +
> > +/*
> > + * Note: called with iot->lock held.
>
> Should this be a WARN_ON() or something similar? The machine is unable
> to enforce a comment. ;-)
Right. :) Actually this is an old and never fixed comment... the
iot->list is always modified only under cgroup_lock(), so there's no
need to introduce another lock in struct iothrottle.
Anyway, adding a WARN_ON() seems a good idea, maybe adding a
WARN_ON_ONCE(cgroup_is_locked()) and define cgroup_is_locked() of
course, because cgroup_mutex is not exported outside kernel/cgroup.c.
I'll fix the comment and add the check.
>
> > + */
> > +static inline void iothrottle_insert_node(struct iothrottle *iot,
> > + struct iothrottle_node *n)
> > +{
> > + list_add_rcu(&n->node, &iot->list);
> > +}
> > +
> > +/*
> > + * Note: called with iot->lock held.
>
> Ditto.
OK, see above.
>
> > + */
> > +static inline void
> > +iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,
> > + struct iothrottle_node *new)
> > +{
> > + list_replace_rcu(&old->node, &new->node);
> > +}
> > +
> > +/*
> > + * Note: called with iot->lock held.
>
> Ditto.
OK, see above.
>
> > + */
> > +static inline void
> > +iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
> > +{
> > + list_del_rcu(&n->node);
> > +}
> > +
> > +/*
> > + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> > + */
> > +static struct cgroup_subsys_state *
> > +iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
> > +{
> > + struct iothrottle *iot;
> > +
> > + if (unlikely((cgrp->parent) == NULL)) {
> > + iot = &init_iothrottle;
> > + } else {
> > + iot = kzalloc(sizeof(*iot), GFP_KERNEL);
> > + if (unlikely(!iot))
> > + return ERR_PTR(-ENOMEM);
> > + }
> > + INIT_LIST_HEAD(&iot->list);
> > +
> > + return &iot->css;
> > +}
> > +
> > +/*
> > + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> > + */
> > +static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
> > +{
> > + struct iothrottle_node *n, *p;
> > + struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
> > +
> > + free_css_id(&iothrottle_subsys, &iot->css);
> > + /*
> > + * don't worry about locking here, at this point there must be not any
> > + * reference to the list.
> > + */
> > + if (!list_empty(&iot->list))
> > + list_for_each_entry_safe(n, p, &iot->list, node)
> > + kfree(n);
> > + kfree(iot);
> > +}
> > +
> > +/*
> > + * NOTE: called with rcu_read_lock() held.
> > + *
> > + * do not care too much about locking for single res_counter values here.
> > + */
> > +static void iothrottle_show_limit(struct seq_file *m, dev_t dev,
> > + struct res_counter *res)
> > +{
> > + if (!res->limit)
> > + return;
> > + seq_printf(m, "%u %u %llu %llu %lli %llu %li\n",
> > + MAJOR(dev), MINOR(dev),
> > + res->limit, res->policy,
> > + (long long)res->usage, res->capacity,
> > + jiffies_to_clock_t(res_counter_ratelimit_delta_t(res)));
>
> OK, looks like the rcu_dereference() in the list_for_each_entry_rcu() in
> the caller suffices here. But thought I should ask the question anyway,
> even though at first glance it does look correct.
>
> > +}
> > +
> > +/*
> > + * NOTE: called with rcu_read_lock() held.
> > + *
> > + */
> > +static void iothrottle_show_failcnt(struct seq_file *m, dev_t dev,
> > + struct iothrottle_stat *stat)
> > +{
> > + unsigned long long bw_count, bw_sleep, iops_count, iops_sleep;
> > +
> > + bw_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_COUNT);
> > + bw_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_SLEEP);
> > + iops_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_COUNT);
> > + iops_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_SLEEP);
> > +
> > + seq_printf(m, "%u %u %llu %li %llu %li\n", MAJOR(dev), MINOR(dev),
> > + bw_count, jiffies_to_clock_t(bw_sleep),
> > + iops_count, jiffies_to_clock_t(iops_sleep));
>
> Ditto.
>
> > +}
> > +
> > +/*
> > + * NOTE: called with rcu_read_lock() held.
> > + */
> > +static void iothrottle_show_stat(struct seq_file *m, dev_t dev,
> > + struct iothrottle_stat *stat)
> > +{
> > + unsigned long long bytes, iops;
> > +
> > + bytes = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BYTES_TOT);
> > + iops = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_TOT);
> > +
> > + seq_printf(m, "%u %u %llu %llu\n", MAJOR(dev), MINOR(dev), bytes, iops);
>
> Ditto.
>
> > +}
> > +
> > +static int iothrottle_read(struct cgroup *cgrp, struct cftype *cft,
> > + struct seq_file *m)
> > +{
> > + struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
> > + struct iothrottle_node *n;
> > +
> > + rcu_read_lock();
> > + if (list_empty(&iot->list))
> > + goto unlock_and_return;
> > + list_for_each_entry_rcu(n, &iot->list, node) {
> > + BUG_ON(!n->dev);
> > + switch (cft->private) {
> > + case IOTHROTTLE_BANDWIDTH:
> > + iothrottle_show_limit(m, n->dev, &n->bw);
> > + break;
> > + case IOTHROTTLE_IOPS:
> > + iothrottle_show_limit(m, n->dev, &n->iops);
> > + break;
> > + case IOTHROTTLE_FAILCNT:
> > + iothrottle_show_failcnt(m, n->dev, &n->stat);
> > + break;
> > + case IOTHROTTLE_STAT:
> > + iothrottle_show_stat(m, n->dev, &n->stat);
> > + break;
> > + }
> > + }
> > +unlock_and_return:
> > + rcu_read_unlock();
> > + return 0;
> > +}
> > +
> > +static dev_t devname2dev_t(const char *buf)
> > +{
> > + struct block_device *bdev;
> > + dev_t dev = 0;
> > + struct gendisk *disk;
> > + int part;
> > +
> > + /* use a lookup to validate the block device */
> > + bdev = lookup_bdev(buf);
> > + if (IS_ERR(bdev))
> > + return 0;
> > + /* only entire devices are allowed, not single partitions */
> > + disk = get_gendisk(bdev->bd_dev, &part);
> > + if (disk && !part) {
> > + BUG_ON(!bdev->bd_inode);
> > + dev = bdev->bd_inode->i_rdev;
> > + }
> > + bdput(bdev);
> > +
> > + return dev;
> > +}
> > +
> > +/*
> > + * The userspace input string must use one of the following syntaxes:
> > + *
> > + * dev:0 <- delete an i/o limiting rule
> > + * dev:io-limit:0 <- set a leaky bucket throttling rule
> > + * dev:io-limit:1:bucket-size <- set a token bucket throttling rule
> > + * dev:io-limit:1 <- set a token bucket throttling rule using
> > + * bucket-size == io-limit
> > + */
> > +static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
> > + dev_t *dev, unsigned long long *iolimit,
> > + unsigned long long *strategy,
> > + unsigned long long *bucket_size)
> > +{
> > + char *p;
> > + int count = 0;
> > + char *s[4];
> > + int ret;
> > +
> > + memset(s, 0, sizeof(s));
> > + *dev = 0;
> > + *iolimit = 0;
> > + *strategy = 0;
> > + *bucket_size = 0;
> > +
> > + /* split the colon-delimited input string into its elements */
> > + while (count < ARRAY_SIZE(s)) {
> > + p = strsep(&buf, ":");
> > + if (!p)
> > + break;
> > + if (!*p)
> > + continue;
> > + s[count++] = p;
> > + }
> > +
> > + /* i/o limit */
> > + if (!s[1])
> > + return -EINVAL;
> > + ret = strict_strtoull(s[1], 10, iolimit);
> > + if (ret < 0)
> > + return ret;
> > + if (!*iolimit)
> > + goto out;
> > + /* throttling strategy (leaky bucket / token bucket) */
> > + if (!s[2])
> > + return -EINVAL;
> > + ret = strict_strtoull(s[2], 10, strategy);
> > + if (ret < 0)
> > + return ret;
> > + switch (*strategy) {
> > + case RATELIMIT_LEAKY_BUCKET:
> > + goto out;
> > + case RATELIMIT_TOKEN_BUCKET:
> > + break;
> > + default:
> > + return -EINVAL;
> > + }
> > + /* bucket size */
> > + if (!s[3])
> > + *bucket_size = *iolimit;
> > + else {
> > + ret = strict_strtoll(s[3], 10, bucket_size);
> > + if (ret < 0)
> > + return ret;
> > + }
> > + if (*bucket_size <= 0)
> > + return -EINVAL;
> > +out:
> > + /* block device number */
> > + *dev = devname2dev_t(s[0]);
> > + return *dev ? 0 : -EINVAL;
> > +}
> > +
> > +static int iothrottle_write(struct cgroup *cgrp, struct cftype *cft,
> > + const char *buffer)
> > +{
> > + struct iothrottle *iot;
> > + struct iothrottle_node *n, *newn = NULL;
> > + dev_t dev;
> > + unsigned long long iolimit, strategy, bucket_size;
> > + char *buf;
> > + size_t nbytes = strlen(buffer);
> > + int ret = 0;
> > +
> > + /*
> > + * We need to allocate a new buffer here, because
> > + * iothrottle_parse_args() can modify it and the buffer provided by
> > + * write_string is supposed to be const.
> > + */
> > + buf = kmalloc(nbytes + 1, GFP_KERNEL);
> > + if (!buf)
> > + return -ENOMEM;
> > + memcpy(buf, buffer, nbytes + 1);
> > +
> > + ret = iothrottle_parse_args(buf, nbytes, cft->private, &dev, &iolimit,
> > + &strategy, &bucket_size);
> > + if (ret)
> > + goto out1;
> > + newn = kzalloc(sizeof(*newn), GFP_KERNEL);
> > + if (!newn) {
> > + ret = -ENOMEM;
> > + goto out1;
> > + }
> > + newn->dev = dev;
> > + res_counter_init(&newn->bw, NULL);
> > + res_counter_init(&newn->iops, NULL);
> > +
> > + switch (cft->private) {
> > + case IOTHROTTLE_BANDWIDTH:
> > + res_counter_ratelimit_set_limit(&newn->iops, 0, 0, 0);
> > + res_counter_ratelimit_set_limit(&newn->bw, strategy,
> > + ALIGN(iolimit, 1024), ALIGN(bucket_size, 1024));
> > + break;
> > + case IOTHROTTLE_IOPS:
> > + res_counter_ratelimit_set_limit(&newn->bw, 0, 0, 0);
> > + /*
> > + * scale up iops cost by a factor of 1000, this allows to apply
> > + * a more fine grained sleeps, and throttling results more
> > + * precise this way.
> > + */
> > + res_counter_ratelimit_set_limit(&newn->iops, strategy,
> > + iolimit * 1000, bucket_size * 1000);
> > + break;
> > + default:
> > + WARN_ON(1);
> > + break;
> > + }
> > +
> > + if (!cgroup_lock_live_group(cgrp)) {
> > + ret = -ENODEV;
> > + goto out1;
> > + }
> > + iot = cgroup_to_iothrottle(cgrp);
> > +
> > + n = iothrottle_search_node(iot, dev);
> > + if (!n) {
> > + if (iolimit) {
> > + /* Add a new block device limiting rule */
> > + iothrottle_insert_node(iot, newn);
> > + newn = NULL;
> > + }
> > + goto out2;
> > + }
> > + switch (cft->private) {
> > + case IOTHROTTLE_BANDWIDTH:
> > + if (!iolimit && !n->iops.limit) {
> > + /* Delete a block device limiting rule */
> > + iothrottle_delete_node(iot, n);
> > + goto out2;
> > + }
> > + if (!n->iops.limit)
> > + break;
> > + /* Update a block device limiting rule */
> > + newn->iops = n->iops;
> > + break;
> > + case IOTHROTTLE_IOPS:
> > + if (!iolimit && !n->bw.limit) {
> > + /* Delete a block device limiting rule */
> > + iothrottle_delete_node(iot, n);
> > + goto out2;
> > + }
> > + if (!n->bw.limit)
> > + break;
> > + /* Update a block device limiting rule */
> > + newn->bw = n->bw;
> > + break;
> > + }
> > + iothrottle_replace_node(iot, n, newn);
> > + newn = NULL;
> > +out2:
> > + cgroup_unlock();
>
> How does the above lock relate to the iot->lock called out in the comment
> headers in the earlier functions? Hmmm... Come to think of it, I don't
> see an acquisition of iot->lock anywhere.
>
> So, what is the story here?
As said before, only the comment in struct iothrottle is correct, we use
cgroup_lock() to protect iot->list, so there's no need to introduce
another lock inside struct iothrottle.
And the other comments about iot->lock must be fixed.
>
> > + if (n) {
> > + synchronize_rcu();
> > + kfree(n);
> > + }
> > +out1:
> > + kfree(newn);
> > + kfree(buf);
> > + return ret;
> > +}
> > +
> > +static struct cftype files[] = {
> > + {
> > + .name = "bandwidth-max",
> > + .read_seq_string = iothrottle_read,
> > + .write_string = iothrottle_write,
> > + .max_write_len = 256,
> > + .private = IOTHROTTLE_BANDWIDTH,
> > + },
> > + {
> > + .name = "iops-max",
> > + .read_seq_string = iothrottle_read,
> > + .write_string = iothrottle_write,
> > + .max_write_len = 256,
> > + .private = IOTHROTTLE_IOPS,
> > + },
> > + {
> > + .name = "throttlecnt",
> > + .read_seq_string = iothrottle_read,
> > + .private = IOTHROTTLE_FAILCNT,
> > + },
> > + {
> > + .name = "stat",
> > + .read_seq_string = iothrottle_read,
> > + .private = IOTHROTTLE_STAT,
> > + },
> > +};
> > +
> > +static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> > +{
> > + return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
> > +}
> > +
> > +struct cgroup_subsys iothrottle_subsys = {
> > + .name = "blockio",
> > + .create = iothrottle_create,
> > + .destroy = iothrottle_destroy,
> > + .populate = iothrottle_populate,
> > + .subsys_id = iothrottle_subsys_id,
> > + .early_init = 1,
> > + .use_id = 1,
> > +};
> > +
> > +/*
> > + * NOTE: called with rcu_read_lock() held.
> > + */
> > +static void iothrottle_evaluate_sleep(struct iothrottle_sleep *sleep,
> > + struct iothrottle *iot,
> > + struct block_device *bdev, ssize_t bytes)
> > +{
> > + struct iothrottle_node *n;
> > + dev_t dev;
> > +
> > + BUG_ON(!iot);
> > +
> > + /* accounting and throttling is done only on entire block devices */
> > + dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev), bdev->bd_disk->first_minor);
> > + n = iothrottle_search_node(iot, dev);
> > + if (!n)
> > + return;
> > +
> > + /* Update statistics */
> > + iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_BYTES_TOT, bytes);
> > + if (bytes)
> > + iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_IOPS_TOT, 1);
> > +
> > + /* Evaluate sleep values */
> > + sleep->bw_sleep = res_counter_ratelimit_sleep(&n->bw, bytes);
> > + /*
> > + * scale up iops cost by a factor of 1000, this allows to apply
> > + * a more fine grained sleeps, and throttling works better in
> > + * this way.
> > + *
> > + * Note: do not account any i/o operation if bytes is negative or zero.
> > + */
> > + sleep->iops_sleep = res_counter_ratelimit_sleep(&n->iops,
> > + bytes ? 1000 : 0);
> > +}
> > +
> > +/*
> > + * NOTE: called with rcu_read_lock() held.
> > + */
> > +static void iothrottle_acct_stat(struct iothrottle *iot,
> > + struct block_device *bdev, int type,
> > + unsigned long long sleep)
> > +{
> > + struct iothrottle_node *n;
> > + dev_t dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev),
> > + bdev->bd_disk->first_minor);
> > +
> > + n = iothrottle_search_node(iot, dev);
> > + if (!n)
> > + return;
> > + iothrottle_stat_add_sleep(&n->stat, type, sleep);
> > +}
> > +
> > +static void iothrottle_acct_task_stat(int type, unsigned long long sleep)
> > +{
> > + /*
> > + * XXX: per-task statistics may be inaccurate (this is not a
> > + * critical issue, anyway, respect to introduce locking
> > + * overhead or increase the size of task_struct).
> > + */
> > + switch (type) {
> > + case IOTHROTTLE_BANDWIDTH:
> > + current->io_throttle_bw_cnt++;
> > + current->io_throttle_bw_sleep += sleep;
> > + break;
> > +
> > + case IOTHROTTLE_IOPS:
> > + current->io_throttle_iops_cnt++;
> > + current->io_throttle_iops_sleep += sleep;
> > + break;
> > + }
> > +}
> > +
> > +/*
> > + * A helper function to get iothrottle from css id.
> > + *
> > + * NOTE: must be called under rcu_read_lock(). The caller must check
> > + * css_is_removed() or some if it's concern.
> > + */
> > +static struct iothrottle *iothrottle_lookup(unsigned long id)
> > +{
> > + struct cgroup_subsys_state *css;
> > +
> > + if (!id)
> > + return NULL;
> > + css = css_lookup(&iothrottle_subsys, id);
> > + if (!css)
> > + return NULL;
> > + return container_of(css, struct iothrottle, css);
> > +}
> > +
> > +static struct iothrottle *get_iothrottle_from_page(struct page *page)
> > +{
> > + struct iothrottle *iot;
> > + unsigned long id;
> > +
> > + BUG_ON(!page);
> > + id = page_cgroup_get_owner(page);
> > +
> > + rcu_read_lock();
> > + iot = iothrottle_lookup(id);
> > + if (!iot)
> > + goto out;
> > + css_get(&iot->css);
> > +out:
> > + rcu_read_unlock();
> > + return iot;
> > +}
> > +
> > +static struct iothrottle *get_iothrottle_from_bio(struct bio *bio)
> > +{
> > + if (!bio)
> > + return NULL;
> > + return get_iothrottle_from_page(bio_page(bio));
> > +}
> > +
> > +int iothrottle_set_page_owner(struct page *page, struct mm_struct *mm)
> > +{
> > + struct iothrottle *iot;
> > + unsigned short id = 0;
> > +
> > + if (iothrottle_disabled())
> > + return 0;
> > + if (!mm)
> > + goto out;
> > + rcu_read_lock();
> > + iot = task_to_iothrottle(rcu_dereference(mm->owner));
>
> Given that task_to_iothrottle() calls task_subsys_state(), which contains
> an rcu_dereference(), why is the rcu_dereference() above required?
> (There might well be a good reason, just cannot see it right offhand.)
The first rcu_dereference() is required to safely get a task_struct from
mm_struct. The second rcu_dereference() inside task_to_iothrottle() is
required to safely get the struct iothrottle from task_struct.
Thanks for your comments!
-Andrea
On Sat, 18 Apr 2009 23:38:27 +0200
Andrea Righi <[email protected]> wrote:
> Introduce attributes and functions in res_counter to implement throttling-based
> cgroup subsystems.
>
> The following attributes have been added to struct res_counter:
> * @policy: the limiting policy / algorithm
> * @capacity: the maximum capacity of the resource
> * @timestamp: timestamp of the last accounted resource request
>
> Currently the available policies are: token-bucket and leaky-bucket and the
> attribute @capacity is only used by token-bucket policy (to represent the
> bucket size).
>
> The following function has been implemented to return the amount of time a
> cgroup should sleep to remain within the defined resource limits.
>
> unsigned long long
> res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
>
> [ Note: only the interfaces needed by the cgroup IO controller are implemented
> right now ]
>
> Signed-off-by: Andrea Righi <[email protected]>
> ---
> include/linux/res_counter.h | 69 +++++++++++++++++++++++++++++++----------
> kernel/res_counter.c | 72 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 124 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 4c5bcf6..9bed6af 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -14,30 +14,36 @@
> */
>
> #include <linux/cgroup.h>
> +#include <linux/jiffies.h>
>
> -/*
> - * The core object. the cgroup that wishes to account for some
> - * resource may include this counter into its structures and use
> - * the helpers described beyond
> - */
> +/* The various policies that can be used for ratelimiting resources */
> +#define RATELIMIT_LEAKY_BUCKET 0
> +#define RATELIMIT_TOKEN_BUCKET 1
>
> +/**
> + * struct res_counter - the core object to account cgroup resources
> + *
> + * @usage: the current resource consumption level
> + * @max_usage: the maximal value of the usage from the counter creation
> + * @limit: the limit that usage cannot be exceeded
> + * @failcnt: the number of unsuccessful attempts to consume the resource
> + * @policy: the limiting policy / algorithm
> + * @capacity: the maximum capacity of the resource
> + * @timestamp: timestamp of the last accounted resource request
> + * @lock: the lock to protect all of the above.
> + * The routines below consider this to be IRQ-safe
> + *
> + * The cgroup that wishes to account for some resource may include this counter
> + * into its structures and use the helpers described beyond.
> + */
> struct res_counter {
> - /*
> - * the current resource consumption level
> - */
> unsigned long long usage;
> - /*
> - * the maximal value of the usage from the counter creation
> - */
> unsigned long long max_usage;
> - /*
> - * the limit that usage cannot exceed
> - */
> unsigned long long limit;
> - /*
> - * the number of unsuccessful attempts to consume the resource
> - */
> unsigned long long failcnt;
> + unsigned long long policy;
> + unsigned long long capacity;
> + unsigned long long timestamp;
>
Andrea, sizeof(struct res_counter) is getting close to 128bytes. (maybe someone adds more)
Then, could you check "unsigned long or unsigned int" is allowed or not, again ?
It's very bad if cacheline of spinlock is different from data field, in future.
Thanks,
-Kame
On Mon, Apr 20, 2009 at 11:22:27PM +0200, Andrea Righi wrote:
> On Mon, Apr 20, 2009 at 10:59:04AM -0700, Paul E. McKenney wrote:
> > On Sat, Apr 18, 2009 at 11:38:29PM +0200, Andrea Righi wrote:
> > > This is the core of the io-throttle kernel infrastructure. It creates
> > > the basic interfaces to the cgroup subsystem and implements the I/O
> > > measurement and throttling functionality.
> >
> > A few questions interspersed below.
> >
> > Thanx, Paul
> >
> > > Signed-off-by: Gui Jianfeng <[email protected]>
> > > Signed-off-by: Andrea Righi <[email protected]>
> > > ---
> > > block/Makefile | 1 +
> > > block/blk-io-throttle.c | 822 +++++++++++++++++++++++++++++++++++++++
> > > include/linux/blk-io-throttle.h | 144 +++++++
> > > include/linux/cgroup_subsys.h | 6 +
> > > init/Kconfig | 12 +
> > > 5 files changed, 985 insertions(+), 0 deletions(-)
> > > create mode 100644 block/blk-io-throttle.c
> > > create mode 100644 include/linux/blk-io-throttle.h
> > >
> > > diff --git a/block/Makefile b/block/Makefile
> > > index e9fa4dd..42b6a46 100644
> > > --- a/block/Makefile
> > > +++ b/block/Makefile
> > > @@ -13,5 +13,6 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
> > > obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
> > > obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
> > >
> > > +obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
> > > obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
> > > obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
> > > diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
> > > new file mode 100644
> > > index 0000000..c8214fc
> > > --- /dev/null
> > > +++ b/block/blk-io-throttle.c
> > > @@ -0,0 +1,822 @@
> > > +/*
> > > + * blk-io-throttle.c
> > > + *
> > > + * This program is free software; you can redistribute it and/or
> > > + * modify it under the terms of the GNU General Public
> > > + * License as published by the Free Software Foundation; either
> > > + * version 2 of the License, or (at your option) any later version.
> > > + *
> > > + * This program is distributed in the hope that it will be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > > + * General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU General Public
> > > + * License along with this program; if not, write to the
> > > + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> > > + * Boston, MA 021110-1307, USA.
> > > + *
> > > + * Copyright (C) 2008 Andrea Righi <[email protected]>
> > > + */
> > > +
> > > +#include <linux/init.h>
> > > +#include <linux/module.h>
> > > +#include <linux/res_counter.h>
> > > +#include <linux/memcontrol.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/gfp.h>
> > > +#include <linux/err.h>
> > > +#include <linux/genhd.h>
> > > +#include <linux/hardirq.h>
> > > +#include <linux/list.h>
> > > +#include <linux/seq_file.h>
> > > +#include <linux/spinlock.h>
> > > +#include <linux/blk-io-throttle.h>
> > > +#include <linux/mm.h>
> > > +#include <linux/page_cgroup.h>
> > > +#include <linux/sched.h>
> > > +#include <linux/bio.h>
> > > +
> > > +/*
> > > + * Statistics for I/O bandwidth controller.
> > > + */
> > > +enum iothrottle_stat_index {
> > > + /* # of times the cgroup has been throttled for bw limit */
> > > + IOTHROTTLE_STAT_BW_COUNT,
> > > + /* # of jiffies spent to sleep for throttling for bw limit */
> > > + IOTHROTTLE_STAT_BW_SLEEP,
> > > + /* # of times the cgroup has been throttled for iops limit */
> > > + IOTHROTTLE_STAT_IOPS_COUNT,
> > > + /* # of jiffies spent to sleep for throttling for iops limit */
> > > + IOTHROTTLE_STAT_IOPS_SLEEP,
> > > + /* total number of bytes read and written */
> > > + IOTHROTTLE_STAT_BYTES_TOT,
> > > + /* total number of I/O operations */
> > > + IOTHROTTLE_STAT_IOPS_TOT,
> > > +
> > > + IOTHROTTLE_STAT_NSTATS,
> > > +};
> > > +
> > > +struct iothrottle_stat_cpu {
> > > + unsigned long long count[IOTHROTTLE_STAT_NSTATS];
> > > +} ____cacheline_aligned_in_smp;
> > > +
> > > +struct iothrottle_stat {
> > > + struct iothrottle_stat_cpu cpustat[NR_CPUS];
> > > +};
> > > +
> > > +static void iothrottle_stat_add(struct iothrottle_stat *stat,
> > > + enum iothrottle_stat_index type, unsigned long long val)
> > > +{
> > > + int cpu = get_cpu();
> > > +
> > > + stat->cpustat[cpu].count[type] += val;
> > > + put_cpu();
> > > +}
> > > +
> > > +static void iothrottle_stat_add_sleep(struct iothrottle_stat *stat,
> > > + int type, unsigned long long sleep)
> > > +{
> > > + int cpu = get_cpu();
> > > +
> > > + switch (type) {
> > > + case IOTHROTTLE_BANDWIDTH:
> > > + stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_COUNT]++;
> > > + stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_SLEEP] += sleep;
> > > + break;
> > > + case IOTHROTTLE_IOPS:
> > > + stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_COUNT]++;
> > > + stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_SLEEP] += sleep;
> > > + break;
> > > + }
> > > + put_cpu();
> > > +}
> > > +
> > > +static unsigned long long iothrottle_read_stat(struct iothrottle_stat *stat,
> > > + enum iothrottle_stat_index idx)
> > > +{
> > > + int cpu;
> > > + unsigned long long ret = 0;
> > > +
> > > + for_each_possible_cpu(cpu)
> > > + ret += stat->cpustat[cpu].count[idx];
> > > + return ret;
> > > +}
> > > +
> > > +struct iothrottle_sleep {
> > > + unsigned long long bw_sleep;
> > > + unsigned long long iops_sleep;
> > > +};
> > > +
> > > +/*
> > > + * struct iothrottle_node - throttling rule of a single block device
> > > + * @node: list of per block device throttling rules
> > > + * @dev: block device number, used as key in the list
> > > + * @bw: max i/o bandwidth (in bytes/s)
> > > + * @iops: max i/o operations per second
> > > + * @stat: throttling statistics
> > > + *
> > > + * Define a i/o throttling rule for a single block device.
> > > + *
> > > + * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
> > > + * the limiting rules defined for that device persist and they are still valid
> > > + * if a new device is plugged and it uses the same dev_t number.
> > > + */
> > > +struct iothrottle_node {
> > > + struct list_head node;
> > > + dev_t dev;
> > > + struct res_counter bw;
> > > + struct res_counter iops;
> > > + struct iothrottle_stat stat;
> > > +};
> > > +
> > > +/**
> > > + * struct iothrottle - throttling rules for a cgroup
> > > + * @css: pointer to the cgroup state
> > > + * @list: list of iothrottle_node elements
> > > + *
> > > + * Define multiple per-block device i/o throttling rules.
> > > + * Note: the list of the throttling rules is protected by RCU locking:
> > > + * - hold cgroup_lock() for update.
> > > + * - hold rcu_read_lock() for read.
> > > + */
> > > +struct iothrottle {
> > > + struct cgroup_subsys_state css;
> > > + struct list_head list;
> > > +};
> > > +static struct iothrottle init_iothrottle;
> > > +
> > > +static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
> > > +{
> > > + return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
> > > + struct iothrottle, css);
> > > +}
> > > +
> > > +/*
> > > + * Note: called with rcu_read_lock() held.
> > > + */
> > > +static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
> > > +{
> > > + return container_of(task_subsys_state(task, iothrottle_subsys_id),
> > > + struct iothrottle, css);
> >
> > OK, task_subsys_state() has an rcu_dereference(), so...
>
> Do you mean the comment is obvious and it can be just removed?
Sorry, no, I mean "this code looks OK to me".
> > > +}
> > > +
> > > +/*
> > > + * Note: called with rcu_read_lock() or iot->lock held.
> > > + */
> > > +static struct iothrottle_node *
> > > +iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
> > > +{
> > > + struct iothrottle_node *n;
> > > +
> > > + if (list_empty(&iot->list))
> > > + return NULL;
> > > + list_for_each_entry_rcu(n, &iot->list, node)
> > > + if (n->dev == dev)
> > > + return n;
> > > + return NULL;
> > > +}
> > > +
> > > +/*
> > > + * Note: called with iot->lock held.
> >
> > Should this be a WARN_ON() or something similar? The machine is unable
> > to enforce a comment. ;-)
>
> Right. :) Actually this is an old and never fixed comment... the
> iot->list is always modified only under cgroup_lock(), so there's no
> need to introduce another lock in struct iothrottle.
Ah!!! That explains why I couldn't find an iot->lock acquisition. ;-)
> Anyway, adding a WARN_ON() seems a good idea, maybe adding a
> WARN_ON_ONCE(cgroup_is_locked()) and define cgroup_is_locked() of
> course, because cgroup_mutex is not exported outside kernel/cgroup.c.
>
> I'll fix the comment and add the check.
Very good!
> > > + */
> > > +static inline void iothrottle_insert_node(struct iothrottle *iot,
> > > + struct iothrottle_node *n)
> > > +{
> > > + list_add_rcu(&n->node, &iot->list);
> > > +}
> > > +
> > > +/*
> > > + * Note: called with iot->lock held.
> >
> > Ditto.
>
> OK, see above.
>
> >
> > > + */
> > > +static inline void
> > > +iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,
> > > + struct iothrottle_node *new)
> > > +{
> > > + list_replace_rcu(&old->node, &new->node);
> > > +}
> > > +
> > > +/*
> > > + * Note: called with iot->lock held.
> >
> > Ditto.
>
> OK, see above.
>
> >
> > > + */
> > > +static inline void
> > > +iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
> > > +{
> > > + list_del_rcu(&n->node);
> > > +}
> > > +
> > > +/*
> > > + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> > > + */
> > > +static struct cgroup_subsys_state *
> > > +iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
> > > +{
> > > + struct iothrottle *iot;
> > > +
> > > + if (unlikely((cgrp->parent) == NULL)) {
> > > + iot = &init_iothrottle;
> > > + } else {
> > > + iot = kzalloc(sizeof(*iot), GFP_KERNEL);
> > > + if (unlikely(!iot))
> > > + return ERR_PTR(-ENOMEM);
> > > + }
> > > + INIT_LIST_HEAD(&iot->list);
> > > +
> > > + return &iot->css;
> > > +}
> > > +
> > > +/*
> > > + * Note: called from kernel/cgroup.c with cgroup_lock() held.
> > > + */
> > > +static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
> > > +{
> > > + struct iothrottle_node *n, *p;
> > > + struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
> > > +
> > > + free_css_id(&iothrottle_subsys, &iot->css);
> > > + /*
> > > + * don't worry about locking here, at this point there must be not any
> > > + * reference to the list.
> > > + */
> > > + if (!list_empty(&iot->list))
> > > + list_for_each_entry_safe(n, p, &iot->list, node)
> > > + kfree(n);
> > > + kfree(iot);
> > > +}
> > > +
> > > +/*
> > > + * NOTE: called with rcu_read_lock() held.
> > > + *
> > > + * do not care too much about locking for single res_counter values here.
> > > + */
> > > +static void iothrottle_show_limit(struct seq_file *m, dev_t dev,
> > > + struct res_counter *res)
> > > +{
> > > + if (!res->limit)
> > > + return;
> > > + seq_printf(m, "%u %u %llu %llu %lli %llu %li\n",
> > > + MAJOR(dev), MINOR(dev),
> > > + res->limit, res->policy,
> > > + (long long)res->usage, res->capacity,
> > > + jiffies_to_clock_t(res_counter_ratelimit_delta_t(res)));
> >
> > OK, looks like the rcu_dereference() in the list_for_each_entry_rcu() in
> > the caller suffices here. But thought I should ask the question anyway,
> > even though at first glance it does look correct.
> >
> > > +}
> > > +
> > > +/*
> > > + * NOTE: called with rcu_read_lock() held.
> > > + *
> > > + */
> > > +static void iothrottle_show_failcnt(struct seq_file *m, dev_t dev,
> > > + struct iothrottle_stat *stat)
> > > +{
> > > + unsigned long long bw_count, bw_sleep, iops_count, iops_sleep;
> > > +
> > > + bw_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_COUNT);
> > > + bw_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_SLEEP);
> > > + iops_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_COUNT);
> > > + iops_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_SLEEP);
> > > +
> > > + seq_printf(m, "%u %u %llu %li %llu %li\n", MAJOR(dev), MINOR(dev),
> > > + bw_count, jiffies_to_clock_t(bw_sleep),
> > > + iops_count, jiffies_to_clock_t(iops_sleep));
> >
> > Ditto.
> >
> > > +}
> > > +
> > > +/*
> > > + * NOTE: called with rcu_read_lock() held.
> > > + */
> > > +static void iothrottle_show_stat(struct seq_file *m, dev_t dev,
> > > + struct iothrottle_stat *stat)
> > > +{
> > > + unsigned long long bytes, iops;
> > > +
> > > + bytes = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BYTES_TOT);
> > > + iops = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_TOT);
> > > +
> > > + seq_printf(m, "%u %u %llu %llu\n", MAJOR(dev), MINOR(dev), bytes, iops);
> >
> > Ditto.
> >
> > > +}
> > > +
> > > +static int iothrottle_read(struct cgroup *cgrp, struct cftype *cft,
> > > + struct seq_file *m)
> > > +{
> > > + struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
> > > + struct iothrottle_node *n;
> > > +
> > > + rcu_read_lock();
> > > + if (list_empty(&iot->list))
> > > + goto unlock_and_return;
> > > + list_for_each_entry_rcu(n, &iot->list, node) {
> > > + BUG_ON(!n->dev);
> > > + switch (cft->private) {
> > > + case IOTHROTTLE_BANDWIDTH:
> > > + iothrottle_show_limit(m, n->dev, &n->bw);
> > > + break;
> > > + case IOTHROTTLE_IOPS:
> > > + iothrottle_show_limit(m, n->dev, &n->iops);
> > > + break;
> > > + case IOTHROTTLE_FAILCNT:
> > > + iothrottle_show_failcnt(m, n->dev, &n->stat);
> > > + break;
> > > + case IOTHROTTLE_STAT:
> > > + iothrottle_show_stat(m, n->dev, &n->stat);
> > > + break;
> > > + }
> > > + }
> > > +unlock_and_return:
> > > + rcu_read_unlock();
> > > + return 0;
> > > +}
> > > +
> > > +static dev_t devname2dev_t(const char *buf)
> > > +{
> > > + struct block_device *bdev;
> > > + dev_t dev = 0;
> > > + struct gendisk *disk;
> > > + int part;
> > > +
> > > + /* use a lookup to validate the block device */
> > > + bdev = lookup_bdev(buf);
> > > + if (IS_ERR(bdev))
> > > + return 0;
> > > + /* only entire devices are allowed, not single partitions */
> > > + disk = get_gendisk(bdev->bd_dev, &part);
> > > + if (disk && !part) {
> > > + BUG_ON(!bdev->bd_inode);
> > > + dev = bdev->bd_inode->i_rdev;
> > > + }
> > > + bdput(bdev);
> > > +
> > > + return dev;
> > > +}
> > > +
> > > +/*
> > > + * The userspace input string must use one of the following syntaxes:
> > > + *
> > > + * dev:0 <- delete an i/o limiting rule
> > > + * dev:io-limit:0 <- set a leaky bucket throttling rule
> > > + * dev:io-limit:1:bucket-size <- set a token bucket throttling rule
> > > + * dev:io-limit:1 <- set a token bucket throttling rule using
> > > + * bucket-size == io-limit
> > > + */
> > > +static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
> > > + dev_t *dev, unsigned long long *iolimit,
> > > + unsigned long long *strategy,
> > > + unsigned long long *bucket_size)
> > > +{
> > > + char *p;
> > > + int count = 0;
> > > + char *s[4];
> > > + int ret;
> > > +
> > > + memset(s, 0, sizeof(s));
> > > + *dev = 0;
> > > + *iolimit = 0;
> > > + *strategy = 0;
> > > + *bucket_size = 0;
> > > +
> > > + /* split the colon-delimited input string into its elements */
> > > + while (count < ARRAY_SIZE(s)) {
> > > + p = strsep(&buf, ":");
> > > + if (!p)
> > > + break;
> > > + if (!*p)
> > > + continue;
> > > + s[count++] = p;
> > > + }
> > > +
> > > + /* i/o limit */
> > > + if (!s[1])
> > > + return -EINVAL;
> > > + ret = strict_strtoull(s[1], 10, iolimit);
> > > + if (ret < 0)
> > > + return ret;
> > > + if (!*iolimit)
> > > + goto out;
> > > + /* throttling strategy (leaky bucket / token bucket) */
> > > + if (!s[2])
> > > + return -EINVAL;
> > > + ret = strict_strtoull(s[2], 10, strategy);
> > > + if (ret < 0)
> > > + return ret;
> > > + switch (*strategy) {
> > > + case RATELIMIT_LEAKY_BUCKET:
> > > + goto out;
> > > + case RATELIMIT_TOKEN_BUCKET:
> > > + break;
> > > + default:
> > > + return -EINVAL;
> > > + }
> > > + /* bucket size */
> > > + if (!s[3])
> > > + *bucket_size = *iolimit;
> > > + else {
> > > + ret = strict_strtoll(s[3], 10, bucket_size);
> > > + if (ret < 0)
> > > + return ret;
> > > + }
> > > + if (*bucket_size <= 0)
> > > + return -EINVAL;
> > > +out:
> > > + /* block device number */
> > > + *dev = devname2dev_t(s[0]);
> > > + return *dev ? 0 : -EINVAL;
> > > +}
> > > +
> > > +static int iothrottle_write(struct cgroup *cgrp, struct cftype *cft,
> > > + const char *buffer)
> > > +{
> > > + struct iothrottle *iot;
> > > + struct iothrottle_node *n, *newn = NULL;
> > > + dev_t dev;
> > > + unsigned long long iolimit, strategy, bucket_size;
> > > + char *buf;
> > > + size_t nbytes = strlen(buffer);
> > > + int ret = 0;
> > > +
> > > + /*
> > > + * We need to allocate a new buffer here, because
> > > + * iothrottle_parse_args() can modify it and the buffer provided by
> > > + * write_string is supposed to be const.
> > > + */
> > > + buf = kmalloc(nbytes + 1, GFP_KERNEL);
> > > + if (!buf)
> > > + return -ENOMEM;
> > > + memcpy(buf, buffer, nbytes + 1);
> > > +
> > > + ret = iothrottle_parse_args(buf, nbytes, cft->private, &dev, &iolimit,
> > > + &strategy, &bucket_size);
> > > + if (ret)
> > > + goto out1;
> > > + newn = kzalloc(sizeof(*newn), GFP_KERNEL);
> > > + if (!newn) {
> > > + ret = -ENOMEM;
> > > + goto out1;
> > > + }
> > > + newn->dev = dev;
> > > + res_counter_init(&newn->bw, NULL);
> > > + res_counter_init(&newn->iops, NULL);
> > > +
> > > + switch (cft->private) {
> > > + case IOTHROTTLE_BANDWIDTH:
> > > + res_counter_ratelimit_set_limit(&newn->iops, 0, 0, 0);
> > > + res_counter_ratelimit_set_limit(&newn->bw, strategy,
> > > + ALIGN(iolimit, 1024), ALIGN(bucket_size, 1024));
> > > + break;
> > > + case IOTHROTTLE_IOPS:
> > > + res_counter_ratelimit_set_limit(&newn->bw, 0, 0, 0);
> > > + /*
> > > + * scale up iops cost by a factor of 1000, this allows to apply
> > > + * a more fine grained sleeps, and throttling results more
> > > + * precise this way.
> > > + */
> > > + res_counter_ratelimit_set_limit(&newn->iops, strategy,
> > > + iolimit * 1000, bucket_size * 1000);
> > > + break;
> > > + default:
> > > + WARN_ON(1);
> > > + break;
> > > + }
> > > +
> > > + if (!cgroup_lock_live_group(cgrp)) {
> > > + ret = -ENODEV;
> > > + goto out1;
> > > + }
> > > + iot = cgroup_to_iothrottle(cgrp);
> > > +
> > > + n = iothrottle_search_node(iot, dev);
> > > + if (!n) {
> > > + if (iolimit) {
> > > + /* Add a new block device limiting rule */
> > > + iothrottle_insert_node(iot, newn);
> > > + newn = NULL;
> > > + }
> > > + goto out2;
> > > + }
> > > + switch (cft->private) {
> > > + case IOTHROTTLE_BANDWIDTH:
> > > + if (!iolimit && !n->iops.limit) {
> > > + /* Delete a block device limiting rule */
> > > + iothrottle_delete_node(iot, n);
> > > + goto out2;
> > > + }
> > > + if (!n->iops.limit)
> > > + break;
> > > + /* Update a block device limiting rule */
> > > + newn->iops = n->iops;
> > > + break;
> > > + case IOTHROTTLE_IOPS:
> > > + if (!iolimit && !n->bw.limit) {
> > > + /* Delete a block device limiting rule */
> > > + iothrottle_delete_node(iot, n);
> > > + goto out2;
> > > + }
> > > + if (!n->bw.limit)
> > > + break;
> > > + /* Update a block device limiting rule */
> > > + newn->bw = n->bw;
> > > + break;
> > > + }
> > > + iothrottle_replace_node(iot, n, newn);
> > > + newn = NULL;
> > > +out2:
> > > + cgroup_unlock();
> >
> > How does the above lock relate to the iot->lock called out in the comment
> > headers in the earlier functions? Hmmm... Come to think of it, I don't
> > see an acquisition of iot->lock anywhere.
> >
> > So, what is the story here?
>
> As said before, only the comment in struct iothrottle is correct, we use
> cgroup_lock() to protect iot->list, so there's no need to introduce
> another lock inside struct iothrottle.
>
> And the other comments about iot->lock must be fixed.
Sounds good!
So this code is compiled into the kernel only when cgroups are defined,
correct? Otherwise, cgroup_lock() seems to be an empty function.
> > > + if (n) {
> > > + synchronize_rcu();
> > > + kfree(n);
> > > + }
> > > +out1:
> > > + kfree(newn);
> > > + kfree(buf);
> > > + return ret;
> > > +}
> > > +
> > > +static struct cftype files[] = {
> > > + {
> > > + .name = "bandwidth-max",
> > > + .read_seq_string = iothrottle_read,
> > > + .write_string = iothrottle_write,
> > > + .max_write_len = 256,
> > > + .private = IOTHROTTLE_BANDWIDTH,
> > > + },
> > > + {
> > > + .name = "iops-max",
> > > + .read_seq_string = iothrottle_read,
> > > + .write_string = iothrottle_write,
> > > + .max_write_len = 256,
> > > + .private = IOTHROTTLE_IOPS,
> > > + },
> > > + {
> > > + .name = "throttlecnt",
> > > + .read_seq_string = iothrottle_read,
> > > + .private = IOTHROTTLE_FAILCNT,
> > > + },
> > > + {
> > > + .name = "stat",
> > > + .read_seq_string = iothrottle_read,
> > > + .private = IOTHROTTLE_STAT,
> > > + },
> > > +};
> > > +
> > > +static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> > > +{
> > > + return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
> > > +}
> > > +
> > > +struct cgroup_subsys iothrottle_subsys = {
> > > + .name = "blockio",
> > > + .create = iothrottle_create,
> > > + .destroy = iothrottle_destroy,
> > > + .populate = iothrottle_populate,
> > > + .subsys_id = iothrottle_subsys_id,
> > > + .early_init = 1,
> > > + .use_id = 1,
> > > +};
> > > +
> > > +/*
> > > + * NOTE: called with rcu_read_lock() held.
> > > + */
> > > +static void iothrottle_evaluate_sleep(struct iothrottle_sleep *sleep,
> > > + struct iothrottle *iot,
> > > + struct block_device *bdev, ssize_t bytes)
> > > +{
> > > + struct iothrottle_node *n;
> > > + dev_t dev;
> > > +
> > > + BUG_ON(!iot);
> > > +
> > > + /* accounting and throttling is done only on entire block devices */
> > > + dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev), bdev->bd_disk->first_minor);
> > > + n = iothrottle_search_node(iot, dev);
> > > + if (!n)
> > > + return;
> > > +
> > > + /* Update statistics */
> > > + iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_BYTES_TOT, bytes);
> > > + if (bytes)
> > > + iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_IOPS_TOT, 1);
> > > +
> > > + /* Evaluate sleep values */
> > > + sleep->bw_sleep = res_counter_ratelimit_sleep(&n->bw, bytes);
> > > + /*
> > > + * scale up iops cost by a factor of 1000, this allows to apply
> > > + * a more fine grained sleeps, and throttling works better in
> > > + * this way.
> > > + *
> > > + * Note: do not account any i/o operation if bytes is negative or zero.
> > > + */
> > > + sleep->iops_sleep = res_counter_ratelimit_sleep(&n->iops,
> > > + bytes ? 1000 : 0);
> > > +}
> > > +
> > > +/*
> > > + * NOTE: called with rcu_read_lock() held.
> > > + */
> > > +static void iothrottle_acct_stat(struct iothrottle *iot,
> > > + struct block_device *bdev, int type,
> > > + unsigned long long sleep)
> > > +{
> > > + struct iothrottle_node *n;
> > > + dev_t dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev),
> > > + bdev->bd_disk->first_minor);
> > > +
> > > + n = iothrottle_search_node(iot, dev);
> > > + if (!n)
> > > + return;
> > > + iothrottle_stat_add_sleep(&n->stat, type, sleep);
> > > +}
> > > +
> > > +static void iothrottle_acct_task_stat(int type, unsigned long long sleep)
> > > +{
> > > + /*
> > > + * XXX: per-task statistics may be inaccurate (this is not a
> > > + * critical issue, anyway, respect to introduce locking
> > > + * overhead or increase the size of task_struct).
> > > + */
> > > + switch (type) {
> > > + case IOTHROTTLE_BANDWIDTH:
> > > + current->io_throttle_bw_cnt++;
> > > + current->io_throttle_bw_sleep += sleep;
> > > + break;
> > > +
> > > + case IOTHROTTLE_IOPS:
> > > + current->io_throttle_iops_cnt++;
> > > + current->io_throttle_iops_sleep += sleep;
> > > + break;
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * A helper function to get iothrottle from css id.
> > > + *
> > > + * NOTE: must be called under rcu_read_lock(). The caller must check
> > > + * css_is_removed() or some if it's concern.
> > > + */
> > > +static struct iothrottle *iothrottle_lookup(unsigned long id)
> > > +{
> > > + struct cgroup_subsys_state *css;
> > > +
> > > + if (!id)
> > > + return NULL;
> > > + css = css_lookup(&iothrottle_subsys, id);
> > > + if (!css)
> > > + return NULL;
> > > + return container_of(css, struct iothrottle, css);
> > > +}
> > > +
> > > +static struct iothrottle *get_iothrottle_from_page(struct page *page)
> > > +{
> > > + struct iothrottle *iot;
> > > + unsigned long id;
> > > +
> > > + BUG_ON(!page);
> > > + id = page_cgroup_get_owner(page);
> > > +
> > > + rcu_read_lock();
> > > + iot = iothrottle_lookup(id);
> > > + if (!iot)
> > > + goto out;
> > > + css_get(&iot->css);
> > > +out:
> > > + rcu_read_unlock();
> > > + return iot;
> > > +}
> > > +
> > > +static struct iothrottle *get_iothrottle_from_bio(struct bio *bio)
> > > +{
> > > + if (!bio)
> > > + return NULL;
> > > + return get_iothrottle_from_page(bio_page(bio));
> > > +}
> > > +
> > > +int iothrottle_set_page_owner(struct page *page, struct mm_struct *mm)
> > > +{
> > > + struct iothrottle *iot;
> > > + unsigned short id = 0;
> > > +
> > > + if (iothrottle_disabled())
> > > + return 0;
> > > + if (!mm)
> > > + goto out;
> > > + rcu_read_lock();
> > > + iot = task_to_iothrottle(rcu_dereference(mm->owner));
> >
> > Given that task_to_iothrottle() calls task_subsys_state(), which contains
> > an rcu_dereference(), why is the rcu_dereference() above required?
> > (There might well be a good reason, just cannot see it right offhand.)
>
> The first rcu_dereference() is required to safely get a task_struct from
> mm_struct. The second rcu_dereference() inside task_to_iothrottle() is
> required to safely get the struct iothrottle from task_struct.
Why not put the rcu_dereference() down inside task_to_iothrottle()?
> Thanks for your comments!
NP, thanks for working on this!
Thanx, Paul
On Tue, Apr 21, 2009 at 09:15:34AM +0900, KAMEZAWA Hiroyuki wrote:
> On Sat, 18 Apr 2009 23:38:27 +0200
> Andrea Righi <[email protected]> wrote:
>
> > Introduce attributes and functions in res_counter to implement throttling-based
> > cgroup subsystems.
> >
> > The following attributes have been added to struct res_counter:
> > * @policy: the limiting policy / algorithm
> > * @capacity: the maximum capacity of the resource
> > * @timestamp: timestamp of the last accounted resource request
> >
> > Currently the available policies are: token-bucket and leaky-bucket and the
> > attribute @capacity is only used by token-bucket policy (to represent the
> > bucket size).
> >
> > The following function has been implemented to return the amount of time a
> > cgroup should sleep to remain within the defined resource limits.
> >
> > unsigned long long
> > res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
> >
> > [ Note: only the interfaces needed by the cgroup IO controller are implemented
> > right now ]
> >
> > Signed-off-by: Andrea Righi <[email protected]>
> > ---
> > include/linux/res_counter.h | 69 +++++++++++++++++++++++++++++++----------
> > kernel/res_counter.c | 72 +++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 124 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index 4c5bcf6..9bed6af 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -14,30 +14,36 @@
> > */
> >
> > #include <linux/cgroup.h>
> > +#include <linux/jiffies.h>
> >
> > -/*
> > - * The core object. the cgroup that wishes to account for some
> > - * resource may include this counter into its structures and use
> > - * the helpers described beyond
> > - */
> > +/* The various policies that can be used for ratelimiting resources */
> > +#define RATELIMIT_LEAKY_BUCKET 0
> > +#define RATELIMIT_TOKEN_BUCKET 1
> >
> > +/**
> > + * struct res_counter - the core object to account cgroup resources
> > + *
> > + * @usage: the current resource consumption level
> > + * @max_usage: the maximal value of the usage from the counter creation
> > + * @limit: the limit that usage cannot be exceeded
> > + * @failcnt: the number of unsuccessful attempts to consume the resource
> > + * @policy: the limiting policy / algorithm
> > + * @capacity: the maximum capacity of the resource
> > + * @timestamp: timestamp of the last accounted resource request
> > + * @lock: the lock to protect all of the above.
> > + * The routines below consider this to be IRQ-safe
> > + *
> > + * The cgroup that wishes to account for some resource may include this counter
> > + * into its structures and use the helpers described beyond.
> > + */
> > struct res_counter {
> > - /*
> > - * the current resource consumption level
> > - */
> > unsigned long long usage;
> > - /*
> > - * the maximal value of the usage from the counter creation
> > - */
> > unsigned long long max_usage;
> > - /*
> > - * the limit that usage cannot exceed
> > - */
> > unsigned long long limit;
> > - /*
> > - * the number of unsuccessful attempts to consume the resource
> > - */
> > unsigned long long failcnt;
> > + unsigned long long policy;
> > + unsigned long long capacity;
> > + unsigned long long timestamp;
> >
> Andrea, sizeof(struct res_counter) is getting close to 128bytes. (maybe someone adds more)
> Then, could you check "unsigned long or unsigned int" is allowed or not, again ?
>
> It's very bad if cacheline of spinlock is different from data field, in future.
Regarding the new attributes, policy can be surely an unsigned int or
even less (now only 1 bit is used!), maybe we can just add an unsigned
int flags, and encode also potential future informations there.
Moreover, are we sure we really need an unsigned long long for failcnt?
Thanks,
-Andrea
* Andrea Righi <[email protected]> [2009-04-18 23:38:27]:
> Introduce attributes and functions in res_counter to implement throttling-based
> cgroup subsystems.
>
> The following attributes have been added to struct res_counter:
> * @policy: the limiting policy / algorithm
> * @capacity: the maximum capacity of the resource
> * @timestamp: timestamp of the last accounted resource request
>
Units of each of the above would be desirable, without them it is hard
to understand what you are trying to add. What is the unit of
capacity?
> Currently the available policies are: token-bucket and leaky-bucket and the
> attribute @capacity is only used by token-bucket policy (to represent the
> bucket size).
>
> The following function has been implemented to return the amount of time a
> cgroup should sleep to remain within the defined resource limits.
>
> unsigned long long
> res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
>
> [ Note: only the interfaces needed by the cgroup IO controller are implemented
> right now ]
>
This is a good RFC, but I would hold off merging till the subsystem
gets in. Having said that I am not convinced about the subsystem
sleeping, if the subsystem is not IO intensive, should it still sleep
because it is over its IO b/w? This might make sense for the CPU
controller, since not having CPU b/w does imply sleeping.
Could you please use the word throttle instead of sleep.
> Signed-off-by: Andrea Righi <[email protected]>
> ---
> include/linux/res_counter.h | 69 +++++++++++++++++++++++++++++++----------
> kernel/res_counter.c | 72 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 124 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 4c5bcf6..9bed6af 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -14,30 +14,36 @@
> */
>
> #include <linux/cgroup.h>
> +#include <linux/jiffies.h>
>
> -/*
> - * The core object. the cgroup that wishes to account for some
> - * resource may include this counter into its structures and use
> - * the helpers described beyond
> - */
> +/* The various policies that can be used for ratelimiting resources */
> +#define RATELIMIT_LEAKY_BUCKET 0
> +#define RATELIMIT_TOKEN_BUCKET 1
>
> +/**
> + * struct res_counter - the core object to account cgroup resources
> + *
> + * @usage: the current resource consumption level
> + * @max_usage: the maximal value of the usage from the counter creation
> + * @limit: the limit that usage cannot be exceeded
> + * @failcnt: the number of unsuccessful attempts to consume the resource
> + * @policy: the limiting policy / algorithm
> + * @capacity: the maximum capacity of the resource
> + * @timestamp: timestamp of the last accounted resource request
> + * @lock: the lock to protect all of the above.
> + * The routines below consider this to be IRQ-safe
> + *
> + * The cgroup that wishes to account for some resource may include this counter
> + * into its structures and use the helpers described beyond.
> + */
> struct res_counter {
> - /*
> - * the current resource consumption level
> - */
> unsigned long long usage;
> - /*
> - * the maximal value of the usage from the counter creation
> - */
> unsigned long long max_usage;
> - /*
> - * the limit that usage cannot exceed
> - */
> unsigned long long limit;
> - /*
> - * the number of unsuccessful attempts to consume the resource
> - */
Don't understand why res_counter is removed? Am I reading the diff
correctly?
> unsigned long long failcnt;
> + unsigned long long policy;
> + unsigned long long capacity;
> + unsigned long long timestamp;
> /*
> * the lock to protect all of the above.
> * the routines below consider this to be IRQ-safe
> @@ -84,6 +90,9 @@ enum {
> RES_USAGE,
> RES_MAX_USAGE,
> RES_LIMIT,
> + RES_POLICY,
> + RES_TIMESTAMP,
> + RES_CAPACITY,
> RES_FAILCNT,
> };
>
> @@ -130,6 +139,15 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> return false;
> }
>
> +static inline unsigned long long
> +res_counter_ratelimit_delta_t(struct res_counter *res)
> +{
> + return (long long)get_jiffies_64() - (long long)res->timestamp;
> +}
> +
> +unsigned long long
> +res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
> +
> /*
> * Helper function to detect if the cgroup is within it's limit or
> * not. It's currently called from cgroup_rss_prepare()
> @@ -163,6 +181,23 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
> spin_unlock_irqrestore(&cnt->lock, flags);
> }
>
> +static inline int
> +res_counter_ratelimit_set_limit(struct res_counter *cnt,
> + unsigned long long policy,
> + unsigned long long limit, unsigned long long max)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&cnt->lock, flags);
> + cnt->limit = limit;
> + cnt->capacity = max;
> + cnt->policy = policy;
> + cnt->timestamp = get_jiffies_64();
> + cnt->usage = 0;
> + spin_unlock_irqrestore(&cnt->lock, flags);
> + return 0;
> +}
> +
> static inline int res_counter_set_limit(struct res_counter *cnt,
> unsigned long long limit)
> {
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index bf8e753..b62319c 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -9,6 +9,7 @@
>
> #include <linux/types.h>
> #include <linux/parser.h>
> +#include <linux/jiffies.h>
> #include <linux/fs.h>
> #include <linux/slab.h>
> #include <linux/res_counter.h>
> @@ -20,6 +21,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
> spin_lock_init(&counter->lock);
> counter->limit = (unsigned long long)LLONG_MAX;
> counter->parent = parent;
> + counter->capacity = (unsigned long long)LLONG_MAX;
> + counter->timestamp = get_jiffies_64();
> }
>
> int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> @@ -99,6 +102,12 @@ res_counter_member(struct res_counter *counter, int member)
> return &counter->max_usage;
> case RES_LIMIT:
> return &counter->limit;
> + case RES_POLICY:
> + return &counter->policy;
> + case RES_TIMESTAMP:
> + return &counter->timestamp;
> + case RES_CAPACITY:
> + return &counter->capacity;
> case RES_FAILCNT:
> return &counter->failcnt;
> };
> @@ -163,3 +172,66 @@ int res_counter_write(struct res_counter *counter, int member,
> spin_unlock_irqrestore(&counter->lock, flags);
> return 0;
> }
> +
> +static unsigned long long
> +ratelimit_leaky_bucket(struct res_counter *res, ssize_t val)
> +{
> + unsigned long long delta, t;
> +
> + res->usage += val;
Is this called from a protected context (w.r.t. res)?
> + delta = res_counter_ratelimit_delta_t(res);
> + if (!delta)
> + return 0;
> + t = res->usage * USEC_PER_SEC;
> + t = usecs_to_jiffies(div_u64(t, res->limit));
> + if (t > delta)
> + return t - delta;
> + /* Reset i/o statistics */
> + res->usage = 0;
> + res->timestamp = get_jiffies_64();
> + return 0;
> +}
> +
> +static unsigned long long
> +ratelimit_token_bucket(struct res_counter *res, ssize_t val)
> +{
> + unsigned long long delta;
> + long long tok;
> +
> + res->usage -= val;
> + delta = jiffies_to_msecs(res_counter_ratelimit_delta_t(res));
> + res->timestamp = get_jiffies_64();
> + tok = (long long)res->usage * MSEC_PER_SEC;
> + if (delta) {
> + long long max = (long long)res->capacity * MSEC_PER_SEC;
> +
> + tok += delta * res->limit;
> + if (tok > max)
> + tok = max;
Use max_t() here
> + res->usage = (unsigned long long)div_s64(tok, MSEC_PER_SEC);
> + }
> + return (tok < 0) ? msecs_to_jiffies(div_u64(-tok, res->limit)) : 0;
> +}
I don't like the usage of MSEC and USEC for res->usage based on
policy.
> +
> +unsigned long long
> +res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val)
> +{
> + unsigned long long sleep = 0;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&res->lock, flags);
> + if (res->limit)
> + switch (res->policy) {
> + case RATELIMIT_LEAKY_BUCKET:
> + sleep = ratelimit_leaky_bucket(res, val);
> + break;
> + case RATELIMIT_TOKEN_BUCKET:
> + sleep = ratelimit_token_bucket(res, val);
> + break;
> + default:
> + WARN_ON(1);
> + break;
> + }
> + spin_unlock_irqrestore(&res->lock, flags);
> + return sleep;
> +}
> --
> 1.5.6.3
>
>
--
Balbir
* Andrea Righi <[email protected]> [2009-04-21 11:55:26]:
> On Tue, Apr 21, 2009 at 09:15:34AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Sat, 18 Apr 2009 23:38:27 +0200
> > Andrea Righi <[email protected]> wrote:
> >
> > > Introduce attributes and functions in res_counter to implement throttling-based
> > > cgroup subsystems.
> > >
> > > The following attributes have been added to struct res_counter:
> > > * @policy: the limiting policy / algorithm
> > > * @capacity: the maximum capacity of the resource
> > > * @timestamp: timestamp of the last accounted resource request
> > >
> > > Currently the available policies are: token-bucket and leaky-bucket and the
> > > attribute @capacity is only used by token-bucket policy (to represent the
> > > bucket size).
> > >
> > > The following function has been implemented to return the amount of time a
> > > cgroup should sleep to remain within the defined resource limits.
> > >
> > > unsigned long long
> > > res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
> > >
> > > [ Note: only the interfaces needed by the cgroup IO controller are implemented
> > > right now ]
> > >
> > > Signed-off-by: Andrea Righi <[email protected]>
> > > ---
> > > include/linux/res_counter.h | 69 +++++++++++++++++++++++++++++++----------
> > > kernel/res_counter.c | 72 +++++++++++++++++++++++++++++++++++++++++++
> > > 2 files changed, 124 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > > index 4c5bcf6..9bed6af 100644
> > > --- a/include/linux/res_counter.h
> > > +++ b/include/linux/res_counter.h
> > > @@ -14,30 +14,36 @@
> > > */
> > >
> > > #include <linux/cgroup.h>
> > > +#include <linux/jiffies.h>
> > >
> > > -/*
> > > - * The core object. the cgroup that wishes to account for some
> > > - * resource may include this counter into its structures and use
> > > - * the helpers described beyond
> > > - */
> > > +/* The various policies that can be used for ratelimiting resources */
> > > +#define RATELIMIT_LEAKY_BUCKET 0
> > > +#define RATELIMIT_TOKEN_BUCKET 1
> > >
> > > +/**
> > > + * struct res_counter - the core object to account cgroup resources
> > > + *
> > > + * @usage: the current resource consumption level
> > > + * @max_usage: the maximal value of the usage from the counter creation
> > > + * @limit: the limit that usage cannot be exceeded
> > > + * @failcnt: the number of unsuccessful attempts to consume the resource
> > > + * @policy: the limiting policy / algorithm
> > > + * @capacity: the maximum capacity of the resource
> > > + * @timestamp: timestamp of the last accounted resource request
> > > + * @lock: the lock to protect all of the above.
> > > + * The routines below consider this to be IRQ-safe
> > > + *
> > > + * The cgroup that wishes to account for some resource may include this counter
> > > + * into its structures and use the helpers described beyond.
> > > + */
> > > struct res_counter {
> > > - /*
> > > - * the current resource consumption level
> > > - */
> > > unsigned long long usage;
> > > - /*
> > > - * the maximal value of the usage from the counter creation
> > > - */
> > > unsigned long long max_usage;
> > > - /*
> > > - * the limit that usage cannot exceed
> > > - */
> > > unsigned long long limit;
> > > - /*
> > > - * the number of unsuccessful attempts to consume the resource
> > > - */
> > > unsigned long long failcnt;
> > > + unsigned long long policy;
> > > + unsigned long long capacity;
> > > + unsigned long long timestamp;
> > >
> > Andrea, sizeof(struct res_counter) is getting close to 128bytes. (maybe someone adds more)
> > Then, could you check "unsigned long or unsigned int" is allowed or not, again ?
> >
> > It's very bad if cacheline of spinlock is different from data field, in future.
>
> Regarding the new attributes, policy can be surely an unsigned int or
> even less (now only 1 bit is used!), maybe we can just add an unsigned
> int flags, and encode also potential future informations there.
>
> Moreover, are we sure we really need an unsigned long long for failcnt?
>
No we don't. But having it helps the members align well on a 8 byte
boundary. For all you know the compiler might do that anyway, unless
we pack the structure.
Why does policy need to be unsigned long long? Can't it be a boolean
for now? Token or leaky? We can consider unioning of some fields like
soft_limit when added along with the proposed fields.
--
Balbir
Andrea Righi wrote:
> On Tue, Apr 21, 2009 at 09:15:34AM +0900, KAMEZAWA Hiroyuki wrote:
>> It's very bad if cacheline of spinlock is different from data field, in
>> future.
>
> Regarding the new attributes, policy can be surely an unsigned int or
> even less (now only 1 bit is used!), maybe we can just add an unsigned
> int flags, and encode also potential future informations there.
agreed.
>
> Moreover, are we sure we really need an unsigned long long for failcnt?
>
I think "int" is enough for failcnt.
Thanks,
-Kame
On Tue, Apr 21, 2009 at 03:43:26PM +0530, Balbir Singh wrote:
> * Andrea Righi <[email protected]> [2009-04-18 23:38:27]:
>
> > Introduce attributes and functions in res_counter to implement throttling-based
> > cgroup subsystems.
> >
> > The following attributes have been added to struct res_counter:
> > * @policy: the limiting policy / algorithm
> > * @capacity: the maximum capacity of the resource
> > * @timestamp: timestamp of the last accounted resource request
> >
>
> Units of each of the above would be desirable, without them it is hard
> to understand what you are trying to add. What is the unit of
> capacity?
Theoretically it can be any unit. At the moment it is used by the
io-throttle controller only for the token bucket strategy (@policy =
RATELIMIT_TOKEN_BUCKET) and it can be either bytes or IO operations.
Maybe I should add a comment like this.
>
> > Currently the available policies are: token-bucket and leaky-bucket and the
> > attribute @capacity is only used by token-bucket policy (to represent the
> > bucket size).
> >
> > The following function has been implemented to return the amount of time a
> > cgroup should sleep to remain within the defined resource limits.
> >
> > unsigned long long
> > res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
> >
> > [ Note: only the interfaces needed by the cgroup IO controller are implemented
> > right now ]
> >
>
> This is a good RFC, but I would hold off merging till the subsystem
> gets in. Having said that I am not convinced about the subsystem
> sleeping, if the subsystem is not IO intensive, should it still sleep
> because it is over its IO b/w? This might make sense for the CPU
> controller, since not having CPU b/w does imply sleeping.
>
> Could you please use the word throttle instead of sleep.
OK, will do in the next version.
>
>
> > Signed-off-by: Andrea Righi <[email protected]>
> > ---
> > include/linux/res_counter.h | 69 +++++++++++++++++++++++++++++++----------
> > kernel/res_counter.c | 72 +++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 124 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > index 4c5bcf6..9bed6af 100644
> > --- a/include/linux/res_counter.h
> > +++ b/include/linux/res_counter.h
> > @@ -14,30 +14,36 @@
> > */
> >
> > #include <linux/cgroup.h>
> > +#include <linux/jiffies.h>
> >
> > -/*
> > - * The core object. the cgroup that wishes to account for some
> > - * resource may include this counter into its structures and use
> > - * the helpers described beyond
> > - */
> > +/* The various policies that can be used for ratelimiting resources */
> > +#define RATELIMIT_LEAKY_BUCKET 0
> > +#define RATELIMIT_TOKEN_BUCKET 1
> >
> > +/**
> > + * struct res_counter - the core object to account cgroup resources
> > + *
> > + * @usage: the current resource consumption level
> > + * @max_usage: the maximal value of the usage from the counter creation
> > + * @limit: the limit that usage cannot be exceeded
> > + * @failcnt: the number of unsuccessful attempts to consume the resource
> > + * @policy: the limiting policy / algorithm
> > + * @capacity: the maximum capacity of the resource
> > + * @timestamp: timestamp of the last accounted resource request
> > + * @lock: the lock to protect all of the above.
> > + * The routines below consider this to be IRQ-safe
> > + *
> > + * The cgroup that wishes to account for some resource may include this counter
> > + * into its structures and use the helpers described beyond.
> > + */
> > struct res_counter {
> > - /*
> > - * the current resource consumption level
> > - */
> > unsigned long long usage;
> > - /*
> > - * the maximal value of the usage from the counter creation
> > - */
> > unsigned long long max_usage;
> > - /*
> > - * the limit that usage cannot exceed
> > - */
> > unsigned long long limit;
> > - /*
> > - * the number of unsuccessful attempts to consume the resource
> > - */
>
> Don't understand why res_counter is removed? Am I reading the diff
> correctly?
It is not removed. I've just used the kernel-doc style comment
(Documentation/kernel-doc-nano-HOWTO.txt). I think Randy suggested this
in the past.
>
> > unsigned long long failcnt;
> > + unsigned long long policy;
> > + unsigned long long capacity;
> > + unsigned long long timestamp;
> > /*
> > * the lock to protect all of the above.
> > * the routines below consider this to be IRQ-safe
> > @@ -84,6 +90,9 @@ enum {
> > RES_USAGE,
> > RES_MAX_USAGE,
> > RES_LIMIT,
> > + RES_POLICY,
> > + RES_TIMESTAMP,
> > + RES_CAPACITY,
> > RES_FAILCNT,
> > };
> >
> > @@ -130,6 +139,15 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> > return false;
> > }
> >
> > +static inline unsigned long long
> > +res_counter_ratelimit_delta_t(struct res_counter *res)
> > +{
> > + return (long long)get_jiffies_64() - (long long)res->timestamp;
> > +}
> > +
> > +unsigned long long
> > +res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
> > +
> > /*
> > * Helper function to detect if the cgroup is within it's limit or
> > * not. It's currently called from cgroup_rss_prepare()
> > @@ -163,6 +181,23 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
> > spin_unlock_irqrestore(&cnt->lock, flags);
> > }
> >
> > +static inline int
> > +res_counter_ratelimit_set_limit(struct res_counter *cnt,
> > + unsigned long long policy,
> > + unsigned long long limit, unsigned long long max)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&cnt->lock, flags);
> > + cnt->limit = limit;
> > + cnt->capacity = max;
> > + cnt->policy = policy;
> > + cnt->timestamp = get_jiffies_64();
> > + cnt->usage = 0;
> > + spin_unlock_irqrestore(&cnt->lock, flags);
> > + return 0;
> > +}
> > +
> > static inline int res_counter_set_limit(struct res_counter *cnt,
> > unsigned long long limit)
> > {
> > diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> > index bf8e753..b62319c 100644
> > --- a/kernel/res_counter.c
> > +++ b/kernel/res_counter.c
> > @@ -9,6 +9,7 @@
> >
> > #include <linux/types.h>
> > #include <linux/parser.h>
> > +#include <linux/jiffies.h>
> > #include <linux/fs.h>
> > #include <linux/slab.h>
> > #include <linux/res_counter.h>
> > @@ -20,6 +21,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
> > spin_lock_init(&counter->lock);
> > counter->limit = (unsigned long long)LLONG_MAX;
> > counter->parent = parent;
> > + counter->capacity = (unsigned long long)LLONG_MAX;
> > + counter->timestamp = get_jiffies_64();
> > }
> >
> > int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> > @@ -99,6 +102,12 @@ res_counter_member(struct res_counter *counter, int member)
> > return &counter->max_usage;
> > case RES_LIMIT:
> > return &counter->limit;
> > + case RES_POLICY:
> > + return &counter->policy;
> > + case RES_TIMESTAMP:
> > + return &counter->timestamp;
> > + case RES_CAPACITY:
> > + return &counter->capacity;
> > case RES_FAILCNT:
> > return &counter->failcnt;
> > };
> > @@ -163,3 +172,66 @@ int res_counter_write(struct res_counter *counter, int member,
> > spin_unlock_irqrestore(&counter->lock, flags);
> > return 0;
> > }
> > +
> > +static unsigned long long
> > +ratelimit_leaky_bucket(struct res_counter *res, ssize_t val)
> > +{
> > + unsigned long long delta, t;
> > +
> > + res->usage += val;
>
> Is this called from a protected context (w.r.t. res)?
Yes, it is called with res->lock held (look at
res_counter_ratelimit_sleep()).
I can add a comment anyway.
>
> > + delta = res_counter_ratelimit_delta_t(res);
> > + if (!delta)
> > + return 0;
> > + t = res->usage * USEC_PER_SEC;
> > + t = usecs_to_jiffies(div_u64(t, res->limit));
> > + if (t > delta)
> > + return t - delta;
> > + /* Reset i/o statistics */
> > + res->usage = 0;
> > + res->timestamp = get_jiffies_64();
> > + return 0;
> > +}
> > +
> > +static unsigned long long
> > +ratelimit_token_bucket(struct res_counter *res, ssize_t val)
> > +{
> > + unsigned long long delta;
> > + long long tok;
> > +
> > + res->usage -= val;
> > + delta = jiffies_to_msecs(res_counter_ratelimit_delta_t(res));
> > + res->timestamp = get_jiffies_64();
> > + tok = (long long)res->usage * MSEC_PER_SEC;
> > + if (delta) {
> > + long long max = (long long)res->capacity * MSEC_PER_SEC;
> > +
> > + tok += delta * res->limit;
> > + if (tok > max)
> > + tok = max;
>
> Use max_t() here
ok.
>
> > + res->usage = (unsigned long long)div_s64(tok, MSEC_PER_SEC);
> > + }
> > + return (tok < 0) ? msecs_to_jiffies(div_u64(-tok, res->limit)) : 0;
> > +}
>
> I don't like the usage of MSEC and USEC for res->usage based on
> policy.
I used a different granularity only because in the io-throttle tests
token bucket worked better with USEC and leaky bucket with MSEC. But we
can generalize and encode this "granularity" information in a
res_counter->flags attribute.
>
> > +
> > +unsigned long long
> > +res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val)
> > +{
> > + unsigned long long sleep = 0;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&res->lock, flags);
> > + if (res->limit)
> > + switch (res->policy) {
> > + case RATELIMIT_LEAKY_BUCKET:
> > + sleep = ratelimit_leaky_bucket(res, val);
> > + break;
> > + case RATELIMIT_TOKEN_BUCKET:
> > + sleep = ratelimit_token_bucket(res, val);
> > + break;
> > + default:
> > + WARN_ON(1);
> > + break;
> > + }
> > + spin_unlock_irqrestore(&res->lock, flags);
> > + return sleep;
> > +}
> > --
> > 1.5.6.3
> >
> >
>
> --
> Balbir
Thanks for your comments!
-Andrea
On Mon, Apr 20, 2009 at 09:15:24PM -0700, Paul E. McKenney wrote:
> > > How does the above lock relate to the iot->lock called out in the comment
> > > headers in the earlier functions? Hmmm... Come to think of it, I don't
> > > see an acquisition of iot->lock anywhere.
> > >
> > > So, what is the story here?
> >
> > As said before, only the comment in struct iothrottle is correct, we use
> > cgroup_lock() to protect iot->list, so there's no need to introduce
> > another lock inside struct iothrottle.
> >
> > And the other comments about iot->lock must be fixed.
>
> Sounds good!
>
> So this code is compiled into the kernel only when cgroups are defined,
> correct? Otherwise, cgroup_lock() seems to be an empty function.
Right, from init/Kconfig:
config CGROUP_IO_THROTTLE
bool "Enable cgroup I/O throttling"
depends on CGROUPS && RESOURCE_COUNTERS && EXPERIMENTAL
...
> > > > +int iothrottle_set_page_owner(struct page *page, struct mm_struct *mm)
> > > > +{
> > > > + struct iothrottle *iot;
> > > > + unsigned short id = 0;
> > > > +
> > > > + if (iothrottle_disabled())
> > > > + return 0;
> > > > + if (!mm)
> > > > + goto out;
> > > > + rcu_read_lock();
> > > > + iot = task_to_iothrottle(rcu_dereference(mm->owner));
> > >
> > > Given that task_to_iothrottle() calls task_subsys_state(), which contains
> > > an rcu_dereference(), why is the rcu_dereference() above required?
> > > (There might well be a good reason, just cannot see it right offhand.)
> >
> > The first rcu_dereference() is required to safely get a task_struct from
> > mm_struct. The second rcu_dereference() inside task_to_iothrottle() is
> > required to safely get the struct iothrottle from task_struct.
>
> Why not put the rcu_dereference() down inside task_to_iothrottle()?
>
mmmh... it is needed only when task_struct is taken from mm->owner,
task_to_iothrottle(current) for example works fine without the
rcu_dereference(current).
-Andrea
On Tue, Apr 21, 2009 at 02:58:30PM +0200, Andrea Righi wrote:
> On Mon, Apr 20, 2009 at 09:15:24PM -0700, Paul E. McKenney wrote:
> > > > How does the above lock relate to the iot->lock called out in the comment
> > > > headers in the earlier functions? Hmmm... Come to think of it, I don't
> > > > see an acquisition of iot->lock anywhere.
> > > >
> > > > So, what is the story here?
> > >
> > > As said before, only the comment in struct iothrottle is correct, we use
> > > cgroup_lock() to protect iot->list, so there's no need to introduce
> > > another lock inside struct iothrottle.
> > >
> > > And the other comments about iot->lock must be fixed.
> >
> > Sounds good!
> >
> > So this code is compiled into the kernel only when cgroups are defined,
> > correct? Otherwise, cgroup_lock() seems to be an empty function.
>
> Right, from init/Kconfig:
>
> config CGROUP_IO_THROTTLE
> bool "Enable cgroup I/O throttling"
> depends on CGROUPS && RESOURCE_COUNTERS && EXPERIMENTAL
> ...
Fair enough!
> > > > > +int iothrottle_set_page_owner(struct page *page, struct mm_struct *mm)
> > > > > +{
> > > > > + struct iothrottle *iot;
> > > > > + unsigned short id = 0;
> > > > > +
> > > > > + if (iothrottle_disabled())
> > > > > + return 0;
> > > > > + if (!mm)
> > > > > + goto out;
> > > > > + rcu_read_lock();
> > > > > + iot = task_to_iothrottle(rcu_dereference(mm->owner));
> > > >
> > > > Given that task_to_iothrottle() calls task_subsys_state(), which contains
> > > > an rcu_dereference(), why is the rcu_dereference() above required?
> > > > (There might well be a good reason, just cannot see it right offhand.)
> > >
> > > The first rcu_dereference() is required to safely get a task_struct from
> > > mm_struct. The second rcu_dereference() inside task_to_iothrottle() is
> > > required to safely get the struct iothrottle from task_struct.
> >
> > Why not put the rcu_dereference() down inside task_to_iothrottle()?
>
> mmmh... it is needed only when task_struct is taken from mm->owner,
> task_to_iothrottle(current) for example works fine without the
> rcu_dereference(current).
OK... But please note that rcu_dereference() is extremely lightweight,
a couple orders of magnitude cheaper than an uncontended lock. So there
is almost no penalty for using it on the task_to_iothrottle() path.
Thanx, Paul
On Tue, Apr 21, 2009 at 03:46:59PM +0530, Balbir Singh wrote:
> * Andrea Righi <[email protected]> [2009-04-21 11:55:26]:
>
> > On Tue, Apr 21, 2009 at 09:15:34AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Sat, 18 Apr 2009 23:38:27 +0200
> > > Andrea Righi <[email protected]> wrote:
> > >
> > > > Introduce attributes and functions in res_counter to implement throttling-based
> > > > cgroup subsystems.
> > > >
> > > > The following attributes have been added to struct res_counter:
> > > > * @policy: the limiting policy / algorithm
> > > > * @capacity: the maximum capacity of the resource
> > > > * @timestamp: timestamp of the last accounted resource request
> > > >
> > > > Currently the available policies are: token-bucket and leaky-bucket and the
> > > > attribute @capacity is only used by token-bucket policy (to represent the
> > > > bucket size).
> > > >
> > > > The following function has been implemented to return the amount of time a
> > > > cgroup should sleep to remain within the defined resource limits.
> > > >
> > > > unsigned long long
> > > > res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
> > > >
> > > > [ Note: only the interfaces needed by the cgroup IO controller are implemented
> > > > right now ]
> > > >
> > > > Signed-off-by: Andrea Righi <[email protected]>
> > > > ---
> > > > include/linux/res_counter.h | 69 +++++++++++++++++++++++++++++++----------
> > > > kernel/res_counter.c | 72 +++++++++++++++++++++++++++++++++++++++++++
> > > > 2 files changed, 124 insertions(+), 17 deletions(-)
> > > >
> > > > diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> > > > index 4c5bcf6..9bed6af 100644
> > > > --- a/include/linux/res_counter.h
> > > > +++ b/include/linux/res_counter.h
> > > > @@ -14,30 +14,36 @@
> > > > */
> > > >
> > > > #include <linux/cgroup.h>
> > > > +#include <linux/jiffies.h>
> > > >
> > > > -/*
> > > > - * The core object. the cgroup that wishes to account for some
> > > > - * resource may include this counter into its structures and use
> > > > - * the helpers described beyond
> > > > - */
> > > > +/* The various policies that can be used for ratelimiting resources */
> > > > +#define RATELIMIT_LEAKY_BUCKET 0
> > > > +#define RATELIMIT_TOKEN_BUCKET 1
> > > >
> > > > +/**
> > > > + * struct res_counter - the core object to account cgroup resources
> > > > + *
> > > > + * @usage: the current resource consumption level
> > > > + * @max_usage: the maximal value of the usage from the counter creation
> > > > + * @limit: the limit that usage cannot be exceeded
> > > > + * @failcnt: the number of unsuccessful attempts to consume the resource
> > > > + * @policy: the limiting policy / algorithm
> > > > + * @capacity: the maximum capacity of the resource
> > > > + * @timestamp: timestamp of the last accounted resource request
> > > > + * @lock: the lock to protect all of the above.
> > > > + * The routines below consider this to be IRQ-safe
> > > > + *
> > > > + * The cgroup that wishes to account for some resource may include this counter
> > > > + * into its structures and use the helpers described beyond.
> > > > + */
> > > > struct res_counter {
> > > > - /*
> > > > - * the current resource consumption level
> > > > - */
> > > > unsigned long long usage;
> > > > - /*
> > > > - * the maximal value of the usage from the counter creation
> > > > - */
> > > > unsigned long long max_usage;
> > > > - /*
> > > > - * the limit that usage cannot exceed
> > > > - */
> > > > unsigned long long limit;
> > > > - /*
> > > > - * the number of unsuccessful attempts to consume the resource
> > > > - */
> > > > unsigned long long failcnt;
> > > > + unsigned long long policy;
> > > > + unsigned long long capacity;
> > > > + unsigned long long timestamp;
> > > >
> > > Andrea, sizeof(struct res_counter) is getting close to 128bytes. (maybe someone adds more)
> > > Then, could you check "unsigned long or unsigned int" is allowed or not, again ?
> > >
> > > It's very bad if cacheline of spinlock is different from data field, in future.
> >
> > Regarding the new attributes, policy can be surely an unsigned int or
> > even less (now only 1 bit is used!), maybe we can just add an unsigned
> > int flags, and encode also potential future informations there.
> >
> > Moreover, are we sure we really need an unsigned long long for failcnt?
> >
>
> No we don't. But having it helps the members align well on a 8 byte
> boundary. For all you know the compiler might do that anyway, unless
> we pack the structure.
>
> Why does policy need to be unsigned long long? Can't it be a boolean
> for now? Token or leaky? We can consider unioning of some fields like
> soft_limit when added along with the proposed fields.
Adding an unsigned int flags to encode policy and other potential future
stuff seems much better. Also agreed by Kame.
Maybe the difference between a soft_limit and hard_limit can be also
encoded in the flags.
Moreover, max_usage is not used in ratelimited resources. And capacity
is not used in all the other cases (it has been introduced only for
token-bucket ratelimited resources).
I think we can easily union max_usage and capacity at least.
-Andrea
Andrea Righi wrote:
> Together with cgroup_io_throttle() the kiothrottled kernel thread
> represents the core of the io-throttle subsystem.
>
> All the writeback IO requests that need to be throttled are not
> dispatched immediately in submit_bio(). Instead, they are added into an
> rbtree by iothrottle_make_request() and processed asynchronously by
> kiothrottled.
>
> A deadline is associated to each request depending on the bandwidth
> usage of the cgroup it belongs. When a request is inserted into the
> rbtree kiothrottled is awakened. This thread selects all the requests
> with an expired deadline and submit the bunch of selected requests to
> the underlying block devices using generic_make_request().
Hi Andrea,
What if an user issues "sync", will the bios still be buffered in the rb-tree?
Do we need to flush the whole tree?
--
Regards
Gui Jianfeng
On Thu, Apr 23, 2009 at 03:53:51PM +0800, Gui Jianfeng wrote:
> Andrea Righi wrote:
> > Together with cgroup_io_throttle() the kiothrottled kernel thread
> > represents the core of the io-throttle subsystem.
> >
> > All the writeback IO requests that need to be throttled are not
> > dispatched immediately in submit_bio(). Instead, they are added into an
> > rbtree by iothrottle_make_request() and processed asynchronously by
> > kiothrottled.
> >
> > A deadline is associated to each request depending on the bandwidth
> > usage of the cgroup it belongs. When a request is inserted into the
> > rbtree kiothrottled is awakened. This thread selects all the requests
> > with an expired deadline and submit the bunch of selected requests to
> > the underlying block devices using generic_make_request().
>
> Hi Andrea,
>
> What if an user issues "sync", will the bios still be buffered in the rb-tree?
> Do we need to flush the whole tree?
Good question. From The sync(2) man page:
According to the standard specification (e.g., POSIX.1-2001), sync()
schedules the writes, but may return before the actual writing is done.
However, since version 1.3.20 Linux does actually wait. (This
still does not guarantee data integrity: modern disks have large
caches.)
It is not completely wrong looking at the standard. The writes are
actually scheduled, but pending in the rbtree. Anyway, if we immediately
dispatch them anyone can evade the IO controller simply issuing a lot of
sync while doing IO. OTOH dispatching the requests respecting the max
rate for each cgroup can cause the sync to wait for all the others' BW
limitations.
Honestly I don't have a good answer for this. Opinions?
-Andrea
Andrea Righi wrote:
> Dirty pages in the page cache can be processed asynchronously by kernel
> threads (pdflush) using a writeback policy. For this reason the real
> writes to the underlying block devices occur in a different IO context
> respect to the task that originally generated the dirty pages involved
> in the IO operation. This makes the tracking and throttling of writeback
> IO more complicate respect to the synchronous IO.
>
> The page_cgroup infrastructure, currently available only for the memory
> cgroup controller, can be used to store the owner of each page and
> opportunely track the writeback IO. This information is encoded in
> page_cgroup->flags.
You encode id in page_cgroup->flags, if a cgroup get removed, IMHO, you
should remove the corresponding id in flags.
One more thing, if a task is moving from a cgroup to another, the id in
flags also need to be changed.
>
> A owner can be identified using a generic ID number and the following
> interfaces are provided to store a retrieve this information:
>
> unsigned long page_cgroup_get_owner(struct page *page);
> int page_cgroup_set_owner(struct page *page, unsigned long id);
> int page_cgroup_copy_owner(struct page *npage, struct page *opage);
>
> The io-throttle controller uses the cgroup css_id() as the owner's ID
> number.
>
> A big part of this code is taken from the Ryo and Hirokazu's bio-cgroup
> controller (http://people.valinux.co.jp/~ryov/bio-cgroup/).
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Hirokazu Takahashi <[email protected]>
> Signed-off-by: Ryo Tsuruta <[email protected]>
> ---
> include/linux/memcontrol.h | 6 +++
> include/linux/mmzone.h | 4 +-
> include/linux/page_cgroup.h | 33 +++++++++++++-
> init/Kconfig | 4 ++
> mm/Makefile | 3 +-
> mm/memcontrol.c | 6 +++
> mm/page_cgroup.c | 95 ++++++++++++++++++++++++++++++++++++++-----
> 7 files changed, 135 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..f3e0e64 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -37,6 +37,8 @@ struct mm_struct;
> * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> */
>
> +extern void __init_mem_page_cgroup(struct page_cgroup *pc);
> +
> extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> /* for swap handling */
> @@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> +static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
> +{
> +}
> +
> static inline int mem_cgroup_newpage_charge(struct page *page,
> struct mm_struct *mm, gfp_t gfp_mask)
> {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 186ec6a..b178eb9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -607,7 +607,7 @@ typedef struct pglist_data {
> int nr_zones;
> #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> struct page *node_mem_map;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_PAGE_TRACKING
> struct page_cgroup *node_page_cgroup;
> #endif
> #endif
> @@ -958,7 +958,7 @@ struct mem_section {
>
> /* See declaration of similar field in struct zone */
> unsigned long *pageblock_flags;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_PAGE_TRACKING
> /*
> * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
> * section. (see memcontrol.h/page_cgroup.h about this.)
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 7339c7b..f24d081 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -1,7 +1,7 @@
> #ifndef __LINUX_PAGE_CGROUP_H
> #define __LINUX_PAGE_CGROUP_H
>
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_PAGE_TRACKING
> #include <linux/bit_spinlock.h>
> /*
> * Page Cgroup can be considered as an extended mem_map.
> @@ -12,11 +12,38 @@
> */
> struct page_cgroup {
> unsigned long flags;
> - struct mem_cgroup *mem_cgroup;
> struct page *page;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + struct mem_cgroup *mem_cgroup;
> struct list_head lru; /* per cgroup LRU list */
> +#endif
> };
>
> +/*
> + * use lower 16 bits for flags and reserve the rest for the page tracking id
> + */
> +#define PAGE_TRACKING_ID_SHIFT (16)
> +#define PAGE_TRACKING_ID_BITS \
> + (8 * sizeof(unsigned long) - PAGE_TRACKING_ID_SHIFT)
> +
> +/* NOTE: must be called with page_cgroup() held */
> +static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
> +{
> + return pc->flags >> PAGE_TRACKING_ID_SHIFT;
> +}
> +
> +/* NOTE: must be called with page_cgroup() held */
> +static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
> +{
> + WARN_ON(id >= (1UL << PAGE_TRACKING_ID_BITS));
> + pc->flags &= (1UL << PAGE_TRACKING_ID_SHIFT) - 1;
> + pc->flags |= (unsigned long)(id << PAGE_TRACKING_ID_SHIFT);
> +}
> +
> +unsigned long page_cgroup_get_owner(struct page *page);
> +int page_cgroup_set_owner(struct page *page, unsigned long id);
> +int page_cgroup_copy_owner(struct page *npage, struct page *opage);
> +
> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> void __init page_cgroup_init(void);
> struct page_cgroup *lookup_page_cgroup(struct page *page);
> @@ -71,7 +98,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
> bit_spin_unlock(PCG_LOCK, &pc->flags);
> }
>
> -#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +#else /* CONFIG_PAGE_TRACKING */
> struct page_cgroup;
>
> static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> diff --git a/init/Kconfig b/init/Kconfig
> index 7be4d38..5428ac7 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -569,6 +569,7 @@ config CGROUP_MEM_RES_CTLR
> bool "Memory Resource Controller for Control Groups"
> depends on CGROUPS && RESOURCE_COUNTERS
> select MM_OWNER
> + select PAGE_TRACKING
> help
> Provides a memory resource controller that manages both anonymous
> memory and page cache. (See Documentation/cgroups/memory.txt)
> @@ -611,6 +612,9 @@ endif # CGROUPS
> config MM_OWNER
> bool
>
> +config PAGE_TRACKING
> + bool
> +
> config SYSFS_DEPRECATED
> bool
>
> diff --git a/mm/Makefile b/mm/Makefile
> index ec73c68..b94e074 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -37,4 +37,5 @@ else
> obj-$(CONFIG_SMP) += allocpercpu.o
> endif
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_PAGE_TRACKING) += page_cgroup.o
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e44fb0f..69d1c31 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2524,6 +2524,12 @@ struct cgroup_subsys mem_cgroup_subsys = {
> .use_id = 1,
> };
>
> +void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
> +{
> + pc->mem_cgroup = NULL;
> + INIT_LIST_HEAD(&pc->lru);
> +}
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>
> static int __init disable_swap_account(char *s)
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 791905c..b3b394c 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -3,6 +3,7 @@
> #include <linux/bootmem.h>
> #include <linux/bit_spinlock.h>
> #include <linux/page_cgroup.h>
> +#include <linux/blk-io-throttle.h>
> #include <linux/hash.h>
> #include <linux/slab.h>
> #include <linux/memory.h>
> @@ -14,9 +15,8 @@ static void __meminit
> __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> {
> pc->flags = 0;
> - pc->mem_cgroup = NULL;
> pc->page = pfn_to_page(pfn);
> - INIT_LIST_HEAD(&pc->lru);
> + __init_mem_page_cgroup(pc);
> }
> static unsigned long total_usage;
>
> @@ -74,7 +74,7 @@ void __init page_cgroup_init(void)
>
> int nid, fail;
>
> - if (mem_cgroup_disabled())
> + if (mem_cgroup_disabled() && iothrottle_disabled())
> return;
>
> for_each_online_node(nid) {
> @@ -83,12 +83,13 @@ void __init page_cgroup_init(void)
> goto fail;
> }
> printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> - printk(KERN_INFO "please try cgroup_disable=memory option if you"
> - " don't want\n");
> + printk(KERN_INFO
> + "try cgroup_disable=memory,blockio option if you don't want\n");
> return;
> fail:
> printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
> - printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
> + printk(KERN_CRIT
> + "try cgroup_disable=memory,blockio boot option\n");
> panic("Out of memory");
> }
>
> @@ -243,12 +244,85 @@ static int __meminit page_cgroup_callback(struct notifier_block *self,
>
> #endif
>
> +/**
> + * page_cgroup_get_owner() - get the owner ID of a page
> + * @page: the page we want to find the owner
> + *
> + * Returns the owner ID of the page, 0 means that the owner cannot be
> + * retrieved.
> + **/
> +unsigned long page_cgroup_get_owner(struct page *page)
> +{
> + struct page_cgroup *pc;
> + unsigned long ret;
> +
> + pc = lookup_page_cgroup(page);
> + if (unlikely(!pc))
> + return 0;
> +
> + lock_page_cgroup(pc);
> + ret = page_cgroup_get_id(pc);
> + unlock_page_cgroup(pc);
> + return ret;
> +}
> +
> +/**
> + * page_cgroup_set_owner() - set the owner ID of a page
> + * @page: the page we want to tag
> + * @id: the ID number that will be associated to page
> + *
> + * Returns 0 if the owner is correctly associated to the page. Returns a
> + * negative value in case of failure.
> + **/
> +int page_cgroup_set_owner(struct page *page, unsigned long id)
> +{
> + struct page_cgroup *pc;
> +
> + pc = lookup_page_cgroup(page);
> + if (unlikely(!pc))
> + return -ENOENT;
> +
> + lock_page_cgroup(pc);
> + page_cgroup_set_id(pc, id);
> + unlock_page_cgroup(pc);
> + return 0;
> +}
> +
> +/**
> + * page_cgroup_copy_owner() - copy the owner ID of a page into another page
> + * @npage: the page where we want to copy the owner
> + * @opage: the page from which we want to copy the ID
> + *
> + * Returns 0 if the owner is correctly associated to npage. Returns a negative
> + * value in case of failure.
> + **/
> +int page_cgroup_copy_owner(struct page *npage, struct page *opage)
> +{
> + struct page_cgroup *npc, *opc;
> + unsigned long id;
> +
> + npc = lookup_page_cgroup(npage);
> + if (unlikely(!npc))
> + return -ENOENT;
> + opc = lookup_page_cgroup(opage);
> + if (unlikely(!opc))
> + return -ENOENT;
> + lock_page_cgroup(opc);
> + lock_page_cgroup(npc);
> + id = page_cgroup_get_id(opc);
> + page_cgroup_set_id(npc, id);
> + unlock_page_cgroup(npc);
> + unlock_page_cgroup(opc);
> +
> + return 0;
> +}
> +
> void __init page_cgroup_init(void)
> {
> unsigned long pfn;
> int fail = 0;
>
> - if (mem_cgroup_disabled())
> + if (mem_cgroup_disabled() && iothrottle_disabled())
> return;
>
> for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> @@ -257,14 +331,15 @@ void __init page_cgroup_init(void)
> fail = init_section_page_cgroup(pfn);
> }
> if (fail) {
> - printk(KERN_CRIT "try cgroup_disable=memory boot option\n");
> + printk(KERN_CRIT
> + "try cgroup_disable=memory,blockio boot option\n");
> panic("Out of memory");
> } else {
> hotplug_memory_notifier(page_cgroup_callback, 0);
> }
> printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> - printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
> - " want\n");
> + printk(KERN_INFO
> + "try cgroup_disable=memory,blockio option if you don't want\n");
> }
>
> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
--
Regards
Gui Jianfeng
Andrea Righi wrote:
> On Thu, Apr 23, 2009 at 03:53:51PM +0800, Gui Jianfeng wrote:
>> Andrea Righi wrote:
>>> Together with cgroup_io_throttle() the kiothrottled kernel thread
>>> represents the core of the io-throttle subsystem.
>>>
>>> All the writeback IO requests that need to be throttled are not
>>> dispatched immediately in submit_bio(). Instead, they are added into an
>>> rbtree by iothrottle_make_request() and processed asynchronously by
>>> kiothrottled.
>>>
>>> A deadline is associated to each request depending on the bandwidth
>>> usage of the cgroup it belongs. When a request is inserted into the
>>> rbtree kiothrottled is awakened. This thread selects all the requests
>>> with an expired deadline and submit the bunch of selected requests to
>>> the underlying block devices using generic_make_request().
>> Hi Andrea,
>>
>> What if an user issues "sync", will the bios still be buffered in the rb-tree?
>> Do we need to flush the whole tree?
>
> Good question. From The sync(2) man page:
>
> According to the standard specification (e.g., POSIX.1-2001), sync()
> schedules the writes, but may return before the actual writing is done.
> However, since version 1.3.20 Linux does actually wait. (This
> still does not guarantee data integrity: modern disks have large
> caches.)
>
> It is not completely wrong looking at the standard. The writes are
> actually scheduled, but pending in the rbtree. Anyway, if we immediately
> dispatch them anyone can evade the IO controller simply issuing a lot of
> sync while doing IO. OTOH dispatching the requests respecting the max
> rate for each cgroup can cause the sync to wait for all the others' BW
> limitations.
>
> Honestly I don't have a good answer for this. Opinions?
IMHO, buffered bios should be submitted immediately even if this
regards as evading when an user issues "sync". Restricting an user
from issuing "sync" also helps, but seems not easy.
--
Regards
Gui Jianfeng
On Fri, Apr 24, 2009 at 10:11:09AM +0800, Gui Jianfeng wrote:
> Andrea Righi wrote:
> > Dirty pages in the page cache can be processed asynchronously by kernel
> > threads (pdflush) using a writeback policy. For this reason the real
> > writes to the underlying block devices occur in a different IO context
> > respect to the task that originally generated the dirty pages involved
> > in the IO operation. This makes the tracking and throttling of writeback
> > IO more complicate respect to the synchronous IO.
> >
> > The page_cgroup infrastructure, currently available only for the memory
> > cgroup controller, can be used to store the owner of each page and
> > opportunely track the writeback IO. This information is encoded in
> > page_cgroup->flags.
>
> You encode id in page_cgroup->flags, if a cgroup get removed, IMHO, you
> should remove the corresponding id in flags.
OK, the same same ID could be reused by another cgroup. I think this
should happen very rarely because IDs are recovered slowly anyway.
What about simply executing a sys_sync() when a io-throttle cgroup is
removed? If we're going to remove a cgroup no additional dirty page will
be generated by this cgroup, because it must be empty. And the sync
would allow that old dirty pages will be flushed back to disk (for those
pages the cgroup ID will be simply ignored).
> One more thing, if a task is moving from a cgroup to another, the id in
> flags also need to be changed.
I do not agree here. Even if a task is moving from a cgroup to another
the cgroup that generated the dirty page is always the old one. Remember
that we want to save cgroup's identity in this case, and not the task.
Thanks,
-Andrea
Andrea Righi wrote:
> On Fri, Apr 24, 2009 at 10:11:09AM +0800, Gui Jianfeng wrote:
>> Andrea Righi wrote:
>>> Dirty pages in the page cache can be processed asynchronously by kernel
>>> threads (pdflush) using a writeback policy. For this reason the real
>>> writes to the underlying block devices occur in a different IO context
>>> respect to the task that originally generated the dirty pages involved
>>> in the IO operation. This makes the tracking and throttling of writeback
>>> IO more complicate respect to the synchronous IO.
>>>
>>> The page_cgroup infrastructure, currently available only for the memory
>>> cgroup controller, can be used to store the owner of each page and
>>> opportunely track the writeback IO. This information is encoded in
>>> page_cgroup->flags.
>> You encode id in page_cgroup->flags, if a cgroup get removed, IMHO, you
>> should remove the corresponding id in flags.
>
> OK, the same same ID could be reused by another cgroup. I think this
> should happen very rarely because IDs are recovered slowly anyway.
>
> What about simply executing a sys_sync() when a io-throttle cgroup is
> removed? If we're going to remove a cgroup no additional dirty page will
> be generated by this cgroup, because it must be empty. And the sync
> would allow that old dirty pages will be flushed back to disk (for those
> pages the cgroup ID will be simply ignored).
>
>> One more thing, if a task is moving from a cgroup to another, the id in
>> flags also need to be changed.
>
> I do not agree here. Even if a task is moving from a cgroup to another
> the cgroup that generated the dirty page is always the old one. Remember
> that we want to save cgroup's identity in this case, and not the task.
If the task moves to a new cgroup, the dirty page generated from the old
group still uses the old id. When these dirty pages is writing back to disk,
the corresponding bios will be delayed according to old group's bandwidth
limitation. Am i right? I think we should use the new bandwidth limitation
when actual IO happens. So we need to use new id for these pages. But i think
the implementation for this functionality must be very complicated. :)
>
> Thanks,
> -Andrea
>
>
>
--
Regards
Gui Jianfeng
On Fri, Apr 24, 2009 at 05:14:55PM +0800, Gui Jianfeng wrote:
> Andrea Righi wrote:
> > On Fri, Apr 24, 2009 at 10:11:09AM +0800, Gui Jianfeng wrote:
> >> Andrea Righi wrote:
> >>> Dirty pages in the page cache can be processed asynchronously by kernel
> >>> threads (pdflush) using a writeback policy. For this reason the real
> >>> writes to the underlying block devices occur in a different IO context
> >>> respect to the task that originally generated the dirty pages involved
> >>> in the IO operation. This makes the tracking and throttling of writeback
> >>> IO more complicate respect to the synchronous IO.
> >>>
> >>> The page_cgroup infrastructure, currently available only for the memory
> >>> cgroup controller, can be used to store the owner of each page and
> >>> opportunely track the writeback IO. This information is encoded in
> >>> page_cgroup->flags.
> >> You encode id in page_cgroup->flags, if a cgroup get removed, IMHO, you
> >> should remove the corresponding id in flags.
> >
> > OK, the same same ID could be reused by another cgroup. I think this
> > should happen very rarely because IDs are recovered slowly anyway.
> >
> > What about simply executing a sys_sync() when a io-throttle cgroup is
> > removed? If we're going to remove a cgroup no additional dirty page will
> > be generated by this cgroup, because it must be empty. And the sync
> > would allow that old dirty pages will be flushed back to disk (for those
> > pages the cgroup ID will be simply ignored).
> >
> >> One more thing, if a task is moving from a cgroup to another, the id in
> >> flags also need to be changed.
> >
> > I do not agree here. Even if a task is moving from a cgroup to another
> > the cgroup that generated the dirty page is always the old one. Remember
> > that we want to save cgroup's identity in this case, and not the task.
>
> If the task moves to a new cgroup, the dirty page generated from the old
> group still uses the old id. When these dirty pages is writing back to disk,
> the corresponding bios will be delayed according to old group's bandwidth
> limitation. Am i right? I think we should use the new bandwidth limitation
> when actual IO happens. So we need to use new id for these pages. But i think
> the implementation for this functionality must be very complicated. :)
Right, but as I said statistics are per cgroup, not per task. And even
if a task moves to another cgroup it is correct IMHO to use the old
cgroup id to writeback the dirty pages, because it is actually IO
activity generated before the task's movement and the old rules should
be applied. I don't see a strong motivation to change this behaviour and
complicate/slow down the current implementation.
Thanks,
-Andrea