2009-04-14 20:21:38

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 0/9] cgroup: io-throttle controller (v13)

Objective
~~~~~~~~~
The objective of the io-throttle controller is to improve IO performance
predictability of different cgroups that share the same block devices.

State of the art (quick overview)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A recent work made by Vivek propose a weighted BW solution introducing
fair queuing support in the elevator layer and modifying the existent IO
schedulers to use that functionality
(https://lists.linux-foundation.org/pipermail/containers/2009-March/016129.html).

For the fair queuing part Vivek's IO controller makes use of the BFQ
code as posted by Paolo and Fabio (http://lkml.org/lkml/2008/11/11/148).

The dm-ioband controller by the valinux guys is also proposing a
proportional ticket-based solution fully implemented at the device
mapper level (http://people.valinux.co.jp/~ryov/dm-ioband/).

The bio-cgroup patch (http://people.valinux.co.jp/~ryov/bio-cgroup/) is
a BIO tracking mechanism for cgroups, implemented in the cgroup memory
subsystem. It is maintained by Ryo and it allows dm-ioband to track
writeback requests issued by kernel threads (pdflush).

Another work by Satoshi implements the cgroup awareness in CFQ, mapping
per-cgroup priority to CFQ IO priorities and this also provide only the
proportional BW support (http://lwn.net/Articles/306772/).

Please correct me or integrate if I missed someone or something. :)

Proposed solution
~~~~~~~~~~~~~~~~~
Respect to other priority/weight-based solutions the approach used by
this controller is to explicitly choke applications' requests that
directly or indirectly generate IO activity in the system (this
controller addresses both synchronous IO and writeback/buffered IO).

The bandwidth and iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the
overall performance of the system in terms of throughput.

IO throttling and accounting is performed during the submission of IO
requests and it is independent of the particular IO scheduler.

Detailed informations about design, goal and usage are described in the
documentation (see [PATCH 1/9]).

Implementation
~~~~~~~~~~~~~~
Patchset against latest Linus' git:

[PATCH 0/9] cgroup: block device IO controller (v13)
[PATCH 1/9] io-throttle documentation
[PATCH 2/9] res_counter: introduce ratelimiting attributes
[PATCH 3/9] bio-cgroup controller
[PATCH 4/9] support checking of cgroup subsystem dependencies
[PATCH 5/9] io-throttle controller infrastructure
[PATCH 6/9] kiothrottled: throttle buffered (writeback) IO
[PATCH 7/9] io-throttle instrumentation
[PATCH 8/9] export per-task io-throttle statistics to userspace
[PATCH 9/9] ext3: do not throttle metadata and journal IO

The v13 all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/

There are some consistent changes in this patchset respect to the
previous version.

Thanks to the Gui Jianfeng's contribution the io-throttle controller now
uses bio-cgroup to track buffered (writeback) IO, instead of the memory
cgroup controller, and it is also possible to mount the memcg,
bio-cgroup and io-throttle in different mount points (see also
http://lwn.net/Articles/308108/).

Moreover, a kernel thread (kiothrottled) has been introduced to schedule
throttled writeback requests asynchronously. This allow to smooth the
bursty IO generated by the buch of pdflush's writeback requests. All
those requests are added into a rbtree and dispatched asynchronously by
kiothrottled using a deadline-based policy.

The kiothrottled scheduler can be improved in future versions to
implement a proportional/weighted IO scheduling, preferably with the
feedback of the existent IO schedulers.

Experimental results
~~~~~~~~~~~~~~~~~~~~
Following few quick experimental results with writeback IO. Results with
synchronous IO (read and write) are more or less the same obtained with
the previous io-throttle version.

Two cgroups:

cgroup-a: 4MB BW limit on /dev/sda
cgroup-b: 2MB BW limit on /dev/sda

Run 2 concurrent "dd"s (1 in cgroup-a, 1 in cgroup-b) to simulate a
large write stream and generate many writeback IO requests.

Expected results: 6MB/s from the disk's point of view, 4MB/s and 2MB/s
from the application's point of view.

Experimental results:

* From the disk's point of view (dstat -d -D sda1):

with kiothrottled without kiothrottled
--dsk/sda1- --dsk/sda1-
read writ read writ
0 6252k 0 9688k
0 6904k 0 6488k
0 6320k 0 2320k
0 6144k 0 8192k
0 6220k 0 10M
0 6212k 0 5208k
0 6228k 0 1940k
0 6212k 0 1300k
0 6312k 0 8100k
0 6216k 0 8640k
0 6228k 0 6584k
0 6648k 0 2440k
... ...
----- ----
avg: 6325k avg: 5928k

* From the application's point of view:

- with kiothrottled -
cgroup-a)
$ dd if=/dev/zero of=4m-bw.out bs=1M
196+0 records in
196+0 records out
205520896 bytes (206 MB) copied, 40.762 s, 5.0 MB/s

cgroup-b)
$ dd if=/dev/zero of=2m-bw.out bs=1M
97+0 records in
97+0 records out
101711872 bytes (102 MB) copied, 37.3826 s, 2.7 MB/s

- without kiothrottled -
cgroup-a)
$ dd if=/dev/zero of=4m-bw.out bs=1M
133+0 records in
133+0 records out
139460608 bytes (139 MB) copied, 39.1345 s, 3.6 MB/s

cgroup-b)
$ dd if=/dev/zero of=2m-bw.out bs=1M
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 39.0422 s, 1.9 MB/s

Changelog (v12 -> v13)
~~~~~~~~~~~~~~~~~~~~~~
* rewritten on top of bio-cgroup to track writeback IO
* now it is possible to mount memory, bio-cgroup and io-throttle cgroups in
different mount points
* introduce a dedicated kernel thread (kiothrottled) to throttle writeback IO
* updated documentation

-Andrea


2009-04-14 20:21:58

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 1/9] io-throttle documentation

Documentation of the block device I/O controller: description, usage,
advantages and design.

Signed-off-by: Andrea Righi <[email protected]>
---
Documentation/cgroups/io-throttle.txt | 451 +++++++++++++++++++++++++++++++++
1 files changed, 451 insertions(+), 0 deletions(-)
create mode 100644 Documentation/cgroups/io-throttle.txt

diff --git a/Documentation/cgroups/io-throttle.txt b/Documentation/cgroups/io-throttle.txt
new file mode 100644
index 0000000..7650601
--- /dev/null
+++ b/Documentation/cgroups/io-throttle.txt
@@ -0,0 +1,451 @@
+
+ Block device I/O bandwidth controller
+
+----------------------------------------------------------------------
+1. DESCRIPTION
+
+This controller allows to limit the I/O bandwidth of specific block devices for
+specific process containers (cgroups [1]) imposing additional delays on I/O
+requests for those processes that exceed the limits defined in the control
+group filesystem.
+
+Bandwidth limiting rules offer better control over QoS with respect to priority
+or weight-based solutions that only give information about applications'
+relative performance requirements. Nevertheless, priority based solutions are
+affected by performance bursts, when only low-priority requests are submitted
+to a general purpose resource dispatcher.
+
+The goal of the I/O bandwidth controller is to improve performance
+predictability from the applications' point of view and provide performance
+isolation of different control groups sharing the same block devices.
+
+NOTE #1: If you're looking for a way to improve the overall throughput of the
+system probably you should use a different solution.
+
+NOTE #2: The current implementation does not guarantee minimum bandwidth
+levels, the QoS is implemented only slowing down I/O "traffic" that exceeds the
+limits specified by the user; minimum I/O rate thresholds are supposed to be
+guaranteed if the user configures a proper I/O bandwidth partitioning of the
+block devices shared among the different cgroups (theoretically if the sum of
+all the single limits defined for a block device doesn't exceed the total I/O
+bandwidth of that device).
+
+----------------------------------------------------------------------
+2. USER INTERFACE
+
+A new I/O limitation rule is described using the files:
+- blockio.bandwidth-max
+- blockio.iops-max
+
+The I/O bandwidth (blockio.bandwidth-max) can be used to limit the throughput
+of a certain cgroup, while blockio.iops-max can be used to throttle cgroups
+containing applications doing a sparse/seeky I/O workload. Any combination of
+them can be used to define more complex I/O limiting rules, expressed both in
+terms of iops/s and bandwidth.
+
+The same files can be used to set multiple rules for different block devices
+relative to the same cgroup.
+
+The following syntax can be used to configure any limiting rule:
+
+# /bin/echo DEV:LIMIT:STRATEGY:BUCKET_SIZE > CGROUP/FILE
+
+- DEV is the name of the device the limiting rule is applied to.
+
+- LIMIT is the maximum I/O activity allowed on DEV by CGROUP; LIMIT can
+ represent a bandwidth limitation (expressed in bytes/s) when writing to
+ blockio.bandwidth-max, or a limitation to the maximum I/O operations per
+ second (expressed in iops/s) issued by CGROUP.
+
+ A generic I/O limiting rule for a block device DEV can be removed setting the
+ LIMIT to 0.
+
+- STRATEGY is the throttling strategy used to throttle the applications' I/O
+ requests from/to device DEV. At the moment two different strategies can be
+ used [2][3]:
+
+ 0 = leaky bucket: the controller accepts at most B bytes (B = LIMIT * time)
+ or O operations (O = LIMIT * time); further I/O requests
+ are delayed scheduling a timeout for the tasks that made
+ those requests.
+
+ Different I/O flow
+ | | |
+ | v |
+ | v
+ v
+ .......
+ \ /
+ \ / leaky-bucket
+ ---
+ |||
+ vvv
+ Smoothed I/O flow
+
+ 1 = token bucket: LIMIT tokens are added to the bucket every seconds; the
+ bucket can hold at the most BUCKET_SIZE tokens; I/O
+ requests are accepted if there are available tokens in the
+ bucket; when a request of N bytes arrives N tokens are
+ removed from the bucket; if fewer than N tokens are
+ available the request is delayed until a sufficient amount
+ of token is available in the bucket.
+
+ Tokens (I/O rate)
+ o
+ o
+ o
+ ....... <--.
+ \ / | Bucket size (burst limit)
+ \ooo/ |
+ --- <--'
+ |ooo
+ Incoming --->|---> Conforming
+ I/O |oo I/O
+ requests -->|--> requests
+ |
+ ---->|
+
+ Leaky bucket is more precise than token bucket to respect the limits, because
+ bursty workloads are always smoothed. Token bucket, instead, allows a small
+ irregularity degree in the I/O flows (burst limit), and, for this, it is
+ better in terms of efficiency (bursty workloads are not smoothed when there
+ are sufficient tokens in the bucket).
+
+- BUCKET_SIZE is used only with token bucket (STRATEGY == 1) and defines the
+ size of the bucket in bytes (blockio.bandwidth-max) or in I/O operations
+ (blockio.iops-max).
+
+- CGROUP is the name of the limited process container.
+
+Also the following syntaxes are allowed:
+
+- remove an I/O bandwidth limiting rule
+# /bin/echo DEV:0 > CGROUP/blockio.bandwidth-max
+
+- configure a limiting rule using leaky bucket throttling (ignore bucket size):
+# /bin/echo DEV:LIMIT:0 > CGROUP/blockio.bandwidth-max
+
+- configure a limiting rule using token bucket throttling
+ (with bucket size == LIMIT):
+# /bin/echo DEV:LIMIT:1 > CGROUP/blockio.bandwidth-max
+
+2.2. Show I/O limiting rules
+
+All the defined rules and statistics for a specific cgroup can be shown reading
+the files blockio.bandwidth-max for bandwidth constraints and blockio.iops-max
+for I/O operations per second constraints.
+
+The following syntax is used:
+
+$ cat CGROUP/blockio.bandwidth-max
+MAJOR MINOR LIMIT STRATEGY LEAKY_STAT BUCKET_SIZE BUCKET_FILL TIME_DELTA
+
+- MAJOR is the major device number of DEV (defined above)
+
+- MINOR is the minor device number of DEV (defined above)
+
+- LIMIT, STRATEGY and BUCKET_SIZE are the same parameters defined above
+
+- LEAKY_STAT is the amount of bytes (blockio.bandwidth-max) or I/O operations
+ (blockio.iops-max) currently allowed by the I/O controller (only used with
+ leaky bucket strategy - STRATEGY == 0)
+
+- BUCKET_FILL represents the amount of tokens present in the bucket (only used
+ with token bucket strategy - STRATEGY == 1)
+
+- TIME_DELTA can be one of the following:
+ - the amount of jiffies elapsed from the last I/O request (token bucket)
+ - the amount of jiffies during which the bytes or the number of I/O
+ operations given by LEAKY_STAT have been accumulated (leaky bucket)
+
+Multiple per-block device rules are reported in multiple rows
+(DEVi, i = 1 .. n):
+
+$ cat CGROUP/blockio.bandwidth-max
+MAJOR1 MINOR1 BW1 STRATEGY1 LEAKY_STAT1 BUCKET_SIZE1 BUCKET_FILL1 TIME_DELTA1
+MAJOR1 MINOR1 BW2 STRATEGY2 LEAKY_STAT2 BUCKET_SIZE2 BUCKET_FILL2 TIME_DELTA2
+...
+MAJORn MINORn BWn STRATEGYn LEAKY_STATn BUCKET_SIZEn BUCKET_FILLn TIME_DELTAn
+
+The same fields are used to describe I/O operations/sec rules. The only
+difference is that the cost of each I/O operation is scaled up by a factor of
+1000. This allows to apply better fine grained sleeps and provide a more
+precise throttling.
+
+$ cat CGROUP/blockio.iops-max
+MAJOR MINOR LIMITx1000 STRATEGY LEAKY_STATx1000 BUCKET_SIZEx1000 BUCKET_FILLx1000 TIME_DELTA
+...
+
+2.3. Additional I/O statistics
+
+Additional cgroup I/O throttling statistics are reported in
+blockio.throttlecnt:
+
+$ cat CGROUP/blockio.throttlecnt
+MAJOR MINOR BW_COUNTER BW_SLEEP IOPS_COUNTER IOPS_SLEEP
+
+ - MAJOR, MINOR are respectively the major and the minor number of the device
+ the following statistics refer to
+ - BW_COUNTER gives the number of times that the cgroup bandwidth limit of
+ this particular device was exceeded
+ - BW_SLEEP is the amount of sleep time measured in clock ticks (divide
+ by sysconf(_SC_CLK_TCK)) imposed to the processes of this cgroup that
+ exceeded the bandwidth limit for this particular device
+ - IOPS_COUNTER gives the number of times that the cgroup I/O operation per
+ second limit of this particular device was exceeded
+ - IOPS_SLEEP is the amount of sleep time measured in clock ticks (divide
+ by sysconf(_SC_CLK_TCK)) imposed to the processes of this cgroup that
+ exceeded the I/O operations per second limit for this particular device
+
+Example:
+$ cat CGROUP/blockio.throttlecnt
+8 0 0 0 0 0
+^ ^ ^ ^ ^ ^
+ \ \ \ \ \ \___iops sleep (in clock ticks)
+ \ \ \ \ \____iops throttle counter
+ \ \ \ \_____bandwidth sleep (in clock ticks)
+ \ \ \______bandwidth throttle counter
+ \ \_______minor dev. number
+ \________major dev. number
+
+Distinct statistics for each process are reported in
+/proc/PID/io-throttle-stat:
+
+$ cat /proc/PID/io-throttle-stat
+BW_COUNTER BW_SLEEP IOPS_COUNTER IOPS_SLEEP
+
+Example:
+$ cat /proc/$$/io-throttle-stat
+0 0 0 0
+^ ^ ^ ^
+ \ \ \ \_____global iops sleep (in clock ticks)
+ \ \ \______global iops counter
+ \ \_______global bandwidth sleep (clock ticks)
+ \________global bandwidth counter
+
+2.5. Generic usage examples
+
+* Mount the cgroup filesystem (blockio subsystem):
+ # mkdir /mnt/cgroup
+ # mount -t cgroup -oblockio blockio /mnt/cgroup
+
+* Instantiate the new cgroup "foo":
+ # mkdir /mnt/cgroup/foo
+ --> the cgroup foo has been created
+
+* Add the current shell process to the cgroup "foo":
+ # /bin/echo $$ > /mnt/cgroup/foo/tasks
+ --> the current shell has been added to the cgroup "foo"
+
+* Give maximum 1MiB/s of I/O bandwidth on /dev/sda for the cgroup "foo", using
+ leaky bucket throttling strategy:
+ # /bin/echo /dev/sda:$((1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth-max
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda
+
+* Give maximum 8MiB/s of I/O bandwidth on /dev/sdb for the cgroup "foo", using
+ token bucket throttling strategy, bucket size = 8MiB:
+ # /bin/echo /dev/sdb:$((8 * 1024 * 1024)):1:$((8 * 1024 * 1024)) > \
+ > /mnt/cgroup/foo/blockio.bandwidth-max
+ # sh
+ --> the subshell 'sh' is running in cgroup "foo" and it can use a maximum I/O
+ bandwidth of 1MiB/s on /dev/sda (controlled by leaky bucket throttling)
+ and 8MiB/s on /dev/sdb (controlled by token bucket throttling)
+
+* Run a benchmark doing I/O on /dev/sda and /dev/sdb; I/O limits and usage
+ defined for cgroup "foo" can be shown as following:
+ # cat /mnt/cgroup/foo/blockio.bandwidth-max
+ 8 16 8388608 1 0 8388608 -522560 48
+ 8 0 1048576 0 737280 0 0 216
+
+* Extend the maximum I/O bandwidth for the cgroup "foo" to 16MiB/s on /dev/sda:
+ # /bin/echo /dev/sda:$((16 * 1024 * 1024)):0:0 > \
+ > /mnt/cgroup/foo/blockio.bandwidth-max
+ # cat /mnt/cgroup/foo/blockio.bandwidth-max
+ 8 16 8388608 1 0 8388608 -84432 206436
+ 8 0 16777216 0 0 0 0 15212
+
+* Remove limiting rule on /dev/sdb for cgroup "foo":
+ # /bin/echo /dev/sdb:0:0:0 > /mnt/cgroup/foo/blockio.bandwidth-max
+ # cat /mnt/cgroup/foo/blockio.bandwidth-max
+ 8 0 16777216 0 0 0 0 110388
+
+* Set a maximum of 100 I/O operations/sec (leaky bucket strategy) to /dev/sdc
+ for cgroup "foo":
+ # /bin/echo /dev/sdc:100:0 > /mnt/cgroup/foo/blockio.iops-max
+ # cat /mnt/cgroup/foo/blockio.iops-max
+ 8 32 100000 0 846000 0 2113
+ ^ ^
+ /________/
+ /
+ Remember: these values are scaled up by a factor of 1000 to apply a fine
+ grained throttling (i.e. LIMIT == 100000 means a maximum of 100 I/O operation
+ per second)
+
+* Remove limiting rule for I/O operations from /dev/sdc for cgroup "foo":
+ # /bin/echo /dev/sdc:0 > /mnt/cgroup/foo/blockio.iops-max
+
+----------------------------------------------------------------------
+3. ADVANTAGES OF PROVIDING THIS FEATURE
+
+* Allow I/O traffic shaping for block device shared among different cgroups
+* Improve I/O performance predictability on block devices shared between
+ different cgroups
+* Limiting rules do not depend of the particular I/O scheduler (anticipatory,
+ deadline, CFQ, noop) and/or the type of the underlying block devices
+* The bandwidth limitations are guaranteed both for synchronous and
+ asynchronous operations, even the I/O passing through the page cache or
+ buffers and not only direct I/O (see below for details)
+* It is possible to implement a simple user-space application to dynamically
+ adjust the I/O workload of different process containers at run-time,
+ according to the particular users' requirements and applications' performance
+ constraints
+
+----------------------------------------------------------------------
+4. DESIGN
+
+The I/O throttling is performed imposing an explicit timeout on the processes
+that exceed the I/O limits dedicated to the cgroup they belong to. I/O
+accounting happens per cgroup.
+
+Only the actual I/O that flows in the block devices is considered. Multiple
+re-reads of pages already present in the page cache as well as re-writes of
+dirty pages are not considered to account and throttle the I/O activity, since
+they don't actually generate any real I/O operation.
+
+This means that a process that re-reads or re-writes multiple times the same
+blocks of a file is affected by the I/O limitations only for the actual I/O
+performed from/to the underlying block devices.
+
+4.1. Synchronous I/O tracking and throttling
+
+The io-throttle controller just works as expected for synchronous (read and
+write) operations: the real I/O activity is reduced synchronously according to
+the defined limitations.
+
+If the operation is synchronous we automatically know that the context of the
+request is the current task and so we can charge the cgroup the current task
+belongs to. And throttle the current task as well, if it exceeded the cgroup
+limitations.
+
+4.2. Buffered I/O (write-back) tracking
+
+For buffered writes the scenario is a bit more complex, because the writes in
+the page cache are processed asynchronously by kernel threads (pdflush), using
+a write-back policy. So the real writes to the underlying block devices occur
+in a different I/O context respect to the task that originally generated the
+dirty pages.
+
+The I/O bandwidth controller uses the following solution to resolve this
+problem.
+
+If the operation is a buffered write, we can charge the right cgroup looking at
+the owner of the first page involved in the I/O operation, that gives the
+context that generated the I/O activity at the source. This information can be
+retrieved using the page_cgroup functionality originally provided by the cgroup
+memory controller [4], and now provided specifically by the bio-cgroup
+controller [5].
+
+In this way we can correctly account the I/O cost to the right cgroup, but we
+cannot throttle the current task in this stage, because, in general, it is a
+different task (e.g., pdflush that is processing asynchronously the dirty
+page).
+
+For this reason, all the write-back requests that are not directly submitted by
+the real owner and that need to be throttled are not dispatched immediately in
+submit_bio(). Instead, they are added into an rbtree and processed
+asynchronously by a dedicated kernel thread: kiothrottled.
+
+A deadline is associated to each throttled write-back request depending on the
+bandwidth usage of the cgroup it belongs. When a request is inserted into the
+rbtree kiothrottled is awakened. This thread periodically selects all the
+requests with an expired deadline and submit the bunch of selected requests to
+the underlying block devices using generic_make_request().
+
+4.3. Usage of bio-cgroup controller
+
+The controller bio-cgroup can be used to track buffered-io (in delay-write
+condition) and for properly apply throttling. The simplest way is to mount
+io-throttle (blockio) and bio-cgroup (bio) together to track buffered-io.
+That's it.
+
+An alternative way is making the use of bio-cgroup id. An association between a
+given io-throttle cgroup and a given bio-cgroup can be built by writing a
+bio-cgroup id to the file blockio.bio_id.
+
+This file is exported for the purpose of associating io-throttle and bio-cgroup
+groups. If you'd like to create an association, you must ensure the io-throttle
+group is empty, that is, there are no tasks in this group. Otherwise,
+association creating will fail. If an association is successfully built, task
+moving in this group will be denied. Of course, you can remove an association,
+just echo an negative number into blockio.bio_id.
+
+In this way, we don't have to necessarily mount io-throttle and bio-cgroup
+together. It's more gentle to the other subsystems who also want to use
+bio-cgroup.
+
+Example:
+* Create an association between an io-throttle group and a bio-cgroup group
+ with "bio" and "blockio" subsystems mounted in different mount points:
+ # mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
+ # cd /mnt/bio-cgroup/
+ # mkdir bio-grp
+ # cat bio-grp/bio.id
+ 1
+ # mount -t cgroup -o blockio blockio /mnt/io-throttle
+ # cd /mnt/io-throttle
+ # mkdir foo
+ # echo 1 > foo/blockio.bio_id
+
+* Now move the current shell in the new io-throttle/bio-cgroup group:
+ # echo $$ > /mnt/bio-cgroup/bio-grp/tasks
+
+The task will be also present in /mnt/io-throttle/foo/tasks, due to the
+previous blockio/bio association.
+
+4.4. Per-block device IO limiting rules
+
+Multiple rules for different block devices are stored in a linked list, using
+the dev_t number of each block device as key to uniquely identify each element
+of the list. RCU synchronization is used to protect the whole list structure,
+since the elements in the list are not supposed to change frequently (they
+change only when a new rule is defined or an old rule is removed or updated),
+while the reads in the list occur at each operation that generates I/O. This
+allows to provide zero overhead for cgroups that do not use any limitation.
+
+WARNING: per-block device limiting rules always refer to the dev_t device
+number. If a block device is unplugged (i.e. a USB device) the limiting rules
+defined for that device persist and they are still valid if a new device is
+plugged in the system and it uses the same major and minor numbers.
+
+4.5. Asynchronous I/O (AIO) handling
+
+Explicit sleeps are *not* imposed on tasks doing asynchronous I/O (AIO)
+operations; AIO throttling is performed returning -EAGAIN from sys_io_submit().
+Userspace applications must be able to handle this error code opportunely.
+
+----------------------------------------------------------------------
+5. TODO
+
+* Support proportional I/O bandwidth for an optimal bandwidth usage. For
+ example use the kiothrottled rbtree: all the requests queued to the I/O
+ subsystem first will go into the rbtree; then based on a per-cgroup I/O
+ priority and feedback from I/O schedulers dispatch the requests to the
+ elevator. This would allow to provide both bandwidth limiting and
+ proportional bandwidth functionalities using a generic approach.
+
+* Implement a fair throttling policy: distribute the time to sleep equally
+ among all the tasks of a cgroup that exceeded the I/O limits, e.g., depending
+ of the amount of I/O activity previously generated in the past by each task
+ (see task_io_accounting).
+
+----------------------------------------------------------------------
+6. REFERENCES
+
+[1] Documentation/cgroups/cgroups.txt
+[2] http://en.wikipedia.org/wiki/Leaky_bucket
+[3] http://en.wikipedia.org/wiki/Token_bucket
+[4] Documentation/controllers/memory.txt
+[5] http://people.valinux.co.jp/~ryov/bio-cgroup
--
1.5.6.3

2009-04-14 20:22:36

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 3/9] bio-cgroup controller

From: Ryo Tsuruta <[email protected]>

From: Ryo Tsuruta <[email protected]>

With writeback IO processed asynchronously by kernel threads (pdflush)
the real writes to the underlying block devices can occur in a different
IO context respect to the task that originally generated the dirty
pages involved in the IO operation.

The controller bio-cgroup is used by io-throttle to track writeback IO
and for properly apply throttling.

Also apply a patch by Gui Jianfeng to announce tasks moving in
bio-cgroup groups.

See also: http://people.valinux.co.jp/~ryov/bio-cgroup

Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>
Signed-off-by: Hirokazu Takahashi <[email protected]>
---
block/blk-ioc.c | 30 ++--
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/biotrack.h | 95 +++++++++++
include/linux/cgroup_subsys.h | 6 +
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 13 ++-
init/Kconfig | 15 ++
mm/Makefile | 4 +-
mm/biotrack.c | 349 +++++++++++++++++++++++++++++++++++++++++
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 5 +
mm/memory.c | 5 +
mm/page-writeback.c | 2 +
mm/page_cgroup.c | 17 ++-
mm/swap_state.c | 2 +
19 files changed, 536 insertions(+), 26 deletions(-)
create mode 100644 include/linux/biotrack.h
create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..ef8cac0 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,24 +84,28 @@ void exit_io_context(void)
}
}

+void init_io_context(struct io_context *ioc)
+{
+ atomic_set(&ioc->refcount, 1);
+ atomic_set(&ioc->nr_tasks, 1);
+ spin_lock_init(&ioc->lock);
+ ioc->ioprio_changed = 0;
+ ioc->ioprio = 0;
+ ioc->last_waited = jiffies; /* doesn't matter... */
+ ioc->nr_batch_requests = 0; /* because this is 0 */
+ ioc->aic = NULL;
+ INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+ INIT_HLIST_HEAD(&ioc->cic_list);
+ ioc->ioc_data = NULL;
+}
+
struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
{
struct io_context *ret;

ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
- if (ret) {
- atomic_set(&ret->refcount, 1);
- atomic_set(&ret->nr_tasks, 1);
- spin_lock_init(&ret->lock);
- ret->ioprio_changed = 0;
- ret->ioprio = 0;
- ret->last_waited = jiffies; /* doesn't matter... */
- ret->nr_batch_requests = 0; /* because this is 0 */
- ret->aic = NULL;
- INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
- INIT_HLIST_HEAD(&ret->cic_list);
- ret->ioc_data = NULL;
- }
+ if (ret)
+ init_io_context(ret);

return ret;
}
diff --git a/fs/buffer.c b/fs/buffer.c
index 13edf7a..bc72150 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
+#include <linux/biotrack.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -655,6 +656,7 @@ static void __set_page_dirty(struct page *page,
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ bio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index da258e7..ec42362 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
#include <linux/err.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
+#include <linux/biotrack.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
#include <asm/atomic.h>
@@ -799,6 +800,7 @@ static int do_direct_IO(struct dio *dio)
ret = PTR_ERR(page);
goto out;
}
+ bio_cgroup_reset_owner(page, current->mm);

while (block_in_page < blocks_per_page) {
unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..25b8810
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,95 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BIO
+
+struct tsk_move_msg {
+ int old_id;
+ int new_id;
+ struct task_struct *tsk;
+};
+
+extern int register_biocgroup_notifier(struct notifier_block *nb);
+extern int unregister_biocgroup_notifier(struct notifier_block *nb);
+
+struct io_context;
+struct block_device;
+
+struct bio_cgroup {
+ struct cgroup_subsys_state css;
+ int id;
+ struct io_context *io_context; /* default io_context */
+/* struct radix_tree_root io_context_root; per device io_context */
+};
+
+static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
+{
+ pc->bio_cgroup_id = 0;
+}
+
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+extern void put_cgroup_from_page(struct page *page);
+extern struct cgroup *bio_id_to_cgroup(int id);
+
+static inline int bio_cgroup_disabled(void)
+{
+ return bio_cgroup_subsys.disabled;
+}
+
+extern void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void bio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm);
+extern void bio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_bio_cgroup_iocontext(struct bio *bio);
+extern int get_bio_cgroup_id(struct bio *bio);
+
+#else /* CONFIG_CGROUP_BIO */
+
+struct bio_cgroup;
+
+static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline int bio_cgroup_disabled(void)
+{
+ return 1;
+}
+
+static inline void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void bio_cgroup_reset_owner(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void bio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void bio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
+{
+ return NULL;
+}
+
+static inline int get_bio_cgroup_id(struct bio *bio)
+{
+ return 0;
+}
+
+#endif /* CONFIG_CGROUP_BIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..5df23f8 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)

/* */

+#ifdef CONFIG_CGROUP_BIO
+SUBSYS(bio_cgroup)
+#endif
+
+/* */
+
#ifdef CONFIG_CGROUP_DEVICE
SUBSYS(devices)
#endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..be37c27 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -104,6 +104,7 @@ int put_io_context(struct io_context *ioc);
void exit_io_context(void);
struct io_context *get_io_context(gfp_t gfp_flags, int node);
struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
void copy_io_context(struct io_context **pdst, struct io_context **psrc);
#else
static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 18146c9..f3e0e64 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
* (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
*/

+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
@@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..47a6f55 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -607,7 +607,7 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
struct page_cgroup *node_page_cgroup;
#endif
#endif
@@ -958,7 +958,7 @@ struct mem_section {

/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..a7249bb 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
#ifndef __LINUX_PAGE_CGROUP_H
#define __LINUX_PAGE_CGROUP_H

-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
#include <linux/bit_spinlock.h>
/*
* Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,16 @@
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct mem_cgroup *mem_cgroup;
+#endif
+#ifdef CONFIG_CGROUP_BIO
+ int bio_cgroup_id;
+#endif
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR) || defined(CONFIG_CGROUP_BIO)
struct list_head lru; /* per cgroup LRU list */
+#endif
};

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -71,7 +78,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
bit_spin_unlock(PCG_LOCK, &pc->flags);
}

-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
struct page_cgroup;

static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..8f7b23c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -606,8 +606,23 @@ config CGROUP_MEM_RES_CTLR_SWAP
Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
size is 4096bytes, 512k per 1Gbytes of swap.

+config CGROUP_BIO
+ bool "Block I/O cgroup subsystem"
+ depends on CGROUPS && BLOCK
+ select MM_OWNER
+ help
+ Provides a Resource Controller which enables to track the onwner
+ of every Block I/O requests.
+ The information this subsystem provides can be used from any
+ kind of module such as dm-ioband device mapper modules or
+ the cfq-scheduler.
+
endif # CGROUPS

+config CGROUP_PAGE
+ def_bool y
+ depends on CGROUP_MEM_RES_CTLR || CGROUP_BIO
+
config MM_OWNER
bool

diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..a78a437 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,4 +37,6 @@ else
obj-$(CONFIG_SMP) += allocpercpu.o
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BIO) += biotrack.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..d3a35f1
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,349 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008
+ * Developed by Hirokazu Takahashi <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/idr.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+
+#define MOVETASK 0
+static BLOCKING_NOTIFIER_HEAD(biocgroup_chain);
+
+int register_biocgroup_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_register(&biocgroup_chain, nb);
+}
+EXPORT_SYMBOL(register_biocgroup_notifier);
+
+int unregister_biocgroup_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_unregister(&biocgroup_chain, nb);
+}
+EXPORT_SYMBOL(unregister_biocgroup_notifier);
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the bio_cgroup that associates with a cgroup. */
+static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, bio_cgroup_subsys_id),
+ struct bio_cgroup, css);
+}
+
+/* Return the bio_cgroup that associates with a process. */
+static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p)
+{
+ return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
+ struct bio_cgroup, css);
+}
+
+static struct idr bio_cgroup_id;
+static DEFINE_SPINLOCK(bio_cgroup_idr_lock);
+static struct io_context default_bio_io_context;
+static struct bio_cgroup default_bio_cgroup = {
+ .id = 0,
+ .io_context = &default_bio_io_context,
+};
+
+/*
+ * This function is used to make a given page have the bio-cgroup id of
+ * the owner of this page.
+ */
+void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+ struct bio_cgroup *biog;
+ struct page_cgroup *pc;
+
+ if (bio_cgroup_disabled())
+ return;
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ pc->bio_cgroup_id = 0; /* 0: default bio_cgroup id */
+ if (!mm)
+ return;
+ /*
+ * Locking "pc" isn't necessary here since the current process is
+ * the only one that can access the members related to bio_cgroup.
+ */
+ rcu_read_lock();
+ biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
+ if (unlikely(!biog))
+ goto out;
+ /*
+ * css_get(&bio->css) isn't called to increment the reference
+ * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
+ * invalid even if this page is still active.
+ * This approach is chosen to minimize the overhead.
+ */
+ pc->bio_cgroup_id = biog->id;
+out:
+ rcu_read_unlock();
+}
+
+/*
+ * Change the owner of a given page if necessary.
+ */
+void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+ /*
+ * A little trick:
+ * Just call bio_cgroup_set_owner() for pages which are already
+ * active since the bio_cgroup_id member of page_cgroup can be
+ * updated without any locks. This is because an integer type of
+ * variable can be set a new value at once on modern cpus.
+ */
+ bio_cgroup_set_owner(page, mm);
+}
+
+/*
+ * Change the owner of a given page. This function is only effective for
+ * pages in the pagecache.
+ */
+void bio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+ if (PageSwapCache(page) || PageAnon(page))
+ return;
+ if (current->flags & PF_MEMALLOC)
+ return;
+
+ bio_cgroup_reset_owner(page, mm);
+}
+
+/*
+ * Assign "page" the same owner as "opage."
+ */
+void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+ struct page_cgroup *npc, *opc;
+
+ if (bio_cgroup_disabled())
+ return;
+ npc = lookup_page_cgroup(npage);
+ if (unlikely(!npc))
+ return;
+ opc = lookup_page_cgroup(opage);
+ if (unlikely(!opc))
+ return;
+
+ /*
+ * Do this without any locks. The reason is the same as
+ * bio_cgroup_reset_owner().
+ */
+ npc->bio_cgroup_id = opc->bio_cgroup_id;
+}
+
+/* Create a new bio-cgroup. */
+static struct cgroup_subsys_state *
+bio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct bio_cgroup *biog;
+ struct io_context *ioc;
+ int ret;
+
+ if (!cgrp->parent) {
+ biog = &default_bio_cgroup;
+ init_io_context(biog->io_context);
+ /* Increment the referrence count not to be released ever. */
+ atomic_inc(&biog->io_context->refcount);
+ idr_init(&bio_cgroup_id);
+ return &biog->css;
+ }
+
+ biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+ ioc = alloc_io_context(GFP_KERNEL, -1);
+ if (!ioc || !biog) {
+ ret = -ENOMEM;
+ goto out_err;
+ }
+ biog->io_context = ioc;
+retry:
+ if (!idr_pre_get(&bio_cgroup_id, GFP_KERNEL)) {
+ ret = -EAGAIN;
+ goto out_err;
+ }
+ spin_lock_irq(&bio_cgroup_idr_lock);
+ ret = idr_get_new_above(&bio_cgroup_id, (void *)biog, 1, &biog->id);
+ spin_unlock_irq(&bio_cgroup_idr_lock);
+ if (ret == -EAGAIN)
+ goto retry;
+ else if (ret)
+ goto out_err;
+
+ return &biog->css;
+out_err:
+ kfree(biog);
+ if (ioc)
+ put_io_context(ioc);
+ return ERR_PTR(ret);
+}
+
+/* Delete the bio-cgroup. */
+static void bio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct bio_cgroup *biog = cgroup_bio(cgrp);
+
+ put_io_context(biog->io_context);
+
+ spin_lock_irq(&bio_cgroup_idr_lock);
+ idr_remove(&bio_cgroup_id, biog->id);
+ spin_unlock_irq(&bio_cgroup_idr_lock);
+
+ kfree(biog);
+}
+
+static struct bio_cgroup *find_bio_cgroup(int id)
+{
+ struct bio_cgroup *biog;
+ spin_lock_irq(&bio_cgroup_idr_lock);
+ /*
+ * It might fail to find A bio-group associated with "id" since it
+ * is allowed to remove the bio-cgroup even when some of I/O requests
+ * this group issued haven't completed yet.
+ */
+ biog = (struct bio_cgroup *)idr_find(&bio_cgroup_id, id);
+ spin_unlock_irq(&bio_cgroup_idr_lock);
+ return biog;
+}
+
+struct cgroup *bio_id_to_cgroup(int id)
+{
+ struct bio_cgroup *biog;
+
+ biog = find_bio_cgroup(id);
+ if (biog)
+ return biog->css.cgroup;
+
+ return NULL;
+}
+
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+ struct page_cgroup *pc;
+ struct bio_cgroup *biog;
+ struct cgroup *cgrp = NULL;
+
+ pc = lookup_page_cgroup(page);
+ if (!pc)
+ return NULL;
+ lock_page_cgroup(pc);
+ biog = find_bio_cgroup(pc->bio_cgroup_id);
+ if (biog) {
+ css_get(&biog->css);
+ cgrp = biog->css.cgroup;
+ }
+ unlock_page_cgroup(pc);
+ return cgrp;
+}
+
+void put_cgroup_from_page(struct page *page)
+{
+ struct bio_cgroup *biog;
+ struct page_cgroup *pc;
+
+ pc = lookup_page_cgroup(page);
+ if (!pc)
+ return;
+ lock_page_cgroup(pc);
+ biog = find_bio_cgroup(pc->bio_cgroup_id);
+ if (biog)
+ css_put(&biog->css);
+ unlock_page_cgroup(pc);
+}
+
+/* Determine the bio-cgroup id of a given bio. */
+int get_bio_cgroup_id(struct bio *bio)
+{
+ struct page_cgroup *pc;
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ int id = 0;
+
+ pc = lookup_page_cgroup(page);
+ if (pc)
+ id = pc->bio_cgroup_id;
+ return id;
+}
+EXPORT_SYMBOL(get_bio_cgroup_id);
+
+/* Determine the iocontext of the bio-cgroup that issued a given bio. */
+struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
+{
+ struct bio_cgroup *biog = NULL;
+ struct io_context *ioc;
+ int id = 0;
+
+ id = get_bio_cgroup_id(bio);
+ if (id)
+ biog = find_bio_cgroup(id);
+ if (!biog)
+ biog = &default_bio_cgroup;
+ ioc = biog->io_context; /* default io_context for this cgroup */
+ atomic_inc(&ioc->refcount);
+ return ioc;
+}
+EXPORT_SYMBOL(get_bio_cgroup_iocontext);
+
+static u64 bio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct bio_cgroup *biog = cgroup_bio(cgrp);
+ return (u64) biog->id;
+}
+
+
+static struct cftype bio_files[] = {
+ {
+ .name = "id",
+ .read_u64 = bio_id_read,
+ },
+};
+
+static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ return cgroup_add_files(cgrp, ss, bio_files, ARRAY_SIZE(bio_files));
+}
+
+static void bio_cgroup_attach(struct cgroup_subsys *ss,
+ struct cgroup *cont, struct cgroup *oldcont,
+ struct task_struct *tsk)
+{
+ struct tsk_move_msg tmm;
+ struct bio_cgroup *old_biog, *new_biog;
+
+ old_biog = cgroup_bio(oldcont);
+ new_biog = cgroup_bio(cont);
+ tmm.old_id = old_biog->id;
+ tmm.new_id = new_biog->id;
+ tmm.tsk = tsk;
+ blocking_notifier_call_chain(&biocgroup_chain, MOVETASK, &tmm);
+}
+
+struct cgroup_subsys bio_cgroup_subsys = {
+ .name = "bio",
+ .create = bio_cgroup_create,
+ .destroy = bio_cgroup_destroy,
+ .populate = bio_cgroup_populate,
+ .attach = bio_cgroup_attach,
+ .subsys_id = bio_cgroup_subsys_id,
+};
+
diff --git a/mm/bounce.c b/mm/bounce.c
index e590272..1a01905 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
#include <linux/hash.h>
#include <linux/highmem.h>
#include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
#include <trace/block.h>
#include <asm/tlbflush.h>

@@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
to->bv_len = from->bv_len;
to->bv_offset = from->bv_offset;
inc_zone_page_state(to->bv_page, NR_BOUNCE);
+ bio_cgroup_copy_owner(to->bv_page, page);

if (rw == WRITE) {
char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 8bd4980..1ab32a2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/cpuset.h>
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
#include "internal.h"

@@ -463,6 +464,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
+ bio_cgroup_set_owner(page, current->mm);

error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..c25eb63 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2524,6 +2524,11 @@ struct cgroup_subsys mem_cgroup_subsys = {
.use_id = 1,
};

+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+ pc->mem_cgroup = NULL;
+}
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP

static int __init disable_swap_account(char *s)
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..7779e12 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mmu_notifier.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
@@ -2052,6 +2053,7 @@ gotten:
* thread doing COW.
*/
ptep_clear_flush_notify(vma, address, page_table);
+ bio_cgroup_set_owner(new_page, mm);
page_add_new_anon_rmap(new_page, vma, address);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
@@ -2497,6 +2499,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);
+ bio_cgroup_reset_owner(page, mm);
/* It's better to call commit-charge after rmap is established */
mem_cgroup_commit_charge_swapin(page, ptr);

@@ -2559,6 +2562,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (!pte_none(*page_table))
goto release;
inc_mm_counter(mm, anon_rss);
+ bio_cgroup_set_owner(page, mm);
page_add_new_anon_rmap(page, vma, address);
set_pte_at(mm, address, page_table, entry);

@@ -2711,6 +2715,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (anon) {
inc_mm_counter(mm, anon_rss);
+ bio_cgroup_set_owner(page, mm);
page_add_new_anon_rmap(page, vma, address);
} else {
inc_mm_counter(mm, file_rss);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..1379eb0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -26,6 +26,7 @@
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
+#include <linux/biotrack.h>
#include <linux/percpu.h>
#include <linux/notifier.h>
#include <linux/smp.h>
@@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ bio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 791905c..f692ee2 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,13 +9,16 @@
#include <linux/vmalloc.h>
#include <linux/cgroup.h>
#include <linux/swapops.h>
+#include <linux/memcontrol.h>
+#include <linux/biotrack.h>

static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
+ __init_mem_page_cgroup(pc);
+ __init_bio_page_cgroup(pc);
INIT_LIST_HEAD(&pc->lru);
}
static unsigned long total_usage;
@@ -74,7 +77,7 @@ void __init page_cgroup_init(void)

int nid, fail;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && bio_cgroup_disabled())
return;

for_each_online_node(nid) {
@@ -83,12 +86,12 @@ void __init page_cgroup_init(void)
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you"
+ printk(KERN_INFO "please try cgroup_disable=memory,bio option if you"
" don't want\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
- printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+ printk(KERN_CRIT "please try cgroup_disable=memory,bio boot options\n");
panic("Out of memory");
}

@@ -248,7 +251,7 @@ void __init page_cgroup_init(void)
unsigned long pfn;
int fail = 0;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && bio_cgroup_disabled())
return;

for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -263,8 +266,8 @@ void __init page_cgroup_init(void)
hotplug_memory_notifier(page_cgroup_callback, 0);
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
- " want\n");
+ printk(KERN_INFO
+ "try cgroup_disable=memory,bio option if you don't want\n");
}

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..c7ad256 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/migrate.h>
+#include <linux/biotrack.h>
#include <linux/page_cgroup.h>

#include <asm/pgtable.h>
@@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
+ bio_cgroup_set_owner(new_page, current->mm);
err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
if (likely(!err)) {
/*
--
1.5.6.3

2009-04-14 20:22:21

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 2/9] res_counter: introduce ratelimiting attributes

Introduce attributes and functions in res_counter to implement throttling-based
cgroup subsystems.

The following attributes have been added to struct res_counter:
* @policy: the limiting policy / algorithm
* @capacity: the maximum capacity of the resource
* @timestamp: timestamp of the last accounted resource request

Currently the available policies are: token-bucket and leaky-bucket and the
attribute @capacity is only used by token-bucket policy (to represent the
bucket size).

The following function has been implemented to return the amount of time a
cgroup should sleep to remain within the defined resource limits.

unsigned long long
res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);

[ Note: only the interfaces needed by the cgroup IO controller are implemented
right now ]

Signed-off-by: Andrea Righi <[email protected]>
---
include/linux/res_counter.h | 69 +++++++++++++++++++++++++++++++----------
kernel/res_counter.c | 72 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 124 insertions(+), 17 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 4c5bcf6..9bed6af 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -14,30 +14,36 @@
*/

#include <linux/cgroup.h>
+#include <linux/jiffies.h>

-/*
- * The core object. the cgroup that wishes to account for some
- * resource may include this counter into its structures and use
- * the helpers described beyond
- */
+/* The various policies that can be used for ratelimiting resources */
+#define RATELIMIT_LEAKY_BUCKET 0
+#define RATELIMIT_TOKEN_BUCKET 1

+/**
+ * struct res_counter - the core object to account cgroup resources
+ *
+ * @usage: the current resource consumption level
+ * @max_usage: the maximal value of the usage from the counter creation
+ * @limit: the limit that usage cannot be exceeded
+ * @failcnt: the number of unsuccessful attempts to consume the resource
+ * @policy: the limiting policy / algorithm
+ * @capacity: the maximum capacity of the resource
+ * @timestamp: timestamp of the last accounted resource request
+ * @lock: the lock to protect all of the above.
+ * The routines below consider this to be IRQ-safe
+ *
+ * The cgroup that wishes to account for some resource may include this counter
+ * into its structures and use the helpers described beyond.
+ */
struct res_counter {
- /*
- * the current resource consumption level
- */
unsigned long long usage;
- /*
- * the maximal value of the usage from the counter creation
- */
unsigned long long max_usage;
- /*
- * the limit that usage cannot exceed
- */
unsigned long long limit;
- /*
- * the number of unsuccessful attempts to consume the resource
- */
unsigned long long failcnt;
+ unsigned long long policy;
+ unsigned long long capacity;
+ unsigned long long timestamp;
/*
* the lock to protect all of the above.
* the routines below consider this to be IRQ-safe
@@ -84,6 +90,9 @@ enum {
RES_USAGE,
RES_MAX_USAGE,
RES_LIMIT,
+ RES_POLICY,
+ RES_TIMESTAMP,
+ RES_CAPACITY,
RES_FAILCNT,
};

@@ -130,6 +139,15 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
return false;
}

+static inline unsigned long long
+res_counter_ratelimit_delta_t(struct res_counter *res)
+{
+ return (long long)get_jiffies_64() - (long long)res->timestamp;
+}
+
+unsigned long long
+res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val);
+
/*
* Helper function to detect if the cgroup is within it's limit or
* not. It's currently called from cgroup_rss_prepare()
@@ -163,6 +181,23 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
spin_unlock_irqrestore(&cnt->lock, flags);
}

+static inline int
+res_counter_ratelimit_set_limit(struct res_counter *cnt,
+ unsigned long long policy,
+ unsigned long long limit, unsigned long long max)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&cnt->lock, flags);
+ cnt->limit = limit;
+ cnt->capacity = max;
+ cnt->policy = policy;
+ cnt->timestamp = get_jiffies_64();
+ cnt->usage = 0;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return 0;
+}
+
static inline int res_counter_set_limit(struct res_counter *cnt,
unsigned long long limit)
{
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index bf8e753..b62319c 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -9,6 +9,7 @@

#include <linux/types.h>
#include <linux/parser.h>
+#include <linux/jiffies.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/res_counter.h>
@@ -20,6 +21,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
spin_lock_init(&counter->lock);
counter->limit = (unsigned long long)LLONG_MAX;
counter->parent = parent;
+ counter->capacity = (unsigned long long)LLONG_MAX;
+ counter->timestamp = get_jiffies_64();
}

int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -99,6 +102,12 @@ res_counter_member(struct res_counter *counter, int member)
return &counter->max_usage;
case RES_LIMIT:
return &counter->limit;
+ case RES_POLICY:
+ return &counter->policy;
+ case RES_TIMESTAMP:
+ return &counter->timestamp;
+ case RES_CAPACITY:
+ return &counter->capacity;
case RES_FAILCNT:
return &counter->failcnt;
};
@@ -163,3 +172,66 @@ int res_counter_write(struct res_counter *counter, int member,
spin_unlock_irqrestore(&counter->lock, flags);
return 0;
}
+
+static unsigned long long
+ratelimit_leaky_bucket(struct res_counter *res, ssize_t val)
+{
+ unsigned long long delta, t;
+
+ res->usage += val;
+ delta = res_counter_ratelimit_delta_t(res);
+ if (!delta)
+ return 0;
+ t = res->usage * USEC_PER_SEC;
+ t = usecs_to_jiffies(div_u64(t, res->limit));
+ if (t > delta)
+ return t - delta;
+ /* Reset i/o statistics */
+ res->usage = 0;
+ res->timestamp = get_jiffies_64();
+ return 0;
+}
+
+static unsigned long long
+ratelimit_token_bucket(struct res_counter *res, ssize_t val)
+{
+ unsigned long long delta;
+ long long tok;
+
+ res->usage -= val;
+ delta = jiffies_to_msecs(res_counter_ratelimit_delta_t(res));
+ res->timestamp = get_jiffies_64();
+ tok = (long long)res->usage * MSEC_PER_SEC;
+ if (delta) {
+ long long max = (long long)res->capacity * MSEC_PER_SEC;
+
+ tok += delta * res->limit;
+ if (tok > max)
+ tok = max;
+ res->usage = (unsigned long long)div_s64(tok, MSEC_PER_SEC);
+ }
+ return (tok < 0) ? msecs_to_jiffies(div_u64(-tok, res->limit)) : 0;
+}
+
+unsigned long long
+res_counter_ratelimit_sleep(struct res_counter *res, ssize_t val)
+{
+ unsigned long long sleep = 0;
+ unsigned long flags;
+
+ spin_lock_irqsave(&res->lock, flags);
+ if (res->limit)
+ switch (res->policy) {
+ case RATELIMIT_LEAKY_BUCKET:
+ sleep = ratelimit_leaky_bucket(res, val);
+ break;
+ case RATELIMIT_TOKEN_BUCKET:
+ sleep = ratelimit_token_bucket(res, val);
+ break;
+ default:
+ WARN_ON(1);
+ break;
+ }
+ spin_unlock_irqrestore(&res->lock, flags);
+ return sleep;
+}
--
1.5.6.3

2009-04-14 20:22:53

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 5/9] io-throttle controller infrastructure

This is the core of the io-throttle kernel infrastructure. It creates
the basic interfaces to the cgroup subsystem and implements the I/O
measurement and throttling functionality.

Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
---
block/Makefile | 1 +
block/blk-io-throttle.c | 1052 +++++++++++++++++++++++++++++++++++++++
include/linux/blk-io-throttle.h | 110 ++++
include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 10 +
5 files changed, 1179 insertions(+), 0 deletions(-)
create mode 100644 block/blk-io-throttle.c
create mode 100644 include/linux/blk-io-throttle.h

diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..42b6a46 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,5 +13,6 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
new file mode 100644
index 0000000..36db803
--- /dev/null
+++ b/block/blk-io-throttle.c
@@ -0,0 +1,1052 @@
+/*
+ * blk-io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <[email protected]>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/res_counter.h>
+#include <linux/memcontrol.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/genhd.h>
+#include <linux/hardirq.h>
+#include <linux/list.h>
+#include <linux/seq_file.h>
+#include <linux/spinlock.h>
+#include <linux/biotrack.h>
+#include <linux/blk-io-throttle.h>
+#include <linux/biotrack.h>
+#include <linux/sched.h>
+#include <linux/bio.h>
+
+/*
+ * Statistics for I/O bandwidth controller.
+ */
+enum iothrottle_stat_index {
+ /* # of times the cgroup has been throttled for bw limit */
+ IOTHROTTLE_STAT_BW_COUNT,
+ /* # of jiffies spent to sleep for throttling for bw limit */
+ IOTHROTTLE_STAT_BW_SLEEP,
+ /* # of times the cgroup has been throttled for iops limit */
+ IOTHROTTLE_STAT_IOPS_COUNT,
+ /* # of jiffies spent to sleep for throttling for iops limit */
+ IOTHROTTLE_STAT_IOPS_SLEEP,
+ /* total number of bytes read and written */
+ IOTHROTTLE_STAT_BYTES_TOT,
+ /* total number of I/O operations */
+ IOTHROTTLE_STAT_IOPS_TOT,
+
+ IOTHROTTLE_STAT_NSTATS,
+};
+
+struct iothrottle_stat_cpu {
+ unsigned long long count[IOTHROTTLE_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct iothrottle_stat {
+ struct iothrottle_stat_cpu cpustat[NR_CPUS];
+};
+
+static void iothrottle_stat_add(struct iothrottle_stat *stat,
+ enum iothrottle_stat_index type, unsigned long long val)
+{
+ int cpu = get_cpu();
+
+ stat->cpustat[cpu].count[type] += val;
+ put_cpu();
+}
+
+static void iothrottle_stat_add_sleep(struct iothrottle_stat *stat,
+ int type, unsigned long long sleep)
+{
+ int cpu = get_cpu();
+
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_COUNT]++;
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_BW_SLEEP] += sleep;
+ break;
+ case IOTHROTTLE_IOPS:
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_COUNT]++;
+ stat->cpustat[cpu].count[IOTHROTTLE_STAT_IOPS_SLEEP] += sleep;
+ break;
+ }
+ put_cpu();
+}
+
+static unsigned long long iothrottle_read_stat(struct iothrottle_stat *stat,
+ enum iothrottle_stat_index idx)
+{
+ int cpu;
+ unsigned long long ret = 0;
+
+ for_each_possible_cpu(cpu)
+ ret += stat->cpustat[cpu].count[idx];
+ return ret;
+}
+
+struct iothrottle_sleep {
+ unsigned long long bw_sleep;
+ unsigned long long iops_sleep;
+};
+
+/*
+ * struct iothrottle_node - throttling rule of a single block device
+ * @node: list of per block device throttling rules
+ * @dev: block device number, used as key in the list
+ * @bw: max i/o bandwidth (in bytes/s)
+ * @iops: max i/o operations per second
+ * @stat: throttling statistics
+ *
+ * Define a i/o throttling rule for a single block device.
+ *
+ * NOTE: limiting rules always refer to dev_t; if a block device is unplugged
+ * the limiting rules defined for that device persist and they are still valid
+ * if a new device is plugged and it uses the same dev_t number.
+ */
+struct iothrottle_node {
+ struct list_head node;
+ dev_t dev;
+ struct res_counter bw;
+ struct res_counter iops;
+ struct iothrottle_stat stat;
+};
+
+/* A list of iothrottle which associate with a bio_cgroup */
+static LIST_HEAD(bio_group_list);
+static DECLARE_MUTEX(bio_group_list_sem);
+
+enum {
+ MOVING_FORBIDDEN,
+};
+/**
+ * struct iothrottle - throttling rules for a cgroup
+ * @css: pointer to the cgroup state
+ * @list: list of iothrottle_node elements
+ *
+ * Define multiple per-block device i/o throttling rules.
+ * Note: the list of the throttling rules is protected by RCU locking:
+ * - hold cgroup_lock() for update.
+ * - hold rcu_read_lock() for read.
+ */
+struct iothrottle {
+ struct cgroup_subsys_state css;
+ struct list_head list;
+ struct list_head bio_node;
+ int bio_id;
+ unsigned long flags;
+};
+static struct iothrottle init_iothrottle;
+
+static inline int is_bind_biocgroup(void)
+{
+ if (init_iothrottle.css.cgroup->subsys[bio_cgroup_subsys_id])
+ return 1;
+ return 0;
+}
+
+static inline int is_moving_forbidden(const struct iothrottle *iot)
+{
+ return test_bit(MOVING_FORBIDDEN, &iot->flags);
+}
+
+/* NOTE: must be called with rcu_read_lock() or bio_group_list_sem held */
+static struct iothrottle *get_bioid_to_iothrottle(int id)
+{
+ struct iothrottle *iot;
+
+ list_for_each_entry_rcu(iot, &bio_group_list, bio_node) {
+ if (iot->bio_id == id) {
+ css_get(&iot->css);
+ return iot;
+ }
+ }
+ return NULL;
+}
+
+static int is_bio_group(struct iothrottle *iot)
+{
+ if (iot && iot->bio_id > 0)
+ return 0;
+ return -1;
+}
+
+static int synchronize_bio_cgroup(int old_id, int new_id,
+ struct task_struct *tsk)
+{
+ struct iothrottle *old_group, *new_group;
+ int ret = 0;
+
+ old_group = get_bioid_to_iothrottle(old_id);
+ new_group = get_bioid_to_iothrottle(new_id);
+
+ /* no need to hold cgroup_lock() for bio_cgroup holding it already */
+ get_task_struct(tsk);
+
+ /* This has nothing to do with us! */
+ if (is_bio_group(old_group) && is_bio_group(new_group))
+ goto out;
+
+ /*
+ * If moving from an associated one to an unassociated one,
+ * just move it to root.
+ */
+ if (!is_bio_group(old_group) && is_bio_group(new_group)) {
+ BUG_ON(is_moving_forbidden(&init_iothrottle));
+ clear_bit(MOVING_FORBIDDEN, &old_group->flags);
+ ret = cgroup_attach_task(init_iothrottle.css.cgroup, tsk);
+ set_bit(MOVING_FORBIDDEN, &old_group->flags);
+ goto out;
+ }
+
+ if (!is_bio_group(new_group) && is_bio_group(old_group)) {
+ BUG_ON(!is_moving_forbidden(new_group));
+ clear_bit(MOVING_FORBIDDEN, &new_group->flags);
+ ret = cgroup_attach_task(new_group->css.cgroup, tsk);
+ set_bit(MOVING_FORBIDDEN, &new_group->flags);
+ goto out;
+ }
+
+ if (!is_bio_group(new_group) && !is_bio_group(old_group)) {
+ BUG_ON(!is_moving_forbidden(new_group));
+ clear_bit(MOVING_FORBIDDEN, &new_group->flags);
+ clear_bit(MOVING_FORBIDDEN, &old_group->flags);
+ ret = cgroup_attach_task(new_group->css.cgroup, tsk);
+ set_bit(MOVING_FORBIDDEN, &old_group->flags);
+ set_bit(MOVING_FORBIDDEN, &new_group->flags);
+ goto out;
+ }
+out:
+ put_task_struct(tsk);
+ if (new_group)
+ css_put(&new_group->css);
+ if (old_group)
+ css_put(&old_group->css);
+ return ret;
+}
+
+static int iothrottle_notifier_call(struct notifier_block *this,
+ unsigned long event, void *ptr)
+{
+ struct tsk_move_msg *tmm;
+ int old_id, new_id;
+ struct task_struct *tsk;
+
+ if (is_bind_biocgroup())
+ return NOTIFY_OK;
+
+ tmm = (struct tsk_move_msg *)ptr;
+ old_id = tmm->old_id;
+ new_id = tmm->new_id;
+ if (old_id == new_id)
+ return NOTIFY_OK;
+ tsk = tmm->tsk;
+ down(&bio_group_list_sem);
+ synchronize_bio_cgroup(old_id, new_id, tsk);
+ up(&bio_group_list_sem);
+
+ return NOTIFY_OK;
+}
+
+
+static struct notifier_block iothrottle_notifier = {
+ .notifier_call = iothrottle_notifier_call,
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() held.
+ */
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+/*
+ * Note: called with rcu_read_lock() or iot->lock held.
+ */
+static struct iothrottle_node *
+iothrottle_search_node(const struct iothrottle *iot, dev_t dev)
+{
+ struct iothrottle_node *n;
+
+ if (list_empty(&iot->list))
+ return NULL;
+ list_for_each_entry_rcu(n, &iot->list, node)
+ if (n->dev == dev)
+ return n;
+ return NULL;
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void iothrottle_insert_node(struct iothrottle *iot,
+ struct iothrottle_node *n)
+{
+ list_add_rcu(&n->node, &iot->list);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_replace_node(struct iothrottle *iot, struct iothrottle_node *old,
+ struct iothrottle_node *new)
+{
+ list_replace_rcu(&old->node, &new->node);
+}
+
+/*
+ * Note: called with iot->lock held.
+ */
+static inline void
+iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
+{
+ list_del_rcu(&n->node);
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *
+iothrottle_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct iothrottle *iot;
+
+ if (unlikely((cgrp->parent) == NULL)) {
+ iot = &init_iothrottle;
+ /* where should we release? */
+ register_biocgroup_notifier(&iothrottle_notifier);
+ } else {
+ iot = kzalloc(sizeof(*iot), GFP_KERNEL);
+ if (unlikely(!iot))
+ return ERR_PTR(-ENOMEM);
+ }
+ INIT_LIST_HEAD(&iot->list);
+ INIT_LIST_HEAD(&iot->bio_node);
+ iot->bio_id = -1;
+ clear_bit(MOVING_FORBIDDEN, &iot->flags);
+
+ return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct iothrottle_node *n, *p;
+ struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+
+ if (unlikely((cgrp->parent) == NULL))
+ unregister_biocgroup_notifier(&iothrottle_notifier);
+
+ /*
+ * don't worry about locking here, at this point there must be not any
+ * reference to the list.
+ */
+ if (!list_empty(&iot->list))
+ list_for_each_entry_safe(n, p, &iot->list, node)
+ kfree(n);
+ kfree(iot);
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ *
+ * do not care too much about locking for single res_counter values here.
+ */
+static void iothrottle_show_limit(struct seq_file *m, dev_t dev,
+ struct res_counter *res)
+{
+ if (!res->limit)
+ return;
+ seq_printf(m, "%u %u %llu %llu %lli %llu %li\n",
+ MAJOR(dev), MINOR(dev),
+ res->limit, res->policy,
+ (long long)res->usage, res->capacity,
+ jiffies_to_clock_t(res_counter_ratelimit_delta_t(res)));
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ *
+ */
+static void iothrottle_show_failcnt(struct seq_file *m, dev_t dev,
+ struct iothrottle_stat *stat)
+{
+ unsigned long long bw_count, bw_sleep, iops_count, iops_sleep;
+
+ bw_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_COUNT);
+ bw_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BW_SLEEP);
+ iops_count = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_COUNT);
+ iops_sleep = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_SLEEP);
+
+ seq_printf(m, "%u %u %llu %li %llu %li\n", MAJOR(dev), MINOR(dev),
+ bw_count, jiffies_to_clock_t(bw_sleep),
+ iops_count, jiffies_to_clock_t(iops_sleep));
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_show_stat(struct seq_file *m, dev_t dev,
+ struct iothrottle_stat *stat)
+{
+ unsigned long long bytes, iops;
+
+ bytes = iothrottle_read_stat(stat, IOTHROTTLE_STAT_BYTES_TOT);
+ iops = iothrottle_read_stat(stat, IOTHROTTLE_STAT_IOPS_TOT);
+
+ seq_printf(m, "%u %u %llu %llu\n", MAJOR(dev), MINOR(dev), bytes, iops);
+}
+
+static int iothrottle_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct iothrottle *iot = cgroup_to_iothrottle(cgrp);
+ struct iothrottle_node *n;
+
+ rcu_read_lock();
+ if (list_empty(&iot->list))
+ goto unlock_and_return;
+ list_for_each_entry_rcu(n, &iot->list, node) {
+ BUG_ON(!n->dev);
+ switch (cft->private) {
+ case IOTHROTTLE_BANDWIDTH:
+ iothrottle_show_limit(m, n->dev, &n->bw);
+ break;
+ case IOTHROTTLE_IOPS:
+ iothrottle_show_limit(m, n->dev, &n->iops);
+ break;
+ case IOTHROTTLE_FAILCNT:
+ iothrottle_show_failcnt(m, n->dev, &n->stat);
+ break;
+ case IOTHROTTLE_STAT:
+ iothrottle_show_stat(m, n->dev, &n->stat);
+ break;
+ }
+ }
+unlock_and_return:
+ rcu_read_unlock();
+ return 0;
+}
+
+static dev_t devname2dev_t(const char *buf)
+{
+ struct block_device *bdev;
+ dev_t dev = 0;
+ struct gendisk *disk;
+ int part;
+
+ /* use a lookup to validate the block device */
+ bdev = lookup_bdev(buf);
+ if (IS_ERR(bdev))
+ return 0;
+ /* only entire devices are allowed, not single partitions */
+ disk = get_gendisk(bdev->bd_dev, &part);
+ if (disk && !part) {
+ BUG_ON(!bdev->bd_inode);
+ dev = bdev->bd_inode->i_rdev;
+ }
+ bdput(bdev);
+
+ return dev;
+}
+
+/*
+ * The userspace input string must use one of the following syntaxes:
+ *
+ * dev:0 <- delete an i/o limiting rule
+ * dev:io-limit:0 <- set a leaky bucket throttling rule
+ * dev:io-limit:1:bucket-size <- set a token bucket throttling rule
+ * dev:io-limit:1 <- set a token bucket throttling rule using
+ * bucket-size == io-limit
+ */
+static int iothrottle_parse_args(char *buf, size_t nbytes, int filetype,
+ dev_t *dev, unsigned long long *iolimit,
+ unsigned long long *strategy,
+ unsigned long long *bucket_size)
+{
+ char *p;
+ int count = 0;
+ char *s[4];
+ int ret;
+
+ memset(s, 0, sizeof(s));
+ *dev = 0;
+ *iolimit = 0;
+ *strategy = 0;
+ *bucket_size = 0;
+
+ /* split the colon-delimited input string into its elements */
+ while (count < ARRAY_SIZE(s)) {
+ p = strsep(&buf, ":");
+ if (!p)
+ break;
+ if (!*p)
+ continue;
+ s[count++] = p;
+ }
+
+ /* i/o limit */
+ if (!s[1])
+ return -EINVAL;
+ ret = strict_strtoull(s[1], 10, iolimit);
+ if (ret < 0)
+ return ret;
+ if (!*iolimit)
+ goto out;
+ /* throttling strategy (leaky bucket / token bucket) */
+ if (!s[2])
+ return -EINVAL;
+ ret = strict_strtoull(s[2], 10, strategy);
+ if (ret < 0)
+ return ret;
+ switch (*strategy) {
+ case RATELIMIT_LEAKY_BUCKET:
+ goto out;
+ case RATELIMIT_TOKEN_BUCKET:
+ break;
+ default:
+ return -EINVAL;
+ }
+ /* bucket size */
+ if (!s[3])
+ *bucket_size = *iolimit;
+ else {
+ ret = strict_strtoll(s[3], 10, bucket_size);
+ if (ret < 0)
+ return ret;
+ }
+ if (*bucket_size <= 0)
+ return -EINVAL;
+out:
+ /* block device number */
+ *dev = devname2dev_t(s[0]);
+ return *dev ? 0 : -EINVAL;
+}
+
+static int iothrottle_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct iothrottle *iot;
+ struct iothrottle_node *n, *newn = NULL;
+ dev_t dev;
+ unsigned long long iolimit, strategy, bucket_size;
+ char *buf;
+ size_t nbytes = strlen(buffer);
+ int ret = 0;
+
+ /*
+ * We need to allocate a new buffer here, because
+ * iothrottle_parse_args() can modify it and the buffer provided by
+ * write_string is supposed to be const.
+ */
+ buf = kmalloc(nbytes + 1, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+ memcpy(buf, buffer, nbytes + 1);
+
+ ret = iothrottle_parse_args(buf, nbytes, cft->private, &dev, &iolimit,
+ &strategy, &bucket_size);
+ if (ret)
+ goto out1;
+ newn = kzalloc(sizeof(*newn), GFP_KERNEL);
+ if (!newn) {
+ ret = -ENOMEM;
+ goto out1;
+ }
+ newn->dev = dev;
+ res_counter_init(&newn->bw, NULL);
+ res_counter_init(&newn->iops, NULL);
+
+ switch (cft->private) {
+ case IOTHROTTLE_BANDWIDTH:
+ res_counter_ratelimit_set_limit(&newn->iops, 0, 0, 0);
+ res_counter_ratelimit_set_limit(&newn->bw, strategy,
+ ALIGN(iolimit, 1024), ALIGN(bucket_size, 1024));
+ break;
+ case IOTHROTTLE_IOPS:
+ res_counter_ratelimit_set_limit(&newn->bw, 0, 0, 0);
+ /*
+ * scale up iops cost by a factor of 1000, this allows to apply
+ * a more fine grained sleeps, and throttling results more
+ * precise this way.
+ */
+ res_counter_ratelimit_set_limit(&newn->iops, strategy,
+ iolimit * 1000, bucket_size * 1000);
+ break;
+ default:
+ WARN_ON(1);
+ break;
+ }
+
+ if (!cgroup_lock_live_group(cgrp)) {
+ ret = -ENODEV;
+ goto out1;
+ }
+ iot = cgroup_to_iothrottle(cgrp);
+
+ n = iothrottle_search_node(iot, dev);
+ if (!n) {
+ if (iolimit) {
+ /* Add a new block device limiting rule */
+ iothrottle_insert_node(iot, newn);
+ newn = NULL;
+ }
+ goto out2;
+ }
+ switch (cft->private) {
+ case IOTHROTTLE_BANDWIDTH:
+ if (!iolimit && !n->iops.limit) {
+ /* Delete a block device limiting rule */
+ iothrottle_delete_node(iot, n);
+ goto out2;
+ }
+ if (!n->iops.limit)
+ break;
+ /* Update a block device limiting rule */
+ newn->iops = n->iops;
+ break;
+ case IOTHROTTLE_IOPS:
+ if (!iolimit && !n->bw.limit) {
+ /* Delete a block device limiting rule */
+ iothrottle_delete_node(iot, n);
+ goto out2;
+ }
+ if (!n->bw.limit)
+ break;
+ /* Update a block device limiting rule */
+ newn->bw = n->bw;
+ break;
+ }
+ iothrottle_replace_node(iot, n, newn);
+ newn = NULL;
+out2:
+ cgroup_unlock();
+ if (n) {
+ synchronize_rcu();
+ kfree(n);
+ }
+out1:
+ kfree(newn);
+ kfree(buf);
+ return ret;
+}
+
+static s64 read_bio_id(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct iothrottle *iot;
+
+ iot = cgroup_to_iothrottle(cgrp);
+ return iot->bio_id;
+}
+
+/**
+ * iothrottle_do_move_task - move a given task to another iothrottle cgroup
+ * @tsk: pointer to task_struct the task to move
+ * @scan: struct cgroup_scanner
+ *
+ * Called by cgroup_scan_tasks() for each task in a cgroup.
+ */
+static void iothrottle_do_move_task(struct task_struct *tsk,
+ struct cgroup_scanner *scan)
+{
+ struct cgroup *new_cgroup = scan->data;
+
+ cgroup_attach_task(new_cgroup, tsk);
+}
+
+/**
+ * move_tasks_to_cgroup - move tasks from one cgroup to another iothrottle
+ * cgroup
+ * @from: iothrottle in which the tasks currently reside
+ * @to: iothrottle to which the tasks will be moved
+ *
+ * NOTE: called with cgroup_mutex held
+ *
+ * The cgroup_scan_tasks() function will scan all the tasks in a cgroup
+ * calling callback functions for each.
+ */
+static void move_tasks_to_init_cgroup(struct cgroup *from, struct cgroup *to)
+{
+ struct cgroup_scanner scan;
+
+ scan.cg = from;
+ scan.test_task = NULL; /* select all tasks in cgroup */
+ scan.process_task = iothrottle_do_move_task;
+ scan.heap = NULL;
+ scan.data = to;
+
+ if (cgroup_scan_tasks(&scan))
+ printk(KERN_ERR "%s: cgroup_scan_tasks failed\n", __func__);
+}
+
+static int write_bio_id(struct cgroup *cgrp, struct cftype *cft, s64 val)
+{
+ struct cgroup *bio_cgroup;
+ struct iothrottle *iot, *pos;
+ int id;
+
+ if (is_bind_biocgroup())
+ return -EPERM;
+
+ iot = cgroup_to_iothrottle(cgrp);
+
+ /* No more operation if it's a root cgroup */
+ if (!cgrp->parent)
+ return 0;
+ id = val;
+
+ /* De-associate from a bio-cgroup */
+ if (id < 0) {
+ if (is_bio_group(iot))
+ return 0;
+
+ clear_bit(MOVING_FORBIDDEN, &iot->flags);
+ cgroup_lock();
+ move_tasks_to_init_cgroup(cgrp, init_iothrottle.css.cgroup);
+ cgroup_unlock();
+
+ down(&bio_group_list_sem);
+ list_del_rcu(&iot->bio_node);
+ up(&bio_group_list_sem);
+
+ iot->bio_id = -1;
+ return 0;
+ }
+
+ /* Not allowed if there're tasks in the iothrottle cgroup */
+ if (cgroup_task_count(cgrp))
+ return -EPERM;
+
+ bio_cgroup = bio_id_to_cgroup(id);
+ if (!bio_cgroup)
+ return 0;
+ /*
+ * Go through the bio_group_list, if don't exist, put it into this
+ * list.
+ */
+ rcu_read_lock();
+ list_for_each_entry_rcu(pos, &bio_group_list, bio_node) {
+ if (pos->bio_id == id) {
+ rcu_read_unlock();
+ return -EEXIST;
+ }
+ }
+ rcu_read_unlock();
+
+ /* Synchronize tasks with bio_cgroup */
+ cgroup_lock();
+ move_tasks_to_init_cgroup(bio_cgroup, cgrp);
+ cgroup_unlock();
+
+ down(&bio_group_list_sem);
+ list_add_rcu(&iot->bio_node, &bio_group_list);
+ up(&bio_group_list_sem);
+
+ iot->bio_id = id;
+ set_bit(MOVING_FORBIDDEN, &iot->flags);
+
+ return 0;
+}
+
+static struct cftype files[] = {
+ {
+ .name = "bandwidth-max",
+ .read_seq_string = iothrottle_read,
+ .write_string = iothrottle_write,
+ .max_write_len = 256,
+ .private = IOTHROTTLE_BANDWIDTH,
+ },
+ {
+ .name = "iops-max",
+ .read_seq_string = iothrottle_read,
+ .write_string = iothrottle_write,
+ .max_write_len = 256,
+ .private = IOTHROTTLE_IOPS,
+ },
+ {
+ .name = "throttlecnt",
+ .read_seq_string = iothrottle_read,
+ .private = IOTHROTTLE_FAILCNT,
+ },
+ {
+ .name = "stat",
+ .read_seq_string = iothrottle_read,
+ .private = IOTHROTTLE_STAT,
+ },
+ {
+ .name = "bio_id",
+ .write_s64 = write_bio_id,
+ .read_s64 = read_bio_id,
+ },
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ return cgroup_add_files(cgrp, ss, files, ARRAY_SIZE(files));
+}
+
+static int iothrottle_can_attach(struct cgroup_subsys *ss,
+ struct cgroup *cont, struct task_struct *tsk)
+{
+ struct iothrottle *new_iot, *old_iot;
+
+ new_iot = cgroup_to_iothrottle(cont);
+ old_iot = task_to_iothrottle(tsk);
+
+ if (!is_moving_forbidden(new_iot) && !is_moving_forbidden(old_iot))
+ return 0;
+ else
+ return -EPERM;
+}
+
+static int iothrottle_subsys_depend(struct cgroup_subsys *ss,
+ unsigned long subsys_bits)
+{
+ unsigned long allow_subsys_bits;
+
+ allow_subsys_bits = 0;
+ allow_subsys_bits |= 1ul << bio_cgroup_subsys_id;
+ allow_subsys_bits |= 1ul << iothrottle_subsys_id;
+ if (subsys_bits & ~allow_subsys_bits)
+ return -1;
+ return 0;
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+ .name = "blockio",
+ .create = iothrottle_create,
+ .destroy = iothrottle_destroy,
+ .populate = iothrottle_populate,
+ .can_attach = iothrottle_can_attach,
+ .subsys_depend = iothrottle_subsys_depend,
+ .subsys_id = iothrottle_subsys_id,
+ .early_init = 1,
+};
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_evaluate_sleep(struct iothrottle_sleep *sleep,
+ struct iothrottle *iot,
+ struct block_device *bdev, ssize_t bytes)
+{
+ struct iothrottle_node *n;
+ dev_t dev;
+
+ if (unlikely(!iot))
+ return;
+
+ /* accounting and throttling is done only on entire block devices */
+ dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev), bdev->bd_disk->first_minor);
+ n = iothrottle_search_node(iot, dev);
+ if (!n)
+ return;
+
+ /* Update statistics */
+ iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_BYTES_TOT, bytes);
+ if (bytes)
+ iothrottle_stat_add(&n->stat, IOTHROTTLE_STAT_IOPS_TOT, 1);
+
+ /* Evaluate sleep values */
+ sleep->bw_sleep = res_counter_ratelimit_sleep(&n->bw, bytes);
+ /*
+ * scale up iops cost by a factor of 1000, this allows to apply
+ * a more fine grained sleeps, and throttling works better in
+ * this way.
+ *
+ * Note: do not account any i/o operation if bytes is negative or zero.
+ */
+ sleep->iops_sleep = res_counter_ratelimit_sleep(&n->iops,
+ bytes ? 1000 : 0);
+}
+
+/*
+ * NOTE: called with rcu_read_lock() held.
+ */
+static void iothrottle_acct_stat(struct iothrottle *iot,
+ struct block_device *bdev, int type,
+ unsigned long long sleep)
+{
+ struct iothrottle_node *n;
+ dev_t dev = MKDEV(MAJOR(bdev->bd_inode->i_rdev),
+ bdev->bd_disk->first_minor);
+
+ n = iothrottle_search_node(iot, dev);
+ if (!n)
+ return;
+ iothrottle_stat_add_sleep(&n->stat, type, sleep);
+}
+
+static void iothrottle_acct_task_stat(int type, unsigned long long sleep)
+{
+ /*
+ * XXX: per-task statistics may be inaccurate (this is not a
+ * critical issue, anyway, respect to introduce locking
+ * overhead or increase the size of task_struct).
+ */
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ current->io_throttle_bw_cnt++;
+ current->io_throttle_bw_sleep += sleep;
+ break;
+
+ case IOTHROTTLE_IOPS:
+ current->io_throttle_iops_cnt++;
+ current->io_throttle_iops_sleep += sleep;
+ break;
+ }
+}
+
+static struct iothrottle *get_iothrottle_from_page(struct page *page)
+{
+ struct cgroup *cgrp;
+ struct iothrottle *iot;
+
+ if (!page)
+ return NULL;
+ cgrp = get_cgroup_from_page(page);
+ if (!cgrp)
+ return NULL;
+ iot = cgroup_to_iothrottle(cgrp);
+ if (!iot)
+ return NULL;
+ css_get(&iot->css);
+ put_cgroup_from_page(page);
+
+ return iot;
+}
+
+static struct iothrottle *get_iothrottle_from_bio(struct bio *bio)
+{
+ struct iothrottle *iot;
+ struct page *page;
+ int id;
+
+ if (!bio)
+ return NULL;
+ page = bio_iovec_idx(bio, 0)->bv_page;
+ iot = get_iothrottle_from_page(page);
+ if (iot)
+ return iot;
+ id = get_bio_cgroup_id(bio);
+ rcu_read_lock();
+ iot = get_bioid_to_iothrottle(id);
+ rcu_read_unlock();
+
+ return iot;
+}
+
+static inline int is_kthread_io(void)
+{
+ return current->flags & (PF_KTHREAD | PF_FLUSHER | PF_KSWAPD);
+}
+
+/**
+ * cgroup_io_throttle() - account and throttle synchronous i/o activity
+ * @bio: the bio structure used to retrieve the owner of the i/o
+ * operation.
+ * @bdev: block device involved for the i/o.
+ * @bytes: size in bytes of the i/o operation.
+ *
+ * This is the core of the block device i/o bandwidth controller. This function
+ * must be called by any function that generates i/o activity (directly or
+ * indirectly). It provides both i/o accounting and throttling functionalities;
+ * throttling is disabled if @can_sleep is set to 0.
+ *
+ * Returns the value of sleep in jiffies if it was not possible to schedule the
+ * timeout.
+ **/
+unsigned long long
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
+{
+ struct iothrottle *iot = NULL;
+ struct iothrottle_sleep s = {};
+ unsigned long long sleep;
+ int can_sleep = 1;
+
+ if (unlikely(!bdev))
+ return 0;
+ BUG_ON(!bdev->bd_inode || !bdev->bd_disk);
+ /*
+ * Never throttle kernel threads directly, since they may completely
+ * block other cgroups, the i/o on other block devices or even the
+ * whole system.
+ *
+ * And never sleep if we're inside an AIO context; just account the i/o
+ * activity. Throttling is performed in io_submit_one() returning
+ * -EAGAIN when the limits are exceeded.
+ */
+ if (is_kthread_io() || is_in_aio())
+ can_sleep = 0;
+ /*
+ * WARNING: in_atomic() do not know about held spinlocks in
+ * non-preemptible kernels, but we want to check it here to raise
+ * potential bugs when a preemptible kernel is used.
+ */
+ WARN_ON_ONCE(can_sleep &&
+ (irqs_disabled() || in_interrupt() || in_atomic()));
+
+ /* Apply IO throttling */
+ iot = get_iothrottle_from_bio(bio);
+ rcu_read_lock();
+ if (!iot) {
+ iot = task_to_iothrottle(current);
+ css_get(&iot->css);
+ }
+ iothrottle_evaluate_sleep(&s, iot, bdev, bytes);
+ sleep = max(s.bw_sleep, s.iops_sleep);
+ if (unlikely(sleep && can_sleep)) {
+ int type = (s.bw_sleep < s.iops_sleep) ?
+ IOTHROTTLE_IOPS : IOTHROTTLE_BANDWIDTH;
+
+ iothrottle_acct_stat(iot, bdev, type, sleep);
+ css_put(&iot->css);
+ rcu_read_unlock();
+
+ pr_debug("io-throttle: task %p (%s) must sleep %llu jiffies\n",
+ current, current->comm, sleep);
+ iothrottle_acct_task_stat(type, sleep);
+ schedule_timeout_killable(sleep);
+ return 0;
+ }
+ css_put(&iot->css);
+ rcu_read_unlock();
+
+ /*
+ * Account, but do not delay filesystems' metadata IO or IO that is
+ * explicitly marked to not wait or being anticipated, i.e. writes with
+ * wbc->sync_mode set to WBC_SYNC_ALL - fsync() - or journal activity.
+ */
+ if (bio && (bio_rw_meta(bio) || bio_noidle(bio)))
+ sleep = 0;
+ return sleep;
+}
diff --git a/include/linux/blk-io-throttle.h b/include/linux/blk-io-throttle.h
new file mode 100644
index 0000000..d3c6e86
--- /dev/null
+++ b/include/linux/blk-io-throttle.h
@@ -0,0 +1,110 @@
+#ifndef BLK_IO_THROTTLE_H
+#define BLK_IO_THROTTLE_H
+
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/sched.h>
+#include <linux/cgroup.h>
+#include <asm/atomic.h>
+#include <asm/current.h>
+
+#define IOTHROTTLE_BANDWIDTH 0
+#define IOTHROTTLE_IOPS 1
+#define IOTHROTTLE_FAILCNT 2
+#define IOTHROTTLE_STAT 3
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+
+extern unsigned long long
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes);
+
+extern int iothrottle_make_request(struct bio *bio, unsigned long deadline);
+
+extern int iothrottle_sync(void);
+
+static inline void set_in_aio(void)
+{
+ atomic_set(&current->in_aio, 1);
+}
+
+static inline void unset_in_aio(void)
+{
+ atomic_set(&current->in_aio, 0);
+}
+
+static inline int is_in_aio(void)
+{
+ return atomic_read(&current->in_aio);
+}
+
+static inline unsigned long long
+get_io_throttle_cnt(struct task_struct *t, int type)
+{
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ return t->io_throttle_bw_cnt;
+ case IOTHROTTLE_IOPS:
+ return t->io_throttle_iops_cnt;
+ }
+ BUG();
+}
+
+static inline unsigned long long
+get_io_throttle_sleep(struct task_struct *t, int type)
+{
+ switch (type) {
+ case IOTHROTTLE_BANDWIDTH:
+ return jiffies_to_clock_t(t->io_throttle_bw_sleep);
+ case IOTHROTTLE_IOPS:
+ return jiffies_to_clock_t(t->io_throttle_iops_sleep);
+ }
+ BUG();
+}
+#else /* CONFIG_CGROUP_IO_THROTTLE */
+
+static inline unsigned long long
+cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
+{
+ return 0;
+}
+
+static inline int
+iothrottle_make_request(struct bio *bio, unsigned long deadline)
+{
+ return 0;
+}
+
+static inline int iothrottle_sync(void)
+{
+ return 0;
+}
+
+static inline void set_in_aio(void) { }
+
+static inline void unset_in_aio(void) { }
+
+static inline int is_in_aio(void)
+{
+ return 0;
+}
+
+static inline unsigned long long
+get_io_throttle_cnt(struct task_struct *t, int type)
+{
+ return 0;
+}
+
+static inline unsigned long long
+get_io_throttle_sleep(struct task_struct *t, int type)
+{
+ return 0;
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+static inline struct block_device *as_to_bdev(struct address_space *mapping)
+{
+ return (mapping->host && mapping->host->i_sb->s_bdev) ?
+ mapping->host->i_sb->s_bdev : NULL;
+}
+
+#endif /* BLK_IO_THROTTLE_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 5df23f8..3ea63f3 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -49,6 +49,12 @@ SUBSYS(bio_cgroup)

/* */

+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
+
#ifdef CONFIG_CGROUP_DEVICE
SUBSYS(devices)
#endif
diff --git a/init/Kconfig b/init/Kconfig
index 8f7b23c..045f7c5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -617,6 +617,16 @@ config CGROUP_BIO
kind of module such as dm-ioband device mapper modules or
the cfq-scheduler.

+config CGROUP_IO_THROTTLE
+ bool "Enable cgroup I/O throttling"
+ depends on CGROUPS && CGROUP_BIO && RESOURCE_COUNTERS && EXPERIMENTAL
+ help
+ This allows to limit the maximum I/O bandwidth for specific
+ cgroup(s).
+ See Documentation/cgroups/io-throttle.txt for more information.
+
+ If unsure, say N.
+
endif # CGROUPS

config CGROUP_PAGE
--
1.5.6.3

2009-04-14 20:23:40

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 6/9] kiothrottled: throttle buffered (writeback) IO

Together with cgroup_io_throttle() the kiothrottled kernel thread
represents the core of the io-throttle subsystem.

All the writeback IO requests that need to be throttled are not
dispatched immediately in submit_bio(). Instead, they are added into an
rbtree by iothrottle_make_request() and processed asynchronously by
kiothrottled.

A deadline is associated to each request depending on the bandwidth
usage of the cgroup it belongs. When a request is inserted into the
rbtree kiothrottled is awakened. This thread selects all the requests
with an expired deadline and submit the bunch of selected requests to
the underlying block devices using generic_make_request().

Signed-off-by: Andrea Righi <[email protected]>
---
block/Makefile | 2 +-
block/kiothrottled.c | 341 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 342 insertions(+), 1 deletions(-)
create mode 100644 block/kiothrottled.c

diff --git a/block/Makefile b/block/Makefile
index 42b6a46..5f10a45 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -13,6 +13,6 @@ obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

-obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += blk-io-throttle.o kiothrottled.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
diff --git a/block/kiothrottled.c b/block/kiothrottled.c
new file mode 100644
index 0000000..3df22c1
--- /dev/null
+++ b/block/kiothrottled.c
@@ -0,0 +1,341 @@
+/*
+ * kiothrottled.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <[email protected]>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <linux/kthread.h>
+#include <linux/jiffies.h>
+#include <linux/ioprio.h>
+#include <linux/rbtree.h>
+#include <linux/blkdev.h>
+
+/* io-throttle bio element */
+struct iot_bio {
+ struct rb_node node;
+ unsigned long deadline;
+ struct bio *bio;
+};
+
+/* io-throttle bio tree */
+struct iot_bio_tree {
+ /* Protect the iothrottle rbtree */
+ spinlock_t lock;
+ struct rb_root tree;
+};
+
+/*
+ * TODO: create one iothrottle rbtree per block device and many kiothrottled
+ * threads per rbtree, instead of a poor scalable single rbtree / single thread
+ * solution.
+ */
+static struct iot_bio_tree *iot;
+static struct task_struct *kiothrottled_thread;
+
+/* Timer used to periodically wake-up kiothrottled */
+static struct timer_list kiothrottled_timer;
+
+/* Insert a new iot_bio element in the iot_bio_tree */
+static void iot_bio_insert(struct rb_root *root, struct iot_bio *data)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ while (*new) {
+ struct iot_bio *this = container_of(*new, struct iot_bio, node);
+ parent = *new;
+ if (data->deadline < this->deadline)
+ new = &((*new)->rb_left);
+ else
+ new = &((*new)->rb_right);
+ }
+ rb_link_node(&data->node, parent, new);
+ rb_insert_color(&data->node, root);
+}
+
+/*
+ * NOTE: no need to care about locking here, we're flushing all the pending
+ * requests, kiothrottled has been stopped and no additional request will be
+ * submitted in the tree.
+ */
+static void iot_bio_cleanup(struct rb_root *root)
+{
+ struct iot_bio *data;
+ struct rb_node *next;
+
+ next = rb_first(root);
+ while (next) {
+ data = rb_entry(next, struct iot_bio, node);
+ pr_debug("%s: dispatching element: %p (%lu)\n",
+ __func__, data->bio, data->deadline);
+ generic_make_request(data->bio);
+ next = rb_next(&data->node);
+ rb_erase(&data->node, root);
+ kfree(data);
+ }
+}
+
+/**
+ * iothrottle_make_request() - submit a delayed IO requests that will be
+ * processed asynchronously by kiothrottled.
+ *
+ * @bio: the bio structure that contains the IO request's informations
+ * @deadline: the request will be actually dispatched only when the deadline
+ * will expire
+ *
+ * Returns 0 if the request is successfully submitted and inserted into the
+ * iot_bio_tree. Return a negative value in case of failure.
+ **/
+int iothrottle_make_request(struct bio *bio, unsigned long deadline)
+{
+ struct iot_bio *data;
+
+ BUG_ON(!iot);
+
+ if (unlikely(!kiothrottled_thread))
+ return -ENOENT;
+
+ data = kzalloc(sizeof(*data), GFP_KERNEL);
+ if (unlikely(!data))
+ return -ENOMEM;
+ data->deadline = deadline;
+ data->bio = bio;
+
+ spin_lock_irq(&iot->lock);
+ iot_bio_insert(&iot->tree, data);
+ spin_unlock_irq(&iot->lock);
+
+ wake_up_process(kiothrottled_thread);
+ return 0;
+}
+EXPORT_SYMBOL(iothrottle_make_request);
+
+static void kiothrottled_timer_expired(unsigned long __unused)
+{
+ wake_up_process(kiothrottled_thread);
+}
+
+static void kiothrottled_sleep(void)
+{
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule();
+}
+
+/**
+ * kiothrottled() - throttle buffered (writeback) i/o activity
+ *
+ * Together with cgroup_io_throttle() this kernel thread represents the core of
+ * the cgroup-io-throttle subsystem.
+ *
+ * All the writeback IO requests that need to be throttled are not dispatched
+ * immediately in submit_bio(). Instead, they are added into the iot_bio_tree
+ * rbtree by iothrottle_make_request() and processed asynchronously by
+ * kiothrottled.
+ *
+ * A deadline is associated to each request depending on the bandwidth usage of
+ * the cgroup it belongs. When a request is inserted into the rbtree
+ * kiothrottled is awakened. This thread selects all the requests with an
+ * expired deadline and submit the bunch of selected requests to the underlying
+ * block devices using generic_make_request().
+ **/
+static int kiothrottled(void *__unused)
+{
+ /*
+ * kiothrottled is responsible of dispatching all the writeback IO
+ * requests with an expired deadline. To dispatch those requests as
+ * soon as possible and to avoid priority inversion problems set
+ * maximum IO real-time priority for this thread.
+ */
+ set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 0));
+
+ while (!kthread_should_stop()) {
+ struct iot_bio *data;
+ struct rb_node *req;
+ struct rb_root staging_tree = RB_ROOT;
+ unsigned long now = jiffies;
+ long delta_t = 0;
+
+ /* Select requests to dispatch */
+ spin_lock_irq(&iot->lock);
+ req = rb_first(&iot->tree);
+ while (req) {
+ data = rb_entry(req, struct iot_bio, node);
+ delta_t = (long)data->deadline - (long)now;
+ if (delta_t > 0)
+ break;
+ req = rb_next(&data->node);
+ rb_erase(&data->node, &iot->tree);
+ iot_bio_insert(&staging_tree, data);
+ }
+ spin_unlock_irq(&iot->lock);
+
+ /* Dispatch requests */
+ req = rb_first(&staging_tree);
+ while (req) {
+ data = rb_entry(req, struct iot_bio, node);
+ req = rb_next(&data->node);
+ rb_erase(&data->node, &staging_tree);
+ pr_debug("%s: dispatching request: %p (%lu)\n",
+ __func__, data->bio, data->deadline);
+ generic_make_request(data->bio);
+ kfree(data);
+ }
+
+ /* Wait for new requests ready to be dispatched */
+ if (delta_t > 0)
+ mod_timer(&kiothrottled_timer, jiffies + HZ);
+ kiothrottled_sleep();
+ }
+ return 0;
+}
+
+/* TODO: handle concurrent startup and shutdown */
+static void kiothrottle_shutdown(void)
+{
+ if (!kiothrottled_thread)
+ return;
+ del_timer(&kiothrottled_timer);
+ printk(KERN_INFO "%s: stopping kiothrottled\n", __func__);
+ kthread_stop(kiothrottled_thread);
+ printk(KERN_INFO "%s: flushing pending requests\n", __func__);
+ spin_lock_irq(&iot->lock);
+ kiothrottled_thread = NULL;
+ spin_unlock_irq(&iot->lock);
+ iot_bio_cleanup(&iot->tree);
+}
+
+static int kiothrottle_startup(void)
+{
+ init_timer(&kiothrottled_timer);
+ kiothrottled_timer.function = kiothrottled_timer_expired;
+
+ printk(KERN_INFO "%s: starting kiothrottled\n", __func__);
+ kiothrottled_thread = kthread_run(kiothrottled, NULL, "kiothrottled");
+ if (IS_ERR(kiothrottled_thread))
+ return -PTR_ERR(kiothrottled_thread);
+ return 0;
+}
+
+/*
+ * NOTE: provide this interface only for emergency situations, when we need to
+ * force the immediate flush of pending (writeback) IO throttled requests.
+ */
+int iothrottle_sync(void)
+{
+ kiothrottle_shutdown();
+ return kiothrottle_startup();
+}
+EXPORT_SYMBOL(iothrottle_sync);
+
+/*
+ * Writing in /proc/kiothrottled_debug enforces an immediate flush of throttled
+ * IO requests.
+ */
+static ssize_t kiothrottle_write(struct file *filp, const char __user *buffer,
+ size_t count, loff_t *data)
+{
+ int ret;
+
+ ret = iothrottle_sync();
+ if (ret)
+ return ret;
+ return count;
+}
+
+/*
+ * Export to userspace the list of pending IO throttled requests.
+ * TODO: this can be useful only for debugging, maybe we should make this
+ * interface optionally, depending on an opportune compile-time config option.
+ */
+static int kiothrottle_show(struct seq_file *m, void *v)
+{
+ struct iot_bio *data;
+ struct rb_node *next;
+ unsigned long now = jiffies;
+ long delta_t;
+
+ spin_lock_irq(&iot->lock);
+ next = rb_first(&iot->tree);
+ while (next) {
+ data = rb_entry(next, struct iot_bio, node);
+ delta_t = (long)data->deadline - (long)now;
+ seq_printf(m, "%p %lu %lu %li\n", data->bio,
+ data->deadline, now, delta_t);
+ next = rb_next(&data->node);
+ }
+ spin_unlock_irq(&iot->lock);
+
+ return 0;
+}
+
+static int kiothrottle_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, kiothrottle_show, NULL);
+}
+
+static const struct file_operations kiothrottle_ops = {
+ .open = kiothrottle_open,
+ .read = seq_read,
+ .write = kiothrottle_write,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+int __init kiothrottled_init(void)
+{
+ struct proc_dir_entry *pe;
+ int ret;
+
+ iot = kzalloc(sizeof(*iot), GFP_KERNEL);
+ if (unlikely(!iot))
+ return -ENOMEM;
+ spin_lock_init(&iot->lock);
+ iot->tree = RB_ROOT;
+
+ pe = create_proc_entry("kiothrottled_debug", 0644, NULL);
+ if (!pe) {
+ kfree(iot);
+ return -ENOMEM;
+ }
+ pe->proc_fops = &kiothrottle_ops;
+
+ ret = kiothrottle_startup();
+ if (ret) {
+ remove_proc_entry("kiothrottled_debug", NULL);
+ kfree(iot);
+ return ret;
+ }
+ printk(KERN_INFO "%s: initialized\n", __func__);
+ return 0;
+}
+
+void __exit kiothrottled_exit(void)
+{
+ kiothrottle_shutdown();
+ remove_proc_entry("kiothrottled_debug", NULL);
+ kfree(iot);
+ printk(KERN_INFO "%s: unloaded\n", __func__);
+}
+
+module_init(kiothrottled_init);
+module_exit(kiothrottled_exit);
+MODULE_LICENSE("GPL");
--
1.5.6.3

2009-04-14 20:24:01

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 7/9] io-throttle instrumentation

Apply the io-throttle controller to the opportune kernel functions.

Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-core.c | 8 ++++++++
fs/aio.c | 12 ++++++++++++
include/linux/sched.h | 7 +++++++
kernel/fork.c | 7 +++++++
mm/readahead.c | 3 +++
5 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 07ab754..4d7f9f6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,6 +26,7 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/blktrace_api.h>
#include <linux/fault-inject.h>
#include <trace/block.h>
@@ -1547,11 +1548,16 @@ void submit_bio(int rw, struct bio *bio)
* go through the normal accounting stuff before submission.
*/
if (bio_has_data(bio)) {
+ unsigned long sleep = 0;
+
if (rw & WRITE) {
count_vm_events(PGPGOUT, count);
+ sleep = cgroup_io_throttle(bio,
+ bio->bi_bdev, bio->bi_size);
} else {
task_io_account_read(bio->bi_size);
count_vm_events(PGPGIN, count);
+ cgroup_io_throttle(NULL, bio->bi_bdev, bio->bi_size);
}

if (unlikely(block_dump)) {
@@ -1562,6 +1568,8 @@ void submit_bio(int rw, struct bio *bio)
(unsigned long long)bio->bi_sector,
bdevname(bio->bi_bdev, b));
}
+ if (sleep && !iothrottle_make_request(bio, jiffies + sleep))
+ return;
}

generic_make_request(bio);
diff --git a/fs/aio.c b/fs/aio.c
index 76da125..ab6c457 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -22,6 +22,7 @@
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/file.h>
+#include <linux/blk-io-throttle.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/slab.h>
@@ -1587,6 +1588,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
{
struct kiocb *req;
struct file *file;
+ struct block_device *bdev;
ssize_t ret;

/* enforce forwards compatibility on users */
@@ -1609,6 +1611,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
if (unlikely(!file))
return -EBADF;

+ /* check if we're exceeding the IO throttling limits */
+ bdev = as_to_bdev(file->f_mapping);
+ ret = cgroup_io_throttle(NULL, bdev, 0);
+ if (unlikely(ret)) {
+ fput(file);
+ return -EAGAIN;
+ }
+
req = aio_get_req(ctx); /* returns with 2 references to req */
if (unlikely(!req)) {
fput(file);
@@ -1652,12 +1662,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
goto out_put_req;

spin_lock_irq(&ctx->ctx_lock);
+ set_in_aio();
aio_run_iocb(req);
if (!list_empty(&ctx->run_list)) {
/* drain the run list */
while (__aio_run_iocbs(ctx))
;
}
+ unset_in_aio();
spin_unlock_irq(&ctx->ctx_lock);
aio_put_req(req); /* drop extra ref to req */
return 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..e0cd710 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,13 @@ struct task_struct {
unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
struct task_io_accounting ioac;
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_t in_aio;
+ unsigned long long io_throttle_bw_cnt;
+ unsigned long long io_throttle_bw_sleep;
+ unsigned long long io_throttle_iops_cnt;
+ unsigned long long io_throttle_iops_sleep;
+#endif
#if defined(CONFIG_TASK_XACCT)
u64 acct_rss_mem1; /* accumulated rss usage */
u64 acct_vm_mem1; /* accumulated virtual memory usage */
diff --git a/kernel/fork.c b/kernel/fork.c
index b9e2edd..272c461 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1043,6 +1043,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
task_io_accounting_init(&p->ioac);
acct_clear_integrals(p);

+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ atomic_set(&p->in_aio, 0);
+ p->io_throttle_bw_cnt = 0;
+ p->io_throttle_bw_sleep = 0;
+ p->io_throttle_iops_cnt = 0;
+ p->io_throttle_iops_sleep = 0;
+#endif
posix_cpu_timers_init(p);

p->lock_depth = -1; /* -1 = no lock */
diff --git a/mm/readahead.c b/mm/readahead.c
index 133b6d5..25cae4c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -14,6 +14,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/pagevec.h>
#include <linux/pagemap.h>

@@ -81,6 +82,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
int (*filler)(void *, struct page *), void *data)
{
struct page *page;
+ struct block_device *bdev = as_to_bdev(mapping);
int ret = 0;

while (!list_empty(pages)) {
@@ -99,6 +101,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
+ cgroup_io_throttle(NULL, bdev, PAGE_CACHE_SIZE);
}
return ret;
}
--
1.5.6.3

2009-04-14 20:23:22

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 4/9] support checking of cgroup subsystem dependencies

From: Li Zefan <[email protected]>

From: Li Zefan <[email protected]>

This allows one subsystem to require to be mounted only when some other
subsystems are also present in or not in the proposed hierarchy.

Signed-off-by: Li Zefan <[email protected]>
---
Documentation/cgroups/cgroups.txt | 5 +++++
include/linux/cgroup.h | 2 ++
kernel/cgroup.c | 19 ++++++++++++++++++-
3 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 6eb1a97..6938025 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -552,6 +552,11 @@ and root cgroup. Currently this will only involve movement between
the default hierarchy (which never has sub-cgroups) and a hierarchy
that is being created/destroyed (and hence has no sub-cgroups).

+int subsys_depend(struct cgroup_subsys *ss, unsigned long subsys_bits)
+Called when a cgroup subsystem wants to check if some other subsystems
+are also present in the proposed hierarchy. If this method returns error,
+the mount of the cgroup filesystem will fail.
+
4. Questions
============

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 665fa70..37ace23 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -385,6 +385,8 @@ struct cgroup_subsys {
struct cgroup *cgrp);
void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
+ int (*subsys_depend)(struct cgroup_subsys *ss,
+ unsigned long subsys_bits);

int subsys_id;
int active;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 382109b..fad3f08 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -830,6 +830,23 @@ static int cgroup_show_options(struct seq_file *seq, struct vfsmount *vfs)
return 0;
}

+static int check_subsys_dependency(unsigned long subsys_bits)
+{
+ int i;
+ int ret;
+ struct cgroup_subsys *ss;
+
+ for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+ ss = subsys[i];
+ if (test_bit(i, &subsys_bits) && ss->subsys_depend) {
+ ret = ss->subsys_depend(ss, subsys_bits);
+ if (ret)
+ return ret;
+ }
+ }
+ return 0;
+}
+
struct cgroup_sb_opts {
unsigned long subsys_bits;
unsigned long flags;
@@ -890,7 +907,7 @@ static int parse_cgroupfs_options(char *data,
if (!opts->subsys_bits)
return -EINVAL;

- return 0;
+ return check_subsys_dependency(opts->subsys_bits);
}

static int cgroup_remount(struct super_block *sb, int *flags, char *data)
--
1.5.6.3

2009-04-14 20:24:24

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 8/9] export per-task io-throttle statistics to userspace

Export the throttling statistics collected for each task through
/proc/PID/io-throttle-stat.

Example:
$ cat /proc/$$/io-throttle-stat
0 0 0 0
^ ^ ^ ^
\ \ \ \_____global iops sleep (in clock ticks)
\ \ \______global iops counter
\ \_______global bandwidth sleep (in clock ticks)
\________global bandwidth counter

Signed-off-by: Andrea Righi <[email protected]>
---
fs/proc/base.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f715597..c07ee00 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -54,6 +54,7 @@
#include <linux/proc_fs.h>
#include <linux/stat.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/blk-io-throttle.h>
#include <linux/init.h>
#include <linux/capability.h>
#include <linux/file.h>
@@ -2453,6 +2454,17 @@ static int proc_tgid_io_accounting(struct task_struct *task, char *buffer)
}
#endif /* CONFIG_TASK_IO_ACCOUNTING */

+#ifdef CONFIG_CGROUP_IO_THROTTLE
+static int proc_iothrottle_stat(struct task_struct *task, char *buffer)
+{
+ return sprintf(buffer, "%llu %llu %llu %llu\n",
+ get_io_throttle_cnt(task, IOTHROTTLE_BANDWIDTH),
+ get_io_throttle_sleep(task, IOTHROTTLE_BANDWIDTH),
+ get_io_throttle_cnt(task, IOTHROTTLE_IOPS),
+ get_io_throttle_sleep(task, IOTHROTTLE_IOPS));
+}
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -2539,6 +2551,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io", S_IRUGO, proc_tgid_io_accounting),
#endif
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ INF("io-throttle-stat", S_IRUGO, proc_iothrottle_stat),
+#endif
};

static int proc_tgid_base_readdir(struct file * filp,
@@ -2874,6 +2889,9 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io", S_IRUGO, proc_tid_io_accounting),
#endif
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+ INF("io-throttle-stat", S_IRUGO, proc_iothrottle_stat),
+#endif
};

static int proc_tid_base_readdir(struct file * filp,
--
1.5.6.3

2009-04-14 20:24:43

by Andrea Righi

[permalink] [raw]
Subject: [PATCH 9/9] ext3: do not throttle metadata and journal IO

Delaying journal IO can unnecessarily delay other independent IO
operations from different cgroups.

Add BIO_RW_META flag to the ext3 journal IO that informs the io-throttle
subsystem to account but not delay journal IO and avoid potential
priority inversion problems.

Signed-off-by: Andrea Righi <[email protected]>
---
fs/jbd/commit.c | 4 ++--
fs/jbd2/commit.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index a8e8513..2e444af 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -318,7 +318,7 @@ void journal_commit_transaction(journal_t *journal)
int first_tag = 0;
int tag_flag;
int i;
- int write_op = WRITE;
+ int write_op = WRITE | (1 << BIO_RW_META);

/*
* First job: lock down the current transaction and wait for
@@ -357,7 +357,7 @@ void journal_commit_transaction(journal_t *journal)
* instead we rely on sync_buffer() doing the unplug for us.
*/
if (commit_transaction->t_synchronous_commit)
- write_op = WRITE_SYNC_PLUG;
+ write_op = WRITE_SYNC_PLUG | (1 << BIO_RW_META);
spin_lock(&commit_transaction->t_handle_lock);
while (commit_transaction->t_updates) {
DEFINE_WAIT(wait);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 073c8c3..61484d0 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -367,7 +367,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
int tag_bytes = journal_tag_bytes(journal);
struct buffer_head *cbh = NULL; /* For transactional checksums */
__u32 crc32_sum = ~0;
- int write_op = WRITE;
+ int write_op = WRITE | (1 << BIO_RW_META);

/*
* First job: lock down the current transaction and wait for
@@ -408,7 +408,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
* instead we rely on sync_buffer() doing the unplug for us.
*/
if (commit_transaction->t_synchronous_commit)
- write_op = WRITE_SYNC_PLUG;
+ write_op = WRITE_SYNC_PLUG | (1 << BIO_RW_META);
stats.u.run.rs_wait = commit_transaction->t_max_wait;
stats.u.run.rs_locked = jiffies;
stats.u.run.rs_running = jbd2_time_diff(commit_transaction->t_start,
--
1.5.6.3

2009-04-15 02:17:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Tue, 14 Apr 2009 22:21:14 +0200
Andrea Righi <[email protected]> wrote:

> From: Ryo Tsuruta <[email protected]>
>
> From: Ryo Tsuruta <[email protected]>
>
> With writeback IO processed asynchronously by kernel threads (pdflush)
> the real writes to the underlying block devices can occur in a different
> IO context respect to the task that originally generated the dirty
> pages involved in the IO operation.
>
> The controller bio-cgroup is used by io-throttle to track writeback IO
> and for properly apply throttling.
>
> Also apply a patch by Gui Jianfeng to announce tasks moving in
> bio-cgroup groups.
>
> See also: http://people.valinux.co.jp/~ryov/bio-cgroup
>
> Signed-off-by: Gui Jianfeng <[email protected]>
> Signed-off-by: Ryo Tsuruta <[email protected]>
> Signed-off-by: Hirokazu Takahashi <[email protected]>
> ---
> block/blk-ioc.c | 30 ++--
> fs/buffer.c | 2 +
> fs/direct-io.c | 2 +
> include/linux/biotrack.h | 95 +++++++++++
> include/linux/cgroup_subsys.h | 6 +
> include/linux/iocontext.h | 1 +
> include/linux/memcontrol.h | 6 +
> include/linux/mmzone.h | 4 +-
> include/linux/page_cgroup.h | 13 ++-
> init/Kconfig | 15 ++
> mm/Makefile | 4 +-
> mm/biotrack.c | 349 +++++++++++++++++++++++++++++++++++++++++
> mm/bounce.c | 2 +
> mm/filemap.c | 2 +
> mm/memcontrol.c | 5 +
> mm/memory.c | 5 +
> mm/page-writeback.c | 2 +
> mm/page_cgroup.c | 17 ++-
> mm/swap_state.c | 2 +
> 19 files changed, 536 insertions(+), 26 deletions(-)
> create mode 100644 include/linux/biotrack.h
> create mode 100644 mm/biotrack.c
>
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 012f065..ef8cac0 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -84,24 +84,28 @@ void exit_io_context(void)
> }
> }
>
> +void init_io_context(struct io_context *ioc)
> +{
> + atomic_set(&ioc->refcount, 1);
> + atomic_set(&ioc->nr_tasks, 1);
> + spin_lock_init(&ioc->lock);
> + ioc->ioprio_changed = 0;
> + ioc->ioprio = 0;
> + ioc->last_waited = jiffies; /* doesn't matter... */
> + ioc->nr_batch_requests = 0; /* because this is 0 */
> + ioc->aic = NULL;
> + INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
> + INIT_HLIST_HEAD(&ioc->cic_list);
> + ioc->ioc_data = NULL;
> +}
> +
> struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
> {
> struct io_context *ret;
>
> ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
> - if (ret) {
> - atomic_set(&ret->refcount, 1);
> - atomic_set(&ret->nr_tasks, 1);
> - spin_lock_init(&ret->lock);
> - ret->ioprio_changed = 0;
> - ret->ioprio = 0;
> - ret->last_waited = jiffies; /* doesn't matter... */
> - ret->nr_batch_requests = 0; /* because this is 0 */
> - ret->aic = NULL;
> - INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
> - INIT_HLIST_HEAD(&ret->cic_list);
> - ret->ioc_data = NULL;
> - }
> + if (ret)
> + init_io_context(ret);
>
> return ret;
> }
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 13edf7a..bc72150 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -36,6 +36,7 @@
> #include <linux/buffer_head.h>
> #include <linux/task_io_accounting_ops.h>
> #include <linux/bio.h>
> +#include <linux/biotrack.h>
> #include <linux/notifier.h>
> #include <linux/cpu.h>
> #include <linux/bitops.h>
> @@ -655,6 +656,7 @@ static void __set_page_dirty(struct page *page,
> if (page->mapping) { /* Race with truncate? */
> WARN_ON_ONCE(warn && !PageUptodate(page));
> account_page_dirtied(page, mapping);
> + bio_cgroup_reset_owner_pagedirty(page, current->mm);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> }
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index da258e7..ec42362 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -33,6 +33,7 @@
> #include <linux/err.h>
> #include <linux/blkdev.h>
> #include <linux/buffer_head.h>
> +#include <linux/biotrack.h>
> #include <linux/rwsem.h>
> #include <linux/uio.h>
> #include <asm/atomic.h>
> @@ -799,6 +800,7 @@ static int do_direct_IO(struct dio *dio)
> ret = PTR_ERR(page);
> goto out;
> }
> + bio_cgroup_reset_owner(page, current->mm);
>
> while (block_in_page < blocks_per_page) {
> unsigned offset_in_page = block_in_page << blkbits;
> diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
> new file mode 100644
> index 0000000..25b8810
> --- /dev/null
> +++ b/include/linux/biotrack.h
> @@ -0,0 +1,95 @@
> +#include <linux/cgroup.h>
> +#include <linux/mm.h>
> +#include <linux/page_cgroup.h>
> +
> +#ifndef _LINUX_BIOTRACK_H
> +#define _LINUX_BIOTRACK_H
> +
> +#ifdef CONFIG_CGROUP_BIO
> +
> +struct tsk_move_msg {
> + int old_id;
> + int new_id;
> + struct task_struct *tsk;
> +};
> +
> +extern int register_biocgroup_notifier(struct notifier_block *nb);
> +extern int unregister_biocgroup_notifier(struct notifier_block *nb);
> +
> +struct io_context;
> +struct block_device;
> +
> +struct bio_cgroup {
> + struct cgroup_subsys_state css;
> + int id;
> + struct io_context *io_context; /* default io_context */
> +/* struct radix_tree_root io_context_root; per device io_context */
> +};
> +
> +static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
> +{
> + pc->bio_cgroup_id = 0;
> +}
> +
> +extern struct cgroup *get_cgroup_from_page(struct page *page);
> +extern void put_cgroup_from_page(struct page *page);
> +extern struct cgroup *bio_id_to_cgroup(int id);
> +
> +static inline int bio_cgroup_disabled(void)
> +{
> + return bio_cgroup_subsys.disabled;
> +}
> +
> +extern void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
> +extern void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
> +extern void bio_cgroup_reset_owner_pagedirty(struct page *page,
> + struct mm_struct *mm);
> +extern void bio_cgroup_copy_owner(struct page *page, struct page *opage);
> +
> +extern struct io_context *get_bio_cgroup_iocontext(struct bio *bio);
> +extern int get_bio_cgroup_id(struct bio *bio);
> +
> +#else /* CONFIG_CGROUP_BIO */
> +
> +struct bio_cgroup;
> +
> +static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
> +{
> +}
> +
> +static inline int bio_cgroup_disabled(void)
> +{
> + return 1;
> +}
> +
> +static inline void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> +{
> +}
> +
> +static inline void bio_cgroup_reset_owner(struct page *page,
> + struct mm_struct *mm)
> +{
> +}
> +
> +static inline void bio_cgroup_reset_owner_pagedirty(struct page *page,
> + struct mm_struct *mm)
> +{
> +}
> +
> +static inline void bio_cgroup_copy_owner(struct page *page, struct page *opage)
> +{
> +}
> +
> +static inline struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
> +{
> + return NULL;
> +}
> +
> +static inline int get_bio_cgroup_id(struct bio *bio)
> +{
> + return 0;
> +}
> +
> +#endif /* CONFIG_CGROUP_BIO */
> +
> +#endif /* _LINUX_BIOTRACK_H */
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 9c8d31b..5df23f8 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
>
> /* */
>
> +#ifdef CONFIG_CGROUP_BIO
> +SUBSYS(bio_cgroup)
> +#endif
> +
> +/* */
> +
> #ifdef CONFIG_CGROUP_DEVICE
> SUBSYS(devices)
> #endif
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index 08b987b..be37c27 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -104,6 +104,7 @@ int put_io_context(struct io_context *ioc);
> void exit_io_context(void);
> struct io_context *get_io_context(gfp_t gfp_flags, int node);
> struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
> +void init_io_context(struct io_context *ioc);
> void copy_io_context(struct io_context **pdst, struct io_context **psrc);
> #else
> static inline void exit_io_context(void)
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..f3e0e64 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -37,6 +37,8 @@ struct mm_struct;
> * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> */
>
> +extern void __init_mem_page_cgroup(struct page_cgroup *pc);
> +
> extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> /* for swap handling */
> @@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> +static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
> +{
> +}
> +
> static inline int mem_cgroup_newpage_charge(struct page *page,
> struct mm_struct *mm, gfp_t gfp_mask)
> {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 186ec6a..47a6f55 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -607,7 +607,7 @@ typedef struct pglist_data {
> int nr_zones;
> #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> struct page *node_mem_map;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> struct page_cgroup *node_page_cgroup;
> #endif
> #endif
> @@ -958,7 +958,7 @@ struct mem_section {
>
> /* See declaration of similar field in struct zone */
> unsigned long *pageblock_flags;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> /*
> * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
> * section. (see memcontrol.h/page_cgroup.h about this.)
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 7339c7b..a7249bb 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -1,7 +1,7 @@
> #ifndef __LINUX_PAGE_CGROUP_H
> #define __LINUX_PAGE_CGROUP_H
>
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> #include <linux/bit_spinlock.h>
> /*
> * Page Cgroup can be considered as an extended mem_map.
> @@ -12,9 +12,16 @@
> */
> struct page_cgroup {
> unsigned long flags;
> - struct mem_cgroup *mem_cgroup;
> struct page *page;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + struct mem_cgroup *mem_cgroup;
> +#endif
> +#ifdef CONFIG_CGROUP_BIO
> + int bio_cgroup_id;
> +#endif
> +#if defined(CONFIG_CGROUP_MEM_RES_CTLR) || defined(CONFIG_CGROUP_BIO)
> struct list_head lru; /* per cgroup LRU list */
> +#endif
> };
>
This #if is unnecessary..

And, now, CSS_ID is supported. I think it can be used and your own id is not
necessary. (plz see swap accounting in memcontrol.c/page_cgroup.c if unsure.)

And... I don't like to increase the size of struct page_cgroup.
Could you find a way to encode bio_cgroup_id into "flags" ?
unsigned long is too much now.



> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> @@ -71,7 +78,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
> bit_spin_unlock(PCG_LOCK, &pc->flags);
> }
>
> -#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +#else /* CONFIG_CGROUP_PAGE */
> struct page_cgroup;
>
> static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> diff --git a/init/Kconfig b/init/Kconfig
> index 7be4d38..8f7b23c 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -606,8 +606,23 @@ config CGROUP_MEM_RES_CTLR_SWAP
> Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
> size is 4096bytes, 512k per 1Gbytes of swap.
>
> +config CGROUP_BIO
> + bool "Block I/O cgroup subsystem"
> + depends on CGROUPS && BLOCK
> + select MM_OWNER
> + help
> + Provides a Resource Controller which enables to track the onwner
> + of every Block I/O requests.
> + The information this subsystem provides can be used from any
> + kind of module such as dm-ioband device mapper modules or
> + the cfq-scheduler.
> +
> endif # CGROUPS
>
> +config CGROUP_PAGE
> + def_bool y
> + depends on CGROUP_MEM_RES_CTLR || CGROUP_BIO
> +
> config MM_OWNER
> bool
>
> diff --git a/mm/Makefile b/mm/Makefile
> index ec73c68..a78a437 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -37,4 +37,6 @@ else
> obj-$(CONFIG_SMP) += allocpercpu.o
> endif
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
> +obj-$(CONFIG_CGROUP_BIO) += biotrack.o
> diff --git a/mm/biotrack.c b/mm/biotrack.c
> new file mode 100644
> index 0000000..d3a35f1
> --- /dev/null
> +++ b/mm/biotrack.c
> @@ -0,0 +1,349 @@
> +/* biotrack.c - Block I/O Tracking
> + *
> + * Copyright (C) VA Linux Systems Japan, 2008
> + * Developed by Hirokazu Takahashi <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/smp.h>
> +#include <linux/bit_spinlock.h>
> +#include <linux/idr.h>
> +#include <linux/blkdev.h>
> +#include <linux/biotrack.h>
> +
> +#define MOVETASK 0
> +static BLOCKING_NOTIFIER_HEAD(biocgroup_chain);
> +
> +int register_biocgroup_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_register(&biocgroup_chain, nb);
> +}
> +EXPORT_SYMBOL(register_biocgroup_notifier);
> +
> +int unregister_biocgroup_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_unregister(&biocgroup_chain, nb);
> +}
> +EXPORT_SYMBOL(unregister_biocgroup_notifier);
> +
> +/*
> + * The block I/O tracking mechanism is implemented on the cgroup memory
> + * controller framework. It helps to find the the owner of an I/O request
> + * because every I/O request has a target page and the owner of the page
> + * can be easily determined on the framework.
> + */
> +
> +/* Return the bio_cgroup that associates with a cgroup. */
> +static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp)
> +{
> + return container_of(cgroup_subsys_state(cgrp, bio_cgroup_subsys_id),
> + struct bio_cgroup, css);
> +}
> +
> +/* Return the bio_cgroup that associates with a process. */
> +static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p)
> +{
> + return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
> + struct bio_cgroup, css);
> +}
> +
> +static struct idr bio_cgroup_id;
> +static DEFINE_SPINLOCK(bio_cgroup_idr_lock);
> +static struct io_context default_bio_io_context;
> +static struct bio_cgroup default_bio_cgroup = {
> + .id = 0,
> + .io_context = &default_bio_io_context,
> +};
> +
> +/*
> + * This function is used to make a given page have the bio-cgroup id of
> + * the owner of this page.
> + */
> +void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> +{
> + struct bio_cgroup *biog;
> + struct page_cgroup *pc;
> +
> + if (bio_cgroup_disabled())
> + return;
> + pc = lookup_page_cgroup(page);
> + if (unlikely(!pc))
> + return;
> +
> + pc->bio_cgroup_id = 0; /* 0: default bio_cgroup id */
> + if (!mm)
> + return;
> + /*
> + * Locking "pc" isn't necessary here since the current process is
> + * the only one that can access the members related to bio_cgroup.
> + */
> + rcu_read_lock();
> + biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
> + if (unlikely(!biog))
> + goto out;
> + /*
> + * css_get(&bio->css) isn't called to increment the reference
> + * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
> + * invalid even if this page is still active.
> + * This approach is chosen to minimize the overhead.
> + */
> + pc->bio_cgroup_id = biog->id;
> +out:
> + rcu_read_unlock();
> +}
> +
> +/*
> + * Change the owner of a given page if necessary.
> + */
> +void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
> +{
> + /*
> + * A little trick:
> + * Just call bio_cgroup_set_owner() for pages which are already
> + * active since the bio_cgroup_id member of page_cgroup can be
> + * updated without any locks. This is because an integer type of
> + * variable can be set a new value at once on modern cpus.
> + */
> + bio_cgroup_set_owner(page, mm);
> +}
Hmm ? I think all operations are under lock_page() and there are no races.
Isn't it ?


> +
> +/*
> + * Change the owner of a given page. This function is only effective for
> + * pages in the pagecache.
> + */
> +void bio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
> +{
> + if (PageSwapCache(page) || PageAnon(page))
> + return;
> + if (current->flags & PF_MEMALLOC)
> + return;
> +
> + bio_cgroup_reset_owner(page, mm);
> +}
> +
> +/*
> + * Assign "page" the same owner as "opage."
> + */
> +void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
> +{
> + struct page_cgroup *npc, *opc;
> +
> + if (bio_cgroup_disabled())
> + return;
> + npc = lookup_page_cgroup(npage);
> + if (unlikely(!npc))
> + return;
> + opc = lookup_page_cgroup(opage);
> + if (unlikely(!opc))
> + return;
> +
> + /*
> + * Do this without any locks. The reason is the same as
> + * bio_cgroup_reset_owner().
> + */
> + npc->bio_cgroup_id = opc->bio_cgroup_id;
> +}
> +
> +/* Create a new bio-cgroup. */
> +static struct cgroup_subsys_state *
> +bio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + struct bio_cgroup *biog;
> + struct io_context *ioc;
> + int ret;
> +
> + if (!cgrp->parent) {
> + biog = &default_bio_cgroup;
> + init_io_context(biog->io_context);
> + /* Increment the referrence count not to be released ever. */
> + atomic_inc(&biog->io_context->refcount);
> + idr_init(&bio_cgroup_id);
> + return &biog->css;
> + }
> +
> + biog = kzalloc(sizeof(*biog), GFP_KERNEL);
> + ioc = alloc_io_context(GFP_KERNEL, -1);
> + if (!ioc || !biog) {
> + ret = -ENOMEM;
> + goto out_err;
> + }
> + biog->io_context = ioc;
> +retry:
> + if (!idr_pre_get(&bio_cgroup_id, GFP_KERNEL)) {
> + ret = -EAGAIN;
> + goto out_err;
> + }
> + spin_lock_irq(&bio_cgroup_idr_lock);
> + ret = idr_get_new_above(&bio_cgroup_id, (void *)biog, 1, &biog->id);
> + spin_unlock_irq(&bio_cgroup_idr_lock);
> + if (ret == -EAGAIN)
> + goto retry;
> + else if (ret)
> + goto out_err;
> +
> + return &biog->css;
> +out_err:
> + kfree(biog);
> + if (ioc)
> + put_io_context(ioc);
> + return ERR_PTR(ret);
> +}
> +
> +/* Delete the bio-cgroup. */
> +static void bio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + struct bio_cgroup *biog = cgroup_bio(cgrp);
> +
> + put_io_context(biog->io_context);
> +
> + spin_lock_irq(&bio_cgroup_idr_lock);
> + idr_remove(&bio_cgroup_id, biog->id);
> + spin_unlock_irq(&bio_cgroup_idr_lock);
> +
> + kfree(biog);
> +}
> +
> +static struct bio_cgroup *find_bio_cgroup(int id)
> +{
> + struct bio_cgroup *biog;
> + spin_lock_irq(&bio_cgroup_idr_lock);
> + /*
> + * It might fail to find A bio-group associated with "id" since it
> + * is allowed to remove the bio-cgroup even when some of I/O requests
> + * this group issued haven't completed yet.
> + */
> + biog = (struct bio_cgroup *)idr_find(&bio_cgroup_id, id);
> + spin_unlock_irq(&bio_cgroup_idr_lock);
> + return biog;
> +}
> +
> +struct cgroup *bio_id_to_cgroup(int id)
> +{
> + struct bio_cgroup *biog;
> +
> + biog = find_bio_cgroup(id);
> + if (biog)
> + return biog->css.cgroup;
> +
> + return NULL;
> +}
> +
> +struct cgroup *get_cgroup_from_page(struct page *page)
> +{
> + struct page_cgroup *pc;
> + struct bio_cgroup *biog;
> + struct cgroup *cgrp = NULL;
> +
> + pc = lookup_page_cgroup(page);
> + if (!pc)
> + return NULL;
> + lock_page_cgroup(pc);
> + biog = find_bio_cgroup(pc->bio_cgroup_id);
> + if (biog) {
> + css_get(&biog->css);
> + cgrp = biog->css.cgroup;
> + }
> + unlock_page_cgroup(pc);
> + return cgrp;
> +}
> +
> +void put_cgroup_from_page(struct page *page)
> +{
> + struct bio_cgroup *biog;
> + struct page_cgroup *pc;
> +
> + pc = lookup_page_cgroup(page);
> + if (!pc)
> + return;
> + lock_page_cgroup(pc);
> + biog = find_bio_cgroup(pc->bio_cgroup_id);
> + if (biog)
> + css_put(&biog->css);
> + unlock_page_cgroup(pc);
> +}
> +
> +/* Determine the bio-cgroup id of a given bio. */
> +int get_bio_cgroup_id(struct bio *bio)
> +{
> + struct page_cgroup *pc;
> + struct page *page = bio_iovec_idx(bio, 0)->bv_page;
> + int id = 0;
> +
> + pc = lookup_page_cgroup(page);
> + if (pc)
> + id = pc->bio_cgroup_id;
> + return id;
> +}
> +EXPORT_SYMBOL(get_bio_cgroup_id);
> +
> +/* Determine the iocontext of the bio-cgroup that issued a given bio. */
> +struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
> +{
> + struct bio_cgroup *biog = NULL;
> + struct io_context *ioc;
> + int id = 0;
> +
> + id = get_bio_cgroup_id(bio);
> + if (id)
> + biog = find_bio_cgroup(id);
> + if (!biog)
> + biog = &default_bio_cgroup;
> + ioc = biog->io_context; /* default io_context for this cgroup */
> + atomic_inc(&ioc->refcount);
> + return ioc;
> +}
> +EXPORT_SYMBOL(get_bio_cgroup_iocontext);
> +
> +static u64 bio_id_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> + struct bio_cgroup *biog = cgroup_bio(cgrp);
> + return (u64) biog->id;
> +}
> +
> +
> +static struct cftype bio_files[] = {
> + {
> + .name = "id",
> + .read_u64 = bio_id_read,
> + },
> +};
> +
> +static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + return cgroup_add_files(cgrp, ss, bio_files, ARRAY_SIZE(bio_files));
> +}
> +
> +static void bio_cgroup_attach(struct cgroup_subsys *ss,
> + struct cgroup *cont, struct cgroup *oldcont,
> + struct task_struct *tsk)
> +{
> + struct tsk_move_msg tmm;
> + struct bio_cgroup *old_biog, *new_biog;
> +
> + old_biog = cgroup_bio(oldcont);
> + new_biog = cgroup_bio(cont);
> + tmm.old_id = old_biog->id;
> + tmm.new_id = new_biog->id;
> + tmm.tsk = tsk;
> + blocking_notifier_call_chain(&biocgroup_chain, MOVETASK, &tmm);
> +}
> +
> +struct cgroup_subsys bio_cgroup_subsys = {
> + .name = "bio",
> + .create = bio_cgroup_create,
> + .destroy = bio_cgroup_destroy,
> + .populate = bio_cgroup_populate,
> + .attach = bio_cgroup_attach,
> + .subsys_id = bio_cgroup_subsys_id,
> +};
> +
> diff --git a/mm/bounce.c b/mm/bounce.c
> index e590272..1a01905 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -14,6 +14,7 @@
> #include <linux/hash.h>
> #include <linux/highmem.h>
> #include <linux/blktrace_api.h>
> +#include <linux/biotrack.h>
> #include <trace/block.h>
> #include <asm/tlbflush.h>
>
> @@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
> to->bv_len = from->bv_len;
> to->bv_offset = from->bv_offset;
> inc_zone_page_state(to->bv_page, NR_BOUNCE);
> + bio_cgroup_copy_owner(to->bv_page, page);
>
> if (rw == WRITE) {
> char *vto, *vfrom;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8bd4980..1ab32a2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -33,6 +33,7 @@
> #include <linux/cpuset.h>
> #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
> #include <linux/memcontrol.h>
> +#include <linux/biotrack.h>
> #include <linux/mm_inline.h> /* for page_is_file_cache() */
> #include "internal.h"
>
> @@ -463,6 +464,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> gfp_mask & GFP_RECLAIM_MASK);
> if (error)
> goto out;
> + bio_cgroup_set_owner(page, current->mm);
>
> error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> if (error == 0) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e44fb0f..c25eb63 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2524,6 +2524,11 @@ struct cgroup_subsys mem_cgroup_subsys = {
> .use_id = 1,
> };
>
> +void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
> +{
> + pc->mem_cgroup = NULL;
> +}
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>
> static int __init disable_swap_account(char *s)
> diff --git a/mm/memory.c b/mm/memory.c
> index cf6873e..7779e12 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -51,6 +51,7 @@
> #include <linux/init.h>
> #include <linux/writeback.h>
> #include <linux/memcontrol.h>
> +#include <linux/biotrack.h>
> #include <linux/mmu_notifier.h>
> #include <linux/kallsyms.h>
> #include <linux/swapops.h>
> @@ -2052,6 +2053,7 @@ gotten:
> * thread doing COW.
> */
> ptep_clear_flush_notify(vma, address, page_table);
> + bio_cgroup_set_owner(new_page, mm);
> page_add_new_anon_rmap(new_page, vma, address);
> set_pte_at(mm, address, page_table, entry);
> update_mmu_cache(vma, address, entry);
> @@ -2497,6 +2499,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> flush_icache_page(vma, page);
> set_pte_at(mm, address, page_table, pte);
> page_add_anon_rmap(page, vma, address);
> + bio_cgroup_reset_owner(page, mm);
> /* It's better to call commit-charge after rmap is established */
> mem_cgroup_commit_charge_swapin(page, ptr);
>
> @@ -2559,6 +2562,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> if (!pte_none(*page_table))
> goto release;
> inc_mm_counter(mm, anon_rss);
> + bio_cgroup_set_owner(page, mm);
> page_add_new_anon_rmap(page, vma, address);
> set_pte_at(mm, address, page_table, entry);
>
> @@ -2711,6 +2715,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> if (anon) {
> inc_mm_counter(mm, anon_rss);
> + bio_cgroup_set_owner(page, mm);
> page_add_new_anon_rmap(page, vma, address);
> } else {
> inc_mm_counter(mm, file_rss);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 30351f0..1379eb0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -26,6 +26,7 @@
> #include <linux/blkdev.h>
> #include <linux/mpage.h>
> #include <linux/rmap.h>
> +#include <linux/biotrack.h>
> #include <linux/percpu.h>
> #include <linux/notifier.h>
> #include <linux/smp.h>
> @@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
> BUG_ON(mapping2 != mapping);
> WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> account_page_dirtied(page, mapping);
> + bio_cgroup_reset_owner_pagedirty(page, current->mm);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> }
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 791905c..f692ee2 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -9,13 +9,16 @@
> #include <linux/vmalloc.h>
> #include <linux/cgroup.h>
> #include <linux/swapops.h>
> +#include <linux/memcontrol.h>
> +#include <linux/biotrack.h>
>
> static void __meminit
> __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> {
> pc->flags = 0;
> - pc->mem_cgroup = NULL;
> pc->page = pfn_to_page(pfn);
> + __init_mem_page_cgroup(pc);
> + __init_bio_page_cgroup(pc);
> INIT_LIST_HEAD(&pc->lru);
> }
> static unsigned long total_usage;
> @@ -74,7 +77,7 @@ void __init page_cgroup_init(void)
>
> int nid, fail;
>
> - if (mem_cgroup_disabled())
> + if (mem_cgroup_disabled() && bio_cgroup_disabled())
> return;
>
> for_each_online_node(nid) {
> @@ -83,12 +86,12 @@ void __init page_cgroup_init(void)
> goto fail;
> }
> printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> - printk(KERN_INFO "please try cgroup_disable=memory option if you"
> + printk(KERN_INFO "please try cgroup_disable=memory,bio option if you"
> " don't want\n");
> return;
> fail:
> printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
> - printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
> + printk(KERN_CRIT "please try cgroup_disable=memory,bio boot options\n");
> panic("Out of memory");
> }
>
> @@ -248,7 +251,7 @@ void __init page_cgroup_init(void)
> unsigned long pfn;
> int fail = 0;
>
> - if (mem_cgroup_disabled())
> + if (mem_cgroup_disabled() && bio_cgroup_disabled())
> return;
>
> for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> @@ -263,8 +266,8 @@ void __init page_cgroup_init(void)
> hotplug_memory_notifier(page_cgroup_callback, 0);
> }
> printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> - printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
> - " want\n");
> + printk(KERN_INFO
> + "try cgroup_disable=memory,bio option if you don't want\n");
> }
>
> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3ecea98..c7ad256 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -17,6 +17,7 @@
> #include <linux/backing-dev.h>
> #include <linux/pagevec.h>
> #include <linux/migrate.h>
> +#include <linux/biotrack.h>
> #include <linux/page_cgroup.h>
>
> #include <asm/pgtable.h>
> @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> */
> __set_page_locked(new_page);
> SetPageSwapBacked(new_page);
> + bio_cgroup_set_owner(new_page, current->mm);
> err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> if (likely(!err)) {

I bet this is dangerous. You can't guarantee current->mm is owner of this swap cache
because this is "readahead". You can't find the owner of this swap-cache until
it's mapped. I recommend you to ignore swap-in here because

- swapin-readahead just read (1 << page_cluster) pages at once.
- until the end of swap-in, the process will make no progress.

I wonder it's better to delay bio-cgroup attaching to anon page until swap-out or
direct-io. (add hook to try_to_unmap and catch owner there.)
Maybe most of anon pages will have no I/O if swap-out doesn't occur.
BTW, it seems DIO from HugeTLB is not handled.


Thanks,
-Kame

2009-04-15 09:37:35

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Wed, Apr 15, 2009 at 11:15:28AM +0900, KAMEZAWA Hiroyuki wrote:
> > /*
> > * Page Cgroup can be considered as an extended mem_map.
> > @@ -12,9 +12,16 @@
> > */
> > struct page_cgroup {
> > unsigned long flags;
> > - struct mem_cgroup *mem_cgroup;
> > struct page *page;
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > + struct mem_cgroup *mem_cgroup;
> > +#endif
> > +#ifdef CONFIG_CGROUP_BIO
> > + int bio_cgroup_id;
> > +#endif
> > +#if defined(CONFIG_CGROUP_MEM_RES_CTLR) || defined(CONFIG_CGROUP_BIO)
> > struct list_head lru; /* per cgroup LRU list */
> > +#endif
> > };
> >
> This #if is unnecessary..

OK.

>
> And, now, CSS_ID is supported. I think it can be used and your own id is not
> necessary. (plz see swap accounting in memcontrol.c/page_cgroup.c if unsure.)

Agree. We can use css_id(&bio_cgroup->css), instead of introducing a new
custom id in the bio_cgroup structure.

>
> And... I don't like to increase the size of struct page_cgroup.
> Could you find a way to encode bio_cgroup_id into "flags" ?
> unsigned long is too much now.

And also agree here. Maybe the lower 16 bits are enough for flags, now
only 3 bits are used and we can reserve the rest (upper 16 bits in
32-bit archs and 48 bits in 64-bit archs) for the bio_cgroup id. Or do
you think it's too much for any amount of reasonable cgroups we could
have in a system?

> > +/*
> > + * This function is used to make a given page have the bio-cgroup id of
> > + * the owner of this page.
> > + */
> > +void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> > +{
> > + struct bio_cgroup *biog;
> > + struct page_cgroup *pc;
> > +
> > + if (bio_cgroup_disabled())
> > + return;
> > + pc = lookup_page_cgroup(page);
> > + if (unlikely(!pc))
> > + return;
> > +
> > + pc->bio_cgroup_id = 0; /* 0: default bio_cgroup id */
> > + if (!mm)
> > + return;
> > + /*
> > + * Locking "pc" isn't necessary here since the current process is
> > + * the only one that can access the members related to bio_cgroup.
> > + */
> > + rcu_read_lock();
> > + biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
> > + if (unlikely(!biog))
> > + goto out;
> > + /*
> > + * css_get(&bio->css) isn't called to increment the reference
> > + * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
> > + * invalid even if this page is still active.
> > + * This approach is chosen to minimize the overhead.
> > + */
> > + pc->bio_cgroup_id = biog->id;
> > +out:
> > + rcu_read_unlock();
> > +}
> > +
> > +/*
> > + * Change the owner of a given page if necessary.
> > + */
> > +void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
> > +{
> > + /*
> > + * A little trick:
> > + * Just call bio_cgroup_set_owner() for pages which are already
> > + * active since the bio_cgroup_id member of page_cgroup can be
> > + * updated without any locks. This is because an integer type of
> > + * variable can be set a new value at once on modern cpus.
> > + */
> > + bio_cgroup_set_owner(page, mm);
> > +}
> Hmm ? I think all operations are under lock_page() and there are no races.
> Isn't it ?
>

We can check this with:

WARN_ON_ONCE(test_bit(PG_locked, &page->flags));

> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 3ecea98..c7ad256 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -17,6 +17,7 @@
> > #include <linux/backing-dev.h>
> > #include <linux/pagevec.h>
> > #include <linux/migrate.h>
> > +#include <linux/biotrack.h>
> > #include <linux/page_cgroup.h>
> >
> > #include <asm/pgtable.h>
> > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > */
> > __set_page_locked(new_page);
> > SetPageSwapBacked(new_page);
> > + bio_cgroup_set_owner(new_page, current->mm);
> > err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > if (likely(!err)) {
>
> I bet this is dangerous. You can't guarantee current->mm is owner of this swap cache
> because this is "readahead". You can't find the owner of this swap-cache until
> it's mapped. I recommend you to ignore swap-in here because
>
> - swapin-readahead just read (1 << page_cluster) pages at once.
> - until the end of swap-in, the process will make no progress.

OK.

>
> I wonder it's better to delay bio-cgroup attaching to anon page until swap-out or
> direct-io. (add hook to try_to_unmap and catch owner there.)
> Maybe most of anon pages will have no I/O if swap-out doesn't occur.
> BTW, it seems DIO from HugeTLB is not handled.

Ryo, it would be great if you can look at this and fix/integrate into
the mainstream bio-cgroup. Otherwise I can try to to schedule this in my
work.

Thanks for your suggestions Kame!
-Andrea

2009-04-15 12:39:00

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

Hi Andrea and Kamezawa-san,

> Ryo, it would be great if you can look at this and fix/integrate into
> the mainstream bio-cgroup. Otherwise I can try to to schedule this in my
> work.

O.K. I'll apply those fixes and post patches as soon as I can.

> Thanks for your suggestions Kame!

I thank you too.

Thanks,
Ryo Tsuruta

2009-04-15 13:07:51

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Wed, Apr 15, 2009 at 11:15:28AM +0900, KAMEZAWA Hiroyuki wrote:
> > +/*
> > + * This function is used to make a given page have the bio-cgroup id of
> > + * the owner of this page.
> > + */
> > +void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> > +{
> > + struct bio_cgroup *biog;
> > + struct page_cgroup *pc;
> > +
> > + if (bio_cgroup_disabled())
> > + return;
> > + pc = lookup_page_cgroup(page);
> > + if (unlikely(!pc))
> > + return;
> > +
> > + pc->bio_cgroup_id = 0; /* 0: default bio_cgroup id */
> > + if (!mm)
> > + return;
> > + /*
> > + * Locking "pc" isn't necessary here since the current process is
> > + * the only one that can access the members related to bio_cgroup.
> > + */
> > + rcu_read_lock();
> > + biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
> > + if (unlikely(!biog))
> > + goto out;
> > + /*
> > + * css_get(&bio->css) isn't called to increment the reference
> > + * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
> > + * invalid even if this page is still active.
> > + * This approach is chosen to minimize the overhead.
> > + */
> > + pc->bio_cgroup_id = biog->id;
> > +out:
> > + rcu_read_unlock();
> > +}
> > +
> > +/*
> > + * Change the owner of a given page if necessary.
> > + */
> > +void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
> > +{
> > + /*
> > + * A little trick:
> > + * Just call bio_cgroup_set_owner() for pages which are already
> > + * active since the bio_cgroup_id member of page_cgroup can be
> > + * updated without any locks. This is because an integer type of
> > + * variable can be set a new value at once on modern cpus.
> > + */
> > + bio_cgroup_set_owner(page, mm);
> > +}
> Hmm ? I think all operations are under lock_page() and there are no races.
> Isn't it ?

ehm.. no. Adding this in bio_cgroup_set_owner():

WARN_ON_ONCE(!test_bit(PG_locked, &page->flags));

produces the following:

[ 1.641186] WARNING: at mm/biotrack.c:77 bio_cgroup_set_owner+0xe2/0x100()
[ 1.644534] Hardware name:
[ 1.646955] Modules linked in:
[ 1.650526] Pid: 1, comm: swapper Not tainted 2.6.30-rc2 #77
[ 1.653499] Call Trace:
[ 1.656004] [<ffffffff80269370>] warn_slowpath+0xd0/0x120
[ 1.659062] [<ffffffff8023d69a>] ? save_stack_trace+0x2a/0x50
[ 1.662357] [<ffffffff80291f7f>] ? save_trace+0x3f/0xb0
[ 1.670214] [<ffffffff802e3abd>] ? handle_mm_fault+0x40d/0x8b0
[ 1.673321] [<ffffffff8029586b>] ? __lock_acquire+0x63b/0x1de0
[ 1.676446] [<ffffffff802921ba>] ? get_lock_stats+0x2a/0x60
[ 1.679657] [<ffffffff802921fe>] ? put_lock_stats+0xe/0x30
[ 1.682673] [<ffffffff802e3abd>] ? handle_mm_fault+0x40d/0x8b0
[ 1.685706] [<ffffffff80300e72>] bio_cgroup_set_owner+0xe2/0x100
[ 1.688852] [<ffffffff802e3abd>] ? handle_mm_fault+0x40d/0x8b0
[ 1.692280] [<ffffffff802e3ae2>] handle_mm_fault+0x432/0x8b0
[ 1.695261] [<ffffffff802e408f>] __get_user_pages+0x12f/0x430
[ 1.703507] [<ffffffff802e43c2>] get_user_pages+0x32/0x40
[ 1.706947] [<ffffffff80308bab>] get_arg_page+0x4b/0xb0
[ 1.710287] [<ffffffff80308e3d>] copy_strings+0xfd/0x200
[ 1.714028] [<ffffffff80308f69>] copy_strings_kernel+0x29/0x40
[ 1.717058] [<ffffffff8030a651>] do_execve+0x2c1/0x400
[ 1.720291] [<ffffffff8022d739>] sys_execve+0x49/0x80
[ 1.723209] [<ffffffff802300b8>] kernel_execve+0x68/0xd0
[ 1.726309] [<ffffffff8020930b>] ? init_post+0x18b/0x1b0
[ 1.729585] [<ffffffff80af069b>] kernel_init+0x198/0x1b0
[ 1.735754] [<ffffffff8023003a>] child_rip+0xa/0x20
[ 1.738690] [<ffffffff8022fa00>] ? restore_args+0x0/0x30
[ 1.741663] [<ffffffff80af0503>] ? kernel_init+0x0/0x1b0
[ 1.744683] [<ffffffff80230030>] ? child_rip+0x0/0x20
[ 1.747820] ---[ end trace b9f530261e455c85 ]---

In do_anonymous_page(), bio_cgroup_set_owner() seems to be called
without lock_page() held.

-Andrea

2009-04-15 13:24:19

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Wed, Apr 15, 2009 at 09:38:50PM +0900, Ryo Tsuruta wrote:
> Hi Andrea and Kamezawa-san,
>
> > Ryo, it would be great if you can look at this and fix/integrate into
> > the mainstream bio-cgroup. Otherwise I can try to to schedule this in my
> > work.
>
> O.K. I'll apply those fixes and post patches as soon as I can.
>

Very good! I've just tested the bio_cgroup_id inclusion in
page_cgroup->flags. I'm posting the patch on-top-of my patchset.

If you're interested, it should apply cleanly to the original
bio-cgroup, except for the get/put_cgroup_from_page() part.

Thanks,
-Andrea
---
bio-cgroup: encode bio_cgroup_id in page_cgroup->flags

Encode the bio_cgroup_id into the flags argument of page_cgroup as
suggested by Kamezawa.

Lower 16-bits of the flags attribute are used for the actual page_cgroup
flags. The rest is reserved to store the bio-cgroup id.

This allows to save 4 bytes (in 32-bit architectures) or 8 bytes (in
64-bit) for each page_cgroup element.

Signed-off-by: Andrea Righi <[email protected]>
---
include/linux/biotrack.h | 2 +-
include/linux/page_cgroup.h | 24 +++++++++++++++++++++---
mm/biotrack.c | 26 ++++++++++++--------------
3 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
index 25b8810..4bd0242 100644
--- a/include/linux/biotrack.h
+++ b/include/linux/biotrack.h
@@ -28,7 +28,7 @@ struct bio_cgroup {

static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
{
- pc->bio_cgroup_id = 0;
+ page_cgroup_set_bio_id(pc, 0);
}

extern struct cgroup *get_cgroup_from_page(struct page *page);
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 00a49c5..af780a4 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -16,12 +16,30 @@ struct page_cgroup {
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem_cgroup;
#endif
-#ifdef CONFIG_CGROUP_BIO
- int bio_cgroup_id;
-#endif
struct list_head lru; /* per cgroup LRU list */
};

+#ifdef CONFIG_CGROUP_BIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the bio-cgroup id
+ */
+#define BIO_CGROUP_ID_SHIFT (16)
+#define BIO_CGROUP_ID_BITS (8 * sizeof(unsigned long) - BIO_CGROUP_ID_SHIFT)
+
+static inline unsigned long page_cgroup_get_bio_id(struct page_cgroup *pc)
+{
+ return pc->flags >> BIO_CGROUP_ID_SHIFT;
+}
+
+static inline void page_cgroup_set_bio_id(struct page_cgroup *pc,
+ unsigned long id)
+{
+ WARN_ON(id >= (1UL << BIO_CGROUP_ID_BITS));
+ pc->flags &= (1UL << BIO_CGROUP_ID_SHIFT) - 1;
+ pc->flags |= (unsigned long)(id << BIO_CGROUP_ID_SHIFT);
+}
+#endif
+
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
void __init page_cgroup_init(void);
struct page_cgroup *lookup_page_cgroup(struct page *page);
diff --git a/mm/biotrack.c b/mm/biotrack.c
index 431056c..4cca3bf 100644
--- a/mm/biotrack.c
+++ b/mm/biotrack.c
@@ -74,7 +74,7 @@ void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
struct bio_cgroup *biog;
struct page_cgroup *pc;

- WARN_ON_ONCE(test_bit(PG_locked, &page->flags));
+ WARN_ON_ONCE(!test_bit(PG_locked, &page->flags));

if (bio_cgroup_disabled())
return;
@@ -82,7 +82,7 @@ void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
if (unlikely(!pc))
return;

- pc->bio_cgroup_id = 0; /* 0: default bio_cgroup id */
+ __init_bio_page_cgroup(pc);
if (!mm)
return;
/*
@@ -95,11 +95,11 @@ void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
goto out;
/*
* css_get(&bio->css) isn't called to increment the reference
- * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
- * invalid even if this page is still active.
- * This approach is chosen to minimize the overhead.
+ * count of this bio_cgroup "biog" so the bio-cgroup id might turn
+ * invalid even if this page is still active. This approach is chosen
+ * to minimize the overhead.
*/
- pc->bio_cgroup_id = biog->id;
+ page_cgroup_set_bio_id(pc, biog->id);
out:
rcu_read_unlock();
}
@@ -112,7 +112,7 @@ void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
/*
* A little trick:
* Just call bio_cgroup_set_owner() for pages which are already
- * active since the bio_cgroup_id member of page_cgroup can be
+ * active since the bio-cgroup id member of page_cgroup can be
* updated without any locks. This is because an integer type of
* variable can be set a new value at once on modern cpus.
*/
@@ -148,12 +148,10 @@ void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
opc = lookup_page_cgroup(opage);
if (unlikely(!opc))
return;
-
/*
- * Do this without any locks. The reason is the same as
- * bio_cgroup_reset_owner().
+ * XXX: is it safe to do this without locking?
*/
- npc->bio_cgroup_id = opc->bio_cgroup_id;
+ page_cgroup_set_bio_id(npc, page_cgroup_get_bio_id(opc));
}

/* Create a new bio-cgroup. */
@@ -250,7 +248,7 @@ struct cgroup *get_cgroup_from_page(struct page *page)
if (!pc)
return NULL;
lock_page_cgroup(pc);
- biog = find_bio_cgroup(pc->bio_cgroup_id);
+ biog = find_bio_cgroup(page_cgroup_get_bio_id(pc));
if (biog) {
css_get(&biog->css);
cgrp = biog->css.cgroup;
@@ -268,7 +266,7 @@ void put_cgroup_from_page(struct page *page)
if (!pc)
return;
lock_page_cgroup(pc);
- biog = find_bio_cgroup(pc->bio_cgroup_id);
+ biog = find_bio_cgroup(page_cgroup_get_bio_id(pc));
if (biog)
css_put(&biog->css);
unlock_page_cgroup(pc);
@@ -283,7 +281,7 @@ int get_bio_cgroup_id(struct bio *bio)

pc = lookup_page_cgroup(page);
if (pc)
- id = pc->bio_cgroup_id;
+ id = page_cgroup_get_bio_id(pc);
return id;
}
EXPORT_SYMBOL(get_bio_cgroup_id);

2009-04-16 00:00:01

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Wed, 15 Apr 2009 15:23:57 +0200
Andrea Righi <[email protected]> wrote:

> On Wed, Apr 15, 2009 at 09:38:50PM +0900, Ryo Tsuruta wrote:
> > Hi Andrea and Kamezawa-san,
> >
> > > Ryo, it would be great if you can look at this and fix/integrate into
> > > the mainstream bio-cgroup. Otherwise I can try to to schedule this in my
> > > work.
> >
> > O.K. I'll apply those fixes and post patches as soon as I can.
> >
>
> Very good! I've just tested the bio_cgroup_id inclusion in
> page_cgroup->flags. I'm posting the patch on-top-of my patchset.
>
> If you're interested, it should apply cleanly to the original
> bio-cgroup, except for the get/put_cgroup_from_page() part.
>
> Thanks,
> -Andrea
> ---
> bio-cgroup: encode bio_cgroup_id in page_cgroup->flags
>
> Encode the bio_cgroup_id into the flags argument of page_cgroup as
> suggested by Kamezawa.
>
> Lower 16-bits of the flags attribute are used for the actual page_cgroup
> flags. The rest is reserved to store the bio-cgroup id.
>
> This allows to save 4 bytes (in 32-bit architectures) or 8 bytes (in
> 64-bit) for each page_cgroup element.
>
> Signed-off-by: Andrea Righi <[email protected]>
> ---
> include/linux/biotrack.h | 2 +-
> include/linux/page_cgroup.h | 24 +++++++++++++++++++++---
> mm/biotrack.c | 26 ++++++++++++--------------
> 3 files changed, 34 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
> index 25b8810..4bd0242 100644
> --- a/include/linux/biotrack.h
> +++ b/include/linux/biotrack.h
> @@ -28,7 +28,7 @@ struct bio_cgroup {
>
> static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
> {
> - pc->bio_cgroup_id = 0;
> + page_cgroup_set_bio_id(pc, 0);
> }
>
> extern struct cgroup *get_cgroup_from_page(struct page *page);
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 00a49c5..af780a4 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -16,12 +16,30 @@ struct page_cgroup {
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> struct mem_cgroup *mem_cgroup;
> #endif
> -#ifdef CONFIG_CGROUP_BIO
> - int bio_cgroup_id;
> -#endif
> struct list_head lru; /* per cgroup LRU list */
> };
>
> +#ifdef CONFIG_CGROUP_BIO
> +/*
> + * use lower 16 bits for flags and reserve the rest for the bio-cgroup id
> + */
> +#define BIO_CGROUP_ID_SHIFT (16)
> +#define BIO_CGROUP_ID_BITS (8 * sizeof(unsigned long) - BIO_CGROUP_ID_SHIFT)
> +
> +static inline unsigned long page_cgroup_get_bio_id(struct page_cgroup *pc)
> +{
> + return pc->flags >> BIO_CGROUP_ID_SHIFT;
> +}
> +
> +static inline void page_cgroup_set_bio_id(struct page_cgroup *pc,
> + unsigned long id)
> +{
> + WARN_ON(id >= (1UL << BIO_CGROUP_ID_BITS));
> + pc->flags &= (1UL << BIO_CGROUP_ID_SHIFT) - 1;
> + pc->flags |= (unsigned long)(id << BIO_CGROUP_ID_SHIFT);
> +}
> +#endif
> +
Ah, there is "Lock" bit in pc->flags and above "set" code does read-modify-write
without lock_page_cgroup().

Could you use lock_page_cgroup() or cmpxchg ? (or using something magical technique ?)

Thanks,
-Kame

> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> void __init page_cgroup_init(void);
> struct page_cgroup *lookup_page_cgroup(struct page *page);
> diff --git a/mm/biotrack.c b/mm/biotrack.c
> index 431056c..4cca3bf 100644
> --- a/mm/biotrack.c
> +++ b/mm/biotrack.c
> @@ -74,7 +74,7 @@ void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> struct bio_cgroup *biog;
> struct page_cgroup *pc;
>
> - WARN_ON_ONCE(test_bit(PG_locked, &page->flags));
> + WARN_ON_ONCE(!test_bit(PG_locked, &page->flags));
>
> if (bio_cgroup_disabled())
> return;
> @@ -82,7 +82,7 @@ void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> if (unlikely(!pc))
> return;
>
> - pc->bio_cgroup_id = 0; /* 0: default bio_cgroup id */
> + __init_bio_page_cgroup(pc);
> if (!mm)
> return;
> /*
> @@ -95,11 +95,11 @@ void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> goto out;
> /*
> * css_get(&bio->css) isn't called to increment the reference
> - * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
> - * invalid even if this page is still active.
> - * This approach is chosen to minimize the overhead.
> + * count of this bio_cgroup "biog" so the bio-cgroup id might turn
> + * invalid even if this page is still active. This approach is chosen
> + * to minimize the overhead.
> */
> - pc->bio_cgroup_id = biog->id;
> + page_cgroup_set_bio_id(pc, biog->id);
> out:
> rcu_read_unlock();
> }
> @@ -112,7 +112,7 @@ void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
> /*
> * A little trick:
> * Just call bio_cgroup_set_owner() for pages which are already
> - * active since the bio_cgroup_id member of page_cgroup can be
> + * active since the bio-cgroup id member of page_cgroup can be
> * updated without any locks. This is because an integer type of
> * variable can be set a new value at once on modern cpus.
> */
> @@ -148,12 +148,10 @@ void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
> opc = lookup_page_cgroup(opage);
> if (unlikely(!opc))
> return;
> -
> /*
> - * Do this without any locks. The reason is the same as
> - * bio_cgroup_reset_owner().
> + * XXX: is it safe to do this without locking?
> */
> - npc->bio_cgroup_id = opc->bio_cgroup_id;
> + page_cgroup_set_bio_id(npc, page_cgroup_get_bio_id(opc));
> }
>
> /* Create a new bio-cgroup. */
> @@ -250,7 +248,7 @@ struct cgroup *get_cgroup_from_page(struct page *page)
> if (!pc)
> return NULL;
> lock_page_cgroup(pc);
> - biog = find_bio_cgroup(pc->bio_cgroup_id);
> + biog = find_bio_cgroup(page_cgroup_get_bio_id(pc));
> if (biog) {
> css_get(&biog->css);
> cgrp = biog->css.cgroup;
> @@ -268,7 +266,7 @@ void put_cgroup_from_page(struct page *page)
> if (!pc)
> return;
> lock_page_cgroup(pc);
> - biog = find_bio_cgroup(pc->bio_cgroup_id);
> + biog = find_bio_cgroup(page_cgroup_get_bio_id(pc));
> if (biog)
> css_put(&biog->css);
> unlock_page_cgroup(pc);
> @@ -283,7 +281,7 @@ int get_bio_cgroup_id(struct bio *bio)
>
> pc = lookup_page_cgroup(page);
> if (pc)
> - id = pc->bio_cgroup_id;
> + id = page_cgroup_get_bio_id(pc);
> return id;
> }
> EXPORT_SYMBOL(get_bio_cgroup_id);

2009-04-16 10:42:54

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Thu, Apr 16, 2009 at 08:58:14AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 15 Apr 2009 15:23:57 +0200
> Andrea Righi <[email protected]> wrote:
>
> > On Wed, Apr 15, 2009 at 09:38:50PM +0900, Ryo Tsuruta wrote:
> > > Hi Andrea and Kamezawa-san,
> > >
> > > > Ryo, it would be great if you can look at this and fix/integrate into
> > > > the mainstream bio-cgroup. Otherwise I can try to to schedule this in my
> > > > work.
> > >
> > > O.K. I'll apply those fixes and post patches as soon as I can.
> > >
> >
> > Very good! I've just tested the bio_cgroup_id inclusion in
> > page_cgroup->flags. I'm posting the patch on-top-of my patchset.
> >
> > If you're interested, it should apply cleanly to the original
> > bio-cgroup, except for the get/put_cgroup_from_page() part.
> >
> > Thanks,
> > -Andrea
> > ---
> > bio-cgroup: encode bio_cgroup_id in page_cgroup->flags
> >
> > Encode the bio_cgroup_id into the flags argument of page_cgroup as
> > suggested by Kamezawa.
> >
> > Lower 16-bits of the flags attribute are used for the actual page_cgroup
> > flags. The rest is reserved to store the bio-cgroup id.
> >
> > This allows to save 4 bytes (in 32-bit architectures) or 8 bytes (in
> > 64-bit) for each page_cgroup element.
> >
> > Signed-off-by: Andrea Righi <[email protected]>
> > ---
> > include/linux/biotrack.h | 2 +-
> > include/linux/page_cgroup.h | 24 +++++++++++++++++++++---
> > mm/biotrack.c | 26 ++++++++++++--------------
> > 3 files changed, 34 insertions(+), 18 deletions(-)
> >
> > diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
> > index 25b8810..4bd0242 100644
> > --- a/include/linux/biotrack.h
> > +++ b/include/linux/biotrack.h
> > @@ -28,7 +28,7 @@ struct bio_cgroup {
> >
> > static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
> > {
> > - pc->bio_cgroup_id = 0;
> > + page_cgroup_set_bio_id(pc, 0);
> > }
> >
> > extern struct cgroup *get_cgroup_from_page(struct page *page);
> > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> > index 00a49c5..af780a4 100644
> > --- a/include/linux/page_cgroup.h
> > +++ b/include/linux/page_cgroup.h
> > @@ -16,12 +16,30 @@ struct page_cgroup {
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > struct mem_cgroup *mem_cgroup;
> > #endif
> > -#ifdef CONFIG_CGROUP_BIO
> > - int bio_cgroup_id;
> > -#endif
> > struct list_head lru; /* per cgroup LRU list */
> > };
> >
> > +#ifdef CONFIG_CGROUP_BIO
> > +/*
> > + * use lower 16 bits for flags and reserve the rest for the bio-cgroup id
> > + */
> > +#define BIO_CGROUP_ID_SHIFT (16)
> > +#define BIO_CGROUP_ID_BITS (8 * sizeof(unsigned long) - BIO_CGROUP_ID_SHIFT)
> > +
> > +static inline unsigned long page_cgroup_get_bio_id(struct page_cgroup *pc)
> > +{
> > + return pc->flags >> BIO_CGROUP_ID_SHIFT;
> > +}
> > +
> > +static inline void page_cgroup_set_bio_id(struct page_cgroup *pc,
> > + unsigned long id)
> > +{
> > + WARN_ON(id >= (1UL << BIO_CGROUP_ID_BITS));
> > + pc->flags &= (1UL << BIO_CGROUP_ID_SHIFT) - 1;
> > + pc->flags |= (unsigned long)(id << BIO_CGROUP_ID_SHIFT);
> > +}
> > +#endif
> > +
> Ah, there is "Lock" bit in pc->flags and above "set" code does read-modify-write
> without lock_page_cgroup().
>
> Could you use lock_page_cgroup() or cmpxchg ? (or using something magical technique ?)

If I'm not wrong this should guarantee atomicity without using
lock_page_cgroup().

Thanks,
-Andrea
---
bio-cgroup: encode bio_cgroup_id in page_cgroup->flags

Encode the bio_cgroup_id into the flags argument of page_cgroup as
suggested by Kamezawa.

Lower 16 bits (in 32-bit archs) or lower 32 bits (in 64-bit archs) of
the flags attribute are used for the actual page_cgroup flags. The upper
bits are reserved to store the bio-cgroup id.

This allows to save 4 bytes (in 32-bit architectures) or 8 bytes (in
64-bit) for each page_cgroup element.

Signed-off-by: Andrea Righi <[email protected]>
---
include/linux/page_cgroup.h | 42 +++++++++++++++++++++++++++++++++++++-----
mm/biotrack.c | 24 +++++++++++-------------
2 files changed, 48 insertions(+), 18 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index a7249bb..864ad6f 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -16,14 +16,46 @@ struct page_cgroup {
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem_cgroup;
#endif
-#ifdef CONFIG_CGROUP_BIO
- int bio_cgroup_id;
-#endif
-#if defined(CONFIG_CGROUP_MEM_RES_CTLR) || defined(CONFIG_CGROUP_BIO)
struct list_head lru; /* per cgroup LRU list */
-#endif
};

+#ifdef CONFIG_CGROUP_BIO
+/*
+ * Use the lower 16 bits (in 32-bit archs) or lower 32 bits (in 64-bit archs)
+ * of page_cgroup->flags for the actual flags and reserve the rest for the
+ * bio-cgroup id.
+ *
+ * This allows to atomically read and write the bio-cgroup id without using
+ * lock/unlock_page_cgroup().
+ */
+#if defined(CONFIG_64BIT)
+typedef uint32_t bio_cgroup_id_t;
+#elif defined(CONFIG_32BIT)
+typedef uint16_t bio_cgroup_id_t;
+#else
+#error "unsupported architecture"
+#endif
+
+#define BIO_CGROUP_ID_SHIFT (sizeof(bio_cgroup_id_t) * 8)
+#define BIO_CGROUP_ID_BITS (sizeof(bio_cgroup_id_t) * 8)
+#define BIO_CGROUP_ID_MASK ((1UL << BIO_CGROUP_ID_BITS) - 1)
+
+static inline unsigned long page_cgroup_get_bio_id(struct page_cgroup *pc)
+{
+ return (pc->flags >> BIO_CGROUP_ID_SHIFT) & BIO_CGROUP_ID_MASK;
+}
+
+static inline void page_cgroup_set_bio_id(struct page_cgroup *pc,
+ unsigned long id)
+{
+ bio_cgroup_id_t *ptr = (bio_cgroup_id_t *)((unsigned char *)&pc->flags +
+ (BIO_CGROUP_ID_SHIFT / 8));
+
+ WARN_ON(id >= (1UL << BIO_CGROUP_ID_BITS));
+ *ptr = (bio_cgroup_id_t)id;
+}
+#endif /* CONFIG_CGROUP_BIO */
+
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
void __init page_cgroup_init(void);
struct page_cgroup *lookup_page_cgroup(struct page *page);
diff --git a/mm/biotrack.c b/mm/biotrack.c
index d3a35f1..01f83ba 100644
--- a/mm/biotrack.c
+++ b/mm/biotrack.c
@@ -80,7 +80,7 @@ void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
if (unlikely(!pc))
return;

- pc->bio_cgroup_id = 0; /* 0: default bio_cgroup id */
+ __init_bio_page_cgroup(pc);
if (!mm)
return;
/*
@@ -93,11 +93,11 @@ void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
goto out;
/*
* css_get(&bio->css) isn't called to increment the reference
- * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
- * invalid even if this page is still active.
- * This approach is chosen to minimize the overhead.
+ * count of this bio_cgroup "biog" so the bio-cgroup id might turn
+ * invalid even if this page is still active. This approach is chosen
+ * to minimize the overhead.
*/
- pc->bio_cgroup_id = biog->id;
+ page_cgroup_set_bio_id(pc, biog->id);
out:
rcu_read_unlock();
}
@@ -110,7 +110,7 @@ void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
/*
* A little trick:
* Just call bio_cgroup_set_owner() for pages which are already
- * active since the bio_cgroup_id member of page_cgroup can be
+ * active since the bio-cgroup id member of page_cgroup can be
* updated without any locks. This is because an integer type of
* variable can be set a new value at once on modern cpus.
*/
@@ -146,12 +146,10 @@ void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
opc = lookup_page_cgroup(opage);
if (unlikely(!opc))
return;
-
/*
- * Do this without any locks. The reason is the same as
- * bio_cgroup_reset_owner().
+ * XXX: is it safe to do this without locking?
*/
- npc->bio_cgroup_id = opc->bio_cgroup_id;
+ page_cgroup_set_bio_id(npc, page_cgroup_get_bio_id(opc));
}

/* Create a new bio-cgroup. */
@@ -248,7 +246,7 @@ struct cgroup *get_cgroup_from_page(struct page *page)
if (!pc)
return NULL;
lock_page_cgroup(pc);
- biog = find_bio_cgroup(pc->bio_cgroup_id);
+ biog = find_bio_cgroup(page_cgroup_get_bio_id(pc));
if (biog) {
css_get(&biog->css);
cgrp = biog->css.cgroup;
@@ -266,7 +264,7 @@ void put_cgroup_from_page(struct page *page)
if (!pc)
return;
lock_page_cgroup(pc);
- biog = find_bio_cgroup(pc->bio_cgroup_id);
+ biog = find_bio_cgroup(page_cgroup_get_bio_id(pc));
if (biog)
css_put(&biog->css);
unlock_page_cgroup(pc);
@@ -281,7 +279,7 @@ int get_bio_cgroup_id(struct bio *bio)

pc = lookup_page_cgroup(page);
if (pc)
- id = pc->bio_cgroup_id;
+ id = page_cgroup_get_bio_id(pc);
return id;
}
EXPORT_SYMBOL(get_bio_cgroup_id);

2009-04-16 12:00:19

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

Hi Andrea and Kamezawa-san,

> > > +#ifdef CONFIG_CGROUP_BIO
> > > +/*
> > > + * use lower 16 bits for flags and reserve the rest for the bio-cgroup id
> > > + */
> > > +#define BIO_CGROUP_ID_SHIFT (16)
> > > +#define BIO_CGROUP_ID_BITS (8 * sizeof(unsigned long) - BIO_CGROUP_ID_SHIFT)
> > > +
> > > +static inline unsigned long page_cgroup_get_bio_id(struct page_cgroup *pc)
> > > +{
> > > + return pc->flags >> BIO_CGROUP_ID_SHIFT;
> > > +}
> > > +
> > > +static inline void page_cgroup_set_bio_id(struct page_cgroup *pc,
> > > + unsigned long id)
> > > +{
> > > + WARN_ON(id >= (1UL << BIO_CGROUP_ID_BITS));
> > > + pc->flags &= (1UL << BIO_CGROUP_ID_SHIFT) - 1;
> > > + pc->flags |= (unsigned long)(id << BIO_CGROUP_ID_SHIFT);
> > > +}
> > > +#endif
> > > +
> > Ah, there is "Lock" bit in pc->flags and above "set" code does read-modify-write
> > without lock_page_cgroup().
> >
> > Could you use lock_page_cgroup() or cmpxchg ? (or using something magical technique ?)
>
> If I'm not wrong this should guarantee atomicity without using
> lock_page_cgroup().

I'll consider carefully how is the best way to minimize the overhead
as far as possible.
First, I'll post the new bio-cgroup patches that use css_id as
bio_cgroup_id soon.

Thanks,
Ryo Tsuruta

2009-04-16 22:41:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/9] cgroup: io-throttle controller (v13)

On Tue, 14 Apr 2009 22:21:11 +0200
Andrea Righi <[email protected]> wrote:

> Objective
> ~~~~~~~~~
> The objective of the io-throttle controller is to improve IO performance
> predictability of different cgroups that share the same block devices.

We should get an IO controller into Linux. Does anyone have a reason
why it shouldn't be this one?

> Respect to other priority/weight-based solutions the approach used by
> this controller is to explicitly choke applications' requests

Yes, blocking the offending application at a high level has always
seemed to me to be the best way of implementing the controller.

> that
> directly or indirectly generate IO activity in the system (this
> controller addresses both synchronous IO and writeback/buffered IO).

The problem I've seen with some of the proposed controllers was that
they didn't handle delayed writeback very well, if at all.

Can you explain at a high level but in some detail how this works? If
an application is doing a huge write(), how is that detected and how is
the application made to throttle?

Does it add new metadata to `struct page' for this?

I assume that the write throttling is also wired up into the MAP_SHARED
write-fault path?



Does this patchset provide a path by which we can implement IO control
for (say) NFS mounts?

2009-04-16 22:46:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Tue, 14 Apr 2009 22:21:14 +0200
Andrea Righi <[email protected]> wrote:

> Subject: [PATCH 3/9] bio-cgroup controller

Sorry, but I have to register extreme distress at the name of this.
The term "bio" is well-established in the kernel and here we have a new
definition for the same term: "block I/O".

"bio" was a fine term for you to have chosen from the user's
perspective, but from the kernel developer perspective it is quite
horrid. The patch adds a vast number of new symbols all into the
existing "bio_" namespace, many of which aren't related to `struct bio'
at all.

At least, I think that's what's happening. Perhaps the controller
really _is_ designed to track `struct bio'? If so, that's an odd thing
to tell userspace about.


> The controller bio-cgroup is used by io-throttle to track writeback IO
> and for properly apply throttling.

Presumably it tracks all forms of block-based I/O and not just delayed
writeback.

2009-04-17 00:07:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Thu, 16 Apr 2009 12:42:36 +0200
Andrea Righi <[email protected]> wrote:

> On Thu, Apr 16, 2009 at 08:58:14AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 15 Apr 2009 15:23:57 +0200
> > Andrea Righi <[email protected]> wrote:
> >
> > > On Wed, Apr 15, 2009 at 09:38:50PM +0900, Ryo Tsuruta wrote:
> > > > Hi Andrea and Kamezawa-san,
> > > >
> > > > > Ryo, it would be great if you can look at this and fix/integrate into
> > > > > the mainstream bio-cgroup. Otherwise I can try to to schedule this in my
> > > > > work.
> > > >
> > > > O.K. I'll apply those fixes and post patches as soon as I can.
> > > >
> > >
> > > Very good! I've just tested the bio_cgroup_id inclusion in
> > > page_cgroup->flags. I'm posting the patch on-top-of my patchset.
> > >
> > > If you're interested, it should apply cleanly to the original
> > > bio-cgroup, except for the get/put_cgroup_from_page() part.
> > >
> > > Thanks,
> > > -Andrea
> > > ---
> > > bio-cgroup: encode bio_cgroup_id in page_cgroup->flags
> > >
> > > Encode the bio_cgroup_id into the flags argument of page_cgroup as
> > > suggested by Kamezawa.
> > >
> > > Lower 16-bits of the flags attribute are used for the actual page_cgroup
> > > flags. The rest is reserved to store the bio-cgroup id.
> > >
> > > This allows to save 4 bytes (in 32-bit architectures) or 8 bytes (in
> > > 64-bit) for each page_cgroup element.
> > >
> > > Signed-off-by: Andrea Righi <[email protected]>
> > > ---
> > > include/linux/biotrack.h | 2 +-
> > > include/linux/page_cgroup.h | 24 +++++++++++++++++++++---
> > > mm/biotrack.c | 26 ++++++++++++--------------
> > > 3 files changed, 34 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
> > > index 25b8810..4bd0242 100644
> > > --- a/include/linux/biotrack.h
> > > +++ b/include/linux/biotrack.h
> > > @@ -28,7 +28,7 @@ struct bio_cgroup {
> > >
> > > static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
> > > {
> > > - pc->bio_cgroup_id = 0;
> > > + page_cgroup_set_bio_id(pc, 0);
> > > }
> > >
> > > extern struct cgroup *get_cgroup_from_page(struct page *page);
> > > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> > > index 00a49c5..af780a4 100644
> > > --- a/include/linux/page_cgroup.h
> > > +++ b/include/linux/page_cgroup.h
> > > @@ -16,12 +16,30 @@ struct page_cgroup {
> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > struct mem_cgroup *mem_cgroup;
> > > #endif
> > > -#ifdef CONFIG_CGROUP_BIO
> > > - int bio_cgroup_id;
> > > -#endif
> > > struct list_head lru; /* per cgroup LRU list */
> > > };
> > >
> > > +#ifdef CONFIG_CGROUP_BIO
> > > +/*
> > > + * use lower 16 bits for flags and reserve the rest for the bio-cgroup id
> > > + */
> > > +#define BIO_CGROUP_ID_SHIFT (16)
> > > +#define BIO_CGROUP_ID_BITS (8 * sizeof(unsigned long) - BIO_CGROUP_ID_SHIFT)
> > > +
> > > +static inline unsigned long page_cgroup_get_bio_id(struct page_cgroup *pc)
> > > +{
> > > + return pc->flags >> BIO_CGROUP_ID_SHIFT;
> > > +}
> > > +
> > > +static inline void page_cgroup_set_bio_id(struct page_cgroup *pc,
> > > + unsigned long id)
> > > +{
> > > + WARN_ON(id >= (1UL << BIO_CGROUP_ID_BITS));
> > > + pc->flags &= (1UL << BIO_CGROUP_ID_SHIFT) - 1;
> > > + pc->flags |= (unsigned long)(id << BIO_CGROUP_ID_SHIFT);
> > > +}
> > > +#endif
> > > +
> > Ah, there is "Lock" bit in pc->flags and above "set" code does read-modify-write
> > without lock_page_cgroup().
> >
> > Could you use lock_page_cgroup() or cmpxchg ? (or using something magical technique ?)
>
> If I'm not wrong this should guarantee atomicity without using
> lock_page_cgroup().

thread A thread B
================= ======================
val = pc->flags
lock_page_cgroup()
pc->flags |= hogehoge
unlock_page_cgroup()


*And* we may add another flags to page_cgroup. plz avoid corner cases.

Thanks,
-Kame


2009-04-17 00:22:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Thu, 16 Apr 2009 15:29:37 -0700
Andrew Morton <[email protected]> wrote:

> On Tue, 14 Apr 2009 22:21:14 +0200
> Andrea Righi <[email protected]> wrote:
>
> > Subject: [PATCH 3/9] bio-cgroup controller
>
> Sorry, but I have to register extreme distress at the name of this.
> The term "bio" is well-established in the kernel and here we have a new
> definition for the same term: "block I/O".
>
> "bio" was a fine term for you to have chosen from the user's
> perspective, but from the kernel developer perspective it is quite
> horrid. The patch adds a vast number of new symbols all into the
> existing "bio_" namespace, many of which aren't related to `struct bio'
> at all.
>
> At least, I think that's what's happening. Perhaps the controller
> really _is_ designed to track `struct bio'? If so, that's an odd thing
> to tell userspace about.
>
Hmm, how about iotrack-cgroup ?

Thanks,
-Kame


>
> > The controller bio-cgroup is used by io-throttle to track writeback IO
> > and for properly apply throttling.
>
> Presumably it tracks all forms of block-based I/O and not just delayed
> writeback.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-04-17 01:01:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Fri, 17 Apr 2009 09:20:40 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Thu, 16 Apr 2009 15:29:37 -0700
> Andrew Morton <[email protected]> wrote:
>
> > On Tue, 14 Apr 2009 22:21:14 +0200
> > Andrea Righi <[email protected]> wrote:
> >
> > > Subject: [PATCH 3/9] bio-cgroup controller
> >
> > Sorry, but I have to register extreme distress at the name of this.
> > The term "bio" is well-established in the kernel and here we have a new
> > definition for the same term: "block I/O".
> >
> > "bio" was a fine term for you to have chosen from the user's
> > perspective, but from the kernel developer perspective it is quite
> > horrid. The patch adds a vast number of new symbols all into the
> > existing "bio_" namespace, many of which aren't related to `struct bio'
> > at all.
> >
> > At least, I think that's what's happening. Perhaps the controller
> > really _is_ designed to track `struct bio'? If so, that's an odd thing
> > to tell userspace about.
> >
> Hmm, how about iotrack-cgroup ?
>

Well. blockio_cgroup has the same character count and is more specific.

2009-04-17 01:26:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Tue, 14 Apr 2009 22:21:12 +0200
Andrea Righi <[email protected]> wrote:

> +Example:
> +* Create an association between an io-throttle group and a bio-cgroup group
> + with "bio" and "blockio" subsystems mounted in different mount points:
> + # mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
> + # cd /mnt/bio-cgroup/
> + # mkdir bio-grp
> + # cat bio-grp/bio.id
> + 1
> + # mount -t cgroup -o blockio blockio /mnt/io-throttle
> + # cd /mnt/io-throttle
> + # mkdir foo
> + # echo 1 > foo/blockio.bio_id

Why do we need multiple cgroups at once to track I/O ?
Seems complicated to me.

Thanks,
-Kame

2009-04-17 01:44:44

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

From: Andrew Morton <[email protected]>
Subject: Re: [PATCH 3/9] bio-cgroup controller
Date: Thu, 16 Apr 2009 17:44:28 -0700

> On Fri, 17 Apr 2009 09:20:40 +0900 KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Thu, 16 Apr 2009 15:29:37 -0700
> > Andrew Morton <[email protected]> wrote:
> >
> > > On Tue, 14 Apr 2009 22:21:14 +0200
> > > Andrea Righi <[email protected]> wrote:
> > >
> > > > Subject: [PATCH 3/9] bio-cgroup controller
> > >
> > > Sorry, but I have to register extreme distress at the name of this.
> > > The term "bio" is well-established in the kernel and here we have a new
> > > definition for the same term: "block I/O".
> > >
> > > "bio" was a fine term for you to have chosen from the user's
> > > perspective, but from the kernel developer perspective it is quite
> > > horrid. The patch adds a vast number of new symbols all into the
> > > existing "bio_" namespace, many of which aren't related to `struct bio'
> > > at all.
> > >
> > > At least, I think that's what's happening. Perhaps the controller
> > > really _is_ designed to track `struct bio'? If so, that's an odd thing
> > > to tell userspace about.
> > >
> > Hmm, how about iotrack-cgroup ?
> >
>
> Well. blockio_cgroup has the same character count and is more specific.

How about blkio_cgroup ?

Thanks,
Ryo Tsuruta

2009-04-17 01:44:59

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

Hi,

I have a few question.
- I have not yet fully understood how your controller are using
bio_cgroup. If my view is wrong please tell me.

o In my view, bio_cgroup's implementation strongly depends on
page_cgoup's. Could you explain for what purpose does this
functionality itself should be implemented as cgroup subsystem?
Using page_cgoup and implementing tracking APIs is not enough?


> +config CGROUP_BIO
> + bool "Block I/O cgroup subsystem"
> + depends on CGROUPS && BLOCK
> + select MM_OWNER
> + help
> + Provides a Resource Controller which enables to track the onwner
> + of every Block I/O requests.
> + The information this subsystem provides can be used from any
> + kind of module such as dm-ioband device mapper modules or
> + the cfq-scheduler.

o I can understand this kind of information will be effective for io
controllers but how about cfq-scheduler? Don't we need some changes to
make cfq use this kind of information?


Thanks,
Takuya Yoshikawa






Andrea Righi wrote:
> From: Ryo Tsuruta <[email protected]>
>
> From: Ryo Tsuruta <[email protected]>
>
> With writeback IO processed asynchronously by kernel threads (pdflush)
> the real writes to the underlying block devices can occur in a different
> IO context respect to the task that originally generated the dirty
> pages involved in the IO operation.
>
> The controller bio-cgroup is used by io-throttle to track writeback IO
> and for properly apply throttling.
>
> Also apply a patch by Gui Jianfeng to announce tasks moving in
> bio-cgroup groups.
>
> See also: http://people.valinux.co.jp/~ryov/bio-cgroup
>
> Signed-off-by: Gui Jianfeng <[email protected]>
> Signed-off-by: Ryo Tsuruta <[email protected]>
> Signed-off-by: Hirokazu Takahashi <[email protected]>
> ---
> block/blk-ioc.c | 30 ++--
> fs/buffer.c | 2 +
> fs/direct-io.c | 2 +
> include/linux/biotrack.h | 95 +++++++++++
> include/linux/cgroup_subsys.h | 6 +
> include/linux/iocontext.h | 1 +
> include/linux/memcontrol.h | 6 +
> include/linux/mmzone.h | 4 +-
> include/linux/page_cgroup.h | 13 ++-
> init/Kconfig | 15 ++
> mm/Makefile | 4 +-
> mm/biotrack.c | 349 +++++++++++++++++++++++++++++++++++++++++
> mm/bounce.c | 2 +
> mm/filemap.c | 2 +
> mm/memcontrol.c | 5 +
> mm/memory.c | 5 +
> mm/page-writeback.c | 2 +
> mm/page_cgroup.c | 17 ++-
> mm/swap_state.c | 2 +
> 19 files changed, 536 insertions(+), 26 deletions(-)
> create mode 100644 include/linux/biotrack.h
> create mode 100644 mm/biotrack.c
>
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 012f065..ef8cac0 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -84,24 +84,28 @@ void exit_io_context(void)
> }
> }
>
> +void init_io_context(struct io_context *ioc)
> +{
> + atomic_set(&ioc->refcount, 1);
> + atomic_set(&ioc->nr_tasks, 1);
> + spin_lock_init(&ioc->lock);
> + ioc->ioprio_changed = 0;
> + ioc->ioprio = 0;
> + ioc->last_waited = jiffies; /* doesn't matter... */
> + ioc->nr_batch_requests = 0; /* because this is 0 */
> + ioc->aic = NULL;
> + INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
> + INIT_HLIST_HEAD(&ioc->cic_list);
> + ioc->ioc_data = NULL;
> +}
> +
> struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
> {
> struct io_context *ret;
>
> ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
> - if (ret) {
> - atomic_set(&ret->refcount, 1);
> - atomic_set(&ret->nr_tasks, 1);
> - spin_lock_init(&ret->lock);
> - ret->ioprio_changed = 0;
> - ret->ioprio = 0;
> - ret->last_waited = jiffies; /* doesn't matter... */
> - ret->nr_batch_requests = 0; /* because this is 0 */
> - ret->aic = NULL;
> - INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
> - INIT_HLIST_HEAD(&ret->cic_list);
> - ret->ioc_data = NULL;
> - }
> + if (ret)
> + init_io_context(ret);
>
> return ret;
> }
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 13edf7a..bc72150 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -36,6 +36,7 @@
> #include <linux/buffer_head.h>
> #include <linux/task_io_accounting_ops.h>
> #include <linux/bio.h>
> +#include <linux/biotrack.h>
> #include <linux/notifier.h>
> #include <linux/cpu.h>
> #include <linux/bitops.h>
> @@ -655,6 +656,7 @@ static void __set_page_dirty(struct page *page,
> if (page->mapping) { /* Race with truncate? */
> WARN_ON_ONCE(warn && !PageUptodate(page));
> account_page_dirtied(page, mapping);
> + bio_cgroup_reset_owner_pagedirty(page, current->mm);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> }
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index da258e7..ec42362 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -33,6 +33,7 @@
> #include <linux/err.h>
> #include <linux/blkdev.h>
> #include <linux/buffer_head.h>
> +#include <linux/biotrack.h>
> #include <linux/rwsem.h>
> #include <linux/uio.h>
> #include <asm/atomic.h>
> @@ -799,6 +800,7 @@ static int do_direct_IO(struct dio *dio)
> ret = PTR_ERR(page);
> goto out;
> }
> + bio_cgroup_reset_owner(page, current->mm);
>
> while (block_in_page < blocks_per_page) {
> unsigned offset_in_page = block_in_page << blkbits;
> diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
> new file mode 100644
> index 0000000..25b8810
> --- /dev/null
> +++ b/include/linux/biotrack.h
> @@ -0,0 +1,95 @@
> +#include <linux/cgroup.h>
> +#include <linux/mm.h>
> +#include <linux/page_cgroup.h>
> +
> +#ifndef _LINUX_BIOTRACK_H
> +#define _LINUX_BIOTRACK_H
> +
> +#ifdef CONFIG_CGROUP_BIO
> +
> +struct tsk_move_msg {
> + int old_id;
> + int new_id;
> + struct task_struct *tsk;
> +};
> +
> +extern int register_biocgroup_notifier(struct notifier_block *nb);
> +extern int unregister_biocgroup_notifier(struct notifier_block *nb);
> +
> +struct io_context;
> +struct block_device;
> +
> +struct bio_cgroup {
> + struct cgroup_subsys_state css;
> + int id;
> + struct io_context *io_context; /* default io_context */
> +/* struct radix_tree_root io_context_root; per device io_context */
> +};
> +
> +static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
> +{
> + pc->bio_cgroup_id = 0;
> +}
> +
> +extern struct cgroup *get_cgroup_from_page(struct page *page);
> +extern void put_cgroup_from_page(struct page *page);
> +extern struct cgroup *bio_id_to_cgroup(int id);
> +
> +static inline int bio_cgroup_disabled(void)
> +{
> + return bio_cgroup_subsys.disabled;
> +}
> +
> +extern void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
> +extern void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
> +extern void bio_cgroup_reset_owner_pagedirty(struct page *page,
> + struct mm_struct *mm);
> +extern void bio_cgroup_copy_owner(struct page *page, struct page *opage);
> +
> +extern struct io_context *get_bio_cgroup_iocontext(struct bio *bio);
> +extern int get_bio_cgroup_id(struct bio *bio);
> +
> +#else /* CONFIG_CGROUP_BIO */
> +
> +struct bio_cgroup;
> +
> +static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
> +{
> +}
> +
> +static inline int bio_cgroup_disabled(void)
> +{
> + return 1;
> +}
> +
> +static inline void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> +{
> +}
> +
> +static inline void bio_cgroup_reset_owner(struct page *page,
> + struct mm_struct *mm)
> +{
> +}
> +
> +static inline void bio_cgroup_reset_owner_pagedirty(struct page *page,
> + struct mm_struct *mm)
> +{
> +}
> +
> +static inline void bio_cgroup_copy_owner(struct page *page, struct page *opage)
> +{
> +}
> +
> +static inline struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
> +{
> + return NULL;
> +}
> +
> +static inline int get_bio_cgroup_id(struct bio *bio)
> +{
> + return 0;
> +}
> +
> +#endif /* CONFIG_CGROUP_BIO */
> +
> +#endif /* _LINUX_BIOTRACK_H */
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 9c8d31b..5df23f8 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
>
> /* */
>
> +#ifdef CONFIG_CGROUP_BIO
> +SUBSYS(bio_cgroup)
> +#endif
> +
> +/* */
> +
> #ifdef CONFIG_CGROUP_DEVICE
> SUBSYS(devices)
> #endif
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index 08b987b..be37c27 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -104,6 +104,7 @@ int put_io_context(struct io_context *ioc);
> void exit_io_context(void);
> struct io_context *get_io_context(gfp_t gfp_flags, int node);
> struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
> +void init_io_context(struct io_context *ioc);
> void copy_io_context(struct io_context **pdst, struct io_context **psrc);
> #else
> static inline void exit_io_context(void)
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..f3e0e64 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -37,6 +37,8 @@ struct mm_struct;
> * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> */
>
> +extern void __init_mem_page_cgroup(struct page_cgroup *pc);
> +
> extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> /* for swap handling */
> @@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> +static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
> +{
> +}
> +
> static inline int mem_cgroup_newpage_charge(struct page *page,
> struct mm_struct *mm, gfp_t gfp_mask)
> {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 186ec6a..47a6f55 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -607,7 +607,7 @@ typedef struct pglist_data {
> int nr_zones;
> #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> struct page *node_mem_map;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> struct page_cgroup *node_page_cgroup;
> #endif
> #endif
> @@ -958,7 +958,7 @@ struct mem_section {
>
> /* See declaration of similar field in struct zone */
> unsigned long *pageblock_flags;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> /*
> * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
> * section. (see memcontrol.h/page_cgroup.h about this.)
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 7339c7b..a7249bb 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -1,7 +1,7 @@
> #ifndef __LINUX_PAGE_CGROUP_H
> #define __LINUX_PAGE_CGROUP_H
>
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> #include <linux/bit_spinlock.h>
> /*
> * Page Cgroup can be considered as an extended mem_map.
> @@ -12,9 +12,16 @@
> */
> struct page_cgroup {
> unsigned long flags;
> - struct mem_cgroup *mem_cgroup;
> struct page *page;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + struct mem_cgroup *mem_cgroup;
> +#endif
> +#ifdef CONFIG_CGROUP_BIO
> + int bio_cgroup_id;
> +#endif
> +#if defined(CONFIG_CGROUP_MEM_RES_CTLR) || defined(CONFIG_CGROUP_BIO)
> struct list_head lru; /* per cgroup LRU list */
> +#endif
> };
>
> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> @@ -71,7 +78,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
> bit_spin_unlock(PCG_LOCK, &pc->flags);
> }
>
> -#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +#else /* CONFIG_CGROUP_PAGE */
> struct page_cgroup;
>
> static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> diff --git a/init/Kconfig b/init/Kconfig
> index 7be4d38..8f7b23c 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -606,8 +606,23 @@ config CGROUP_MEM_RES_CTLR_SWAP
> Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
> size is 4096bytes, 512k per 1Gbytes of swap.
>
> +config CGROUP_BIO
> + bool "Block I/O cgroup subsystem"
> + depends on CGROUPS && BLOCK
> + select MM_OWNER
> + help
> + Provides a Resource Controller which enables to track the onwner
> + of every Block I/O requests.
> + The information this subsystem provides can be used from any
> + kind of module such as dm-ioband device mapper modules or
> + the cfq-scheduler.
> +
> endif # CGROUPS
>
> +config CGROUP_PAGE
> + def_bool y
> + depends on CGROUP_MEM_RES_CTLR || CGROUP_BIO
> +
> config MM_OWNER
> bool
>
> diff --git a/mm/Makefile b/mm/Makefile
> index ec73c68..a78a437 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -37,4 +37,6 @@ else
> obj-$(CONFIG_SMP) += allocpercpu.o
> endif
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
> +obj-$(CONFIG_CGROUP_BIO) += biotrack.o
> diff --git a/mm/biotrack.c b/mm/biotrack.c
> new file mode 100644
> index 0000000..d3a35f1
> --- /dev/null
> +++ b/mm/biotrack.c
> @@ -0,0 +1,349 @@
> +/* biotrack.c - Block I/O Tracking
> + *
> + * Copyright (C) VA Linux Systems Japan, 2008
> + * Developed by Hirokazu Takahashi <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/smp.h>
> +#include <linux/bit_spinlock.h>
> +#include <linux/idr.h>
> +#include <linux/blkdev.h>
> +#include <linux/biotrack.h>
> +
> +#define MOVETASK 0
> +static BLOCKING_NOTIFIER_HEAD(biocgroup_chain);
> +
> +int register_biocgroup_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_register(&biocgroup_chain, nb);
> +}
> +EXPORT_SYMBOL(register_biocgroup_notifier);
> +
> +int unregister_biocgroup_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_unregister(&biocgroup_chain, nb);
> +}
> +EXPORT_SYMBOL(unregister_biocgroup_notifier);
> +
> +/*
> + * The block I/O tracking mechanism is implemented on the cgroup memory
> + * controller framework. It helps to find the the owner of an I/O request
> + * because every I/O request has a target page and the owner of the page
> + * can be easily determined on the framework.
> + */
> +
> +/* Return the bio_cgroup that associates with a cgroup. */
> +static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp)
> +{
> + return container_of(cgroup_subsys_state(cgrp, bio_cgroup_subsys_id),
> + struct bio_cgroup, css);
> +}
> +
> +/* Return the bio_cgroup that associates with a process. */
> +static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p)
> +{
> + return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
> + struct bio_cgroup, css);
> +}
> +
> +static struct idr bio_cgroup_id;
> +static DEFINE_SPINLOCK(bio_cgroup_idr_lock);
> +static struct io_context default_bio_io_context;
> +static struct bio_cgroup default_bio_cgroup = {
> + .id = 0,
> + .io_context = &default_bio_io_context,
> +};
> +
> +/*
> + * This function is used to make a given page have the bio-cgroup id of
> + * the owner of this page.
> + */
> +void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> +{
> + struct bio_cgroup *biog;
> + struct page_cgroup *pc;
> +
> + if (bio_cgroup_disabled())
> + return;
> + pc = lookup_page_cgroup(page);
> + if (unlikely(!pc))
> + return;
> +
> + pc->bio_cgroup_id = 0; /* 0: default bio_cgroup id */
> + if (!mm)
> + return;
> + /*
> + * Locking "pc" isn't necessary here since the current process is
> + * the only one that can access the members related to bio_cgroup.
> + */
> + rcu_read_lock();
> + biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
> + if (unlikely(!biog))
> + goto out;
> + /*
> + * css_get(&bio->css) isn't called to increment the reference
> + * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
> + * invalid even if this page is still active.
> + * This approach is chosen to minimize the overhead.
> + */
> + pc->bio_cgroup_id = biog->id;
> +out:
> + rcu_read_unlock();
> +}
> +
> +/*
> + * Change the owner of a given page if necessary.
> + */
> +void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
> +{
> + /*
> + * A little trick:
> + * Just call bio_cgroup_set_owner() for pages which are already
> + * active since the bio_cgroup_id member of page_cgroup can be
> + * updated without any locks. This is because an integer type of
> + * variable can be set a new value at once on modern cpus.
> + */
> + bio_cgroup_set_owner(page, mm);
> +}
> +
> +/*
> + * Change the owner of a given page. This function is only effective for
> + * pages in the pagecache.
> + */
> +void bio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
> +{
> + if (PageSwapCache(page) || PageAnon(page))
> + return;
> + if (current->flags & PF_MEMALLOC)
> + return;
> +
> + bio_cgroup_reset_owner(page, mm);
> +}
> +
> +/*
> + * Assign "page" the same owner as "opage."
> + */
> +void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
> +{
> + struct page_cgroup *npc, *opc;
> +
> + if (bio_cgroup_disabled())
> + return;
> + npc = lookup_page_cgroup(npage);
> + if (unlikely(!npc))
> + return;
> + opc = lookup_page_cgroup(opage);
> + if (unlikely(!opc))
> + return;
> +
> + /*
> + * Do this without any locks. The reason is the same as
> + * bio_cgroup_reset_owner().
> + */
> + npc->bio_cgroup_id = opc->bio_cgroup_id;
> +}
> +
> +/* Create a new bio-cgroup. */
> +static struct cgroup_subsys_state *
> +bio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + struct bio_cgroup *biog;
> + struct io_context *ioc;
> + int ret;
> +
> + if (!cgrp->parent) {
> + biog = &default_bio_cgroup;
> + init_io_context(biog->io_context);
> + /* Increment the referrence count not to be released ever. */
> + atomic_inc(&biog->io_context->refcount);
> + idr_init(&bio_cgroup_id);
> + return &biog->css;
> + }
> +
> + biog = kzalloc(sizeof(*biog), GFP_KERNEL);
> + ioc = alloc_io_context(GFP_KERNEL, -1);
> + if (!ioc || !biog) {
> + ret = -ENOMEM;
> + goto out_err;
> + }
> + biog->io_context = ioc;
> +retry:
> + if (!idr_pre_get(&bio_cgroup_id, GFP_KERNEL)) {
> + ret = -EAGAIN;
> + goto out_err;
> + }
> + spin_lock_irq(&bio_cgroup_idr_lock);
> + ret = idr_get_new_above(&bio_cgroup_id, (void *)biog, 1, &biog->id);
> + spin_unlock_irq(&bio_cgroup_idr_lock);
> + if (ret == -EAGAIN)
> + goto retry;
> + else if (ret)
> + goto out_err;
> +
> + return &biog->css;
> +out_err:
> + kfree(biog);
> + if (ioc)
> + put_io_context(ioc);
> + return ERR_PTR(ret);
> +}
> +
> +/* Delete the bio-cgroup. */
> +static void bio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + struct bio_cgroup *biog = cgroup_bio(cgrp);
> +
> + put_io_context(biog->io_context);
> +
> + spin_lock_irq(&bio_cgroup_idr_lock);
> + idr_remove(&bio_cgroup_id, biog->id);
> + spin_unlock_irq(&bio_cgroup_idr_lock);
> +
> + kfree(biog);
> +}
> +
> +static struct bio_cgroup *find_bio_cgroup(int id)
> +{
> + struct bio_cgroup *biog;
> + spin_lock_irq(&bio_cgroup_idr_lock);
> + /*
> + * It might fail to find A bio-group associated with "id" since it
> + * is allowed to remove the bio-cgroup even when some of I/O requests
> + * this group issued haven't completed yet.
> + */
> + biog = (struct bio_cgroup *)idr_find(&bio_cgroup_id, id);
> + spin_unlock_irq(&bio_cgroup_idr_lock);
> + return biog;
> +}
> +
> +struct cgroup *bio_id_to_cgroup(int id)
> +{
> + struct bio_cgroup *biog;
> +
> + biog = find_bio_cgroup(id);
> + if (biog)
> + return biog->css.cgroup;
> +
> + return NULL;
> +}
> +
> +struct cgroup *get_cgroup_from_page(struct page *page)
> +{
> + struct page_cgroup *pc;
> + struct bio_cgroup *biog;
> + struct cgroup *cgrp = NULL;
> +
> + pc = lookup_page_cgroup(page);
> + if (!pc)
> + return NULL;
> + lock_page_cgroup(pc);
> + biog = find_bio_cgroup(pc->bio_cgroup_id);
> + if (biog) {
> + css_get(&biog->css);
> + cgrp = biog->css.cgroup;
> + }
> + unlock_page_cgroup(pc);
> + return cgrp;
> +}
> +
> +void put_cgroup_from_page(struct page *page)
> +{
> + struct bio_cgroup *biog;
> + struct page_cgroup *pc;
> +
> + pc = lookup_page_cgroup(page);
> + if (!pc)
> + return;
> + lock_page_cgroup(pc);
> + biog = find_bio_cgroup(pc->bio_cgroup_id);
> + if (biog)
> + css_put(&biog->css);
> + unlock_page_cgroup(pc);
> +}
> +
> +/* Determine the bio-cgroup id of a given bio. */
> +int get_bio_cgroup_id(struct bio *bio)
> +{
> + struct page_cgroup *pc;
> + struct page *page = bio_iovec_idx(bio, 0)->bv_page;
> + int id = 0;
> +
> + pc = lookup_page_cgroup(page);
> + if (pc)
> + id = pc->bio_cgroup_id;
> + return id;
> +}
> +EXPORT_SYMBOL(get_bio_cgroup_id);
> +
> +/* Determine the iocontext of the bio-cgroup that issued a given bio. */
> +struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
> +{
> + struct bio_cgroup *biog = NULL;
> + struct io_context *ioc;
> + int id = 0;
> +
> + id = get_bio_cgroup_id(bio);
> + if (id)
> + biog = find_bio_cgroup(id);
> + if (!biog)
> + biog = &default_bio_cgroup;
> + ioc = biog->io_context; /* default io_context for this cgroup */
> + atomic_inc(&ioc->refcount);
> + return ioc;
> +}
> +EXPORT_SYMBOL(get_bio_cgroup_iocontext);
> +
> +static u64 bio_id_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> + struct bio_cgroup *biog = cgroup_bio(cgrp);
> + return (u64) biog->id;
> +}
> +
> +
> +static struct cftype bio_files[] = {
> + {
> + .name = "id",
> + .read_u64 = bio_id_read,
> + },
> +};
> +
> +static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + return cgroup_add_files(cgrp, ss, bio_files, ARRAY_SIZE(bio_files));
> +}
> +
> +static void bio_cgroup_attach(struct cgroup_subsys *ss,
> + struct cgroup *cont, struct cgroup *oldcont,
> + struct task_struct *tsk)
> +{
> + struct tsk_move_msg tmm;
> + struct bio_cgroup *old_biog, *new_biog;
> +
> + old_biog = cgroup_bio(oldcont);
> + new_biog = cgroup_bio(cont);
> + tmm.old_id = old_biog->id;
> + tmm.new_id = new_biog->id;
> + tmm.tsk = tsk;
> + blocking_notifier_call_chain(&biocgroup_chain, MOVETASK, &tmm);
> +}
> +
> +struct cgroup_subsys bio_cgroup_subsys = {
> + .name = "bio",
> + .create = bio_cgroup_create,
> + .destroy = bio_cgroup_destroy,
> + .populate = bio_cgroup_populate,
> + .attach = bio_cgroup_attach,
> + .subsys_id = bio_cgroup_subsys_id,
> +};
> +
> diff --git a/mm/bounce.c b/mm/bounce.c
> index e590272..1a01905 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -14,6 +14,7 @@
> #include <linux/hash.h>
> #include <linux/highmem.h>
> #include <linux/blktrace_api.h>
> +#include <linux/biotrack.h>
> #include <trace/block.h>
> #include <asm/tlbflush.h>
>
> @@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
> to->bv_len = from->bv_len;
> to->bv_offset = from->bv_offset;
> inc_zone_page_state(to->bv_page, NR_BOUNCE);
> + bio_cgroup_copy_owner(to->bv_page, page);
>
> if (rw == WRITE) {
> char *vto, *vfrom;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8bd4980..1ab32a2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -33,6 +33,7 @@
> #include <linux/cpuset.h>
> #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
> #include <linux/memcontrol.h>
> +#include <linux/biotrack.h>
> #include <linux/mm_inline.h> /* for page_is_file_cache() */
> #include "internal.h"
>
> @@ -463,6 +464,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> gfp_mask & GFP_RECLAIM_MASK);
> if (error)
> goto out;
> + bio_cgroup_set_owner(page, current->mm);
>
> error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> if (error == 0) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e44fb0f..c25eb63 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2524,6 +2524,11 @@ struct cgroup_subsys mem_cgroup_subsys = {
> .use_id = 1,
> };
>
> +void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
> +{
> + pc->mem_cgroup = NULL;
> +}
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>
> static int __init disable_swap_account(char *s)
> diff --git a/mm/memory.c b/mm/memory.c
> index cf6873e..7779e12 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -51,6 +51,7 @@
> #include <linux/init.h>
> #include <linux/writeback.h>
> #include <linux/memcontrol.h>
> +#include <linux/biotrack.h>
> #include <linux/mmu_notifier.h>
> #include <linux/kallsyms.h>
> #include <linux/swapops.h>
> @@ -2052,6 +2053,7 @@ gotten:
> * thread doing COW.
> */
> ptep_clear_flush_notify(vma, address, page_table);
> + bio_cgroup_set_owner(new_page, mm);
> page_add_new_anon_rmap(new_page, vma, address);
> set_pte_at(mm, address, page_table, entry);
> update_mmu_cache(vma, address, entry);
> @@ -2497,6 +2499,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> flush_icache_page(vma, page);
> set_pte_at(mm, address, page_table, pte);
> page_add_anon_rmap(page, vma, address);
> + bio_cgroup_reset_owner(page, mm);
> /* It's better to call commit-charge after rmap is established */
> mem_cgroup_commit_charge_swapin(page, ptr);
>
> @@ -2559,6 +2562,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> if (!pte_none(*page_table))
> goto release;
> inc_mm_counter(mm, anon_rss);
> + bio_cgroup_set_owner(page, mm);
> page_add_new_anon_rmap(page, vma, address);
> set_pte_at(mm, address, page_table, entry);
>
> @@ -2711,6 +2715,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> if (anon) {
> inc_mm_counter(mm, anon_rss);
> + bio_cgroup_set_owner(page, mm);
> page_add_new_anon_rmap(page, vma, address);
> } else {
> inc_mm_counter(mm, file_rss);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 30351f0..1379eb0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -26,6 +26,7 @@
> #include <linux/blkdev.h>
> #include <linux/mpage.h>
> #include <linux/rmap.h>
> +#include <linux/biotrack.h>
> #include <linux/percpu.h>
> #include <linux/notifier.h>
> #include <linux/smp.h>
> @@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
> BUG_ON(mapping2 != mapping);
> WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> account_page_dirtied(page, mapping);
> + bio_cgroup_reset_owner_pagedirty(page, current->mm);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> }
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 791905c..f692ee2 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -9,13 +9,16 @@
> #include <linux/vmalloc.h>
> #include <linux/cgroup.h>
> #include <linux/swapops.h>
> +#include <linux/memcontrol.h>
> +#include <linux/biotrack.h>
>
> static void __meminit
> __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> {
> pc->flags = 0;
> - pc->mem_cgroup = NULL;
> pc->page = pfn_to_page(pfn);
> + __init_mem_page_cgroup(pc);
> + __init_bio_page_cgroup(pc);
> INIT_LIST_HEAD(&pc->lru);
> }
> static unsigned long total_usage;
> @@ -74,7 +77,7 @@ void __init page_cgroup_init(void)
>
> int nid, fail;
>
> - if (mem_cgroup_disabled())
> + if (mem_cgroup_disabled() && bio_cgroup_disabled())
> return;
>
> for_each_online_node(nid) {
> @@ -83,12 +86,12 @@ void __init page_cgroup_init(void)
> goto fail;
> }
> printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> - printk(KERN_INFO "please try cgroup_disable=memory option if you"
> + printk(KERN_INFO "please try cgroup_disable=memory,bio option if you"
> " don't want\n");
> return;
> fail:
> printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
> - printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
> + printk(KERN_CRIT "please try cgroup_disable=memory,bio boot options\n");
> panic("Out of memory");
> }
>
> @@ -248,7 +251,7 @@ void __init page_cgroup_init(void)
> unsigned long pfn;
> int fail = 0;
>
> - if (mem_cgroup_disabled())
> + if (mem_cgroup_disabled() && bio_cgroup_disabled())
> return;
>
> for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> @@ -263,8 +266,8 @@ void __init page_cgroup_init(void)
> hotplug_memory_notifier(page_cgroup_callback, 0);
> }
> printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> - printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
> - " want\n");
> + printk(KERN_INFO
> + "try cgroup_disable=memory,bio option if you don't want\n");
> }
>
> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3ecea98..c7ad256 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -17,6 +17,7 @@
> #include <linux/backing-dev.h>
> #include <linux/pagevec.h>
> #include <linux/migrate.h>
> +#include <linux/biotrack.h>
> #include <linux/page_cgroup.h>
>
> #include <asm/pgtable.h>
> @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> */
> __set_page_locked(new_page);
> SetPageSwapBacked(new_page);
> + bio_cgroup_set_owner(new_page, current->mm);
> err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> if (likely(!err)) {
> /*

2009-04-17 01:51:15

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

* Andrew Morton <[email protected]> [2009-04-16 17:44:28]:

> > Hmm, how about iotrack-cgroup ?
> >
>
> Well. blockio_cgroup has the same character count and is more specific.

Sounds good to me.

--
Balbir

2009-04-17 01:55:54

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

KAMEZAWA Hiroyuki wrote:
> On Tue, 14 Apr 2009 22:21:12 +0200
> Andrea Righi <[email protected]> wrote:
>
>> +Example:
>> +* Create an association between an io-throttle group and a bio-cgroup group
>> + with "bio" and "blockio" subsystems mounted in different mount points:
>> + # mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
>> + # cd /mnt/bio-cgroup/
>> + # mkdir bio-grp
>> + # cat bio-grp/bio.id
>> + 1
>> + # mount -t cgroup -o blockio blockio /mnt/io-throttle
>> + # cd /mnt/io-throttle
>> + # mkdir foo
>> + # echo 1 > foo/blockio.bio_id
>
> Why do we need multiple cgroups at once to track I/O ?
> Seems complicated to me.
>

IIUC, it also disallows other subsystems to be binded with blockio subsys:
# mount -t cgroup -o blockio cpuset xxx /mnt
(failed)

and if a task is moved from cg1(id=1) to cg2(id=2) in bio subsys, this task
will be moved from CG1(id=1) to CG2(id=2) automatically in blockio subsys.

All these are odd, unexpected, complex and bug-prone I think..

2009-04-17 02:26:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Fri, 17 Apr 2009 10:49:43 +0900
Takuya Yoshikawa <[email protected]> wrote:

> Hi,
>
> I have a few question.
> - I have not yet fully understood how your controller are using
> bio_cgroup. If my view is wrong please tell me.
>
> o In my view, bio_cgroup's implementation strongly depends on
> page_cgoup's. Could you explain for what purpose does this
> functionality itself should be implemented as cgroup subsystem?
> Using page_cgoup and implementing tracking APIs is not enough?

I'll definitely do "Nack" to add full bio-cgroup members to page_cgroup.
Now, page_cgroup is 40bytes(in 64bit arch.) And all of them are allocated at
boot time as memmap. (and add member to struct page is much harder ;)

IIUC, feature for "tracking bio" is just necesary for pages for I/O.
So, I think it's much better to add misc. information to struct bio not to the page.
But, if people want to add "small hint" to struct page or struct page_cgroup
for tracking buffered I/O, I'll give you help as much as I can.
Maybe using "unused bits" in page_cgroup->flags is a choice with no overhead.

Thanks,
-Kame

2009-04-17 04:31:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Fri, 17 Apr 2009 10:44:32 +0900 (JST) Ryo Tsuruta <[email protected]> wrote:

> > > Hmm, how about iotrack-cgroup ?
> > >
> >
> > Well. blockio_cgroup has the same character count and is more specific.
>
> How about blkio_cgroup ?

Sounds good.

2009-04-17 07:22:19

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

Hi,

From: KAMEZAWA Hiroyuki <[email protected]>
Date: Fri, 17 Apr 2009 11:24:33 +0900

> On Fri, 17 Apr 2009 10:49:43 +0900
> Takuya Yoshikawa <[email protected]> wrote:
>
> > Hi,
> >
> > I have a few question.
> > - I have not yet fully understood how your controller are using
> > bio_cgroup. If my view is wrong please tell me.
> >
> > o In my view, bio_cgroup's implementation strongly depends on
> > page_cgoup's. Could you explain for what purpose does this
> > functionality itself should be implemented as cgroup subsystem?
> > Using page_cgoup and implementing tracking APIs is not enough?
>
> I'll definitely do "Nack" to add full bio-cgroup members to page_cgroup.
> Now, page_cgroup is 40bytes(in 64bit arch.) And all of them are allocated at
> boot time as memmap. (and add member to struct page is much harder ;)
>
> IIUC, feature for "tracking bio" is just necesary for pages for I/O.
> So, I think it's much better to add misc. information to struct bio not to the page.
> But, if people want to add "small hint" to struct page or struct page_cgroup
> for tracking buffered I/O, I'll give you help as much as I can.
> Maybe using "unused bits" in page_cgroup->flags is a choice with no overhead.

In the case where the bio-cgroup data is allocated dynamically,
- Sometimes quite a large amount of memory get marked dirty.
In this case it requires more kernel memory than that of the
current implementation.
- The operation is expansive due to memory allocations and exclusive
controls by such as spinlocks.

In the case where the bio-cgroup data is allocated by delayed allocation,
- It makes the operation complicated and expensive, because
sometimes a bio has to be created in the context of other
processes, such as aio and swap-out operation.

I'd prefer a simple and lightweight implementation. bio-cgroup only
needs 4bytes unlike memory controller. The reason why bio-cgroup chose
this approach is to minimize the overhead.

Thanks,
Ryo Tsuruta

2009-04-17 07:32:22

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

Hi,

From: Takuya Yoshikawa <[email protected]>
Subject: Re: [PATCH 3/9] bio-cgroup controller
Date: Fri, 17 Apr 2009 10:49:43 +0900

> > +config CGROUP_BIO
> > + bool "Block I/O cgroup subsystem"
> > + depends on CGROUPS && BLOCK
> > + select MM_OWNER
> > + help
> > + Provides a Resource Controller which enables to track the onwner
> > + of every Block I/O requests.
> > + The information this subsystem provides can be used from any
> > + kind of module such as dm-ioband device mapper modules or
> > + the cfq-scheduler.
>
>o I can understand this kind of information will be effective for io
> controllers but how about cfq-scheduler? Don't we need some changes to
> make cfq use this kind of information?

You need to modify cfq-scheduler to use bio-cgruop.

Thanks,
Ryo Tsuruta

2009-04-17 07:35:28

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

KAMEZAWA Hiroyuki wrote:
> On Tue, 14 Apr 2009 22:21:12 +0200
> Andrea Righi <[email protected]> wrote:
>
>> +Example:
>> +* Create an association between an io-throttle group and a bio-cgroup group
>> + with "bio" and "blockio" subsystems mounted in different mount points:
>> + # mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
>> + # cd /mnt/bio-cgroup/
>> + # mkdir bio-grp
>> + # cat bio-grp/bio.id
>> + 1
>> + # mount -t cgroup -o blockio blockio /mnt/io-throttle
>> + # cd /mnt/io-throttle
>> + # mkdir foo
>> + # echo 1 > foo/blockio.bio_id
>
> Why do we need multiple cgroups at once to track I/O ?
> Seems complicated to me.

Hi Kamezawa-san,

The original thought to implement this function is for sharing a bio-cgroup
with other subsystems, such as dm-ioband. If the bio-cgroup is already mounted,
and used by dm-ioband or others, we just need to create a association between
io-throttle and bio-cgroup by echo a bio-cgroup id, just like what dm-ioband does.

>
> Thanks,
> -Kame
>
>
>
>

--
Regards
Gui Jianfeng

2009-04-17 07:45:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Fri, 17 Apr 2009 15:34:53 +0800
Gui Jianfeng <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Tue, 14 Apr 2009 22:21:12 +0200
> > Andrea Righi <[email protected]> wrote:
> >
> >> +Example:
> >> +* Create an association between an io-throttle group and a bio-cgroup group
> >> + with "bio" and "blockio" subsystems mounted in different mount points:
> >> + # mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
> >> + # cd /mnt/bio-cgroup/
> >> + # mkdir bio-grp
> >> + # cat bio-grp/bio.id
> >> + 1
> >> + # mount -t cgroup -o blockio blockio /mnt/io-throttle
> >> + # cd /mnt/io-throttle
> >> + # mkdir foo
> >> + # echo 1 > foo/blockio.bio_id
> >
> > Why do we need multiple cgroups at once to track I/O ?
> > Seems complicated to me.
>
> Hi Kamezawa-san,
>
> The original thought to implement this function is for sharing a bio-cgroup
> with other subsystems, such as dm-ioband. If the bio-cgroup is already mounted,
> and used by dm-ioband or others, we just need to create a association between
> io-throttle and bio-cgroup by echo a bio-cgroup id, just like what dm-ioband does.
>

- Why we need multiple I/O controller ?
- Why bio-cgroup cannot be a _pure_ infrastructe as page_cgroup ?
- Why we need extra mount ?

I have no answer but, IMHO,
- only one I/O controller should be enabled at once.
- bio cgroup should be tightly coupled with I/O controller and should work as
infrastructure i.e. naming/tagging I/O should be automatically done by
I/O controller. not by the user's hand.

Thanks,
-Kame

2009-04-17 07:48:55

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

Hi,

From: Andrew Morton <[email protected]>
Subject: Re: [PATCH 3/9] bio-cgroup controller
Date: Thu, 16 Apr 2009 21:15:14 -0700

> On Fri, 17 Apr 2009 10:44:32 +0900 (JST) Ryo Tsuruta <[email protected]> wrote:
>
> > > > Hmm, how about iotrack-cgroup ?
> > > >
> > >
> > > Well. blockio_cgroup has the same character count and is more specific.
> >
> > How about blkio_cgroup ?
>
> Sounds good.

I'll rename bio-cgroup to blkio_cgroup and post the patches to this
list nextweek.

Thanks,
Ryo Tsuruta

2009-04-17 08:02:00

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Fri, 17 Apr 2009 16:22:01 +0900 (JST)
Ryo Tsuruta <[email protected]> wrote:

> In the case where the bio-cgroup data is allocated dynamically,
> - Sometimes quite a large amount of memory get marked dirty.
> In this case it requires more kernel memory than that of the
> current implementation.
> - The operation is expansive due to memory allocations and exclusive
> controls by such as spinlocks.
>
> In the case where the bio-cgroup data is allocated by delayed allocation,
> - It makes the operation complicated and expensive, because
> sometimes a bio has to be created in the context of other
> processes, such as aio and swap-out operation.
>
> I'd prefer a simple and lightweight implementation. bio-cgroup only
> needs 4bytes unlike memory controller. The reason why bio-cgroup chose
> this approach is to minimize the overhead.
>
My point is, plz do your best to reduce memory usage here. You increase
size of page_cgroup just because you cannot increase size of struct page.
It's not be sane reason to increase size of this object.
It's a cheat in my point of view.


Thanks,
-Kame

2009-04-17 08:50:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Fri, 17 Apr 2009 17:00:16 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Fri, 17 Apr 2009 16:22:01 +0900 (JST)
> Ryo Tsuruta <[email protected]> wrote:
>
> > In the case where the bio-cgroup data is allocated dynamically,
> > - Sometimes quite a large amount of memory get marked dirty.
> > In this case it requires more kernel memory than that of the
> > current implementation.
> > - The operation is expansive due to memory allocations and exclusive
> > controls by such as spinlocks.
> >
> > In the case where the bio-cgroup data is allocated by delayed allocation,
> > - It makes the operation complicated and expensive, because
> > sometimes a bio has to be created in the context of other
> > processes, such as aio and swap-out operation.
> >
> > I'd prefer a simple and lightweight implementation. bio-cgroup only
> > needs 4bytes unlike memory controller. The reason why bio-cgroup chose
> > this approach is to minimize the overhead.
> >
> My point is, plz do your best to reduce memory usage here. You increase
> size of page_cgroup just because you cannot increase size of struct page.
> It's not be sane reason to increase size of this object.
> It's a cheat in my point of view.
>

Can't this work sanely ?
Hmm, endian is obstacle ?
==
sturct page_cgroup {
union {
struct {
unsigned long memcg_field:16;
unsigned long blockio_field:16;
} field;
unsigned long flags; /* unsigned long is not 32bits */
} flags;
}
==

Thanks,
-Kame




2009-04-17 08:53:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Fri, 17 Apr 2009 17:48:54 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Fri, 17 Apr 2009 17:00:16 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Fri, 17 Apr 2009 16:22:01 +0900 (JST)
> > Ryo Tsuruta <[email protected]> wrote:
> >
> > > In the case where the bio-cgroup data is allocated dynamically,
> > > - Sometimes quite a large amount of memory get marked dirty.
> > > In this case it requires more kernel memory than that of the
> > > current implementation.
> > > - The operation is expansive due to memory allocations and exclusive
> > > controls by such as spinlocks.
> > >
> > > In the case where the bio-cgroup data is allocated by delayed allocation,
> > > - It makes the operation complicated and expensive, because
> > > sometimes a bio has to be created in the context of other
> > > processes, such as aio and swap-out operation.
> > >
> > > I'd prefer a simple and lightweight implementation. bio-cgroup only
> > > needs 4bytes unlike memory controller. The reason why bio-cgroup chose
> > > this approach is to minimize the overhead.
> > >
> > My point is, plz do your best to reduce memory usage here. You increase
> > size of page_cgroup just because you cannot increase size of struct page.
> > It's not be sane reason to increase size of this object.
> > It's a cheat in my point of view.
> >
>
> Can't this work sanely ?
> Hmm, endian is obstacle ?
> ==
> sturct page_cgroup {
> union {
> struct {
> unsigned long memcg_field:16;
> unsigned long blockio_field:16;
> } field;
> unsigned long flags; /* unsigned long is not 32bits */
> } flags;
> }
> ==
>
....sorry plz ignore.
-Kame

2009-04-17 09:30:00

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

KAMEZAWA Hiroyuki wrote:
> On Fri, 17 Apr 2009 15:34:53 +0800
> Gui Jianfeng <[email protected]> wrote:
>
>> KAMEZAWA Hiroyuki wrote:
>>> On Tue, 14 Apr 2009 22:21:12 +0200
>>> Andrea Righi <[email protected]> wrote:
>>>
>>>> +Example:
>>>> +* Create an association between an io-throttle group and a bio-cgroup group
>>>> + with "bio" and "blockio" subsystems mounted in different mount points:
>>>> + # mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
>>>> + # cd /mnt/bio-cgroup/
>>>> + # mkdir bio-grp
>>>> + # cat bio-grp/bio.id
>>>> + 1
>>>> + # mount -t cgroup -o blockio blockio /mnt/io-throttle
>>>> + # cd /mnt/io-throttle
>>>> + # mkdir foo
>>>> + # echo 1 > foo/blockio.bio_id
>>> Why do we need multiple cgroups at once to track I/O ?
>>> Seems complicated to me.
>> Hi Kamezawa-san,
>>
>> The original thought to implement this function is for sharing a bio-cgroup
>> with other subsystems, such as dm-ioband. If the bio-cgroup is already mounted,
>> and used by dm-ioband or others, we just need to create a association between
>> io-throttle and bio-cgroup by echo a bio-cgroup id, just like what dm-ioband does.
>>
>
> - Why we need multiple I/O controller ?
> - Why bio-cgroup cannot be a _pure_ infrastructe as page_cgroup ?
> - Why we need extra mount ?
>
> I have no answer but, IMHO,
> - only one I/O controller should be enabled at once.
> - bio cgroup should be tightly coupled with I/O controller and should work as
> infrastructure i.e. naming/tagging I/O should be automatically done by
> I/O controller. not by the user's hand.

It seems dm-ioband has to make use of bio-cgroup by the user's hand. Because dm-ioband
is not cgroup based. :(

Is that possible that another subsystem(not cgroup based, and not an IO Controller) also
would like to use bio-cgroup in the future? There's no such case at least now, so i don't
object to get rid of this part. :)

>
> Thanks,
> -Kame
>
>
>
>
>

--
Regards
Gui Jianfeng

2009-04-17 09:40:56

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Thu, Apr 16, 2009 at 03:29:37PM -0700, Andrew Morton wrote:
> On Tue, 14 Apr 2009 22:21:14 +0200
> Andrea Righi <[email protected]> wrote:
>
> > Subject: [PATCH 3/9] bio-cgroup controller
>
> Sorry, but I have to register extreme distress at the name of this.
> The term "bio" is well-established in the kernel and here we have a new
> definition for the same term: "block I/O".
>
> "bio" was a fine term for you to have chosen from the user's
> perspective, but from the kernel developer perspective it is quite
> horrid. The patch adds a vast number of new symbols all into the
> existing "bio_" namespace, many of which aren't related to `struct bio'
> at all.
>
> At least, I think that's what's happening. Perhaps the controller
> really _is_ designed to track `struct bio'? If so, that's an odd thing
> to tell userspace about.
>
>
> > The controller bio-cgroup is used by io-throttle to track writeback IO
> > and for properly apply throttling.
>
> Presumably it tracks all forms of block-based I/O and not just delayed
> writeback.

For the general case bio-cgroup tracks all forms of block IO, in this
particular case (only for the io-throttle controller) I used bio-cgroup
to track writeback IO. Synchronous IO is accounted directly in
submit_bio() and throttled as well, imposing explicit sleeps via
schedule_timeout_killable().

-Andrea

2009-04-17 09:44:35

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Fri, Apr 17, 2009 at 09:04:51AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 16 Apr 2009 12:42:36 +0200
> Andrea Righi <[email protected]> wrote:
>
> > On Thu, Apr 16, 2009 at 08:58:14AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Wed, 15 Apr 2009 15:23:57 +0200
> > > Andrea Righi <[email protected]> wrote:
> > >
> > > > On Wed, Apr 15, 2009 at 09:38:50PM +0900, Ryo Tsuruta wrote:
> > > > > Hi Andrea and Kamezawa-san,
> > > > >
> > > > > > Ryo, it would be great if you can look at this and fix/integrate into
> > > > > > the mainstream bio-cgroup. Otherwise I can try to to schedule this in my
> > > > > > work.
> > > > >
> > > > > O.K. I'll apply those fixes and post patches as soon as I can.
> > > > >
> > > >
> > > > Very good! I've just tested the bio_cgroup_id inclusion in
> > > > page_cgroup->flags. I'm posting the patch on-top-of my patchset.
> > > >
> > > > If you're interested, it should apply cleanly to the original
> > > > bio-cgroup, except for the get/put_cgroup_from_page() part.
> > > >
> > > > Thanks,
> > > > -Andrea
> > > > ---
> > > > bio-cgroup: encode bio_cgroup_id in page_cgroup->flags
> > > >
> > > > Encode the bio_cgroup_id into the flags argument of page_cgroup as
> > > > suggested by Kamezawa.
> > > >
> > > > Lower 16-bits of the flags attribute are used for the actual page_cgroup
> > > > flags. The rest is reserved to store the bio-cgroup id.
> > > >
> > > > This allows to save 4 bytes (in 32-bit architectures) or 8 bytes (in
> > > > 64-bit) for each page_cgroup element.
> > > >
> > > > Signed-off-by: Andrea Righi <[email protected]>
> > > > ---
> > > > include/linux/biotrack.h | 2 +-
> > > > include/linux/page_cgroup.h | 24 +++++++++++++++++++++---
> > > > mm/biotrack.c | 26 ++++++++++++--------------
> > > > 3 files changed, 34 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
> > > > index 25b8810..4bd0242 100644
> > > > --- a/include/linux/biotrack.h
> > > > +++ b/include/linux/biotrack.h
> > > > @@ -28,7 +28,7 @@ struct bio_cgroup {
> > > >
> > > > static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
> > > > {
> > > > - pc->bio_cgroup_id = 0;
> > > > + page_cgroup_set_bio_id(pc, 0);
> > > > }
> > > >
> > > > extern struct cgroup *get_cgroup_from_page(struct page *page);
> > > > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> > > > index 00a49c5..af780a4 100644
> > > > --- a/include/linux/page_cgroup.h
> > > > +++ b/include/linux/page_cgroup.h
> > > > @@ -16,12 +16,30 @@ struct page_cgroup {
> > > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > > struct mem_cgroup *mem_cgroup;
> > > > #endif
> > > > -#ifdef CONFIG_CGROUP_BIO
> > > > - int bio_cgroup_id;
> > > > -#endif
> > > > struct list_head lru; /* per cgroup LRU list */
> > > > };
> > > >
> > > > +#ifdef CONFIG_CGROUP_BIO
> > > > +/*
> > > > + * use lower 16 bits for flags and reserve the rest for the bio-cgroup id
> > > > + */
> > > > +#define BIO_CGROUP_ID_SHIFT (16)
> > > > +#define BIO_CGROUP_ID_BITS (8 * sizeof(unsigned long) - BIO_CGROUP_ID_SHIFT)
> > > > +
> > > > +static inline unsigned long page_cgroup_get_bio_id(struct page_cgroup *pc)
> > > > +{
> > > > + return pc->flags >> BIO_CGROUP_ID_SHIFT;
> > > > +}
> > > > +
> > > > +static inline void page_cgroup_set_bio_id(struct page_cgroup *pc,
> > > > + unsigned long id)
> > > > +{
> > > > + WARN_ON(id >= (1UL << BIO_CGROUP_ID_BITS));
> > > > + pc->flags &= (1UL << BIO_CGROUP_ID_SHIFT) - 1;
> > > > + pc->flags |= (unsigned long)(id << BIO_CGROUP_ID_SHIFT);
> > > > +}
> > > > +#endif
> > > > +
> > > Ah, there is "Lock" bit in pc->flags and above "set" code does read-modify-write
> > > without lock_page_cgroup().
> > >
> > > Could you use lock_page_cgroup() or cmpxchg ? (or using something magical technique ?)
> >
> > If I'm not wrong this should guarantee atomicity without using
> > lock_page_cgroup().
>
> thread A thread B
> ================= ======================
> val = pc->flags
> lock_page_cgroup()
> pc->flags |= hogehoge
> unlock_page_cgroup()
>
>
> *And* we may add another flags to page_cgroup. plz avoid corner cases.

argh! right. So, better to use lock/unlock_page_cgroup(). I'll fix it or
wait Ryo if he'll decide to apply this to the mainstream bio-cgroup
(..or whatever name, I vote for blkio_cgroup BTW).

Thanks,
-Andrea

2009-04-17 09:45:28

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 0/9] cgroup: io-throttle controller (v13)

On Thu, Apr 16, 2009 at 03:24:33PM -0700, Andrew Morton wrote:
> On Tue, 14 Apr 2009 22:21:11 +0200
> Andrea Righi <[email protected]> wrote:
>
> > Objective
> > ~~~~~~~~~
> > The objective of the io-throttle controller is to improve IO performance
> > predictability of different cgroups that share the same block devices.
>
> We should get an IO controller into Linux. Does anyone have a reason
> why it shouldn't be this one?
>
> > Respect to other priority/weight-based solutions the approach used by
> > this controller is to explicitly choke applications' requests
>
> Yes, blocking the offending application at a high level has always
> seemed to me to be the best way of implementing the controller.
>
> > that
> > directly or indirectly generate IO activity in the system (this
> > controller addresses both synchronous IO and writeback/buffered IO).
>
> The problem I've seen with some of the proposed controllers was that
> they didn't handle delayed writeback very well, if at all.
>
> Can you explain at a high level but in some detail how this works? If
> an application is doing a huge write(), how is that detected and how is
> the application made to throttle?

The writeback writes are handled in three steps:

1) track the owner of the dirty pages
2) detect writeback IO
3) delay writeback IO that exceeds the cgroup limits

For 1) I barely used the bio-cgroup functionality. The bio-cgroup use
the page_cgroup structure to store the owner of each dirty page when the
page is dirtied. At this point the actual owner of the page can be
retrieved looking at current->mm->owner (i.e. in __set_page_dirty()),
and its bio_cgroup id is stored into the page_cgroup structure.

Then for 2) we can detect writeback IO placing a hook,
cgroup_io_throttle(), in submit_bio():

unsigned long long
cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes);

If the IO operation is a write we look at the owner of the pages
involved (from bio) and we check if we must throttle the operation. If
the owner of that page is "current", we throttle the current task
directly (via schedule_timeout_killable()) and we just return 0 from
cgroup_io_throttle() after the sleep.

3) If the owner of the page must be throttled and the current task is
not the same task, e.g., it's a kernel thread (current->flags &
(PF_KTHREAD | PF_FLUSHER | PF_KSWAPD)), then we assume it's a writeback
IO and we immediately return the amount of jiffies that the real owner
should sleep.

void submit_bio(int rw, struct bio *bio)
{
...
if (bio_has_data(bio)) {
unsigned long sleep = 0;

if (rw & WRITE) {
count_vm_events(PGPGOUT, count);
sleep = cgroup_io_throttle(bio,
bio->bi_bdev, bio->bi_size);
} else {
task_io_account_read(bio->bi_size);
count_vm_events(PGPGIN, count);
cgroup_io_throttle(NULL, bio->bi_bdev, bio->bi_size);
}
...

if (sleep && !iothrottle_make_request(bio, jiffies + sleep))
return;
}

generic_make_request(bio);
...
}

Since the current task must not be throttled here, we set a deadline
jiffies + sleep and we add this request in a rbtree via
iothrottle_make_request().

This request will be dispatched ansychronously by a kernel thread -
kiothrottled() - using generic_make_request() when the deadline will
expire. There's a lot of space for optimizations here, i.e. use many
threads per block device, workqueue, slow-work, ...

In the old version (v12) I simply throttled writeback IO in
balance_dirty_pages_ratelimited_nr() but this obviously leads to bursty
writebacks. In v13 the writeback IO is hugely more smooth.

>
> Does it add new metadata to `struct page' for this?

struct page_cgroup

>
> I assume that the write throttling is also wired up into the MAP_SHARED
> write-fault path?
>

mmmh.. in case of writeback IO we account and throttle requests for
mm->owner. In case of synchronous IO (read/write) we always throttle the
current task in submit_bio().

>
>
> Does this patchset provide a path by which we can implement IO control
> for (say) NFS mounts?

Honestly I didn't looked at all at this. :) I'll check, but in principle
adding the cgroup_io_throttle() hook in the opportune NFS path is enough
to provide IO control also for NFS mounts.

-Andrea

2009-04-17 09:56:18

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Fri, Apr 17, 2009 at 10:24:17AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 14 Apr 2009 22:21:12 +0200
> Andrea Righi <[email protected]> wrote:
>
> > +Example:
> > +* Create an association between an io-throttle group and a bio-cgroup group
> > + with "bio" and "blockio" subsystems mounted in different mount points:
> > + # mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
> > + # cd /mnt/bio-cgroup/
> > + # mkdir bio-grp
> > + # cat bio-grp/bio.id
> > + 1
> > + # mount -t cgroup -o blockio blockio /mnt/io-throttle
> > + # cd /mnt/io-throttle
> > + # mkdir foo
> > + # echo 1 > foo/blockio.bio_id
>
> Why do we need multiple cgroups at once to track I/O ?
> Seems complicated to me.
>
> Thanks,
> -Kame

I totally agree. I could easily merge the bio-cgroup functionality in
io-throttle or implement this as an infrastructure framework, using a
single controller and remove this complication.

For now, since the decisions on IO controllers are not definitive at
all, I privileged flexibility and I simply decided to be a plain user of
bio-cgroup to quickly adapt my patch to the future bio-cgroup
development.

Thanks,
-Andrea

2009-04-17 10:25:54

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Fri, Apr 17, 2009 at 09:56:31AM +0800, Li Zefan wrote:
> KAMEZAWA Hiroyuki wrote:
> > On Tue, 14 Apr 2009 22:21:12 +0200
> > Andrea Righi <[email protected]> wrote:
> >
> >> +Example:
> >> +* Create an association between an io-throttle group and a bio-cgroup group
> >> + with "bio" and "blockio" subsystems mounted in different mount points:
> >> + # mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
> >> + # cd /mnt/bio-cgroup/
> >> + # mkdir bio-grp
> >> + # cat bio-grp/bio.id
> >> + 1
> >> + # mount -t cgroup -o blockio blockio /mnt/io-throttle
> >> + # cd /mnt/io-throttle
> >> + # mkdir foo
> >> + # echo 1 > foo/blockio.bio_id
> >
> > Why do we need multiple cgroups at once to track I/O ?
> > Seems complicated to me.
> >
>
> IIUC, it also disallows other subsystems to be binded with blockio subsys:
> # mount -t cgroup -o blockio cpuset xxx /mnt
> (failed)
>
> and if a task is moved from cg1(id=1) to cg2(id=2) in bio subsys, this task
> will be moved from CG1(id=1) to CG2(id=2) automatically in blockio subsys.
>
> All these are odd, unexpected, complex and bug-prone I think..

Implementing bio-cgroup functionality as pure infrastructure framework
instead of a cgroup subsystem would remove all this oddity and
complexity.

For example, the actual functionality that I need for the io-throttle
controller is just an interface to set and get the cgroup owner of a
page. I think it should be the same also for other potential users of
bio-cgroup.

So, what about implementing the bio-cgroup functionality as cgroup "page
tracking" infrastructure and provide the following interfaces:

/*
* Encode the cgrp->css.id in page_group->flags
*/
void set_cgroup_page_owner(struct page *page, struct cgroup *cgrp);

/*
* Returns the cgroup owner of a page, decoding the cgroup id from
* page_cgroup->flags.
*/
struct cgroup *get_cgroup_page_owner(struct page *page);

This also wouldn't increase the size of page_cgroup because we can
encode the cgroup id in the unused bits of page_cgroup->flags, as
originally suggested by Kame.

And I think it could be used also by dm-ioband, even if it's not a
cgroup-based subsystem... but I may be wrong. Ryo what's your opinion?

-Andrea

2009-04-17 10:31:19

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

* Andrea Righi <[email protected]> [2009-04-14 22:21:14]:

> From: Ryo Tsuruta <[email protected]>
>
> From: Ryo Tsuruta <[email protected]>
>
> With writeback IO processed asynchronously by kernel threads (pdflush)
> the real writes to the underlying block devices can occur in a different
> IO context respect to the task that originally generated the dirty
> pages involved in the IO operation.
>
> The controller bio-cgroup is used by io-throttle to track writeback IO
> and for properly apply throttling.
>
> Also apply a patch by Gui Jianfeng to announce tasks moving in
> bio-cgroup groups.
>
> See also: http://people.valinux.co.jp/~ryov/bio-cgroup
>
> Signed-off-by: Gui Jianfeng <[email protected]>
> Signed-off-by: Ryo Tsuruta <[email protected]>
> Signed-off-by: Hirokazu Takahashi <[email protected]>
> ---
> block/blk-ioc.c | 30 ++--
> fs/buffer.c | 2 +
> fs/direct-io.c | 2 +
> include/linux/biotrack.h | 95 +++++++++++
> include/linux/cgroup_subsys.h | 6 +
> include/linux/iocontext.h | 1 +
> include/linux/memcontrol.h | 6 +
> include/linux/mmzone.h | 4 +-
> include/linux/page_cgroup.h | 13 ++-
> init/Kconfig | 15 ++
> mm/Makefile | 4 +-
> mm/biotrack.c | 349 +++++++++++++++++++++++++++++++++++++++++
> mm/bounce.c | 2 +
> mm/filemap.c | 2 +
> mm/memcontrol.c | 5 +
> mm/memory.c | 5 +
> mm/page-writeback.c | 2 +
> mm/page_cgroup.c | 17 ++-
> mm/swap_state.c | 2 +
> 19 files changed, 536 insertions(+), 26 deletions(-)
> create mode 100644 include/linux/biotrack.h
> create mode 100644 mm/biotrack.c
>
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 012f065..ef8cac0 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -84,24 +84,28 @@ void exit_io_context(void)
> }
> }
>
> +void init_io_context(struct io_context *ioc)
> +{
> + atomic_set(&ioc->refcount, 1);
> + atomic_set(&ioc->nr_tasks, 1);
> + spin_lock_init(&ioc->lock);
> + ioc->ioprio_changed = 0;
> + ioc->ioprio = 0;
> + ioc->last_waited = jiffies; /* doesn't matter... */
> + ioc->nr_batch_requests = 0; /* because this is 0 */
> + ioc->aic = NULL;
> + INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
> + INIT_HLIST_HEAD(&ioc->cic_list);
> + ioc->ioc_data = NULL;
> +}
> +
> struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
> {
> struct io_context *ret;
>
> ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
> - if (ret) {
> - atomic_set(&ret->refcount, 1);
> - atomic_set(&ret->nr_tasks, 1);
> - spin_lock_init(&ret->lock);
> - ret->ioprio_changed = 0;
> - ret->ioprio = 0;
> - ret->last_waited = jiffies; /* doesn't matter... */
> - ret->nr_batch_requests = 0; /* because this is 0 */
> - ret->aic = NULL;
> - INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
> - INIT_HLIST_HEAD(&ret->cic_list);
> - ret->ioc_data = NULL;
> - }
> + if (ret)
> + init_io_context(ret);
>

Can you split this part of the patch out as a refactoring patch?

> return ret;
> }
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 13edf7a..bc72150 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -36,6 +36,7 @@
> #include <linux/buffer_head.h>
> #include <linux/task_io_accounting_ops.h>
> #include <linux/bio.h>
> +#include <linux/biotrack.h>
> #include <linux/notifier.h>
> #include <linux/cpu.h>
> #include <linux/bitops.h>
> @@ -655,6 +656,7 @@ static void __set_page_dirty(struct page *page,
> if (page->mapping) { /* Race with truncate? */
> WARN_ON_ONCE(warn && !PageUptodate(page));
> account_page_dirtied(page, mapping);
> + bio_cgroup_reset_owner_pagedirty(page, current->mm);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> }
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index da258e7..ec42362 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -33,6 +33,7 @@
> #include <linux/err.h>
> #include <linux/blkdev.h>
> #include <linux/buffer_head.h>
> +#include <linux/biotrack.h>
> #include <linux/rwsem.h>
> #include <linux/uio.h>
> #include <asm/atomic.h>
> @@ -799,6 +800,7 @@ static int do_direct_IO(struct dio *dio)
> ret = PTR_ERR(page);
> goto out;
> }
> + bio_cgroup_reset_owner(page, current->mm);
>
> while (block_in_page < blocks_per_page) {
> unsigned offset_in_page = block_in_page << blkbits;
> diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
> new file mode 100644
> index 0000000..25b8810
> --- /dev/null
> +++ b/include/linux/biotrack.h
> @@ -0,0 +1,95 @@
> +#include <linux/cgroup.h>
> +#include <linux/mm.h>
> +#include <linux/page_cgroup.h>
> +
> +#ifndef _LINUX_BIOTRACK_H
> +#define _LINUX_BIOTRACK_H
> +
> +#ifdef CONFIG_CGROUP_BIO
> +
> +struct tsk_move_msg {
> + int old_id;
> + int new_id;
> + struct task_struct *tsk;
> +};
> +
> +extern int register_biocgroup_notifier(struct notifier_block *nb);
> +extern int unregister_biocgroup_notifier(struct notifier_block *nb);
> +
> +struct io_context;
> +struct block_device;
> +
> +struct bio_cgroup {
> + struct cgroup_subsys_state css;
> + int id;

Can't css_id be used here?

> + struct io_context *io_context; /* default io_context */
> +/* struct radix_tree_root io_context_root; per device io_context */

Commented out code? Do you want to remove this.

> +};
> +
> +static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
> +{
> + pc->bio_cgroup_id = 0;
> +}
> +
> +extern struct cgroup *get_cgroup_from_page(struct page *page);
> +extern void put_cgroup_from_page(struct page *page);
> +extern struct cgroup *bio_id_to_cgroup(int id);
> +
> +static inline int bio_cgroup_disabled(void)
> +{
> + return bio_cgroup_subsys.disabled;
> +}
> +
> +extern void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
> +extern void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
> +extern void bio_cgroup_reset_owner_pagedirty(struct page *page,
> + struct mm_struct *mm);
> +extern void bio_cgroup_copy_owner(struct page *page, struct page *opage);
> +
> +extern struct io_context *get_bio_cgroup_iocontext(struct bio *bio);
> +extern int get_bio_cgroup_id(struct bio *bio);
> +
> +#else /* CONFIG_CGROUP_BIO */
> +
> +struct bio_cgroup;
> +

Comments? Docbook style would be nice for the functions below.

> +static inline void __init_bio_page_cgroup(struct page_cgroup *pc)
> +{
> +}
> +
> +static inline int bio_cgroup_disabled(void)
> +{
> + return 1;
> +}
> +
> +static inline void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> +{
> +}
> +
> +static inline void bio_cgroup_reset_owner(struct page *page,
> + struct mm_struct *mm)
> +{
> +}
> +
> +static inline void bio_cgroup_reset_owner_pagedirty(struct page *page,
> + struct mm_struct *mm)
> +{
> +}
> +
> +static inline void bio_cgroup_copy_owner(struct page *page, struct page *opage)
> +{
> +}
> +
> +static inline struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
> +{
> + return NULL;
> +}
> +
> +static inline int get_bio_cgroup_id(struct bio *bio)
> +{
> + return 0;
> +}
> +
> +#endif /* CONFIG_CGROUP_BIO */
> +
> +#endif /* _LINUX_BIOTRACK_H */
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 9c8d31b..5df23f8 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
>
> /* */
>
> +#ifdef CONFIG_CGROUP_BIO
> +SUBSYS(bio_cgroup)
> +#endif
> +
> +/* */
> +
> #ifdef CONFIG_CGROUP_DEVICE
> SUBSYS(devices)
> #endif
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index 08b987b..be37c27 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -104,6 +104,7 @@ int put_io_context(struct io_context *ioc);
> void exit_io_context(void);
> struct io_context *get_io_context(gfp_t gfp_flags, int node);
> struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
> +void init_io_context(struct io_context *ioc);
> void copy_io_context(struct io_context **pdst, struct io_context **psrc);
> #else
> static inline void exit_io_context(void)
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..f3e0e64 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -37,6 +37,8 @@ struct mm_struct;
> * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> */
>
> +extern void __init_mem_page_cgroup(struct page_cgroup *pc);
> +
> extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> /* for swap handling */
> @@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> +static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
> +{
> +}
> +
> static inline int mem_cgroup_newpage_charge(struct page *page,
> struct mm_struct *mm, gfp_t gfp_mask)
> {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 186ec6a..47a6f55 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -607,7 +607,7 @@ typedef struct pglist_data {
> int nr_zones;
> #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> struct page *node_mem_map;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> struct page_cgroup *node_page_cgroup;
> #endif
> #endif
> @@ -958,7 +958,7 @@ struct mem_section {
>
> /* See declaration of similar field in struct zone */
> unsigned long *pageblock_flags;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> /*
> * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
> * section. (see memcontrol.h/page_cgroup.h about this.)
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 7339c7b..a7249bb 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -1,7 +1,7 @@
> #ifndef __LINUX_PAGE_CGROUP_H
> #define __LINUX_PAGE_CGROUP_H
>
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> #include <linux/bit_spinlock.h>
> /*
> * Page Cgroup can be considered as an extended mem_map.
> @@ -12,9 +12,16 @@
> */
> struct page_cgroup {
> unsigned long flags;
> - struct mem_cgroup *mem_cgroup;
> struct page *page;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + struct mem_cgroup *mem_cgroup;
> +#endif
> +#ifdef CONFIG_CGROUP_BIO
> + int bio_cgroup_id;
> +#endif
> +#if defined(CONFIG_CGROUP_MEM_RES_CTLR) || defined(CONFIG_CGROUP_BIO)
> struct list_head lru; /* per cgroup LRU list */

Do we need the #if defined clause? Anyone using page_cgroup, but not
list_head LRU needs to be explicitly covered when they come up.

> +#endif
> };
>
> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> @@ -71,7 +78,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
> bit_spin_unlock(PCG_LOCK, &pc->flags);
> }
>
> -#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +#else /* CONFIG_CGROUP_PAGE */
> struct page_cgroup;
>
> static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> diff --git a/init/Kconfig b/init/Kconfig
> index 7be4d38..8f7b23c 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -606,8 +606,23 @@ config CGROUP_MEM_RES_CTLR_SWAP
> Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
> size is 4096bytes, 512k per 1Gbytes of swap.
>
> +config CGROUP_BIO
> + bool "Block I/O cgroup subsystem"
> + depends on CGROUPS && BLOCK
> + select MM_OWNER
> + help
> + Provides a Resource Controller which enables to track the onwner
> + of every Block I/O requests.
> + The information this subsystem provides can be used from any
> + kind of module such as dm-ioband device mapper modules or
> + the cfq-scheduler.
> +
> endif # CGROUPS
>
> +config CGROUP_PAGE
> + def_bool y
> + depends on CGROUP_MEM_RES_CTLR || CGROUP_BIO
> +
> config MM_OWNER
> bool
>
> diff --git a/mm/Makefile b/mm/Makefile
> index ec73c68..a78a437 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -37,4 +37,6 @@ else
> obj-$(CONFIG_SMP) += allocpercpu.o
> endif
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
> +obj-$(CONFIG_CGROUP_BIO) += biotrack.o
> diff --git a/mm/biotrack.c b/mm/biotrack.c
> new file mode 100644
> index 0000000..d3a35f1
> --- /dev/null
> +++ b/mm/biotrack.c
> @@ -0,0 +1,349 @@
> +/* biotrack.c - Block I/O Tracking
> + *
> + * Copyright (C) VA Linux Systems Japan, 2008
> + * Developed by Hirokazu Takahashi <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/smp.h>
> +#include <linux/bit_spinlock.h>
> +#include <linux/idr.h>
> +#include <linux/blkdev.h>
> +#include <linux/biotrack.h>
> +
> +#define MOVETASK 0
> +static BLOCKING_NOTIFIER_HEAD(biocgroup_chain);
> +
> +int register_biocgroup_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_register(&biocgroup_chain, nb);
> +}
> +EXPORT_SYMBOL(register_biocgroup_notifier);
> +
> +int unregister_biocgroup_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_unregister(&biocgroup_chain, nb);
> +}
> +EXPORT_SYMBOL(unregister_biocgroup_notifier);
> +
> +/*
> + * The block I/O tracking mechanism is implemented on the cgroup memory
> + * controller framework. It helps to find the the owner of an I/O request
> + * because every I/O request has a target page and the owner of the page
> + * can be easily determined on the framework.
> + */
> +
> +/* Return the bio_cgroup that associates with a cgroup. */
> +static inline struct bio_cgroup *cgroup_bio(struct cgroup *cgrp)
> +{
> + return container_of(cgroup_subsys_state(cgrp, bio_cgroup_subsys_id),
> + struct bio_cgroup, css);
> +}
> +
> +/* Return the bio_cgroup that associates with a process. */
> +static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p)
> +{
> + return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
> + struct bio_cgroup, css);
> +}
> +
> +static struct idr bio_cgroup_id;
> +static DEFINE_SPINLOCK(bio_cgroup_idr_lock);
> +static struct io_context default_bio_io_context;
> +static struct bio_cgroup default_bio_cgroup = {
> + .id = 0,
> + .io_context = &default_bio_io_context,
> +};
> +
> +/*
> + * This function is used to make a given page have the bio-cgroup id of
> + * the owner of this page.
> + */
> +void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> +{
> + struct bio_cgroup *biog;
> + struct page_cgroup *pc;
> +
> + if (bio_cgroup_disabled())
> + return;
> + pc = lookup_page_cgroup(page);
> + if (unlikely(!pc))
> + return;
> +

Is this routine called with lock_page_cgroup() taken? Otherwise
what protects pc->bio_cgroup_id?

> + pc->bio_cgroup_id = 0; /* 0: default bio_cgroup id */
> + if (!mm)
> + return;
> + /*
> + * Locking "pc" isn't necessary here since the current process is
> + * the only one that can access the members related to bio_cgroup.
> + */
> + rcu_read_lock();
> + biog = bio_cgroup_from_task(rcu_dereference(mm->owner));
> + if (unlikely(!biog))
> + goto out;
> + /*
> + * css_get(&bio->css) isn't called to increment the reference
> + * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
> + * invalid even if this page is still active.
> + * This approach is chosen to minimize the overhead.
> + */
> + pc->bio_cgroup_id = biog->id;

What happens without ref count increase we delete the cgroup or css?

> +out:
> + rcu_read_unlock();
> +}
> +
> +/*
> + * Change the owner of a given page if necessary.
> + */
> +void bio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
> +{
> + /*
> + * A little trick:
> + * Just call bio_cgroup_set_owner() for pages which are already
> + * active since the bio_cgroup_id member of page_cgroup can be
> + * updated without any locks. This is because an integer type of
> + * variable can be set a new value at once on modern cpus.
> + */
> + bio_cgroup_set_owner(page, mm);
> +}
> +
> +/*
> + * Change the owner of a given page. This function is only effective for
> + * pages in the pagecache.

Could you clarify pagecache? mapped/unmapped or both?

> + */
> +void bio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
> +{
> + if (PageSwapCache(page) || PageAnon(page))
> + return;

Look at page_is_file_cache() depending on the answer above

> + if (current->flags & PF_MEMALLOC)
> + return;
> +
> + bio_cgroup_reset_owner(page, mm);
> +}
> +
> +/*
> + * Assign "page" the same owner as "opage."
> + */
> +void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
> +{
> + struct page_cgroup *npc, *opc;
> +
> + if (bio_cgroup_disabled())
> + return;
> + npc = lookup_page_cgroup(npage);
> + if (unlikely(!npc))
> + return;
> + opc = lookup_page_cgroup(opage);
> + if (unlikely(!opc))
> + return;
> +
> + /*
> + * Do this without any locks. The reason is the same as
> + * bio_cgroup_reset_owner().
> + */
> + npc->bio_cgroup_id = opc->bio_cgroup_id;

What protects npc and opc?

> +}
> +
> +/* Create a new bio-cgroup. */
> +static struct cgroup_subsys_state *
> +bio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + struct bio_cgroup *biog;
> + struct io_context *ioc;
> + int ret;
> +
> + if (!cgrp->parent) {
> + biog = &default_bio_cgroup;
> + init_io_context(biog->io_context);
> + /* Increment the referrence count not to be released ever. */
> + atomic_inc(&biog->io_context->refcount);
> + idr_init(&bio_cgroup_id);
> + return &biog->css;
> + }
> +
> + biog = kzalloc(sizeof(*biog), GFP_KERNEL);
> + ioc = alloc_io_context(GFP_KERNEL, -1);
> + if (!ioc || !biog) {
> + ret = -ENOMEM;
> + goto out_err;
> + }
> + biog->io_context = ioc;
> +retry:
> + if (!idr_pre_get(&bio_cgroup_id, GFP_KERNEL)) {
> + ret = -EAGAIN;
> + goto out_err;
> + }
> + spin_lock_irq(&bio_cgroup_idr_lock);
> + ret = idr_get_new_above(&bio_cgroup_id, (void *)biog, 1, &biog->id);
> + spin_unlock_irq(&bio_cgroup_idr_lock);
> + if (ret == -EAGAIN)
> + goto retry;
> + else if (ret)
> + goto out_err;
> +
> + return &biog->css;
> +out_err:
> + kfree(biog);
> + if (ioc)
> + put_io_context(ioc);
> + return ERR_PTR(ret);
> +}
> +
> +/* Delete the bio-cgroup. */
> +static void bio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + struct bio_cgroup *biog = cgroup_bio(cgrp);
> +
> + put_io_context(biog->io_context);
> +
> + spin_lock_irq(&bio_cgroup_idr_lock);
> + idr_remove(&bio_cgroup_id, biog->id);
> + spin_unlock_irq(&bio_cgroup_idr_lock);
> +
> + kfree(biog);
> +}
> +
> +static struct bio_cgroup *find_bio_cgroup(int id)
> +{
> + struct bio_cgroup *biog;
> + spin_lock_irq(&bio_cgroup_idr_lock);
> + /*
> + * It might fail to find A bio-group associated with "id" since it
> + * is allowed to remove the bio-cgroup even when some of I/O requests
> + * this group issued haven't completed yet.
> + */
> + biog = (struct bio_cgroup *)idr_find(&bio_cgroup_id, id);
> + spin_unlock_irq(&bio_cgroup_idr_lock);
> + return biog;
> +}
> +
> +struct cgroup *bio_id_to_cgroup(int id)
> +{
> + struct bio_cgroup *biog;
> +
> + biog = find_bio_cgroup(id);
> + if (biog)
> + return biog->css.cgroup;
> +
> + return NULL;
> +}
> +
> +struct cgroup *get_cgroup_from_page(struct page *page)
> +{
> + struct page_cgroup *pc;
> + struct bio_cgroup *biog;
> + struct cgroup *cgrp = NULL;
> +
> + pc = lookup_page_cgroup(page);
> + if (!pc)
> + return NULL;
> + lock_page_cgroup(pc);
> + biog = find_bio_cgroup(pc->bio_cgroup_id);
> + if (biog) {
> + css_get(&biog->css);
> + cgrp = biog->css.cgroup;
> + }
> + unlock_page_cgroup(pc);
> + return cgrp;
> +}
> +
> +void put_cgroup_from_page(struct page *page)
> +{
> + struct bio_cgroup *biog;
> + struct page_cgroup *pc;
> +
> + pc = lookup_page_cgroup(page);
> + if (!pc)
> + return;
> + lock_page_cgroup(pc);
> + biog = find_bio_cgroup(pc->bio_cgroup_id);
> + if (biog)
> + css_put(&biog->css);
> + unlock_page_cgroup(pc);
> +}
> +
> +/* Determine the bio-cgroup id of a given bio. */
> +int get_bio_cgroup_id(struct bio *bio)
> +{
> + struct page_cgroup *pc;
> + struct page *page = bio_iovec_idx(bio, 0)->bv_page;
> + int id = 0;
> +
> + pc = lookup_page_cgroup(page);
> + if (pc)
> + id = pc->bio_cgroup_id;
> + return id;
> +}
> +EXPORT_SYMBOL(get_bio_cgroup_id);
> +
> +/* Determine the iocontext of the bio-cgroup that issued a given bio. */
> +struct io_context *get_bio_cgroup_iocontext(struct bio *bio)
> +{
> + struct bio_cgroup *biog = NULL;
> + struct io_context *ioc;
> + int id = 0;
> +
> + id = get_bio_cgroup_id(bio);
> + if (id)
> + biog = find_bio_cgroup(id);
> + if (!biog)
> + biog = &default_bio_cgroup;
> + ioc = biog->io_context; /* default io_context for this cgroup */
> + atomic_inc(&ioc->refcount);
> + return ioc;
> +}
> +EXPORT_SYMBOL(get_bio_cgroup_iocontext);
> +
> +static u64 bio_id_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> + struct bio_cgroup *biog = cgroup_bio(cgrp);
> + return (u64) biog->id;
> +}
> +
> +
> +static struct cftype bio_files[] = {
> + {
> + .name = "id",
> + .read_u64 = bio_id_read,
> + },
> +};
> +
> +static int bio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +{
> + return cgroup_add_files(cgrp, ss, bio_files, ARRAY_SIZE(bio_files));
> +}
> +
> +static void bio_cgroup_attach(struct cgroup_subsys *ss,
> + struct cgroup *cont, struct cgroup *oldcont,
> + struct task_struct *tsk)
> +{
> + struct tsk_move_msg tmm;
> + struct bio_cgroup *old_biog, *new_biog;
> +
> + old_biog = cgroup_bio(oldcont);
> + new_biog = cgroup_bio(cont);
> + tmm.old_id = old_biog->id;
> + tmm.new_id = new_biog->id;
> + tmm.tsk = tsk;
> + blocking_notifier_call_chain(&biocgroup_chain, MOVETASK, &tmm);
> +}
> +
> +struct cgroup_subsys bio_cgroup_subsys = {
> + .name = "bio",
> + .create = bio_cgroup_create,
> + .destroy = bio_cgroup_destroy,
> + .populate = bio_cgroup_populate,
> + .attach = bio_cgroup_attach,
> + .subsys_id = bio_cgroup_subsys_id,
> +};
> +
> diff --git a/mm/bounce.c b/mm/bounce.c
> index e590272..1a01905 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -14,6 +14,7 @@
> #include <linux/hash.h>
> #include <linux/highmem.h>
> #include <linux/blktrace_api.h>
> +#include <linux/biotrack.h>
> #include <trace/block.h>
> #include <asm/tlbflush.h>
>
> @@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
> to->bv_len = from->bv_len;
> to->bv_offset = from->bv_offset;
> inc_zone_page_state(to->bv_page, NR_BOUNCE);
> + bio_cgroup_copy_owner(to->bv_page, page);
>
> if (rw == WRITE) {
> char *vto, *vfrom;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8bd4980..1ab32a2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -33,6 +33,7 @@
> #include <linux/cpuset.h>
> #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
> #include <linux/memcontrol.h>
> +#include <linux/biotrack.h>
> #include <linux/mm_inline.h> /* for page_is_file_cache() */
> #include "internal.h"
>
> @@ -463,6 +464,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> gfp_mask & GFP_RECLAIM_MASK);
> if (error)
> goto out;
> + bio_cgroup_set_owner(page, current->mm);
>
> error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> if (error == 0) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e44fb0f..c25eb63 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2524,6 +2524,11 @@ struct cgroup_subsys mem_cgroup_subsys = {
> .use_id = 1,
> };
>
> +void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
> +{
> + pc->mem_cgroup = NULL;
> +}
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>
> static int __init disable_swap_account(char *s)
> diff --git a/mm/memory.c b/mm/memory.c
> index cf6873e..7779e12 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -51,6 +51,7 @@
> #include <linux/init.h>
> #include <linux/writeback.h>
> #include <linux/memcontrol.h>
> +#include <linux/biotrack.h>
> #include <linux/mmu_notifier.h>
> #include <linux/kallsyms.h>
> #include <linux/swapops.h>
> @@ -2052,6 +2053,7 @@ gotten:
> * thread doing COW.
> */
> ptep_clear_flush_notify(vma, address, page_table);
> + bio_cgroup_set_owner(new_page, mm);
> page_add_new_anon_rmap(new_page, vma, address);
> set_pte_at(mm, address, page_table, entry);
> update_mmu_cache(vma, address, entry);
> @@ -2497,6 +2499,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> flush_icache_page(vma, page);
> set_pte_at(mm, address, page_table, pte);
> page_add_anon_rmap(page, vma, address);
> + bio_cgroup_reset_owner(page, mm);
> /* It's better to call commit-charge after rmap is established */
> mem_cgroup_commit_charge_swapin(page, ptr);
>
> @@ -2559,6 +2562,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> if (!pte_none(*page_table))
> goto release;
> inc_mm_counter(mm, anon_rss);
> + bio_cgroup_set_owner(page, mm);
> page_add_new_anon_rmap(page, vma, address);
> set_pte_at(mm, address, page_table, entry);
>
> @@ -2711,6 +2715,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> if (anon) {
> inc_mm_counter(mm, anon_rss);
> + bio_cgroup_set_owner(page, mm);
> page_add_new_anon_rmap(page, vma, address);
> } else {
> inc_mm_counter(mm, file_rss);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 30351f0..1379eb0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -26,6 +26,7 @@
> #include <linux/blkdev.h>
> #include <linux/mpage.h>
> #include <linux/rmap.h>
> +#include <linux/biotrack.h>
> #include <linux/percpu.h>
> #include <linux/notifier.h>
> #include <linux/smp.h>
> @@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
> BUG_ON(mapping2 != mapping);
> WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> account_page_dirtied(page, mapping);
> + bio_cgroup_reset_owner_pagedirty(page, current->mm);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> }
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 791905c..f692ee2 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -9,13 +9,16 @@
> #include <linux/vmalloc.h>
> #include <linux/cgroup.h>
> #include <linux/swapops.h>
> +#include <linux/memcontrol.h>
> +#include <linux/biotrack.h>
>
> static void __meminit
> __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> {
> pc->flags = 0;
> - pc->mem_cgroup = NULL;
> pc->page = pfn_to_page(pfn);
> + __init_mem_page_cgroup(pc);
> + __init_bio_page_cgroup(pc);
> INIT_LIST_HEAD(&pc->lru);
> }
> static unsigned long total_usage;
> @@ -74,7 +77,7 @@ void __init page_cgroup_init(void)
>
> int nid, fail;
>
> - if (mem_cgroup_disabled())
> + if (mem_cgroup_disabled() && bio_cgroup_disabled())
> return;
>
> for_each_online_node(nid) {
> @@ -83,12 +86,12 @@ void __init page_cgroup_init(void)
> goto fail;
> }
> printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> - printk(KERN_INFO "please try cgroup_disable=memory option if you"
> + printk(KERN_INFO "please try cgroup_disable=memory,bio option if you"
> " don't want\n");
> return;
> fail:
> printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
> - printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
> + printk(KERN_CRIT "please try cgroup_disable=memory,bio boot options\n");
> panic("Out of memory");
> }
>
> @@ -248,7 +251,7 @@ void __init page_cgroup_init(void)
> unsigned long pfn;
> int fail = 0;
>
> - if (mem_cgroup_disabled())
> + if (mem_cgroup_disabled() && bio_cgroup_disabled())
> return;
>
> for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
> @@ -263,8 +266,8 @@ void __init page_cgroup_init(void)
> hotplug_memory_notifier(page_cgroup_callback, 0);
> }
> printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> - printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
> - " want\n");
> + printk(KERN_INFO
> + "try cgroup_disable=memory,bio option if you don't want\n");
> }
>
> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3ecea98..c7ad256 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -17,6 +17,7 @@
> #include <linux/backing-dev.h>
> #include <linux/pagevec.h>
> #include <linux/migrate.h>
> +#include <linux/biotrack.h>
> #include <linux/page_cgroup.h>
>
> #include <asm/pgtable.h>
> @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> */
> __set_page_locked(new_page);
> SetPageSwapBacked(new_page);
> + bio_cgroup_set_owner(new_page, current->mm);
> err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> if (likely(!err)) {
> /*
> --
> 1.5.6.3
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers
>

--
Balbir

2009-04-17 10:41:53

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Fri, Apr 17, 2009 at 12:25:40PM +0200, Andrea Righi wrote:
> So, what about implementing the bio-cgroup functionality as cgroup "page
> tracking" infrastructure and provide the following interfaces:
>
> /*
> * Encode the cgrp->css.id in page_group->flags

sorry, I meant css_id(struct cgroup_subsys_state *css) here.

> */
> void set_cgroup_page_owner(struct page *page, struct cgroup *cgrp);
>
> /*
> * Returns the cgroup owner of a page, decoding the cgroup id from
> * page_cgroup->flags.
> */
> struct cgroup *get_cgroup_page_owner(struct page *page);

Or better, even more generic:

/*
* Encode id in page_group->flags.
*/
void set_page_id(struct page *page, unsigned short id);

/*
* Returns the id of a page, decoding it from page_cgroup->flags.
*/
unsigned short get_page_id(struct page *page);

Then we can use css_id() for cgroups, or any kind of ID for other
potential users (dm-ioband, etc.).

-Andrea

>
> This also wouldn't increase the size of page_cgroup because we can
> encode the cgroup id in the unused bits of page_cgroup->flags, as
> originally suggested by Kame.
>
> And I think it could be used also by dm-ioband, even if it's not a
> cgroup-based subsystem... but I may be wrong. Ryo what's your opinion?

Subject: Block I/O tracking (was Re: [PATCH 3/9] bio-cgroup controller)

Ryo Tsuruta wrote:
> Hi,
>
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Fri, 17 Apr 2009 11:24:33 +0900
>
>> On Fri, 17 Apr 2009 10:49:43 +0900
>> Takuya Yoshikawa <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I have a few question.
>>> - I have not yet fully understood how your controller are using
>>> bio_cgroup. If my view is wrong please tell me.
>>>
>>> o In my view, bio_cgroup's implementation strongly depends on
>>> page_cgoup's. Could you explain for what purpose does this
>>> functionality itself should be implemented as cgroup subsystem?
>>> Using page_cgoup and implementing tracking APIs is not enough?
>> I'll definitely do "Nack" to add full bio-cgroup members to page_cgroup.
>> Now, page_cgroup is 40bytes(in 64bit arch.) And all of them are allocated at
>> boot time as memmap. (and add member to struct page is much harder ;)
>>
>> IIUC, feature for "tracking bio" is just necesary for pages for I/O.
>> So, I think it's much better to add misc. information to struct bio not to the page.
>> But, if people want to add "small hint" to struct page or struct page_cgroup
>> for tracking buffered I/O, I'll give you help as much as I can.
>> Maybe using "unused bits" in page_cgroup->flags is a choice with no overhead.
>
> In the case where the bio-cgroup data is allocated dynamically,
> - Sometimes quite a large amount of memory get marked dirty.
> In this case it requires more kernel memory than that of the
> current implementation.
> - The operation is expansive due to memory allocations and exclusive
> controls by such as spinlocks.
>
> In the case where the bio-cgroup data is allocated by delayed allocation,
> - It makes the operation complicated and expensive, because
> sometimes a bio has to be created in the context of other
> processes, such as aio and swap-out operation.
>
> I'd prefer a simple and lightweight implementation. bio-cgroup only
> needs 4bytes unlike memory controller. The reason why bio-cgroup chose
> this approach is to minimize the overhead.

Elaborating on Yoshikawa-san's comment, I would like to propose a
generic I/O tracking mechanism that is not tied to all the cgroup
paraphernalia. This approach has several advantages:

- By using this functionality, existing I/O schedulers (well, some
relatively minor changes would be needed) would be able to schedule
buffered I/O properly.

- The amount of memory consumed to do the tracking could be
optimized according to the kernel configuration (do we really
need struct page_cgroup when the cgroup memory controller or all
of the cgroup infrastructure has been configured out?).

The I/O tracking functionality would look something like the following:

- Create an API to acquire the I/O context of a certain page, which is
cgroup independent. For discussion purposes, I will assume that the
I/O context of a page is the io_context of the task that dirtied the
page (this can be changed if deemed necessary, though).

- When cgroups are not being used, pages would be tracked using a
pfn-indexed array of struct io_context (? la memcg's array of
struct page_cgroup).

- When cgroups are activated but the memory controller is not, we
would have a pfn-indexed array of struct blkio_cgroup, which would
have both a pointer to the corresponding io_context of the page and a
reference to the cgroup it belongs to (most likely using css_id). The
API offered by the I/O tracking mechanism would be extended so that
the kernel can easily obtain not only the per-task io_context but also
the cgroup a certain page belongs to. Please notice that by doing this
we have all the information we need to schedule buffered I/O both at
the cgroup-level and the task-level. From the memory usage point of
view, memory controller-specific bits would be gone and to top it all
we save one indirection level (since struct page_cgroup would be out
of the picture).

- When the memory controller is active we would have the
pfn-indexed array of struct page_cgroup we have know plus a
reference to the corresponding cgroup and io_context (yes, I
still want to do proper scheduling of buffered I/O within a
cgroup).

- Finally, since bio entering the block layer can generate additional
bios it is necessary to pass the I/O context information of original
bio down to the new bios. For that stacking devices such as dm and
those of that ilk will have to be modified. To improve performance I/O
context information would be cached in bios (to achieve this we have
to ensure that all bios that enter the block layer have the right I/O
context information attached to it).

Yoshikawa-san and myself have been working on a patch-set that
implements just this and we have reached that point where the kernel
does not panic right after booting:), so we will be sending patches soon
(hopefully this weekend).

Any thoughts?

Regards,

Fernando

Subject: Re: [PATCH 1/9] io-throttle documentation

Andrea Righi wrote:
> On Fri, Apr 17, 2009 at 09:56:31AM +0800, Li Zefan wrote:
>> KAMEZAWA Hiroyuki wrote:
>>> On Tue, 14 Apr 2009 22:21:12 +0200
>>> Andrea Righi <[email protected]> wrote:
>>>
>>>> +Example:
>>>> +* Create an association between an io-throttle group and a bio-cgroup group
>>>> + with "bio" and "blockio" subsystems mounted in different mount points:
>>>> + # mount -t cgroup -o bio bio-cgroup /mnt/bio-cgroup/
>>>> + # cd /mnt/bio-cgroup/
>>>> + # mkdir bio-grp
>>>> + # cat bio-grp/bio.id
>>>> + 1
>>>> + # mount -t cgroup -o blockio blockio /mnt/io-throttle
>>>> + # cd /mnt/io-throttle
>>>> + # mkdir foo
>>>> + # echo 1 > foo/blockio.bio_id
>>> Why do we need multiple cgroups at once to track I/O ?
>>> Seems complicated to me.
>>>
>> IIUC, it also disallows other subsystems to be binded with blockio subsys:
>> # mount -t cgroup -o blockio cpuset xxx /mnt
>> (failed)
>>
>> and if a task is moved from cg1(id=1) to cg2(id=2) in bio subsys, this task
>> will be moved from CG1(id=1) to CG2(id=2) automatically in blockio subsys.
>>
>> All these are odd, unexpected, complex and bug-prone I think..
>
> Implementing bio-cgroup functionality as pure infrastructure framework
> instead of a cgroup subsystem would remove all this oddity and
> complexity.

Andrea, I agree with you completely. In fact, we have been working on that for
a while and have just proposed doing exactly that on a different mail thread
(you are CC'ed). It would be great if you could comment on that proposal.

Thanks,

Fernando

> For example, the actual functionality that I need for the io-throttle
> controller is just an interface to set and get the cgroup owner of a
> page. I think it should be the same also for other potential users of
> bio-cgroup.
>
> So, what about implementing the bio-cgroup functionality as cgroup "page
> tracking" infrastructure and provide the following interfaces:
>
> /*
> * Encode the cgrp->css.id in page_group->flags
> */
> void set_cgroup_page_owner(struct page *page, struct cgroup *cgrp);
>
> /*
> * Returns the cgroup owner of a page, decoding the cgroup id from
> * page_cgroup->flags.
> */
> struct cgroup *get_cgroup_page_owner(struct page *page);
>
> This also wouldn't increase the size of page_cgroup because we can
> encode the cgroup id in the unused bits of page_cgroup->flags, as
> originally suggested by Kame.
>
> And I think it could be used also by dm-ioband, even if it's not a
> cgroup-based subsystem... but I may be wrong. Ryo what's your opinion?
>
> -Andrea

2009-04-17 12:40:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Tue, Apr 14, 2009 at 10:21:20PM +0200, Andrea Righi wrote:
> Delaying journal IO can unnecessarily delay other independent IO
> operations from different cgroups.
>
> Add BIO_RW_META flag to the ext3 journal IO that informs the io-throttle
> subsystem to account but not delay journal IO and avoid potential
> priority inversion problems.

So this worries me for two reasons. First of all, the meaning of
BIO_RW_META is not well defined, but I'm concerned that you are using
the flag in a manner that in a way that wasn't its original intent.
I've included Jens on the cc list so he can comment on that score.

Secondly, there are many more locations than these which can end up
causing I/O which will ending up causing the journal commit to block
until they are completed. I've done a lot of work in the past few
weeks to make sure those writes get marked using BIO_RW_SYNC. In
data=ordered mode, the journal commit will block waiting for data
blocks to be written out, and that implies you really need to treat as
high priority all of the block writes that are marked with the
BIO_RW_SYNC flag.

The flip side of this is it may end up making your I/O controller to
leaky; that is, someone might be able to evade your I/O controller's
attempt to impose limits by using fsync() all the time. This is a
hard problem, though, because filesystem I/O is almost always
intertwined.

What sort of scenarios and workloads are you envisioning might use
this I/O controller? And can you say more about the specifics about
the priority inversion problem you are concerned about?

Regards,

- Ted

2009-04-17 12:50:22

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Fri, Apr 17 2009, Theodore Tso wrote:
> On Tue, Apr 14, 2009 at 10:21:20PM +0200, Andrea Righi wrote:
> > Delaying journal IO can unnecessarily delay other independent IO
> > operations from different cgroups.
> >
> > Add BIO_RW_META flag to the ext3 journal IO that informs the io-throttle
> > subsystem to account but not delay journal IO and avoid potential
> > priority inversion problems.
>
> So this worries me for two reasons. First of all, the meaning of
> BIO_RW_META is not well defined, but I'm concerned that you are using
> the flag in a manner that in a way that wasn't its original intent.
> I've included Jens on the cc list so he can comment on that score.

I was actually already on the cc, though with my private mail address! I
did read the patch this morning and initially thought it was a bad idea
as well, but then I thought that perhaps it's not that different to view
journal IO as a form of meta data to some extent.

But still, putting any sort of value into the meta flag is a bad idea.
It's assuming that it will get you some sort of extra guarantee, which
isn't the case. If journal IO is that much more important than other IO,
it should be prioritized explicitly. I'm not sure there's a good
solution to this problem.

> Secondly, there are many more locations than these which can end up
> causing I/O which will ending up causing the journal commit to block
> until they are completed. I've done a lot of work in the past few
> weeks to make sure those writes get marked using BIO_RW_SYNC. In
> data=ordered mode, the journal commit will block waiting for data
> blocks to be written out, and that implies you really need to treat as
> high priority all of the block writes that are marked with the
> BIO_RW_SYNC flag.
>
> The flip side of this is it may end up making your I/O controller to
> leaky; that is, someone might be able to evade your I/O controller's
> attempt to impose limits by using fsync() all the time. This is a
> hard problem, though, because filesystem I/O is almost always
> intertwined.
>
> What sort of scenarios and workloads are you envisioning might use
> this I/O controller? And can you say more about the specifics about
> the priority inversion problem you are concerned about?

I'm assuming it's the "usual" problem with lower priority IO getting
access to fs exclusive data. It's quite trivial to cause problems with
higher IO priority tasks then getting stuck waiting for the low priority
process, since they also need to access that fs exclusive data.

CFQ includes a vain attempt at boosting the priority of such a low
priority process if that happens, see the get_fs_excl() stuff in
lock_super(). reiserfs also marks the process as holding fs exclusive
resources, but it was never added to any of the other file systems. But
we could improve that situation. The file system is really the only one
that can inform us of such an issue.

--
Jens Axboe

2009-04-17 14:39:21

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Fri, Apr 17, 2009 at 02:50:04PM +0200, Jens Axboe wrote:
> On Fri, Apr 17 2009, Theodore Tso wrote:
> > On Tue, Apr 14, 2009 at 10:21:20PM +0200, Andrea Righi wrote:
> > > Delaying journal IO can unnecessarily delay other independent IO
> > > operations from different cgroups.
> > >
> > > Add BIO_RW_META flag to the ext3 journal IO that informs the io-throttle
> > > subsystem to account but not delay journal IO and avoid potential
> > > priority inversion problems.
> >
> > So this worries me for two reasons. First of all, the meaning of
> > BIO_RW_META is not well defined, but I'm concerned that you are using
> > the flag in a manner that in a way that wasn't its original intent.
> > I've included Jens on the cc list so he can comment on that score.
>
> I was actually already on the cc, though with my private mail address! I
> did read the patch this morning and initially thought it was a bad idea
> as well, but then I thought that perhaps it's not that different to view
> journal IO as a form of meta data to some extent.
>
> But still, putting any sort of value into the meta flag is a bad idea.
> It's assuming that it will get you some sort of extra guarantee, which
> isn't the case. If journal IO is that much more important than other IO,
> it should be prioritized explicitly. I'm not sure there's a good
> solution to this problem.

Exactly, the purpose here is is to prioritize the dispatching of journal
IO requests in the IO controller. I may have used an inappropriate flag
or a quick&dirty solution, but without this, any cgroup/process that
generates a lot of journal activity may be throttled and cause other
cgroups/processes to be incorrectly blocked when they try to write to
disk.

>
> > Secondly, there are many more locations than these which can end up
> > causing I/O which will ending up causing the journal commit to block
> > until they are completed. I've done a lot of work in the past few
> > weeks to make sure those writes get marked using BIO_RW_SYNC. In
> > data=ordered mode, the journal commit will block waiting for data
> > blocks to be written out, and that implies you really need to treat as
> > high priority all of the block writes that are marked with the
> > BIO_RW_SYNC flag.
> >
> > The flip side of this is it may end up making your I/O controller to
> > leaky; that is, someone might be able to evade your I/O controller's
> > attempt to impose limits by using fsync() all the time. This is a
> > hard problem, though, because filesystem I/O is almost always
> > intertwined.
> >
> > What sort of scenarios and workloads are you envisioning might use
> > this I/O controller? And can you say more about the specifics about
> > the priority inversion problem you are concerned about?
>
> I'm assuming it's the "usual" problem with lower priority IO getting
> access to fs exclusive data. It's quite trivial to cause problems with
> higher IO priority tasks then getting stuck waiting for the low priority
> process, since they also need to access that fs exclusive data.

Right. I thought about using the BIO_RW_SYNC flag instead, but as Ted
pointed out, some cgroups/processes might be able to evade the IO
control issuing a lot of fsync()s. We could also limit the fsync()-rate
into the IO controller, but it sounds like a dirty workaround...

>
> CFQ includes a vain attempt at boosting the priority of such a low
> priority process if that happens, see the get_fs_excl() stuff in
> lock_super(). reiserfs also marks the process as holding fs exclusive
> resources, but it was never added to any of the other file systems. But
> we could improve that situation. The file system is really the only one
> that can inform us of such an issue.

What about writeback IO? get_fs_excl() only refers to the current
process. At least for the cgroup io-throttle controller we can't delay
writeback requests that hold exclusive access resources. For this reason
encoding this information in the IO request (or better using a flag in
struct bio) seems to me a better solution.

-Andrea

2009-04-17 17:45:21

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Tue, Apr 14, 2009 at 10:21:12PM +0200, Andrea Righi wrote:

[..]
> +4.2. Buffered I/O (write-back) tracking
> +
> +For buffered writes the scenario is a bit more complex, because the writes in
> +the page cache are processed asynchronously by kernel threads (pdflush), using
> +a write-back policy. So the real writes to the underlying block devices occur
> +in a different I/O context respect to the task that originally generated the
> +dirty pages.
> +
> +The I/O bandwidth controller uses the following solution to resolve this
> +problem.
> +
> +If the operation is a buffered write, we can charge the right cgroup looking at
> +the owner of the first page involved in the I/O operation, that gives the
> +context that generated the I/O activity at the source. This information can be
> +retrieved using the page_cgroup functionality originally provided by the cgroup
> +memory controller [4], and now provided specifically by the bio-cgroup
> +controller [5].
> +
> +In this way we can correctly account the I/O cost to the right cgroup, but we
> +cannot throttle the current task in this stage, because, in general, it is a
> +different task (e.g., pdflush that is processing asynchronously the dirty
> +page).
> +
> +For this reason, all the write-back requests that are not directly submitted by
> +the real owner and that need to be throttled are not dispatched immediately in
> +submit_bio(). Instead, they are added into an rbtree and processed
> +asynchronously by a dedicated kernel thread: kiothrottled.
> +

Hi Andrea,

I am trying to go through your patches now and also planning to test it
out. While reading the documentation async write handling interested
me. IIUC, looks like you are throttling writes once they are being
written to the disk (either by pdflush or in the context of the process
because vm_dirty_ratio crossed etc).

If that's the case, will a process not see an increased rate of writes
till we are not hitting dirty_background_ratio?

Secondly, if above is giving acceptable performance resutls, then we
should be able to provide max bw control at IO scheduler level (along
with proportional bw control)?

So instead of doing max bw and proportional bw implementation in two
places with the help of different controllers, I think we can do it
with the help of one controller at one place.

Please do have a look at my patches also to figure out if that's possible
or not. I think it should be possible.

Keeping both at single place should simplify the things.

Thanks
Vivek

2009-04-17 22:09:44

by Andrea Righi

[permalink] [raw]
Subject: Re: Block I/O tracking (was Re: [PATCH 3/9] bio-cgroup controller)

On Fri, Apr 17, 2009 at 08:27:25PM +0900, Fernando Luis V?zquez Cao wrote:
> Ryo Tsuruta wrote:
>> Hi,
>>
>> From: KAMEZAWA Hiroyuki <[email protected]>
>> Date: Fri, 17 Apr 2009 11:24:33 +0900
>>
>>> On Fri, 17 Apr 2009 10:49:43 +0900
>>> Takuya Yoshikawa <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a few question.
>>>> - I have not yet fully understood how your controller are using
>>>> bio_cgroup. If my view is wrong please tell me.
>>>>
>>>> o In my view, bio_cgroup's implementation strongly depends on
>>>> page_cgoup's. Could you explain for what purpose does this
>>>> functionality itself should be implemented as cgroup subsystem?
>>>> Using page_cgoup and implementing tracking APIs is not enough?
>>> I'll definitely do "Nack" to add full bio-cgroup members to page_cgroup.
>>> Now, page_cgroup is 40bytes(in 64bit arch.) And all of them are allocated at
>>> boot time as memmap. (and add member to struct page is much harder ;)
>>>
>>> IIUC, feature for "tracking bio" is just necesary for pages for I/O.
>>> So, I think it's much better to add misc. information to struct bio not to the page.
>>> But, if people want to add "small hint" to struct page or struct page_cgroup
>>> for tracking buffered I/O, I'll give you help as much as I can.
>>> Maybe using "unused bits" in page_cgroup->flags is a choice with no overhead.
>>
>> In the case where the bio-cgroup data is allocated dynamically,
>> - Sometimes quite a large amount of memory get marked dirty.
>> In this case it requires more kernel memory than that of the
>> current implementation.
>> - The operation is expansive due to memory allocations and exclusive
>> controls by such as spinlocks.
>>
>> In the case where the bio-cgroup data is allocated by delayed
>> allocation, - It makes the operation complicated and expensive,
>> because
>> sometimes a bio has to be created in the context of other
>> processes, such as aio and swap-out operation.
>>
>> I'd prefer a simple and lightweight implementation. bio-cgroup only
>> needs 4bytes unlike memory controller. The reason why bio-cgroup chose
>> this approach is to minimize the overhead.
>
> Elaborating on Yoshikawa-san's comment, I would like to propose a
> generic I/O tracking mechanism that is not tied to all the cgroup
> paraphernalia. This approach has several advantages:
>
> - By using this functionality, existing I/O schedulers (well, some
> relatively minor changes would be needed) would be able to schedule
> buffered I/O properly.
>
> - The amount of memory consumed to do the tracking could be
> optimized according to the kernel configuration (do we really
> need struct page_cgroup when the cgroup memory controller or all
> of the cgroup infrastructure has been configured out?).
>
> The I/O tracking functionality would look something like the following:
>
> - Create an API to acquire the I/O context of a certain page, which is
> cgroup independent. For discussion purposes, I will assume that the
> I/O context of a page is the io_context of the task that dirtied the
> page (this can be changed if deemed necessary, though).
>
> - When cgroups are not being used, pages would be tracked using a
> pfn-indexed array of struct io_context (? la memcg's array of
> struct page_cgroup).

mmh... thinking in terms of io_context instead of task or cgroup. This
is not suitable for memcg anyway, that will also require the page_cgroup
infrastructure, at least for the per cgroup lru list I think. In any
case, as suggested by Kamezawa, we should do the best to reduce the size
of page_cgroup or any equivalent structure associated with every page
descriptor.

>
> - When cgroups are activated but the memory controller is not, we
> would have a pfn-indexed array of struct blkio_cgroup, which would
> have both a pointer to the corresponding io_context of the page and a
> reference to the cgroup it belongs to (most likely using css_id). The
> API offered by the I/O tracking mechanism would be extended so that
> the kernel can easily obtain not only the per-task io_context but also
> the cgroup a certain page belongs to. Please notice that by doing this
> we have all the information we need to schedule buffered I/O both at
> the cgroup-level and the task-level. From the memory usage point of
> view, memory controller-specific bits would be gone and to top it all
> we save one indirection level (since struct page_cgroup would be out
> of the picture).
>
> - When the memory controller is active we would have the
> pfn-indexed array of struct page_cgroup we have know plus a
> reference to the corresponding cgroup and io_context (yes, I
> still want to do proper scheduling of buffered I/O within a
> cgroup).

Have you considered if multiple cgroup subsystems (io-throttle, memcg,
etc.) want to use this feature at the same time? how to store a
reference to many different cgroup subsystems?

>
> - Finally, since bio entering the block layer can generate additional
> bios it is necessary to pass the I/O context information of original
> bio down to the new bios. For that stacking devices such as dm and
> those of that ilk will have to be modified. To improve performance I/O
> context information would be cached in bios (to achieve this we have
> to ensure that all bios that enter the block layer have the right I/O
> context information attached to it).

This is a very interesting feature IMHO. AFAIK at the moment only
dm-ioband, for its dm nature, is able to define rules for logical
devices (LVM, software RAID, etc).

>
> Yoshikawa-san and myself have been working on a patch-set that
> implements just this and we have reached that point where the kernel
> does not panic right after booting:), so we will be sending patches soon
> (hopefully this weekend).

Good! curious to see this patchset ;).

Thanks,
-Andrea

2009-04-17 23:12:59

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Fri, Apr 17, 2009 at 01:39:55PM -0400, Vivek Goyal wrote:
> On Tue, Apr 14, 2009 at 10:21:12PM +0200, Andrea Righi wrote:
>
> [..]
> > +4.2. Buffered I/O (write-back) tracking
> > +
> > +For buffered writes the scenario is a bit more complex, because the writes in
> > +the page cache are processed asynchronously by kernel threads (pdflush), using
> > +a write-back policy. So the real writes to the underlying block devices occur
> > +in a different I/O context respect to the task that originally generated the
> > +dirty pages.
> > +
> > +The I/O bandwidth controller uses the following solution to resolve this
> > +problem.
> > +
> > +If the operation is a buffered write, we can charge the right cgroup looking at
> > +the owner of the first page involved in the I/O operation, that gives the
> > +context that generated the I/O activity at the source. This information can be
> > +retrieved using the page_cgroup functionality originally provided by the cgroup
> > +memory controller [4], and now provided specifically by the bio-cgroup
> > +controller [5].
> > +
> > +In this way we can correctly account the I/O cost to the right cgroup, but we
> > +cannot throttle the current task in this stage, because, in general, it is a
> > +different task (e.g., pdflush that is processing asynchronously the dirty
> > +page).
> > +
> > +For this reason, all the write-back requests that are not directly submitted by
> > +the real owner and that need to be throttled are not dispatched immediately in
> > +submit_bio(). Instead, they are added into an rbtree and processed
> > +asynchronously by a dedicated kernel thread: kiothrottled.
> > +
>
> Hi Andrea,

Hi Vivek,

>
> I am trying to go through your patches now and also planning to test it

thanks for trying to test first of all.

> out. While reading the documentation async write handling interested
> me. IIUC, looks like you are throttling writes once they are being
> written to the disk (either by pdflush or in the context of the process
> because vm_dirty_ratio crossed etc).

Correct, more exactly in submit_bio().

The difference between synchronous IO and writeback IO is that in the
first case the task itself is throttled via schedule_timeout_killable();
in the second case pdflush is never throttled, the IO requests instead
are simply added into a rbtree and dispatched asynchronously by another
kernel thread (kiothrottled) using a EDF-like scheduling. More exactly,
a deadline is evaluated for each writeback IO request looking at the
cgroup BW and iops/sec limits, then kiothrottled periodically selects
and dispatches the requests with an elapsed deadline.

>
> If that's the case, will a process not see an increased rate of writes
> till we are not hitting dirty_background_ratio?

Correct. And this is a good behaviour IMHO. At the same time we have a
smooth BW usage (according to the cgroup limits I mean) even in presence
of writeback IO only.

>
> Secondly, if above is giving acceptable performance resutls, then we
> should be able to provide max bw control at IO scheduler level (along
> with proportional bw control)?
>
> So instead of doing max bw and proportional bw implementation in two
> places with the help of different controllers, I think we can do it
> with the help of one controller at one place.
>
> Please do have a look at my patches also to figure out if that's possible
> or not. I think it should be possible.
>
> Keeping both at single place should simplify the things.

Absolutely agree to do both proportional and max BW limiting in a single
place. I still need to figure which is the best place, if the IO
scheduler in the elevator, when the IO requests are submitted. A natural
way IMHO is to control the submission of requests, also Andrew seemed to
be convinced about this approach. Anyway, I've already scheduled to test
your patchset and I'd like to see if it's possible to merge our works,
or select the best from ours patchsets.

Thanks!
-Andrea

2009-04-19 13:47:14

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Sat, Apr 18, 2009 at 01:12:45AM +0200, Andrea Righi wrote:
> On Fri, Apr 17, 2009 at 01:39:55PM -0400, Vivek Goyal wrote:
> > On Tue, Apr 14, 2009 at 10:21:12PM +0200, Andrea Righi wrote:
> >
> > [..]
> > > +4.2. Buffered I/O (write-back) tracking
> > > +
> > > +For buffered writes the scenario is a bit more complex, because the writes in
> > > +the page cache are processed asynchronously by kernel threads (pdflush), using
> > > +a write-back policy. So the real writes to the underlying block devices occur
> > > +in a different I/O context respect to the task that originally generated the
> > > +dirty pages.
> > > +
> > > +The I/O bandwidth controller uses the following solution to resolve this
> > > +problem.
> > > +
> > > +If the operation is a buffered write, we can charge the right cgroup looking at
> > > +the owner of the first page involved in the I/O operation, that gives the
> > > +context that generated the I/O activity at the source. This information can be
> > > +retrieved using the page_cgroup functionality originally provided by the cgroup
> > > +memory controller [4], and now provided specifically by the bio-cgroup
> > > +controller [5].
> > > +
> > > +In this way we can correctly account the I/O cost to the right cgroup, but we
> > > +cannot throttle the current task in this stage, because, in general, it is a
> > > +different task (e.g., pdflush that is processing asynchronously the dirty
> > > +page).
> > > +
> > > +For this reason, all the write-back requests that are not directly submitted by
> > > +the real owner and that need to be throttled are not dispatched immediately in
> > > +submit_bio(). Instead, they are added into an rbtree and processed
> > > +asynchronously by a dedicated kernel thread: kiothrottled.
> > > +
> >
> > Hi Andrea,
>
> Hi Vivek,
>
> >
> > I am trying to go through your patches now and also planning to test it
>
> thanks for trying to test first of all.
>
> > out. While reading the documentation async write handling interested
> > me. IIUC, looks like you are throttling writes once they are being
> > written to the disk (either by pdflush or in the context of the process
> > because vm_dirty_ratio crossed etc).
>
> Correct, more exactly in submit_bio().
>
> The difference between synchronous IO and writeback IO is that in the
> first case the task itself is throttled via schedule_timeout_killable();
> in the second case pdflush is never throttled, the IO requests instead
> are simply added into a rbtree and dispatched asynchronously by another
> kernel thread (kiothrottled) using a EDF-like scheduling. More exactly,
> a deadline is evaluated for each writeback IO request looking at the
> cgroup BW and iops/sec limits, then kiothrottled periodically selects
> and dispatches the requests with an elapsed deadline.
>

Ok, i will look into the logic of translating cgroup BW limits into
deadline. But as Nauman pointed out that we probably will run into
issues of tasks with in cgroup as we loose that notion of class and prio.

> >
> > If that's the case, will a process not see an increased rate of writes
> > till we are not hitting dirty_background_ratio?
>
> Correct. And this is a good behaviour IMHO. At the same time we have a
> smooth BW usage (according to the cgroup limits I mean) even in presence
> of writeback IO only.
>

Hmm.., I am not able to understand this. The very fact that you will see
a high rate of async writes (more than specified by cgroup max BW), till
you hit dirty_background_ratio, isn't it against the goals of max bw
controller? You wanted to see a consistent view of rate even if spare BW
is available, and this scenario goes against that?

Think of an hypothetical configuration of 10G RAM with dirty ratio say
set to 20%. Assume not much of write out is taking place in the system.
So for first 2G of writes, application will be able to write it at cpu
speed and no throttling will kick in and a cgroup will easily cross it
max BW?

> >
> > Secondly, if above is giving acceptable performance resutls, then we
> > should be able to provide max bw control at IO scheduler level (along
> > with proportional bw control)?
> >
> > So instead of doing max bw and proportional bw implementation in two
> > places with the help of different controllers, I think we can do it
> > with the help of one controller at one place.
> >
> > Please do have a look at my patches also to figure out if that's possible
> > or not. I think it should be possible.
> >
> > Keeping both at single place should simplify the things.
>
> Absolutely agree to do both proportional and max BW limiting in a single
> place. I still need to figure which is the best place, if the IO
> scheduler in the elevator, when the IO requests are submitted. A natural
> way IMHO is to control the submission of requests, also Andrew seemed to
> be convinced about this approach. Anyway, I've already scheduled to test
> your patchset and I'd like to see if it's possible to merge our works,
> or select the best from ours patchsets.
>

Are we not already controlling submission of request (at crude level).
If application is doing writeout at high rate, then it hits vm_dirty_ratio
hits and this application is forced to do write out and hence it is slowed
down and is not allowed to submit writes at high rate.

Just that it is not a very fair scheme right now as during right out
a high prio/high weight cgroup application can start writing out some
other cgroups' pages.

For this we probably need to have some combination of solutions like
per cgroup upper limit on dirty pages. Secondly probably if an application
is slowed down because of hitting vm_drity_ratio, it should try to
write out the inode it is dirtying first instead of picking any random
inode and associated pages. This will ensure that a high weight
application can quickly get through the write outs and see higher
throughput from the disk.

Thanks
Vivek

2009-04-19 13:57:51

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Sat, Apr 18, 2009 at 01:12:45AM +0200, Andrea Righi wrote:
> On Fri, Apr 17, 2009 at 01:39:55PM -0400, Vivek Goyal wrote:
> > On Tue, Apr 14, 2009 at 10:21:12PM +0200, Andrea Righi wrote:
> >
> > [..]
> > > +4.2. Buffered I/O (write-back) tracking
> > > +
> > > +For buffered writes the scenario is a bit more complex, because the writes in
> > > +the page cache are processed asynchronously by kernel threads (pdflush), using
> > > +a write-back policy. So the real writes to the underlying block devices occur
> > > +in a different I/O context respect to the task that originally generated the
> > > +dirty pages.
> > > +
> > > +The I/O bandwidth controller uses the following solution to resolve this
> > > +problem.
> > > +
> > > +If the operation is a buffered write, we can charge the right cgroup looking at
> > > +the owner of the first page involved in the I/O operation, that gives the
> > > +context that generated the I/O activity at the source. This information can be
> > > +retrieved using the page_cgroup functionality originally provided by the cgroup
> > > +memory controller [4], and now provided specifically by the bio-cgroup
> > > +controller [5].
> > > +
> > > +In this way we can correctly account the I/O cost to the right cgroup, but we
> > > +cannot throttle the current task in this stage, because, in general, it is a
> > > +different task (e.g., pdflush that is processing asynchronously the dirty
> > > +page).
> > > +
> > > +For this reason, all the write-back requests that are not directly submitted by
> > > +the real owner and that need to be throttled are not dispatched immediately in
> > > +submit_bio(). Instead, they are added into an rbtree and processed
> > > +asynchronously by a dedicated kernel thread: kiothrottled.
> > > +
> >
> > Hi Andrea,
>
> Hi Vivek,
>
> >
> > I am trying to go through your patches now and also planning to test it
>
> thanks for trying to test first of all.
>
> > out. While reading the documentation async write handling interested
> > me. IIUC, looks like you are throttling writes once they are being
> > written to the disk (either by pdflush or in the context of the process
> > because vm_dirty_ratio crossed etc).
>
> Correct, more exactly in submit_bio().
>
> The difference between synchronous IO and writeback IO is that in the
> first case the task itself is throttled via schedule_timeout_killable();
> in the second case pdflush is never throttled, the IO requests instead
> are simply added into a rbtree and dispatched asynchronously by another
> kernel thread (kiothrottled) using a EDF-like scheduling. More exactly,
> a deadline is evaluated for each writeback IO request looking at the
> cgroup BW and iops/sec limits, then kiothrottled periodically selects
> and dispatches the requests with an elapsed deadline.
>
> >
> > If that's the case, will a process not see an increased rate of writes
> > till we are not hitting dirty_background_ratio?
>
> Correct. And this is a good behaviour IMHO. At the same time we have a
> smooth BW usage (according to the cgroup limits I mean) even in presence
> of writeback IO only.
>
> >
> > Secondly, if above is giving acceptable performance resutls, then we
> > should be able to provide max bw control at IO scheduler level (along
> > with proportional bw control)?
> >
> > So instead of doing max bw and proportional bw implementation in two
> > places with the help of different controllers, I think we can do it
> > with the help of one controller at one place.
> >
> > Please do have a look at my patches also to figure out if that's possible
> > or not. I think it should be possible.
> >
> > Keeping both at single place should simplify the things.
>
> Absolutely agree to do both proportional and max BW limiting in a single
> place. I still need to figure which is the best place, if the IO
> scheduler in the elevator, when the IO requests are submitted. A natural
> way IMHO is to control the submission of requests, also Andrew seemed to
> be convinced about this approach. Anyway, I've already scheduled to test
> your patchset and I'd like to see if it's possible to merge our works,
> or select the best from ours patchsets.
>

Hmm..., thinking more about it, it reminded me of one problem though
of doing it at IO scheduler level. Its very hard to provide max bw control
at intermediate logical devices (ex, software raid configurations).

Thanks
Vivek

2009-04-19 15:47:35

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Sun, Apr 19, 2009 at 09:42:01AM -0400, Vivek Goyal wrote:
> > The difference between synchronous IO and writeback IO is that in the
> > first case the task itself is throttled via schedule_timeout_killable();
> > in the second case pdflush is never throttled, the IO requests instead
> > are simply added into a rbtree and dispatched asynchronously by another
> > kernel thread (kiothrottled) using a EDF-like scheduling. More exactly,
> > a deadline is evaluated for each writeback IO request looking at the
> > cgroup BW and iops/sec limits, then kiothrottled periodically selects
> > and dispatches the requests with an elapsed deadline.
> >
>
> Ok, i will look into the logic of translating cgroup BW limits into
> deadline. But as Nauman pointed out that we probably will run into
> issues of tasks with in cgroup as we loose that notion of class and prio.

Correct. I've not addressed the IO class and priority inside cgroup, and
there is a lot of space for optimizations and tunings for this in the
io-throttle controller. In the current implementation the delay is only
imposed to the first task that hits the BW limit. This is not fair at
all.

Ideally the throttling should be distributed equally among the tasks
within the same cgroup that exhaust the available BW. With equally I
mean depending of a function of the previous generated IO, class and IO
priority.

The same concept of fairness (for ioprio and class) will be reflected to
the underlying IO scheduler (only CFQ at the moment) for the requests
that passed the BW limits.

This doesn't seem a bad idea, well.. at least in theory... :) Do you see
evident weak points? or motivations to move to another direction?

>
> > >
> > > If that's the case, will a process not see an increased rate of writes
> > > till we are not hitting dirty_background_ratio?
> >
> > Correct. And this is a good behaviour IMHO. At the same time we have a
> > smooth BW usage (according to the cgroup limits I mean) even in presence
> > of writeback IO only.
> >
>
> Hmm.., I am not able to understand this. The very fact that you will see
> a high rate of async writes (more than specified by cgroup max BW), till
> you hit dirty_background_ratio, isn't it against the goals of max bw
> controller? You wanted to see a consistent view of rate even if spare BW
> is available, and this scenario goes against that?

The goal of the io-throttle controller is to guarantee a constant BW for
the IO to the block devices. If you write data in cache, buffers, etc.
you shouldn't be affected by any IO limitation, but you will be when the
data is be written out to the disk.

OTOH if an application needs a predictable IO BW, we can always set a
max limit and use direct IO.

>
> Think of an hypothetical configuration of 10G RAM with dirty ratio say
> set to 20%. Assume not much of write out is taking place in the system.
> So for first 2G of writes, application will be able to write it at cpu
> speed and no throttling will kick in and a cgroup will easily cross it
> max BW?

Yes.

>
> > >
> > > Secondly, if above is giving acceptable performance resutls, then we
> > > should be able to provide max bw control at IO scheduler level (along
> > > with proportional bw control)?
> > >
> > > So instead of doing max bw and proportional bw implementation in two
> > > places with the help of different controllers, I think we can do it
> > > with the help of one controller at one place.
> > >
> > > Please do have a look at my patches also to figure out if that's possible
> > > or not. I think it should be possible.
> > >
> > > Keeping both at single place should simplify the things.
> >
> > Absolutely agree to do both proportional and max BW limiting in a single
> > place. I still need to figure which is the best place, if the IO
> > scheduler in the elevator, when the IO requests are submitted. A natural
> > way IMHO is to control the submission of requests, also Andrew seemed to
> > be convinced about this approach. Anyway, I've already scheduled to test
> > your patchset and I'd like to see if it's possible to merge our works,
> > or select the best from ours patchsets.
> >
>
> Are we not already controlling submission of request (at crude level).
> If application is doing writeout at high rate, then it hits vm_dirty_ratio
> hits and this application is forced to do write out and hence it is slowed
> down and is not allowed to submit writes at high rate.
>
> Just that it is not a very fair scheme right now as during right out
> a high prio/high weight cgroup application can start writing out some
> other cgroups' pages.
>
> For this we probably need to have some combination of solutions like
> per cgroup upper limit on dirty pages. Secondly probably if an application
> is slowed down because of hitting vm_drity_ratio, it should try to
> write out the inode it is dirtying first instead of picking any random
> inode and associated pages. This will ensure that a high weight
> application can quickly get through the write outs and see higher
> throughput from the disk.

For the first, I submitted a patchset some months ago to provide this
feature in the memory controller:

https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html

We focused on the best interface to use for setting the dirty pages
limit, but we didn't finalize it. I can rework on that and repost an
updated version. Now that we have the dirty_ratio/dirty_bytes to set the
global limit I think we can use the same interface and the same semantic
within the cgroup fs, something like:

memory.dirty_ratio
memory.dirty_bytes

For the second point something like this should be enough to force tasks
to write out only the inode they're actually dirtying when they hit the
vm_dirty_ratio limit. But it should be tested carefully and may cause
heavy performance regressions.

Signed-off-by: Andrea Righi <[email protected]>
---
mm/page-writeback.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2630937..1e07c9d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -543,7 +543,7 @@ static void balance_dirty_pages(struct address_space *mapping)
* been flushed to permanent storage.
*/
if (bdi_nr_reclaimable) {
- writeback_inodes(&wbc);
+ sync_inode(mapping->host, &wbc);
pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(&background_thresh, &dirty_thresh,
&bdi_thresh, bdi);

2009-04-20 09:38:26

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

Hi Andrea,

> Implementing bio-cgroup functionality as pure infrastructure framework
> instead of a cgroup subsystem would remove all this oddity and
> complexity.
>
> For example, the actual functionality that I need for the io-throttle
> controller is just an interface to set and get the cgroup owner of a
> page. I think it should be the same also for other potential users of
> bio-cgroup.
>
> So, what about implementing the bio-cgroup functionality as cgroup "page
> tracking" infrastructure and provide the following interfaces:
>
> /*
> * Encode the cgrp->css.id in page_group->flags
> */
> void set_cgroup_page_owner(struct page *page, struct cgroup *cgrp);
>
> /*
> * Returns the cgroup owner of a page, decoding the cgroup id from
> * page_cgroup->flags.
> */
> struct cgroup *get_cgroup_page_owner(struct page *page);
>
> This also wouldn't increase the size of page_cgroup because we can
> encode the cgroup id in the unused bits of page_cgroup->flags, as
> originally suggested by Kame.
>
> And I think it could be used also by dm-ioband, even if it's not a
> cgroup-based subsystem... but I may be wrong. Ryo what's your opinion?

I looked your page_cgroup patch in io-throttle v14, It can also be used
by dm-ioband. But I'd like to eliminate lock_page_cgroup() to minimize
overhead. I'll rearrange the bio-cgroup patch according to the functions.

Thanks,
Ryo Tsuruta

2009-04-20 11:35:54

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

Hi Balbir,

> > diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> > index 012f065..ef8cac0 100644
> > --- a/block/blk-ioc.c
> > +++ b/block/blk-ioc.c
> > @@ -84,24 +84,28 @@ void exit_io_context(void)
> > }
> > }
> >
> > +void init_io_context(struct io_context *ioc)
> > +{
> > + atomic_set(&ioc->refcount, 1);
> > + atomic_set(&ioc->nr_tasks, 1);
> > + spin_lock_init(&ioc->lock);
> > + ioc->ioprio_changed = 0;
> > + ioc->ioprio = 0;
> > + ioc->last_waited = jiffies; /* doesn't matter... */
> > + ioc->nr_batch_requests = 0; /* because this is 0 */
> > + ioc->aic = NULL;
> > + INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
> > + INIT_HLIST_HEAD(&ioc->cic_list);
> > + ioc->ioc_data = NULL;
> > +}
> > +
> > struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
> > {
> > struct io_context *ret;
> >
> > ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
> > - if (ret) {
> > - atomic_set(&ret->refcount, 1);
> > - atomic_set(&ret->nr_tasks, 1);
> > - spin_lock_init(&ret->lock);
> > - ret->ioprio_changed = 0;
> > - ret->ioprio = 0;
> > - ret->last_waited = jiffies; /* doesn't matter... */
> > - ret->nr_batch_requests = 0; /* because this is 0 */
> > - ret->aic = NULL;
> > - INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
> > - INIT_HLIST_HEAD(&ret->cic_list);
> > - ret->ioc_data = NULL;
> > - }
> > + if (ret)
> > + init_io_context(ret);
> >
>
> Can you split this part of the patch out as a refactoring patch?

Yes, I'll do it.

> > return ret;
> > }
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 13edf7a..bc72150 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -36,6 +36,7 @@
> > #include <linux/buffer_head.h>
> > #include <linux/task_io_accounting_ops.h>
> > #include <linux/bio.h>
> > +#include <linux/biotrack.h>
> > #include <linux/notifier.h>
> > #include <linux/cpu.h>
> > #include <linux/bitops.h>
> > @@ -655,6 +656,7 @@ static void __set_page_dirty(struct page *page,
> > if (page->mapping) { /* Race with truncate? */
> > WARN_ON_ONCE(warn && !PageUptodate(page));
> > account_page_dirtied(page, mapping);
> > + bio_cgroup_reset_owner_pagedirty(page, current->mm);
> > radix_tree_tag_set(&mapping->page_tree,
> > page_index(page), PAGECACHE_TAG_DIRTY);
> > }
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index da258e7..ec42362 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -33,6 +33,7 @@
> > #include <linux/err.h>
> > #include <linux/blkdev.h>
> > #include <linux/buffer_head.h>
> > +#include <linux/biotrack.h>
> > #include <linux/rwsem.h>
> > #include <linux/uio.h>
> > #include <asm/atomic.h>
> > @@ -799,6 +800,7 @@ static int do_direct_IO(struct dio *dio)
> > ret = PTR_ERR(page);
> > goto out;
> > }
> > + bio_cgroup_reset_owner(page, current->mm);
> >
> > while (block_in_page < blocks_per_page) {
> > unsigned offset_in_page = block_in_page << blkbits;
> > diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
> > new file mode 100644
> > index 0000000..25b8810
> > --- /dev/null
> > +++ b/include/linux/biotrack.h
> > @@ -0,0 +1,95 @@
> > +#include <linux/cgroup.h>
> > +#include <linux/mm.h>
> > +#include <linux/page_cgroup.h>
> > +
> > +#ifndef _LINUX_BIOTRACK_H
> > +#define _LINUX_BIOTRACK_H
> > +
> > +#ifdef CONFIG_CGROUP_BIO
> > +
> > +struct tsk_move_msg {
> > + int old_id;
> > + int new_id;
> > + struct task_struct *tsk;
> > +};
> > +
> > +extern int register_biocgroup_notifier(struct notifier_block *nb);
> > +extern int unregister_biocgroup_notifier(struct notifier_block *nb);
> > +
> > +struct io_context;
> > +struct block_device;
> > +
> > +struct bio_cgroup {
> > + struct cgroup_subsys_state css;
> > + int id;
>
> Can't css_id be used here?

The latest patch has already done it.

> > + struct io_context *io_context; /* default io_context */
> > +/* struct radix_tree_root io_context_root; per device io_context */
>
> Commented out code? Do you want to remove this.

No. This is a sample code to set io_contexts per cgroup.

> Comments? Docbook style would be nice for the functions below.

O.K. I'll do it.

> > struct page_cgroup {
> > unsigned long flags;
> > - struct mem_cgroup *mem_cgroup;
> > struct page *page;
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > + struct mem_cgroup *mem_cgroup;
> > +#endif
> > +#ifdef CONFIG_CGROUP_BIO
> > + int bio_cgroup_id;
> > +#endif
> > +#if defined(CONFIG_CGROUP_MEM_RES_CTLR) || defined(CONFIG_CGROUP_BIO)
> > struct list_head lru; /* per cgroup LRU list */
>
> Do we need the #if defined clause? Anyone using page_cgroup, but not
> list_head LRU needs to be explicitly covered when they come up.

How about to add a option like "CONFIG_CGROUP_PAGE_USE_LRU" ?

> > +/*
> > + * This function is used to make a given page have the bio-cgroup id of
> > + * the owner of this page.
> > + */
> > +void bio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
> > +{
> > + struct bio_cgroup *biog;
> > + struct page_cgroup *pc;
> > +
> > + if (bio_cgroup_disabled())
> > + return;
> > + pc = lookup_page_cgroup(page);
> > + if (unlikely(!pc))
> > + return;
> > +
> > + pc->bio_cgroup_id = 0; /* 0: default bio_cgroup id */
> > + if (!mm)
> > + return;
>
>
> Is this routine called with lock_page_cgroup() taken? Otherwise
> what protects pc->bio_cgroup_id?

pc->bio_cgroup_id can be updated without any locks because an integer
type variable can be set a new value at once on modern CPUs.

> > + /*
> > + * css_get(&bio->css) isn't called to increment the reference
> > + * count of this bio_cgroup "biog" so pc->bio_cgroup_id might turn
> > + * invalid even if this page is still active.
> > + * This approach is chosen to minimize the overhead.
> > + */
> > + pc->bio_cgroup_id = biog->id;
>
> What happens without ref count increase we delete the cgroup or css?

I know that the same ID could be reused, but it is not a big
problem because it is recovered slowly.
I think that a mechanism which makes it difficult to reuse the same
ID should be implemented.

> > +/*
> > + * Change the owner of a given page. This function is only effective for
> > + * pages in the pagecache.
>
> Could you clarify pagecache? mapped/unmapped or both?

This function is effective for both mapped and unmapped pages. I'll
write it in the comment.

> > + */
> > +void bio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
> > +{
> > + if (PageSwapCache(page) || PageAnon(page))
> > + return;
>
> Look at page_is_file_cache() depending on the answer above

Thank you for this information. I'll use it.

> > +/*
> > + * Assign "page" the same owner as "opage."
> > + */
> > +void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
> > +{
> > + struct page_cgroup *npc, *opc;
> > +
> > + if (bio_cgroup_disabled())
> > + return;
> > + npc = lookup_page_cgroup(npage);
> > + if (unlikely(!npc))
> > + return;
> > + opc = lookup_page_cgroup(opage);
> > + if (unlikely(!opc))
> > + return;
> > +
> > + /*
> > + * Do this without any locks. The reason is the same as
> > + * bio_cgroup_reset_owner().
> > + */
> > + npc->bio_cgroup_id = opc->bio_cgroup_id;
>
> What protects npc and opc?

As the same reason mentioned above, bio_cgroup_id can be updated
without any locks, and npc and opc always point to page_cgroups.
An integer variable can be set a new value at once on a system which
can use RCU lock.

Thanks,
Ryo Tsuruta

2009-04-20 14:57:21

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

On Mon, Apr 20, 2009 at 08:35:40PM +0900, Ryo Tsuruta wrote:
> > > +/*
> > > + * Assign "page" the same owner as "opage."
> > > + */
> > > +void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
> > > +{
> > > + struct page_cgroup *npc, *opc;
> > > +
> > > + if (bio_cgroup_disabled())
> > > + return;
> > > + npc = lookup_page_cgroup(npage);
> > > + if (unlikely(!npc))
> > > + return;
> > > + opc = lookup_page_cgroup(opage);
> > > + if (unlikely(!opc))
> > > + return;
> > > +
> > > + /*
> > > + * Do this without any locks. The reason is the same as
> > > + * bio_cgroup_reset_owner().
> > > + */
> > > + npc->bio_cgroup_id = opc->bio_cgroup_id;
> >
> > What protects npc and opc?
>
> As the same reason mentioned above, bio_cgroup_id can be updated
> without any locks, and npc and opc always point to page_cgroups.
> An integer variable can be set a new value at once on a system which
> can use RCU lock.

mmmh... I'm not sure about this. Actually you read opc->bio_cgroup_id
first and then write to npc->bio_cgroup_id, so it is not atomic at all.
So, you can read or set a wrong ID, but at least it should be always
consistent (the single read or write itself is atomic).

-Andrea

2009-04-20 15:01:16

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Mon, Apr 20, 2009 at 06:38:15PM +0900, Ryo Tsuruta wrote:
> Hi Andrea,
>
> > Implementing bio-cgroup functionality as pure infrastructure framework
> > instead of a cgroup subsystem would remove all this oddity and
> > complexity.
> >
> > For example, the actual functionality that I need for the io-throttle
> > controller is just an interface to set and get the cgroup owner of a
> > page. I think it should be the same also for other potential users of
> > bio-cgroup.
> >
> > So, what about implementing the bio-cgroup functionality as cgroup "page
> > tracking" infrastructure and provide the following interfaces:
> >
> > /*
> > * Encode the cgrp->css.id in page_group->flags
> > */
> > void set_cgroup_page_owner(struct page *page, struct cgroup *cgrp);
> >
> > /*
> > * Returns the cgroup owner of a page, decoding the cgroup id from
> > * page_cgroup->flags.
> > */
> > struct cgroup *get_cgroup_page_owner(struct page *page);
> >
> > This also wouldn't increase the size of page_cgroup because we can
> > encode the cgroup id in the unused bits of page_cgroup->flags, as
> > originally suggested by Kame.
> >
> > And I think it could be used also by dm-ioband, even if it's not a
> > cgroup-based subsystem... but I may be wrong. Ryo what's your opinion?
>
> I looked your page_cgroup patch in io-throttle v14, It can also be used
> by dm-ioband. But I'd like to eliminate lock_page_cgroup() to minimize
> overhead. I'll rearrange the bio-cgroup patch according to the functions.

It would be great! Anyway, I don't think it's a trivial task to
completely remove lock_page_cgroup(), especially if we decide to encode
the ID in page_cgroup->flags.

Thanks,
-Andrea

2009-04-20 21:33:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Sun, Apr 19, 2009 at 05:47:18PM +0200, Andrea Righi wrote:
> On Sun, Apr 19, 2009 at 09:42:01AM -0400, Vivek Goyal wrote:
> > > The difference between synchronous IO and writeback IO is that in the
> > > first case the task itself is throttled via schedule_timeout_killable();
> > > in the second case pdflush is never throttled, the IO requests instead
> > > are simply added into a rbtree and dispatched asynchronously by another
> > > kernel thread (kiothrottled) using a EDF-like scheduling. More exactly,
> > > a deadline is evaluated for each writeback IO request looking at the
> > > cgroup BW and iops/sec limits, then kiothrottled periodically selects
> > > and dispatches the requests with an elapsed deadline.
> > >
> >
> > Ok, i will look into the logic of translating cgroup BW limits into
> > deadline. But as Nauman pointed out that we probably will run into
> > issues of tasks with in cgroup as we loose that notion of class and prio.
>
> Correct. I've not addressed the IO class and priority inside cgroup, and
> there is a lot of space for optimizations and tunings for this in the
> io-throttle controller. In the current implementation the delay is only
> imposed to the first task that hits the BW limit. This is not fair at
> all.
>
> Ideally the throttling should be distributed equally among the tasks
> within the same cgroup that exhaust the available BW. With equally I
> mean depending of a function of the previous generated IO, class and IO
> priority.
>
> The same concept of fairness (for ioprio and class) will be reflected to
> the underlying IO scheduler (only CFQ at the moment) for the requests
> that passed the BW limits.
>
> This doesn't seem a bad idea, well.. at least in theory... :) Do you see
> evident weak points? or motivations to move to another direction?
>
> >
> > > >
> > > > If that's the case, will a process not see an increased rate of writes
> > > > till we are not hitting dirty_background_ratio?
> > >
> > > Correct. And this is a good behaviour IMHO. At the same time we have a
> > > smooth BW usage (according to the cgroup limits I mean) even in presence
> > > of writeback IO only.
> > >
> >
> > Hmm.., I am not able to understand this. The very fact that you will see
> > a high rate of async writes (more than specified by cgroup max BW), till
> > you hit dirty_background_ratio, isn't it against the goals of max bw
> > controller? You wanted to see a consistent view of rate even if spare BW
> > is available, and this scenario goes against that?
>
> The goal of the io-throttle controller is to guarantee a constant BW for
> the IO to the block devices. If you write data in cache, buffers, etc.
> you shouldn't be affected by any IO limitation, but you will be when the
> data is be written out to the disk.
>
> OTOH if an application needs a predictable IO BW, we can always set a
> max limit and use direct IO.
>
> >
> > Think of an hypothetical configuration of 10G RAM with dirty ratio say
> > set to 20%. Assume not much of write out is taking place in the system.
> > So for first 2G of writes, application will be able to write it at cpu
> > speed and no throttling will kick in and a cgroup will easily cross it
> > max BW?
>
> Yes.
>
> >
> > > >
> > > > Secondly, if above is giving acceptable performance resutls, then we
> > > > should be able to provide max bw control at IO scheduler level (along
> > > > with proportional bw control)?
> > > >
> > > > So instead of doing max bw and proportional bw implementation in two
> > > > places with the help of different controllers, I think we can do it
> > > > with the help of one controller at one place.
> > > >
> > > > Please do have a look at my patches also to figure out if that's possible
> > > > or not. I think it should be possible.
> > > >
> > > > Keeping both at single place should simplify the things.
> > >
> > > Absolutely agree to do both proportional and max BW limiting in a single
> > > place. I still need to figure which is the best place, if the IO
> > > scheduler in the elevator, when the IO requests are submitted. A natural
> > > way IMHO is to control the submission of requests, also Andrew seemed to
> > > be convinced about this approach. Anyway, I've already scheduled to test
> > > your patchset and I'd like to see if it's possible to merge our works,
> > > or select the best from ours patchsets.
> > >
> >
> > Are we not already controlling submission of request (at crude level).
> > If application is doing writeout at high rate, then it hits vm_dirty_ratio
> > hits and this application is forced to do write out and hence it is slowed
> > down and is not allowed to submit writes at high rate.
> >
> > Just that it is not a very fair scheme right now as during right out
> > a high prio/high weight cgroup application can start writing out some
> > other cgroups' pages.
> >
> > For this we probably need to have some combination of solutions like
> > per cgroup upper limit on dirty pages. Secondly probably if an application
> > is slowed down because of hitting vm_drity_ratio, it should try to
> > write out the inode it is dirtying first instead of picking any random
> > inode and associated pages. This will ensure that a high weight
> > application can quickly get through the write outs and see higher
> > throughput from the disk.
>
> For the first, I submitted a patchset some months ago to provide this
> feature in the memory controller:
>
> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
>
> We focused on the best interface to use for setting the dirty pages
> limit, but we didn't finalize it. I can rework on that and repost an
> updated version. Now that we have the dirty_ratio/dirty_bytes to set the
> global limit I think we can use the same interface and the same semantic
> within the cgroup fs, something like:
>
> memory.dirty_ratio
> memory.dirty_bytes
>
> For the second point something like this should be enough to force tasks
> to write out only the inode they're actually dirtying when they hit the
> vm_dirty_ratio limit. But it should be tested carefully and may cause
> heavy performance regressions.
>
> Signed-off-by: Andrea Righi <[email protected]>
> ---
> mm/page-writeback.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 2630937..1e07c9d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -543,7 +543,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> * been flushed to permanent storage.
> */
> if (bdi_nr_reclaimable) {
> - writeback_inodes(&wbc);
> + sync_inode(mapping->host, &wbc);
> pages_written += write_chunk - wbc.nr_to_write;
> get_dirty_limits(&background_thresh, &dirty_thresh,
> &bdi_thresh, bdi);

This patch seems to be helping me a bit in getting more service
differentiation between two writer dd of different weights. But strangely
it is helping only for ext3 and not ext4. Debugging is on.

Thanks
Vivek

2009-04-20 22:05:32

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Mon, Apr 20, 2009 at 05:28:27PM -0400, Vivek Goyal wrote:
> On Sun, Apr 19, 2009 at 05:47:18PM +0200, Andrea Righi wrote:
> > On Sun, Apr 19, 2009 at 09:42:01AM -0400, Vivek Goyal wrote:
> > > > The difference between synchronous IO and writeback IO is that in the
> > > > first case the task itself is throttled via schedule_timeout_killable();
> > > > in the second case pdflush is never throttled, the IO requests instead
> > > > are simply added into a rbtree and dispatched asynchronously by another
> > > > kernel thread (kiothrottled) using a EDF-like scheduling. More exactly,
> > > > a deadline is evaluated for each writeback IO request looking at the
> > > > cgroup BW and iops/sec limits, then kiothrottled periodically selects
> > > > and dispatches the requests with an elapsed deadline.
> > > >
> > >
> > > Ok, i will look into the logic of translating cgroup BW limits into
> > > deadline. But as Nauman pointed out that we probably will run into
> > > issues of tasks with in cgroup as we loose that notion of class and prio.
> >
> > Correct. I've not addressed the IO class and priority inside cgroup, and
> > there is a lot of space for optimizations and tunings for this in the
> > io-throttle controller. In the current implementation the delay is only
> > imposed to the first task that hits the BW limit. This is not fair at
> > all.
> >
> > Ideally the throttling should be distributed equally among the tasks
> > within the same cgroup that exhaust the available BW. With equally I
> > mean depending of a function of the previous generated IO, class and IO
> > priority.
> >
> > The same concept of fairness (for ioprio and class) will be reflected to
> > the underlying IO scheduler (only CFQ at the moment) for the requests
> > that passed the BW limits.
> >
> > This doesn't seem a bad idea, well.. at least in theory... :) Do you see
> > evident weak points? or motivations to move to another direction?
> >
> > >
> > > > >
> > > > > If that's the case, will a process not see an increased rate of writes
> > > > > till we are not hitting dirty_background_ratio?
> > > >
> > > > Correct. And this is a good behaviour IMHO. At the same time we have a
> > > > smooth BW usage (according to the cgroup limits I mean) even in presence
> > > > of writeback IO only.
> > > >
> > >
> > > Hmm.., I am not able to understand this. The very fact that you will see
> > > a high rate of async writes (more than specified by cgroup max BW), till
> > > you hit dirty_background_ratio, isn't it against the goals of max bw
> > > controller? You wanted to see a consistent view of rate even if spare BW
> > > is available, and this scenario goes against that?
> >
> > The goal of the io-throttle controller is to guarantee a constant BW for
> > the IO to the block devices. If you write data in cache, buffers, etc.
> > you shouldn't be affected by any IO limitation, but you will be when the
> > data is be written out to the disk.
> >
> > OTOH if an application needs a predictable IO BW, we can always set a
> > max limit and use direct IO.
> >
> > >
> > > Think of an hypothetical configuration of 10G RAM with dirty ratio say
> > > set to 20%. Assume not much of write out is taking place in the system.
> > > So for first 2G of writes, application will be able to write it at cpu
> > > speed and no throttling will kick in and a cgroup will easily cross it
> > > max BW?
> >
> > Yes.
> >
> > >
> > > > >
> > > > > Secondly, if above is giving acceptable performance resutls, then we
> > > > > should be able to provide max bw control at IO scheduler level (along
> > > > > with proportional bw control)?
> > > > >
> > > > > So instead of doing max bw and proportional bw implementation in two
> > > > > places with the help of different controllers, I think we can do it
> > > > > with the help of one controller at one place.
> > > > >
> > > > > Please do have a look at my patches also to figure out if that's possible
> > > > > or not. I think it should be possible.
> > > > >
> > > > > Keeping both at single place should simplify the things.
> > > >
> > > > Absolutely agree to do both proportional and max BW limiting in a single
> > > > place. I still need to figure which is the best place, if the IO
> > > > scheduler in the elevator, when the IO requests are submitted. A natural
> > > > way IMHO is to control the submission of requests, also Andrew seemed to
> > > > be convinced about this approach. Anyway, I've already scheduled to test
> > > > your patchset and I'd like to see if it's possible to merge our works,
> > > > or select the best from ours patchsets.
> > > >
> > >
> > > Are we not already controlling submission of request (at crude level).
> > > If application is doing writeout at high rate, then it hits vm_dirty_ratio
> > > hits and this application is forced to do write out and hence it is slowed
> > > down and is not allowed to submit writes at high rate.
> > >
> > > Just that it is not a very fair scheme right now as during right out
> > > a high prio/high weight cgroup application can start writing out some
> > > other cgroups' pages.
> > >
> > > For this we probably need to have some combination of solutions like
> > > per cgroup upper limit on dirty pages. Secondly probably if an application
> > > is slowed down because of hitting vm_drity_ratio, it should try to
> > > write out the inode it is dirtying first instead of picking any random
> > > inode and associated pages. This will ensure that a high weight
> > > application can quickly get through the write outs and see higher
> > > throughput from the disk.
> >
> > For the first, I submitted a patchset some months ago to provide this
> > feature in the memory controller:
> >
> > https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> >
> > We focused on the best interface to use for setting the dirty pages
> > limit, but we didn't finalize it. I can rework on that and repost an
> > updated version. Now that we have the dirty_ratio/dirty_bytes to set the
> > global limit I think we can use the same interface and the same semantic
> > within the cgroup fs, something like:
> >
> > memory.dirty_ratio
> > memory.dirty_bytes
> >
> > For the second point something like this should be enough to force tasks
> > to write out only the inode they're actually dirtying when they hit the
> > vm_dirty_ratio limit. But it should be tested carefully and may cause
> > heavy performance regressions.
> >
> > Signed-off-by: Andrea Righi <[email protected]>
> > ---
> > mm/page-writeback.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 2630937..1e07c9d 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -543,7 +543,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> > * been flushed to permanent storage.
> > */
> > if (bdi_nr_reclaimable) {
> > - writeback_inodes(&wbc);
> > + sync_inode(mapping->host, &wbc);
> > pages_written += write_chunk - wbc.nr_to_write;
> > get_dirty_limits(&background_thresh, &dirty_thresh,
> > &bdi_thresh, bdi);
>
> This patch seems to be helping me a bit in getting more service
> differentiation between two writer dd of different weights. But strangely
> it is helping only for ext3 and not ext4. Debugging is on.

Are you explicitly mounting ext3 with data=ordered?

-Andrea

2009-04-21 00:21:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Fri, Apr 17, 2009 at 04:39:05PM +0200, Andrea Righi wrote:
>
> Exactly, the purpose here is is to prioritize the dispatching of journal
> IO requests in the IO controller. I may have used an inappropriate flag
> or a quick&dirty solution, but without this, any cgroup/process that
> generates a lot of journal activity may be throttled and cause other
> cgroups/processes to be incorrectly blocked when they try to write to
> disk.

With ext3 and ext4, all journal I/O requests end up going through
kjournald. So the question is what I/O control group do you put
kjournald in? If you unrestrict it, it makes the problem go away
entirely. On the other hand, it is doing work on behalf of other
processes, and there is no real way to separate out on whose behalf
kjournald is doing said work. So I'm not sure fundamentally you'll be
able to do much with any filesystem journalling activity --- and ext3
makes life especially bad because of data=ordered mode.

> > I'm assuming it's the "usual" problem with lower priority IO getting
> > access to fs exclusive data. It's quite trivial to cause problems with
> > higher IO priority tasks then getting stuck waiting for the low priority
> > process, since they also need to access that fs exclusive data.
>
> Right. I thought about using the BIO_RW_SYNC flag instead, but as Ted
> pointed out, some cgroups/processes might be able to evade the IO
> control issuing a lot of fsync()s. We could also limit the fsync()-rate
> into the IO controller, but it sounds like a dirty workaround...

Well, if you use data=writeback or Chris Mason's proposed data=guarded
mode, then at least all of the data blocks will be written process
context of the application, and not kjournald's process context. So
one solution that might be the best that we have for now is to treat
kjournald as special from an I/O controller point of view (i.e., give
it its own cgroup), and then use a filesystem mode which avoids data
blocks getting written in kjournald (i.e., ext3 data=wirteback or
data=guarded, ext4's delayed allocation, etc.)

One major form of leakage that you're still going to have is pdflush;
which again, is more I/O happening in somebody else's process context.
Ultimately I think trying to throttle I/O at write submission time
whether at entry into block layer or in the elevators, is going to be
highly problematic. Suppose someone dirties a large number of pages?
That's a system resource, and delaying the writes because a particular
container has used more than its fair share will cause the entire
system to run out of memory, which is not a good thing.

Ultimately, I think you'll need to do is write throttling, and suspend
processes that are dirtying too many pages, instad of trying to
control the I/O.

- Ted

2009-04-21 01:11:18

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Tue, Apr 21, 2009 at 12:05:12AM +0200, Andrea Righi wrote:

[..]
> > > > Are we not already controlling submission of request (at crude level).
> > > > If application is doing writeout at high rate, then it hits vm_dirty_ratio
> > > > hits and this application is forced to do write out and hence it is slowed
> > > > down and is not allowed to submit writes at high rate.
> > > >
> > > > Just that it is not a very fair scheme right now as during right out
> > > > a high prio/high weight cgroup application can start writing out some
> > > > other cgroups' pages.
> > > >
> > > > For this we probably need to have some combination of solutions like
> > > > per cgroup upper limit on dirty pages. Secondly probably if an application
> > > > is slowed down because of hitting vm_drity_ratio, it should try to
> > > > write out the inode it is dirtying first instead of picking any random
> > > > inode and associated pages. This will ensure that a high weight
> > > > application can quickly get through the write outs and see higher
> > > > throughput from the disk.
> > >
> > > For the first, I submitted a patchset some months ago to provide this
> > > feature in the memory controller:
> > >
> > > https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> > >
> > > We focused on the best interface to use for setting the dirty pages
> > > limit, but we didn't finalize it. I can rework on that and repost an
> > > updated version. Now that we have the dirty_ratio/dirty_bytes to set the
> > > global limit I think we can use the same interface and the same semantic
> > > within the cgroup fs, something like:
> > >
> > > memory.dirty_ratio
> > > memory.dirty_bytes
> > >
> > > For the second point something like this should be enough to force tasks
> > > to write out only the inode they're actually dirtying when they hit the
> > > vm_dirty_ratio limit. But it should be tested carefully and may cause
> > > heavy performance regressions.
> > >
> > > Signed-off-by: Andrea Righi <[email protected]>
> > > ---
> > > mm/page-writeback.c | 2 +-
> > > 1 files changed, 1 insertions(+), 1 deletions(-)
> > >
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index 2630937..1e07c9d 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -543,7 +543,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> > > * been flushed to permanent storage.
> > > */
> > > if (bdi_nr_reclaimable) {
> > > - writeback_inodes(&wbc);
> > > + sync_inode(mapping->host, &wbc);
> > > pages_written += write_chunk - wbc.nr_to_write;
> > > get_dirty_limits(&background_thresh, &dirty_thresh,
> > > &bdi_thresh, bdi);
> >
> > This patch seems to be helping me a bit in getting more service
> > differentiation between two writer dd of different weights. But strangely
> > it is helping only for ext3 and not ext4. Debugging is on.
>
> Are you explicitly mounting ext3 with data=ordered?

Yes. Still using 29-rc8 and data=ordered was the default then.

I got two partitions on same disk and created one ext3 filesystem on each
partition (just to take journaling intereference out of two dd threads
for the time being).

Two dd threads doing writes to each partition.

Thanks
Vivek

2009-04-21 08:30:26

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Mon, Apr 20, 2009 at 08:18:22PM -0400, Theodore Tso wrote:
> On Fri, Apr 17, 2009 at 04:39:05PM +0200, Andrea Righi wrote:
> >
> > Exactly, the purpose here is is to prioritize the dispatching of journal
> > IO requests in the IO controller. I may have used an inappropriate flag
> > or a quick&dirty solution, but without this, any cgroup/process that
> > generates a lot of journal activity may be throttled and cause other
> > cgroups/processes to be incorrectly blocked when they try to write to
> > disk.
>
> With ext3 and ext4, all journal I/O requests end up going through
> kjournald. So the question is what I/O control group do you put
> kjournald in? If you unrestrict it, it makes the problem go away
> entirely. On the other hand, it is doing work on behalf of other
> processes, and there is no real way to separate out on whose behalf
> kjournald is doing said work. So I'm not sure fundamentally you'll be
> able to do much with any filesystem journalling activity --- and ext3
> makes life especially bad because of data=ordered mode.

OK, I've just removed the ext3/ext4 patch from io-throttle v14 and
results are pretty the same. BTW I can't even prioritize all the
BIO_RW_SYNC, because in this way all the direct IO would be never
limited at all. Or at least I should add something like a
is_in_direct_io() check or kind of.

Anyway, I agree and I think it's reasonable to always leave kiojournald
into the root cgroup, and doesn't set any IO limit for that cgroup.

But I wouldn't add additional checks for this, at the end we know that
"Unix gives you just enough rope to hang yourself".

>
> > > I'm assuming it's the "usual" problem with lower priority IO getting
> > > access to fs exclusive data. It's quite trivial to cause problems with
> > > higher IO priority tasks then getting stuck waiting for the low priority
> > > process, since they also need to access that fs exclusive data.
> >
> > Right. I thought about using the BIO_RW_SYNC flag instead, but as Ted
> > pointed out, some cgroups/processes might be able to evade the IO
> > control issuing a lot of fsync()s. We could also limit the fsync()-rate
> > into the IO controller, but it sounds like a dirty workaround...
>
> Well, if you use data=writeback or Chris Mason's proposed data=guarded
> mode, then at least all of the data blocks will be written process
> context of the application, and not kjournald's process context. So
> one solution that might be the best that we have for now is to treat
> kjournald as special from an I/O controller point of view (i.e., give
> it its own cgroup), and then use a filesystem mode which avoids data
> blocks getting written in kjournald (i.e., ext3 data=wirteback or
> data=guarded, ext4's delayed allocation, etc.)

Agree.

>
> One major form of leakage that you're still going to have is pdflush;
> which again, is more I/O happening in somebody else's process context.
> Ultimately I think trying to throttle I/O at write submission time
> whether at entry into block layer or in the elevators, is going to be
> highly problematic. Suppose someone dirties a large number of pages?
> That's a system resource, and delaying the writes because a particular
> container has used more than its fair share will cause the entire
> system to run out of memory, which is not a good thing.
>
> Ultimately, I think you'll need to do is write throttling, and suspend
> processes that are dirtying too many pages, instad of trying to
> control the I/O.

We're trying to address also this issue, setting max dirty pages limit
per cgroup, and force a direct writeback when these limits are exceeded.

In this case dirty ratio throttling should happen automatically because
the process will be throttled by the IO controller when it tries to
writeback the dirty pages and submit IO requests.

What's your opinion?

Thanks,
-Andrea

2009-04-21 08:37:28

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Mon, Apr 20, 2009 at 09:08:46PM -0400, Vivek Goyal wrote:
> On Tue, Apr 21, 2009 at 12:05:12AM +0200, Andrea Righi wrote:
>
> [..]
> > > > > Are we not already controlling submission of request (at crude level).
> > > > > If application is doing writeout at high rate, then it hits vm_dirty_ratio
> > > > > hits and this application is forced to do write out and hence it is slowed
> > > > > down and is not allowed to submit writes at high rate.
> > > > >
> > > > > Just that it is not a very fair scheme right now as during right out
> > > > > a high prio/high weight cgroup application can start writing out some
> > > > > other cgroups' pages.
> > > > >
> > > > > For this we probably need to have some combination of solutions like
> > > > > per cgroup upper limit on dirty pages. Secondly probably if an application
> > > > > is slowed down because of hitting vm_drity_ratio, it should try to
> > > > > write out the inode it is dirtying first instead of picking any random
> > > > > inode and associated pages. This will ensure that a high weight
> > > > > application can quickly get through the write outs and see higher
> > > > > throughput from the disk.
> > > >
> > > > For the first, I submitted a patchset some months ago to provide this
> > > > feature in the memory controller:
> > > >
> > > > https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> > > >
> > > > We focused on the best interface to use for setting the dirty pages
> > > > limit, but we didn't finalize it. I can rework on that and repost an
> > > > updated version. Now that we have the dirty_ratio/dirty_bytes to set the
> > > > global limit I think we can use the same interface and the same semantic
> > > > within the cgroup fs, something like:
> > > >
> > > > memory.dirty_ratio
> > > > memory.dirty_bytes
> > > >
> > > > For the second point something like this should be enough to force tasks
> > > > to write out only the inode they're actually dirtying when they hit the
> > > > vm_dirty_ratio limit. But it should be tested carefully and may cause
> > > > heavy performance regressions.
> > > >
> > > > Signed-off-by: Andrea Righi <[email protected]>
> > > > ---
> > > > mm/page-writeback.c | 2 +-
> > > > 1 files changed, 1 insertions(+), 1 deletions(-)
> > > >
> > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > index 2630937..1e07c9d 100644
> > > > --- a/mm/page-writeback.c
> > > > +++ b/mm/page-writeback.c
> > > > @@ -543,7 +543,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> > > > * been flushed to permanent storage.
> > > > */
> > > > if (bdi_nr_reclaimable) {
> > > > - writeback_inodes(&wbc);
> > > > + sync_inode(mapping->host, &wbc);
> > > > pages_written += write_chunk - wbc.nr_to_write;
> > > > get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > &bdi_thresh, bdi);
> > >
> > > This patch seems to be helping me a bit in getting more service
> > > differentiation between two writer dd of different weights. But strangely
> > > it is helping only for ext3 and not ext4. Debugging is on.
> >
> > Are you explicitly mounting ext3 with data=ordered?
>
> Yes. Still using 29-rc8 and data=ordered was the default then.
>
> I got two partitions on same disk and created one ext3 filesystem on each
> partition (just to take journaling intereference out of two dd threads
> for the time being).
>
> Two dd threads doing writes to each partition.

...and if you're using data=writeback with ext4 sync_inode() should sync
the metadata only. If this is the case, could you check data=ordered
also for ext4?

-Andrea

2009-04-21 11:39:30

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

Hi Andrea,

> On Mon, Apr 20, 2009 at 08:35:40PM +0900, Ryo Tsuruta wrote:
> > > > +/*
> > > > + * Assign "page" the same owner as "opage."
> > > > + */
> > > > +void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
> > > > +{
> > > > + struct page_cgroup *npc, *opc;
> > > > +
> > > > + if (bio_cgroup_disabled())
> > > > + return;
> > > > + npc = lookup_page_cgroup(npage);
> > > > + if (unlikely(!npc))
> > > > + return;
> > > > + opc = lookup_page_cgroup(opage);
> > > > + if (unlikely(!opc))
> > > > + return;
> > > > +
> > > > + /*
> > > > + * Do this without any locks. The reason is the same as
> > > > + * bio_cgroup_reset_owner().
> > > > + */
> > > > + npc->bio_cgroup_id = opc->bio_cgroup_id;
> > >
> > > What protects npc and opc?
> >
> > As the same reason mentioned above, bio_cgroup_id can be updated
> > without any locks, and npc and opc always point to page_cgroups.
> > An integer variable can be set a new value at once on a system which
> > can use RCU lock.
>
> mmmh... I'm not sure about this. Actually you read opc->bio_cgroup_id
> first and then write to npc->bio_cgroup_id, so it is not atomic at all.
> So, you can read or set a wrong ID, but at least it should be always
> consistent (the single read or write itself is atomic).

Even if opc->bio_cgroup_id is changed before copying to
npc->bio_cgroup_id, npc->bio_cgroup_id is correctly updated at some time.
The implementation is not completely accurate, but is faster and lighter.

Thanks,
Ryo Tsuruta

2009-04-21 14:07:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Tue, Apr 21, 2009 at 10:30:02AM +0200, Andrea Righi wrote:
>
> We're trying to address also this issue, setting max dirty pages limit
> per cgroup, and force a direct writeback when these limits are exceeded.
>
> In this case dirty ratio throttling should happen automatically because
> the process will be throttled by the IO controller when it tries to
> writeback the dirty pages and submit IO requests.

The challenge here will be the accounting; consider that you may have
a file that had some of its pages in its page cache dirtied by a
process in cgroup A. Now another process in cgroup B dirties some
more pages. This could happen either via a mmap'ed file or via the
standard read/write system calls. How do you track which dirty pages
should be charged against which cgroup?

- Ted

2009-04-21 14:26:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Tue, Apr 21, 2009 at 10:37:03AM +0200, Andrea Righi wrote:
> On Mon, Apr 20, 2009 at 09:08:46PM -0400, Vivek Goyal wrote:
> > On Tue, Apr 21, 2009 at 12:05:12AM +0200, Andrea Righi wrote:
> >
> > [..]
> > > > > > Are we not already controlling submission of request (at crude level).
> > > > > > If application is doing writeout at high rate, then it hits vm_dirty_ratio
> > > > > > hits and this application is forced to do write out and hence it is slowed
> > > > > > down and is not allowed to submit writes at high rate.
> > > > > >
> > > > > > Just that it is not a very fair scheme right now as during right out
> > > > > > a high prio/high weight cgroup application can start writing out some
> > > > > > other cgroups' pages.
> > > > > >
> > > > > > For this we probably need to have some combination of solutions like
> > > > > > per cgroup upper limit on dirty pages. Secondly probably if an application
> > > > > > is slowed down because of hitting vm_drity_ratio, it should try to
> > > > > > write out the inode it is dirtying first instead of picking any random
> > > > > > inode and associated pages. This will ensure that a high weight
> > > > > > application can quickly get through the write outs and see higher
> > > > > > throughput from the disk.
> > > > >
> > > > > For the first, I submitted a patchset some months ago to provide this
> > > > > feature in the memory controller:
> > > > >
> > > > > https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> > > > >
> > > > > We focused on the best interface to use for setting the dirty pages
> > > > > limit, but we didn't finalize it. I can rework on that and repost an
> > > > > updated version. Now that we have the dirty_ratio/dirty_bytes to set the
> > > > > global limit I think we can use the same interface and the same semantic
> > > > > within the cgroup fs, something like:
> > > > >
> > > > > memory.dirty_ratio
> > > > > memory.dirty_bytes
> > > > >
> > > > > For the second point something like this should be enough to force tasks
> > > > > to write out only the inode they're actually dirtying when they hit the
> > > > > vm_dirty_ratio limit. But it should be tested carefully and may cause
> > > > > heavy performance regressions.
> > > > >
> > > > > Signed-off-by: Andrea Righi <[email protected]>
> > > > > ---
> > > > > mm/page-writeback.c | 2 +-
> > > > > 1 files changed, 1 insertions(+), 1 deletions(-)
> > > > >
> > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > > index 2630937..1e07c9d 100644
> > > > > --- a/mm/page-writeback.c
> > > > > +++ b/mm/page-writeback.c
> > > > > @@ -543,7 +543,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> > > > > * been flushed to permanent storage.
> > > > > */
> > > > > if (bdi_nr_reclaimable) {
> > > > > - writeback_inodes(&wbc);
> > > > > + sync_inode(mapping->host, &wbc);
> > > > > pages_written += write_chunk - wbc.nr_to_write;
> > > > > get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > > &bdi_thresh, bdi);
> > > >
> > > > This patch seems to be helping me a bit in getting more service
> > > > differentiation between two writer dd of different weights. But strangely
> > > > it is helping only for ext3 and not ext4. Debugging is on.
> > >
> > > Are you explicitly mounting ext3 with data=ordered?
> >
> > Yes. Still using 29-rc8 and data=ordered was the default then.
> >
> > I got two partitions on same disk and created one ext3 filesystem on each
> > partition (just to take journaling intereference out of two dd threads
> > for the time being).
> >
> > Two dd threads doing writes to each partition.
>
> ...and if you're using data=writeback with ext4 sync_inode() should sync
> the metadata only. If this is the case, could you check data=ordered
> also for ext4?

No, even data=ordered mode with ext4 is also not helping. It has to be
something else.

BTW, with the above patch, what happens if the address space being dirtied
does not have sufficient dirty pages to write back (more than write_chunk).
Will the process not be in loop until the number of dirty pages come down
(hopefully due to writeout by pdflush or by other processes?)

Thanks
Vivek

2009-04-21 14:31:46

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Tue, Apr 21, 2009 at 10:06:31AM -0400, Theodore Tso wrote:
> On Tue, Apr 21, 2009 at 10:30:02AM +0200, Andrea Righi wrote:
> >
> > We're trying to address also this issue, setting max dirty pages limit
> > per cgroup, and force a direct writeback when these limits are exceeded.
> >
> > In this case dirty ratio throttling should happen automatically because
> > the process will be throttled by the IO controller when it tries to
> > writeback the dirty pages and submit IO requests.
>
> The challenge here will be the accounting; consider that you may have
> a file that had some of its pages in its page cache dirtied by a
> process in cgroup A. Now another process in cgroup B dirties some
> more pages. This could happen either via a mmap'ed file or via the
> standard read/write system calls. How do you track which dirty pages
> should be charged against which cgroup?
>
> - Ted

Some months ago I posted a proposal to account, track and limit per
cgroup dirty pages in the memory cgroup subsystem:

https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html

At the moment I'm reworking on a similar and updated version. I know
that Kamezawa is also implementing something to account per cgroup dirty
pages in memory cgroup.

Moreover, io-throttle v14 already uses the page_cgroup structure to
encode into page_cgroup->flags the cgroup ID (io-throttle css_id()
actually) that originally dirtied the page.

This should be enough to track dirty pages and charge the right cgroup.

-Andrea

2009-04-21 15:32:44

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 3/9] bio-cgroup controller

* Andrea Righi <[email protected]> [2009-04-20 16:56:59]:

> On Mon, Apr 20, 2009 at 08:35:40PM +0900, Ryo Tsuruta wrote:
> > > > +/*
> > > > + * Assign "page" the same owner as "opage."
> > > > + */
> > > > +void bio_cgroup_copy_owner(struct page *npage, struct page *opage)
> > > > +{
> > > > + struct page_cgroup *npc, *opc;
> > > > +
> > > > + if (bio_cgroup_disabled())
> > > > + return;
> > > > + npc = lookup_page_cgroup(npage);
> > > > + if (unlikely(!npc))
> > > > + return;
> > > > + opc = lookup_page_cgroup(opage);
> > > > + if (unlikely(!opc))
> > > > + return;
> > > > +
> > > > + /*
> > > > + * Do this without any locks. The reason is the same as
> > > > + * bio_cgroup_reset_owner().
> > > > + */
> > > > + npc->bio_cgroup_id = opc->bio_cgroup_id;
> > >
> > > What protects npc and opc?
> >
> > As the same reason mentioned above, bio_cgroup_id can be updated
> > without any locks, and npc and opc always point to page_cgroups.
> > An integer variable can be set a new value at once on a system which
> > can use RCU lock.
>
> mmmh... I'm not sure about this. Actually you read opc->bio_cgroup_id
> first and then write to npc->bio_cgroup_id, so it is not atomic at all.
> So, you can read or set a wrong ID, but at least it should be always
> consistent (the single read or write itself is atomic).

Quick concern here, how long does it take for the data to become
consistent? Can we have a group misuse the bandwidth during that time?
What about conditions where you have a wrong ID, but the group
associated with the wrong ID is gone?

--
Balbir

2009-04-21 16:37:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Tue, Apr 21, 2009 at 04:31:31PM +0200, Andrea Righi wrote:
>
> Some months ago I posted a proposal to account, track and limit per
> cgroup dirty pages in the memory cgroup subsystem:
>
> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
>
> At the moment I'm reworking on a similar and updated version. I know
> that Kamezawa is also implementing something to account per cgroup dirty
> pages in memory cgroup.
>
> Moreover, io-throttle v14 already uses the page_cgroup structure to
> encode into page_cgroup->flags the cgroup ID (io-throttle css_id()
> actually) that originally dirtied the page.
>
> This should be enough to track dirty pages and charge the right cgroup.

I'm not convinced this will work that well. Right now the associating
a page with a cgroup is done on a very rough basis --- basically,
whoever touches a page last "owns" the page. That means if one
process first tries reading from the cgroup, it will "own" the page.
This can get quite arbitrary for shared libraries, for example.
However, while it may be the best that you can do for RSS accounting,
it gets worse for tracking dirties pages.

Now if you have processes from one cgroup that always reading from
some data file, and a process from another cgroup which is updating
said data file, the writes won't be charged to the correct cgroup.

So using the same data structures to assign page ownership for RSS
accounting and page dirtying accounting might not be such a great
idea. On the other hand, using a completely different set of data
structures increases your overhead.

That being said, it's not obvious to me that trying to track RSS
ownership on a per-page basis makes sense. It may not be worth the
overhead, particularly on a machine with a truly large amount of
memory. So for example, tracking on a per vm_area_struct, and
splitting the cost across cgroups, might be a better way of tracking
RSS accounting. But for dirty pages, where there will be much fewer
such pages, maybe using a per-page scheme makes more sense. The
take-home here is that using different mechanisms for tracking RSS
accounting and dirty page accounting on a per-cgroup basis, with the
understanding that this will all be horribly rough and non-exact, may
make a lot of sense.

Best,

- Ted

2009-04-21 17:24:29

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

* Theodore Tso <[email protected]> [2009-04-21 12:35:37]:

> On Tue, Apr 21, 2009 at 04:31:31PM +0200, Andrea Righi wrote:
> >
> > Some months ago I posted a proposal to account, track and limit per
> > cgroup dirty pages in the memory cgroup subsystem:
> >
> > https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> >
> > At the moment I'm reworking on a similar and updated version. I know
> > that Kamezawa is also implementing something to account per cgroup dirty
> > pages in memory cgroup.
> >
> > Moreover, io-throttle v14 already uses the page_cgroup structure to
> > encode into page_cgroup->flags the cgroup ID (io-throttle css_id()
> > actually) that originally dirtied the page.
> >
> > This should be enough to track dirty pages and charge the right cgroup.
>
> I'm not convinced this will work that well. Right now the associating
> a page with a cgroup is done on a very rough basis --- basically,
> whoever touches a page last "owns" the page. That means if one
> process first tries reading from the cgroup, it will "own" the page.
> This can get quite arbitrary for shared libraries, for example.
> However, while it may be the best that you can do for RSS accounting,
> it gets worse for tracking dirties pages.
>
> Now if you have processes from one cgroup that always reading from
> some data file, and a process from another cgroup which is updating
> said data file, the writes won't be charged to the correct cgroup.
>
> So using the same data structures to assign page ownership for RSS
> accounting and page dirtying accounting might not be such a great
> idea. On the other hand, using a completely different set of data
> structures increases your overhead.
>
> That being said, it's not obvious to me that trying to track RSS
> ownership on a per-page basis makes sense. It may not be worth the
> overhead, particularly on a machine with a truly large amount of
> memory. So for example, tracking on a per vm_area_struct, and
> splitting the cost across cgroups, might be a better way of tracking
> RSS accounting. But for dirty pages, where there will be much fewer
> such pages, maybe using a per-page scheme makes more sense. The
> take-home here is that using different mechanisms for tracking RSS
> accounting and dirty page accounting on a per-cgroup basis, with the
> understanding that this will all be horribly rough and non-exact, may
> make a lot of sense.
>

We need to do this tracking for per cgroup reclaim, we need to track
pages in their own LRU. I've been working on some optimizations not to
track LRU pages for the largest cgroup (or root cgroup mostly) to help
optimize the memory resource controller, but I've not posted them out
yet. We also have a mechanism by which a page reclaimed from one
cgroup, might stay back in the global LRU and get reassigned depending
on usage.

Coming to the dirty page tracking issue, the issue that is being
brought about is the same issue that we have shared page accounting. I
am working on estimates for shared page accounting and it should be
possible to extend it to dirty shared page accounting. Using the
shared ratios for decisions might be a better strategy.


--
Balbir

2009-04-21 17:47:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Tue, Apr 21, 2009 at 10:53:17PM +0530, Balbir Singh wrote:
> Coming to the dirty page tracking issue, the issue that is being
> brought about is the same issue that we have shared page accounting. I
> am working on estimates for shared page accounting and it should be
> possible to extend it to dirty shared page accounting. Using the
> shared ratios for decisions might be a better strategy.

It's the same issue, but again, consider the use case where the
readers and the writers are in different cgroups. This can happen
quite often in database workloads, where you might have many readers,
and a single process doing the database update. Or the case where you
have one process in one cgroup doing a tail -f of some log file, and
another process doing writing to the log file.

Using a shared ratio is certainly better than charging 100% of the
write to whichever unfortunate process happened to first read the
page, but it will still not be terribly accurate. A lot really
depends on how you expect these cgroup limits will be used, and what
the requirements actually will be with respect to accuracy. If the
requirements for accuracy are different for RSS tracking and dirty
page tracking --- which could easily be the case, since memory is
usually much cheaper than I/O bandwidth, and there is generally far
more clean memory pages than there are dirty memory pages, so a small
numberical error in dirty page accounting translates to a much larger
percentage error than read-only RSS page accounting --- it may make
sense to use different mechanisms for tracking the two, given the
different requirements and differring overhead implications.

Anyway, something for you to think about.

Regards,

- Ted

2009-04-21 18:15:35

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

* Theodore Tso <[email protected]> [2009-04-21 13:46:20]:

> On Tue, Apr 21, 2009 at 10:53:17PM +0530, Balbir Singh wrote:
> > Coming to the dirty page tracking issue, the issue that is being
> > brought about is the same issue that we have shared page accounting. I
> > am working on estimates for shared page accounting and it should be
> > possible to extend it to dirty shared page accounting. Using the
> > shared ratios for decisions might be a better strategy.
>
> It's the same issue, but again, consider the use case where the
> readers and the writers are in different cgroups. This can happen
> quite often in database workloads, where you might have many readers,
> and a single process doing the database update. Or the case where you
> have one process in one cgroup doing a tail -f of some log file, and
> another process doing writing to the log file.
>

That would be true in general, but only the process writing to the
file will dirty it. So dirty already accounts for the read/write
split. I'd assume that the cost is only for the dirty page, since we
do IO only on write in this case, unless I am missing something very
obvious.

> Using a shared ratio is certainly better than charging 100% of the
> write to whichever unfortunate process happened to first read the
> page, but it will still not be terribly accurate. A lot really
> depends on how you expect these cgroup limits will be used, and what
> the requirements actually will be with respect to accuracy. If the
> requirements for accuracy are different for RSS tracking and dirty
> page tracking --- which could easily be the case, since memory is
> usually much cheaper than I/O bandwidth, and there is generally far
> more clean memory pages than there are dirty memory pages, so a small
> numberical error in dirty page accounting translates to a much larger
> percentage error than read-only RSS page accounting --- it may make
> sense to use different mechanisms for tracking the two, given the
> different requirements and differring overhead implications.
>
> Anyway, something for you to think about.

Yep, but I would recommend using the controller we have, if the
overheads span out to be too large for IO, we think about
alternatives.

--
Balbir

2009-04-21 18:32:34

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Tue, Apr 21, 2009 at 10:23:05AM -0400, Vivek Goyal wrote:
> On Tue, Apr 21, 2009 at 10:37:03AM +0200, Andrea Righi wrote:
> > On Mon, Apr 20, 2009 at 09:08:46PM -0400, Vivek Goyal wrote:
> > > On Tue, Apr 21, 2009 at 12:05:12AM +0200, Andrea Righi wrote:
> > >
> > > [..]
> > > > > > > Are we not already controlling submission of request (at crude level).
> > > > > > > If application is doing writeout at high rate, then it hits vm_dirty_ratio
> > > > > > > hits and this application is forced to do write out and hence it is slowed
> > > > > > > down and is not allowed to submit writes at high rate.
> > > > > > >
> > > > > > > Just that it is not a very fair scheme right now as during right out
> > > > > > > a high prio/high weight cgroup application can start writing out some
> > > > > > > other cgroups' pages.
> > > > > > >
> > > > > > > For this we probably need to have some combination of solutions like
> > > > > > > per cgroup upper limit on dirty pages. Secondly probably if an application
> > > > > > > is slowed down because of hitting vm_drity_ratio, it should try to
> > > > > > > write out the inode it is dirtying first instead of picking any random
> > > > > > > inode and associated pages. This will ensure that a high weight
> > > > > > > application can quickly get through the write outs and see higher
> > > > > > > throughput from the disk.
> > > > > >
> > > > > > For the first, I submitted a patchset some months ago to provide this
> > > > > > feature in the memory controller:
> > > > > >
> > > > > > https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> > > > > >
> > > > > > We focused on the best interface to use for setting the dirty pages
> > > > > > limit, but we didn't finalize it. I can rework on that and repost an
> > > > > > updated version. Now that we have the dirty_ratio/dirty_bytes to set the
> > > > > > global limit I think we can use the same interface and the same semantic
> > > > > > within the cgroup fs, something like:
> > > > > >
> > > > > > memory.dirty_ratio
> > > > > > memory.dirty_bytes
> > > > > >
> > > > > > For the second point something like this should be enough to force tasks
> > > > > > to write out only the inode they're actually dirtying when they hit the
> > > > > > vm_dirty_ratio limit. But it should be tested carefully and may cause
> > > > > > heavy performance regressions.
> > > > > >
> > > > > > Signed-off-by: Andrea Righi <[email protected]>
> > > > > > ---
> > > > > > mm/page-writeback.c | 2 +-
> > > > > > 1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > > > index 2630937..1e07c9d 100644
> > > > > > --- a/mm/page-writeback.c
> > > > > > +++ b/mm/page-writeback.c
> > > > > > @@ -543,7 +543,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> > > > > > * been flushed to permanent storage.
> > > > > > */
> > > > > > if (bdi_nr_reclaimable) {
> > > > > > - writeback_inodes(&wbc);
> > > > > > + sync_inode(mapping->host, &wbc);
> > > > > > pages_written += write_chunk - wbc.nr_to_write;
> > > > > > get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > > > &bdi_thresh, bdi);
> > > > >
> > > > > This patch seems to be helping me a bit in getting more service
> > > > > differentiation between two writer dd of different weights. But strangely
> > > > > it is helping only for ext3 and not ext4. Debugging is on.
> > > >
> > > > Are you explicitly mounting ext3 with data=ordered?
> > >
> > > Yes. Still using 29-rc8 and data=ordered was the default then.
> > >
> > > I got two partitions on same disk and created one ext3 filesystem on each
> > > partition (just to take journaling intereference out of two dd threads
> > > for the time being).
> > >
> > > Two dd threads doing writes to each partition.
> >
> > ...and if you're using data=writeback with ext4 sync_inode() should sync
> > the metadata only. If this is the case, could you check data=ordered
> > also for ext4?
>
> No, even data=ordered mode with ext4 is also not helping. It has to be
> something else.
>

Ok, with data=ordered mode with ext4, now I can get significant service
differentiation between two dd processes. I had to tweak cfq a bit.

- Instead of 40ms slice for async queue, do 20ms at a time (tunable).
- change cfq quantum to 1 from 4 to not dispatch a bunch of requests at
one go.

Above changes help a bit in making sure two continuously backlogged queues
at IO scheduler so that IO scheduler can offer more disk time to higher
weight process.

Thanks
Vivek

> BTW, with the above patch, what happens if the address space being dirtied
> does not have sufficient dirty pages to write back (more than write_chunk).
> Will the process not be in loop until the number of dirty pages come down
> (hopefully due to writeout by pdflush or by other processes?)
>
> Thanks
> Vivek

2009-04-21 19:15:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Tue, Apr 21, 2009 at 11:44:29PM +0530, Balbir Singh wrote:
>
> That would be true in general, but only the process writing to the
> file will dirty it. So dirty already accounts for the read/write
> split. I'd assume that the cost is only for the dirty page, since we
> do IO only on write in this case, unless I am missing something very
> obvious.

Maybe I'm missing something, but the (in development) patches I saw
seemed to use the existing infrastructure designed for RSS cost
tracking (which is also not yet in mainline, unless I'm mistaken ---
but I didn't see page_get_page_cgroup() in the mainline tree yet).

Right? So if process A in cgroup A reads touches the file first by
reading from it, then the pages read by process A will be assigned as
being "owned" by cgroup A. Then when the patch described at

http://lkml.org/lkml/2008/9/9/245

... tries to charge a write done by process B in cgroup B, the code
will call page_get_page_cgroup(), see that it is "owned" by cgroup A,
and charge the dirty page to cgroup A. If process A and all of the
other processes in cgroup A only access this file read-only, and
process B is updating this file very heavily --- and it is a large
file --- then cgroup B will get a completely free pass as far as
dirtying pages to this file, since it will be all charged 100% to
cgroup A, incorrectly.

So what am I missing?

- Ted

2009-04-21 20:49:23

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Tue, Apr 21, 2009 at 03:14:01PM -0400, Theodore Tso wrote:
> On Tue, Apr 21, 2009 at 11:44:29PM +0530, Balbir Singh wrote:
> >
> > That would be true in general, but only the process writing to the
> > file will dirty it. So dirty already accounts for the read/write
> > split. I'd assume that the cost is only for the dirty page, since we
> > do IO only on write in this case, unless I am missing something very
> > obvious.
>
> Maybe I'm missing something, but the (in development) patches I saw
> seemed to use the existing infrastructure designed for RSS cost
> tracking (which is also not yet in mainline, unless I'm mistaken ---
> but I didn't see page_get_page_cgroup() in the mainline tree yet).

page_get_page_cgroup() is the old page_cgroup interface, now it has been
replaced by lookup_page_cgroup(), that is in the mainline.

>
> Right? So if process A in cgroup A reads touches the file first by
> reading from it, then the pages read by process A will be assigned as
> being "owned" by cgroup A. Then when the patch described at
>
> http://lkml.org/lkml/2008/9/9/245

And this patch must be completely reworked.

>
> ... tries to charge a write done by process B in cgroup B, the code
> will call page_get_page_cgroup(), see that it is "owned" by cgroup A,
> and charge the dirty page to cgroup A. If process A and all of the
> other processes in cgroup A only access this file read-only, and
> process B is updating this file very heavily --- and it is a large
> file --- then cgroup B will get a completely free pass as far as
> dirtying pages to this file, since it will be all charged 100% to
> cgroup A, incorrectly.

yep! right. Anyway, it's not completely wrong to account dirty pages in
this way. The dirty pages actually belong to cgroup A and providing per
cgroup upper limits of dirty pages could help to equally distribute
dirty pages, that are hard/slow to reclaim, among cgroups.

But this is definitely another problem.

And it doesn't help for the problem described by Ted, expecially for the
IO controller. The only way I see to correctly handle that case is to
limit the rate of dirty pages per cgroup, accounting the dirty activity
to the cgroup that firstly touched the page (and not the owner as
intended by the memory controller).

And this should be probably strictly connected to the IO controller. If
we throttle or delay the dispatching/submission of some IO requests
without throttling the dirty pages rate a cgroup could completely waste
its own available memory with dirty (hard and slow to reclaim) pages.

That is in part the approach I used in io-throttle v12, adding a hook in
balance_dirty_pages_ratelimited_nr() to throttle the current task when
cgroup's IO limit are exceeded. Argh!

So, another proposal could be to re-add in io-throttle v14 the old hook
also in balance_dirty_pages_ratelimited_nr().

In this way io-throttle would:

- use page_cgroup infrastructure and page_cgroup->flags to encode the
cgroup id that firstly dirtied a generic page
- account and opportunely throttle sync and writeback IO requests in
submit_bio()
- at the same time throttle the tasks in
balance_dirty_pages_ratelimited_nr() if the cgroup they belong has
exhausted the IO BW (or quota, share, etc. in case of proportional BW
limit)

-Andrea

2009-04-21 21:28:34

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Tue, Apr 21, 2009 at 10:23:05AM -0400, Vivek Goyal wrote:
> On Tue, Apr 21, 2009 at 10:37:03AM +0200, Andrea Righi wrote:
> > On Mon, Apr 20, 2009 at 09:08:46PM -0400, Vivek Goyal wrote:
> > > On Tue, Apr 21, 2009 at 12:05:12AM +0200, Andrea Righi wrote:
> > >
> > > [..]
> > > > > > > Are we not already controlling submission of request (at crude level).
> > > > > > > If application is doing writeout at high rate, then it hits vm_dirty_ratio
> > > > > > > hits and this application is forced to do write out and hence it is slowed
> > > > > > > down and is not allowed to submit writes at high rate.
> > > > > > >
> > > > > > > Just that it is not a very fair scheme right now as during right out
> > > > > > > a high prio/high weight cgroup application can start writing out some
> > > > > > > other cgroups' pages.
> > > > > > >
> > > > > > > For this we probably need to have some combination of solutions like
> > > > > > > per cgroup upper limit on dirty pages. Secondly probably if an application
> > > > > > > is slowed down because of hitting vm_drity_ratio, it should try to
> > > > > > > write out the inode it is dirtying first instead of picking any random
> > > > > > > inode and associated pages. This will ensure that a high weight
> > > > > > > application can quickly get through the write outs and see higher
> > > > > > > throughput from the disk.
> > > > > >
> > > > > > For the first, I submitted a patchset some months ago to provide this
> > > > > > feature in the memory controller:
> > > > > >
> > > > > > https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> > > > > >
> > > > > > We focused on the best interface to use for setting the dirty pages
> > > > > > limit, but we didn't finalize it. I can rework on that and repost an
> > > > > > updated version. Now that we have the dirty_ratio/dirty_bytes to set the
> > > > > > global limit I think we can use the same interface and the same semantic
> > > > > > within the cgroup fs, something like:
> > > > > >
> > > > > > memory.dirty_ratio
> > > > > > memory.dirty_bytes
> > > > > >
> > > > > > For the second point something like this should be enough to force tasks
> > > > > > to write out only the inode they're actually dirtying when they hit the
> > > > > > vm_dirty_ratio limit. But it should be tested carefully and may cause
> > > > > > heavy performance regressions.
> > > > > >
> > > > > > Signed-off-by: Andrea Righi <[email protected]>
> > > > > > ---
> > > > > > mm/page-writeback.c | 2 +-
> > > > > > 1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > > > index 2630937..1e07c9d 100644
> > > > > > --- a/mm/page-writeback.c
> > > > > > +++ b/mm/page-writeback.c
> > > > > > @@ -543,7 +543,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> > > > > > * been flushed to permanent storage.
> > > > > > */
> > > > > > if (bdi_nr_reclaimable) {
> > > > > > - writeback_inodes(&wbc);
> > > > > > + sync_inode(mapping->host, &wbc);
> > > > > > pages_written += write_chunk - wbc.nr_to_write;
> > > > > > get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > > > &bdi_thresh, bdi);
> > > > >
> > > > > This patch seems to be helping me a bit in getting more service
> > > > > differentiation between two writer dd of different weights. But strangely
> > > > > it is helping only for ext3 and not ext4. Debugging is on.
> > > >
> > > > Are you explicitly mounting ext3 with data=ordered?
> > >
> > > Yes. Still using 29-rc8 and data=ordered was the default then.
> > >
> > > I got two partitions on same disk and created one ext3 filesystem on each
> > > partition (just to take journaling intereference out of two dd threads
> > > for the time being).
> > >
> > > Two dd threads doing writes to each partition.
> >
> > ...and if you're using data=writeback with ext4 sync_inode() should sync
> > the metadata only. If this is the case, could you check data=ordered
> > also for ext4?
>
> No, even data=ordered mode with ext4 is also not helping. It has to be
> something else.

mmmh.. maybe you could try to also add a wbc.sync_mode=WB_SYNC_ALL in
the if below.

>
> BTW, with the above patch, what happens if the address space being dirtied
> does not have sufficient dirty pages to write back (more than write_chunk).
> Will the process not be in loop until the number of dirty pages come down
> (hopefully due to writeout by pdflush or by other processes?)

Right! we could try something like this at the very least, stopping the
loop if the address space we've dirtied doesn't have enough dirty pages.

Signed-off-by: Andrea Righi <[email protected]>
---
mm/page-writeback.c | 12 +++++++++++-
1 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..e71a164 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -542,7 +542,17 @@ static void balance_dirty_pages(struct address_space *mapping)
* been flushed to permanent storage.
*/
if (bdi_nr_reclaimable) {
- writeback_inodes(&wbc);
+ wbc.more_io = 0;
+ wbc.encountered_congestion = 0;
+ sync_inode(mapping->host, &wbc);
+ if (wbc.nr_to_write <= 0)
+ break;
+ /*
+ * Wrote less than expected, check if the inode has
+ * enough dirty pages to write back
+ */
+ if (!wbc.encountered_congestion && !wbc.more_io)
+ break;
pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(&background_thresh, &dirty_thresh,
&bdi_thresh, bdi);

2009-04-21 21:36:25

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Tue, Apr 21, 2009 at 02:29:58PM -0400, Vivek Goyal wrote:
> On Tue, Apr 21, 2009 at 10:23:05AM -0400, Vivek Goyal wrote:
> > On Tue, Apr 21, 2009 at 10:37:03AM +0200, Andrea Righi wrote:
> > > On Mon, Apr 20, 2009 at 09:08:46PM -0400, Vivek Goyal wrote:
> > > > On Tue, Apr 21, 2009 at 12:05:12AM +0200, Andrea Righi wrote:
> > > >
> > > > [..]
> > > > > > > > Are we not already controlling submission of request (at crude level).
> > > > > > > > If application is doing writeout at high rate, then it hits vm_dirty_ratio
> > > > > > > > hits and this application is forced to do write out and hence it is slowed
> > > > > > > > down and is not allowed to submit writes at high rate.
> > > > > > > >
> > > > > > > > Just that it is not a very fair scheme right now as during right out
> > > > > > > > a high prio/high weight cgroup application can start writing out some
> > > > > > > > other cgroups' pages.
> > > > > > > >
> > > > > > > > For this we probably need to have some combination of solutions like
> > > > > > > > per cgroup upper limit on dirty pages. Secondly probably if an application
> > > > > > > > is slowed down because of hitting vm_drity_ratio, it should try to
> > > > > > > > write out the inode it is dirtying first instead of picking any random
> > > > > > > > inode and associated pages. This will ensure that a high weight
> > > > > > > > application can quickly get through the write outs and see higher
> > > > > > > > throughput from the disk.
> > > > > > >
> > > > > > > For the first, I submitted a patchset some months ago to provide this
> > > > > > > feature in the memory controller:
> > > > > > >
> > > > > > > https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> > > > > > >
> > > > > > > We focused on the best interface to use for setting the dirty pages
> > > > > > > limit, but we didn't finalize it. I can rework on that and repost an
> > > > > > > updated version. Now that we have the dirty_ratio/dirty_bytes to set the
> > > > > > > global limit I think we can use the same interface and the same semantic
> > > > > > > within the cgroup fs, something like:
> > > > > > >
> > > > > > > memory.dirty_ratio
> > > > > > > memory.dirty_bytes
> > > > > > >
> > > > > > > For the second point something like this should be enough to force tasks
> > > > > > > to write out only the inode they're actually dirtying when they hit the
> > > > > > > vm_dirty_ratio limit. But it should be tested carefully and may cause
> > > > > > > heavy performance regressions.
> > > > > > >
> > > > > > > Signed-off-by: Andrea Righi <[email protected]>
> > > > > > > ---
> > > > > > > mm/page-writeback.c | 2 +-
> > > > > > > 1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > > >
> > > > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > > > > index 2630937..1e07c9d 100644
> > > > > > > --- a/mm/page-writeback.c
> > > > > > > +++ b/mm/page-writeback.c
> > > > > > > @@ -543,7 +543,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> > > > > > > * been flushed to permanent storage.
> > > > > > > */
> > > > > > > if (bdi_nr_reclaimable) {
> > > > > > > - writeback_inodes(&wbc);
> > > > > > > + sync_inode(mapping->host, &wbc);
> > > > > > > pages_written += write_chunk - wbc.nr_to_write;
> > > > > > > get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > > > > &bdi_thresh, bdi);
> > > > > >
> > > > > > This patch seems to be helping me a bit in getting more service
> > > > > > differentiation between two writer dd of different weights. But strangely
> > > > > > it is helping only for ext3 and not ext4. Debugging is on.
> > > > >
> > > > > Are you explicitly mounting ext3 with data=ordered?
> > > >
> > > > Yes. Still using 29-rc8 and data=ordered was the default then.
> > > >
> > > > I got two partitions on same disk and created one ext3 filesystem on each
> > > > partition (just to take journaling intereference out of two dd threads
> > > > for the time being).
> > > >
> > > > Two dd threads doing writes to each partition.
> > >
> > > ...and if you're using data=writeback with ext4 sync_inode() should sync
> > > the metadata only. If this is the case, could you check data=ordered
> > > also for ext4?
> >
> > No, even data=ordered mode with ext4 is also not helping. It has to be
> > something else.
> >
>
> Ok, with data=ordered mode with ext4, now I can get significant service
> differentiation between two dd processes. I had to tweak cfq a bit.
>
> - Instead of 40ms slice for async queue, do 20ms at a time (tunable).
> - change cfq quantum to 1 from 4 to not dispatch a bunch of requests at
> one go.
>
> Above changes help a bit in making sure two continuously backlogged queues
> at IO scheduler so that IO scheduler can offer more disk time to higher
> weight process.

Good, also testing the WB_SYNC_ALL would be interesting I think.

-Andrea

2009-04-22 00:35:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Tue, 21 Apr 2009 22:49:06 +0200
Andrea Righi <[email protected]> wrote:
> yep! right. Anyway, it's not completely wrong to account dirty pages in
> this way. The dirty pages actually belong to cgroup A and providing per
> cgroup upper limits of dirty pages could help to equally distribute
> dirty pages, that are hard/slow to reclaim, among cgroups.
>
> But this is definitely another problem.
>
Hmm, my motivation for dirty accounting in memcg is for supporting dirty_ratio
to do smooth page reclaiming and to kick background-write-out.


> And it doesn't help for the problem described by Ted, expecially for the
> IO controller. The only way I see to correctly handle that case is to
> limit the rate of dirty pages per cgroup, accounting the dirty activity
> to the cgroup that firstly touched the page (and not the owner as
> intended by the memory controller).
>
Owner of the page should know dirty ratio, too.

> And this should be probably strictly connected to the IO controller. If
> we throttle or delay the dispatching/submission of some IO requests
> without throttling the dirty pages rate a cgroup could completely waste
> its own available memory with dirty (hard and slow to reclaim) pages.
>
> That is in part the approach I used in io-throttle v12, adding a hook in
> balance_dirty_pages_ratelimited_nr() to throttle the current task when
> cgroup's IO limit are exceeded. Argh!
>
> So, another proposal could be to re-add in io-throttle v14 the old hook
> also in balance_dirty_pages_ratelimited_nr().
>
> In this way io-throttle would:
>
> - use page_cgroup infrastructure and page_cgroup->flags to encode the
> cgroup id that firstly dirtied a generic page
> - account and opportunely throttle sync and writeback IO requests in
> submit_bio()
> - at the same time throttle the tasks in
> balance_dirty_pages_ratelimited_nr() if the cgroup they belong has
> exhausted the IO BW (or quota, share, etc. in case of proportional BW
> limit)
>

IMHO, io-controller should just work as I/O subsystem as bdi. Now, per-bdi dirty_ratio
is suppoted and it seems to work well.

Can't we write a function like bdi_writeout_fraction() ?;
It will be a simple choice.

Thanks,
-Kame

2009-04-22 01:23:53

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Wed, 22 Apr 2009 09:33:49 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:


> > And this should be probably strictly connected to the IO controller. If
> > we throttle or delay the dispatching/submission of some IO requests
> > without throttling the dirty pages rate a cgroup could completely waste
> > its own available memory with dirty (hard and slow to reclaim) pages.
> >
> > That is in part the approach I used in io-throttle v12, adding a hook in
> > balance_dirty_pages_ratelimited_nr() to throttle the current task when
> > cgroup's IO limit are exceeded. Argh!
> >
> > So, another proposal could be to re-add in io-throttle v14 the old hook
> > also in balance_dirty_pages_ratelimited_nr().
> >
> > In this way io-throttle would:
> >
> > - use page_cgroup infrastructure and page_cgroup->flags to encode the
> > cgroup id that firstly dirtied a generic page
> > - account and opportunely throttle sync and writeback IO requests in
> > submit_bio()
> > - at the same time throttle the tasks in
> > balance_dirty_pages_ratelimited_nr() if the cgroup they belong has
> > exhausted the IO BW (or quota, share, etc. in case of proportional BW
> > limit)
> >
>
> IMHO, io-controller should just work as I/O subsystem as bdi. Now, per-bdi dirty_ratio
> is suppoted and it seems to work well.
>
> Can't we write a function like bdi_writeout_fraction() ?;
> It will be a simple choice.
>
One more thing, if you want dirty_ratio for throttoling I/O not for supporing page reclaim,
Something like task_dirty_limit() will be apporpriate.

Thanks,
-Kame

2009-04-22 03:31:39

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

* Theodore Tso <[email protected]> [2009-04-21 15:14:01]:

> On Tue, Apr 21, 2009 at 11:44:29PM +0530, Balbir Singh wrote:
> >
> > That would be true in general, but only the process writing to the
> > file will dirty it. So dirty already accounts for the read/write
> > split. I'd assume that the cost is only for the dirty page, since we
> > do IO only on write in this case, unless I am missing something very
> > obvious.
>
> Maybe I'm missing something, but the (in development) patches I saw
> seemed to use the existing infrastructure designed for RSS cost
> tracking (which is also not yet in mainline, unless I'm mistaken ---
> but I didn't see page_get_page_cgroup() in the mainline tree yet).
>
> Right? So if process A in cgroup A reads touches the file first by
> reading from it, then the pages read by process A will be assigned as
> being "owned" by cgroup A. Then when the patch described at
>
> http://lkml.org/lkml/2008/9/9/245

That is correct, but on reclaim (hitting the limit) a page that is frequently
used by B and not A, can get reclaimed from A and move to B if B is
heavily using it.

>
> ... tries to charge a write done by process B in cgroup B, the code
> will call page_get_page_cgroup(), see that it is "owned" by cgroup A,
> and charge the dirty page to cgroup A. If process A and all of the
> other processes in cgroup A only access this file read-only, and
> process B is updating this file very heavily --- and it is a large
> file --- then cgroup B will get a completely free pass as far as
> dirtying pages to this file, since it will be all charged 100% to
> cgroup A, incorrectly.
>
> So what am I missing?

You are right. As long as A is not exceeding its limit, B will get a
free pass at the page. The page will be inactive on A's LRU and active
on the global LRU though from the memory controller perspective. We'll
need to find a way to fix this, if this is a very common scenario for
the IO controller.

--
Balbir

2009-04-22 10:22:56

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Wed, Apr 22, 2009 at 10:21:53AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 22 Apr 2009 09:33:49 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>
> > > And this should be probably strictly connected to the IO controller. If
> > > we throttle or delay the dispatching/submission of some IO requests
> > > without throttling the dirty pages rate a cgroup could completely waste
> > > its own available memory with dirty (hard and slow to reclaim) pages.
> > >
> > > That is in part the approach I used in io-throttle v12, adding a hook in
> > > balance_dirty_pages_ratelimited_nr() to throttle the current task when
> > > cgroup's IO limit are exceeded. Argh!
> > >
> > > So, another proposal could be to re-add in io-throttle v14 the old hook
> > > also in balance_dirty_pages_ratelimited_nr().
> > >
> > > In this way io-throttle would:
> > >
> > > - use page_cgroup infrastructure and page_cgroup->flags to encode the
> > > cgroup id that firstly dirtied a generic page
> > > - account and opportunely throttle sync and writeback IO requests in
> > > submit_bio()
> > > - at the same time throttle the tasks in
> > > balance_dirty_pages_ratelimited_nr() if the cgroup they belong has
> > > exhausted the IO BW (or quota, share, etc. in case of proportional BW
> > > limit)
> > >
> >
> > IMHO, io-controller should just work as I/O subsystem as bdi. Now, per-bdi dirty_ratio
> > is suppoted and it seems to work well.
> >
> > Can't we write a function like bdi_writeout_fraction() ?;
> > It will be a simple choice.
> >
> One more thing, if you want dirty_ratio for throttoling I/O not for supporing page reclaim,
> Something like task_dirty_limit() will be apporpriate.
>
> Thanks,
> -Kame

Actually I was proposing something quite similar, if I've understood
well. Just add a hook in balance_dirty_pages() to throttle tasks in
cgroups that exhausted their IO BW.

The way to do so will be similar to the per-bdi write throttling, taking
in account the IO requests previously submitted per cgroup, the pages
dirtied per cgroup (considering that are not necessarily dirtied by the
owner of the page) and apply something like congestion_wait() to
throttle the tasks in the cgroups that exceeded the BW limit.

Maybe we can just introduce cgroup_dirty_limit() simply replicating what
we're doing for task_dirty_limit(), but using per cgroup statistics of
course.

I can change the io-throttle controller to do so. This feature should be
valid also for the proportional BW approach.

BTW Vivek's proposal to also dispatch IO requests according to cgroup
proportional BW limits can be still valid and it is worth to be tested
IMHO. But we must also find a way to say to the right cgroup: hey! stop
to waste the memory with dirty pages, because you've directly or
indirectly generated too much IO in the system and I'm throttling and/or
not scheduling your IO requests.

Objections?

-Andrea

2009-04-23 00:07:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Wed, 22 Apr 2009 12:22:41 +0200
Andrea Righi <[email protected]> wrote:

> Actually I was proposing something quite similar, if I've understood
> well. Just add a hook in balance_dirty_pages() to throttle tasks in
> cgroups that exhausted their IO BW.
>
> The way to do so will be similar to the per-bdi write throttling, taking
> in account the IO requests previously submitted per cgroup, the pages
> dirtied per cgroup (considering that are not necessarily dirtied by the
> owner of the page) and apply something like congestion_wait() to
> throttle the tasks in the cgroups that exceeded the BW limit.
>
> Maybe we can just introduce cgroup_dirty_limit() simply replicating what
> we're doing for task_dirty_limit(), but using per cgroup statistics of
> course.
>
> I can change the io-throttle controller to do so. This feature should be
> valid also for the proportional BW approach.
>
> BTW Vivek's proposal to also dispatch IO requests according to cgroup
> proportional BW limits can be still valid and it is worth to be tested
> IMHO. But we must also find a way to say to the right cgroup: hey! stop
> to waste the memory with dirty pages, because you've directly or
> indirectly generated too much IO in the system and I'm throttling and/or
> not scheduling your IO requests.
>
> Objections?
>
No objections. plz let me know my following understanding is right.

1. dirty_ratio should be supported per cgroup.
- Memory cgroup should support dirty_ratio or dirty_ratio cgroup should be implemented.
For doing this, we can make use of page_cgroup.

One good point of dirty-ratio cgroup is that dirty-ratio accounting is done
against a cgroup which made pages dirty not against a owner of the page. But
if dirty_ratio cgroup is completely independent from mem_cgroup, it cannot
be a help for memory reclaiming.
Then,
- memcg itself should have dirty_ratio check.
- like bdi/task_dirty_limit(), a cgroup (which is not memcg) can be used
another filter for dirty_ratio.

2. dirty_ratio is not I/O BW control.

3. I/O BW(limit) control cgroup should be implemented and it should be exsiting
in I/O scheduling layer or somewhere around. But it's not easy.

4. To track bufferred I/O, we have to add "tag" to pages which tell us who
generated the I/O. Now it's called blockio-cgroup and implementation details
are still under discussion.

So, current status is.

A. memcg should support dirty_ratio for its own memory reclaim.
in plan.

B. another cgroup can be implemnted to support cgroup_dirty_limit().
But relationship with "A" should be discussed.
no plan yet.

C. I/O cgroup and bufferred I/O tracking system.
Now under patch review.

And this I/O throttle is mainly for "C" discussion.

Right ?

-Kame



Regards,
-Kame



2009-04-23 01:26:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Thu, Apr 23, 2009 at 09:05:35AM +0900, KAMEZAWA Hiroyuki wrote:
> So, current status is.
>
> A. memcg should support dirty_ratio for its own memory reclaim.
> in plan.
>
> B. another cgroup can be implemnted to support cgroup_dirty_limit().
> But relationship with "A" should be discussed.
> no plan yet.
>
> C. I/O cgroup and bufferred I/O tracking system.
> Now under patch review.
>
> And this I/O throttle is mainly for "C" discussion.

How much testing has been done in terms of whether the I/O throttling
actually works? Not just, "the kernel doesn't crash", but that where
you have one process generating a large amount of I/O load, in various
different ways, and whether the right things happens? If so, how has
this been measured?

I'm really concerned that given some of the ways that I/O will "leak"
out --- the via pdflush, swap writeout, etc., that without the rest of
the pieces in place, I/O throttling by itself might not prove to be
very effective. Sure, if the workload is only doing direct I/O, life
is pretty easy and it shouldn't be hard to throttle the cgroup.

But in the case where there is bufferred I/O, without write
throttling, it's hard to see how well the I/O controller will work in
practice. In fact, I wouldn't be that surprised if it's possible to
trigger the OOM killer.......

Regards,

- Ted

2009-04-23 02:56:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Wed, 22 Apr 2009 21:22:54 -0400
Theodore Tso <[email protected]> wrote:

> On Thu, Apr 23, 2009 at 09:05:35AM +0900, KAMEZAWA Hiroyuki wrote:
> > So, current status is.
> >
> > A. memcg should support dirty_ratio for its own memory reclaim.
> > in plan.
> >
> > B. another cgroup can be implemnted to support cgroup_dirty_limit().
> > But relationship with "A" should be discussed.
> > no plan yet.
> >
> > C. I/O cgroup and bufferred I/O tracking system.
> > Now under patch review.
> >
> > And this I/O throttle is mainly for "C" discussion.
>
> How much testing has been done in terms of whether the I/O throttling
> actually works? Not just, "the kernel doesn't crash", but that where
> you have one process generating a large amount of I/O load, in various
> different ways, and whether the right things happens? If so, how has
> this been measured?

I/O control people should prove it. And they do, I think.

>
> I'm really concerned that given some of the ways that I/O will "leak"
> out --- the via pdflush, swap writeout, etc., that without the rest of
> the pieces in place, I/O throttling by itself might not prove to be
> very effective. Sure, if the workload is only doing direct I/O, life
> is pretty easy and it shouldn't be hard to throttle the cgroup.
>
It's just a problem of "what we do and what we don't, now".
Andrea, Vivek, could you clarify ? As other project, I/O controller will not be
100% at first implementation.

> But in the case where there is bufferred I/O, without write
> throttling, it's hard to see how well the I/O controller will work in
> practice. In fact, I wouldn't be that surprised if it's possible to
> trigger the OOM killer.......
>

yes, then, memcg should have dirty_ratio handler. And, we may have to
implement dirty-ratio controller. So, please don't merge memcg discussion
and I/O BW throttoling. It's related to each other but different problem.

Thanks,
-Kame

2009-04-23 04:37:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Thu, Apr 23, 2009 at 11:54:19AM +0900, KAMEZAWA Hiroyuki wrote:
> > How much testing has been done in terms of whether the I/O throttling
> > actually works? Not just, "the kernel doesn't crash", but that where
> > you have one process generating a large amount of I/O load, in various
> > different ways, and whether the right things happens? If so, how has
> > this been measured?
>
> I/O control people should prove it. And they do, I think.
>

Well, with all due respect, the fact that they only tested removing
the ext3 patch to fs/jbd2/commit.c, and discovered it had no effect,
only after I asked some questions about how it could possibly work
from a theoretical basis, makes me wonder exactly how much testing has
actually been done to date. Which is why I asked the question....

> > I'm really concerned that given some of the ways that I/O will "leak"
> > out --- the via pdflush, swap writeout, etc., that without the rest of
> > the pieces in place, I/O throttling by itself might not prove to be
> > very effective. Sure, if the workload is only doing direct I/O, life
> > is pretty easy and it shouldn't be hard to throttle the cgroup.
>
> It's just a problem of "what we do and what we don't, now".
> Andrea, Vivek, could you clarify ? As other project, I/O controller
> will not be 100% at first implementation.

Yeah, but if the design hasn't been fully validated, maybe the
implementation isn't ready for merging yet. I only came across these
patch series because of the ext3 patch, and when I started looking at
it just from a high level point of view, I'm concerned about the
design gaps and exactly how much high level thinking has gone into the
patches. This isn't a NACK per se, because I haven't spent the time
to look at this code very closely (nor do I have the time).

Consider this more of a yellow flag being thrown on the field, in the
hopes that the block layer and VM experts will take a much closer
review of these patches. I have a vague sense of disquiet that the
container patches are touching a very large number of subsystems
across the kernels, and it's not clear to me the maintainers of all of
the subsystems have been paying very close attention and doing a
proper high-level review of the design.

Simply on the strength of a very cursory reivew and asking a few
questions, it seems to me that the I/O controller was implemented,
apparently without even thinking about the write throttling problems,
and this just making me.... very, very, nervous.

I hope someone like akpm is paying very close attention and auditing
these patches both from an low-level patch cleanliness point of view
as well as a high-level design review. Or at least that *someone* is
doing so and can perhaps document how all of these knobs interact.
After all, if they are going to be separate, and someone turns the I/O
throttling knob without bothering to turn the write throttling knob
--- what's going to happen? An OOM? That's not going to be very safe
or friendly for the sysadmin who plans to be configuring the system.

Maybe this high level design considerations is happening, and I just
haven't have seen it. I sure hope so.

- Ted

2009-04-23 05:11:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Thu, 23 Apr 2009 00:35:48 -0400 Theodore Tso <[email protected]> wrote:

> I hope someone like akpm is paying very close attention and auditing
> these patches both from an low-level patch cleanliness point of view
> as well as a high-level design review.

Not yet, really. But I intend to. Largely because I've always been
very skeptical that anyone has found a good solution to...

> Or at least that *someone* is
> doing so and can perhaps document how all of these knobs interact.
> After all, if they are going to be separate, and someone turns the I/O
> throttling knob without bothering to turn the write throttling knob
> --- what's going to happen? An OOM? That's not going to be very safe
> or friendly for the sysadmin who plans to be configuring the system.

... this problem.

2009-04-23 05:39:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Wed, 22 Apr 2009 21:58:25 -0700
Andrew Morton <[email protected]> wrote:
> > Or at least that *someone* is
> > doing so and can perhaps document how all of these knobs interact.
> > After all, if they are going to be separate, and someone turns the I/O
> > throttling knob without bothering to turn the write throttling knob
> > --- what's going to happen? An OOM? That's not going to be very safe
> > or friendly for the sysadmin who plans to be configuring the system.
>
> ... this problem.
>
Considering that low-io-limit cgroup as very slow device,
the problem itself is not far from that the current kernel has.
If per-bdi-dirty-ratio works well, we can write per-cgroup-dirty-ratio, I think.
(I'll do if I find time.)

But yes, configuration-how-to should be documented, finally.
I hope sysadmins will not use some acrobatic configuration ;)


Thanks,
-Kame

2009-04-23 09:44:49

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Thu, Apr 23, 2009 at 12:35:48AM -0400, Theodore Tso wrote:
> On Thu, Apr 23, 2009 at 11:54:19AM +0900, KAMEZAWA Hiroyuki wrote:
> > > How much testing has been done in terms of whether the I/O throttling
> > > actually works? Not just, "the kernel doesn't crash", but that where
> > > you have one process generating a large amount of I/O load, in various
> > > different ways, and whether the right things happens? If so, how has
> > > this been measured?
> >
> > I/O control people should prove it. And they do, I think.
> >
>
> Well, with all due respect, the fact that they only tested removing
> the ext3 patch to fs/jbd2/commit.c, and discovered it had no effect,
> only after I asked some questions about how it could possibly work
> from a theoretical basis, makes me wonder exactly how much testing has
> actually been done to date. Which is why I asked the question....

This is true in part. Actually io-throttle v12 has been largely tested,
also in production environments (Matt and David in cc can confirm
this) with quite interesting results.

I tested the previous versions usually with many parallel iozone, dd,
using many different configurations.

In v12 writeback IO is not actually limited, what io-throttle did was to
account and limit reads and direct IO in submit_bio() and limit and
account page cache writes in balance_dirty_pages_ratelimited_nr().

This seems to work quite well for the cases when we want avoid that a
single cgroup eats all the IO BW, but in this way in presence of a large
write stream we periodically have bunches of writeback IO that can
disrupt the other cgroups' BW requirements, from the QoS perspective.

The point is that in the new versions (v13 and v14) I merged the
bio-cgroup stuff to track and opportunely handle writeback IO in a
"smoother" way, actually changing some core components of the
io-throttle controller.

And this means it surely needs additional testing before merging in the
mainline.

I'll reproduce all the tests and publish the results ASAP using the new
implementation. I was just waiting to reach a stable point in the
implementation decisions before doing that.

>
> > > I'm really concerned that given some of the ways that I/O will "leak"
> > > out --- the via pdflush, swap writeout, etc., that without the rest of
> > > the pieces in place, I/O throttling by itself might not prove to be
> > > very effective. Sure, if the workload is only doing direct I/O, life
> > > is pretty easy and it shouldn't be hard to throttle the cgroup.
> >
> > It's just a problem of "what we do and what we don't, now".
> > Andrea, Vivek, could you clarify ? As other project, I/O controller
> > will not be 100% at first implementation.
>
> Yeah, but if the design hasn't been fully validated, maybe the
> implementation isn't ready for merging yet. I only came across these
> patch series because of the ext3 patch, and when I started looking at
> it just from a high level point of view, I'm concerned about the
> design gaps and exactly how much high level thinking has gone into the
> patches. This isn't a NACK per se, because I haven't spent the time
> to look at this code very closely (nor do I have the time).

And the ext3 patch BTW was just an experimental test, that has been
useful at the end, because now I have the attention and some feedbacks
also from the fs experts... :)

Anyway, as said above and at least for io-throttle it is not a totally
first implementation. It's a quite old and tested cgroup subsystem, but
some core components have been redesigned. For this reason it surely
needs more testing, and we're still discussing some implementation
details. I'd say the basic interface is stable and as Kamezawa said we
just need to decide what we do, what we don't, which problems the IO
controller should address and which should be considered by other cgroup
subsystems (like the dirty ratio issue).

>
> Consider this more of a yellow flag being thrown on the field, in the
> hopes that the block layer and VM experts will take a much closer
> review of these patches. I have a vague sense of disquiet that the
> container patches are touching a very large number of subsystems
> across the kernels, and it's not clear to me the maintainers of all of
> the subsystems have been paying very close attention and doing a
> proper high-level review of the design.

Agreed that IO controller touches a lot of critical kernel components. A
feedback from VM and block layer experts would be really welcome.

>
> Simply on the strength of a very cursory reivew and asking a few
> questions, it seems to me that the I/O controller was implemented,
> apparently without even thinking about the write throttling problems,
> and this just making me.... very, very, nervous.

Actually we discussed a lot about write throttling problems. At least I
addressed the problem since io-throttle RFC v2 (posted in Jun 2008).

>
> I hope someone like akpm is paying very close attention and auditing
> these patches both from an low-level patch cleanliness point of view
> as well as a high-level design review. Or at least that *someone* is
> doing so and can perhaps document how all of these knobs interact.
> After all, if they are going to be separate, and someone turns the I/O
> throttling knob without bothering to turn the write throttling knob
> --- what's going to happen? An OOM? That's not going to be very safe
> or friendly for the sysadmin who plans to be configuring the system.

>
> Maybe this high level design considerations is happening, and I just
> haven't have seen it. I sure hope so.

In a previous discussion (http://lkml.org/lkml/2008/11/4/565) we decided
to split the problems: the decision was that IO controller should
consider only IO requests and the memory controller should take care of
the OOM / dirty pages problems. Distinct memcg dirty_ratio seemed to be
a good start. Anyway, I think we're not so far from having an acceptable
solution, also looking at the recent thoughts and discussions in this
thread. For the implementation part, as pointed by Kamezawa per bdi /
task dirty ratio is a very similar problem. Probably we can simply
replicate the same concepts per cgroup.

-Andrea

2009-04-23 10:03:49

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Thu, Apr 23, 2009 at 09:05:35AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 22 Apr 2009 12:22:41 +0200
> Andrea Righi <[email protected]> wrote:
>
> > Actually I was proposing something quite similar, if I've understood
> > well. Just add a hook in balance_dirty_pages() to throttle tasks in
> > cgroups that exhausted their IO BW.
> >
> > The way to do so will be similar to the per-bdi write throttling, taking
> > in account the IO requests previously submitted per cgroup, the pages
> > dirtied per cgroup (considering that are not necessarily dirtied by the
> > owner of the page) and apply something like congestion_wait() to
> > throttle the tasks in the cgroups that exceeded the BW limit.
> >
> > Maybe we can just introduce cgroup_dirty_limit() simply replicating what
> > we're doing for task_dirty_limit(), but using per cgroup statistics of
> > course.
> >
> > I can change the io-throttle controller to do so. This feature should be
> > valid also for the proportional BW approach.
> >
> > BTW Vivek's proposal to also dispatch IO requests according to cgroup
> > proportional BW limits can be still valid and it is worth to be tested
> > IMHO. But we must also find a way to say to the right cgroup: hey! stop
> > to waste the memory with dirty pages, because you've directly or
> > indirectly generated too much IO in the system and I'm throttling and/or
> > not scheduling your IO requests.
> >
> > Objections?
> >
> No objections. plz let me know my following understanding is right.
>
> 1. dirty_ratio should be supported per cgroup.
> - Memory cgroup should support dirty_ratio or dirty_ratio cgroup should be implemented.
> For doing this, we can make use of page_cgroup.
>
> One good point of dirty-ratio cgroup is that dirty-ratio accounting is done
> against a cgroup which made pages dirty not against a owner of the page. But
> if dirty_ratio cgroup is completely independent from mem_cgroup, it cannot
> be a help for memory reclaiming.
> Then,
> - memcg itself should have dirty_ratio check.
> - like bdi/task_dirty_limit(), a cgroup (which is not memcg) can be used
> another filter for dirty_ratio.

Agreed. We probably need two different dirty_ratio statistics: one to
check the dirty pages inside a memcg for memory reclaim, and another to
check how many dirty pages a cgroup has generated in the system.
Something similar to the task_struct->dirties and global dirty
statistics.

>
> 2. dirty_ratio is not I/O BW control.

Agreed. They are two different problems. Maybe they could be connected,
but the connection can be made in userspace mounting dirty_ratio cgroup
and blockio subsystems together.

For example: give 10MB/s IO BW to cgroup A and also set a upper limit of
dirty pages this cgroup can generate in the system, i.e. 10% of the
system-wide reclaimable memory. If the dirty limit is exceeded the tasks
in this cgroup will start to actively writeback system-wide dirty pages
at the rate defined by the IO controller.

>
> 3. I/O BW(limit) control cgroup should be implemented and it should be exsiting
> in I/O scheduling layer or somewhere around. But it's not easy.

Agreed. Expecially for the "it's not easy" part. :)

>
> 4. To track bufferred I/O, we have to add "tag" to pages which tell us who
> generated the I/O. Now it's called blockio-cgroup and implementation details
> are still under discussion.

OK.

>
> So, current status is.
>
> A. memcg should support dirty_ratio for its own memory reclaim.
> in plan.
>
> B. another cgroup can be implemnted to support cgroup_dirty_limit().
> But relationship with "A" should be discussed.
> no plan yet.
>
> C. I/O cgroup and bufferred I/O tracking system.
> Now under patch review.

D. I/O tracking system must be implemented as a common
infrastructure and not a separate cgroup subsystem. This would
allow to be easily reused also by other potential cgroup
controllers, and avoid to introduce oddity, complexity in
userspace (separate mountpoints, etc.)

>
> And this I/O throttle is mainly for "C" discussion.
>
> Right ?

Right. In io-throttle v14 I also merged some of the blockio-cgroup
functionality, so IO throttle is mainly for C and D, but D should be
probably considered as a separate patchset.

-Andrea

2009-04-23 12:19:21

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Thu, Apr 23, 2009 at 11:44:24AM +0200, Andrea Righi wrote:
> This is true in part. Actually io-throttle v12 has been largely tested,
> also in production environments (Matt and David in cc can confirm
> this) with quite interesting results.
>
> I tested the previous versions usually with many parallel iozone, dd,
> using many different configurations.
>
> In v12 writeback IO is not actually limited, what io-throttle did was to
> account and limit reads and direct IO in submit_bio() and limit and
> account page cache writes in balance_dirty_pages_ratelimited_nr().

Did the testing include what happened if the system was also
simultaneously under memory pressure? What you might find happening
then is that the cgroups which have lots of dirty pages, which are not
getting written out, have their memory usage "protected", while
cgroups that have lots of clean pages have more of their pages
(unfairly) evicted from memory. The worst case, of course, would be
if the memory pressure is coming from an uncapped cgroup.

> In a previous discussion (http://lkml.org/lkml/2008/11/4/565) we decided
> to split the problems: the decision was that IO controller should
> consider only IO requests and the memory controller should take care of
> the OOM / dirty pages problems. Distinct memcg dirty_ratio seemed to be
> a good start. Anyway, I think we're not so far from having an acceptable
> solution, also looking at the recent thoughts and discussions in this
> thread. For the implementation part, as pointed by Kamezawa per bdi /
> task dirty ratio is a very similar problem. Probably we can simply
> replicate the same concepts per cgroup.

I looked at that discussion, and it doesn't seem to be about splitting
the problem between the IO controller and the memory controller at
all. Instead, Andrew is talking about how thottling dirty memory page
writeback on a per-cpuset basis (which is what Christoph Lamaeter
wanted for large SGI systems) made sense as compared to controlling
the rate at which pages got dirty, which is considered much higher
priority:

Generally, I worry that this is a specific fix to a specific problem
encountered on specific machines with specific setups and specific
workloads, and that it's just all too low-level and myopic.

And now we're back in the usual position where there's existing code and
everyone says it's terribly wonderful and everyone is reluctant to step
back and look at the big picture. Am I wrong?

Plus: we need per-memcg dirty-memory throttling, and this is more
important than per-cpuset, I suspect. How will the (already rather
buggy) code look once we've stuffed both of them in there?

So that's basically the same worry I have; which is we're looking at
things at a too-low-level basis, and not at the big picture.

There wasn't discussion about the I/O controller on this thread at
all, at least as far as I could find; nor that splitting the problem
was the right way to solve the problem. Maybe somewhere there was a
call for someone to step back and take a look at the "big picture"
(what I've been calling the high level design), but I didn't see it in
the thread.

It would seem to be much simpler if there was a single tuning knob for
the I/O controller and for dirty page writeback --- after all, why
*else* would you be trying to control the rate at which pages get
dirty? And if you have a cgroup which sometimes does a lot of writes
via direct I/O, and sometimes does a lot of writes through the page
cache, and sometimes does *both*, it would seem to me that if you want
to be able to smoothly limit the amount of I/O it does, you would want
to account and charge for direct I/O and page cache I/O under the same
"bucket". Is that what the user would want?

Suppose you only have 200 MB/sec worth of disk bandwidth, and you
parcel it out in 50 MB/sec chunks to 4 cgroups. But you also parcel
out 50MB/sec of dirty writepages quota to each of the 4 cgroups. Now
suppose one of the cgroups, which was normally doing not much of
anything, suddenly starts doing a database backup which does 50 MB/sec
of direct I/O reading from the database file, and 50 MB/sec dirtying
pages in the page cache as it writes the backup file. Suddenly that
one cgroup is using half of the system's I/O bandwidth!

And before you say this is "correct" from a definitional point of
view, is it "correct" from what a system administrator would want to
control? Is it the right __feature__? If you just say, well, we
defined the problem that way, and we're doing things the way we
defined it, that's a case of garbage in, garbage out. You also have
to ask the question, "did we define the _problem_ in the right way?"
What does the user of this feature really want to do?

It would seem to me that the system administrator would want a single
knob, saying "I don't know or care how the processes in a cgroup does
its I/O; I just want to limit things so that the cgroup can only hog
25% of the I/O bandwidth."

And note this is completely separate from the question of what happens
if you throttle I/O in the page cache writeback loop, and you end up
with an imbalance in the clean/dirty ratios of the cgroups. And
looking at this thread, life gets even *more* amusing on NUMA machines
if you do this; what if you end up starving a cpuset as a result of
this I/O balancing decision, so a particular cpuset doesn't have
enough memory? That's when you'll *definitely* start having OOM
problems.

So maybe someone has thought about all of these issues --- if so, may
I gently suggest that someone write all of this down? The design
issues here are subtle, at least to my little brain, and relying on
people remembering that something was discussed on LKML six months ago
doesn't seem like a good long-term strategy. Eventually this code
will need to be maintained, and maybe some of the engineers working on
it will have moved on to other projects. So this is something that is
rather definitely deserves to be written up and dropped into
Documentation/ or in ample code code comments discussing on the
various subsystems interact.

Best regards,

- Ted

2009-04-23 12:32:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

P.S. I'm not saying that all of these problems need to be solved in a
single subsystem, or even in a single patch series. Just that someone
is thinking about how these ideas all fit together in a high-level
plan, which is written down. It could be wrong, but it really seems
to me that each time says, "so what about X", another piece gets
bolted on, as opposed to thinking about what the whole thing will look
like from the very beginning.

- Ted

2009-04-23 21:13:23

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Thu, Apr 23, 2009 at 08:17:45AM -0400, Theodore Tso wrote:
> On Thu, Apr 23, 2009 at 11:44:24AM +0200, Andrea Righi wrote:
> > This is true in part. Actually io-throttle v12 has been largely tested,
> > also in production environments (Matt and David in cc can confirm
> > this) with quite interesting results.
> >
> > I tested the previous versions usually with many parallel iozone, dd,
> > using many different configurations.
> >
> > In v12 writeback IO is not actually limited, what io-throttle did was to
> > account and limit reads and direct IO in submit_bio() and limit and
> > account page cache writes in balance_dirty_pages_ratelimited_nr().
>
> Did the testing include what happened if the system was also
> simultaneously under memory pressure? What you might find happening
> then is that the cgroups which have lots of dirty pages, which are not
> getting written out, have their memory usage "protected", while
> cgroups that have lots of clean pages have more of their pages
> (unfairly) evicted from memory. The worst case, of course, would be
> if the memory pressure is coming from an uncapped cgroup.

This is an interesting case that should be considered of course. The
tests I did were mainly focused in distinct environment where each
cgroup writes its own files and dirties its own memory. I'll add this
case to the next tests I'll do with io-throttle.

But it's a general problem IMHO and doesn't depend only on the presence
of an IO controller. The same issue can happen if a cgroup reads a file
from a slow device and another cgroup writes to all the pages of the
other cgroup.

Maybe this kind of cgroup unfairness should be addressed by the memory
controller, the IO controller should be just like another slow device in
this particular case.

>
> > In a previous discussion (http://lkml.org/lkml/2008/11/4/565) we decided
> > to split the problems: the decision was that IO controller should
> > consider only IO requests and the memory controller should take care of
> > the OOM / dirty pages problems. Distinct memcg dirty_ratio seemed to be
> > a good start. Anyway, I think we're not so far from having an acceptable
> > solution, also looking at the recent thoughts and discussions in this
> > thread. For the implementation part, as pointed by Kamezawa per bdi /
> > task dirty ratio is a very similar problem. Probably we can simply
> > replicate the same concepts per cgroup.
>
> I looked at that discussion, and it doesn't seem to be about splitting
> the problem between the IO controller and the memory controller at
> all. Instead, Andrew is talking about how thottling dirty memory page
> writeback on a per-cpuset basis (which is what Christoph Lamaeter
> wanted for large SGI systems) made sense as compared to controlling
> the rate at which pages got dirty, which is considered much higher
> priority:
>
> Generally, I worry that this is a specific fix to a specific problem
> encountered on specific machines with specific setups and specific
> workloads, and that it's just all too low-level and myopic.
>
> And now we're back in the usual position where there's existing code and
> everyone says it's terribly wonderful and everyone is reluctant to step
> back and look at the big picture. Am I wrong?
>
> Plus: we need per-memcg dirty-memory throttling, and this is more
> important than per-cpuset, I suspect. How will the (already rather
> buggy) code look once we've stuffed both of them in there?

You're right. That thread was mainly focused on the dirty-page issue. My
fault, sorry.

I've looked back in my old mail archives to find other old discussions
about the dirty page and IO controller issue. I report some of them here
for completeness:

https://lists.linux-foundation.org/pipermail/virtualization/2008-August/011474.html
https://lists.linux-foundation.org/pipermail/virtualization/2008-August/011466.html
https://lists.linux-foundation.org/pipermail/virtualization/2008-August/011482.html
https://lists.linux-foundation.org/pipermail/virtualization/2008-August/011472.html

>
> So that's basically the same worry I have; which is we're looking at
> things at a too-low-level basis, and not at the big picture.
>
> There wasn't discussion about the I/O controller on this thread at
> all, at least as far as I could find; nor that splitting the problem
> was the right way to solve the problem. Maybe somewhere there was a
> call for someone to step back and take a look at the "big picture"
> (what I've been calling the high level design), but I didn't see it in
> the thread.
>
> It would seem to be much simpler if there was a single tuning knob for
> the I/O controller and for dirty page writeback --- after all, why
> *else* would you be trying to control the rate at which pages get
> dirty? And if you have a cgroup which sometimes does a lot of writes

Actually we do already control the rate at which dirty pages are
generated. In balance_dirty_pages() we add a congestion_wait() when the
bdi is congested.

We do that when we write to a slow device for example. Slow because it
is intrinsically slow or because it is limited by some IO controlling
rules.

It is a very similar issue IMHO.

> via direct I/O, and sometimes does a lot of writes through the page
> cache, and sometimes does *both*, it would seem to me that if you want
> to be able to smoothly limit the amount of I/O it does, you would want
> to account and charge for direct I/O and page cache I/O under the same
> "bucket". Is that what the user would want?
>
> Suppose you only have 200 MB/sec worth of disk bandwidth, and you
> parcel it out in 50 MB/sec chunks to 4 cgroups. But you also parcel
> out 50MB/sec of dirty writepages quota to each of the 4 cgroups. Now
> suppose one of the cgroups, which was normally doing not much of
> anything, suddenly starts doing a database backup which does 50 MB/sec
> of direct I/O reading from the database file, and 50 MB/sec dirtying
> pages in the page cache as it writes the backup file. Suddenly that
> one cgroup is using half of the system's I/O bandwidth!

Agreed. The bucket should be the same. The dirty memory should be
probably limited only in terms of "space" for this case instead of BW.

And we should guarantee that a cgroup doesn't fill unfairly the memory
with dirty pages (system-wide or in other cgroups).

>
> And before you say this is "correct" from a definitional point of
> view, is it "correct" from what a system administrator would want to
> control? Is it the right __feature__? If you just say, well, we
> defined the problem that way, and we're doing things the way we
> defined it, that's a case of garbage in, garbage out. You also have
> to ask the question, "did we define the _problem_ in the right way?"
> What does the user of this feature really want to do?
>
> It would seem to me that the system administrator would want a single
> knob, saying "I don't know or care how the processes in a cgroup does
> its I/O; I just want to limit things so that the cgroup can only hog
> 25% of the I/O bandwidth."

Agreed.

>
> And note this is completely separate from the question of what happens
> if you throttle I/O in the page cache writeback loop, and you end up
> with an imbalance in the clean/dirty ratios of the cgroups. And
> looking at this thread, life gets even *more* amusing on NUMA machines
> if you do this; what if you end up starving a cpuset as a result of
> this I/O balancing decision, so a particular cpuset doesn't have
> enough memory? That's when you'll *definitely* start having OOM
> problems.
>
> So maybe someone has thought about all of these issues --- if so, may
> I gently suggest that someone write all of this down? The design
> issues here are subtle, at least to my little brain, and relying on
> people remembering that something was discussed on LKML six months ago
> doesn't seem like a good long-term strategy. Eventually this code
> will need to be maintained, and maybe some of the engineers working on
> it will have moved on to other projects. So this is something that is
> rather definitely deserves to be written up and dropped into
> Documentation/ or in ample code code comments discussing on the
> various subsystems interact.

I agree about the documentation. As also suggested by Balbir we should
definitely start to write something in a common place (wiki?) to collect
all the concepts and objectives we defined in the past and propose a
coherent solution.

Otherwise the risk is to continuously move around discussing about the
same issues and proposing each one a different solution for specific
problems.

I can start extending the io-throttle documentation and
collect/integrate some concepts we've discussed in the past, but first
of all we really need to define all the possible use cases IMHO.

Honestly, I've never considered the cgroups "interactions" and the
unfair distribution of dirty pages among cgroups, for example, as
correctly pointed out by Ted.

Thanks,
-Andrea

2009-04-24 00:27:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Thu, 23 Apr 2009 23:13:04 +0200
Andrea Righi <[email protected]> wrote:

> On Thu, Apr 23, 2009 at 08:17:45AM -0400, Theodore Tso wrote:
> > On Thu, Apr 23, 2009 at 11:44:24AM +0200, Andrea Righi wrote:
> > > This is true in part. Actually io-throttle v12 has been largely tested,
> > > also in production environments (Matt and David in cc can confirm
> > > this) with quite interesting results.
> > >
> > > I tested the previous versions usually with many parallel iozone, dd,
> > > using many different configurations.
> > >
> > > In v12 writeback IO is not actually limited, what io-throttle did was to
> > > account and limit reads and direct IO in submit_bio() and limit and
> > > account page cache writes in balance_dirty_pages_ratelimited_nr().
> >
> > Did the testing include what happened if the system was also
> > simultaneously under memory pressure? What you might find happening
> > then is that the cgroups which have lots of dirty pages, which are not
> > getting written out, have their memory usage "protected", while
> > cgroups that have lots of clean pages have more of their pages
> > (unfairly) evicted from memory. The worst case, of course, would be
> > if the memory pressure is coming from an uncapped cgroup.
>
> This is an interesting case that should be considered of course. The
> tests I did were mainly focused in distinct environment where each
> cgroup writes its own files and dirties its own memory. I'll add this
> case to the next tests I'll do with io-throttle.
>
> But it's a general problem IMHO and doesn't depend only on the presence
> of an IO controller. The same issue can happen if a cgroup reads a file
> from a slow device and another cgroup writes to all the pages of the
> other cgroup.
>
> Maybe this kind of cgroup unfairness should be addressed by the memory
> controller, the IO controller should be just like another slow device in
> this particular case.
>
"soft limit"...for selecting victim at memory shortage is under development.


> >
> > So that's basically the same worry I have; which is we're looking at
> > things at a too-low-level basis, and not at the big picture.
> >
> > There wasn't discussion about the I/O controller on this thread at
> > all, at least as far as I could find; nor that splitting the problem
> > was the right way to solve the problem. Maybe somewhere there was a
> > call for someone to step back and take a look at the "big picture"
> > (what I've been calling the high level design), but I didn't see it in
> > the thread.
> >
> > It would seem to be much simpler if there was a single tuning knob for
> > the I/O controller and for dirty page writeback --- after all, why
> > *else* would you be trying to control the rate at which pages get
> > dirty? And if you have a cgroup which sometimes does a lot of writes
>
> Actually we do already control the rate at which dirty pages are
> generated. In balance_dirty_pages() we add a congestion_wait() when the
> bdi is congested.
>
> We do that when we write to a slow device for example. Slow because it
> is intrinsically slow or because it is limited by some IO controlling
> rules.
>
> It is a very similar issue IMHO.
>
I think so, too.

> > via direct I/O, and sometimes does a lot of writes through the page
> > cache, and sometimes does *both*, it would seem to me that if you want
> > to be able to smoothly limit the amount of I/O it does, you would want
> > to account and charge for direct I/O and page cache I/O under the same
> > "bucket". Is that what the user would want?
> >
> > Suppose you only have 200 MB/sec worth of disk bandwidth, and you
> > parcel it out in 50 MB/sec chunks to 4 cgroups. But you also parcel
> > out 50MB/sec of dirty writepages quota to each of the 4 cgroups.

50MB/sec of diry writepages sounds strange. It's just "50MB of dirty pages limit".
not 50MB/sec if we use a logic like dirty_ratio.


> > Now suppose one of the cgroups, which was normally doing not much of
> > anything, suddenly starts doing a database backup which does 50 MB/sec
> > of direct I/O reading from the database file, and 50 MB/sec dirtying
> > pages in the page cache as it writes the backup file. Suddenly that
> > one cgroup is using half of the system's I/O bandwidth!
>
Hmm ? buffered I/O tracking can't be a help ? Of course, I/O controller
should chase this. And dirty_ratio is not 50MB/sec but 50MB. Then,
read will get slow down very soon if read/write is done by 1 thread.
(I'm not sure if there are 2 threads, one only read and another only write.)

BTW, read B/W and write B/W can be handled under a limit ?


> Agreed. The bucket should be the same. The dirty memory should be
> probably limited only in terms of "space" for this case instead of BW.
>
> And we should guarantee that a cgroup doesn't fill unfairly the memory
> with dirty pages (system-wide or in other cgroups).
>
> >
> > And before you say this is "correct" from a definitional point of
> > view, is it "correct" from what a system administrator would want to
> > control? Is it the right __feature__? If you just say, well, we
> > defined the problem that way, and we're doing things the way we
> > defined it, that's a case of garbage in, garbage out. You also have
> > to ask the question, "did we define the _problem_ in the right way?"
> > What does the user of this feature really want to do?
> >
> > It would seem to me that the system administrator would want a single
> > knob, saying "I don't know or care how the processes in a cgroup does
> > its I/O; I just want to limit things so that the cgroup can only hog
> > 25% of the I/O bandwidth."
>
> Agreed.
>
Agreed. It will be the best.

> >
> > And note this is completely separate from the question of what happens
> > if you throttle I/O in the page cache writeback loop, and you end up
> > with an imbalance in the clean/dirty ratios of the cgroups.
dirty_ratio for memcg is in plan. just delayed.

> > And
> > looking at this thread, life gets even *more* amusing on NUMA machines
> > if you do this; what if you end up starving a cpuset as a result of
> > this I/O balancing decision, so a particular cpuset doesn't have
> > enough memory? That's when you'll *definitely* start having OOM
> > problems.
> >
cpuset users shouldn't use I/O limiting, in general.
Or I/O cotroller should have a switch as "toggle I/O limit if I/O is from
kswapd/vmscan.c". (Or categorize it to kernel I/O.)


> Honestly, I've never considered the cgroups "interactions" and the
> unfair distribution of dirty pages among cgroups, for example, as
> correctly pointed out by Ted.
>

If we really want that, scheduler-cgroup should be considered, too.

Considering optimisically, 99% of cgroup users will use "container" and
all resource control cgroup will be set up at once. Then, user-land
container tools can tell users the container has good balance(of cpu,memory,I/O, etc)
or not.

_interactions_ is important. But cgroup is desined to have many independent subsystems
because it's considered as generic infrastructure.
I didn't read the cgroup design discussion but it's strange to say "we need
balance under subsystem in the kernel" _now_.

A container, the user interface of cgroups which most people think of, should
know that. If we can't do in user land, we should find a way to _ineractions_ in
the kernel, of course.

Thanks,
-Kame


2009-04-24 05:15:30

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

* Theodore Tso <[email protected]> [2009-04-23 00:35:48]:

> On Thu, Apr 23, 2009 at 11:54:19AM +0900, KAMEZAWA Hiroyuki wrote:
> > > How much testing has been done in terms of whether the I/O throttling
> > > actually works? Not just, "the kernel doesn't crash", but that where
> > > you have one process generating a large amount of I/O load, in various
> > > different ways, and whether the right things happens? If so, how has
> > > this been measured?
> >
> > I/O control people should prove it. And they do, I think.
> >
>
> Well, with all due respect, the fact that they only tested removing
> the ext3 patch to fs/jbd2/commit.c, and discovered it had no effect,
> only after I asked some questions about how it could possibly work
> from a theoretical basis, makes me wonder exactly how much testing has
> actually been done to date. Which is why I asked the question....
>

The IO controller patches (now 3 in total) are undergoing review and
comparison. I had suggested we setup a wiki to track requirements and
design. I'll try to get that setup.

> > > I'm really concerned that given some of the ways that I/O will "leak"
> > > out --- the via pdflush, swap writeout, etc., that without the rest of
> > > the pieces in place, I/O throttling by itself might not prove to be
> > > very effective. Sure, if the workload is only doing direct I/O, life
> > > is pretty easy and it shouldn't be hard to throttle the cgroup.
> >
> > It's just a problem of "what we do and what we don't, now".
> > Andrea, Vivek, could you clarify ? As other project, I/O controller
> > will not be 100% at first implementation.
>
> Yeah, but if the design hasn't been fully validated, maybe the
> implementation isn't ready for merging yet. I only came across these
> patch series because of the ext3 patch, and when I started looking at
> it just from a high level point of view, I'm concerned about the
> design gaps and exactly how much high level thinking has gone into the
> patches. This isn't a NACK per se, because I haven't spent the time
> to look at this code very closely (nor do I have the time).
>
> Consider this more of a yellow flag being thrown on the field, in the
> hopes that the block layer and VM experts will take a much closer
> review of these patches. I have a vague sense of disquiet that the
> container patches are touching a very large number of subsystems
> across the kernels, and it's not clear to me the maintainers of all of
> the subsystems have been paying very close attention and doing a
> proper high-level review of the design.
>
> Simply on the strength of a very cursory reivew and asking a few
> questions, it seems to me that the I/O controller was implemented,
> apparently without even thinking about the write throttling problems,
> and this just making me.... very, very, nervous.
>
> I hope someone like akpm is paying very close attention and auditing
> these patches both from an low-level patch cleanliness point of view
> as well as a high-level design review. Or at least that *someone* is
> doing so and can perhaps document how all of these knobs interact.
> After all, if they are going to be separate, and someone turns the I/O
> throttling knob without bothering to turn the write throttling knob
> --- what's going to happen? An OOM? That's not going to be very safe
> or friendly for the sysadmin who plans to be configuring the system.
>
> Maybe this high level design considerations is happening, and I just
> haven't have seen it. I sure hope so.

My understanding is that, it will happen and the patches are
undergoing several iterations, some of them being design iterations.
May be a larger RFC with requirements might help grab more attention.

--
Balbir

2009-04-24 15:12:00

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

* Theodore Tso <[email protected]> [2009-04-21 10:06:31]:

> On Tue, Apr 21, 2009 at 10:30:02AM +0200, Andrea Righi wrote:
> >
> > We're trying to address also this issue, setting max dirty pages limit
> > per cgroup, and force a direct writeback when these limits are exceeded.
> >
> > In this case dirty ratio throttling should happen automatically because
> > the process will be throttled by the IO controller when it tries to
> > writeback the dirty pages and submit IO requests.
>
> The challenge here will be the accounting; consider that you may have
> a file that had some of its pages in its page cache dirtied by a
> process in cgroup A. Now another process in cgroup B dirties some
> more pages. This could happen either via a mmap'ed file or via the
> standard read/write system calls. How do you track which dirty pages
> should be charged against which cgroup?

We have ways to track the cgroup from either "mm_struct" or the
"page". Given the context, vma or mm, we should be able to charge the
page to the cgroup, undoing the charge might be challenge since we'll
need to figure out whom to uncharge from the write context. This needs
some investigation. We could even decay the charge or use similar
techniques, don't know yet.

--
Balbir

2009-04-27 10:45:45

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

Hi Andrea,

From: Andrea Righi <[email protected]>
Subject: Re: [PATCH 1/9] io-throttle documentation
Date: Mon, 20 Apr 2009 17:00:53 +0200

> On Mon, Apr 20, 2009 at 06:38:15PM +0900, Ryo Tsuruta wrote:
> > Hi Andrea,
> >
> > > Implementing bio-cgroup functionality as pure infrastructure framework
> > > instead of a cgroup subsystem would remove all this oddity and
> > > complexity.
> > >
> > > For example, the actual functionality that I need for the io-throttle
> > > controller is just an interface to set and get the cgroup owner of a
> > > page. I think it should be the same also for other potential users of
> > > bio-cgroup.
> > >
> > > So, what about implementing the bio-cgroup functionality as cgroup "page
> > > tracking" infrastructure and provide the following interfaces:
> > >
> > > /*
> > > * Encode the cgrp->css.id in page_group->flags
> > > */
> > > void set_cgroup_page_owner(struct page *page, struct cgroup *cgrp);
> > >
> > > /*
> > > * Returns the cgroup owner of a page, decoding the cgroup id from
> > > * page_cgroup->flags.
> > > */
> > > struct cgroup *get_cgroup_page_owner(struct page *page);
> > >
> > > This also wouldn't increase the size of page_cgroup because we can
> > > encode the cgroup id in the unused bits of page_cgroup->flags, as
> > > originally suggested by Kame.
> > >
> > > And I think it could be used also by dm-ioband, even if it's not a
> > > cgroup-based subsystem... but I may be wrong. Ryo what's your opinion?

I've come up with an idea to coexist blkio-cgroup and io-throttle.
blkio-cgroup provides a function to get a cgroup with the specified ID.

/* Should be called under rcu_read_lock() */
struct cgroup *blkio_cgroup_lookup(int id)
{
struct cgroup *cgrp;
struct cgroup_subsys_state *css;

if (blkio_cgroup_disabled())
return NULL;

css = css_lookup(&blkio_cgroup_subsys, id);
if (!css)
return NULL;
cgrp = css->cgroup;
return cgrp;
}

Then io-throttle can get a struct iothrottle which belongs to the
cgroup by using the above function.

static struct iothrottle *iothrottle_lookup(int id)
{
struct cgroup *grp;
struct iothrottle *iot;

...
grp = blkio_cgroup_lookup(id);
if (!grp)
return NULL
iot = cgroup_to_iothrottle(grp);
...
}

What do you think about this way?

Thanks,
Ryo Tsuruta

2009-04-27 12:15:44

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

Hi Kamezawa-san,

> I've come up with an idea to coexist blkio-cgroup and io-throttle.
> blkio-cgroup provides a function to get a cgroup with the specified ID.
>
> /* Should be called under rcu_read_lock() */
> struct cgroup *blkio_cgroup_lookup(int id)
> {
> struct cgroup *cgrp;
> struct cgroup_subsys_state *css;
>
> if (blkio_cgroup_disabled())
> return NULL;
>
> css = css_lookup(&blkio_cgroup_subsys, id);
> if (!css)
> return NULL;
> cgrp = css->cgroup;
> return cgrp;
> }
>
> Then io-throttle can get a struct iothrottle which belongs to the
> cgroup by using the above function.
>
> static struct iothrottle *iothrottle_lookup(int id)
> {
> struct cgroup *grp;
> struct iothrottle *iot;
>
> ...
> grp = blkio_cgroup_lookup(id);
> if (!grp)
> return NULL
> iot = cgroup_to_iothrottle(grp);
> ...
> }
>
> What do you think about this way?

I have some questions.
- How about using the same numbering scheme as process ID for css_id
instead of idr? It prevents the same ID from being resued quickly.
- Why are css_ids assigned per css? If each cgroup has a unique ID and
the subsystems can refer to it, I can make the above code simple.

Thanks,
Ryo Tsuruta

2009-04-27 21:58:59

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH 1/9] io-throttle documentation

On Mon, Apr 27, 2009 at 07:45:33PM +0900, Ryo Tsuruta wrote:
> Hi Andrea,
>
> From: Andrea Righi <[email protected]>
> Subject: Re: [PATCH 1/9] io-throttle documentation
> Date: Mon, 20 Apr 2009 17:00:53 +0200
>
> > On Mon, Apr 20, 2009 at 06:38:15PM +0900, Ryo Tsuruta wrote:
> > > Hi Andrea,
> > >
> > > > Implementing bio-cgroup functionality as pure infrastructure framework
> > > > instead of a cgroup subsystem would remove all this oddity and
> > > > complexity.
> > > >
> > > > For example, the actual functionality that I need for the io-throttle
> > > > controller is just an interface to set and get the cgroup owner of a
> > > > page. I think it should be the same also for other potential users of
> > > > bio-cgroup.
> > > >
> > > > So, what about implementing the bio-cgroup functionality as cgroup "page
> > > > tracking" infrastructure and provide the following interfaces:
> > > >
> > > > /*
> > > > * Encode the cgrp->css.id in page_group->flags
> > > > */
> > > > void set_cgroup_page_owner(struct page *page, struct cgroup *cgrp);
> > > >
> > > > /*
> > > > * Returns the cgroup owner of a page, decoding the cgroup id from
> > > > * page_cgroup->flags.
> > > > */
> > > > struct cgroup *get_cgroup_page_owner(struct page *page);
> > > >
> > > > This also wouldn't increase the size of page_cgroup because we can
> > > > encode the cgroup id in the unused bits of page_cgroup->flags, as
> > > > originally suggested by Kame.
> > > >
> > > > And I think it could be used also by dm-ioband, even if it's not a
> > > > cgroup-based subsystem... but I may be wrong. Ryo what's your opinion?
>
> I've come up with an idea to coexist blkio-cgroup and io-throttle.
> blkio-cgroup provides a function to get a cgroup with the specified ID.
>
> /* Should be called under rcu_read_lock() */
> struct cgroup *blkio_cgroup_lookup(int id)
> {
> struct cgroup *cgrp;
> struct cgroup_subsys_state *css;
>
> if (blkio_cgroup_disabled())
> return NULL;
>
> css = css_lookup(&blkio_cgroup_subsys, id);
> if (!css)
> return NULL;
> cgrp = css->cgroup;
> return cgrp;
> }
>
> Then io-throttle can get a struct iothrottle which belongs to the
> cgroup by using the above function.
>
> static struct iothrottle *iothrottle_lookup(int id)
> {
> struct cgroup *grp;
> struct iothrottle *iot;
>
> ...
> grp = blkio_cgroup_lookup(id);
> if (!grp)
> return NULL
> iot = cgroup_to_iothrottle(grp);
> ...
> }
>
> What do you think about this way?

Hi Ryo,

this should be ok for io-throttle. But I'd still prefer to see
blkio-cgroup implemented as an infrastructure, instead of a cgroup. This
would avoid (at least for io-throttle) the need to mount io-throttle
together with blkio-cgroup or provide complicate ways to associate
io-throttle groups with blkio-cgroup groups.

Thanks,
-Andrea

2009-04-30 13:21:23

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: [PATCH 0/9] cgroup: io-throttle controller (v13)

Hi Andrea -

FYI: I ran a simple test using this code to try and gauge the overhead
incurred by enabling this technology. Using a single 400GB volume split
into two 200GB partitions I ran two processes in parallel performing a
mkfs (ext2) on each partition. First w/out cgroup io-throttle and then
with it enabled (with each task having throttling enabled to
400MB/second (much, much more than the device is actually capable of
doing)). The idea here is to see the base overhead of just having the
io-throttle code in the paths.

Doing 30 runs of each (w/out & w/ io-throttle enabled) shows very little
difference (time in seconds)

w/out: min=80.196 avg=80.585 max=81.030 sdev=0.215 spread=0.834
with: min=80.402 avg=80.836 max=81.623 sdev=0.327 spread=1.221

So only around 0.3% overhead - and that may not be conclusive with the
standard deviations seen.

--

FYI: The test was run on 2.6.30-rc1+your patches on a 16-way x86_64 box
(128GB RAM) plus a single FC volume off of a 1Gb FC RAID controller.

Regards,
Alan D. Brunelle
Hewlett-Packard