2009-07-21 14:09:28

by Ryo Tsuruta

[permalink] [raw]
Subject: [PATCH 1/9] I/O bandwidth controller and BIO tracking

Hi all,

These are new releases of dm-ioband and blkio-cgroup. The major
changes of these releases are:
- dm-ioband can be configured through the cgroup interface. The
bandwidth can be assigned on a per cgroup per block device basis.
- The event tracing is supported that helps in debugging and
monitoring dm-ioband.
- A document for blkio-cgroup is available at
Documentation/cgroup/blkio.txt.

This series of patches consists of two parts:
dm-ioband v1.12.1
dm-ioband is an I/O bandwidth controller implemented as a
device-mapper driver and can control bandwidth on per partition, per
user, per process, per virtual machine (such as KVM or Xen) basis.

blkio-cgruop v9
blkio-cgroup is a block I/O tracking mechanism implemented on the
cgroup memory subsystem. Using this feature the owners of any type
of I/O can be determined. This allows dm-ioband to control block I/O
bandwidth even when it is accepting delayed write requests.
dm-ioband can find the cgroup of each request. It is also for
possible that others working on I/O bandwidth throttling to use this
functionality to control asynchronous I/O with a little enhancement.

The patches can be applied to both the current device-mapper development
tree and 2.6.31-rc3.

The list of the patches:
[PATCH 1/9] I/O bandwidth controller and BIO tracking
[PATCH 2/9] dm-ioband-1.12.1: All-in-one patch
[PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework
[PATCH 4/9] blkio-cgroup-v9: Refactoring io-context initialization
[PATCH 5/9] blkio-cgroup-v9: The body of blkio-cgroup
[PATCH 6/9] blkio-cgroup-v9: The document of blkio-cgroup
[PATCH 7/9] blkio-cgroup-v9: Page tracking hooks
[PATCH 8/9] blkio-cgroup-v9: Fast page tracking
[PATCH 9/9] blkio-cgroup-v9: Add a cgroup support to dm-ioband

Please visit our website, the patches and more information are available.
Linux Block I/O Bandwidth Control Project
http://sourceforge.net/apps/trac/ioband/

I'd like to get some feedbacks from the list. Any comments are
appreciated.

Thanks,
Ryo Tsuruta


2009-07-21 14:11:36

by Ryo Tsuruta

[permalink] [raw]
Subject: [PATCH 2/9] dm-ioband-1.12.1: All-in-one patch

The body of dm-ioband. This patch is an all-in-one patch of dm-ioband
so that it replaces dm-add-ioband.patch in the device-mapper development tree.

Signed-off-by: Ryo Tsuruta <[email protected]>
Signed-off-by: Hirokazu Takahashi <[email protected]>

---
Documentation/device-mapper/ioband.txt | 1113 ++++++++++++++++++++++++++
Documentation/device-mapper/range-bw.txt | 99 ++
drivers/md/Kconfig | 13
drivers/md/Makefile | 3
drivers/md/dm-ioband-ctl.c | 1313 +++++++++++++++++++++++++++++++
drivers/md/dm-ioband-policy.c | 459 ++++++++++
drivers/md/dm-ioband-rangebw.c | 673 +++++++++++++++
drivers/md/dm-ioband-type.c | 76 +
drivers/md/dm-ioband.h | 228 +++++
include/trace/events/dm-ioband.h | 242 +++++
10 files changed, 4219 insertions(+)

Index: linux-2.6.31-rc3/Documentation/device-mapper/ioband.txt
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/Documentation/device-mapper/ioband.txt
@@ -0,0 +1,1113 @@
+ Block I/O bandwidth control: dm-ioband
+
+ -------------------------------------------------------
+
+ Table of Contents
+
+ [1]What's dm-ioband all about?
+
+ [2]Differences from the CFQ I/O scheduler
+
+ [3]How dm-ioband works.
+
+ [4]Setup and Installation
+
+ [5]Getting started
+
+ [6]Command Reference
+
+ [7]Examples
+
+What's dm-ioband all about?
+
+ dm-ioband is an I/O bandwidth controller implemented as a device-mapper
+ driver. Several jobs using the same block device have to share the
+ bandwidth of the device. dm-ioband gives bandwidth to each job according
+ to bandwidth control policies.
+
+ A job is a group of processes with the same pid or pgrp or uid or a
+ virtual machine such as KVM or Xen. A job can also be a cgroup by applying
+ the blkio-cgroup patch, which can be found at
+ http://sourceforge.net/apps/trac/ioband/.
+
+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
+ |cgroup | |cgroup | | the | | pid | | pid | | the | jobs
+ | A | | B | |others | | X | | Y | |others |
+ +---|---+ +---|---+ +---|---+ +---|---+ +---|---+ +---|---+
+ | | | | | |
+ +-----|---------|---------|----+----|---------|---------|-----+
+ | | /dev/mapper/disk1 | | | /dev/mapper/disk2 | |
+ |-----|---------|---------|----+----|---------|---------|-----|
+ | +---V---+ +---V---+ +---V---+ +---V---+ +---V---+ +---V---+ |
+ | | ioband| | ioband| |default| | ioband| | ioband| |default| |
+ | | group | | group | | group | | group | | group | | group | | dm-ioband
+ | |-------+-+-------+-+-------+-+-------+-+-------+-+-------| |
+ | | bandwidth control | |
+ | +-------------|-----------------------------|-------------+ |
+ ---------------|-----------------------------|---------------
+ | |
+ +---------------V--------------+--------------V---------------+
+ | /dev/sdb1 | /dev/sdb2 | partitions
+ +------------------------------+------------------------------+
+
+
+ --------------------------------------------------------------------------
+
+Differences from the CFQ I/O scheduler
+
+ Dm-ioband is flexible to configure the bandwidth settings.
+
+ Dm-ioband can work with any type of I/O scheduler such as the NOOP
+ scheduler, which is often chosen for high-end storages, since it is
+ implemented outside the I/O scheduling layer. It allows both of partition
+ based bandwidth control and job --- a group of processes --- based
+ control. In addition, it can set different configuration on each block
+ device to control its bandwidth.
+
+ Meanwhile the current implementation of the CFQ scheduler has 8 IO
+ priority levels and all jobs whose processes have the same IO priority
+ share the bandwidth assigned to this level between them. And IO priority
+ is an attribute of a process, so that it equally effects to all block
+ devices.
+
+ --------------------------------------------------------------------------
+
+How dm-ioband works.
+
+ The bandwidth of each job is determined by a bandwidth control policy.
+ dm-ioband provides three kinds of policies "weight", "weight-iosize" and
+ "range-bw", and a user can select one of them at the time of setup.
+
+ --------------------------------------------------------------------------
+
+ weight and weight-iosize policy
+
+ Every ioband device has one ioband group, which by default is called the
+ default group, and can also have extra ioband groups in the ioband device.
+ Each ioband group has its own weight and tokens. The amount of tokens are
+ determined proportional to the weight of each ioband group.
+
+ The ioband group can pass on I/O requests that its job issues to the
+ underlying layer so long as it has tokens left, while requests are blocked
+ if there aren't any tokens left in the ioband group. The tokens are
+ refilled once all of the ioband groups that have requests on a given
+ underlying block device use up their tokens.
+
+ The weight policy lets dm-ioband consume one token per one I/O request.
+ The weight-iosize policy lets dm-ioband consume one token per one I/O
+ sector, for example, one I/O request which consists of 4Kbytes (512bytes *
+ 8 sectors) read consumes 8 tokens.
+
+ With this approach, a job running on the ioband group with large weight
+ is guaranteed a wide I/O bandwidth.
+
+ --------------------------------------------------------------------------
+
+ range-bw policy
+
+ range-bw means the predicable I/O bandwidth with minimum and maximum
+ value defined by administrator. And it is also possible to set up only
+ maximum value for only I/O limitation. So, you can define the specific and
+ fixed bandwidth to satisfy I/O requirement regardless of whole I/O
+ bandwidth.
+
+ Minimum I/O bandwidth is to guarantee the stable performance or
+ reliability of specific process group and maximum bandwidth is to throttle
+ the unnecessary I/O usage or to reserve the I/O bandwidth for another use.
+ So range-bw supports adequate and predicable I/O bandwidth between minimum
+ and maximum value.
+
+ The setting unit is based on Kbytes/sec. If you want to allocate
+ 3M~5Mbytes/sec I/O bandwidth to X group, you should set 3000 to min-bw,
+ 5000 to max-bw.
+
+ Attention
+
+ Although range-bw supports the predicable I/O bandwidth, it should be
+ configured in the scope of total I/O bandwidth of the I/O system to
+ guarantee the minimum I/O requirement. For example, if total I/O bandwidth
+ is 40Mbytes/sec, the summary of I/O bandwidth configured in each process
+ group should be equal or smaller than 40Mbytes/sec. So, we need to check
+ total I/O bandwidth before set it up.
+
+ --------------------------------------------------------------------------
+
+Setup and Installation
+
+ Build a kernel with these options enabled:
+
+ CONFIG_MD
+ CONFIG_BLK_DEV_DM
+ CONFIG_DM_IOBAND
+
+
+ If compiled as module, use modprobe to load dm-ioband.
+
+ # make modules
+ # make modules_install
+ # depmod -a
+ # modprobe dm-ioband
+
+
+ "dmsetup targets" command shows all available device-mapper targets.
+ "ioband" and the version number are displayed when dm-ioband has been
+ loaded.
+
+ # dmsetup targets | grep ioband
+ ioband v1.0.0
+
+
+ --------------------------------------------------------------------------
+
+Getting started
+
+ The following is a brief description how to control the I/O bandwidth of
+ disks. In this description, we'll take one disk with two partitions as an
+ example target.
+
+ --------------------------------------------------------------------------
+
+ Create and map ioband devices
+
+ Create two ioband devices "ioband1" and "ioband2". "ioband1" is mapped
+ to "/dev/sda1" and has a weight of 40. "ioband2" is mapped to "/dev/sda2"
+ and has a weight of 10. "ioband1" can use 80% --- 40/(40+10)*100 --- of
+ the bandwidth of "/dev/sda" while "ioband2" can use 20%.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \
+ "weight 0 :40" | dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \
+ "weight 0 :10" | dmsetup create ioband2
+
+
+ If the commands are successful then the device files
+ "/dev/mapper/ioband1" and "/dev/mapper/ioband2" will have been created.
+
+ --------------------------------------------------------------------------
+
+ Additional bandwidth control
+
+ In this example two extra ioband groups are created on "ioband1."
+
+ First, set the ioband group type as user. Next, create two ioband groups
+ that have id 1000 and 2000. Then, give weights of 30 and 20 to the ioband
+ groups respectively.
+
+ # dmsetup message ioband1 0 type user
+ # dmsetup message ioband1 0 attach 1000
+ # dmsetup message ioband1 0 attach 2000
+ # dmsetup message ioband1 0 weight 1000:30
+ # dmsetup message ioband1 0 weight 2000:20
+
+
+ Now the processes owned by uid 1000 can use 30% --- 30/(30+20+40+10)*100
+ --- of the bandwidth of "/dev/sda" when the processes issue I/O requests
+ through "ioband1." The processes owned by uid 2000 can use 20% of the
+ bandwidth likewise.
+
+ Table 1. Weight assignments
+
+ +----------------------------------------------------------------+
+ | ioband device | ioband group | ioband weight |
+ |---------------+--------------------------------+---------------|
+ | ioband1 | user id 1000 | 30 |
+ |---------------+--------------------------------+---------------|
+ | ioband1 | user id 2000 | 20 |
+ |---------------+--------------------------------+---------------|
+ | ioband1 | default group(the other users) | 40 |
+ |---------------+--------------------------------+---------------|
+ | ioband2 | default group | 10 |
+ +----------------------------------------------------------------+
+
+ --------------------------------------------------------------------------
+
+ Remove the ioband devices
+
+ Remove the ioband devices when no longer used.
+
+ # dmsetup remove ioband1
+ # dmsetup remove ioband2
+
+
+ --------------------------------------------------------------------------
+
+Command Reference
+
+ Create an ioband device
+
+ SYNOPSIS
+
+ dmsetup create IOBAND_DEVICE
+
+ DESCRIPTION
+
+ Create an ioband device with the given name IOBAND_DEVICE.
+ Generally, dmsetup reads a table from standard input. Each line of
+ the table specifies a single target and is of the form:
+
+ start_sector num_sectors "ioband" device_file ioband_device_id \
+ io_throttle io_limit ioband_group_type policy policy_args...
+
+
+ start_sector, num_sectors
+
+ The sector range of the underlying device where
+ dm-ioband maps.
+
+ ioband
+
+ Specify the string "ioband" as a target type.
+
+ device_file
+
+ Underlying device name.
+
+ ioband_device_id
+
+ The ID for an ioband device can be symbolic,
+ numeric, or mixed. The same ID must be set among the
+ ioband devices that share the same bandwidth. This is
+ useful for grouping disk drives partitioned from one
+ disk drive such as RAID drive or LVM logical striped
+ volume.
+
+ io_throttle
+
+ When a device has a lot of tokens, and the number
+ of in-flight I/Os in dm-ioband exceeds io_throttle,
+ dm-ioband gives priority to the device and issues
+ I/Os to the device until no tokens of the device are
+ left. If 0 is specified, the default value is used.
+ This setting applies all ioband devices which has the
+ same ioband device ID as you specified by
+ "ioband_device_id."
+
+ io_limit
+
+ Dm-ioband blocks all I/O requests for IOBAND_DEVICE
+ when the number of BIOs in progress exceeds this
+ value. If 0 is specified, the default value is used.
+ This setting applies all ioband devices which has the
+ same ioband device ID as you specified by
+ "ioband_device_id."
+
+ ioband_group_type
+
+ Specify how to evaluate the ioband group ID. The
+ selectable group types are "none", "user", "gid",
+ "pid" or "pgrp." The type "cgroup" is enabled by
+ applying the blkio-cgroup patch. Specify "none" if
+ you don't need any ioband groups other than the
+ default ioband group.
+
+ policy and policy_args
+
+ Specify a bandwidth control policy. The selectable
+ policies are "weight", "weight-iosize" or "range-bw."
+ This setting applies all ioband devices which has the
+ same ioband device ID as you specified by
+ "ioband_device_id."
+
+ policy_args are specific for each policy. See below
+ for information on each policy.
+
+ WEIGHT AND WEIGHT-IOSIZE POLICIES
+
+ The "weight" and "weight-iosize" policies distribute bandwidth
+ proportional to the weight of each ioband group. Each ioband group
+ is charged on an I/O count basis when the "weight" policy is used
+ and an I/O size basis when the "weight-iosize" policy is used. The
+ arguments are of the form:
+
+ token_base :weight [ioband_group_id:weight...]
+
+
+ token_base
+
+ The number of tokens which specified by token_base
+ will be distributed to all ioband groups proportional
+ to the weight of each ioband group. If 0 is
+ specified, the default value is used. This setting
+ applies all ioband devices which has the same ioband
+ device ID as you specified by "ioband_device_id."
+
+ :weight
+
+ Set the weight of the default ioband group.
+
+ ioband_group_id:weight
+
+ Create an extra ioband group with an
+ ioband_group_id and set its weight. The
+ ioband_group_id is an identification number and
+ corresponds to pid, pgrp , uid and so on which depend
+ on ioband group type settings.
+
+ RANGE-BW POLICY
+
+ The "range-bw" policy distributes the predicable bandwidth to
+ each group according to the values of minimum and maximum
+ bandwidth value. And range-bw is not based on I/O token which is
+ usually grant for I/O authority.
+
+ So, "0" value is used for token_base parameter in range-bw
+ policy. And both parameters, min-bw and max-bw, are generally used
+ together, but, max-bw can be used alone for only limitation. The
+ arguments are of the form:
+
+ token_base :min-bw:max-bw [ioband_group_id:min-bw:max-bw...]
+
+
+ token_base
+
+ "0" is used, because it is not meaningful in this
+ policy
+
+ min-bw
+
+ Set the minimum bandwidth of the default ioband
+ group. This parameter can't be used alone.
+
+ max-bw
+
+ Set the maximum bandwidth of the default ioband
+ group.
+
+ ioband_group_id:min-bw:max-bw
+
+ Create an extra ioband group with an
+ ioband_group_id and set its min and max bandwidth.
+ The ioband_group_id is an identification number and
+ corresponds to pid, pgrp , uid and so on which depend
+ on ioband group type settings.
+
+ EXAMPLE
+
+ Create an ioband device with the following parameters:
+
+ * Starting sector = "0"
+
+ * The number of sectors = "$(blockdev --getsize /dev/sda1)"
+
+ * Target type = "ioband"
+
+ * Underlying device name = "/dev/sda1"
+
+ * Ioband device ID = "share1"
+
+ * I/O throttle = "10"
+
+ * I/O limit = "400"
+
+ * Ioband group type = "user"
+
+ * Bandwidth control policy = "weight"
+
+ * Token base = "2048"
+
+ * Weight for the default ioband group = "100"
+
+ * Weight for the ioband group 1000 = "80"
+
+ * Weight for the ioband group 2000 = "20"
+
+ * Ioband device name = "ioband1"
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+ "share1 10 400 user weight 2048 :100 1000:80 2000:20" \
+ | dmsetup create ioband1
+
+
+ Create two device groups (ID=1,2). The bandwidths of these
+ device groups will be individually controlled.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1" \
+ "0 0 none weight 0 :80" | dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1" \
+ "0 0 none weight 0 :20" | dmsetup create ioband2
+ # echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 2" \
+ "0 0 none weight 0 :60" | dmsetup create ioband3
+ # echo "0 $(blockdev --getsize /dev/sdb4) ioband /dev/sdb4 2" \
+ "0 0 none weight 0 :40" | dmsetup create ioband4
+
+
+ --------------------------------------------------------------------------
+
+ Remove the ioband device
+
+ SYNOPSIS
+
+ dmsetup remove IOBAND_DEVICE
+
+ DESCRIPTION
+
+ Remove the specified ioband device IOBAND_DEVICE. All the band
+ groups attached to the ioband device are also removed
+ automatically.
+
+ EXAMPLE
+
+ Remove ioband device "ioband1."
+
+ # dmsetup remove ioband1
+
+
+ --------------------------------------------------------------------------
+
+ Set an ioband group type
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 type TYPE
+
+ DESCRIPTION
+
+ Set an ioband group type of IOBAND_DEVICE. TYPE must be one of
+ "none", "user", "gid", "pid" or "pgrp." The type "cgroup" is
+ enabled by applying the blkio-cgroup patch. Once the type is set,
+ new ioband groups can be created on IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Set the ioband group type of ioband device "ioband1" to "user."
+
+ # dmsetup message ioband1 0 type user
+
+
+ --------------------------------------------------------------------------
+
+ Create an ioband group
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 attach ID
+
+ DESCRIPTION
+
+ Create an ioband group and attach it to IOBAND_DEVICE. ID
+ specifies user-id, group-id, process-id or process-group-id
+ depending the ioband group type of IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Create an ioband group which consists of all processes with
+ user-id 1000 and attach it to ioband device "ioband1."
+
+ # dmsetup message ioband1 0 type user
+ # dmsetup message ioband1 0 attach 1000
+
+
+ --------------------------------------------------------------------------
+
+ Detach the ioband group
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 detach ID
+
+ DESCRIPTION
+
+ Detach the ioband group specified by ID from ioband device
+ IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Detach the ioband group with ID "2000" from ioband device
+ "ioband2."
+
+ # dmsetup message ioband2 0 detach 1000
+
+
+ --------------------------------------------------------------------------
+
+ Set bandwidth control policy
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 policy POLICY
+
+ DESCRIPTION
+
+ Set POLICY to a bandwidth control policy. The selectable
+ policies are "weight", "weight-iosize" and "range-bw." This
+ setting applies all ioband devices which has the same ioband
+ device ID as IOBAND_DEVICE.
+
+ weight
+
+ This policy distributes bandwidth proportional to
+ the weight of each ioband group. Each ioband group is
+ charged on an I/O count basis.
+
+ weight-iosize
+
+ This policy distributes bandwidth proportional to
+ the weight of each ioband group. Each ioband group is
+ charged on an I/O size basis.
+
+ range-bw
+
+ This policy guarantees minimum bandwidth and limits
+ maximum bandwidth for each ioband group.
+
+ EXAMPLE
+
+ Set bandwidth control policy of ioband devices which have the
+ same ioband device ID as "ioband1" to "weight-iosize."
+
+ # dmsetup message ioband1 0 policy weight-iosize
+
+
+ --------------------------------------------------------------------------
+
+ Set the weight of an ioband group
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 weight VAL
+
+ dmsetup message IOBAND_DEVICE 0 weight ID:VAL
+
+ DESCRIPTION
+
+ Set the weight of the ioband group which belongs to
+ IOBAND_DEVICE. The group is determined by ID. If ID: is omitted,
+ the default ioband group is chosen.
+
+ The following example means that "ioband1" can use 80% ---
+ 40/(40+10)*100 --- of the bandwidth of the underlying block device
+ while "ioband2" can use 20%.
+
+ # dmsetup message ioband1 0 weight 40
+ # dmsetup message ioband2 0 weight 10
+
+
+ The following lines have the same effect as the above:
+
+ # dmsetup message ioband1 0 weight 4
+ # dmsetup message ioband2 0 weight 1
+
+
+ VAL must be an integer larger than 0. The default value, which
+ is assigned to newly created ioband groups, is 100.
+
+ EXAMPLE
+
+ Set the weight of the default ioband group of "ioband1" to 40.
+
+ # dmsetup message ioband1 0 weight 40
+
+
+ Set the weight of the ioband group of "ioband1" with ID "1000"
+ to 10.
+
+ # dmsetup message ioband1 0 weight 1000:10
+
+
+ --------------------------------------------------------------------------
+
+ Set the range-bw of an ioband group
+
+ SYNOPSIS
+
+ dmsetup -- message IOBAND_DEVICE 0 range-bw -1:MIN:MAX
+
+ dmsetup message IOBAND_DEVICE 0 range-bw ID:MIN-BW:MAX-BW
+
+ DESCRIPTION
+
+ Set the range-bw of the ioband group which belongs to
+ IOBAND_DEVICE. The group is determined by ID. If -1 is specified
+ as ID, the default ioband group is chosen.
+
+ The following example means that "ioband1" can use
+ 5M~6Mbytes/sec bandwidth of the underlying block device while
+ "ioband2" can use 900K~1Mbytes/sec bandwidth.
+
+ # dmsetup message -- ioband1 0 range-bw -1:5000:6000
+
+ # dmsetup message -- ioband2 0 range-bw -1:900:1000
+
+
+ MIN-BW and MAX-BW and must be an integer larger than 0 and its
+ unit is Kbyte/sec.
+
+ EXAMPLE
+
+ Set the range-bw of the default ioband group of "ioband1" to
+ 200K~300K I/O bandwidth.
+
+ # dmsetup -- message ioband1 0 range-bw -1:200:300
+
+
+ Set the weight of the ioband group of "ioband1" with ID "1000"
+ to 10M~12M I/O bandwidth.
+
+ # dmsetup message ioband1 0 range-bw 1000:10000:12000
+
+
+ --------------------------------------------------------------------------
+
+ Set the number of tokens
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 token VAL
+
+ DESCRIPTION
+
+ The number of tokens will be distributed to all ioband groups
+ proportional to the weight of each ioband group. If 0 is
+ specified, the default value is used. This setting applies all
+ ioband devices which has the same ioband device ID as
+ IOBAND_DEVICE
+
+ EXAMPLE
+
+ Set the number of tokens to 256.
+
+ # dmsetup message ioband1 0 token 256
+
+
+ --------------------------------------------------------------------------
+
+ Set a limit of how many tokens are carried over
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 carryover VAL
+
+ DESCRIPTION
+
+ When dm-ioband tries to refill an ioband group with tokens after
+ another ioband group is already refilled several times, dm-ioband
+ determines the number of tokens to refill by multiplying the
+ number of tokens refilled once by the smaller of how many times
+ the other group is already refilled or this limit. If 0 is
+ specified, the default value is used. This setting applies all
+ ioband devices which has the same ioband device ID as
+ IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Set a limit for "ioband1" to 2.
+
+ # dmsetup message ioband1 0 carryover 2
+
+
+ --------------------------------------------------------------------------
+
+ Set I/O throttling
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 io_throttle VAL
+
+ DESCRIPTION
+
+ When a device has a lot of tokens, and the number of in-flight
+ I/Os in dm-ioband exceeds io_throttle, dm-ioband gives priority to
+ the device and issues I/Os to the device until no tokens of the
+ device are left. If 0 is specified, the default value is used.
+ This setting applies all ioband devices which has the same ioband
+ device ID as you specified by "ioband_device_id."
+
+ EXAMPLE
+
+ Set the I/O throttling value of "ioband1" to 16.
+
+ # dmsetup message ioband1 0 io_throttle 16
+
+
+ --------------------------------------------------------------------------
+
+ Set I/O limiting
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 io_limit VAL
+
+ DESCRIPTION
+
+ Dm-ioband blocks all I/O requests for IOBAND_DEVICE when the
+ number of BIOs in progress exceeds this value. If 0 is specified,
+ the default value is used. This setting applies all ioband devices
+ which has the same ioband device ID as IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Set the I/O limiting value of "ioband1" to 128.
+
+ # dmsetup message ioband1 0 io_limit 128
+
+
+ --------------------------------------------------------------------------
+
+ Display settings
+
+ SYNOPSIS
+
+ dmsetup table --target ioband
+
+ DESCRIPTION
+
+ Display the current table for the ioband device in a format. See
+ "dmsetup create" command for information on the table format.
+
+ EXAMPLE
+
+ The following output shows the current table of "ioband1."
+
+ # dmsetup table --target ioband
+ ioband: 0 32129937 ioband1 8:29 128 10 400 user weight \
+ 2048 :100 1000:80 2000:20
+
+
+ --------------------------------------------------------------------------
+
+ Display Statistics
+
+ SYNOPSIS
+
+ dmsetup status --target ioband
+
+ DESCRIPTION
+
+ Display the statistics of all the ioband devices whose target
+ type is "ioband."
+
+ The output format is as below. the first five columns shows:
+
+ * ioband device name
+
+ * logical start sector of the device (must be 0)
+
+ * device size in sectors
+
+ * target type (must be "ioband")
+
+ * device group ID
+
+ The remaining columns show the statistics of each ioband group
+ on the band device. Each group uses seven columns for its
+ statistics.
+
+ * ioband group ID (-1 means default)
+
+ * total read requests
+
+ * delayed read requests
+
+ * total read sectors
+
+ * total write requests
+
+ * delayed write requests
+
+ * total write sectors
+
+ EXAMPLE
+
+ The following output shows the statistics of two ioband devices.
+ Ioband2 only has the default ioband group and ioband1 has three
+ (default, 1001, 1002) ioband groups.
+
+ # dmsetup status
+ ioband2: 0 44371467 ioband 128 -1 143 90 424 122 78 352
+ ioband1: 0 44371467 ioband 128 -1 223 172 408 211 136 600 1001 \
+ 166 107 472 139 95 352 1002 211 146 520 210 147 504
+
+
+ --------------------------------------------------------------------------
+
+ Reset status counter
+
+ SYNOPSIS
+
+ dmsetup message IOBAND_DEVICE 0 reset
+
+ DESCRIPTION
+
+ Reset the statistics of ioband device IOBAND_DEVICE.
+
+ EXAMPLE
+
+ Reset the statistics of "ioband1."
+
+ # dmsetup message ioband1 0 reset
+
+
+ --------------------------------------------------------------------------
+
+Examples
+
+ Example #1: Bandwidth control on Partitions
+
+ This example describes how to control the bandwidth with disk
+ partitions. The following diagram illustrates the configuration of this
+ example. You may want to run a database on /dev/mapper/ioband1 and web
+ applications on /dev/mapper/ioband2.
+
+ /mnt1 /mnt2 mount points
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +--------------------------+ +--------------------------+
+ | default group | | default group | ioband groups
+ | (80) | | (40) | (weight)
+ +-------------|------------+ +-------------|------------+
+ | |
+ +-------------V-------------+--------------V------------+
+ | /dev/sda1 | /dev/sda2 | partitions
+ +---------------------------+---------------------------+
+
+
+ To setup the above configuration, follow these steps:
+
+ 1. Create ioband devices with the same device group ID and assign
+ weights of 80 and 40 to the default ioband groups respectively.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0" \
+ "none weight 0 :80" | dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0" \
+ "none weight 0 :40" | dmsetup create ioband2
+
+
+ 2. Create filesystems on the ioband devices and mount them.
+
+ # mkfs.ext3 /dev/mapper/ioband1
+ # mount /dev/mapper/ioband1 /mnt1
+
+ # mkfs.ext3 /dev/mapper/ioband2
+ # mount /dev/mapper/ioband2 /mnt2
+
+
+ --------------------------------------------------------------------------
+
+ Example #2: Bandwidth control on Logical Volumes
+
+ This example is similar to the example #1 but it uses LVM logical
+ volumes instead of disk partitions. This example shows how to configure
+ ioband devices on two striped logical volumes.
+
+ /mnt1 /mnt2 mount points
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +--------------------------+ +--------------------------+
+ | default group | | default group | ioband groups
+ | (80) | | (40) | (weight)
+ +-------------|------------+ +-------------|------------+
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/lv0 | | /dev/mapper/lv1 | striped logical
+ | | | | volumes
+ +-------------------------------------------------------+
+ | vg0 | volume group
+ +-------------|----------------------------|------------+
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/sdb | | /dev/sdc | physical disks
+ +--------------------------+ +--------------------------+
+
+
+ To setup the above configuration, follow these steps:
+
+ 1. Initialize the partitions for use by LVM.
+
+ # pvcreate /dev/sdb
+ # pvcreate /dev/sdc
+
+
+ 2. Create a new volume group named "vg0" with /dev/sdb and /dev/sdc.
+
+ # vgcreate vg0 /dev/sdb /dev/sdc
+
+
+ 3. Create two logical volumes in "vg0." The volumes have to be striped.
+
+ # lvcreate -n lv0 -i 2 -I 64 vg0 -L 1024M
+ # lvcreate -n lv1 -i 2 -I 64 vg0 -L 1024M
+
+
+ The rest is the same as the example #1.
+
+ 4. Create ioband devices corresponding to each logical volume and
+ assign weights of 80 and 40 to the default ioband groups respectively.
+
+ # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv0)" \
+ "ioband /dev/mapper/vg0-lv0 1 0 0 none weight 0 :80" | \
+ dmsetup create ioband1
+ # echo "0 $(blockdev --getsize /dev/mapper/vg0-lv1)" \
+ "ioband /dev/mapper/vg0-lv1 1 0 0 none weight 0 :40" | \
+ dmsetup create ioband2
+
+
+ 5. Create filesystems on the ioband devices and mount them.
+
+ # mkfs.ext3 /dev/mapper/ioband1
+ # mount /dev/mapper/ioband1 /mnt1
+
+ # mkfs.ext3 /dev/mapper/ioband2
+ # mount /dev/mapper/ioband2 /mnt2
+
+
+ --------------------------------------------------------------------------
+
+ Example #4: Bandwidth control on processes
+
+ This example describes how to control the bandwidth with groups of
+ processes. You may also want to run an additional application on the same
+ machine described in the example #1. This example shows how to add a new
+ ioband group for this application.
+
+ /mnt1 /mnt2 mount points
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +-------------+------------+ +-------------+------------+
+ | default | | user=1000 | default | ioband groups
+ | (80) | | (20) | (40) | (weight)
+ +-------------+------------+ +-------------+------------+
+ | |
+ +-------------V-------------+--------------V------------+
+ | /dev/sda1 | /dev/sda2 | partitions
+ +---------------------------+---------------------------+
+
+
+ The following shows to set up a new ioband group on the machine that is
+ already configured as the example #1. The application will have a weight
+ of 20 and run with user-id 1000 on /dev/mapper/ioband2.
+
+ 1. Set the type of ioband2 to "user."
+
+ # dmsetup message ioband2 0 type user.
+
+
+ 2. Create a new ioband group on ioband2.
+
+ # dmsetup message ioband2 0 attach 1000
+
+
+ 3. Assign weight of 10 to this newly created ioband group.
+
+ # dmsetup message ioband2 0 weight 1000:20
+
+
+ --------------------------------------------------------------------------
+
+ Example #3: Bandwidth control for Xen virtual block devices
+
+ This example describes how to control the bandwidth for Xen virtual
+ block devices. The following diagram illustrates the configuration of this
+ example.
+
+ Virtual Machine 1 Virtual Machine 2 virtual machines
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/xvda1 | | /dev/xvda1 | virtual block
+ +-------------|------------+ +-------------|------------+ devices
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/mapper/ioband1 | | /dev/mapper/ioband2 | ioband devices
+ +--------------------------+ +--------------------------+
+ | default group | | default group | ioband groups
+ | (80) | | (40) | (weight)
+ +-------------|------------+ +-------------|------------+
+ | |
+ +-------------V-------------+--------------V------------+
+ | /dev/sda1 | /dev/sda2 | partitions
+ +---------------------------+---------------------------+
+
+
+ The followings shows how to map ioband device "ioband1" and "ioband2" to
+ virtual block device "/dev/xvda1 on Virtual Machine 1" and "/dev/xvda1 on
+ Virtual Machine 2" respectively on the machine configured as the example
+ #1. Add the following lines to the configuration files that are referenced
+ when creating "Virtual Machine 1" and "Virtual Machine 2."
+
+ For "Virtual Machine 1"
+ disk = [ 'phy:/dev/mapper/ioband1,xvda,w' ]
+
+ For "Virtual Machine 2"
+ disk = [ 'phy:/dev/mapper/ioband2,xvda,w' ]
+
+
+ --------------------------------------------------------------------------
+
+ Example #4: Bandwidth control for Xen blktap devices
+
+ This example describes how to control the bandwidth for Xen virtual
+ block devices when Xen blktap devices are used. The following diagram
+ illustrates the configuration of this example.
+
+ Virtual Machine 1 Virtual Machine 2 virtual machines
+ | |
+ +-------------V------------+ +-------------V------------+
+ | /dev/xvda1 | | /dev/xvda1 | virtual block
+ +-------------|------------+ +-------------|------------+ devices
+ | |
+ +----------V----------+ +-----------V---------+
+ | tapdisk | | tapdisk | tapdisk daemons
+ | (15011) | | (15276) | (daemon's pid)
+ +----------|----------+ +-----------|---------+
+ | |
+ +-------------|----------------------------|------------+
+ | | /dev/mapper/ioband1 | | ioband device
+ | | mount on /vmdisk | |
+ +-------------V-------------+--------------V------------+
+ | group for PID=15011 | group for PID=15276 | ioband groups
+ | (80) | (40) | (weight)
+ +-------------|----------------------------|------------+
+ | |
+ +-------------|----------------------------|------------+
+ | +----------V----------+ +-----------V---------+ |
+ | | vm1.img | | vm2.img | | disk image files
+ | +---------------------+ +---------------------+ |
+ | /dev/sda1 | partition
+ +-------------------------------------------------------+
+
+
+ To setup the above configuration, follow these steps:
+
+ 1. Create an ioband device.
+
+ # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1" \
+ "1 0 0 none weight 0 :100" | dmsetup create ioband1
+
+
+ 2. Add the following lines to the configuration files that are
+ referenced when creating "Virtual Machine 1" and "Virtual Machine 2."
+ Disk image files "/vmdisk/vm1.img" and "/vmdisk/vm2.img" will be used.
+
+ For "Virtual Machine 1"
+ disk = [ 'tap:aio:/vmdisk/vm1.img,xvda,w', ]
+
+ For "Virtual Machine 1"
+ disk = [ 'tap:aio:/vmdisk/vm2.img,xvda,w', ]
+
+
+ 3. Run the virtual machines.
+
+ # xm create vm1
+ # xm create vm2
+
+
+ 4. Find out the process IDs of the daemons which control the blktap
+ devices.
+
+ # lsof /vmdisk/disk[12].img
+ COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
+ tapdisk 15011 root 11u REG 253,0 2147483648 48961 /vmdisk/vm1.img
+ tapdisk 15276 root 13u REG 253,0 2147483648 48962 /vmdisk/vm2.img
+
+
+ 5. Create new ioband groups of pid 15011 and pid 15276, which are
+ process IDs of the tapdisks, and assign weight of 80 and 40 to the
+ groups respectively.
+
+ # dmsetup message ioband1 0 type pid
+ # dmsetup message ioband1 0 attach 15011
+ # dmsetup message ioband1 0 weight 15011:80
+ # dmsetup message ioband1 0 attach 15276
+ # dmsetup message ioband1 0 weight 15276:40
Index: linux-2.6.31-rc3/drivers/md/Kconfig
===================================================================
--- linux-2.6.31-rc3.orig/drivers/md/Kconfig
+++ linux-2.6.31-rc3/drivers/md/Kconfig
@@ -294,4 +294,17 @@ config DM_UEVENT
---help---
Generate udev events for DM events.

+config DM_IOBAND
+ tristate "I/O bandwidth control (EXPERIMENTAL)"
+ depends on BLK_DEV_DM && EXPERIMENTAL
+ ---help---
+ This device-mapper target allows to define how the
+ available bandwidth of a storage device should be
+ shared between processes, cgroups, the partitions or the LUNs.
+
+ Information on how to use dm-ioband is available in:
+ <file:Documentation/device-mapper/ioband.txt>.
+
+ If unsure, say N.
+
endif # MD
Index: linux-2.6.31-rc3/drivers/md/Makefile
===================================================================
--- linux-2.6.31-rc3.orig/drivers/md/Makefile
+++ linux-2.6.31-rc3/drivers/md/Makefile
@@ -8,6 +8,8 @@ dm-multipath-y += dm-path-selector.o dm-
dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \
dm-snap-persistent.o
dm-mirror-y += dm-raid1.o
+dm-ioband-y += dm-ioband-ctl.o dm-ioband-policy.o dm-ioband-rangebw.o \
+ dm-ioband-type.o
dm-log-userspace-y \
+= dm-log-userspace-base.o dm-log-userspace-transfer.o
md-mod-y += md.o bitmap.o
@@ -37,6 +39,7 @@ obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
+obj-$(CONFIG_DM_IOBAND) += dm-ioband.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
Index: linux-2.6.31-rc3/drivers/md/dm-ioband-ctl.c
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/drivers/md/dm-ioband-ctl.c
@@ -0,0 +1,1313 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ * Authors: Hirokazu Takahashi <[email protected]>
+ * Ryo Tsuruta <[email protected]>
+ *
+ * I/O bandwidth control
+ *
+ * Some blktrace messages were added by Alan D. Brunelle <[email protected]>
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "md.h"
+#include "dm-ioband.h"
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/dm-ioband.h>
+
+#define num_issued(dp) \
+ ((dp)->g_issued[BLK_RW_SYNC] + (dp)->g_issued[BLK_RW_ASYNC])
+
+static LIST_HEAD(ioband_device_list);
+/* lock up during configuration */
+static DEFINE_MUTEX(ioband_lock);
+
+static void suspend_ioband_device(struct ioband_device *, unsigned long, int);
+static void resume_ioband_device(struct ioband_device *);
+static void ioband_conduct(struct work_struct *);
+static void ioband_hold_bio(struct ioband_group *, struct bio *);
+static struct bio *ioband_pop_bio(struct ioband_group *);
+static int ioband_set_param(struct ioband_group *, const char *, const char *);
+static int ioband_group_attach(struct ioband_group *, int, const char *);
+static int ioband_group_type_select(struct ioband_group *, const char *);
+
+static void do_nothing(void) {}
+
+static int policy_init(struct ioband_device *dp, const char *name,
+ int argc, char **argv)
+{
+ const struct ioband_policy_type *p;
+ struct ioband_group *gp;
+ unsigned long flags;
+ int r;
+
+ for (p = dm_ioband_policy_type; p->p_name; p++) {
+ if (!strcmp(name, p->p_name))
+ break;
+ }
+ if (!p->p_name)
+ return -EINVAL;
+ /* do nothing if the same policy is already set */
+ if (dp->g_policy == p)
+ return 0;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ suspend_ioband_device(dp, flags, 1);
+ list_for_each_entry(gp, &dp->g_groups, c_list)
+ dp->g_group_dtr(gp);
+
+ /* switch to the new policy */
+ dp->g_policy = p;
+ r = p->p_policy_init(dp, argc, argv);
+ if (!r) {
+ if (!dp->g_hold_bio)
+ dp->g_hold_bio = ioband_hold_bio;
+ if (!dp->g_pop_bio)
+ dp->g_pop_bio = ioband_pop_bio;
+
+ list_for_each_entry(gp, &dp->g_groups, c_list)
+ dp->g_group_ctr(gp, NULL);
+ }
+ resume_ioband_device(dp);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+}
+
+static struct ioband_device *alloc_ioband_device(const char *name,
+ int io_throttle, int io_limit)
+{
+ struct ioband_device *dp, *new_dp;
+
+ new_dp = kzalloc(sizeof(struct ioband_device), GFP_KERNEL);
+ if (!new_dp)
+ return NULL;
+
+ /*
+ * Prepare its own workqueue as generic_make_request() may
+ * potentially block the workqueue when submitting BIOs.
+ */
+ new_dp->g_ioband_wq = create_workqueue("kioband");
+ if (!new_dp->g_ioband_wq) {
+ kfree(new_dp);
+ return NULL;
+ }
+
+ list_for_each_entry(dp, &ioband_device_list, g_list) {
+ if (!strcmp(dp->g_name, name)) {
+ dp->g_ref++;
+ destroy_workqueue(new_dp->g_ioband_wq);
+ kfree(new_dp);
+ return dp;
+ }
+ }
+
+ INIT_DELAYED_WORK(&new_dp->g_conductor, ioband_conduct);
+ INIT_LIST_HEAD(&new_dp->g_groups);
+ INIT_LIST_HEAD(&new_dp->g_list);
+ spin_lock_init(&new_dp->g_lock);
+ bio_list_init(&new_dp->g_urgent_bios);
+ new_dp->g_io_throttle = io_throttle;
+ new_dp->g_io_limit = io_limit;
+ new_dp->g_issued[BLK_RW_SYNC] = 0;
+ new_dp->g_issued[BLK_RW_ASYNC] = 0;
+ new_dp->g_blocked = 0;
+ new_dp->g_ref = 1;
+ new_dp->g_flags = 0;
+ strlcpy(new_dp->g_name, name, sizeof(new_dp->g_name));
+ new_dp->g_policy = NULL;
+ new_dp->g_hold_bio = NULL;
+ new_dp->g_pop_bio = NULL;
+ init_waitqueue_head(&new_dp->g_waitq);
+ init_waitqueue_head(&new_dp->g_waitq_suspend);
+ init_waitqueue_head(&new_dp->g_waitq_flush);
+ list_add_tail(&new_dp->g_list, &ioband_device_list);
+ return new_dp;
+}
+
+static void release_ioband_device(struct ioband_device *dp)
+{
+ dp->g_ref--;
+ if (dp->g_ref > 0)
+ return;
+ list_del(&dp->g_list);
+ destroy_workqueue(dp->g_ioband_wq);
+ kfree(dp);
+}
+
+static int is_ioband_device_flushed(struct ioband_device *dp,
+ int wait_completion)
+{
+ struct ioband_group *gp;
+
+ if (wait_completion && num_issued(dp) > 0)
+ return 0;
+ if (dp->g_blocked || waitqueue_active(&dp->g_waitq))
+ return 0;
+ list_for_each_entry(gp, &dp->g_groups, c_list)
+ if (waitqueue_active(&gp->c_waitq))
+ return 0;
+ return 1;
+}
+
+static void suspend_ioband_device(struct ioband_device *dp,
+ unsigned long flags, int wait_completion)
+{
+ struct ioband_group *gp;
+
+ /* block incoming bios */
+ set_device_suspended(dp);
+
+ /* wake up all blocked processes and go down all ioband groups */
+ wake_up_all(&dp->g_waitq);
+ list_for_each_entry(gp, &dp->g_groups, c_list) {
+ if (!is_group_down(gp)) {
+ set_group_down(gp);
+ set_group_need_up(gp);
+ }
+ wake_up_all(&gp->c_waitq);
+ }
+
+ /* flush the already mapped bios */
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ flush_workqueue(dp->g_ioband_wq);
+
+ /* wait for all processes to wake up and bios to release */
+ spin_lock_irqsave(&dp->g_lock, flags);
+ wait_event_lock_irq(dp->g_waitq_flush,
+ is_ioband_device_flushed(dp, wait_completion),
+ dp->g_lock, do_nothing());
+}
+
+static void resume_ioband_device(struct ioband_device *dp)
+{
+ struct ioband_group *gp;
+
+ /* go up ioband groups */
+ list_for_each_entry(gp, &dp->g_groups, c_list) {
+ if (group_need_up(gp)) {
+ clear_group_need_up(gp);
+ clear_group_down(gp);
+ }
+ }
+
+ /* accept incoming bios */
+ wake_up_all(&dp->g_waitq_suspend);
+ clear_device_suspended(dp);
+}
+
+static struct ioband_group *ioband_group_find(struct ioband_group *head, int id)
+{
+ struct rb_node *node = head->c_group_root.rb_node;
+
+ while (node) {
+ struct ioband_group *p =
+ rb_entry(node, struct ioband_group, c_group_node);
+
+ if (p->c_id == id || id == IOBAND_ID_ANY)
+ return p;
+ node = (id < p->c_id) ? node->rb_left : node->rb_right;
+ }
+ return NULL;
+}
+
+static void ioband_group_add_node(struct rb_root *root, struct ioband_group *gp)
+{
+ struct rb_node **node = &root->rb_node, *parent = NULL;
+ struct ioband_group *p;
+
+ while (*node) {
+ p = rb_entry(*node, struct ioband_group, c_group_node);
+ parent = *node;
+ node = (gp->c_id < p->c_id) ?
+ &(*node)->rb_left : &(*node)->rb_right;
+ }
+
+ rb_link_node(&gp->c_group_node, parent, node);
+ rb_insert_color(&gp->c_group_node, root);
+}
+
+static int ioband_group_init(struct ioband_group *gp,
+ struct ioband_group *head,
+ struct ioband_device *dp,
+ int id, const char *param)
+{
+ unsigned long flags;
+ int r;
+
+ INIT_LIST_HEAD(&gp->c_list);
+ bio_list_init(&gp->c_blocked_bios);
+ bio_list_init(&gp->c_prio_bios);
+ gp->c_id = id; /* should be verified */
+ gp->c_blocked = 0;
+ gp->c_prio_blocked = 0;
+ memset(gp->c_stat, 0, sizeof(gp->c_stat));
+ init_waitqueue_head(&gp->c_waitq);
+ gp->c_flags = 0;
+ gp->c_group_root = RB_ROOT;
+ gp->c_banddev = dp;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (head && ioband_group_find(head, id)) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ DMWARN("%s: id=%d already exists.", __func__, id);
+ return -EEXIST;
+ }
+
+ list_add_tail(&gp->c_list, &dp->g_groups);
+ r = dp->g_group_ctr(gp, param);
+ if (r) {
+ list_del(&gp->c_list);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+ }
+
+ if (head) {
+ ioband_group_add_node(&head->c_group_root, gp);
+ gp->c_dev = head->c_dev;
+ gp->c_target = head->c_target;
+ }
+
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return 0;
+}
+
+static void ioband_group_release(struct ioband_group *head,
+ struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ list_del(&gp->c_list);
+ if (head)
+ rb_erase(&gp->c_group_node, &head->c_group_root);
+ dp->g_group_dtr(gp);
+ kfree(gp);
+}
+
+static void ioband_group_destroy_all(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ while ((p = ioband_group_find(gp, IOBAND_ID_ANY)))
+ ioband_group_release(gp, p);
+ ioband_group_release(NULL, gp);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static void ioband_group_stop_all(struct ioband_group *head, int suspend)
+{
+ struct ioband_device *dp = head->c_banddev;
+ struct ioband_group *p;
+ struct rb_node *node;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ set_group_down(p);
+ if (suspend)
+ set_group_suspended(p);
+ }
+ set_group_down(head);
+ if (suspend)
+ set_group_suspended(head);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ flush_workqueue(dp->g_ioband_wq);
+}
+
+static void ioband_group_resume_all(struct ioband_group *head)
+{
+ struct ioband_device *dp = head->c_banddev;
+ struct ioband_group *p;
+ struct rb_node *node;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ for (node = rb_first(&head->c_group_root); node; node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ clear_group_down(p);
+ clear_group_suspended(p);
+ }
+ clear_group_down(head);
+ clear_group_suspended(head);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+}
+
+static int parse_group_param(const char *param, long *id, char const **value)
+{
+ char *s, *endp;
+ long n;
+
+ s = strpbrk(param, POLICY_PARAM_DELIM);
+
+ if (!s) {
+ *id = IOBAND_ID_ANY;
+ *value = param;
+ return 0;
+ }
+
+ n = simple_strtol(param, &endp, 0);
+ if (endp != s)
+ return -EINVAL;
+
+ *id = n;
+ *value = s + 1;
+ return 0;
+}
+
+/*
+ * Create a new band device:
+ * parameters: <device> <device-group-id> <io_throttle> <io_limit>
+ * <type> <policy> <policy-param...> <group-id:group-param...>
+ */
+static int ioband_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+ struct ioband_group *gp;
+ struct ioband_device *dp;
+ struct dm_dev *dev;
+ int io_throttle;
+ int io_limit;
+ int i, r, start;
+ long val, id;
+ const char *param;
+ char *s;
+
+ if (argc < POLICY_PARAM_START) {
+ ti->error = "Requires " __stringify(POLICY_PARAM_START)
+ " or more arguments";
+ return -EINVAL;
+ }
+
+ if (strlen(argv[1]) > IOBAND_NAME_MAX) {
+ ti->error = "Ioband device name is too long";
+ return -EINVAL;
+ }
+
+ r = strict_strtol(argv[2], 0, &val);
+ if (r || val < 0 || val > SHORT_MAX) {
+ ti->error = "Invalid io_throttle";
+ return -EINVAL;
+ }
+ io_throttle = (val == 0) ? DEFAULT_IO_THROTTLE : val;
+
+ r = strict_strtol(argv[3], 0, &val);
+ if (r || val < 0 || val > SHORT_MAX) {
+ ti->error = "Invalid io_limit";
+ return -EINVAL;
+ }
+ io_limit = val;
+
+ r = dm_get_device(ti, argv[0], 0, ti->len,
+ dm_table_get_mode(ti->table), &dev);
+ if (r) {
+ ti->error = "Device lookup failed";
+ return r;
+ }
+
+ if (io_limit == 0) {
+ struct request_queue *q;
+
+ q = bdev_get_queue(dev->bdev);
+ if (!q) {
+ ti->error = "Can't get queue size";
+ r = -ENXIO;
+ goto release_dm_device;
+ }
+ /*
+ * The block layer accepts I/O requests up to 50% over
+ * nr_requests when the requests are issued from a
+ * "batcher" process.
+ */
+ io_limit = (3 * q->nr_requests / 2);
+ }
+
+ if (io_limit < io_throttle)
+ io_limit = io_throttle;
+
+ mutex_lock(&ioband_lock);
+ dp = alloc_ioband_device(argv[1], io_throttle, io_limit);
+ if (!dp) {
+ ti->error = "Cannot create ioband device";
+ r = -EINVAL;
+ mutex_unlock(&ioband_lock);
+ goto release_dm_device;
+ }
+
+ r = policy_init(dp, argv[POLICY_PARAM_START - 1],
+ argc - POLICY_PARAM_START, &argv[POLICY_PARAM_START]);
+ if (r) {
+ ti->error = "Invalid policy parameter";
+ goto release_ioband_device;
+ }
+
+ gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+ if (!gp) {
+ ti->error = "Cannot allocate memory for ioband group";
+ r = -ENOMEM;
+ goto release_ioband_device;
+ }
+
+ ti->num_flush_requests = 1;
+ ti->private = gp;
+ gp->c_target = ti;
+ gp->c_dev = dev;
+
+ /* Find a default group parameter */
+ for (start = POLICY_PARAM_START; start < argc; start++) {
+ s = strpbrk(argv[start], POLICY_PARAM_DELIM);
+ if (s == argv[start])
+ break;
+ }
+ param = (start < argc) ? &argv[start][1] : NULL;
+
+ /* Create a default ioband group */
+ r = ioband_group_init(gp, NULL, dp, IOBAND_ID_ANY, param);
+ if (r) {
+ kfree(gp);
+ ti->error = "Cannot create default ioband group";
+ goto release_ioband_device;
+ }
+
+ r = ioband_group_type_select(gp, argv[4]);
+ if (r) {
+ ti->error = "Cannot set ioband group type";
+ goto release_ioband_group;
+ }
+
+ /* Create sub ioband groups */
+ for (i = start + 1; i < argc; i++) {
+ r = parse_group_param(argv[i], &id, &param);
+ if (r) {
+ ti->error = "Invalid ioband group parameter";
+ goto release_ioband_group;
+ }
+ r = ioband_group_attach(gp, id, param);
+ if (r) {
+ ti->error = "Cannot create ioband group";
+ goto release_ioband_group;
+ }
+ }
+ mutex_unlock(&ioband_lock);
+ return 0;
+
+release_ioband_group:
+ ioband_group_destroy_all(gp);
+release_ioband_device:
+ release_ioband_device(dp);
+ mutex_unlock(&ioband_lock);
+release_dm_device:
+ dm_put_device(ti, dev);
+ return r;
+}
+
+static void ioband_dtr(struct dm_target *ti)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+ struct dm_dev *dev = gp->c_dev;
+
+ mutex_lock(&ioband_lock);
+
+ ioband_group_stop_all(gp, 0);
+ cancel_delayed_work_sync(&dp->g_conductor);
+ ioband_group_destroy_all(gp);
+
+ release_ioband_device(dp);
+ mutex_unlock(&ioband_lock);
+
+ dm_put_device(ti, dev);
+}
+
+static void ioband_hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+ /* Todo: The list should be split into a sync list and an async list */
+ bio_list_add(&gp->c_blocked_bios, bio);
+}
+
+static struct bio *ioband_pop_bio(struct ioband_group *gp)
+{
+ return bio_list_pop(&gp->c_blocked_bios);
+}
+
+static int is_urgent_bio(struct bio *bio)
+{
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ /*
+ * ToDo: A new flag should be added to struct bio, which indicates
+ * it contains urgent I/O requests.
+ */
+ if (!PageReclaim(page))
+ return 0;
+ if (PageSwapCache(page))
+ return 2;
+ return 1;
+}
+
+static inline int device_should_block(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (is_group_down(gp))
+ return 0;
+ if (is_device_blocked(dp))
+ return 1;
+ if (dp->g_blocked >= dp->g_io_limit * 2) {
+ set_device_blocked(dp);
+ return 1;
+ }
+ return 0;
+}
+
+static inline int group_should_block(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (is_group_down(gp))
+ return 0;
+ if (is_group_blocked(gp))
+ return 1;
+ if (dp->g_should_block(gp)) {
+ set_group_blocked(gp);
+ return 1;
+ }
+ return 0;
+}
+
+static void prevent_burst_bios(struct ioband_group *gp, struct bio *bio)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (current->flags & PF_KTHREAD || is_urgent_bio(bio)) {
+ /*
+ * Kernel threads shouldn't be blocked easily since each of
+ * them may handle BIOs for several groups on several
+ * partitions.
+ */
+ wait_event_lock_irq(dp->g_waitq, !device_should_block(gp),
+ dp->g_lock, do_nothing());
+ } else {
+ wait_event_lock_irq(gp->c_waitq, !group_should_block(gp),
+ dp->g_lock, do_nothing());
+ }
+}
+
+static inline int should_pushback_bio(struct ioband_group *gp)
+{
+ return is_group_suspended(gp) && dm_noflush_suspending(gp->c_target);
+}
+
+static inline bool bio_is_sync(struct bio *bio)
+{
+ /* Must be the same condition as rw_is_sync() in blkdev.h */
+ return !bio_data_dir(bio) || bio_sync(bio);
+}
+
+static inline int prepare_to_issue(struct ioband_group *gp, struct bio *bio)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ dp->g_issued[bio_is_sync(bio)]++;
+ return dp->g_prepare_bio(gp, bio, 0);
+}
+
+static inline int room_for_bio(struct ioband_device *dp)
+{
+ return dp->g_issued[BLK_RW_SYNC] < dp->g_io_limit
+ || dp->g_issued[BLK_RW_ASYNC] < dp->g_io_limit;
+}
+
+static void hold_bio(struct ioband_group *gp, struct bio *bio)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ dp->g_blocked++;
+ if (is_urgent_bio(bio)) {
+ dp->g_prepare_bio(gp, bio, IOBAND_URGENT);
+ bio_list_add(&dp->g_urgent_bios, bio);
+ trace_ioband_hold_urgent_bio(gp, bio);
+ } else {
+ gp->c_blocked++;
+ dp->g_hold_bio(gp, bio);
+ trace_ioband_hold_bio(gp, bio);
+ }
+}
+
+static inline int room_for_bio_sync(struct ioband_device *dp, int sync)
+{
+ return dp->g_issued[sync] < dp->g_io_limit;
+}
+
+static void push_prio_bio(struct ioband_group *gp, struct bio *bio, int sync)
+{
+ if (bio_list_empty(&gp->c_prio_bios))
+ set_prio_queue(gp, sync);
+ bio_list_add(&gp->c_prio_bios, bio);
+ gp->c_prio_blocked++;
+}
+
+static struct bio *pop_prio_bio(struct ioband_group *gp)
+{
+ struct bio *bio = bio_list_pop(&gp->c_prio_bios);
+
+ if (bio_list_empty(&gp->c_prio_bios))
+ clear_prio_queue(gp);
+
+ if (bio)
+ gp->c_prio_blocked--;
+ return bio;
+}
+
+static int make_issue_list(struct ioband_group *gp, struct bio *bio,
+ struct bio_list *issue_list,
+ struct bio_list *pushback_list)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ dp->g_blocked--;
+ gp->c_blocked--;
+ if (!gp->c_blocked && is_group_blocked(gp)) {
+ clear_group_blocked(gp);
+ wake_up_all(&gp->c_waitq);
+ }
+ if (should_pushback_bio(gp)) {
+ bio_list_add(pushback_list, bio);
+ trace_ioband_make_pback_list(gp, bio);
+ } else {
+ int rw = bio_data_dir(bio);
+
+ gp->c_stat[rw].deferred++;
+ gp->c_stat[rw].sectors += bio_sectors(bio);
+ bio_list_add(issue_list, bio);
+ trace_ioband_make_issue_list(gp, bio);
+ }
+ return prepare_to_issue(gp, bio);
+}
+
+static void release_urgent_bios(struct ioband_device *dp,
+ struct bio_list *issue_list,
+ struct bio_list *pushback_list)
+{
+ struct bio *bio;
+
+ if (bio_list_empty(&dp->g_urgent_bios))
+ return;
+ while (room_for_bio_sync(dp, BLK_RW_ASYNC)) {
+ bio = bio_list_pop(&dp->g_urgent_bios);
+ if (!bio)
+ return;
+ dp->g_blocked--;
+ dp->g_issued[bio_is_sync(bio)]++;
+ bio_list_add(issue_list, bio);
+ trace_ioband_release_urgent_bios(dp, bio);
+ }
+}
+
+static int release_prio_bios(struct ioband_group *gp,
+ struct bio_list *issue_list,
+ struct bio_list *pushback_list)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct bio *bio;
+ int sync;
+ int ret;
+
+ if (bio_list_empty(&gp->c_prio_bios))
+ return R_OK;
+ sync = prio_queue_sync(gp);
+ while (gp->c_prio_blocked) {
+ if (!dp->g_can_submit(gp))
+ return R_BLOCK;
+ if (!room_for_bio_sync(dp, sync))
+ return R_OK;
+ bio = pop_prio_bio(gp);
+ if (!bio)
+ return R_OK;
+ ret = make_issue_list(gp, bio, issue_list, pushback_list);
+ if (ret)
+ return ret;
+ }
+ return R_OK;
+}
+
+static int release_norm_bios(struct ioband_group *gp,
+ struct bio_list *issue_list,
+ struct bio_list *pushback_list)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct bio *bio;
+ int sync, ret;
+
+ while (gp->c_blocked - gp->c_prio_blocked) {
+ if (!dp->g_can_submit(gp))
+ return R_BLOCK;
+ if (!room_for_bio(dp))
+ return R_OK;
+ bio = dp->g_pop_bio(gp);
+ if (!bio)
+ return R_OK;
+
+ sync = bio_is_sync(bio);
+ if (!room_for_bio_sync(dp, sync)) {
+ push_prio_bio(gp, bio, sync);
+ continue;
+ }
+ ret = make_issue_list(gp, bio, issue_list, pushback_list);
+ if (ret)
+ return ret;
+ }
+ return R_OK;
+}
+
+static inline int release_bios(struct ioband_group *gp,
+ struct bio_list *issue_list,
+ struct bio_list *pushback_list)
+{
+ int ret = release_prio_bios(gp, issue_list, pushback_list);
+ if (ret)
+ return ret;
+ return release_norm_bios(gp, issue_list, pushback_list);
+}
+
+static struct ioband_group *ioband_group_get(struct ioband_group *head,
+ struct bio *bio)
+{
+ struct ioband_group *gp;
+
+ if (!head->c_type->t_getid)
+ return head;
+
+ gp = ioband_group_find(head, head->c_type->t_getid(bio));
+
+ if (!gp)
+ gp = head;
+ return gp;
+}
+
+/*
+ * Start to control the bandwidth once the number of uncompleted BIOs
+ * exceeds the value of "io_throttle".
+ */
+static int ioband_map(struct dm_target *ti, struct bio *bio,
+ union map_info *map_context)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+ unsigned long flags;
+ int direct;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+
+ /*
+ * The device is suspended while some of the ioband device
+ * configurations are being changed.
+ */
+ if (is_device_suspended(dp))
+ wait_event_lock_irq(dp->g_waitq_suspend,
+ !is_device_suspended(dp), dp->g_lock,
+ do_nothing());
+
+ gp = ioband_group_get(gp, bio);
+ prevent_burst_bios(gp, bio);
+ if (should_pushback_bio(gp)) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return DM_MAPIO_REQUEUE;
+ }
+
+ bio->bi_bdev = gp->c_dev->bdev;
+ if (bio_sectors(bio))
+ bio->bi_sector -= ti->begin;
+ direct = bio_data_dir(bio);
+
+ if (!gp->c_blocked && room_for_bio_sync(dp, bio_is_sync(bio))) {
+ if (dp->g_can_submit(gp)) {
+ prepare_to_issue(gp, bio);
+ gp->c_stat[direct].immediate++;
+ gp->c_stat[direct].sectors += bio_sectors(bio);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return DM_MAPIO_REMAPPED;
+ } else if (!dp->g_blocked && num_issued(dp) == 0) {
+ DMDEBUG("%s: token expired gp:%p", __func__, gp);
+ queue_delayed_work(dp->g_ioband_wq,
+ &dp->g_conductor, 1);
+ }
+ }
+ hold_bio(gp, bio);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+
+ return DM_MAPIO_SUBMITTED;
+}
+
+/*
+ * Select the best group to resubmit its BIOs.
+ */
+static struct ioband_group *choose_best_group(struct ioband_device *dp)
+{
+ struct ioband_group *gp;
+ struct ioband_group *best = NULL;
+ int highest = 0;
+ int pri;
+
+ /* Todo: The algorithm should be optimized.
+ * It would be better to use rbtree.
+ */
+ list_for_each_entry(gp, &dp->g_groups, c_list) {
+ if (!gp->c_blocked || !room_for_bio(dp))
+ continue;
+ if (gp->c_blocked == gp->c_prio_blocked &&
+ !room_for_bio_sync(dp, prio_queue_sync(gp))) {
+ continue;
+ }
+ pri = dp->g_can_submit(gp);
+ if (pri > highest) {
+ highest = pri;
+ best = gp;
+ }
+ }
+
+ return best;
+}
+
+/*
+ * This function is called right after it becomes able to resubmit BIOs.
+ * It selects the best BIOs and passes them to the underlying layer.
+ */
+static void ioband_conduct(struct work_struct *work)
+{
+ struct ioband_device *dp =
+ container_of(work, struct ioband_device, g_conductor.work);
+ struct ioband_group *gp = NULL;
+ struct bio *bio;
+ unsigned long flags;
+ struct bio_list issue_list, pushback_list;
+
+ bio_list_init(&issue_list);
+ bio_list_init(&pushback_list);
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ release_urgent_bios(dp, &issue_list, &pushback_list);
+ if (dp->g_blocked) {
+ gp = choose_best_group(dp);
+ if (gp &&
+ release_bios(gp, &issue_list, &pushback_list) == R_YIELD)
+ queue_delayed_work(dp->g_ioband_wq,
+ &dp->g_conductor, 0);
+ }
+
+ if (is_device_blocked(dp) && dp->g_blocked < dp->g_io_limit * 2) {
+ clear_device_blocked(dp);
+ wake_up_all(&dp->g_waitq);
+ }
+
+ if (dp->g_blocked &&
+ room_for_bio_sync(dp, BLK_RW_SYNC) &&
+ room_for_bio_sync(dp, BLK_RW_ASYNC) &&
+ bio_list_empty(&issue_list) && bio_list_empty(&pushback_list) &&
+ dp->g_restart_bios(dp)) {
+ DMDEBUG("%s: token expired dp:%p issued(%d,%d) g_blocked(%d)",
+ __func__, dp,
+ dp->g_issued[BLK_RW_SYNC], dp->g_issued[BLK_RW_ASYNC],
+ dp->g_blocked);
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ }
+
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+
+ while ((bio = bio_list_pop(&issue_list))) {
+ trace_ioband_make_request(dp, bio);
+ generic_make_request(bio);
+ }
+
+ while ((bio = bio_list_pop(&pushback_list))) {
+ trace_ioband_pushback_bio(dp, bio);
+ bio_endio(bio, -EIO);
+ }
+}
+
+static int ioband_end_io(struct dm_target *ti, struct bio *bio,
+ int error, union map_info *map_context)
+{
+ struct ioband_group *gp = ti->private;
+ struct ioband_device *dp = gp->c_banddev;
+ unsigned long flags;
+ int r = error;
+
+ /*
+ * XXX: A new error code for device mapper devices should be used
+ * rather than EIO.
+ */
+ if (error == -EIO && should_pushback_bio(gp)) {
+ /* This ioband device is suspending */
+ r = DM_ENDIO_REQUEUE;
+ }
+ /*
+ * Todo: The algorithm should be optimized to eliminate the spinlock.
+ */
+ spin_lock_irqsave(&dp->g_lock, flags);
+ dp->g_issued[bio_is_sync(bio)]--;
+
+ /*
+ * Todo: It would be better to introduce high/low water marks here
+ * not to kick the workqueues so often.
+ */
+ if (dp->g_blocked)
+ queue_delayed_work(dp->g_ioband_wq, &dp->g_conductor, 0);
+ else if (is_device_suspended(dp) && num_issued(dp) == 0)
+ wake_up_all(&dp->g_waitq_flush);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+}
+
+static void ioband_presuspend(struct dm_target *ti)
+{
+ struct ioband_group *gp = ti->private;
+
+ ioband_group_stop_all(gp, 1);
+}
+
+static void ioband_resume(struct dm_target *ti)
+{
+ struct ioband_group *gp = ti->private;
+
+ ioband_group_resume_all(gp);
+}
+
+static void ioband_group_status(struct ioband_group *gp, int *szp,
+ char *result, unsigned maxlen)
+{
+ struct ioband_group_stat *stat;
+ int i, sz = *szp; /* used in DMEMIT() */
+
+ DMEMIT(" %d", gp->c_id);
+ for (i = 0; i < 2; i++) {
+ stat = &gp->c_stat[i];
+ DMEMIT(" %lu %lu %lu",
+ stat->immediate + stat->deferred, stat->deferred,
+ stat->sectors);
+ }
+ *szp = sz;
+}
+
+static int ioband_status(struct dm_target *ti, status_type_t type,
+ char *result, unsigned maxlen)
+{
+ struct ioband_group *gp = ti->private, *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ int sz = 0; /* used in DMEMIT() */
+ unsigned long flags;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ DMEMIT("%s", dp->g_name);
+ ioband_group_status(gp, &sz, result, maxlen);
+ for (node = rb_first(&gp->c_group_root); node;
+ node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ ioband_group_status(p, &sz, result, maxlen);
+ }
+ break;
+
+ case STATUSTYPE_TABLE:
+ DMEMIT("%s %s %d %d %s %s",
+ gp->c_dev->name, dp->g_name,
+ dp->g_io_throttle, dp->g_io_limit,
+ gp->c_type->t_name, dp->g_policy->p_name);
+ dp->g_show(gp, &sz, result, maxlen);
+ break;
+ }
+
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return 0;
+}
+
+static int ioband_group_type_select(struct ioband_group *gp, const char *name)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ const struct ioband_group_type *t;
+ unsigned long flags;
+
+ for (t = dm_ioband_group_type; (t->t_name); t++) {
+ if (!strcmp(name, t->t_name))
+ break;
+ }
+ if (!t->t_name) {
+ DMWARN("%s: %s isn't supported.", __func__, name);
+ return -EINVAL;
+ }
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (!RB_EMPTY_ROOT(&gp->c_group_root)) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -EBUSY;
+ }
+ gp->c_type = t;
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+
+ return 0;
+}
+
+static int ioband_set_param(struct ioband_group *gp,
+ const char *cmd, const char *value)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ const char *val_str;
+ long id;
+ unsigned long flags;
+ int r;
+
+ r = parse_group_param(value, &id, &val_str);
+ if (r)
+ return r;
+
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (id != IOBAND_ID_ANY) {
+ gp = ioband_group_find(gp, id);
+ if (!gp) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ DMWARN("%s: id=%ld not found.", __func__, id);
+ return -EINVAL;
+ }
+ }
+ r = dp->g_set_param(gp, cmd, val_str);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return r;
+}
+
+static int ioband_group_attach(struct ioband_group *gp,
+ int id, const char *param)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *sub_gp;
+ int r;
+
+ if (id < 0) {
+ DMWARN("%s: invalid id:%d", __func__, id);
+ return -EINVAL;
+ }
+ if (!gp->c_type->t_getid) {
+ DMWARN("%s: no ioband group type is specified", __func__);
+ return -EINVAL;
+ }
+
+ sub_gp = kzalloc(sizeof(struct ioband_group), GFP_KERNEL);
+ if (!sub_gp)
+ return -ENOMEM;
+
+ r = ioband_group_init(sub_gp, gp, dp, id, param);
+ if (r < 0) {
+ kfree(sub_gp);
+ return r;
+ }
+ return 0;
+}
+
+static int ioband_group_detach(struct ioband_group *gp, int id)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *sub_gp;
+ unsigned long flags;
+
+ if (id < 0) {
+ DMWARN("%s: invalid id:%d", __func__, id);
+ return -EINVAL;
+ }
+ spin_lock_irqsave(&dp->g_lock, flags);
+ sub_gp = ioband_group_find(gp, id);
+ if (!sub_gp) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ DMWARN("%s: invalid id:%d", __func__, id);
+ return -EINVAL;
+ }
+
+ /*
+ * Todo: Calling suspend_ioband_device() before releasing the
+ * ioband group has a large overhead. Need improvement.
+ */
+ suspend_ioband_device(dp, flags, 0);
+ ioband_group_release(gp, sub_gp);
+ resume_ioband_device(dp);
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return 0;
+}
+
+/*
+ * Message parameters:
+ * "policy" <name>
+ * ex)
+ * "policy" "weight"
+ * "type" "none"|"pid"|"pgrp"|"node"|"cpuset"|"cgroup"|"user"|"gid"
+ * "io_throttle" <value>
+ * "io_limit" <value>
+ * "attach" <group id>
+ * "detach" <group id>
+ * "any-command" <group id>:<value>
+ * ex)
+ * "weight" 0:<value>
+ * "token" 24:<value>
+ */
+static int __ioband_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+ struct ioband_group *gp = ti->private, *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ long val;
+ int r = 0;
+ unsigned long flags;
+
+ if (argc == 1 && !strcmp(argv[0], "reset")) {
+ spin_lock_irqsave(&dp->g_lock, flags);
+ memset(gp->c_stat, 0, sizeof(gp->c_stat));
+ for (node = rb_first(&gp->c_group_root); node;
+ node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ memset(p->c_stat, 0, sizeof(p->c_stat));
+ }
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return 0;
+ }
+
+ if (argc != 2) {
+ DMWARN("Unrecognised band message received.");
+ return -EINVAL;
+ }
+ if (!strcmp(argv[0], "io_throttle")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r || val < 0 || val > SHORT_MAX)
+ return -EINVAL;
+ if (val == 0)
+ val = DEFAULT_IO_THROTTLE;
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (val > dp->g_io_limit) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -EINVAL;
+ }
+ dp->g_io_throttle = val;
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ ioband_set_param(gp, argv[0], argv[1]);
+ return 0;
+ } else if (!strcmp(argv[0], "io_limit")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r || val < 0 || val > SHORT_MAX)
+ return -EINVAL;
+ spin_lock_irqsave(&dp->g_lock, flags);
+ if (val == 0) {
+ struct request_queue *q;
+
+ q = bdev_get_queue(gp->c_dev->bdev);
+ if (!q) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -ENXIO;
+ }
+ /*
+ * The block layer accepts I/O requests up to
+ * 50% over nr_requests when the requests are
+ * issued from a "batcher" process.
+ */
+ val = (3 * q->nr_requests / 2);
+ }
+ if (val < dp->g_io_throttle) {
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ return -EINVAL;
+ }
+ dp->g_io_limit = val;
+ spin_unlock_irqrestore(&dp->g_lock, flags);
+ ioband_set_param(gp, argv[0], argv[1]);
+ return 0;
+ } else if (!strcmp(argv[0], "type")) {
+ return ioband_group_type_select(gp, argv[1]);
+ } else if (!strcmp(argv[0], "attach")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r)
+ return r;
+ return ioband_group_attach(gp, val, NULL);
+ } else if (!strcmp(argv[0], "detach")) {
+ r = strict_strtol(argv[1], 0, &val);
+ if (r)
+ return r;
+ return ioband_group_detach(gp, val);
+ } else if (!strcmp(argv[0], "policy")) {
+ r = policy_init(dp, argv[1], 0, &argv[2]);
+ return r;
+ } else {
+ /* message anycommand <group-id>:<value> */
+ r = ioband_set_param(gp, argv[0], argv[1]);
+ if (r < 0)
+ DMWARN("Unrecognised band message received.");
+ return r;
+ }
+ return 0;
+}
+
+static int ioband_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+ int r;
+
+ mutex_lock(&ioband_lock);
+ r = __ioband_message(ti, argc, argv);
+ mutex_unlock(&ioband_lock);
+ return r;
+}
+
+static int ioband_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+ struct bio_vec *biovec, int max_size)
+{
+ struct ioband_group *gp = ti->private;
+ struct request_queue *q = bdev_get_queue(gp->c_dev->bdev);
+
+ if (!q->merge_bvec_fn)
+ return max_size;
+
+ bvm->bi_bdev = gp->c_dev->bdev;
+ bvm->bi_sector -= ti->begin;
+
+ return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
+}
+
+static struct target_type ioband_target = {
+ .name = "ioband",
+ .module = THIS_MODULE,
+ .version = {1, 12, 1},
+ .ctr = ioband_ctr,
+ .dtr = ioband_dtr,
+ .map = ioband_map,
+ .end_io = ioband_end_io,
+ .presuspend = ioband_presuspend,
+ .resume = ioband_resume,
+ .status = ioband_status,
+ .message = ioband_message,
+ .merge = ioband_merge,
+};
+
+static int __init dm_ioband_init(void)
+{
+ int r;
+
+ r = dm_register_target(&ioband_target);
+ if (r < 0)
+ DMERR("register failed %d", r);
+ return r;
+}
+
+static void __exit dm_ioband_exit(void)
+{
+ dm_unregister_target(&ioband_target);
+}
+
+module_init(dm_ioband_init);
+module_exit(dm_ioband_exit);
+
+MODULE_DESCRIPTION(DM_NAME " I/O bandwidth control");
+MODULE_AUTHOR("Hirokazu Takahashi, Ryo Tsuruta, Dong-Jae Kang");
+MODULE_LICENSE("GPL");
Index: linux-2.6.31-rc3/drivers/md/dm-ioband-policy.c
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/drivers/md/dm-ioband-policy.c
@@ -0,0 +1,459 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ * I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include "dm.h"
+#include "dm-ioband.h"
+
+/*
+ * The following functions determine when and which BIOs should
+ * be submitted to control the I/O flow.
+ * It is possible to add a new BIO scheduling policy with it.
+ */
+
+/*
+ * Functions for weight balancing policy based on the number of I/Os.
+ */
+#define DEFAULT_WEIGHT 100
+#define DEFAULT_TOKENPOOL 2048
+#define DEFAULT_BUCKET 2
+#define IOBAND_IOPRIO_BASE 100
+#define TOKEN_BATCH_UNIT 20
+#define PROCEED_THRESHOLD 8
+#define LOCAL_ACTIVE_RATIO 8
+#define GLOBAL_ACTIVE_RATIO 16
+#define OVERCOMMIT_RATE 4
+
+/*
+ * Calculate the effective number of tokens this group has.
+ */
+static int get_token(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int token = gp->c_token;
+ int allowance = dp->g_epoch - gp->c_my_epoch;
+
+ if (allowance) {
+ if (allowance > dp->g_carryover)
+ allowance = dp->g_carryover;
+ token += gp->c_token_initial * allowance;
+ }
+ if (is_group_down(gp))
+ token += gp->c_token_initial * dp->g_carryover * 2;
+
+ return token;
+}
+
+/*
+ * Calculate the priority of a given group.
+ */
+static int iopriority(struct ioband_group *gp)
+{
+ return get_token(gp) * IOBAND_IOPRIO_BASE / gp->c_token_initial + 1;
+}
+
+/*
+ * This function is called when all the active group on the same ioband
+ * device has used up their tokens. It makes a new global epoch so that
+ * all groups on this device will get freshly assigned tokens.
+ */
+static int make_global_epoch(struct ioband_device *dp)
+{
+ struct ioband_group *gp = dp->g_dominant;
+
+ /*
+ * Don't make a new epoch if the dominant group still has a lot of
+ * tokens, except when the I/O load is low.
+ */
+ if (gp) {
+ int iopri = iopriority(gp);
+ if (iopri * PROCEED_THRESHOLD > IOBAND_IOPRIO_BASE &&
+ dp->g_issued[READ] + dp->g_issued[WRITE] >=
+ dp->g_io_throttle)
+ return 0;
+ }
+
+ dp->g_epoch++;
+ DMDEBUG("make_epoch %d", dp->g_epoch);
+
+ /* The leftover tokens will be used in the next epoch. */
+ dp->g_token_extra = dp->g_token_left;
+ if (dp->g_token_extra < 0)
+ dp->g_token_extra = 0;
+ dp->g_token_left = dp->g_token_bucket;
+
+ dp->g_expired = NULL;
+ dp->g_dominant = NULL;
+
+ return 1;
+}
+
+/*
+ * This function is called when this group has used up its own tokens.
+ * It will check whether it's possible to make a new epoch of this group.
+ */
+static inline int make_epoch(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int allowance = dp->g_epoch - gp->c_my_epoch;
+
+ if (!allowance)
+ return 0;
+ if (allowance > dp->g_carryover)
+ allowance = dp->g_carryover;
+ gp->c_my_epoch = dp->g_epoch;
+ return allowance;
+}
+
+/*
+ * Check whether this group has tokens to issue an I/O. Return 0 if it
+ * doesn't have any, otherwise return the priority of this group.
+ */
+static int is_token_left(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int allowance;
+ int delta;
+ int extra;
+
+ if (gp->c_token > 0)
+ return iopriority(gp);
+
+ if (is_group_down(gp)) {
+ gp->c_token = gp->c_token_initial;
+ return iopriority(gp);
+ }
+ allowance = make_epoch(gp);
+ if (!allowance)
+ return 0;
+ /*
+ * If this group has the right to get tokens for several epochs,
+ * give all of them to the group here.
+ */
+ delta = gp->c_token_initial * allowance;
+ dp->g_token_left -= delta;
+ /*
+ * Give some extra tokens to this group when there have left unused
+ * tokens on this ioband device from the previous epoch.
+ */
+ extra = dp->g_token_extra * gp->c_token_initial /
+ (dp->g_token_bucket - dp->g_token_extra / 2);
+ delta += extra;
+ gp->c_token += delta;
+ gp->c_consumed = 0;
+
+ if (gp == dp->g_current)
+ dp->g_yield_mark += delta;
+ DMDEBUG("refill token: gp:%p token:%d->%d extra(%d) allowance(%d)",
+ gp, gp->c_token - delta, gp->c_token, extra, allowance);
+ if (gp->c_token > 0)
+ return iopriority(gp);
+ DMDEBUG("refill token: yet empty gp:%p token:%d", gp, gp->c_token);
+ return 0;
+}
+
+/*
+ * Use tokens to issue an I/O. After the operation, the number of tokens left
+ * on this group may become negative value, which will be treated as debt.
+ */
+static int consume_token(struct ioband_group *gp, int count, int flag)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ if (gp->c_consumed * LOCAL_ACTIVE_RATIO < gp->c_token_initial &&
+ gp->c_consumed * GLOBAL_ACTIVE_RATIO < dp->g_token_bucket) {
+ ; /* Do nothing unless this group is really active. */
+ } else if (!dp->g_dominant ||
+ get_token(gp) > get_token(dp->g_dominant)) {
+ /*
+ * Regard this group as the dominant group on this
+ * ioband device when it has larger number of tokens
+ * than those of the previous one.
+ */
+ dp->g_dominant = gp;
+ }
+ if (dp->g_epoch == gp->c_my_epoch &&
+ gp->c_token > 0 && gp->c_token - count <= 0) {
+ /* Remember the last group which used up its own tokens. */
+ dp->g_expired = gp;
+ if (dp->g_dominant == gp)
+ dp->g_dominant = NULL;
+ }
+
+ if (gp != dp->g_current) {
+ /* This group is the current already. */
+ dp->g_current = gp;
+ dp->g_yield_mark =
+ gp->c_token - (TOKEN_BATCH_UNIT << dp->g_token_unit);
+ }
+ gp->c_token -= count;
+ gp->c_consumed += count;
+ if (gp->c_token <= dp->g_yield_mark && !(flag & IOBAND_URGENT)) {
+ /*
+ * Return-value 1 means that this policy requests dm-ioband
+ * to give a chance to another group to be selected since
+ * this group has already issued enough amount of I/Os.
+ */
+ dp->g_current = NULL;
+ return R_YIELD;
+ }
+ /*
+ * Return-value 0 means that this policy allows dm-ioband to select
+ * this group to issue I/Os without a break.
+ */
+ return R_OK;
+}
+
+/*
+ * Consume one token on each I/O.
+ */
+static int prepare_token(struct ioband_group *gp, struct bio *bio, int flag)
+{
+ return consume_token(gp, 1, flag);
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ */
+static int is_queue_full(struct ioband_group *gp)
+{
+ return gp->c_blocked >= gp->c_limit;
+}
+
+static void set_weight(struct ioband_group *gp, int new)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+
+ dp->g_weight_total += (new - gp->c_weight);
+ gp->c_weight = new;
+
+ if (dp->g_weight_total == 0) {
+ list_for_each_entry(p, &dp->g_groups, c_list)
+ p->c_token = p->c_token_initial = p->c_limit = 1;
+ } else {
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ p->c_token = p->c_token_initial =
+ dp->g_token_bucket * p->c_weight /
+ dp->g_weight_total + 1;
+ p->c_limit = dp->g_io_limit * 2 * p->c_weight /
+ dp->g_weight_total / OVERCOMMIT_RATE + 1;
+ }
+ }
+}
+
+static void init_token_bucket(struct ioband_device *dp,
+ int token_bucket, int carryover)
+{
+ if (!token_bucket)
+ dp->g_token_bucket = (dp->g_io_limit * 2 * DEFAULT_BUCKET) <<
+ dp->g_token_unit;
+ else
+ dp->g_token_bucket = token_bucket;
+ if (!carryover)
+ dp->g_carryover = (DEFAULT_TOKENPOOL << dp->g_token_unit) /
+ dp->g_token_bucket;
+ else
+ dp->g_carryover = carryover;
+ if (dp->g_carryover < 1)
+ dp->g_carryover = 1;
+ dp->g_token_left = 0;
+}
+
+static int policy_weight_param(struct ioband_group *gp,
+ const char *cmd, const char *value)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ long val = 0;
+ int r = 0, err = 0;
+
+ if (value)
+ err = strict_strtol(value, 0, &val);
+
+ if (!strcmp(cmd, "weight")) {
+ if (!value)
+ set_weight(gp, DEFAULT_WEIGHT);
+ else if (!err && 0 < val && val <= SHORT_MAX)
+ set_weight(gp, val);
+ else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "token")) {
+ if (!err && 0 <= val && val <= INT_MAX) {
+ init_token_bucket(dp, val, 0);
+ set_weight(gp, gp->c_weight);
+ dp->g_token_extra = 0;
+ } else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "carryover")) {
+ if (!err && 0 <= val && val <= INT_MAX) {
+ init_token_bucket(dp, dp->g_token_bucket, val);
+ set_weight(gp, gp->c_weight);
+ dp->g_token_extra = 0;
+ } else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "io_limit")) {
+ init_token_bucket(dp, 0, 0);
+ set_weight(gp, gp->c_weight);
+ } else {
+ r = -EINVAL;
+ }
+ return r;
+}
+
+static int policy_weight_ctr(struct ioband_group *gp, const char *arg)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ gp->c_my_epoch = dp->g_epoch;
+ gp->c_weight = 0;
+ gp->c_consumed = 0;
+ return policy_weight_param(gp, "weight", arg);
+}
+
+static void policy_weight_dtr(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ set_weight(gp, 0);
+ dp->g_dominant = NULL;
+ dp->g_expired = NULL;
+}
+
+static void policy_weight_show(struct ioband_group *gp, int *szp,
+ char *result, unsigned maxlen)
+{
+ struct ioband_group *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ int sz = *szp; /* used in DMEMIT() */
+
+ DMEMIT(" %d :%d", dp->g_token_bucket, gp->c_weight);
+
+ for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ DMEMIT(" %d:%d", p->c_id, p->c_weight);
+ }
+ *szp = sz;
+}
+
+/*
+ * <Method> <description>
+ * g_can_submit : To determine whether a given group has the right to
+ * submit BIOs. The larger the return value the higher the
+ * priority to submit. Zero means it has no right.
+ * g_prepare_bio : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ * of them can be submitted now. This method has to
+ * reinitialize the data to restart to submit BIOs and return
+ * 0 or 1.
+ * The return value 0 means that it has become able to submit
+ * them now so that this ioband device will continue its work.
+ * The return value 1 means that it is still unable to submit
+ * them so that this device will stop its work. And this
+ * policy module has to reactivate the device when it gets
+ * to be able to submit BIOs.
+ * g_hold_bio : To hold a given BIO until it is submitted.
+ * The default function is used when this method is undefined.
+ * g_pop_bio : To select and get the best BIO to submit.
+ * g_group_ctr : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr : Called when struct ioband_group is removed.
+ * g_set_param : To update the policy own date.
+ * The parameters can be passed through "dmsetup message"
+ * command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ * Return 1 if a given group can't receive any more BIOs,
+ * otherwise return 0.
+ * g_show : Show the configuration.
+ */
+static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
+{
+ long val;
+ int r = 0;
+
+ if (argc < 1)
+ val = 0;
+ else {
+ r = strict_strtol(argv[0], 0, &val);
+ if (r || val < 0 || val > INT_MAX)
+ return -EINVAL;
+ }
+
+ dp->g_can_submit = is_token_left;
+ dp->g_prepare_bio = prepare_token;
+ dp->g_restart_bios = make_global_epoch;
+ dp->g_group_ctr = policy_weight_ctr;
+ dp->g_group_dtr = policy_weight_dtr;
+ dp->g_set_param = policy_weight_param;
+ dp->g_should_block = is_queue_full;
+ dp->g_show = policy_weight_show;
+
+ dp->g_epoch = 0;
+ dp->g_weight_total = 0;
+ dp->g_current = NULL;
+ dp->g_dominant = NULL;
+ dp->g_expired = NULL;
+ dp->g_token_extra = 0;
+ dp->g_token_unit = 0;
+ init_token_bucket(dp, val, 0);
+ dp->g_token_left = dp->g_token_bucket;
+
+ return 0;
+}
+
+/* weight balancing policy based on the number of I/Os. --- End --- */
+
+/*
+ * Functions for weight balancing policy based on I/O size.
+ * It just borrows a lot of functions from the regular weight balancing policy.
+ */
+static int iosize_prepare_token(struct ioband_group *gp,
+ struct bio *bio, int flag)
+{
+ /* Consume tokens depending on the size of a given bio. */
+ return consume_token(gp, bio_sectors(bio), flag);
+}
+
+static int policy_weight_iosize_init(struct ioband_device *dp,
+ int argc, char **argv)
+{
+ long val;
+ int r = 0;
+
+ if (argc < 1)
+ val = 0;
+ else {
+ r = strict_strtol(argv[0], 0, &val);
+ if (r || val < 0 || val > INT_MAX)
+ return -EINVAL;
+ }
+
+ r = policy_weight_init(dp, argc, argv);
+ if (r < 0)
+ return r;
+
+ dp->g_prepare_bio = iosize_prepare_token;
+ dp->g_token_unit = PAGE_SHIFT - 9;
+ init_token_bucket(dp, val, 0);
+ dp->g_token_left = dp->g_token_bucket;
+ return 0;
+}
+
+/* weight balancing policy based on I/O size. --- End --- */
+
+static int policy_default_init(struct ioband_device *dp, int argc, char **argv)
+{
+ return policy_weight_init(dp, argc, argv);
+}
+
+const struct ioband_policy_type dm_ioband_policy_type[] = {
+ { "default", policy_default_init },
+ { "weight", policy_weight_init },
+ { "weight-iosize", policy_weight_iosize_init },
+ { "range-bw", policy_range_bw_init },
+ { NULL, policy_default_init }
+};
Index: linux-2.6.31-rc3/drivers/md/dm-ioband-type.c
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/drivers/md/dm-ioband-type.c
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ * I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/bio.h>
+#include "dm.h"
+#include "dm-ioband.h"
+
+/*
+ * Any I/O bandwidth can be divided into several bandwidth groups, each of which
+ * has its own unique ID. The following functions are called to determine
+ * which group a given BIO belongs to and return the ID of the group.
+ */
+
+/* ToDo: unsigned long value would be better for group ID */
+
+static int ioband_process_id(struct bio *bio)
+{
+ /*
+ * This function will work for KVM and Xen.
+ */
+ return (int)current->tgid;
+}
+
+static int ioband_process_group(struct bio *bio)
+{
+ return (int)task_pgrp_nr(current);
+}
+
+static int ioband_uid(struct bio *bio)
+{
+ return (int)current_uid();
+}
+
+static int ioband_gid(struct bio *bio)
+{
+ return (int)current_gid();
+}
+
+static int ioband_cpuset(struct bio *bio)
+{
+ return 0; /* not implemented yet */
+}
+
+static int ioband_node(struct bio *bio)
+{
+ return 0; /* not implemented yet */
+}
+
+static int ioband_cgroup(struct bio *bio)
+{
+ /*
+ * This function should return the ID of the cgroup which
+ * issued "bio". The ID of the cgroup which the current
+ * process belongs to won't be suitable ID for this purpose,
+ * since some BIOs will be handled by kernel threads like aio
+ * or pdflush on behalf of the process requesting the BIOs.
+ */
+ return 0; /* not implemented yet */
+}
+
+const struct ioband_group_type dm_ioband_group_type[] = {
+ { "none", NULL },
+ { "pgrp", ioband_process_group },
+ { "pid", ioband_process_id },
+ { "node", ioband_node },
+ { "cpuset", ioband_cpuset },
+ { "cgroup", ioband_cgroup },
+ { "user", ioband_uid },
+ { "uid", ioband_uid },
+ { "gid", ioband_gid },
+ { NULL, NULL}
+};
Index: linux-2.6.31-rc3/drivers/md/dm-ioband.h
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/drivers/md/dm-ioband.h
@@ -0,0 +1,228 @@
+/*
+ * Copyright (C) 2008-2009 VA Linux Systems Japan K.K.
+ *
+ * I/O bandwidth control
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef DM_IOBAND_H
+#define DM_IOBAND_H
+
+#include <linux/version.h>
+#include <linux/wait.h>
+
+#define DM_MSG_PREFIX "ioband"
+
+#define DEFAULT_IO_THROTTLE 4
+#define IOBAND_NAME_MAX 31
+#define IOBAND_ID_ANY (-1)
+#define POLICY_PARAM_START 6
+#define POLICY_PARAM_DELIM "=:,"
+
+#define MAX_BW_OVER 1
+#define MAX_BW_UNDER 0
+#define NO_IO_MODE 4
+
+#define TIME_COMPENSATOR 10
+
+struct ioband_group;
+
+struct ioband_device {
+ struct list_head g_groups;
+ struct delayed_work g_conductor;
+ struct workqueue_struct *g_ioband_wq;
+ struct bio_list g_urgent_bios;
+ int g_io_throttle;
+ int g_io_limit;
+ int g_issued[2];
+ int g_blocked;
+ spinlock_t g_lock;
+ wait_queue_head_t g_waitq;
+ wait_queue_head_t g_waitq_suspend;
+ wait_queue_head_t g_waitq_flush;
+
+ int g_ref;
+ struct list_head g_list;
+ int g_flags;
+ char g_name[IOBAND_NAME_MAX + 1];
+ const struct ioband_policy_type *g_policy;
+
+ /* policy dependent */
+ int (*g_can_submit) (struct ioband_group *);
+ int (*g_prepare_bio) (struct ioband_group *, struct bio *, int);
+ int (*g_restart_bios) (struct ioband_device *);
+ void (*g_hold_bio) (struct ioband_group *, struct bio *);
+ struct bio *(*g_pop_bio) (struct ioband_group *);
+ int (*g_group_ctr) (struct ioband_group *, const char *);
+ void (*g_group_dtr) (struct ioband_group *);
+ int (*g_set_param) (struct ioband_group *, const char *, const char *);
+ int (*g_should_block) (struct ioband_group *);
+ void (*g_show) (struct ioband_group *, int *, char *, unsigned);
+
+ /* members for weight balancing policy */
+ int g_epoch;
+ int g_weight_total;
+ /* the number of tokens which can be used in every epoch */
+ int g_token_bucket;
+ /* how many epochs tokens can be carried over */
+ int g_carryover;
+ /* how many tokens should be used for one page-sized I/O */
+ int g_token_unit;
+ /* the last group which used a token */
+ struct ioband_group *g_current;
+ /* give another group a chance to be scheduled when the rest
+ of tokens of the current group reaches this mark */
+ int g_yield_mark;
+ /* the latest group which used up its tokens */
+ struct ioband_group *g_expired;
+ /* the group which has the largest number of tokens in the
+ active groups */
+ struct ioband_group *g_dominant;
+ /* the number of unused tokens in this epoch */
+ int g_token_left;
+ /* left-over tokens from the previous epoch */
+ int g_token_extra;
+
+ /* members for range-bw policy */
+ int g_min_bw_total;
+ int g_max_bw_total;
+ unsigned long g_next_time_period;
+ int g_time_period_expired;
+ struct ioband_group *g_running_gp;
+ int g_total_min_bw_token;
+ int g_consumed_min_bw_token;
+ int g_io_mode;
+
+};
+
+struct ioband_group_stat {
+ unsigned long sectors;
+ unsigned long immediate;
+ unsigned long deferred;
+};
+
+struct ioband_group {
+ struct list_head c_list;
+ struct ioband_device *c_banddev;
+ struct dm_dev *c_dev;
+ struct dm_target *c_target;
+ struct bio_list c_blocked_bios;
+ struct bio_list c_prio_bios;
+ struct rb_root c_group_root;
+ struct rb_node c_group_node;
+ int c_id; /* should be unsigned long or unsigned long long */
+ char c_name[IOBAND_NAME_MAX + 1]; /* rfu */
+ int c_blocked;
+ int c_prio_blocked;
+ wait_queue_head_t c_waitq;
+ int c_flags;
+ struct ioband_group_stat c_stat[2]; /* hold rd/wr status */
+ const struct ioband_group_type *c_type;
+
+ /* members for weight balancing policy */
+ int c_weight;
+ int c_my_epoch;
+ int c_token;
+ int c_token_initial;
+ int c_limit;
+ int c_consumed;
+
+ /* rfu */
+ /* struct bio_list c_ordered_tag_bios; */
+
+ /* members for range-bw policy */
+ wait_queue_head_t c_max_bw_over_waitq;
+ struct timer_list *c_timer;
+ int timer_set;
+ int c_min_bw;
+ int c_max_bw;
+ int c_time_slice_expired;
+ int c_min_bw_token;
+ int c_max_bw_token;
+ int c_consumed_min_bw_token;
+ int c_is_over_max_bw;
+ int c_io_mode;
+ unsigned long c_time_slice;
+ unsigned long c_time_slice_start;
+ unsigned long c_time_slice_end;
+ int c_wait_p_count;
+
+};
+
+#define IOBAND_URGENT 1
+
+#define DEV_BIO_BLOCKED 1
+#define DEV_SUSPENDED 2
+
+#define set_device_blocked(dp) ((dp)->g_flags |= DEV_BIO_BLOCKED)
+#define clear_device_blocked(dp) ((dp)->g_flags &= ~DEV_BIO_BLOCKED)
+#define is_device_blocked(dp) ((dp)->g_flags & DEV_BIO_BLOCKED)
+
+#define set_device_suspended(dp) ((dp)->g_flags |= DEV_SUSPENDED)
+#define clear_device_suspended(dp) ((dp)->g_flags &= ~DEV_SUSPENDED)
+#define is_device_suspended(dp) ((dp)->g_flags & DEV_SUSPENDED)
+
+#define IOG_PRIO_BIO_SYNC 1
+#define IOG_PRIO_QUEUE 2
+#define IOG_BIO_BLOCKED 4
+#define IOG_GOING_DOWN 8
+#define IOG_SUSPENDED 16
+#define IOG_NEED_UP 32
+
+#define R_OK 0
+#define R_BLOCK 1
+#define R_YIELD 2
+
+#define set_group_blocked(gp) ((gp)->c_flags |= IOG_BIO_BLOCKED)
+#define clear_group_blocked(gp) ((gp)->c_flags &= ~IOG_BIO_BLOCKED)
+#define is_group_blocked(gp) ((gp)->c_flags & IOG_BIO_BLOCKED)
+
+#define set_group_down(gp) ((gp)->c_flags |= IOG_GOING_DOWN)
+#define clear_group_down(gp) ((gp)->c_flags &= ~IOG_GOING_DOWN)
+#define is_group_down(gp) ((gp)->c_flags & IOG_GOING_DOWN)
+
+#define set_group_suspended(gp) ((gp)->c_flags |= IOG_SUSPENDED)
+#define clear_group_suspended(gp) ((gp)->c_flags &= ~IOG_SUSPENDED)
+#define is_group_suspended(gp) ((gp)->c_flags & IOG_SUSPENDED)
+
+#define set_group_need_up(gp) ((gp)->c_flags |= IOG_NEED_UP)
+#define clear_group_need_up(gp) ((gp)->c_flags &= ~IOG_NEED_UP)
+#define group_need_up(gp) ((gp)->c_flags & IOG_NEED_UP)
+
+#define set_prio_async(gp) ((gp)->c_flags |= IOG_PRIO_QUEUE)
+#define clear_prio_async(gp) ((gp)->c_flags &= ~IOG_PRIO_QUEUE)
+#define is_prio_async(gp) \
+ ((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC) == IOG_PRIO_QUEUE)
+
+#define set_prio_sync(gp) \
+ ((gp)->c_flags |= (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+#define clear_prio_sync(gp) \
+ ((gp)->c_flags &= ~(IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+#define is_prio_sync(gp) \
+ ((gp)->c_flags & (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC) == \
+ (IOG_PRIO_QUEUE|IOG_PRIO_BIO_SYNC))
+
+#define set_prio_queue(gp, sync) \
+ ((gp)->c_flags |= (IOG_PRIO_QUEUE|sync))
+#define clear_prio_queue(gp) clear_prio_sync(gp)
+#define is_prio_queue(gp) ((gp)->c_flags & IOG_PRIO_QUEUE)
+#define prio_queue_sync(gp) ((gp)->c_flags & IOG_PRIO_BIO_SYNC)
+
+struct ioband_policy_type {
+ const char *p_name;
+ int (*p_policy_init) (struct ioband_device *, int, char **);
+};
+
+extern const struct ioband_policy_type dm_ioband_policy_type[];
+
+struct ioband_group_type {
+ const char *t_name;
+ int (*t_getid) (struct bio *);
+};
+
+extern const struct ioband_group_type dm_ioband_group_type[];
+
+extern int policy_range_bw_init(struct ioband_device *, int, char **);
+
+#endif /* DM_IOBAND_H */
Index: linux-2.6.31-rc3/drivers/md/dm-ioband-rangebw.c
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/drivers/md/dm-ioband-rangebw.c
@@ -0,0 +1,673 @@
+/*
+ * dm-ioband-rangebw.c
+ *
+ * This is a I/O control policy to support the Range Bandwidth in Disk I/O.
+ * And this policy is for dm-ioband controller by Ryo Tsuruta,
+ * Hirokazu Takahashi
+ *
+ * Copyright (C) 2008 - 2011
+ * Electronics and Telecommunications Research Institute(ETRI)
+ *
+ * This program is free software. you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License(GPL) as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Contact Information:
+ * Dong-Jae, Kang <[email protected]>, Chei-Yol,Kim <[email protected]>,
+ * Sung-In,Jung <[email protected]>
+ */
+
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/rbtree.h>
+#include <linux/jiffies.h>
+#include <linux/random.h>
+#include <linux/time.h>
+#include <linux/timer.h>
+#include "dm.h"
+#include "md.h"
+#include "dm-ioband.h"
+
+static void range_bw_timeover(unsigned long);
+static void range_bw_timer_register(struct timer_list *,
+ unsigned long, unsigned long);
+
+/*
+ * Functions for Range Bandwidth(range-bw) policy based on
+ * the time slice and token.
+ */
+#define DEFAULT_BUCKET 2
+#define DEFAULT_TOKENPOOL 2048
+
+#define TIME_SLICE_EXPIRED 1
+#define TIME_SLICE_NOT_EXPIRED 0
+
+#define MINBW_IO_MODE 0
+#define LEFTOVER_IO_MODE 1
+#define RANGE_IO_MODE 2
+#define DEFAULT_IO_MODE 3
+#define NO_IO_MODE 4
+
+#define MINBW_PRIO_BASE 10
+#define OVER_IO_RATE 4
+
+#define DEFAULT_RANGE_BW "0:0"
+#define DEFAULT_MIN_BW 0
+#define DEFAULT_MAX_BW 0
+
+static const int time_slice_base = HZ / 10;
+static const int range_time_slice_base = HZ / 50;
+static void do_nothing(void) {}
+/*
+ * g_restart_bios function for range-bw policy
+ */
+static int range_bw_restart_bios(struct ioband_device *dp)
+{
+ return 1;
+}
+
+/*
+ * Allocate the time slice when IO mode is MINBW_IO_MODE,
+ * RANGE_IO_MODE or LEFTOVER_IO_MODE
+ */
+static int set_time_slice(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int dp_io_mode, gp_io_mode;
+ unsigned long now = jiffies;
+
+ dp_io_mode = dp->g_io_mode;
+ gp_io_mode = gp->c_io_mode;
+
+ gp->c_time_slice_start = now;
+
+ if (dp_io_mode == LEFTOVER_IO_MODE) {
+ gp->c_time_slice_end = now + gp->c_time_slice;
+ return 0;
+ }
+
+ if (gp_io_mode == MINBW_IO_MODE)
+ gp->c_time_slice_end = now + gp->c_time_slice;
+ else if (gp_io_mode == RANGE_IO_MODE)
+ gp->c_time_slice_end = now + range_time_slice_base;
+ else if (gp_io_mode == DEFAULT_IO_MODE)
+ gp->c_time_slice_end = now + time_slice_base;
+ else if (gp_io_mode == NO_IO_MODE) {
+ gp->c_time_slice_end = 0;
+ gp->c_time_slice_expired = TIME_SLICE_EXPIRED;
+ return 0;
+ }
+
+ gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+
+ return 0;
+}
+
+/*
+ * Calculate the priority of given ioband_group
+ */
+static int range_bw_priority(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int prio = 0;
+
+ if (dp->g_io_mode == LEFTOVER_IO_MODE) {
+ prio = random32() % MINBW_PRIO_BASE;
+ if (prio == 0)
+ prio = 1;
+ } else if (gp->c_io_mode == MINBW_IO_MODE) {
+ prio = (gp->c_min_bw_token - gp->c_consumed_min_bw_token) *
+ MINBW_PRIO_BASE;
+ } else if (gp->c_io_mode == DEFAULT_IO_MODE) {
+ prio = MINBW_PRIO_BASE;
+ } else if (gp->c_io_mode == RANGE_IO_MODE) {
+ prio = MINBW_PRIO_BASE / 2;
+ } else {
+ prio = 0;
+ }
+
+ return prio;
+}
+
+/*
+ * Check whether this group has right to issue an I/O in range-bw policy mode.
+ * Return 0 if it doesn't have right, otherwise return the non-zero value.
+ */
+static int has_right_to_issue(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int prio;
+
+ if (gp->c_prio_blocked > 0 || gp->c_blocked - gp->c_prio_blocked > 0) {
+ prio = range_bw_priority(gp);
+ if (prio <= 0)
+ return 1;
+ return prio;
+ }
+
+ if (gp == dp->g_running_gp) {
+
+ if (gp->c_time_slice_expired == TIME_SLICE_EXPIRED) {
+
+ gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+ gp->c_time_slice_end = 0;
+
+ return 0;
+ }
+
+ if (gp->c_time_slice_end == 0)
+ set_time_slice(gp);
+
+ return range_bw_priority(gp);
+
+ }
+
+ dp->g_running_gp = gp;
+ set_time_slice(gp);
+
+ return range_bw_priority(gp);
+}
+
+/*
+ * Reset all variables related with range-bw token and time slice
+ */
+static int reset_range_bw_token(struct ioband_group *gp, unsigned long now)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ p->c_consumed_min_bw_token = 0;
+ p->c_is_over_max_bw = MAX_BW_UNDER;
+ if (p->c_io_mode != DEFAULT_IO_MODE)
+ p->c_io_mode = MINBW_IO_MODE;
+ }
+
+ dp->g_consumed_min_bw_token = 0;
+
+ dp->g_next_time_period = now + HZ;
+ dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+ dp->g_io_mode = MINBW_IO_MODE;
+
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ if (waitqueue_active(&p->c_max_bw_over_waitq))
+ wake_up_all(&p->c_max_bw_over_waitq);
+ }
+ return 0;
+}
+
+/*
+ * Use tokens(Increase the number of consumed token) to issue an I/O
+ * for guranteeing the range-bw. and check the expiration of local and
+ * global time slice, and overflow of max bw
+ */
+static int range_bw_consume_token(struct ioband_group *gp, int count, int flag)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+ unsigned long now = jiffies;
+ int io_mode;
+
+ dp->g_current = gp;
+
+ if (dp->g_next_time_period == 0) {
+ dp->g_next_time_period = now + HZ;
+ dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+ }
+
+ if (time_after(now, dp->g_next_time_period)) {
+ reset_range_bw_token(gp, now);
+ } else {
+ gp->c_consumed_min_bw_token += count;
+ dp->g_consumed_min_bw_token += count;
+
+ if (gp->c_max_bw > 0 && gp->c_consumed_min_bw_token >=
+ gp->c_max_bw_token) {
+ gp->c_is_over_max_bw = MAX_BW_OVER;
+ gp->c_io_mode = NO_IO_MODE;
+ return R_YIELD;
+ }
+
+ if (gp->c_io_mode != RANGE_IO_MODE && gp->c_min_bw_token <=
+ gp->c_consumed_min_bw_token) {
+ gp->c_io_mode = RANGE_IO_MODE;
+
+ if (dp->g_total_min_bw_token <=
+ dp->g_consumed_min_bw_token) {
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ if (p->c_io_mode == RANGE_IO_MODE ||
+ p->c_io_mode == DEFAULT_IO_MODE) {
+ io_mode = 1;
+ } else {
+ io_mode = 0;
+ break;
+ }
+ }
+
+ if (io_mode && dp->g_io_mode == MINBW_IO_MODE)
+ dp->g_io_mode = LEFTOVER_IO_MODE;
+ }
+ }
+ }
+
+ if (gp->c_time_slice_end != 0 &&
+ time_after(now, gp->c_time_slice_end)) {
+ gp->c_time_slice_expired = TIME_SLICE_EXPIRED;
+ return R_YIELD;
+ }
+
+ return R_OK;
+}
+
+static int is_no_io_mode(struct ioband_group *gp)
+{
+ if (gp->c_io_mode == NO_IO_MODE)
+ return 1;
+
+ return 0;
+}
+
+/*
+ * Check if this group is able to receive a new bio.
+ * in range bw policy, we only check that ioband device should be blocked
+ */
+static int range_bw_queue_full(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ unsigned long now, time_step;
+
+ if (is_no_io_mode(gp)) {
+ now = jiffies;
+ if (time_after(dp->g_next_time_period, now)) {
+ time_step = dp->g_next_time_period - now;
+ range_bw_timer_register(gp->c_timer,
+ (time_step + TIME_COMPENSATOR),
+ (unsigned long)gp);
+ wait_event_lock_irq(gp->c_max_bw_over_waitq,
+ !is_no_io_mode(gp),
+ dp->g_lock, do_nothing());
+ }
+ }
+
+ return (gp->c_blocked >= gp->c_limit);
+}
+
+/*
+ * Convert the bw valuse to the number of bw token
+ * bw : Kbyte unit bandwidth
+ * token_base : the number of tokens used for one 1Kbyte-size IO
+ * -- Attention : Currently, We support the 512byte or 1Kbyte per 1 token
+ */
+static int convert_bw_to_token(int bw, int token_unit)
+{
+ int token;
+ int token_base;
+
+ token_base = (1 << token_unit) / 4;
+ token = bw * token_base;
+
+ return token;
+}
+
+
+/*
+ * Allocate the time slice for MINBW_IO_MODE to each group
+ */
+static void range_bw_time_slice_init(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+
+ if (dp->g_min_bw_total == 0)
+ p->c_time_slice = time_slice_base;
+ else
+ p->c_time_slice = time_slice_base +
+ ((time_slice_base *
+ ((p->c_min_bw + p->c_max_bw) / 2)) /
+ dp->g_min_bw_total);
+ }
+}
+
+/*
+ * Allocate the range_bw and range_bw_token to the given group
+ */
+static void set_range_bw(struct ioband_group *gp, int new_min, int new_max)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+ int token_unit;
+
+ dp->g_min_bw_total += (new_min - gp->c_min_bw);
+ gp->c_min_bw = new_min;
+
+ dp->g_max_bw_total += (new_max - gp->c_max_bw);
+ gp->c_max_bw = new_max;
+
+ if (new_min)
+ gp->c_io_mode = MINBW_IO_MODE;
+ else
+ gp->c_io_mode = DEFAULT_IO_MODE;
+
+ range_bw_time_slice_init(gp);
+
+ token_unit = dp->g_token_unit;
+ gp->c_min_bw_token = convert_bw_to_token(new_min, token_unit);
+ dp->g_total_min_bw_token =
+ convert_bw_to_token(dp->g_min_bw_total, token_unit);
+
+ gp->c_max_bw_token = convert_bw_to_token(new_max, token_unit);
+
+ if (dp->g_min_bw_total == 0) {
+ list_for_each_entry(p, &dp->g_groups, c_list)
+ p->c_limit = 1;
+ } else {
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ p->c_limit = dp->g_io_limit * 2 * p->c_min_bw /
+ dp->g_min_bw_total / OVER_IO_RATE + 1;
+ }
+ }
+
+ return;
+}
+
+/*
+ * Allocate the min_bw and min_bw_token to the given group
+ */
+static void set_min_bw(struct ioband_group *gp, int new)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ struct ioband_group *p;
+ int token_unit;
+
+ dp->g_min_bw_total += (new - gp->c_min_bw);
+ gp->c_min_bw = new;
+
+ if (new)
+ gp->c_io_mode = MINBW_IO_MODE;
+ else
+ gp->c_io_mode = DEFAULT_IO_MODE;
+
+ range_bw_time_slice_init(gp);
+
+ token_unit = dp->g_token_unit;
+ gp->c_min_bw_token = convert_bw_to_token(gp->c_min_bw, token_unit);
+ dp->g_total_min_bw_token =
+ convert_bw_to_token(dp->g_min_bw_total, token_unit);
+
+ if (dp->g_min_bw_total == 0) {
+ list_for_each_entry(p, &dp->g_groups, c_list)
+ p->c_limit = 1;
+ } else {
+ list_for_each_entry(p, &dp->g_groups, c_list) {
+ p->c_limit = dp->g_io_limit * 2 * p->c_min_bw /
+ dp->g_min_bw_total / OVER_IO_RATE + 1;
+ }
+ }
+
+ return;
+}
+
+/*
+ * Allocate the max_bw and max_bw_token to the pointed group
+ */
+static void set_max_bw(struct ioband_group *gp, int new)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int token_unit;
+
+ token_unit = dp->g_token_unit;
+
+ dp->g_max_bw_total += (new - gp->c_max_bw);
+ gp->c_max_bw = new;
+ gp->c_max_bw_token = convert_bw_to_token(new, token_unit);
+
+ range_bw_time_slice_init(gp);
+
+ return;
+
+}
+
+static void init_range_bw_token_bucket(struct ioband_device *dp, int val)
+{
+ dp->g_token_bucket = (dp->g_io_limit * 2 * DEFAULT_BUCKET) <<
+ dp->g_token_unit;
+ if (!val)
+ val = DEFAULT_TOKENPOOL << dp->g_token_unit;
+ if (val < dp->g_token_bucket)
+ val = dp->g_token_bucket;
+ dp->g_carryover = val/dp->g_token_bucket;
+ dp->g_token_left = 0;
+}
+
+static int policy_range_bw_param(struct ioband_group *gp,
+ const char *cmd, const char *value)
+{
+ long val = 0, min_val = DEFAULT_MIN_BW, max_val = DEFAULT_MAX_BW;
+ int r = 0, err = 0;
+ char *endp;
+
+ if (value) {
+ min_val = simple_strtol(value, &endp, 0);
+ if (strchr(POLICY_PARAM_DELIM, *endp)) {
+ max_val = simple_strtol(endp + 1, &endp, 0);
+ if (*endp != '\0')
+ err++;
+ } else
+ err++;
+ }
+
+ if (!strcmp(cmd, "range-bw")) {
+ if (!err && 0 <= min_val &&
+ min_val <= (INT_MAX / 2) && 0 <= max_val &&
+ max_val <= (INT_MAX / 2) && min_val <= max_val)
+ set_range_bw(gp, min_val, max_val);
+ else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "min-bw")) {
+ if (!err && 0 <= val && val <= (INT_MAX / 2))
+ set_min_bw(gp, val);
+ else
+ r = -EINVAL;
+ } else if (!strcmp(cmd, "max-bw")) {
+ if ((!err && 0 <= val && val <= (INT_MAX / 2) &&
+ gp->c_min_bw <= val) || val == 0)
+ set_max_bw(gp, val);
+ else
+ r = -EINVAL;
+ } else {
+ r = -EINVAL;
+ }
+ return r;
+}
+
+static int policy_range_bw_ctr(struct ioband_group *gp, const char *arg)
+{
+ int ret;
+
+ init_waitqueue_head(&gp->c_max_bw_over_waitq);
+
+ gp->c_min_bw = 0;
+ gp->c_max_bw = 0;
+ gp->c_io_mode = DEFAULT_IO_MODE;
+ gp->c_time_slice_expired = TIME_SLICE_NOT_EXPIRED;
+ gp->c_min_bw_token = 0;
+ gp->c_max_bw_token = 0;
+ gp->c_consumed_min_bw_token = 0;
+ gp->c_is_over_max_bw = MAX_BW_UNDER;
+ gp->c_time_slice_start = 0;
+ gp->c_time_slice_end = 0;
+ gp->c_wait_p_count = 0;
+
+ gp->c_time_slice = time_slice_base;
+
+ gp->c_timer = kmalloc(sizeof(struct timer_list), GFP_KERNEL);
+ if (gp->c_timer == NULL)
+ return -EINVAL;
+ memset(gp->c_timer, 0, sizeof(struct timer_list));
+ gp->timer_set = 0;
+
+ ret = policy_range_bw_param(gp, "range-bw", arg);
+
+ return ret;
+}
+
+static void policy_range_bw_dtr(struct ioband_group *gp)
+{
+ struct ioband_device *dp = gp->c_banddev;
+
+ gp->c_time_slice = 0;
+ set_range_bw(gp, 0, 0);
+
+ dp->g_running_gp = NULL;
+
+ if (gp->c_timer != NULL) {
+ del_timer(gp->c_timer);
+ kfree(gp->c_timer);
+ }
+}
+
+static void policy_range_bw_show(struct ioband_group *gp, int *szp,
+ char *result, unsigned int maxlen)
+{
+ struct ioband_group *p;
+ struct ioband_device *dp = gp->c_banddev;
+ struct rb_node *node;
+ int sz = *szp; /* used in DMEMIT() */
+
+ DMEMIT(" %d :%d:%d", dp->g_token_bucket * dp->g_carryover,
+ gp->c_min_bw, gp->c_max_bw);
+
+ for (node = rb_first(&gp->c_group_root); node; node = rb_next(node)) {
+ p = rb_entry(node, struct ioband_group, c_group_node);
+ DMEMIT(" %d:%d:%d", p->c_id, p->c_min_bw, p->c_max_bw);
+ }
+ *szp = sz;
+}
+
+static int range_bw_prepare_token(struct ioband_group *gp,
+ struct bio *bio, int flag)
+{
+ struct ioband_device *dp = gp->c_banddev;
+ int unit;
+ int bio_count;
+ int token_count = 0;
+
+ unit = (1 << dp->g_token_unit);
+ bio_count = bio_sectors(bio);
+
+ if (unit == 8)
+ token_count = bio_count;
+ else if (unit == 4)
+ token_count = bio_count / 2;
+ else if (unit == 2)
+ token_count = bio_count / 4;
+ else if (unit == 1)
+ token_count = bio_count / 8;
+
+ return range_bw_consume_token(gp, token_count, flag);
+}
+
+void range_bw_timer_register(struct timer_list *ptimer,
+ unsigned long timeover, unsigned long gp)
+{
+ struct ioband_group *group = (struct ioband_group *)gp;
+
+ if (group->timer_set == 0) {
+ init_timer(ptimer);
+ ptimer->expires = get_jiffies_64() + timeover;
+ ptimer->data = gp;
+ ptimer->function = range_bw_timeover;
+ add_timer(ptimer);
+ group->timer_set = 1;
+ }
+}
+
+/*
+ * Timer Handler function to protect the all processes's hanging in
+ * lower min-bw configuration
+ */
+void range_bw_timeover(unsigned long gp)
+{
+ struct ioband_group *group = (struct ioband_group *)gp;
+
+ if (group->c_is_over_max_bw == MAX_BW_OVER)
+ group->c_is_over_max_bw = MAX_BW_UNDER;
+
+ if (group->c_io_mode == NO_IO_MODE)
+ group->c_io_mode = MINBW_IO_MODE;
+
+ if (waitqueue_active(&group->c_max_bw_over_waitq))
+ wake_up_all(&group->c_max_bw_over_waitq);
+
+ group->timer_set = 0;
+}
+
+/*
+ * <Method> <description>
+ * g_can_submit : To determine whether a given group has the right to
+ * submit BIOs. The larger the return value the higher the
+ * priority to submit. Zero means it has no right.
+ * g_prepare_bio : Called right before submitting each BIO.
+ * g_restart_bios : Called if this ioband device has some BIOs blocked but none
+ * of them can be submitted now. This method has to
+ * reinitialize the data to restart to submit BIOs and return
+ * 0 or 1.
+ * The return value 0 means that it has become able to submit
+ * them now so that this ioband device will continue its work.
+ * The return value 1 means that it is still unable to submit
+ * them so that this device will stop its work. And this
+ * policy module has to reactivate the device when it gets
+ * to be able to submit BIOs.
+ * g_hold_bio : To hold a given BIO until it is submitted.
+ * The default function is used when this method is undefined.
+ * g_pop_bio : To select and get the best BIO to submit.
+ * g_group_ctr : To initalize the policy own members of struct ioband_group.
+ * g_group_dtr : Called when struct ioband_group is removed.
+ * g_set_param : To update the policy own date.
+ * The parameters can be passed through "dmsetup message"
+ * command.
+ * g_should_block : Called every time this ioband device receive a BIO.
+ * Return 1 if a given group can't receive any more BIOs,
+ * otherwise return 0.
+ * g_show : Show the configuration.
+ */
+
+int policy_range_bw_init(struct ioband_device *dp, int argc, char **argv)
+{
+ long val;
+ int r = 0;
+
+ if (argc < 1)
+ val = 0;
+ else {
+ r = strict_strtol(argv[0], 0, &val);
+ if (r || val < 0)
+ return -EINVAL;
+ }
+
+ dp->g_can_submit = has_right_to_issue;
+ dp->g_prepare_bio = range_bw_prepare_token;
+ dp->g_restart_bios = range_bw_restart_bios;
+ dp->g_group_ctr = policy_range_bw_ctr;
+ dp->g_group_dtr = policy_range_bw_dtr;
+ dp->g_set_param = policy_range_bw_param;
+ dp->g_should_block = range_bw_queue_full;
+ dp->g_show = policy_range_bw_show;
+
+ dp->g_min_bw_total = 0;
+ dp->g_running_gp = NULL;
+ dp->g_total_min_bw_token = 0;
+ dp->g_io_mode = MINBW_IO_MODE;
+ dp->g_consumed_min_bw_token = 0;
+ dp->g_current = NULL;
+ dp->g_next_time_period = 0;
+ dp->g_time_period_expired = TIME_SLICE_NOT_EXPIRED;
+
+ dp->g_token_unit = PAGE_SHIFT - 9;
+ init_range_bw_token_bucket(dp, val);
+
+ return 0;
+}
Index: linux-2.6.31-rc3/Documentation/device-mapper/range-bw.txt
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/Documentation/device-mapper/range-bw.txt
@@ -0,0 +1,99 @@
+Range-BW I/O controller by Dong-Jae Kang <[email protected]>
+
+
+1. Introduction
+===============
+
+The design of Range-BW is related with three another parts, Cgroup,
+bio-cgroup (or blkio-cgroup) and dm-ioband and it was implemented as
+an additional controller for dm-ioband.
+Cgroup framework is used to support process grouping mechanism and
+bio-cgroup is used to control delayed I/O or non-direct I/O. Finally,
+dm-ioband is a kind of I/O controller allowing the proportional I/O
+bandwidth to process groups based on its priority.
+The supposed controller supports the process group-based range
+bandwidth according to the priority or importance of the group. Range
+bandwidth means the predicable I/O bandwidth with minimum and maximum
+value defined by administrator.
+
+Minimum I/O bandwidth should be guaranteed for stable performance or
+reliability of specific service and I/O bandwidth over maximum should
+be throttled to protect the limited I/O resource from
+over-provisioning in unnecessary usage or to reserve the I/O bandwidth
+for another use.
+So, Range-BW was implemented to include the two concepts, guaranteeing
+of minimum I/O requirement and limitation of unnecessary bandwidth
+depending on its priority.
+And it was implemented as device mapper driver such like dm-ioband.
+So, it is independent of the underlying specific I/O scheduler, for
+example, CFQ, AS, NOOP, deadline and so on.
+
+* Attention
+Range-BW supports the predicable I/O bandwidth, but it should be
+configured in the scope of total I/O bandwidth of the I/O system to
+guarantee the minimum I/O requirement. For example, if total I/O
+bandwidth is 40Mbytes/sec,
+
+the summary of I/O bandwidth configured in each process group should
+be equal or smaller than 40Mbytes/sec.
+So, we need to check total I/O bandwidth before set it up.
+
+2. Setup and Installation
+=========================
+
+This part is same with dm-ioband,
+../../Documentation/device-mapper/ioband.txt or
+http://sourceforge.net/apps/trac/ioband/wiki/dm-ioband/man/setup
+except the allocation of range-bw values.
+
+3. Usage
+========
+
+It is very useful to refer the documentation for dm-ioband in
+../../Documentation/device-mapper/ioband.txt or
+
+http://sourceforge.net/apps/trac/ioband/wiki/dm-ioband, because
+Range-BW follows the basic semantics of dm-ioband.
+This example is for range-bw configuration.
+
+# mount the cgroup
+mount -t cgroup -o blkio none /root/cgroup/blkio
+
+# create the process groups (3 groups)
+mkdir /root/cgroup/blkio/bgroup1
+mkdir /root/cgroup/blkio/bgroup2
+mkdir /root/cgroup/blkio/bgroup3
+
+# create the ioband device ( name : ioband1 )
+echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none
+range-bw 0 :0:0" | dmsetup create ioband1
+: Attention - device name (/dev/sdb2) should be modified depending on
+your system
+
+# init ioband device ( type and policy )
+dmsetup message ioband1 0 type cgroup
+dmsetup message ioband1 0 policy range-bw
+
+# attach the groups to the ioband device
+dmsetup message ioband1 0 attach 2
+dmsetup message ioband1 0 attach 3
+dmsetup message ioband1 0 attach 4
+: group number can be referred in /root/cgroup/blkio/bgroup1/blkio.id
+
+# allocate the values ( range-bw ) : XXX Kbytes
+: the sum of minimum I/O bandwidth in each group should be equal or
+smaller than total bandwidth to be supported by your system
+
+# range : about 100~500 Kbytes
+dmsetup message ioband1 0 range-bw 2:100:500
+
+# range : about 700~1000 Kbytes
+dmsetup message ioband1 0 range-bw 3:700:1000
+
+# range : about 30~35Mbytes
+dmsetup message ioband1 0 range-bw 4:30000:35000
+
+You can confirm the configuration of range-bw by using this command :
+[root@localhost range-bw]# dmsetup table --target ioband
+ioband1: 0 305235000 ioband 8:18 1 4 128 cgroup \
+ range-bw 16384 :0:0 2:100:500 3:700:1000 4:30000:35000
Index: linux-2.6.31-rc3/include/trace/events/dm-ioband.h
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/include/trace/events/dm-ioband.h
@@ -0,0 +1,242 @@
+#if !defined(_TRACE_DM_IOBAND_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DM_IOBAND_H
+
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dm-ioband
+
+TRACE_EVENT(ioband_hold_urgent_bio,
+
+ TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+ TP_ARGS(gp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, gp->c_banddev->g_name )
+ __field( int, c_id )
+ __field( int, g_blocked )
+ __field( int, c_blocked )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, gp->c_banddev->g_name);
+ __entry->c_id = gp->c_id;
+ __entry->g_blocked = gp->c_banddev->g_blocked;
+ __entry->c_blocked = gp->c_blocked;
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+ __get_str(g_name), __entry->c_id,
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_hold_bio,
+
+ TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+ TP_ARGS(gp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, gp->c_banddev->g_name )
+ __field( int, c_id )
+ __field( int, g_blocked )
+ __field( int, c_blocked )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, gp->c_banddev->g_name);
+ __entry->c_id = gp->c_id;
+ __entry->g_blocked = gp->c_banddev->g_blocked;
+ __entry->c_blocked = gp->c_blocked;
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+ __get_str(g_name), __entry->c_id,
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_make_pback_list,
+
+ TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+ TP_ARGS(gp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, gp->c_banddev->g_name )
+ __field( int, c_id )
+ __field( int, g_blocked )
+ __field( int, c_blocked )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, gp->c_banddev->g_name);
+ __entry->c_id = gp->c_id;
+ __entry->g_blocked = gp->c_banddev->g_blocked;
+ __entry->c_blocked = gp->c_blocked;
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+ __get_str(g_name), __entry->c_id,
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_make_issue_list,
+
+ TP_PROTO(struct ioband_group *gp, struct bio *bio),
+
+ TP_ARGS(gp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, gp->c_banddev->g_name )
+ __field( int, c_id )
+ __field( int, g_blocked )
+ __field( int, c_blocked )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, gp->c_banddev->g_name);
+ __entry->c_id = gp->c_id;
+ __entry->g_blocked = gp->c_banddev->g_blocked;
+ __entry->c_blocked = gp->c_blocked;
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s,%d: %d,%d %c %llu + %u %d %d",
+ __get_str(g_name), __entry->c_id,
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->g_blocked, __entry->c_blocked)
+);
+
+TRACE_EVENT(ioband_release_urgent_bios,
+
+ TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+ TP_ARGS(dp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, dp->g_name )
+ __field( int, g_blocked )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, dp->g_name);
+ __entry->g_blocked = dp->g_blocked;
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s: %d,%d %c %llu + %u %d",
+ __get_str(g_name),
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector, __entry->g_blocked)
+);
+
+TRACE_EVENT(ioband_make_request,
+
+ TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+ TP_ARGS(dp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, dp->g_name )
+ __field( int, c_id )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, dp->g_name);
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s: %d,%d %c %llu + %u",
+ __get_str(g_name),
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector)
+);
+
+TRACE_EVENT(ioband_pushback_bio,
+
+ TP_PROTO(struct ioband_device *dp, struct bio *bio),
+
+ TP_ARGS(dp, bio),
+
+ TP_STRUCT__entry(
+ __string( g_name, dp->g_name )
+ __field( dev_t, dev )
+ __field( sector_t, sector )
+ __field( unsigned int, nr_sector )
+ __field( char, rw )
+ ),
+
+ TP_fast_assign(
+ __assign_str(g_name, dp->g_name);
+ __entry->dev = bio->bi_bdev->bd_dev;
+ __entry->sector = bio->bi_sector;
+ __entry->nr_sector = bio->bi_size >> 9;
+ __entry->rw = (bio_data_dir(bio) == READ) ? 'R' : 'W';
+ ),
+
+ TP_printk("%s: %d,%d %c %llu + %u",
+ __get_str(g_name),
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rw,
+ (unsigned long long)__entry->sector,
+ __entry->nr_sector)
+);
+
+#endif /* _TRACE_DM_IOBAND_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

2009-07-21 14:12:28

by Ryo Tsuruta

[permalink] [raw]
Subject: [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework

This patch makes the page_cgroup framework be able to be used even if
the compile option of the cgroup memory controller is off.
So blkio-cgroup can use this framework without the memory controller.

Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>

---
include/linux/memcontrol.h | 6 ++++++
include/linux/mmzone.h | 4 ++--
include/linux/page_cgroup.h | 8 +++++---
init/Kconfig | 4 ++++
mm/Makefile | 3 ++-
mm/memcontrol.c | 6 ++++++
mm/page_cgroup.c | 3 +--
7 files changed, 26 insertions(+), 8 deletions(-)

Index: linux-2.6.31-rc3/include/linux/memcontrol.h
===================================================================
--- linux-2.6.31-rc3.orig/include/linux/memcontrol.h
+++ linux-2.6.31-rc3/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
* (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
*/

+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
Index: linux-2.6.31-rc3/include/linux/mmzone.h
===================================================================
--- linux-2.6.31-rc3.orig/include/linux/mmzone.h
+++ linux-2.6.31-rc3/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
struct page_cgroup *node_page_cgroup;
#endif
#endif
@@ -956,7 +956,7 @@ struct mem_section {

/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
Index: linux-2.6.31-rc3/include/linux/page_cgroup.h
===================================================================
--- linux-2.6.31-rc3.orig/include/linux/page_cgroup.h
+++ linux-2.6.31-rc3/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
#ifndef __LINUX_PAGE_CGROUP_H
#define __LINUX_PAGE_CGROUP_H

-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
#include <linux/bit_spinlock.h>
/*
* Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,11 @@
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct mem_cgroup *mem_cgroup;
struct list_head lru; /* per cgroup LRU list */
+#endif
};

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -83,7 +85,7 @@ static inline void unlock_page_cgroup(st
bit_spin_unlock(PCG_LOCK, &pc->flags);
}

-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
struct page_cgroup;

static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
Index: linux-2.6.31-rc3/init/Kconfig
===================================================================
--- linux-2.6.31-rc3.orig/init/Kconfig
+++ linux-2.6.31-rc3/init/Kconfig
@@ -614,6 +614,10 @@ config CGROUP_MEM_RES_CTLR_SWAP

endif # CGROUPS

+config CGROUP_PAGE
+ def_bool y
+ depends on CGROUP_MEM_RES_CTLR
+
config MM_OWNER
bool

Index: linux-2.6.31-rc3/mm/Makefile
===================================================================
--- linux-2.6.31-rc3.orig/mm/Makefile
+++ linux-2.6.31-rc3/mm/Makefile
@@ -39,6 +39,7 @@ else
obj-$(CONFIG_SMP) += allocpercpu.o
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
Index: linux-2.6.31-rc3/mm/memcontrol.c
===================================================================
--- linux-2.6.31-rc3.orig/mm/memcontrol.c
+++ linux-2.6.31-rc3/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};

+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+ pc->mem_cgroup = NULL;
+ INIT_LIST_HEAD(&pc->lru);
+}
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
Index: linux-2.6.31-rc3/mm/page_cgroup.c
===================================================================
--- linux-2.6.31-rc3.orig/mm/page_cgroup.c
+++ linux-2.6.31-rc3/mm/page_cgroup.c
@@ -14,9 +14,8 @@ static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
- INIT_LIST_HEAD(&pc->lru);
+ __init_mem_page_cgroup(pc);
}
static unsigned long total_usage;

2009-07-21 14:13:28

by Ryo Tsuruta

[permalink] [raw]
Subject: [PATCH 4/9] blkio-cgroup-v9: Refactoring io-context initialization

This patch refactors io_context initialization.

Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>

---
block/blk-ioc.c | 30 +++++++++++++++++-------------
include/linux/iocontext.h | 1 +
2 files changed, 18 insertions(+), 13 deletions(-)

Index: linux-2.6.31-rc3/block/blk-ioc.c
===================================================================
--- linux-2.6.31-rc3.orig/block/blk-ioc.c
+++ linux-2.6.31-rc3/block/blk-ioc.c
@@ -84,24 +84,28 @@ void exit_io_context(void)
}
}

+void init_io_context(struct io_context *ioc)
+{
+ atomic_set(&ioc->refcount, 1);
+ atomic_set(&ioc->nr_tasks, 1);
+ spin_lock_init(&ioc->lock);
+ ioc->ioprio_changed = 0;
+ ioc->ioprio = 0;
+ ioc->last_waited = jiffies; /* doesn't matter... */
+ ioc->nr_batch_requests = 0; /* because this is 0 */
+ ioc->aic = NULL;
+ INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+ INIT_HLIST_HEAD(&ioc->cic_list);
+ ioc->ioc_data = NULL;
+}
+
struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
{
struct io_context *ret;

ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
- if (ret) {
- atomic_long_set(&ret->refcount, 1);
- atomic_set(&ret->nr_tasks, 1);
- spin_lock_init(&ret->lock);
- ret->ioprio_changed = 0;
- ret->ioprio = 0;
- ret->last_waited = jiffies; /* doesn't matter... */
- ret->nr_batch_requests = 0; /* because this is 0 */
- ret->aic = NULL;
- INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
- INIT_HLIST_HEAD(&ret->cic_list);
- ret->ioc_data = NULL;
- }
+ if (ret)
+ init_io_context(ret);

return ret;
}
Index: linux-2.6.31-rc3/include/linux/iocontext.h
===================================================================
--- linux-2.6.31-rc3.orig/include/linux/iocontext.h
+++ linux-2.6.31-rc3/include/linux/iocontext.h
@@ -104,6 +104,7 @@ int put_io_context(struct io_context *io
void exit_io_context(void);
struct io_context *get_io_context(gfp_t gfp_flags, int node);
struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
void copy_io_context(struct io_context **pdst, struct io_context **psrc);
#else
static inline void exit_io_context(void)

2009-07-21 14:14:10

by Ryo Tsuruta

[permalink] [raw]
Subject: [PATCH 5/9] blkio-cgroup-v9: The body of blkio-cgroup

The body of blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>

---
include/linux/biotrack.h | 97 +++++++++++++
include/linux/cgroup_subsys.h | 6
include/linux/page_cgroup.h | 23 +++
init/Kconfig | 13 +
mm/Makefile | 1
mm/biotrack.c | 300 ++++++++++++++++++++++++++++++++++++++++++
mm/page_cgroup.c | 20 +-
7 files changed, 451 insertions(+), 9 deletions(-)

Index: linux-2.6.31-rc3/include/linux/biotrack.h
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/include/linux/biotrack.h
@@ -0,0 +1,97 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+ struct cgroup_subsys_state css;
+ struct io_context *io_context; /* default io_context */
+/* struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc: page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, 0);
+ unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+ if (blkio_cgroup_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *blkio_cgroup_lookup(int id);
+
+#else /* !CONFIG_CGROUP_BLKIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+ return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+ return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ return 0;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
Index: linux-2.6.31-rc3/include/linux/cgroup_subsys.h
===================================================================
--- linux-2.6.31-rc3.orig/include/linux/cgroup_subsys.h
+++ linux-2.6.31-rc3/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)

/* */

+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
#ifdef CONFIG_CGROUP_DEVICE
SUBSYS(devices)
#endif
Index: linux-2.6.31-rc3/include/linux/page_cgroup.h
===================================================================
--- linux-2.6.31-rc3.orig/include/linux/page_cgroup.h
+++ linux-2.6.31-rc3/include/linux/page_cgroup.h
@@ -140,4 +140,27 @@ static inline void swap_cgroup_swapoff(i
}

#endif
+
+#ifdef CONFIG_CGROUP_BLKIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG_TRACKING_ID_SHIFT (16)
+#define PCG_TRACKING_ID_BITS \
+ (8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+ return pc->flags >> PCG_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+ WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
+ pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
+ pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
+}
+#endif
#endif
Index: linux-2.6.31-rc3/init/Kconfig
===================================================================
--- linux-2.6.31-rc3.orig/init/Kconfig
+++ linux-2.6.31-rc3/init/Kconfig
@@ -614,9 +614,20 @@ config CGROUP_MEM_RES_CTLR_SWAP

endif # CGROUPS

+config CGROUP_BLKIO
+ bool "Block I/O cgroup subsystem"
+ depends on CGROUPS && BLOCK
+ select MM_OWNER
+ help
+ Provides a Resource Controller which enables to track the onwner
+ of every Block I/O requests.
+ The information this subsystem provides can be used from any
+ kind of module such as dm-ioband device mapper modules or
+ the cfq-scheduler.
+
config CGROUP_PAGE
def_bool y
- depends on CGROUP_MEM_RES_CTLR
+ depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO

config MM_OWNER
bool
Index: linux-2.6.31-rc3/mm/biotrack.c
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/mm/biotrack.c
@@ -0,0 +1,300 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <[email protected]>
+ *
+ * Copyright (C) 2008 Andrea Righi <[email protected]>
+ * Use part of page_cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+ return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+ .io_context = &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+ struct blkio_cgroup *biog;
+ struct page_cgroup *pc;
+ unsigned long id;
+
+ if (blkio_cgroup_disabled())
+ return;
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, 0); /* 0: default blkio_cgroup id */
+ unlock_page_cgroup(pc);
+ if (!mm)
+ return;
+
+ rcu_read_lock();
+ biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+ if (unlikely(!biog)) {
+ rcu_read_unlock();
+ return;
+ }
+ /*
+ * css_get(&bio->css) isn't called to increment the reference
+ * count of this blkio_cgroup "biog" so the css_id might turn
+ * invalid even if this page is still active.
+ * This approach is chosen to minimize the overhead.
+ */
+ id = css_id(&biog->css);
+ rcu_read_unlock();
+ lock_page_cgroup(pc);
+ page_cgroup_set_id(pc, id);
+ unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+ blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+ if (!page_is_file_cache(page))
+ return;
+ if (current->flags & PF_MEMALLOC)
+ return;
+
+ blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage: the page where we want to copy the owner
+ * @opage: the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+ struct page_cgroup *npc, *opc;
+ unsigned long id;
+
+ if (blkio_cgroup_disabled())
+ return;
+ npc = lookup_page_cgroup(npage);
+ if (unlikely(!npc))
+ return;
+ opc = lookup_page_cgroup(opage);
+ if (unlikely(!opc))
+ return;
+
+ lock_page_cgroup(opc);
+ lock_page_cgroup(npc);
+ id = page_cgroup_get_id(opc);
+ page_cgroup_set_id(npc, id);
+ unlock_page_cgroup(npc);
+ unlock_page_cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio_cgroup *biog;
+ struct io_context *ioc;
+
+ if (!cgrp->parent) {
+ biog = &default_blkio_cgroup;
+ init_io_context(biog->io_context);
+ /* Increment the referrence count not to be released ever. */
+ atomic_inc(&biog->io_context->refcount);
+ return &biog->css;
+ }
+
+ biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+ if (!biog)
+ return ERR_PTR(-ENOMEM);
+ ioc = alloc_io_context(GFP_KERNEL, -1);
+ if (!ioc) {
+ kfree(biog);
+ return ERR_PTR(-ENOMEM);
+ }
+ biog->io_context = ioc;
+ return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+ put_io_context(biog->io_context);
+ free_css_id(&blkio_cgroup_subsys, &biog->css);
+ kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio: the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ struct page_cgroup *pc;
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ unsigned long id = 0;
+
+ pc = lookup_page_cgroup(page);
+ if (pc) {
+ lock_page_cgroup(pc);
+ id = page_cgroup_get_id(pc);
+ unlock_page_cgroup(pc);
+ }
+ return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio: the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+ struct cgroup_subsys_state *css;
+ struct blkio_cgroup *biog;
+ struct io_context *ioc;
+ unsigned long id;
+
+ id = get_blkio_cgroup_id(bio);
+ rcu_read_lock();
+ css = css_lookup(&blkio_cgroup_subsys, id);
+ if (css)
+ biog = container_of(css, struct blkio_cgroup, css);
+ else
+ biog = &default_blkio_cgroup;
+ ioc = biog->io_context; /* default io_context for this cgroup */
+ atomic_inc(&ioc->refcount);
+ rcu_read_unlock();
+ return ioc;
+}
+
+/**
+ * blkio_cgroup_lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id: blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu_read_lock().
+ */
+struct cgroup *blkio_cgroup_lookup(int id)
+{
+ struct cgroup *cgrp;
+ struct cgroup_subsys_state *css;
+
+ if (blkio_cgroup_disabled())
+ return NULL;
+
+ css = css_lookup(&blkio_cgroup_subsys, id);
+ if (!css)
+ return NULL;
+ cgrp = css->cgroup;
+ return cgrp;
+}
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(blkio_cgroup_lookup);
+
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+ unsigned long id;
+
+ rcu_read_lock();
+ id = css_id(&biog->css);
+ rcu_read_unlock();
+ return (u64)id;
+}
+
+
+static struct cftype blkio_files[] = {
+ {
+ .name = "id",
+ .read_u64 = blkio_id_read,
+ },
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ return cgroup_add_files(cgrp, ss, blkio_files,
+ ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+ .name = "blkio",
+ .create = blkio_cgroup_create,
+ .destroy = blkio_cgroup_destroy,
+ .populate = blkio_cgroup_populate,
+ .subsys_id = blkio_cgroup_subsys_id,
+ .use_id = 1,
+};
Index: linux-2.6.31-rc3/mm/page_cgroup.c
===================================================================
--- linux-2.6.31-rc3.orig/mm/page_cgroup.c
+++ linux-2.6.31-rc3/mm/page_cgroup.c
@@ -9,6 +9,7 @@
#include <linux/vmalloc.h>
#include <linux/cgroup.h>
#include <linux/swapops.h>
+#include <linux/biotrack.h>

static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -16,6 +17,7 @@ __init_page_cgroup(struct page_cgroup *p
pc->flags = 0;
pc->page = pfn_to_page(pfn);
__init_mem_page_cgroup(pc);
+ __init_blkio_page_cgroup(pc);
}
static unsigned long total_usage;

@@ -73,7 +75,7 @@ void __init page_cgroup_init_flatmem(voi

int nid, fail;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;

for_each_online_node(nid) {
@@ -82,12 +84,13 @@ void __init page_cgroup_init_flatmem(voi
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
- " don't want memory cgroups\n");
+ printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+ " if you don't want memory and blkio cgroups\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup failed.\n");
- printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+ printk(KERN_CRIT
+ "please try 'cgroup_disable=memory,blkio' boot option\n");
panic("Out of memory");
}

@@ -244,7 +247,7 @@ void __init page_cgroup_init(void)
unsigned long pfn;
int fail = 0;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;

for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -253,14 +256,15 @@ void __init page_cgroup_init(void)
fail = init_section_page_cgroup(pfn);
}
if (fail) {
- printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+ printk(KERN_CRIT
+ "try 'cgroup_disable=memory,blkio' boot option\n");
panic("Out of memory");
} else {
hotplug_memory_notifier(page_cgroup_callback, 0);
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
- " want memory cgroups\n");
+ printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+ " if you don't want memory and blkio cgroups\n");
}

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
Index: linux-2.6.31-rc3/mm/Makefile
===================================================================
--- linux-2.6.31-rc3.orig/mm/Makefile
+++ linux-2.6.31-rc3/mm/Makefile
@@ -41,5 +41,6 @@ endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o

2009-07-21 14:22:17

by Ryo Tsuruta

[permalink] [raw]
Subject: [PATCH 6/9] blkio-cgroup-v9: The document of blkio-cgroup

The document of blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>

---
Documentation/cgroups/00-INDEX | 2
Documentation/cgroups/blkio.txt | 289 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 291 insertions(+)

Index: linux-2.6.31-rc3/Documentation/cgroups/00-INDEX
===================================================================
--- linux-2.6.31-rc3.orig/Documentation/cgroups/00-INDEX
+++ linux-2.6.31-rc3/Documentation/cgroups/00-INDEX
@@ -16,3 +16,5 @@ memory.txt
- Memory Resource Controller; design, accounting, interface, testing.
resource_counter.txt
- Resource Counter API.
+blkio.txt
+ - Block I/O Tracking; description, interface and examples.
Index: linux-2.6.31-rc3/Documentation/cgroups/blkio.txt
===================================================================
--- /dev/null
+++ linux-2.6.31-rc3/Documentation/cgroups/blkio.txt
@@ -0,0 +1,289 @@
+Block I/O Cgroup
+
+1. Overview
+
+Using this feature the owners of any type of I/O can be determined.
+This allows dm-ioband to control block I/O bandwidth even when it is
+accepting delayed write requests. dm-ioband can find the cgroup of
+each request. It is also for possible that others working on I/O
+bandwidth throttling to use this functionality to control asynchronous
+I/O with a little enhancement.
+
+2. Setting up blkio-cgroup
+
+Note: If dm-ioband is to be used with blkio-cgroup, then the dm-ioband
+patch needs to be applied first.
+
+The following kernel config options are required.
+
+CONFIG_CGROUPS=y
+CONFIG_CGROUP_BLKIO=y
+
+Selecting the options for the cgroup memory subsystem is also recommended
+as it makes it possible to give some I/O bandwidth and memory to a selected
+cgroup to control delayed write requests. The amount of dirty pages is
+limited within the cgroup even if the allocated bandwidth is narrow.
+
+CONFIG_RESOURCE_COUNTERS=y
+CONFIG_CGROUP_MEM_RES_CTLR=y
+
+3. User interface
+
+3.1 Mounting the cgroup filesystem
+
+First, mount the cgroup filesystem in order to enable observation and
+modification of the blkio-cgroup settings.
+
+# mount -t cgroup -o blkio none /cgroup
+
+3.2 The blkio.id file
+
+After mounting the cgroup filesystem the blkio.id file will be visible
+in the cgroup directory. This file contains a unique ID number for
+each cgroup. When an I/O operation starts, blkio-cgroup sets the
+page's ID number on the page cgroup. The cgroup of I/O can be
+determined by retrieving the ID number from the page cgroup, because
+the page cgroup is associated with the page which is involved in the
+I/O.
+
+If the dm-ioband support patch was applied then the blkio.devices and
+blkio.settings files will also be present.
+
+4. Using dm-ioband and blkio-cgroup
+
+This section describes how to set up dm-ioband and blkio-cgroup in
+order to control bandwidth on a per cgroup per logical volume basis.
+The example used in this section assumes that there are two LVM volume
+groups on individual hard disks and two logical volumes on each volume
+group.
+
+ Table. LVM configurations
+
+ --------------------------------------------------------------
+ | LVM volume groups | vg0 on /dev/sda | vg1 on /dev/sdb |
+ |----------------------|-------------------|-------------------|
+ | LVM logical volume | lv0 | lv1 | lv0 | lv1 |
+ --------------------------------------------------------------
+
+4.1. Creating a dm-ioband logical device
+
+A dm-ioband logical device needs to be created and stacked on the
+device that is to bandwidth controlled. In this example the dm-ioband
+logical devices are stacked on each of the existing LVM logical
+volumes. By using the LVM facilities there is no need to unmount any
+logical volumes, even in the case of a volume being used as the root
+device. The following script is an example of how to stack and remove
+dm-ioband devices.
+
+==================== cut here (ioband.sh) ====================
+#!/bin/sh
+#
+# NOTE: You must run "ioband.sh stop" to restore the device-mapper
+# settings before changing logical volume settings, such as activate,
+# rename, resize and so on. These constraints would be eliminated by
+# enhancing LVM tools to support dm-ioband.
+
+logvols="vg0-lv0 vg0-lv1 vg1-lv0 vg1-lv1"
+
+start()
+{
+ for lv in $logvols; do
+ volgrp=${lv%%-*}
+ orig=${lv}-orig
+
+ # clone an existing logical volume.
+ /sbin/dmsetup table $lv | /sbin/dmsetup create $orig
+
+ # stack a dm-ioband device on the clone.
+ size=$(/sbin/blockdev --getsize /dev/mapper/$orig)
+ cat<<-EOM | /sbin/dmsetup load ${lv}
+ 0 $size ioband /dev/mapper/${orig} ${volgrp} 0 0 cgroup weight 0 :100
+ EOM
+
+ # activate the new setting.
+ /sbin/dmsetup resume $lv
+ done
+}
+
+stop()
+{
+ for lv in $logvols; do
+ orig=${lv}-orig
+
+ # restore the original setting.
+ /sbin/dmsetup table $orig | /sbin/dmsetup load $lv
+
+ # activate the new setting.
+ /sbin/dmsetup resume $lv
+
+ # remove the clone.
+ /sbin/dmsetup remove $orig
+ done
+}
+
+case "$1" in
+ start)
+ start
+ ;;
+ stop)
+ stop
+ ;;
+esac
+exit 0
+==================== cut here (ioband.sh) ====================
+
+The following diagram shows how dm-ioband devices are stacked on and
+removed from the logical volumes.
+
+ Figure. stacking and removing dm-ioband devices
+
+ run "ioband.sh start"
+ ===>
+
+ ----------------------- -----------------------
+ | lv0 | lv1 | | lv0 | lv1 |
+ |(dm-linear)|(dm-linear)| |(dm-ioband)|(dm-ioband)|
+ |-----------------------| |-----------------------|
+ | vg0 | | lv0-orig | lv1-orig |
+ ----------------------- |(dm-linear)|(dm-linear)|
+ |-----------------------|
+ | vg0 |
+ -----------------------
+ <===
+ run "ioband.sh stop"
+
+After creating the dm-ioband devices, the settings can be observed by
+reading the blkio.devices file.
+
+# cat /cgroup/blkio.devices
+vg0 policy=weight io_throttle=4 io_limit=192 token=768 carryover=2
+ vg0-lv0
+ vg0-lv1
+vg1 policy=weight io_throttle=4 io_limit=192 token=768 carryover=2
+ vg1-lv0
+ vg1-lv1
+
+The first field in the first line is the symbolic name for an ioband
+device group, and the subsequent fields are settings for the ioband
+device group. The settings can be changed by writing to the
+blkio.devices, for example:
+
+# echo vg1 policy range-bw > /cgroup/blkio.devices
+
+Please refer to Document/device-mapper/ioband.txt which describes the
+details of the ioband device group settings.
+
+The second and the third indented lines "vg0-lv0" and "vg0-lv1" are
+the names of the dm-ioband devices that belong to the ioband device
+group. Typically, dm-ioband devices that reside on the same hard disk
+should belong to the same ioband device group in order to share the
+bandwidth of the hard disk.
+
+dm-ioband is not restricted to working with LVM, it may work in
+conjunction with any type of block device. Please refer to
+Documentation/device-mapper/ioband.txt for more details.
+
+4.2 Setting up dm-ioband through the blkio-cgroup interface
+
+The following table shows the given settings for this example. The
+bandwidth will be assigned on a per cgroup per logical volume basis.
+
+ Table. Settings for each cgroup
+
+ --------------------------------------------------------------
+ | LVM volume groups | vg0 on /dev/sda | vg1 on /dev/sdb |
+ |----------------------|-------------------|-------------------|
+ | LVM logical volume | lv0 | lv1 | lv0 | lv1 |
+ |----------------------|-------------------|-------------------|
+ | bandwidth control | relative | absolute |
+ | policy | weight | bandwidth limit |
+ |----------------------|-------------------|-------------------|
+ | unit | weight value (*1) | throughput [KB/s] |
+ |----------------------|-------------------|-------------------|
+ | settings for cgroup1 | 40 (16) | 90 (36) | 400 | 900 |
+ |----------------------|---------|---------|---------|---------|
+ | settings for cgroup2 | 20 (8) | 60 (24) | 200 | 600 |
+ |----------------------|---------|---------|---------|---------|
+ | for other cgroups | 10 (4) | 30 (12) | 100 | 300 |
+ --------------------------------------------------------------
+
+ *1: The values enclosed in () denote the preceding weight
+ as a percentage of the total weight. The bandwidth of
+ vg0 is distributed proportional to the total weight.
+
+The set-up is described step-by-step below.
+
+1) Create new cgroups using the mkdir command
+
+# mkdir /cgroup/1
+# mkdir /cgroup/2
+
+2) Set bandwidth control policy on each ioband device group
+
+The set-up of bandwidth control policy is done by writing to
+blkio.devices file.
+
+# echo vg0 policy weight > /cgroup/blkio.devices
+# echo vg1 policy range-bw > /cgroup/blkio.devices
+
+3) Set up the root cgroup
+
+The root cgroup represents the default blkio-cgroup. If an I/O is
+performed by a process in a cgroup and the cgroup is not set up by
+blkio-cgroup, the I/O is charged to the root cgroup.
+
+The set-up of the root cgroup is done by writing to blkio.settings
+file in the cgroup's root directory. The following commands write
+the settings of each logical volume to that file.
+
+# echo vg0-lv0 10 > /cgroup/bklio.settings
+# echo vg0-lv1 30 > /cgroup/bklio.settings
+# echo vg1-lv0 100:100 > /cgroup/blkio.settings
+# echo vg1-lv1 300:300 > /cgroup/blkio.settings
+
+The settings can be verified by reading the blkio.settings file.
+
+# cat /cgroup/blkio.settings
+vg0-lv0 weight=10
+vg0-lv1 weight=30
+vg1-lv0 range-bw=100:100
+vg1-lv1 range-bw=300:300
+
+4) Set up cgroup1 and cgroup2
+
+New cgroups are set up in the same manner as the root cgroup.
+
+Settings for cgroup1
+# echo vg0-lv0 40 > /cgroup/1/blkio.settings
+# echo vg0-lv1 90 > /cgroup/1/bklio.settings
+# echo vg1-lv0 400:400 > /cgroup/1/blkio.settings
+# echo vg1-lv1 900:900 > /cgroup/1/bklio.settings
+
+Settings for cgroup2
+# echo vg0-lv0 20 > /cgroup/2/blkio.settings
+# echo vg0-lv1 60 > /cgroup/2/bklio.settings
+# echo vg1-lv0 200:200 > /cgroup/2/blkio.settings
+# echo vg1-lv1 600:600 > /cgroup/2/bklio.settings
+
+Again, the settings can be verified by reading the appropriate
+blkio.settings file.
+
+# cat /cgroup/1/blkio.settings
+vg0-lv0 weight=40
+vg0-lv1 weight=90
+vg1-lv0 range-bw=400:400
+vg1-lv1 range-bw=900:900
+
+If only the logical volume name is specified, the entry for the
+logical volume is removed.
+
+# echo vg0-lv1 > /cgroup/1/vlkio.setting
+# cat /cgroup/1/blkio.settings
+vg0-lv0 weight=40
+vg0-lv1 weight=90
+vg1-lv0 range-bw=400:400
+
+5. Contact
+
+Linux Block I/O Bandwidth Control Project
+http://sourceforge.net/projects/ioband/

2009-07-21 14:23:30

by Ryo Tsuruta

[permalink] [raw]
Subject: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

This patch contains several hooks that let the blkio-cgroup framework to know
which blkio-cgroup is the owner of a page before starting I/O against the page.

Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>

---
fs/buffer.c | 2 ++
fs/direct-io.c | 2 ++
mm/bounce.c | 2 ++
mm/filemap.c | 2 ++
mm/memory.c | 5 +++++
mm/page-writeback.c | 2 ++
mm/swap_state.c | 2 ++
7 files changed, 17 insertions(+)

Index: linux-2.6.31-rc3/fs/buffer.c
===================================================================
--- linux-2.6.31-rc3.orig/fs/buffer.c
+++ linux-2.6.31-rc3/fs/buffer.c
@@ -36,6 +36,7 @@
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
+#include <linux/biotrack.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
Index: linux-2.6.31-rc3/fs/direct-io.c
===================================================================
--- linux-2.6.31-rc3.orig/fs/direct-io.c
+++ linux-2.6.31-rc3/fs/direct-io.c
@@ -33,6 +33,7 @@
#include <linux/err.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
+#include <linux/biotrack.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
#include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
ret = PTR_ERR(page);
goto out;
}
+ blkio_cgroup_reset_owner(page, current->mm);

while (block_in_page < blocks_per_page) {
unsigned offset_in_page = block_in_page << blkbits;
Index: linux-2.6.31-rc3/mm/bounce.c
===================================================================
--- linux-2.6.31-rc3.orig/mm/bounce.c
+++ linux-2.6.31-rc3/mm/bounce.c
@@ -13,6 +13,7 @@
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
+#include <linux/biotrack.h>
#include <asm/tlbflush.h>

#include <trace/events/block.h>
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct re
to->bv_len = from->bv_len;
to->bv_offset = from->bv_offset;
inc_zone_page_state(to->bv_page, NR_BOUNCE);
+ blkio_cgroup_copy_owner(to->bv_page, page);

if (rw == WRITE) {
char *vto, *vfrom;
Index: linux-2.6.31-rc3/mm/filemap.c
===================================================================
--- linux-2.6.31-rc3.orig/mm/filemap.c
+++ linux-2.6.31-rc3/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/cpuset.h>
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
#include "internal.h"

@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
+ blkio_cgroup_set_owner(page, current->mm);

error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
Index: linux-2.6.31-rc3/mm/memory.c
===================================================================
--- linux-2.6.31-rc3.orig/mm/memory.c
+++ linux-2.6.31-rc3/mm/memory.c
@@ -51,6 +51,7 @@
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mmu_notifier.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
@@ -2115,6 +2116,7 @@ gotten:
*/
ptep_clear_flush_notify(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
+ blkio_cgroup_set_owner(new_page, mm);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
if (old_page) {
@@ -2580,6 +2582,7 @@ static int do_swap_page(struct mm_struct
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);
+ blkio_cgroup_reset_owner(page, mm);
/* It's better to call commit-charge after rmap is established */
mem_cgroup_commit_charge_swapin(page, ptr);

@@ -2644,6 +2647,7 @@ static int do_anonymous_page(struct mm_s
goto release;
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
set_pte_at(mm, address, page_table, entry);

/* No need to invalidate - it was non-present before */
@@ -2791,6 +2795,7 @@ static int __do_fault(struct mm_struct *
if (anon) {
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(page);
Index: linux-2.6.31-rc3/mm/page-writeback.c
===================================================================
--- linux-2.6.31-rc3.orig/mm/page-writeback.c
+++ linux-2.6.31-rc3/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -1247,6 +1248,7 @@ int __set_page_dirty_nobuffers(struct pa
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
Index: linux-2.6.31-rc3/mm/swap_state.c
===================================================================
--- linux-2.6.31-rc3.orig/mm/swap_state.c
+++ linux-2.6.31-rc3/mm/swap_state.c
@@ -18,6 +18,7 @@
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
+#include <linux/biotrack.h>

#include <asm/pgtable.h>

@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_e
*/
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
+ blkio_cgroup_set_owner(new_page, current->mm);
err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
if (likely(!err)) {
/*

2009-07-21 14:24:07

by Ryo Tsuruta

[permalink] [raw]
Subject: [PATCH 8/9] blkio-cgroup-v9: Fast page tracking

This is an extra patch which reduces the overhead of IO tracking but
increases the size of struct page_cgroup.

Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>

---
include/linux/biotrack.h | 5 -
include/linux/page_cgroup.h | 26 --------
mm/biotrack.c | 138 ++++++++++++++++++++++++++------------------
3 files changed, 87 insertions(+), 82 deletions(-)

Index: linux-2.6.31-rc3/mm/biotrack.c
===================================================================
--- linux-2.6.31-rc3.orig/mm/biotrack.c
+++ linux-2.6.31-rc3/mm/biotrack.c
@@ -3,9 +3,6 @@
* Copyright (C) VA Linux Systems Japan, 2008-2009
* Developed by Hirokazu Takahashi <[email protected]>
*
- * Copyright (C) 2008 Andrea Righi <[email protected]>
- * Use part of page_cgroup->flags to store blkio-cgroup ID.
- *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -20,6 +17,7 @@
#include <linux/module.h>
#include <linux/smp.h>
#include <linux/bit_spinlock.h>
+#include <linux/idr.h>
#include <linux/blkdev.h>
#include <linux/biotrack.h>
#include <linux/mm_inline.h>
@@ -45,8 +43,11 @@ static inline struct blkio_cgroup *blkio
struct blkio_cgroup, css);
}

+static struct idr blkio_cgroup_id;
+static DEFINE_SPINLOCK(blkio_cgroup_idr_lock);
static struct io_context default_blkio_io_context;
static struct blkio_cgroup default_blkio_cgroup = {
+ .id = 0,
.io_context = &default_blkio_io_context,
};

@@ -61,7 +62,6 @@ void blkio_cgroup_set_owner(struct page
{
struct blkio_cgroup *biog;
struct page_cgroup *pc;
- unsigned long id;

if (blkio_cgroup_disabled())
return;
@@ -69,29 +69,27 @@ void blkio_cgroup_set_owner(struct page
if (unlikely(!pc))
return;

- lock_page_cgroup(pc);
- page_cgroup_set_id(pc, 0); /* 0: default blkio_cgroup id */
- unlock_page_cgroup(pc);
+ pc->blkio_cgroup_id = 0; /* 0: default blkio_cgroup id */
if (!mm)
return;

+ /*
+ * Locking "pc" isn't necessary here since the current process is
+ * the only one that can access the members related to blkio_cgroup.
+ */
rcu_read_lock();
biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
- if (unlikely(!biog)) {
- rcu_read_unlock();
- return;
- }
+ if (unlikely(!biog))
+ goto out;
/*
* css_get(&bio->css) isn't called to increment the reference
* count of this blkio_cgroup "biog" so the css_id might turn
* invalid even if this page is still active.
* This approach is chosen to minimize the overhead.
*/
- id = css_id(&biog->css);
+ pc->blkio_cgroup_id = biog->id;
+out:
rcu_read_unlock();
- lock_page_cgroup(pc);
- page_cgroup_set_id(pc, id);
- unlock_page_cgroup(pc);
}

/**
@@ -103,6 +101,13 @@ void blkio_cgroup_set_owner(struct page
*/
void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
{
+ /*
+ * A little trick:
+ * Just call blkio_cgroup_set_owner() for pages which are already
+ * active since the blkio_cgroup_id member of page_cgroup can be
+ * updated without any locks. This is because an integer type of
+ * variable can be set a new value at once on modern cpus.
+ */
blkio_cgroup_set_owner(page, mm);
}

@@ -133,7 +138,6 @@ void blkio_cgroup_reset_owner_pagedirty(
void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
{
struct page_cgroup *npc, *opc;
- unsigned long id;

if (blkio_cgroup_disabled())
return;
@@ -144,12 +148,11 @@ void blkio_cgroup_copy_owner(struct page
if (unlikely(!opc))
return;

- lock_page_cgroup(opc);
- lock_page_cgroup(npc);
- id = page_cgroup_get_id(opc);
- page_cgroup_set_id(npc, id);
- unlock_page_cgroup(npc);
- unlock_page_cgroup(opc);
+ /*
+ * Do this without any locks. The reason is the same as
+ * blkio_cgroup_reset_owner().
+ */
+ npc->blkio_cgroup_id = opc->blkio_cgroup_id;
}

/* Create a new blkio-cgroup. */
@@ -158,25 +161,44 @@ blkio_cgroup_create(struct cgroup_subsys
{
struct blkio_cgroup *biog;
struct io_context *ioc;
+ int ret;

if (!cgrp->parent) {
biog = &default_blkio_cgroup;
init_io_context(biog->io_context);
/* Increment the referrence count not to be released ever. */
atomic_inc(&biog->io_context->refcount);
+ idr_init(&blkio_cgroup_id);
return &biog->css;
}

biog = kzalloc(sizeof(*biog), GFP_KERNEL);
- if (!biog)
- return ERR_PTR(-ENOMEM);
ioc = alloc_io_context(GFP_KERNEL, -1);
- if (!ioc) {
- kfree(biog);
- return ERR_PTR(-ENOMEM);
+ if (!ioc || !biog) {
+ ret = -ENOMEM;
+ goto out_err;
}
biog->io_context = ioc;
+retry:
+ if (!idr_pre_get(&blkio_cgroup_id, GFP_KERNEL)) {
+ ret = -EAGAIN;
+ goto out_err;
+ }
+ spin_lock_irq(&blkio_cgroup_idr_lock);
+ ret = idr_get_new_above(&blkio_cgroup_id, (void *)biog, 1, &biog->id);
+ spin_unlock_irq(&blkio_cgroup_idr_lock);
+ if (ret == -EAGAIN)
+ goto retry;
+ else if (ret)
+ goto out_err;
+
return &biog->css;
+out_err:
+ if (biog)
+ kfree(biog);
+ if (ioc)
+ put_io_context(ioc);
+ return ERR_PTR(ret);
}

/* Delete the blkio-cgroup. */
@@ -185,10 +207,28 @@ static void blkio_cgroup_destroy(struct
struct blkio_cgroup *biog = cgroup_blkio(cgrp);

put_io_context(biog->io_context);
- free_css_id(&blkio_cgroup_subsys, &biog->css);
+
+ spin_lock_irq(&blkio_cgroup_idr_lock);
+ idr_remove(&blkio_cgroup_id, biog->id);
+ spin_unlock_irq(&blkio_cgroup_idr_lock);
+
kfree(biog);
}

+static struct blkio_cgroup *find_blkio_cgroup(int id)
+{
+ struct blkio_cgroup *biog;
+ spin_lock_irq(&blkio_cgroup_idr_lock);
+ /*
+ * It might fail to find A bio-group associated with "id" since it
+ * is allowed to remove the bio-cgroup even when some of I/O requests
+ * this group issued haven't completed yet.
+ */
+ biog = (struct blkio_cgroup *)idr_find(&blkio_cgroup_id, id);
+ spin_unlock_irq(&blkio_cgroup_idr_lock);
+ return biog;
+}
+
/**
* get_blkio_cgroup_id() - determine the blkio-cgroup ID
* @bio: the &struct bio which describes the I/O
@@ -200,14 +240,11 @@ unsigned long get_blkio_cgroup_id(struct
{
struct page_cgroup *pc;
struct page *page = bio_iovec_idx(bio, 0)->bv_page;
- unsigned long id = 0;
+ int id = 0;

pc = lookup_page_cgroup(page);
- if (pc) {
- lock_page_cgroup(pc);
- id = page_cgroup_get_id(pc);
- unlock_page_cgroup(pc);
- }
+ if (pc)
+ id = pc->blkio_cgroup_id;
return id;
}

@@ -219,21 +256,17 @@ unsigned long get_blkio_cgroup_id(struct
*/
struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
{
- struct cgroup_subsys_state *css;
- struct blkio_cgroup *biog;
+ struct blkio_cgroup *biog = NULL;
struct io_context *ioc;
- unsigned long id;
+ int id = 0;

id = get_blkio_cgroup_id(bio);
- rcu_read_lock();
- css = css_lookup(&blkio_cgroup_subsys, id);
- if (css)
- biog = container_of(css, struct blkio_cgroup, css);
- else
+ if (id)
+ biog = find_blkio_cgroup(id);
+ if (!biog)
biog = &default_blkio_cgroup;
ioc = biog->io_context; /* default io_context for this cgroup */
atomic_inc(&ioc->refcount);
- rcu_read_unlock();
return ioc;
}

@@ -249,17 +282,15 @@ struct io_context *get_blkio_cgroup_ioco
*/
struct cgroup *blkio_cgroup_lookup(int id)
{
- struct cgroup *cgrp;
- struct cgroup_subsys_state *css;
+ struct blkio_cgroup *biog = NULL;

if (blkio_cgroup_disabled())
return NULL;
-
- css = css_lookup(&blkio_cgroup_subsys, id);
- if (!css)
+ if (id)
+ biog = find_blkio_cgroup(id);
+ if (!biog)
return NULL;
- cgrp = css->cgroup;
- return cgrp;
+ return biog->css.cgroup;
}
EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
EXPORT_SYMBOL(get_blkio_cgroup_id);
@@ -268,12 +299,8 @@ EXPORT_SYMBOL(blkio_cgroup_lookup);
static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
{
struct blkio_cgroup *biog = cgroup_blkio(cgrp);
- unsigned long id;

- rcu_read_lock();
- id = css_id(&biog->css);
- rcu_read_unlock();
- return (u64)id;
+ return (u64) biog->id;
}


@@ -296,5 +323,4 @@ struct cgroup_subsys blkio_cgroup_subsys
.destroy = blkio_cgroup_destroy,
.populate = blkio_cgroup_populate,
.subsys_id = blkio_cgroup_subsys_id,
- .use_id = 1,
};
Index: linux-2.6.31-rc3/include/linux/biotrack.h
===================================================================
--- linux-2.6.31-rc3.orig/include/linux/biotrack.h
+++ linux-2.6.31-rc3/include/linux/biotrack.h
@@ -12,6 +12,7 @@ struct block_device;

struct blkio_cgroup {
struct cgroup_subsys_state css;
+ int id;
struct io_context *io_context; /* default io_context */
/* struct radix_tree_root io_context_root; per device io_context */
};
@@ -24,9 +25,7 @@ struct blkio_cgroup {
*/
static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
{
- lock_page_cgroup(pc);
- page_cgroup_set_id(pc, 0);
- unlock_page_cgroup(pc);
+ pc->blkio_cgroup_id = 0;
}

/**
Index: linux-2.6.31-rc3/include/linux/page_cgroup.h
===================================================================
--- linux-2.6.31-rc3.orig/include/linux/page_cgroup.h
+++ linux-2.6.31-rc3/include/linux/page_cgroup.h
@@ -17,6 +17,9 @@ struct page_cgroup {
struct mem_cgroup *mem_cgroup;
struct list_head lru; /* per cgroup LRU list */
#endif
+#ifdef CONFIG_CGROUP_BLKIO
+ int blkio_cgroup_id;
+#endif
};

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -140,27 +143,4 @@ static inline void swap_cgroup_swapoff(i
}

#endif
-
-#ifdef CONFIG_CGROUP_BLKIO
-/*
- * use lower 16 bits for flags and reserve the rest for the page tracking id
- */
-#define PCG_TRACKING_ID_SHIFT (16)
-#define PCG_TRACKING_ID_BITS \
- (8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
-
-/* NOTE: must be called with page_cgroup() held */
-static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
-{
- return pc->flags >> PCG_TRACKING_ID_SHIFT;
-}
-
-/* NOTE: must be called with page_cgroup() held */
-static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
-{
- WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
- pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
- pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
-}
-#endif
#endif

2009-07-21 14:25:16

by Ryo Tsuruta

[permalink] [raw]
Subject: [PATCH 9/9] blkio-cgroup-v9: Add a cgroup support to dm-ioband

With this patch, dm-ioband can work with the blkio-cgroup.

Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>

---
drivers/md/dm-ioband-ctl.c | 211 ++++++++++++++++++++++++++++++++++++++++-
drivers/md/dm-ioband-policy.c | 20 +++
drivers/md/dm-ioband-rangebw.c | 13 ++
drivers/md/dm-ioband-type.c | 10 -
drivers/md/dm-ioband.h | 14 ++
drivers/md/dm-ioctl.c | 1
include/linux/biotrack.h | 7 +
mm/biotrack.c | 119 ++++++++++++++++++++++-
8 files changed, 382 insertions(+), 13 deletions(-)

Index: linux-2.6.31-rc3/include/linux/biotrack.h
===================================================================
--- linux-2.6.31-rc3.orig/include/linux/biotrack.h
+++ linux-2.6.31-rc3/include/linux/biotrack.h
@@ -9,6 +9,7 @@

struct io_context;
struct block_device;
+struct ioband_cgroup_ops;

struct blkio_cgroup {
struct cgroup_subsys_state css;
@@ -49,6 +50,12 @@ extern void blkio_cgroup_copy_owner(stru
extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
extern unsigned long get_blkio_cgroup_id(struct bio *bio);
extern struct cgroup *blkio_cgroup_lookup(int id);
+extern int blkio_cgroup_register_ioband(const struct ioband_cgroup_ops *ops);
+
+static inline int blkio_cgroup_unregister_ioband(void)
+{
+ return blkio_cgroup_register_ioband(NULL);
+}

#else /* !CONFIG_CGROUP_BLKIO */

Index: linux-2.6.31-rc3/mm/biotrack.c
===================================================================
--- linux-2.6.31-rc3.orig/mm/biotrack.c
+++ linux-2.6.31-rc3/mm/biotrack.c
@@ -21,6 +21,9 @@
#include <linux/blkdev.h>
#include <linux/biotrack.h>
#include <linux/mm_inline.h>
+#include <linux/seq_file.h>
+#include <linux/dm-ioctl.h>
+#include <../drivers/md/dm-ioband.h>

/*
* The block I/O tracking mechanism is implemented on the cgroup memory
@@ -50,6 +53,8 @@ static struct blkio_cgroup default_blkio
.id = 0,
.io_context = &default_blkio_io_context,
};
+static DEFINE_MUTEX(ioband_ops_lock);
+static const struct ioband_cgroup_ops *ioband_ops = NULL;

/**
* blkio_cgroup_set_owner() - set the owner ID of a page.
@@ -206,6 +211,11 @@ static void blkio_cgroup_destroy(struct
{
struct blkio_cgroup *biog = cgroup_blkio(cgrp);

+ mutex_lock(&ioband_ops_lock);
+ if (ioband_ops)
+ ioband_ops->remove_group(biog);
+ mutex_unlock(&ioband_ops_lock);
+
put_io_context(biog->io_context);

spin_lock_irq(&blkio_cgroup_idr_lock);
@@ -292,23 +302,128 @@ struct cgroup *blkio_cgroup_lookup(int i
return NULL;
return biog->css.cgroup;
}
+
+/**
+ * blkio_cgroup_register_ioband() - register ioband
+ * @p: a pointer to struct ioband_cgroup_ops
+ *
+ * Calling with NULL means unregistration.
+ * Returns 0 on success.
+ */
+int blkio_cgroup_register_ioband(const struct ioband_cgroup_ops *p)
+{
+ if (blkio_cgroup_disabled())
+ return -1;
+
+ mutex_lock(&ioband_ops_lock);
+ ioband_ops = p;
+ mutex_unlock(&ioband_ops_lock);
+ return 0;
+}
EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
EXPORT_SYMBOL(get_blkio_cgroup_id);
EXPORT_SYMBOL(blkio_cgroup_lookup);
+EXPORT_SYMBOL(blkio_cgroup_register_ioband);

+/* Read the ID of the specified blkio cgroup. */
static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
{
- struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+ struct blkio_cgroup *biog;
+ int id;
+
+ biog = cgroup_blkio(cgrp);
+ id = biog->id;
+
+ return (u64) id;
+}
+
+/* Show all ioband devices and their settings. */
+static int blkio_devs_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ mutex_lock(&ioband_ops_lock);
+ if (ioband_ops)
+ ioband_ops->show_device(m);
+ mutex_unlock(&ioband_ops_lock);
+ return 0;
+}
+
+/* Configure ioband devices specified by an ioband device ID */
+static int blkio_devs_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ char **argv;
+ int argc, r = 0;

- return (u64) biog->id;
+ if (cgrp != cgrp->top_cgroup)
+ return -EACCES;
+
+ argv = argv_split(GFP_KERNEL, buffer, &argc);
+ if (!argv)
+ return -ENOMEM;
+
+ mutex_lock(&ioband_ops_lock);
+ if (ioband_ops)
+ r = ioband_ops->config_device(argc, argv);
+ mutex_unlock(&ioband_ops_lock);
+
+ argv_free(argv);
+ return r;
}

+/* Show the settings of the specified blkio cgroup. */
+static int blkio_settings_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct blkio_cgroup *biog;
+
+ mutex_lock(&ioband_ops_lock);
+ if (ioband_ops) {
+ biog = cgroup_blkio(cgrp);
+ ioband_ops->show_group(m, biog);
+ }
+ mutex_unlock(&ioband_ops_lock);
+ return 0;
+}
+
+/* Configure the specified blkio cgroup. */
+static int blkio_settings_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct blkio_cgroup *biog;
+ char **argv;
+ int argc, r = 0;
+
+ argv = argv_split(GFP_KERNEL, buffer, &argc);
+ if (!argv)
+ return -ENOMEM;
+
+ mutex_lock(&ioband_ops_lock);
+ if (ioband_ops) {
+ biog = cgroup_blkio(cgrp);
+ r = ioband_ops->config_group(argc, argv, biog);
+ }
+ mutex_unlock(&ioband_ops_lock);
+
+ argv_free(argv);
+ return r;
+}

static struct cftype blkio_files[] = {
{
.name = "id",
.read_u64 = blkio_id_read,
},
+ {
+ .name = "devices",
+ .read_seq_string = blkio_devs_read,
+ .write_string = blkio_devs_write,
+ },
+ {
+ .name = "settings",
+ .read_seq_string = blkio_settings_read,
+ .write_string = blkio_settings_write,
+ },
};

static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
Index: linux-2.6.31-rc3/drivers/md/dm-ioctl.c
===================================================================
--- linux-2.6.31-rc3.orig/drivers/md/dm-ioctl.c
+++ linux-2.6.31-rc3/drivers/md/dm-ioctl.c
@@ -1601,3 +1601,4 @@ out:

return r;
}
+EXPORT_SYMBOL(dm_copy_name_and_uuid);
Index: linux-2.6.31-rc3/drivers/md/dm-ioband-policy.c
===================================================================
--- linux-2.6.31-rc3.orig/drivers/md/dm-ioband-policy.c
+++ linux-2.6.31-rc3/drivers/md/dm-ioband-policy.c
@@ -8,6 +8,7 @@
#include <linux/bio.h>
#include <linux/workqueue.h>
#include <linux/rbtree.h>
+#include <linux/seq_file.h>
#include "dm.h"
#include "dm-ioband.h"

@@ -276,7 +277,7 @@ static int policy_weight_param(struct io
if (value)
err = strict_strtol(value, 0, &val);

- if (!strcmp(cmd, "weight")) {
+ if (!cmd || !strcmp(cmd, "weight")) {
if (!value)
set_weight(gp, DEFAULT_WEIGHT);
else if (!err && 0 < val && val <= SHORT_MAX)
@@ -341,6 +342,19 @@ static void policy_weight_show(struct io
*szp = sz;
}

+static void policy_weight_show_device(struct seq_file *m,
+ struct ioband_device *dp)
+{
+ seq_printf(m, " token=%d carryover=%d",
+ dp->g_token_bucket, dp->g_carryover);
+}
+
+static void policy_weight_show_group(struct seq_file *m,
+ struct ioband_group *gp)
+{
+ seq_printf(m, " weight=%d", gp->c_weight);
+}
+
/*
* <Method> <description>
* g_can_submit : To determine whether a given group has the right to
@@ -369,6 +383,8 @@ static void policy_weight_show(struct io
* Return 1 if a given group can't receive any more BIOs,
* otherwise return 0.
* g_show : Show the configuration.
+ * g_show_device : Show the configuration of the specified ioband device.
+ * g_show_group : Show the configuration of the spacified ioband group.
*/
static int policy_weight_init(struct ioband_device *dp, int argc, char **argv)
{
@@ -391,6 +407,8 @@ static int policy_weight_init(struct iob
dp->g_set_param = policy_weight_param;
dp->g_should_block = is_queue_full;
dp->g_show = policy_weight_show;
+ dp->g_show_device = policy_weight_show_device;
+ dp->g_show_group = policy_weight_show_group;

dp->g_epoch = 0;
dp->g_weight_total = 0;
Index: linux-2.6.31-rc3/drivers/md/dm-ioband-rangebw.c
===================================================================
--- linux-2.6.31-rc3.orig/drivers/md/dm-ioband-rangebw.c
+++ linux-2.6.31-rc3/drivers/md/dm-ioband-rangebw.c
@@ -25,6 +25,7 @@
#include <linux/random.h>
#include <linux/time.h>
#include <linux/timer.h>
+#include <linux/seq_file.h>
#include "dm.h"
#include "md.h"
#include "dm-ioband.h"
@@ -459,7 +460,7 @@ static int policy_range_bw_param(struct
err++;
}

- if (!strcmp(cmd, "range-bw")) {
+ if (!cmd || !strcmp(cmd, "range-bw")) {
if (!err && 0 <= min_val &&
min_val <= (INT_MAX / 2) && 0 <= max_val &&
max_val <= (INT_MAX / 2) && min_val <= max_val)
@@ -547,6 +548,12 @@ static void policy_range_bw_show(struct
*szp = sz;
}

+static void policy_range_bw_show_group(struct seq_file *m,
+ struct ioband_group *gp)
+{
+ seq_printf(m, " range-bw=%d:%d", gp->c_min_bw, gp->c_max_bw);
+}
+
static int range_bw_prepare_token(struct ioband_group *gp,
struct bio *bio, int flag)
{
@@ -633,6 +640,8 @@ void range_bw_timeover(unsigned long gp)
* Return 1 if a given group can't receive any more BIOs,
* otherwise return 0.
* g_show : Show the configuration.
+ * g_show_device : Show the configuration of the specified ioband device.
+ * g_show_group : Show the configuration of the spacified ioband group.
*/

int policy_range_bw_init(struct ioband_device *dp, int argc, char **argv)
@@ -656,6 +665,8 @@ int policy_range_bw_init(struct ioband_d
dp->g_set_param = policy_range_bw_param;
dp->g_should_block = range_bw_queue_full;
dp->g_show = policy_range_bw_show;
+ dp->g_show_device = NULL;
+ dp->g_show_group = policy_range_bw_show_group;

dp->g_min_bw_total = 0;
dp->g_running_gp = NULL;
Index: linux-2.6.31-rc3/drivers/md/dm-ioband-ctl.c
===================================================================
--- linux-2.6.31-rc3.orig/drivers/md/dm-ioband-ctl.c
+++ linux-2.6.31-rc3/drivers/md/dm-ioband-ctl.c
@@ -15,6 +15,8 @@
#include <linux/slab.h>
#include <linux/workqueue.h>
#include <linux/rbtree.h>
+#include <linux/biotrack.h>
+#include <linux/dm-ioctl.h>
#include "dm.h"
#include "md.h"
#include "dm-ioband.h"
@@ -111,6 +113,7 @@ static struct ioband_device *alloc_ioban
INIT_DELAYED_WORK(&new_dp->g_conductor, ioband_conduct);
INIT_LIST_HEAD(&new_dp->g_groups);
INIT_LIST_HEAD(&new_dp->g_list);
+ INIT_LIST_HEAD(&new_dp->g_heads);
spin_lock_init(&new_dp->g_lock);
bio_list_init(&new_dp->g_urgent_bios);
new_dp->g_io_throttle = io_throttle;
@@ -243,6 +246,7 @@ static int ioband_group_init(struct ioba
int r;

INIT_LIST_HEAD(&gp->c_list);
+ INIT_LIST_HEAD(&gp->c_heads);
bio_list_init(&gp->c_blocked_bios);
bio_list_init(&gp->c_prio_bios);
gp->c_id = id; /* should be verified */
@@ -273,7 +277,8 @@ static int ioband_group_init(struct ioba
ioband_group_add_node(&head->c_group_root, gp);
gp->c_dev = head->c_dev;
gp->c_target = head->c_target;
- }
+ } else
+ list_add_tail(&gp->c_heads, &dp->g_heads);

spin_unlock_irqrestore(&dp->g_lock, flags);
return 0;
@@ -287,6 +292,8 @@ static void ioband_group_release(struct
list_del(&gp->c_list);
if (head)
rb_erase(&gp->c_group_node, &head->c_group_root);
+ else
+ list_del(&gp->c_heads);
dp->g_group_dtr(gp);
kfree(gp);
}
@@ -1290,6 +1297,201 @@ static struct target_type ioband_target
.merge = ioband_merge,
};

+#ifdef CONFIG_CGROUP_BLKIO
+/* Read the ID of the specified blkio cgroup. */
+static void ioband_copy_name(struct ioband_group *gp, char *name)
+{
+ struct mapped_device *md;
+
+ md = dm_table_get_md(gp->c_target->table);
+ dm_copy_name_and_uuid(md, name, NULL);
+ dm_put(md);
+}
+
+/* Show all ioband devices and their settings. */
+static void ioband_cgroup_show_device(struct seq_file *m)
+{
+ struct ioband_device *dp;
+ struct ioband_group *gp;
+ char name[DM_NAME_LEN];
+
+ mutex_lock(&ioband_lock);
+
+ list_for_each_entry(dp, &ioband_device_list, g_list) {
+ seq_printf(m, "%s policy=%s io_throttle=%d io_limit=%d",
+ dp->g_name, dp->g_policy->p_name,
+ dp->g_io_throttle, dp->g_io_limit);
+ if (dp->g_show_device)
+ dp->g_show_device(m, dp);
+ seq_putc(m, '\n');
+
+ list_for_each_entry(gp, &dp->g_heads, c_heads) {
+ if (strcmp(gp->c_type->t_name, "cgroup"))
+ continue;
+ ioband_copy_name(gp, name);
+ seq_printf(m, " %s\n", name);
+ }
+ }
+
+ mutex_unlock(&ioband_lock);
+}
+
+/* Configure ioband devices specified by an ioband device ID */
+static int ioband_cgroup_config_device(int argc, char **argv)
+{
+ struct ioband_device *dp;
+ struct ioband_group *gp;
+ char name[DM_NAME_LEN];
+ int r;
+
+ if (argc < 1)
+ return -EINVAL;
+
+ mutex_lock(&ioband_lock);
+
+ /* look up the ioband device */
+ list_for_each_entry(dp, &ioband_device_list, g_list) {
+ /* assuming argv[0] is a share name */
+ if (!strcmp(dp->g_name, argv[0])) {
+ gp = list_first_entry(&dp->g_heads,
+ struct ioband_group, c_heads);
+ goto found;
+ }
+
+ /* assuming argv[0] is a device name */
+ list_for_each_entry(gp, &dp->g_heads, c_heads) {
+ ioband_copy_name(gp, name);
+ if (!strcmp(name, argv[0]))
+ goto found;
+ }
+ }
+
+ mutex_unlock(&ioband_lock);
+ return -ENODEV;
+
+found:
+ if (!strcmp(gp->c_type->t_name, "cgroup"))
+ r = __ioband_message(gp->c_target, --argc, &argv[1]);
+ else
+ r = -ENODEV;
+
+ mutex_unlock(&ioband_lock);
+ return r;
+}
+
+/* Show the settings of the specified blkio cgroup. */
+static void ioband_cgroup_show_group(struct seq_file *m,
+ struct blkio_cgroup *biog)
+{
+ struct ioband_device *dp;
+ struct ioband_group *head, *gp;
+ struct cgroup *cgrp = biog->css.cgroup;
+ char name[DM_NAME_LEN];
+
+ mutex_lock(&ioband_lock);
+
+ list_for_each_entry(dp, &ioband_device_list, g_list) {
+ list_for_each_entry(head, &dp->g_heads, c_heads) {
+ if (strcmp(head->c_type->t_name, "cgroup"))
+ continue;
+
+ if (cgrp == cgrp->top_cgroup)
+ gp = head;
+ else {
+ gp = ioband_group_find(head, biog->id);
+ if (!gp)
+ continue;
+ }
+
+ ioband_copy_name(head, name);
+ seq_puts(m, name);
+ if (dp->g_show_group)
+ dp->g_show_group(m, gp);
+ seq_putc(m, '\n');
+ }
+ }
+
+ mutex_unlock(&ioband_lock);
+}
+
+/* Configure the specified blkio cgroup. */
+static int ioband_cgroup_config_group(int argc, char **argv,
+ struct blkio_cgroup *biog)
+{
+ struct ioband_device *dp;
+ struct ioband_group *head, *gp;
+ struct cgroup *cgrp = biog->css.cgroup;
+ char name[DM_NAME_LEN];
+ int r;
+
+ if (argc != 1 && argc != 2)
+ return -EINVAL;
+
+ mutex_lock(&ioband_lock);
+
+ list_for_each_entry(dp, &ioband_device_list, g_list) {
+ list_for_each_entry(head, &dp->g_heads, c_heads) {
+ if (strcmp(head->c_type->t_name, "cgroup"))
+ continue;
+ ioband_copy_name(head, name);
+ if (!strcmp(name, argv[0]))
+ goto found;
+ }
+ }
+
+ mutex_unlock(&ioband_lock);
+ return -ENODEV;
+
+found:
+ if (argc == 1) {
+ if (cgrp == cgrp->top_cgroup)
+ r = -EINVAL;
+ else
+ r = ioband_group_detach(head, biog->id);
+ } else {
+ if (cgrp == cgrp->top_cgroup)
+ gp = head;
+ else
+ gp = ioband_group_find(head, biog->id);
+
+ if (!gp)
+ r = ioband_group_attach(head, biog->id, argv[1]);
+ else
+ r = gp->c_banddev->g_set_param(gp, NULL, argv[1]);
+ }
+
+ mutex_unlock(&ioband_lock);
+ return r;
+}
+
+/* Remove the specified blkio cgroup. */
+static void ioband_cgroup_remove_group(struct blkio_cgroup *biog)
+{
+ struct ioband_device *dp;
+ struct ioband_group *head;
+
+ mutex_lock(&ioband_lock);
+
+ list_for_each_entry(dp, &ioband_device_list, g_list) {
+ list_for_each_entry(head, &dp->g_heads, c_heads) {
+ if (strcmp(head->c_type->t_name, "cgroup"))
+ continue;
+ ioband_group_detach(head, biog->id);
+ }
+ }
+
+ mutex_unlock(&ioband_lock);
+}
+
+static const struct ioband_cgroup_ops ioband_ops = {
+ .show_device = ioband_cgroup_show_device,
+ .config_device = ioband_cgroup_config_device,
+ .show_group = ioband_cgroup_show_group,
+ .config_group = ioband_cgroup_config_group,
+ .remove_group = ioband_cgroup_remove_group,
+};
+#endif
+
static int __init dm_ioband_init(void)
{
int r;
@@ -1297,11 +1499,18 @@ static int __init dm_ioband_init(void)
r = dm_register_target(&ioband_target);
if (r < 0)
DMERR("register failed %d", r);
+#ifdef CONFIG_CGROUP_BLKIO
+ else
+ r = blkio_cgroup_register_ioband(&ioband_ops);
+#endif
return r;
}

static void __exit dm_ioband_exit(void)
{
+#ifdef CONFIG_CGROUP_BLKIO
+ blkio_cgroup_unregister_ioband();
+#endif
dm_unregister_target(&ioband_target);
}

Index: linux-2.6.31-rc3/drivers/md/dm-ioband.h
===================================================================
--- linux-2.6.31-rc3.orig/drivers/md/dm-ioband.h
+++ linux-2.6.31-rc3/drivers/md/dm-ioband.h
@@ -44,6 +44,7 @@ struct ioband_device {

int g_ref;
struct list_head g_list;
+ struct list_head g_heads;
int g_flags;
char g_name[IOBAND_NAME_MAX + 1];
const struct ioband_policy_type *g_policy;
@@ -59,6 +60,8 @@ struct ioband_device {
int (*g_set_param) (struct ioband_group *, const char *, const char *);
int (*g_should_block) (struct ioband_group *);
void (*g_show) (struct ioband_group *, int *, char *, unsigned);
+ void (*g_show_device) (struct seq_file *, struct ioband_device *);
+ void (*g_show_group) (struct seq_file *, struct ioband_group *);

/* members for weight balancing policy */
int g_epoch;
@@ -104,6 +107,7 @@ struct ioband_group_stat {

struct ioband_group {
struct list_head c_list;
+ struct list_head c_heads;
struct ioband_device *c_banddev;
struct dm_dev *c_dev;
struct dm_target *c_target;
@@ -150,6 +154,16 @@ struct ioband_group {

};

+struct blkio_cgroup;
+
+struct ioband_cgroup_ops {
+ void (*show_device)(struct seq_file *);
+ int (*config_device)(int, char **);
+ void (*show_group)(struct seq_file *, struct blkio_cgroup *);
+ int (*config_group)(int, char **, struct blkio_cgroup *);
+ void (*remove_group)(struct blkio_cgroup *);
+};
+
#define IOBAND_URGENT 1

#define DEV_BIO_BLOCKED 1
Index: linux-2.6.31-rc3/drivers/md/dm-ioband-type.c
===================================================================
--- linux-2.6.31-rc3.orig/drivers/md/dm-ioband-type.c
+++ linux-2.6.31-rc3/drivers/md/dm-ioband-type.c
@@ -6,6 +6,7 @@
* This file is released under the GPL.
*/
#include <linux/bio.h>
+#include <linux/biotrack.h>
#include "dm.h"
#include "dm-ioband.h"

@@ -52,14 +53,7 @@ static int ioband_node(struct bio *bio)

static int ioband_cgroup(struct bio *bio)
{
- /*
- * This function should return the ID of the cgroup which
- * issued "bio". The ID of the cgroup which the current
- * process belongs to won't be suitable ID for this purpose,
- * since some BIOs will be handled by kernel threads like aio
- * or pdflush on behalf of the process requesting the BIOs.
- */
- return 0; /* not implemented yet */
+ return get_blkio_cgroup_id(bio);
}

const struct ioband_group_type dm_ioband_group_type[] = {

2009-07-21 15:56:41

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework

* Ryo Tsuruta <[email protected]> [2009-07-21 23:12:11]:

> This patch makes the page_cgroup framework be able to be used even if
> the compile option of the cgroup memory controller is off.
> So blkio-cgroup can use this framework without the memory controller.
>
> Signed-off-by: Hirokazu Takahashi <[email protected]>
> Signed-off-by: Ryo Tsuruta <[email protected]>
>
> ---
> include/linux/memcontrol.h | 6 ++++++
> include/linux/mmzone.h | 4 ++--
> include/linux/page_cgroup.h | 8 +++++---
> init/Kconfig | 4 ++++
> mm/Makefile | 3 ++-
> mm/memcontrol.c | 6 ++++++
> mm/page_cgroup.c | 3 +--
> 7 files changed, 26 insertions(+), 8 deletions(-)
>
> Index: linux-2.6.31-rc3/include/linux/memcontrol.h
> ===================================================================
> --- linux-2.6.31-rc3.orig/include/linux/memcontrol.h
> +++ linux-2.6.31-rc3/include/linux/memcontrol.h
> @@ -37,6 +37,8 @@ struct mm_struct;
> * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> */
>
> +extern void __init_mem_page_cgroup(struct page_cgroup *pc);
> +
> extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask);
> /* for swap handling */
> @@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> +static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
> +{
> +}
> +
> static inline int mem_cgroup_newpage_charge(struct page *page,
> struct mm_struct *mm, gfp_t gfp_mask)
> {
> Index: linux-2.6.31-rc3/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.31-rc3.orig/include/linux/mmzone.h
> +++ linux-2.6.31-rc3/include/linux/mmzone.h
> @@ -605,7 +605,7 @@ typedef struct pglist_data {
> int nr_zones;
> #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> struct page *node_mem_map;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> struct page_cgroup *node_page_cgroup;
> #endif
> #endif
> @@ -956,7 +956,7 @@ struct mem_section {
>
> /* See declaration of similar field in struct zone */
> unsigned long *pageblock_flags;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> /*
> * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
> * section. (see memcontrol.h/page_cgroup.h about this.)
> Index: linux-2.6.31-rc3/include/linux/page_cgroup.h
> ===================================================================
> --- linux-2.6.31-rc3.orig/include/linux/page_cgroup.h
> +++ linux-2.6.31-rc3/include/linux/page_cgroup.h
> @@ -1,7 +1,7 @@
> #ifndef __LINUX_PAGE_CGROUP_H
> #define __LINUX_PAGE_CGROUP_H
>
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_CGROUP_PAGE
> #include <linux/bit_spinlock.h>
> /*
> * Page Cgroup can be considered as an extended mem_map.
> @@ -12,9 +12,11 @@
> */
> struct page_cgroup {
> unsigned long flags;
> - struct mem_cgroup *mem_cgroup;
> struct page *page;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + struct mem_cgroup *mem_cgroup;
> struct list_head lru; /* per cgroup LRU list */
> +#endif
> };

If CONFIG_CGROUP_MEM_RES_CTLR is not enabled and CGROUP_PAGE is
(assuming that the depends on below is refactored), what would this
change buy us? What is page_cgroup helping us track, the mem_cgroup is
factored out, so we are interested in the flags only?


>
> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> @@ -83,7 +85,7 @@ static inline void unlock_page_cgroup(st
> bit_spin_unlock(PCG_LOCK, &pc->flags);
> }
>
> -#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +#else /* CONFIG_CGROUP_PAGE */
> struct page_cgroup;
>
> static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> Index: linux-2.6.31-rc3/init/Kconfig
> ===================================================================
> --- linux-2.6.31-rc3.orig/init/Kconfig
> +++ linux-2.6.31-rc3/init/Kconfig
> @@ -614,6 +614,10 @@ config CGROUP_MEM_RES_CTLR_SWAP
>
> endif # CGROUPS
>
> +config CGROUP_PAGE
> + def_bool y

Should def_bool be "y"? Shouldn't the CGROUP_MEM_RES_CTLR select it.

> + depends on CGROUP_MEM_RES_CTLR
> +
> config MM_OWNER
> bool
>
> Index: linux-2.6.31-rc3/mm/Makefile
> ===================================================================
> --- linux-2.6.31-rc3.orig/mm/Makefile
> +++ linux-2.6.31-rc3/mm/Makefile
> @@ -39,6 +39,7 @@ else
> obj-$(CONFIG_SMP) += allocpercpu.o
> endif
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
> obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> Index: linux-2.6.31-rc3/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.31-rc3.orig/mm/memcontrol.c
> +++ linux-2.6.31-rc3/mm/memcontrol.c
> @@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
> struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
> };
>
> +void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
> +{
> + pc->mem_cgroup = NULL;
> + INIT_LIST_HEAD(&pc->lru);
> +}
> +
> /*
> * The memory controller data structure. The memory controller controls both
> * page cache and RSS per cgroup. We would eventually like to provide
> Index: linux-2.6.31-rc3/mm/page_cgroup.c
> ===================================================================
> --- linux-2.6.31-rc3.orig/mm/page_cgroup.c
> +++ linux-2.6.31-rc3/mm/page_cgroup.c
> @@ -14,9 +14,8 @@ static void __meminit
> __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> {
> pc->flags = 0;
> - pc->mem_cgroup = NULL;
> pc->page = pfn_to_page(pfn);
> - INIT_LIST_HEAD(&pc->lru);
> + __init_mem_page_cgroup(pc);
> }
> static unsigned long total_usage;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Balbir

2009-07-22 01:22:47

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework

On Tue, 21 Jul 2009 21:26:36 +0530
Balbir Singh <[email protected]> wrote:

> * Ryo Tsuruta <[email protected]> [2009-07-21 23:12:11]:
>
> > This patch makes the page_cgroup framework be able to be used even if
> > the compile option of the cgroup memory controller is off.
> > So blkio-cgroup can use this framework without the memory controller.
> >
> > Signed-off-by: Hirokazu Takahashi <[email protected]>
> > Signed-off-by: Ryo Tsuruta <[email protected]>
> >
> > ---
> > include/linux/memcontrol.h | 6 ++++++
> > include/linux/mmzone.h | 4 ++--
> > include/linux/page_cgroup.h | 8 +++++---
> > init/Kconfig | 4 ++++
> > mm/Makefile | 3 ++-
> > mm/memcontrol.c | 6 ++++++
> > mm/page_cgroup.c | 3 +--
> > 7 files changed, 26 insertions(+), 8 deletions(-)
> >
> > Index: linux-2.6.31-rc3/include/linux/memcontrol.h
> > ===================================================================
> > --- linux-2.6.31-rc3.orig/include/linux/memcontrol.h
> > +++ linux-2.6.31-rc3/include/linux/memcontrol.h
> > @@ -37,6 +37,8 @@ struct mm_struct;
> > * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> > */
> >
> > +extern void __init_mem_page_cgroup(struct page_cgroup *pc);
> > +
> > extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> > gfp_t gfp_mask);
> > /* for swap handling */
> > @@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(
> > #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> > struct mem_cgroup;
> >
> > +static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
> > +{
> > +}
> > +
> > static inline int mem_cgroup_newpage_charge(struct page *page,
> > struct mm_struct *mm, gfp_t gfp_mask)
> > {
> > Index: linux-2.6.31-rc3/include/linux/mmzone.h
> > ===================================================================
> > --- linux-2.6.31-rc3.orig/include/linux/mmzone.h
> > +++ linux-2.6.31-rc3/include/linux/mmzone.h
> > @@ -605,7 +605,7 @@ typedef struct pglist_data {
> > int nr_zones;
> > #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> > struct page *node_mem_map;
> > -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +#ifdef CONFIG_CGROUP_PAGE
> > struct page_cgroup *node_page_cgroup;
> > #endif
> > #endif
> > @@ -956,7 +956,7 @@ struct mem_section {
> >
> > /* See declaration of similar field in struct zone */
> > unsigned long *pageblock_flags;
> > -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +#ifdef CONFIG_CGROUP_PAGE
> > /*
> > * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
> > * section. (see memcontrol.h/page_cgroup.h about this.)
> > Index: linux-2.6.31-rc3/include/linux/page_cgroup.h
> > ===================================================================
> > --- linux-2.6.31-rc3.orig/include/linux/page_cgroup.h
> > +++ linux-2.6.31-rc3/include/linux/page_cgroup.h
> > @@ -1,7 +1,7 @@
> > #ifndef __LINUX_PAGE_CGROUP_H
> > #define __LINUX_PAGE_CGROUP_H
> >
> > -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +#ifdef CONFIG_CGROUP_PAGE
> > #include <linux/bit_spinlock.h>
> > /*
> > * Page Cgroup can be considered as an extended mem_map.
> > @@ -12,9 +12,11 @@
> > */
> > struct page_cgroup {
> > unsigned long flags;
> > - struct mem_cgroup *mem_cgroup;
> > struct page *page;
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > + struct mem_cgroup *mem_cgroup;
> > struct list_head lru; /* per cgroup LRU list */
> > +#endif
> > };
>
> If CONFIG_CGROUP_MEM_RES_CTLR is not enabled and CGROUP_PAGE is
> (assuming that the depends on below is refactored), what would this
> change buy us? What is page_cgroup helping us track, the mem_cgroup is
> factored out, so we are interested in the flags only?
>
plz remove CONFIG. This jsut makes code complicated.
or plz use your own infrastructure, not depends on page_cgroup.

Thanks,
-Kame

2009-07-22 02:09:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework

On Wed, 22 Jul 2009 10:20:58 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 21 Jul 2009 21:26:36 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * Ryo Tsuruta <[email protected]> [2009-07-21 23:12:11]:
> >
> > > This patch makes the page_cgroup framework be able to be used even if
> > > the compile option of the cgroup memory controller is off.
> > > So blkio-cgroup can use this framework without the memory controller.
> > >
> > > Signed-off-by: Hirokazu Takahashi <[email protected]>
> > > Signed-off-by: Ryo Tsuruta <[email protected]>
> > >
> > > ---
> > > include/linux/memcontrol.h | 6 ++++++
> > > include/linux/mmzone.h | 4 ++--
> > > include/linux/page_cgroup.h | 8 +++++---
> > > init/Kconfig | 4 ++++
> > > mm/Makefile | 3 ++-
> > > mm/memcontrol.c | 6 ++++++
> > > mm/page_cgroup.c | 3 +--
> > > 7 files changed, 26 insertions(+), 8 deletions(-)
> > >
> > > Index: linux-2.6.31-rc3/include/linux/memcontrol.h
> > > ===================================================================
> > > --- linux-2.6.31-rc3.orig/include/linux/memcontrol.h
> > > +++ linux-2.6.31-rc3/include/linux/memcontrol.h
> > > @@ -37,6 +37,8 @@ struct mm_struct;
> > > * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> > > */
> > >
> > > +extern void __init_mem_page_cgroup(struct page_cgroup *pc);
> > > +
> > > extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> > > gfp_t gfp_mask);
> > > /* for swap handling */
> > > @@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(
> > > #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> > > struct mem_cgroup;
> > >
> > > +static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
> > > +{
> > > +}
> > > +
> > > static inline int mem_cgroup_newpage_charge(struct page *page,
> > > struct mm_struct *mm, gfp_t gfp_mask)
> > > {
> > > Index: linux-2.6.31-rc3/include/linux/mmzone.h
> > > ===================================================================
> > > --- linux-2.6.31-rc3.orig/include/linux/mmzone.h
> > > +++ linux-2.6.31-rc3/include/linux/mmzone.h
> > > @@ -605,7 +605,7 @@ typedef struct pglist_data {
> > > int nr_zones;
> > > #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> > > struct page *node_mem_map;
> > > -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > +#ifdef CONFIG_CGROUP_PAGE
> > > struct page_cgroup *node_page_cgroup;
> > > #endif
> > > #endif
> > > @@ -956,7 +956,7 @@ struct mem_section {
> > >
> > > /* See declaration of similar field in struct zone */
> > > unsigned long *pageblock_flags;
> > > -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > +#ifdef CONFIG_CGROUP_PAGE
> > > /*
> > > * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
> > > * section. (see memcontrol.h/page_cgroup.h about this.)
> > > Index: linux-2.6.31-rc3/include/linux/page_cgroup.h
> > > ===================================================================
> > > --- linux-2.6.31-rc3.orig/include/linux/page_cgroup.h
> > > +++ linux-2.6.31-rc3/include/linux/page_cgroup.h
> > > @@ -1,7 +1,7 @@
> > > #ifndef __LINUX_PAGE_CGROUP_H
> > > #define __LINUX_PAGE_CGROUP_H
> > >
> > > -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > +#ifdef CONFIG_CGROUP_PAGE
> > > #include <linux/bit_spinlock.h>
> > > /*
> > > * Page Cgroup can be considered as an extended mem_map.
> > > @@ -12,9 +12,11 @@
> > > */
> > > struct page_cgroup {
> > > unsigned long flags;
> > > - struct mem_cgroup *mem_cgroup;
> > > struct page *page;
> > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > + struct mem_cgroup *mem_cgroup;
> > > struct list_head lru; /* per cgroup LRU list */
> > > +#endif
> > > };
> >
> > If CONFIG_CGROUP_MEM_RES_CTLR is not enabled and CGROUP_PAGE is
> > (assuming that the depends on below is refactored), what would this
> > change buy us? What is page_cgroup helping us track, the mem_cgroup is
> > factored out, so we are interested in the flags only?
> >
> plz remove CONFIG. This jsut makes code complicated.
> or plz use your own infrastructure, not depends on page_cgroup.
>

BTW, you can't modify page_cgroup->flags bit without cmpxchg.
Then, patch [5/9] is completely broken, now because new bit is used
with atomic bit ops but without lock_page_cgroup(). (see mmotm)

Why struct page's flags bit can includes zone id etc...is just because
it's initalized before using. Anyway, this is "flags" bit. If you want
to modify multiple bit at once, plz use cmpxchg.
Then, I buy patch [8/9] and just skip this patch.

But, following is more straightforward. (and what you do is not different
from this.)
==
struct page {
.....
#ifdef CONFIG_BLOCKIO_CGROUP
void *blockio_cgroup;
#endif
}
==


Regards,
-Kame


2009-07-22 02:13:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 5/9] blkio-cgroup-v9: The body of blkio-cgroup

On Tue, 21 Jul 2009 23:14:05 +0900 (JST)
Ryo Tsuruta <[email protected]> wrote:

> The body of blkio-cgroup.
> + * blkio_cgroup_set_owner() - set the owner ID of a page.
> + * @page: the page we want to tag
> + * @mm: the mm_struct of a page owner
> + *
> + * Make a given page have the blkio-cgroup ID of the owner of this page.
> + */
> +void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)


> + * blkio_cgroup_reset_owner() - reset the owner ID of a page
> + * @page: the page we want to tag
> + * @mm: the mm_struct of a page owner
> + *
> + * Change the owner of a given page if necessary.
> + */
> +void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
> +{
> + blkio_cgroup_set_owner(page, mm);
> +}


> +void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
> +{
> + if (!page_is_file_cache(page))
> + return;
> + if (current->flags & PF_MEMALLOC)
> + return;
> +
> + blkio_cgroup_reset_owner(page, mm);
> +}
> +

Hmm, why pass "mm" not "thread" ? Do we need to take care of mm->ownder ?
Why "current" is bad ?

Thanks,
-Kame


2009-07-22 02:19:05

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

On Tue, 21 Jul 2009 23:23:16 +0900 (JST)
Ryo Tsuruta <[email protected]> wrote:

> This patch contains several hooks that let the blkio-cgroup framework to know
> which blkio-cgroup is the owner of a page before starting I/O against the page.

> @@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page
> gfp_mask & GFP_RECLAIM_MASK);
> if (error)
> goto out;
> + blkio_cgroup_set_owner(page, current->mm);
>

This part is doubtful...Is this necessary ?
I recommend you that the caller should attach owner by itself.


> error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> if (error == 0) {
> Index: linux-2.6.31-rc3/mm/memory.c
> ===================================================================
> --- linux-2.6.31-rc3.orig/mm/memory.c
> +++ linux-2.6.31-rc3/mm/memory.c
> @@ -51,6 +51,7 @@
> #include <linux/init.h>
> #include <linux/writeback.h>
> #include <linux/memcontrol.h>
> +#include <linux/biotrack.h>
> #include <linux/mmu_notifier.h>
> #include <linux/kallsyms.h>
> #include <linux/swapops.h>
> @@ -2115,6 +2116,7 @@ gotten:
> */
> ptep_clear_flush_notify(vma, address, page_table);
> page_add_new_anon_rmap(new_page, vma, address);
> + blkio_cgroup_set_owner(new_page, mm);

plz do this in swap-out code.

> set_pte_at(mm, address, page_table, entry);
> update_mmu_cache(vma, address, entry);
> if (old_page) {
> @@ -2580,6 +2582,7 @@ static int do_swap_page(struct mm_struct
> flush_icache_page(vma, page);
> set_pte_at(mm, address, page_table, pte);
> page_add_anon_rmap(page, vma, address);
> + blkio_cgroup_reset_owner(page, mm);

and this.


> /* It's better to call commit-charge after rmap is established */
> mem_cgroup_commit_charge_swapin(page, ptr);
>
> @@ -2644,6 +2647,7 @@ static int do_anonymous_page(struct mm_s
> goto release;
> inc_mm_counter(mm, anon_rss);
> page_add_new_anon_rmap(page, vma, address);
> + blkio_cgroup_set_owner(page, mm);
> set_pte_at(mm, address, page_table, entry);
>
and this.

> /* No need to invalidate - it was non-present before */
> @@ -2791,6 +2795,7 @@ static int __do_fault(struct mm_struct *
> if (anon) {
> inc_mm_counter(mm, anon_rss);
> page_add_new_anon_rmap(page, vma, address);
> + blkio_cgroup_set_owner(page, mm);
and this.

IMHO, later io for swap-out is caused by the caller of swapout, not page's
owner. plz charge to them or,
- add special BLOCK CGROUP ID for the kernel's swap out.

Bye,
-Kame

> } else {
> inc_mm_counter(mm, file_rss);
> page_add_file_rmap(page);
> Index: linux-2.6.31-rc3/mm/page-writeback.c
> ===================================================================
> --- linux-2.6.31-rc3.orig/mm/page-writeback.c
> +++ linux-2.6.31-rc3/mm/page-writeback.c
> @@ -23,6 +23,7 @@
> #include <linux/init.h>
> #include <linux/backing-dev.h>
> #include <linux/task_io_accounting_ops.h>
> +#include <linux/biotrack.h>
> #include <linux/blkdev.h>
> #include <linux/mpage.h>
> #include <linux/rmap.h>
> @@ -1247,6 +1248,7 @@ int __set_page_dirty_nobuffers(struct pa
> BUG_ON(mapping2 != mapping);
> WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> account_page_dirtied(page, mapping);
> + blkio_cgroup_reset_owner_pagedirty(page, current->mm);
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> }
> Index: linux-2.6.31-rc3/mm/swap_state.c
> ===================================================================
> --- linux-2.6.31-rc3.orig/mm/swap_state.c
> +++ linux-2.6.31-rc3/mm/swap_state.c
> @@ -18,6 +18,7 @@
> #include <linux/pagevec.h>
> #include <linux/migrate.h>
> #include <linux/page_cgroup.h>
> +#include <linux/biotrack.h>
>
> #include <asm/pgtable.h>
>
> @@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_e
> */
> __set_page_locked(new_page);
> SetPageSwapBacked(new_page);
> + blkio_cgroup_set_owner(new_page, current->mm);
> err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> if (likely(!err)) {
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-07-22 08:28:45

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework

KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > Index: linux-2.6.31-rc3/include/linux/page_cgroup.h
> > > > ===================================================================
> > > > --- linux-2.6.31-rc3.orig/include/linux/page_cgroup.h
> > > > +++ linux-2.6.31-rc3/include/linux/page_cgroup.h
> > > > @@ -1,7 +1,7 @@
> > > > #ifndef __LINUX_PAGE_CGROUP_H
> > > > #define __LINUX_PAGE_CGROUP_H
> > > >
> > > > -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > > +#ifdef CONFIG_CGROUP_PAGE
> > > > #include <linux/bit_spinlock.h>
> > > > /*
> > > > * Page Cgroup can be considered as an extended mem_map.
> > > > @@ -12,9 +12,11 @@
> > > > */
> > > > struct page_cgroup {
> > > > unsigned long flags;
> > > > - struct mem_cgroup *mem_cgroup;
> > > > struct page *page;
> > > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > > + struct mem_cgroup *mem_cgroup;
> > > > struct list_head lru; /* per cgroup LRU list */
> > > > +#endif
> > > > };
> > >
> > > If CONFIG_CGROUP_MEM_RES_CTLR is not enabled and CGROUP_PAGE is
> > > (assuming that the depends on below is refactored), what would this
> > > change buy us? What is page_cgroup helping us track, the mem_cgroup is
> > > factored out, so we are interested in the flags only?
> > >
> > plz remove CONFIG. This jsut makes code complicated.
> > or plz use your own infrastructure, not depends on page_cgroup.

Thanks for reviewing the patch.
Do you mean that remove only CONFIG_CGROUP_MEM_RES_CTR in struct
page_cgroup? Is it OK to define CONFIG_CGROUP_PAGE?

> BTW, you can't modify page_cgroup->flags bit without cmpxchg.
> Then, patch [5/9] is completely broken, now because new bit is used
> with atomic bit ops but without lock_page_cgroup(). (see mmotm)
>
> Why struct page's flags bit can includes zone id etc...is just because
> it's initalized before using. Anyway, this is "flags" bit. If you want
> to modify multiple bit at once, plz use cmpxchg.
> Then, I buy patch [8/9] and just skip this patch.

O.K. I'll use cmpxchg.

> But, following is more straightforward. (and what you do is not different
> from this.)
> ==
> struct page {
> .....
> #ifdef CONFIG_BLOCKIO_CGROUP
> void *blockio_cgroup;
> #endif
> }
> ==

This increases the size of struct page. Could I get a consensus on
this approach?

Thanks,
Ryo Tsuruta

> Regards,
> -Kame
>
>

2009-07-22 08:37:27

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework

* Ryo Tsuruta <[email protected]> [2009-07-22 17:28:43]:

> KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > > Index: linux-2.6.31-rc3/include/linux/page_cgroup.h
> > > > > ===================================================================
> > > > > --- linux-2.6.31-rc3.orig/include/linux/page_cgroup.h
> > > > > +++ linux-2.6.31-rc3/include/linux/page_cgroup.h
> > > > > @@ -1,7 +1,7 @@
> > > > > #ifndef __LINUX_PAGE_CGROUP_H
> > > > > #define __LINUX_PAGE_CGROUP_H
> > > > >
> > > > > -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > > > +#ifdef CONFIG_CGROUP_PAGE
> > > > > #include <linux/bit_spinlock.h>
> > > > > /*
> > > > > * Page Cgroup can be considered as an extended mem_map.
> > > > > @@ -12,9 +12,11 @@
> > > > > */
> > > > > struct page_cgroup {
> > > > > unsigned long flags;
> > > > > - struct mem_cgroup *mem_cgroup;
> > > > > struct page *page;
> > > > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > > > + struct mem_cgroup *mem_cgroup;
> > > > > struct list_head lru; /* per cgroup LRU list */
> > > > > +#endif
> > > > > };
> > > >
> > > > If CONFIG_CGROUP_MEM_RES_CTLR is not enabled and CGROUP_PAGE is
> > > > (assuming that the depends on below is refactored), what would this
> > > > change buy us? What is page_cgroup helping us track, the mem_cgroup is
> > > > factored out, so we are interested in the flags only?
> > > >
> > > plz remove CONFIG. This jsut makes code complicated.
> > > or plz use your own infrastructure, not depends on page_cgroup.
>
> Thanks for reviewing the patch.
> Do you mean that remove only CONFIG_CGROUP_MEM_RES_CTR in struct
> page_cgroup? Is it OK to define CONFIG_CGROUP_PAGE?
>
> > BTW, you can't modify page_cgroup->flags bit without cmpxchg.
> > Then, patch [5/9] is completely broken, now because new bit is used
> > with atomic bit ops but without lock_page_cgroup(). (see mmotm)
> >
> > Why struct page's flags bit can includes zone id etc...is just because
> > it's initalized before using. Anyway, this is "flags" bit. If you want
> > to modify multiple bit at once, plz use cmpxchg.
> > Then, I buy patch [8/9] and just skip this patch.
>
> O.K. I'll use cmpxchg.
>
> > But, following is more straightforward. (and what you do is not different
> > from this.)
> > ==
> > struct page {
> > .....
> > #ifdef CONFIG_BLOCKIO_CGROUP
> > void *blockio_cgroup;
> > #endif
> > }
> > ==
>
> This increases the size of struct page. Could I get a consensus on
> this approach?
>


This defeats the entire purpose of page_cgroup, IMHO. You need to add
the cgroup pointer to page_cgroup or use css id's there.

> Thanks,
> Ryo Tsuruta
>
> > Regards,
> > -Kame
> >
> >

--
Balbir

2009-07-22 08:41:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework

On Wed, 22 Jul 2009 17:28:43 +0900 (JST)
Ryo Tsuruta <[email protected]> wrote:
> > But, following is more straightforward. (and what you do is not different
> > from this.)
> > ==
> > struct page {
> > .....
> > #ifdef CONFIG_BLOCKIO_CGROUP
> > void *blockio_cgroup;
> > #endif
> > }
> > ==
>
> This increases the size of struct page. Could I get a consensus on
> this approach?
>
Just God knows ;)

To be honest, what I expected in these days for people of blockio cgroup is like
following for getting room for themselves.

I'm now thinking to do this by myself and offer a room for you because
terrible bugs have been gone now and I have time.

Balbir, if you have no concerns, I'll clean up and send this to mmotm.
(maybe softlimit accesses pc->page and I have to update this.)

Note: This is _not_ tested at all.

Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <[email protected]>

page cgroup has pointer to memmap it stands for.
But, page_cgroup->page is not accessed in fast path and not necessary
and not modified. Then, it's not to be maintained as pointer.

This patch removes "page" from page_cgroup and add
page_cgroup_to_page() function. This uses some amount of FLAGS bit
as struct page does.
As side effect, nid, zid can be obtaind from page_cgroup itself.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page_cgroup.h | 19 ++++++++++++++++---
mm/page_cgroup.c | 42 ++++++++++++++++++++++++++++++++----------
2 files changed, 48 insertions(+), 13 deletions(-)

Index: mmotm-2.6.31-Jul16/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.31-Jul16.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.31-Jul16/include/linux/page_cgroup.h
@@ -13,7 +13,7 @@
struct page_cgroup {
unsigned long flags;
struct mem_cgroup *mem_cgroup;
- struct page *page;
+ /* block io tracking will use extra unsigned long bytes */
struct list_head lru; /* per cgroup LRU list */
};

@@ -32,7 +32,12 @@ static inline void __init page_cgroup_in
#endif

struct page_cgroup *lookup_page_cgroup(struct page *page);
+struct page *page_cgroup_to_page(struct page_cgroup *page);

+/*
+ * TOP MOST (NODE_SHIFT+ZONE_SHIFT or SECTION_SHIFT bits of "flags" are used
+ * for detecting pfn as struct page does.
+ */
enum {
/* flags for mem_cgroup */
PCG_LOCK, /* page cgroup is locked */
@@ -71,14 +76,22 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
TESTPCGFLAG(AcctLRU, ACCT_LRU)
TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)

+#ifdef NODE_NOT_IN_PAGE_FLAGS
static inline int page_cgroup_nid(struct page_cgroup *pc)
{
- return page_to_nid(pc->page);
+ struct page *page= page_cgroup_to_page(pc);
+ return page_to_nid(page);
}
+#else
+static inline int page_cgroup_nid(struct page_cgroup *pc)
+{
+ return (pc->flags >> NODES_PGSHIFT) & NODES_MASK;
+}
+#endif

static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
{
- return page_zonenum(pc->page);
+ return (pc->flags >> ZONEID_PGSHIFT) & ZONEID_MASK;
}

static inline void lock_page_cgroup(struct page_cgroup *pc)
Index: mmotm-2.6.31-Jul16/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.31-Jul16.orig/mm/page_cgroup.c
+++ mmotm-2.6.31-Jul16/mm/page_cgroup.c
@@ -13,9 +13,12 @@
static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
- pc->flags = 0;
+ unsigned long flags;
+
pc->mem_cgroup = NULL;
- pc->page = pfn_to_page(pfn);
+ /* Copy NODE/ZONE/SECTION information from struct page */
+ flags = pfn_to_page(pfn)->flags;
+ pc->flags = flags & ~((1 << __NR_PAGEFLAGS) - 1);
INIT_LIST_HEAD(&pc->lru);
}
static unsigned long total_usage;
@@ -42,6 +45,18 @@ struct page_cgroup *lookup_page_cgroup(s
return base + offset;
}

+struct page *page_cgroup_to_page(struct page_cgroup *pc)
+{
+ int nid = (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
+ unsigned long pfn, offset;
+
+ offset = pc - NODE_DATA(nid)->node_page_cgroup
+ pfn = NODE_DATA(nid)->node_start_pfn + offset;
+
+ return pfn_to_page(pfn);
+}
+
+
static int __init alloc_node_page_cgroup(int nid)
{
struct page_cgroup *base, *pc;
@@ -104,6 +119,18 @@ struct page_cgroup *lookup_page_cgroup(s
return section->page_cgroup + pfn;
}

+struct page *page_cgroup_to_page(struct page_cgroup *pc)
+{
+ unsigned long pfn, sectionid;
+ struct mem_section *section;
+
+ sectionid = (pc->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
+ section = __nr_to_section(sectionid);
+
+ pfn = pc - section->page_cgroup;
+ return pfn_to_page(pfn);
+}
+
/* __alloc_bootmem...() is protected by !slab_available() */
static int __init_refok init_section_page_cgroup(unsigned long pfn)
{
@@ -128,15 +155,10 @@ static int __init_refok init_section_pag
}
} else {
/*
- * We don't have to allocate page_cgroup again, but
- * address of memmap may be changed. So, we have to initialize
- * again.
+ * We don't have to allocate page_cgroup again, and we don't
+ * take care of address of memmap.
*/
- base = section->page_cgroup + pfn;
- table_size = 0;
- /* check address of memmap is changed or not. */
- if (base->page == pfn_to_page(pfn))
- return 0;
+ return 0;
}

if (!base) {

2009-07-22 08:48:01

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework

On Wed, 22 Jul 2009 14:07:22 +0530
Balbir Singh <[email protected]> wrote:

> * Ryo Tsuruta <[email protected]> [2009-07-22 17:28:43]:
>
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > > > Index: linux-2.6.31-rc3/include/linux/page_cgroup.h
> > > > > > ===================================================================
> > > > > > --- linux-2.6.31-rc3.orig/include/linux/page_cgroup.h
> > > > > > +++ linux-2.6.31-rc3/include/linux/page_cgroup.h
> > > > > > @@ -1,7 +1,7 @@
> > > > > > #ifndef __LINUX_PAGE_CGROUP_H
> > > > > > #define __LINUX_PAGE_CGROUP_H
> > > > > >
> > > > > > -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > > > > +#ifdef CONFIG_CGROUP_PAGE
> > > > > > #include <linux/bit_spinlock.h>
> > > > > > /*
> > > > > > * Page Cgroup can be considered as an extended mem_map.
> > > > > > @@ -12,9 +12,11 @@
> > > > > > */
> > > > > > struct page_cgroup {
> > > > > > unsigned long flags;
> > > > > > - struct mem_cgroup *mem_cgroup;
> > > > > > struct page *page;
> > > > > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > > > > + struct mem_cgroup *mem_cgroup;
> > > > > > struct list_head lru; /* per cgroup LRU list */
> > > > > > +#endif
> > > > > > };
> > > > >
> > > > > If CONFIG_CGROUP_MEM_RES_CTLR is not enabled and CGROUP_PAGE is
> > > > > (assuming that the depends on below is refactored), what would this
> > > > > change buy us? What is page_cgroup helping us track, the mem_cgroup is
> > > > > factored out, so we are interested in the flags only?
> > > > >
> > > > plz remove CONFIG. This jsut makes code complicated.
> > > > or plz use your own infrastructure, not depends on page_cgroup.
> >
> > Thanks for reviewing the patch.
> > Do you mean that remove only CONFIG_CGROUP_MEM_RES_CTR in struct
> > page_cgroup? Is it OK to define CONFIG_CGROUP_PAGE?
> >
> > > BTW, you can't modify page_cgroup->flags bit without cmpxchg.
> > > Then, patch [5/9] is completely broken, now because new bit is used
> > > with atomic bit ops but without lock_page_cgroup(). (see mmotm)
> > >
> > > Why struct page's flags bit can includes zone id etc...is just because
> > > it's initalized before using. Anyway, this is "flags" bit. If you want
> > > to modify multiple bit at once, plz use cmpxchg.
> > > Then, I buy patch [8/9] and just skip this patch.
> >
> > O.K. I'll use cmpxchg.
> >
> > > But, following is more straightforward. (and what you do is not different
> > > from this.)
> > > ==
> > > struct page {
> > > .....
> > > #ifdef CONFIG_BLOCKIO_CGROUP
> > > void *blockio_cgroup;
> > > #endif
> > > }
> > > ==
> >
> > This increases the size of struct page. Could I get a consensus on
> > this approach?
> >
>
>
> This defeats the entire purpose of page_cgroup, IMHO. You need to add
> the cgroup pointer to page_cgroup or use css id's there.
>
My point is
- increasing size of struct page is verrry difficult.
- increasing size of struct page_cgroup should be.
Any diffrence ?

Then, plz don't go this way without enough amounts of effort.
plz see my patch to reduce size struct page_cgroup, it's _a_ effort.

Thanks,
-Kame


2009-07-22 09:30:14

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 3/9] blkio-cgroup-v9: The new page_cgroup framework

KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Wed, 22 Jul 2009 17:28:43 +0900 (JST)
> Ryo Tsuruta <[email protected]> wrote:
> > > But, following is more straightforward. (and what you do is not different
> > > from this.)
> > > ==
> > > struct page {
> > > .....
> > > #ifdef CONFIG_BLOCKIO_CGROUP
> > > void *blockio_cgroup;
> > > #endif
> > > }
> > > ==
> >
> > This increases the size of struct page. Could I get a consensus on
> > this approach?
> >
> Just God knows ;)
>
> To be honest, what I expected in these days for people of blockio cgroup is like
> following for getting room for themselves.
>
> I'm now thinking to do this by myself and offer a room for you because
> terrible bugs have been gone now and I have time.

It is very nice for blkio-cgroup, it can make blkio-cgroup simple and
more faster to track down the owner of an I/O request.

Thanks,
Ryo Tsuruta

> Balbir, if you have no concerns, I'll clean up and send this to mmotm.
> (maybe softlimit accesses pc->page and I have to update this.)
>
> Note: This is _not_ tested at all.
>
> Thanks,
> -Kame
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> page cgroup has pointer to memmap it stands for.
> But, page_cgroup->page is not accessed in fast path and not necessary
> and not modified. Then, it's not to be maintained as pointer.
>
> This patch removes "page" from page_cgroup and add
> page_cgroup_to_page() function. This uses some amount of FLAGS bit
> as struct page does.
> As side effect, nid, zid can be obtaind from page_cgroup itself.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/page_cgroup.h | 19 ++++++++++++++++---
> mm/page_cgroup.c | 42 ++++++++++++++++++++++++++++++++----------
> 2 files changed, 48 insertions(+), 13 deletions(-)
>
> Index: mmotm-2.6.31-Jul16/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-2.6.31-Jul16.orig/include/linux/page_cgroup.h
> +++ mmotm-2.6.31-Jul16/include/linux/page_cgroup.h
> @@ -13,7 +13,7 @@
> struct page_cgroup {
> unsigned long flags;
> struct mem_cgroup *mem_cgroup;
> - struct page *page;
> + /* block io tracking will use extra unsigned long bytes */
> struct list_head lru; /* per cgroup LRU list */
> };
>
> @@ -32,7 +32,12 @@ static inline void __init page_cgroup_in
> #endif
>
> struct page_cgroup *lookup_page_cgroup(struct page *page);
> +struct page *page_cgroup_to_page(struct page_cgroup *page);
>
> +/*
> + * TOP MOST (NODE_SHIFT+ZONE_SHIFT or SECTION_SHIFT bits of "flags" are used
> + * for detecting pfn as struct page does.
> + */
> enum {
> /* flags for mem_cgroup */
> PCG_LOCK, /* page cgroup is locked */
> @@ -71,14 +76,22 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
> TESTPCGFLAG(AcctLRU, ACCT_LRU)
> TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>
> +#ifdef NODE_NOT_IN_PAGE_FLAGS
> static inline int page_cgroup_nid(struct page_cgroup *pc)
> {
> - return page_to_nid(pc->page);
> + struct page *page= page_cgroup_to_page(pc);
> + return page_to_nid(page);
> }
> +#else
> +static inline int page_cgroup_nid(struct page_cgroup *pc)
> +{
> + return (pc->flags >> NODES_PGSHIFT) & NODES_MASK;
> +}
> +#endif
>
> static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
> {
> - return page_zonenum(pc->page);
> + return (pc->flags >> ZONEID_PGSHIFT) & ZONEID_MASK;
> }
>
> static inline void lock_page_cgroup(struct page_cgroup *pc)
> Index: mmotm-2.6.31-Jul16/mm/page_cgroup.c
> ===================================================================
> --- mmotm-2.6.31-Jul16.orig/mm/page_cgroup.c
> +++ mmotm-2.6.31-Jul16/mm/page_cgroup.c
> @@ -13,9 +13,12 @@
> static void __meminit
> __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
> {
> - pc->flags = 0;
> + unsigned long flags;
> +
> pc->mem_cgroup = NULL;
> - pc->page = pfn_to_page(pfn);
> + /* Copy NODE/ZONE/SECTION information from struct page */
> + flags = pfn_to_page(pfn)->flags;
> + pc->flags = flags & ~((1 << __NR_PAGEFLAGS) - 1);
> INIT_LIST_HEAD(&pc->lru);
> }
> static unsigned long total_usage;
> @@ -42,6 +45,18 @@ struct page_cgroup *lookup_page_cgroup(s
> return base + offset;
> }
>
> +struct page *page_cgroup_to_page(struct page_cgroup *pc)
> +{
> + int nid = (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
> + unsigned long pfn, offset;
> +
> + offset = pc - NODE_DATA(nid)->node_page_cgroup
> + pfn = NODE_DATA(nid)->node_start_pfn + offset;
> +
> + return pfn_to_page(pfn);
> +}
> +
> +
> static int __init alloc_node_page_cgroup(int nid)
> {
> struct page_cgroup *base, *pc;
> @@ -104,6 +119,18 @@ struct page_cgroup *lookup_page_cgroup(s
> return section->page_cgroup + pfn;
> }
>
> +struct page *page_cgroup_to_page(struct page_cgroup *pc)
> +{
> + unsigned long pfn, sectionid;
> + struct mem_section *section;
> +
> + sectionid = (pc->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
> + section = __nr_to_section(sectionid);
> +
> + pfn = pc - section->page_cgroup;
> + return pfn_to_page(pfn);
> +}
> +
> /* __alloc_bootmem...() is protected by !slab_available() */
> static int __init_refok init_section_page_cgroup(unsigned long pfn)
> {
> @@ -128,15 +155,10 @@ static int __init_refok init_section_pag
> }
> } else {
> /*
> - * We don't have to allocate page_cgroup again, but
> - * address of memmap may be changed. So, we have to initialize
> - * again.
> + * We don't have to allocate page_cgroup again, and we don't
> + * take care of address of memmap.
> */
> - base = section->page_cgroup + pfn;
> - table_size = 0;
> - /* check address of memmap is changed or not. */
> - if (base->page == pfn_to_page(pfn))
> - return 0;
> + return 0;
> }
>
> if (!base) {
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-07-22 09:40:57

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Tue, 21 Jul 2009 23:23:16 +0900 (JST)
> Ryo Tsuruta <[email protected]> wrote:
>
> > This patch contains several hooks that let the blkio-cgroup framework to know
> > which blkio-cgroup is the owner of a page before starting I/O against the page.
>
> > @@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page
> > gfp_mask & GFP_RECLAIM_MASK);
> > if (error)
> > goto out;
> > + blkio_cgroup_set_owner(page, current->mm);
> >
>
> This part is doubtful...Is this necessary ?
> I recommend you that the caller should attach owner by itself.

I think that it is reasonable to add the hook right here rather than
to add many hooks to a variety of places.

> > error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > if (error == 0) {
> > Index: linux-2.6.31-rc3/mm/memory.c
> > ===================================================================
> > --- linux-2.6.31-rc3.orig/mm/memory.c
> > +++ linux-2.6.31-rc3/mm/memory.c
> > @@ -51,6 +51,7 @@
> > #include <linux/init.h>
> > #include <linux/writeback.h>
> > #include <linux/memcontrol.h>
> > +#include <linux/biotrack.h>
> > #include <linux/mmu_notifier.h>
> > #include <linux/kallsyms.h>
> > #include <linux/swapops.h>
> > @@ -2115,6 +2116,7 @@ gotten:
> > */
> > ptep_clear_flush_notify(vma, address, page_table);
> > page_add_new_anon_rmap(new_page, vma, address);
> > + blkio_cgroup_set_owner(new_page, mm);
>
> plz do this in swap-out code.
>
> > set_pte_at(mm, address, page_table, entry);
> > update_mmu_cache(vma, address, entry);
> > if (old_page) {
> > @@ -2580,6 +2582,7 @@ static int do_swap_page(struct mm_struct
> > flush_icache_page(vma, page);
> > set_pte_at(mm, address, page_table, pte);
> > page_add_anon_rmap(page, vma, address);
> > + blkio_cgroup_reset_owner(page, mm);
>
> and this.
>
>
> > /* It's better to call commit-charge after rmap is established */
> > mem_cgroup_commit_charge_swapin(page, ptr);
> >
> > @@ -2644,6 +2647,7 @@ static int do_anonymous_page(struct mm_s
> > goto release;
> > inc_mm_counter(mm, anon_rss);
> > page_add_new_anon_rmap(page, vma, address);
> > + blkio_cgroup_set_owner(page, mm);
> > set_pte_at(mm, address, page_table, entry);
> >
> and this.
>
> > /* No need to invalidate - it was non-present before */
> > @@ -2791,6 +2795,7 @@ static int __do_fault(struct mm_struct *
> > if (anon) {
> > inc_mm_counter(mm, anon_rss);
> > page_add_new_anon_rmap(page, vma, address);
> > + blkio_cgroup_set_owner(page, mm);
> and this.
>
> IMHO, later io for swap-out is caused by the caller of swapout, not page's
> owner. plz charge to them or,
> - add special BLOCK CGROUP ID for the kernel's swap out.

I think that it is not too bad to charge the owner of a page for
swap-out. From another perspective, it can be considered that swap-out
is caused by a process which uses a large amount of memory.

Thanks,
Ryo Tsuruta

>
> Bye,
> -Kame
>
> > } else {
> > inc_mm_counter(mm, file_rss);
> > page_add_file_rmap(page);
> > Index: linux-2.6.31-rc3/mm/page-writeback.c
> > ===================================================================
> > --- linux-2.6.31-rc3.orig/mm/page-writeback.c
> > +++ linux-2.6.31-rc3/mm/page-writeback.c
> > @@ -23,6 +23,7 @@
> > #include <linux/init.h>
> > #include <linux/backing-dev.h>
> > #include <linux/task_io_accounting_ops.h>
> > +#include <linux/biotrack.h>
> > #include <linux/blkdev.h>
> > #include <linux/mpage.h>
> > #include <linux/rmap.h>
> > @@ -1247,6 +1248,7 @@ int __set_page_dirty_nobuffers(struct pa
> > BUG_ON(mapping2 != mapping);
> > WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> > account_page_dirtied(page, mapping);
> > + blkio_cgroup_reset_owner_pagedirty(page, current->mm);
> > radix_tree_tag_set(&mapping->page_tree,
> > page_index(page), PAGECACHE_TAG_DIRTY);
> > }
> > Index: linux-2.6.31-rc3/mm/swap_state.c
> > ===================================================================
> > --- linux-2.6.31-rc3.orig/mm/swap_state.c
> > +++ linux-2.6.31-rc3/mm/swap_state.c
> > @@ -18,6 +18,7 @@
> > #include <linux/pagevec.h>
> > #include <linux/migrate.h>
> > #include <linux/page_cgroup.h>
> > +#include <linux/biotrack.h>
> >
> > #include <asm/pgtable.h>
> >
> > @@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_e
> > */
> > __set_page_locked(new_page);
> > SetPageSwapBacked(new_page);
> > + blkio_cgroup_set_owner(new_page, current->mm);
> > err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > if (likely(!err)) {
> > /*
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >

2009-07-23 00:20:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

On Wed, 22 Jul 2009 18:40:55 +0900 (JST)
Ryo Tsuruta <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> wrote:
> > On Tue, 21 Jul 2009 23:23:16 +0900 (JST)
> > Ryo Tsuruta <[email protected]> wrote:
> >
> > > This patch contains several hooks that let the blkio-cgroup framework to know
> > > which blkio-cgroup is the owner of a page before starting I/O against the page.
> >
> > > @@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page
> > > gfp_mask & GFP_RECLAIM_MASK);
> > > if (error)
> > > goto out;
> > > + blkio_cgroup_set_owner(page, current->mm);
> > >
> >
> > This part is doubtful...Is this necessary ?
> > I recommend you that the caller should attach owner by itself.
>
> I think that it is reasonable to add the hook right here rather than
> to add many hooks to a variety of places.
>
Why ? at writing, it's will be overwriten soon, IIUC. Then, this information
is misleading. plz add a hook like this when it means something. In this case,
read/write callers.
IMO, you just increase patch's readbility but decrease easiness of maintaince.


> > IMHO, later io for swap-out is caused by the caller of swapout, not page's
> > owner. plz charge to them or,
> > - add special BLOCK CGROUP ID for the kernel's swap out.
>
> I think that it is not too bad to charge the owner of a page for
> swap-out. From another perspective, it can be considered that swap-out
> is caused by a process which uses a large amount of memory.
>
No. swap-out is caused by a thread who requests memory even while memory is
in short. IMHO, I/O by memory reqraim should work in priority of memory requester.

Consider following situation.

- A process "A" has big memory. When several threads requests memory, all
of them are caught by a blockio cgroup of "A".
- A process "B" has read big file caches. When several threads requests memory,
all of them are caught by a blockio cgroup of "B".

If "A" and "B" 's threshold is small, you'll see big slow down.
But it's not _planned_ behavior in many cases.

If you charges agaisnt memory owner, the admin has to set _big_ priority of I/O
controller to "A" and "B" if it uses much memory. I think the admin can't design
his system. It's nonsense to say "plz set I/O limit propotional to memory usage of
your apps even if it never do I/O in usual."


If this blockio cgroup is introduced, people will see *unexpected* very
terrible slow down and the user will see heartbeat warnings/failover by cluster
management software. Please do I/O at the priority of memory reclaiming requester.


Thanks,
-Kame



2009-07-23 06:38:55

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Wed, 22 Jul 2009 18:40:55 +0900 (JST)
> Ryo Tsuruta <[email protected]> wrote:
>
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > On Tue, 21 Jul 2009 23:23:16 +0900 (JST)
> > > Ryo Tsuruta <[email protected]> wrote:
> > >
> > > > This patch contains several hooks that let the blkio-cgroup framework to know
> > > > which blkio-cgroup is the owner of a page before starting I/O against the page.
> > >
> > > > @@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page
> > > > gfp_mask & GFP_RECLAIM_MASK);
> > > > if (error)
> > > > goto out;
> > > > + blkio_cgroup_set_owner(page, current->mm);
> > > >
> > >
> > > This part is doubtful...Is this necessary ?
> > > I recommend you that the caller should attach owner by itself.
> >
> > I think that it is reasonable to add the hook right here rather than
> > to add many hooks to a variety of places.
> >
> Why ? at writing, it's will be overwriten soon, IIUC. Then, this information
> is misleading. plz add a hook like this when it means something. In this case,
> read/write callers.
> IMO, you just increase patch's readbility but decrease easiness of maintaince.

Even though the owner is overwritten soon at writing, I'm not sure why
inserting the hook here causes the misleading. I think that it is easy
to understand when and where the owner is set by blkio-cgroup, and it
does not decrease maintainability, rather than put many hooks to each
caller.

> > > IMHO, later io for swap-out is caused by the caller of swapout, not page's
> > > owner. plz charge to them or,
> > > - add special BLOCK CGROUP ID for the kernel's swap out.
> >
> > I think that it is not too bad to charge the owner of a page for
> > swap-out. From another perspective, it can be considered that swap-out
> > is caused by a process which uses a large amount of memory.
> >
> No. swap-out is caused by a thread who requests memory even while memory is
> in short. IMHO, I/O by memory reqraim should work in priority of memory requester.


>
> Consider following situation.
>
> - A process "A" has big memory. When several threads requests memory, all
> of them are caught by a blockio cgroup of "A".
> - A process "B" has read big file caches. When several threads requests memory,
> all of them are caught by a blockio cgroup of "B".
>
> If "A" and "B" 's threshold is small, you'll see big slow down.
> But it's not _planned_ behavior in many cases.
>
> If you charges agaisnt memory owner, the admin has to set _big_ priority of I/O
> controller to "A" and "B" if it uses much memory. I think the admin can't design
> his system. It's nonsense to say "plz set I/O limit propotional to memory usage of
> your apps even if it never do I/O in usual."
>
> If this blockio cgroup is introduced, people will see *unexpected* very
> terrible slow down and the user will see heartbeat warnings/failover by cluster
> management software. Please do I/O at the priority of memory reclaiming requester.

dm-ioband gives high priority to I/O for swap-out by checking whether
PG_swapcache flag is set on the I/O page, regardless of the assigned
I/O bandwidth, and the bandwidth consumed for swap-out is charged to
the owner of the pages as a debt.
How about this approach?

Thanks,
Ryo Tsuruta

>
>
> Thanks,
> -Kame
>
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> http://lists.xensource.com/xen-devel

2009-07-23 07:51:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

On Thu, 23 Jul 2009 15:38:43 +0900 (JST)
Ryo Tsuruta <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> wrote:
> > On Wed, 22 Jul 2009 18:40:55 +0900 (JST)
> > Ryo Tsuruta <[email protected]> wrote:
> >
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > On Tue, 21 Jul 2009 23:23:16 +0900 (JST)
> > > > Ryo Tsuruta <[email protected]> wrote:
> > > >
> > > > > This patch contains several hooks that let the blkio-cgroup framework to know
> > > > > which blkio-cgroup is the owner of a page before starting I/O against the page.
> > > >
> > > > > @@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page
> > > > > gfp_mask & GFP_RECLAIM_MASK);
> > > > > if (error)
> > > > > goto out;
> > > > > + blkio_cgroup_set_owner(page, current->mm);
> > > > >
> > > >
> > > > This part is doubtful...Is this necessary ?
> > > > I recommend you that the caller should attach owner by itself.
> > >
> > > I think that it is reasonable to add the hook right here rather than
> > > to add many hooks to a variety of places.
> > >
> > Why ? at writing, it's will be overwriten soon, IIUC. Then, this information
> > is misleading. plz add a hook like this when it means something. In this case,
> > read/write callers.
> > IMO, you just increase patch's readbility but decrease easiness of maintaince.
>
> Even though the owner is overwritten soon at writing, I'm not sure why
> inserting the hook here causes the misleading. I think that it is easy
> to understand when and where the owner is set by blkio-cgroup, and it
> does not decrease maintainability, rather than put many hooks to each
> caller.
>
Are there _many_ callers ? I don't think so.
But okay, I don't say strong objections more if other ones say ok.

BTW, a sad information for you.
you can't call lock_page_cgroup() under radix_tree->lock.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e767e0561d7fd2333df1921f1ab4176211f9036b

plz update.

> >
> > Consider following situation.
> >
> > - A process "A" has big memory. When several threads requests memory, all
> > of them are caught by a blockio cgroup of "A".
> > - A process "B" has read big file caches. When several threads requests memory,
> > all of them are caught by a blockio cgroup of "B".
> >
> > If "A" and "B" 's threshold is small, you'll see big slow down.
> > But it's not _planned_ behavior in many cases.
> >
> > If you charges agaisnt memory owner, the admin has to set _big_ priority of I/O
> > controller to "A" and "B" if it uses much memory. I think the admin can't design
> > his system. It's nonsense to say "plz set I/O limit propotional to memory usage of
> > your apps even if it never do I/O in usual."
> >
> > If this blockio cgroup is introduced, people will see *unexpected* very
> > terrible slow down and the user will see heartbeat warnings/failover by cluster
> > management software. Please do I/O at the priority of memory reclaiming requester.
>
> dm-ioband gives high priority to I/O for swap-out by checking whether
> PG_swapcache flag is set on the I/O page, regardless of the assigned
> I/O bandwidth, and the bandwidth consumed for swap-out is charged to
> the owner of the pages as a debt.
> How about this approach?

I don't think it's reasonable. Why I/O device, scheduler should know about
such mm-related information ? I think layering is wrong.
And your approatch cannot be a workaround.

In follwing _typical_ case,

- A process does small logging to /var/log/mylog, once in a sec.
but it uses some amount of cold memory or shmem.

This process's logging will be delayed _unexpectedly_ by some buggy process
which does memory leak.


Thanks,
-Kame

2009-07-23 10:02:55

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Thu, 23 Jul 2009 15:38:43 +0900 (JST)
> Ryo Tsuruta <[email protected]> wrote:
>
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > On Wed, 22 Jul 2009 18:40:55 +0900 (JST)
> > > Ryo Tsuruta <[email protected]> wrote:
> > >
> > > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > > On Tue, 21 Jul 2009 23:23:16 +0900 (JST)
> > > > > Ryo Tsuruta <[email protected]> wrote:
> > > > >
> > > > > > This patch contains several hooks that let the blkio-cgroup framework to know
> > > > > > which blkio-cgroup is the owner of a page before starting I/O against the page.
> > > > >
> > > > > > @@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page
> > > > > > gfp_mask & GFP_RECLAIM_MASK);
> > > > > > if (error)
> > > > > > goto out;
> > > > > > + blkio_cgroup_set_owner(page, current->mm);
> > > > > >
> > > > >
> > > > > This part is doubtful...Is this necessary ?
> > > > > I recommend you that the caller should attach owner by itself.
> > > >
> > > > I think that it is reasonable to add the hook right here rather than
> > > > to add many hooks to a variety of places.
> > > >
> > > Why ? at writing, it's will be overwriten soon, IIUC. Then, this information
> > > is misleading. plz add a hook like this when it means something. In this case,
> > > read/write callers.
> > > IMO, you just increase patch's readbility but decrease easiness of maintaince.
> >
> > Even though the owner is overwritten soon at writing, I'm not sure why
> > inserting the hook here causes the misleading. I think that it is easy
> > to understand when and where the owner is set by blkio-cgroup, and it
> > does not decrease maintainability, rather than put many hooks to each
> > caller.
> >
> Are there _many_ callers ? I don't think so.
> But okay, I don't say strong objections more if other ones say ok.
>
> BTW, a sad information for you.
> you can't call lock_page_cgroup() under radix_tree->lock.
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e767e0561d7fd2333df1921f1ab4176211f9036b
>
> plz update.

Thank you for letting me know. I'll fix it.

>
> > >
> > > Consider following situation.
> > >
> > > - A process "A" has big memory. When several threads requests memory, all
> > > of them are caught by a blockio cgroup of "A".
> > > - A process "B" has read big file caches. When several threads requests memory,
> > > all of them are caught by a blockio cgroup of "B".
> > >
> > > If "A" and "B" 's threshold is small, you'll see big slow down.
> > > But it's not _planned_ behavior in many cases.
> > >
> > > If you charges agaisnt memory owner, the admin has to set _big_ priority of I/O
> > > controller to "A" and "B" if it uses much memory. I think the admin can't design
> > > his system. It's nonsense to say "plz set I/O limit propotional to memory usage of
> > > your apps even if it never do I/O in usual."
> > >
> > > If this blockio cgroup is introduced, people will see *unexpected* very
> > > terrible slow down and the user will see heartbeat warnings/failover by cluster
> > > management software. Please do I/O at the priority of memory reclaiming requester.
> >
> > dm-ioband gives high priority to I/O for swap-out by checking whether
> > PG_swapcache flag is set on the I/O page, regardless of the assigned
> > I/O bandwidth, and the bandwidth consumed for swap-out is charged to
> > the owner of the pages as a debt.
> > How about this approach?
>
> I don't think it's reasonable. Why I/O device, scheduler should know about
> such mm-related information ? I think layering is wrong.

I think that urgent I/O requests such as swap-out should be notified
by setting a special flag in the struct bio, but there is no such
mechanism at this time. That is why dm-ioband uses this approach.

> And your approatch cannot be a workaround.
>
> In follwing _typical_ case,
>
> - A process does small logging to /var/log/mylog, once in a sec.
> but it uses some amount of cold memory or shmem.
>
> This process's logging will be delayed _unexpectedly_ by some buggy process
> which does memory leak.

Do you mean that the delay in logging is caused since the small process
is swapped out unexpectedly by the buggy processes?
How about using memory cgroup to prevent the small process from swap-out?
I would appreciate it if you could tell me more about this.

Thanks,
Ryo Tsuruta

>
> Thanks,
> -Kame

2009-07-23 12:02:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

Ryo Tsuruta wrote:
> KAMEZAWA Hiroyuki <[email protected]> wrote:

>> > dm-ioband gives high priority to I/O for swap-out by checking whether
>> > PG_swapcache flag is set on the I/O page, regardless of the assigned
>> > I/O bandwidth, and the bandwidth consumed for swap-out is charged to
>> > the owner of the pages as a debt.
>> > How about this approach?
>>
>> I don't think it's reasonable. Why I/O device, scheduler should know
>> about
>> such mm-related information ? I think layering is wrong.
>
> I think that urgent I/O requests such as swap-out should be notified
> by setting a special flag in the struct bio, but there is no such
> mechanism at this time. That is why dm-ioband uses this approach.
>
>> And your approatch cannot be a workaround.
>>
>> In follwing _typical_ case,
>>
>> - A process does small logging to /var/log/mylog, once in a sec.
>> but it uses some amount of cold memory or shmem.
>>
>> This process's logging will be delayed _unexpectedly_ by some buggy
>> process
>> which does memory leak.
>
> Do you mean that the delay in logging is caused since the small process
> is swapped out unexpectedly by the buggy processes?
I don't write "small process", "small logging".
Buggy process does swap-out and cosumes someone else's bandwidth, then,
loggind will be delayed. Important here is throttle bandwidth consumed by
buggy prorcess, not other's.

> How about using memory cgroup to prevent the small process from swap-out?
It never be help if memcg is not configured.

My point is "don't allow anyone to use bandwidth of others."
Considering job isolation, a thread who requests swap-out should be charged
against bandwidth.

Thanks,
-Kame


2009-07-24 05:44:19

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

"KAMEZAWA Hiroyuki" <[email protected]> wrote:
> Ryo Tsuruta wrote:
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> >> > dm-ioband gives high priority to I/O for swap-out by checking whethe=
> r
> >> > PG_swapcache flag is set on the I/O page, regardless of the assigned
> >> > I/O bandwidth, and the bandwidth consumed for swap-out is charged to
> >> > the owner of the pages as a debt.
> >> > How about this approach?
> >>
> >> I don't think it's reasonable. Why I/O device, scheduler should know
> >> about
> >> such mm-related information ? I think layering is wrong.
> >
> > I think that urgent I/O requests such as swap-out should be notified
> > by setting a special flag in the struct bio, but there is no such
> > mechanism at this time. That is why dm-ioband uses this approach.
> >
> >> And your approatch cannot be a workaround.
> >>
> >> In follwing _typical_ case,
> >>
> >> - A process does small logging to /var/log/mylog, once in a sec.
> >> but it uses some amount of cold memory or shmem.
> >>
> >> This process's logging will be delayed _unexpectedly_ by some buggy
> >> process
> >> which does memory leak.
> >
> > Do you mean that the delay in logging is caused since the small process
> > is swapped out unexpectedly by the buggy processes?
> I don't write "small process", "small logging".
> Buggy process does swap-out and cosumes someone else's bandwidth, then,
> loggind will be delayed. Important here is throttle bandwidth consumed by
> buggy prorcess, not other's.

Thank you for explaining it.

> > How about using memory cgroup to prevent the small process from swap-ou=
> t?
> It never be help if memcg is not configured.

blkio-cgroup is recommended to use with memcg. I think that it can be
a good solution to resolve such problem.

> My point is "don't allow anyone to use bandwidth of others."
> Considering job isolation, a thread who requests swap-out should be charg=
> ed
> against bandwidth.

>From another perspective, the swap-out is caused since the buggy
process uses a large amount of memory, so it can be considered as
the bandwidth of logging process is used due to the buggy process.

Please consider the following case. If a thread who requests swap-out
is charged, the thread is charged other threads' I/O.

(1) -------- (2)
Process A | | Process B
mmaps a large area in --> | memory | <-- tries to allocate a page.
the memory and writes | |
data to there. -------- (3)
| To get a free page,
| the data written by Proc.A
| is written out to the disk.
V The I/O is done by using
--------- Proc.B's bandwidth.
| disk |
---------

Thus I think that page owners should be charged against bandwidth.

Thanks,
Ryo Tsuruta

2009-07-24 06:21:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

On Fri, 24 Jul 2009 14:44:16 +0900 (JST)
Ryo Tsuruta <[email protected]> wrote:
good solution to resolve such problem.
>
> > My point is "don't allow anyone to use bandwidth of others."
> > Considering job isolation, a thread who requests swap-out should be charg=
> > ed
> > against bandwidth.
>
> From another perspective, the swap-out is caused since the buggy
> process uses a large amount of memory, so it can be considered as
> the bandwidth of logging process is used due to the buggy process.
>
> Please consider the following case. If a thread who requests swap-out
> is charged, the thread is charged other threads' I/O.
>
> (1) -------- (2)
> Process A | | Process B
> mmaps a large area in --> | memory | <-- tries to allocate a page.
> the memory and writes | |
> data to there. -------- (3)
> | To get a free page,
> | the data written by Proc.A
> | is written out to the disk.
> V The I/O is done by using
> --------- Proc.B's bandwidth.
> | disk |
> ---------
>
> Thus I think that page owners should be charged against bandwidth.
>
Ok, no good way. yours is wrong, mine is wrong, too.
plz find 3rd way, reasonable.

Below is brief thinking.

"Why process A should be charged to I/O when it just maps anon memory ?"
I can't answer this.

Even in yorr case, Process B requests memory and get penalty. It's
very natural, I think.

In usual case,
- if process A maps ANON, there will be no I/O.
- if process A maps FILE, it will be charged to process A.
ok ?

Under memory pressure,
- if process A maps ANON, swap I/O should be charged to process B.
- if process A maps FILE, I/O should be charged to process A.
maybe.


Anyway, there will be ineraction with dirty_ratio of memcg (not implemeted yet)
and _Owner should be charged_ issue will be handled in this dirty_ratio layer.
More consideration is necessary, I think.


Bye,
-Kame

2009-07-24 08:48:55

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 7/9] blkio-cgroup-v9: Page tracking hooks

KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Fri, 24 Jul 2009 14:44:16 +0900 (JST)
> Ryo Tsuruta <[email protected]> wrote:
> good solution to resolve such problem.
> >
> > > My point is "don't allow anyone to use bandwidth of others."
> > > Considering job isolation, a thread who requests swap-out should be charg=
> > > ed
> > > against bandwidth.
> >
> > From another perspective, the swap-out is caused since the buggy
> > process uses a large amount of memory, so it can be considered as
> > the bandwidth of logging process is used due to the buggy process.
> >
> > Please consider the following case. If a thread who requests swap-out
> > is charged, the thread is charged other threads' I/O.
> >
> > (1) -------- (2)
> > Process A | | Process B
> > mmaps a large area in --> | memory | <-- tries to allocate a page.
> > the memory and writes | |
> > data to there. -------- (3)
> > | To get a free page,
> > | the data written by Proc.A
> > | is written out to the disk.
> > V The I/O is done by using
> > --------- Proc.B's bandwidth.
> > | disk |
> > ---------
> >
> > Thus I think that page owners should be charged against bandwidth.
> >
> Ok, no good way. yours is wrong, mine is wrong, too.
> plz find 3rd way, reasonable.
>
> Below is brief thinking.
>
> "Why process A should be charged to I/O when it just maps anon memory ?"
> I can't answer this.
>
> Even in yorr case, Process B requests memory and get penalty. It's
> very natural, I think.
>
> In usual case,
> - if process A maps ANON, there will be no I/O.
> - if process A maps FILE, it will be charged to process A.
> ok ?
>
> Under memory pressure,
> - if process A maps ANON, swap I/O should be charged to process B.
> - if process A maps FILE, I/O should be charged to process A.
> maybe.

I think that even process A maps ANON, it should be charged to process A
because the memory pressure is caused by process A. It seems natual
for me that a process which consumes more resources is more likely to
get penalty.

> Anyway, there will be ineraction with dirty_ratio of memcg (not implemeted yet)
> and _Owner should be charged_ issue will be handled in this dirty_ratio layer.
> More consideration is necessary, I think.

I'll keep thinking how it should be done.

Thanks,
Ryo Tsuruta