2009-09-24 19:27:52

by Vivek Goyal

[permalink] [raw]
Subject: IO scheduler based IO controller V10


Hi All,

Here is the V10 of the IO controller patches generated on top of 2.6.31.

For ease of patching, a consolidated patch is available here.

http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v10.patch

Changes from V9
===============
- Brought back the mechanism of idle trees (cache of recently served io
queues). BFQ had originally implemented it and I had got rid of it. Later
I realized that it helps providing fairness when io queue and io groups are
running at same level. Hence brought the mechanism back.

This cache helps in determining whether a task getting back into tree
is a streaming reader who just consumed full slice legth or a new process
(if not in cache) or a random reader who just got a small slice lenth and
now got backlogged again.

- Implemented "wait busy" for sequential reader queues. So we wait for one
extra idle period for these queues to become busy so that group does not
loose fairness. This works even if group_idle=0.

- Fixed an issue where readers don't preempt writers with-in a group when
readers get backlogged. (implemented late preemption).

- Fixed the issue reported by Gui where Anticipatory was not expiring the
queue.

- Did more modification to AS so that it lets common layer know that it is
anticipation on next requeust and common fair queuing layer does not try
to do excessive queue expiratrions.

- Started charging the queue only for allocated slice length (if fairness
is not set) if it consumed more than allocated slice. Otherwise that
queue can miss a dispatch round doubling the max latencies. This idea
also borrowed from BFQ.

- Allowed preemption where a reader can preempt other writer running in
sibling groups or a meta data reader can preempt other non metadata
reader in sibling group.

- Fixed freed_request() issue pointed out by Nauman.

What problem are we trying to solve
===================================
Provide group IO scheduling feature in Linux along the lines of other resource
controllers like cpu.

IOW, provide facility so that a user can group applications using cgroups and
control the amount of disk time/bandwidth received by a group based on its
weight.

How to solve the problem
=========================

Different people have solved the issue differetnly. So far looks it looks
like we seem to have following two core requirements when it comes to
fairness at group level.

- Control bandwidth seen by groups.
- Control on latencies when a request gets backlogged in group.

At least there are now three patchsets available (including this one).

IO throttling
-------------
This is a bandwidth controller which keeps track of IO rate of a group and
throttles the process in the group if it exceeds the user specified limit.

dm-ioband
---------
This is a proportional bandwidth controller implemented as device mapper
driver and provides fair access in terms of amount of IO done (not in terms
of disk time as CFQ does).

So one will setup one or more dm-ioband devices on top of physical/logical
block device, configure the ioband device and pass information like grouping
etc. Now this device will keep track of bios flowing through it and control
the flow of bios based on group policies.

IO scheduler based IO controller
--------------------------------
Here we have viewed the problem of IO contoller as hierarchical group
scheduling (along the lines of CFS group scheduling) issue. Currently one can
view linux IO schedulers as flat where there is one root group and all the IO
belongs to that group.

This patchset basically modifies IO schedulers to also support hierarchical
group scheduling. CFQ already provides fairness among different processes. I
have extended it support group IO schduling. Also took some of the code out
of CFQ and put in a common layer so that same group scheduling code can be
used by noop, deadline and AS to support group scheduling.

Pros/Cons
=========
There are pros and cons to each of the approach. Following are some of the
thoughts.

Max bandwidth vs proportional bandwidth
---------------------------------------
IO throttling is a max bandwidth controller and not a proportional one.
Additionaly it provides fairness in terms of amount of IO done (and not in
terms of disk time as CFQ does).

Personally, I think that proportional weight controller is useful to more
people than just max bandwidth controller. In addition, IO scheduler based
controller can also be enhanced to do max bandwidth control. So it can
satisfy wider set of requirements.

Fairness in terms of disk time vs size of IO
---------------------------------------------
An higher level controller will most likely be limited to providing fairness
in terms of size/number of IO done and will find it hard to provide fairness
in terms of disk time used (as CFQ provides between various prio levels). This
is because only IO scheduler knows how much disk time a queue has used and
information about queues and disk time used is not exported to higher
layers.

So a seeky application will still run away with lot of disk time and bring
down the overall throughput of the the disk.

Currently dm-ioband provides fairness in terms of number/size of IO.

Latencies and isolation between groups
--------------------------------------
An higher level controller is generally implementing a bandwidth throttling
solution where if a group exceeds either the max bandwidth or the proportional
share then throttle that group.

This kind of approach will probably not help in controlling latencies as it
will depend on underlying IO scheduler. Consider following scenario.

Assume there are two groups. One group is running multiple sequential readers
and other group has a random reader. sequential readers will get a nice 100ms
slice each and then a random reader from group2 will get to dispatch the
request. So latency of this random reader will depend on how many sequential
readers are running in other group and that is a weak isolation between groups.

When we control things at IO scheduler level, we assign one time slice to one
group and then pick next entity to run. So effectively after one time slice
(max 180ms, if prio 0 sequential reader is running), random reader in other
group will get to run. Hence we achieve better isolation between groups as
response time of process in a differnt group is generally not dependent on
number of processes running in competing group.

So a higher level solution is most likely limited to only shaping bandwidth
without any control on latencies.

Stacking group scheduler on top of CFQ can lead to issues
---------------------------------------------------------
IO throttling and dm-ioband both are second level controller. That is these
controllers are implemented in higher layers than io schedulers. So they
control the IO at higher layer based on group policies and later IO
schedulers take care of dispatching these bios to disk.

Implementing a second level controller has the advantage of being able to
provide bandwidth control even on logical block devices in the IO stack
which don't have any IO schedulers attached to these. But they can also
interefere with IO scheduling policy of underlying IO scheduler and change
the effective behavior. Following are some of the issues which I think
should be visible in second level controller in one form or other.

Prio with-in group
------------------
A second level controller can potentially interefere with behavior of
different prio processes with-in a group. bios are buffered at higher layer
in single queue and release of bios is FIFO and not proportionate to the
ioprio of the process. This can result in a particular prio level not
getting fair share.

Buffering at higher layer can delay read requests for more than slice idle
period of CFQ (default 8 ms). That means, it is possible that we are waiting
for a request from the queue but it is buffered at higher layer and then idle
timer will fire. It means that queue will losse its share at the same time
overall throughput will be impacted as we lost those 8 ms.

Read Vs Write
-------------
Writes can overwhelm readers hence second level controller FIFO release
will run into issue here. If there is a single queue maintained then reads
will suffer large latencies. If there separate queues for reads and writes
then it will be hard to decide in what ratio to dispatch reads and writes as
it is IO scheduler's decision to decide when and how much read/write to
dispatch. This is another place where higher level controller will not be in
sync with lower level io scheduler and can change the effective policies of
underlying io scheduler.

CFQ IO context Issues
---------------------
Buffering at higher layer means submission of bios later with the help of
a worker thread. This changes the io context information at CFQ layer which
assigns the request to submitting thread. Change of io context info again
leads to issues of idle timer expiry and issue of a process not getting fair
share and reduced throughput.

Throughput with noop, deadline and AS
---------------------------------------------
I think an higher level controller will result in reduced overall throughput
(as compared to io scheduler based io controller) and more seeks with noop,
deadline and AS.

The reason being, that it is likely that IO with-in a group will be related
and will be relatively close as compared to IO across the groups. For example,
thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
control, IO from various groups will go into a single queue at lower level
controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
G4....) causing more seeks and reduced throughput. (Agreed that merging will
help up to some extent but still....).

Instead, in case of lower level controller, IO scheduler maintains one queue
per group hence there is no interleaving of IO between groups. And if IO is
related with-in group, then we shoud get reduced number/amount of seek and
higher throughput.

Latency can be a concern but that can be controlled by reducing the time
slice length of the queue.

Fairness at logical device level vs at physical device level
------------------------------------------------------------

IO scheduler based controller has the limitation that it works only with the
bottom most devices in the IO stack where IO scheduler is attached.

For example, assume a user has created a logical device lv0 using three
underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
in two groups doing IO on lv0. Also assume that weights of groups are in the
ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.

T1 T2
\ /
lv0
/ | \
sda sdb sdc


Now resource control will take place only on devices sda, sdb and sdc and
not at lv0 level. So if IO from two tasks is relatively uniformly
distributed across the disks then T1 and T2 will see the throughput ratio
in proportion to weight specified. But if IO from T1 and T2 is going to
different disks and there is no contention then at higher level they both
will see same BW.

Here a second level controller can produce better fairness numbers at
logical device but most likely at redued overall throughput of the system,
because it will try to control IO even if there is no contention at phsical
possibly leaving diksks unused in the system.

Hence, question comes that how important it is to control bandwidth at
higher level logical devices also. The actual contention for resources is
at the leaf block device so it probably makes sense to do any kind of
control there and not at the intermediate devices. Secondly probably it
also means better use of available resources.

Limited Fairness
----------------
Currently CFQ idles on a sequential reader queue to make sure it gets its
fair share. A second level controller will find it tricky to anticipate.
Either it will not have any anticipation logic and in that case it will not
provide fairness to single readers in a group (as dm-ioband does) or if it
starts anticipating then we should run into these strange situations where
second level controller is anticipating on one queue/group and underlying
IO scheduler might be anticipating on something else.

Need of device mapper tools
---------------------------
A device mapper based solution will require creation of a ioband device
on each physical/logical device one wants to control. So it requires usage
of device mapper tools even for the people who are not using device mapper.
At the same time creation of ioband device on each partition in the system to
control the IO can be cumbersome and overwhelming if system has got lots of
disks and partitions with-in.


IMHO, IO scheduler based IO controller is a reasonable approach to solve the
problem of group bandwidth control, and can do hierarchical IO scheduling
more tightly and efficiently.

But I am all ears to alternative approaches and suggestions how doing things
can be done better and will be glad to implement it.

TODO
====
- code cleanups, testing, bug fixing, optimizations, benchmarking etc...
- More testing to make sure there are no regressions in CFQ.

Testing
=======

Environment
==========
A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. I am mostly
running fio jobs which have been limited to 30 seconds run and then monitored
the throughput and latency.

Test1: Random Reader Vs Random Writers
======================================
Launched a random reader and then increasing number of random writers to see
the effect on random reader BW and max lantecies.

[fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
[fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]

[Vanilla CFQ, No groups]
<--------------random writers--------------------> <------random reader-->
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec
2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec
4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec
8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec
16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec
32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec

Created two cgroups group1 and group2 of weights 500 each. Launched increasing
number of random writers in group1 and one random reader in group2 using fio.

[IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
<--------------random writers(group1)-------------> <-random reader(group2)->
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec
2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec
4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec
8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec
16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec
32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec

Also ran the same test with IO controller CFQ in flat mode to see if there
are any major deviations from Vanilla CFQ. Does not look like any.

[IO controller CFQ; No groups ]
<--------------random writers--------------------> <------random reader-->
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec
2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec
4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec
8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec
16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec
32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec

Notes:
- With vanilla CFQ, random writers can overwhelm a random reader. Bring down
its throughput and bump up latencies significantly.

- With IO controller, one can provide isolation to the random reader group and
maintain consitent view of bandwidth and latencies.

Test2: Random Reader Vs Sequential Reader
========================================
Launched a random reader and then increasing number of sequential readers to
see the effect on BW and latencies of random reader.

[fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
[fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]

[ Vanilla CFQ, No groups ]
<---------------seq readers----------------------> <------random reader-->
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec
2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec
4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec
8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec
16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec

Created two cgroups group1 and group2 of weights 500 each. Launched increasing
number of sequential readers in group1 and one random reader in group2 using
fio.

[IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
<---------------group1---------------------------> <------group2--------->
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec
2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec
4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec
8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec
16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec

Also ran the same test with IO controller CFQ in flat mode to see if there
are any major deviations from Vanilla CFQ. Does not look like any.

[IO controller CFQ; No groups ]
<---------------seq readers----------------------> <------random reader-->
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec
2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec
4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec
8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec
16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec

Notes:
- The BW and latencies of random reader in group 2 seems to be stable and
bounded and does not get impacted much as number of sequential readers
increase in group1. Hence provding good isolation.

- Throughput of sequential readers comes down and latencies go up as half
of disk bandwidth (in terms of time) has been reserved for random reader
group.

Test3: Sequential Reader Vs Sequential Reader
============================================
Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
Launched increasing number of sequential readers in group1 and one sequential
reader in group2 using fio and monitored how bandwidth is being distributed
between two groups.

First 5 columns give stats about job in group1 and last two columns give
stats about job in group2.

<---------------group1---------------------------> <------group2--------->
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec
2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec
4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec
8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec
16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec

Note: group2 is getting double the bandwidth of group1 even in the face
of increasing number of readers in group1.

Test4 (Isolation between two KVM virtual machines)
==================================================
Created two KVM virtual machines. Partitioned a disk on host in two partitions
and gave one partition to each virtual machine. Put both the virtual machines
in two different cgroup of weight 1000 and 500 each. Virtual machines created
ext3 file system on the partitions exported from host and did buffered writes.
Host seems writes as synchronous and virtual machine with higher weight gets
double the disk time of virtual machine of lower weight. Used deadline
scheduler in this test case.

Some more details about configuration are in documentation patch.

Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
===================================================================
Fairness for async writes is tricky and biggest reason is that async writes
are cached in higher layers (page cahe) as well as possibly in file system
layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
in proportional manner.

For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.

IOW, the core problem with buffered write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.

In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.

Vanilla CFQ Vs IO Controller CFQ
================================
We have not fundamentally changed CFQ, instead enhanced it to also support
hierarchical io scheduling. In the process invariably there are small changes
here and there as new scenarios come up. Running some tests here and comparing
both the CFQ's to see if there is any major deviation in behavior.

Test1: Sequential Readers
=========================
[fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]

IO scheduler: Vanilla CFQ

nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec
2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec
4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec
8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec
16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec

IO scheduler: IO controller CFQ

nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec
2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec
4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec
8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec
16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec

Test2: Sequential Writers
=========================
[fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]

IO scheduler: Vanilla CFQ

nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec
2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec
4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec
8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec
16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec

IO scheduler: IO Controller CFQ

nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec
2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec
4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec
8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec
16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec

Test3: Random Readers
=========================
[fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]

IO scheduler: Vanilla CFQ

nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
1 484KiB/s 484KiB/s 484KiB/s 22596 usec
2 229KiB/s 196KiB/s 425KiB/s 51111 usec
4 119KiB/s 73KiB/s 405KiB/s 2344 msec
8 93KiB/s 23KiB/s 399KiB/s 2246 msec
16 38KiB/s 8KiB/s 328KiB/s 3965 msec

IO scheduler: IO Controller CFQ

nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
1 483KiB/s 483KiB/s 483KiB/s 29391 usec
2 229KiB/s 196KiB/s 426KiB/s 51625 usec
4 132KiB/s 88KiB/s 417KiB/s 2313 msec
8 79KiB/s 18KiB/s 389KiB/s 2298 msec
16 43KiB/s 9KiB/s 327KiB/s 3905 msec

Test4: Random Writers
=====================
[fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]

IO scheduler: Vanilla CFQ

nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec
2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec
4 2657KiB/s 265KiB/s 6025KiB/s 216K usec
8 951KiB/s 122KiB/s 3386KiB/s 1148K usec
16 66KiB/s 22KiB/s 829KiB/s 1308 msec

IO scheduler: IO Controller CFQ

nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec
2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec
4 3113KiB/s 334KiB/s 5782KiB/s 200K usec
8 1146KiB/s 95KiB/s 3832KiB/s 593K usec
16 71KiB/s 29KiB/s 814KiB/s 1457 msec

Notes:
- Does not look like that anything has changed significantly.

Previous versions of the patches were posted here.
------------------------------------------------

(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580
(V5) http://lkml.org/lkml/2009/6/19/279
(V6) http://lkml.org/lkml/2009/7/2/369
(V7) http://lkml.org/lkml/2009/7/24/253
(V8) http://lkml.org/lkml/2009/8/16/204
(V9) http://lkml.org/lkml/2009/8/28/327

Thanks
Vivek


2009-09-24 19:32:52

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 01/28] io-controller: Documentation

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
Documentation/block/00-INDEX | 2 +
Documentation/block/io-controller.txt | 464 +++++++++++++++++++++++++++++++++
2 files changed, 466 insertions(+), 0 deletions(-)
create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
- Generic Block Device Capability (/sys/block/<disk>/capability)
deadline-iosched.txt
- Deadline IO scheduler tunables
+io-controller.txt
+ - IO controller for provding hierarchical IO scheduling
ioprio.txt
- Block io priorities (in CFQ scheduler)
request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..f2bfce6
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,464 @@
+ IO Controller
+ =============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is primarily needed on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+ lv0 lv1
+ / \ / \
+ sda sdb sdc
+
+Also consider following cgroup hierarchy
+
+ root
+ / \
+ A B
+ / \ / \
+ T1 T2 T3 T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+ --------------------------------
+ | Elevator Layer + Fair Queuing |
+ --------------------------------
+ | | | |
+ NOOP DEADLINE AS CFQ
+
+Design
+======
+This patchset takes the inspiration from CFS cpu scheduler, CFQ and BFQ to
+come up with core of hierarchical scheduling. Like CFQ we give time slices to
+every queue based on their priority. Like CFS, this disktime given to a
+queue is converted to virtual disk time based on queue's weight (vdisktime)
+and based on this vdisktime we decide which is the queue next to be
+dispatched. And like BFQ we maintain a cache of recently served queues and
+derive new vdisktime of the queue from the cache if queue was recently served.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled. io_queue, is end
+queue where requests are actually stored and dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Noop, deadline and AS don't
+maintain separate queues per task, hence ther is only one io_queue per group.
+So once we can find right group, we also found right queue. CFQ maintains
+multiple io queues with-in group based on task context and maps the request
+to right queue in the group.
+
+sync requests are mapped to right group and queue based on the "current" task.
+Async requests can be mapped using either "current" task or based on owner of
+the page. (blkio cgroup subsystem provides this bio/page tracking mechanism).
+This option is controlled by config option "CONFIG_TRACK_ASYNC_CONTEXT"
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and changing IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode. So CFQ has to use fair queuing logic from common layer but it can choose
+to enable only flat support and not enable hierarchical (group scheduling)
+support.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+ - Enables hierchical fair queuing in noop. Not selecting this option
+ leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+ - Enables hierchical fair queuing in deadline. Not selecting this
+ option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+ - Enables hierchical fair queuing in AS. Not selecting this option
+ leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+ - Enables hierarchical fair queuing in CFQ. Not selecting this option
+ still does fair queuing among various queus but it is flat and not
+ hierarchical.
+
+CGROUP_BLKIO
+ - This option enables blkio-cgroup controller for IO tracking
+ purposes. That means, by this controller one can attribute a write
+ to the original cgroup and not assume that it belongs to submitting
+ thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+ - Currently CFQ attributes the writes to the submitting thread and
+ caches the async queue pointer in the io context of the process.
+ If this option is set, it tells cfq and elevator fair queuing logic
+ that for async writes make use of IO tracking patches and attribute
+ writes to original cgroup and not to write submitting thread.
+
+ This should be primarily useful when lots of asynchronous writes
+ are being submitted by pdflush threads and we need to assign the
+ writes to right group.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+ - Throws extra debug messages in blktrace output helpful in doing
+ doing debugging in hierarchical setup.
+
+ - Also allows for export of extra debug statistics like group queue
+ and dequeue statistics on device through cgroup interface.
+
+CONFIG_DEBUG_ELV_FAIR_QUEUING
+ - Enables some vdisktime related debugging messages.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+ - Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+ - Enables/Disables hierarchical queuing and associated cgroup bits.
+
+HOWTO
+=====
+You can do a very simple testing of running two dd threads in two different
+cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+ CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+ CONFIG_TRACK_ASYNC_CONTEXT=y
+
+ (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+ controller.
+
+ mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+ mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+ echo 1000 > /cgroup/test1/io.weight
+ echo 500 > /cgroup/test2/io.weight
+
+- Set "fairness" parameter to 1 at the disk you are testing.
+
+ echo 1 > /sys/block/<disk>/queue/iosched/fairness
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+ launch two dd threads in different cgroup to read those files. Make sure
+ right io scheduler is being used for the block device where files are
+ present (the one you compiled in hierarchical mode).
+
+ sync
+ echo 3 > /proc/sys/vm/drop_caches
+
+ dd if=/mnt/sdb/zerofile1 of=/dev/null &
+ echo $! > /cgroup/test1/tasks
+ cat /cgroup/test1/tasks
+
+ dd if=/mnt/sdb/zerofile2 of=/dev/null &
+ echo $! > /cgroup/test2/tasks
+ cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+ on looking at (with the help of script), at io.disk_time and io.disk_sectors
+ files of both test1 and test2 groups. This will tell how much disk time
+ (in milli seconds), each group got and how many secotors each group
+ dispatched to the disk. We provide fairness in terms of disk time, so
+ ideally io.disk_time of cgroups should be in proportion to the weight.
+
+What Works and What Does not
+============================
+Service differentiation at application level can be noticed only if completely
+parallel IO paths are created from application to IO scheduler and there
+are no serializations introduced by any intermediate layer. For example,
+in some cases file system and page cache layer introduce serialization and
+we don't see service difference between higher weight and lower weight
+process groups.
+
+For example, when I start an O_SYNC write out on an ext3 file system (file
+is being created newly), I see lots of activity from kjournald. I have not
+gone into details yet, but my understanding is that there are lot more
+journal commits and kjournald kind of introduces serialization between two
+processes. So even if you put these two processes in two different cgroups
+with different weights, higher weight process will not see more IO done.
+
+It does work very well when we bypass filesystem layer and IO is raw. For
+example in above virtual machine case, host sees raw synchronous writes
+coming from two guest machines and filesystem layer at host is not introducing
+any kind of serialization hence we can see the service difference.
+
+It also works very well for reads even on the same file system as for reads
+file system journalling activity does not kick in and we can create parallel
+IO paths from application to all the way down to IO scheduler and get more
+IO done on the IO path with higher weight.
+
+Details of new ioscheduler tunables
+===================================
+
+group_idle
+-----------
+
+"group_idle" specifies the duration one should wait for new request before
+group is expired. This is very similiar to "slice_idle" parameter of cfq. The
+difference is that slice_idle specifies queue idling period and group_idle
+specifies group idling period. Another difference is that cfq idling is
+dynamically updated based on traffic pattern. group idling is currently
+static.
+
+group idling takes place when a group is empty when it is being expired. If
+an empty group is expired and later it gets a request (say 1 ms), it looses
+its fair share as upon expiry it will be deleted from the service tree and
+a new queue will be selected to run and min_vdisktime will be udpated on
+service tree.
+
+There are both advantages and disadvantates of enabling group_idle. If
+enabled, it ensures that a group gets its fair share of disk time (as long
+as a group gets a new request with-in group_idle period). So even if a
+single sequential reader is running in a group, it will get the disk time
+depending on the group weight. IOW, enabling it provides very strong isolation
+between groups.
+
+The flip side is that it makes the group a heavier entity with slow switching
+between groups. There are many cases where CFQ disables the idling on the
+queue and hence queue gets expired as soon as requests are over in the queue
+and CFQ moves to new queue. This way it achieves faster switching and in many
+cases better throughput (most of the time seeky processes will not have idling
+enabled and will get very limited access to disk).
+
+If group idling is disabled, a group will get fairness only if it is
+continuously backlogged. So this weakens the fairness gurantees and isolation
+between the groups but can help achieve faster switching between queues/groups
+and better throughput.
+
+So one should set "group_idle" depending on one's use case and based on need.
+
+For the time being it is enabled by default.
+
+"fairness"
+----------
+IO controller has introduced a "fairness" tunable for every io scheduler.
+Currently this tunable can assume values 0, 1.
+
+If fairness is set to 1, then IO controller waits for requests to finish from
+previous queue before requests from new queue are dispatched. This helps in
+doing better accouting of disk time consumed by a queue. If this is not done
+then on a queuing hardware, there can be requests from multiple queues and
+we will not have any idea which queue consumed how much of disk time.
+
+So if "fairness" is set, it can help achive better time accounting. But the
+flip side is that it can slow down switching between queues and also lower the
+throughput.
+
+Again, this parameter should be set/reset based on the need. For the time
+being it is disabled by default.
+
+Details of cgroup files
+=======================
+- io.ioprio_class
+ - Specifies class of the cgroup (RT, BE, IDLE). This is default io
+ class of the group on all the devices until and unless overridden by
+ per device rule. (See io.policy).
+
+ 1 = RT; 2 = BE, 3 = IDLE
+
+- io.weight
+ - Specifies per cgroup weight. This is default weight of the group
+ on all the devices until and unless overridden by per device rule.
+ (See io.policy).
+
+ Currently allowed range of weights is from 100 to 1000.
+
+- io.disk_time
+ - disk time allocated to cgroup per device in milliseconds. First
+ two fields specify the major and minor number of the device and
+ third field specifies the disk time allocated to group in
+ milliseconds.
+
+- io.disk_sectors
+ - number of sectors transferred to/from disk by the group. First
+ two fields specify the major and minor number of the device and
+ third field specifies the number of sectors transferred by the
+ group to/from the device.
+
+- io.disk_queue
+ - Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+ gives the statistics about how many a times a group was queued
+ on service tree of the device. First two fields specify the major
+ and minor number of the device and third field specifies the number
+ of times a group was queued on a particular device.
+
+- io.disk_queue
+ - Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+ gives the statistics about how many a times a group was de-queued
+ or removed from the service tree of the device. This basically gives
+ and idea if we can generate enough IO to create continuously
+ backlogged groups. First two fields specify the major and minor
+ number of the device and third field specifies the number
+ of times a group was de-queued on a particular device.
+
+- io.policy
+ - One can specify per cgroup per device rules using this interface.
+ These rules override the default value of group weight and class as
+ specified by io.weight and io.ioprio_class.
+
+ Following is the format.
+
+ #echo dev_maj:dev_minor weight ioprio_class > /patch/to/cgroup/io.policy
+
+ weight=0 means removing a policy.
+
+ Examples:
+
+ Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
+ # echo 8:16 300 2 > io.policy
+ # cat io.policy
+ dev weight class
+ 8:16 300 2
+
+ Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
+ # echo 8:0 500 1 > io.policy
+ # cat io.policy
+ dev weight class
+ 8:0 500 1
+ 8:16 300 2
+
+ Remove the policy for /dev/hda in this cgroup
+ # echo 8:0 0 1 > io.policy
+ # cat io.policy
+ dev weight class
+ 8:16 300 2
+
+About configuring request desriptors
+====================================
+Traditionally there are 128 request desriptors allocated per request queue
+where io scheduler is operating (/sys/block/<disk>/queue/nr_requests). If these
+request descriptors are exhausted, processes will put to sleep and woken
+up once request descriptors are available.
+
+With io controller and cgroup stuff, one can not afford to allocate requests
+from single pool as one group might allocate lots of requests and then tasks
+from other groups might be put to sleep and this other group might be a
+higher weight group. Hence to make sure that a group always can get the
+request descriptors it is entitled to, one needs to make request descriptor
+limit per group on every queue.
+
+A new parameter /sys/block/<disk>/queue/nr_group_requests has been introduced
+and this parameter controlls the maximum number of requests per group.
+nr_requests still continues to control total number of request descriptors
+on the queue.
+
+Ideally one should set nr_requests to be following.
+
+nr_requests = number_of_cgroups * nr_group_requests
+
+This will make sure that at any point of time nr_group_requests number of
+request descriptors will be available for any of the cgroups.
+
+Currently default nr_requests=512 and nr_group_requests=128. This will make
+sure that apart from root group one can create 3 more group without running
+into any issues. If one decides to create more cgorus, nr_requests and
+nr_group_requests should be adjusted accordingly.
+
+Some High Level Test setups
+===========================
+One of the use cases of IO controller is to provide some kind of IO isolation
+between multiple virtual machines on the same host. Following is one
+example setup which worked for me.
+
+
+ KVM KVM
+ Guest1 Guest2
+ --------- ----------
+ | ----- | | ------ |
+ | | vdb | | | | vdb | |
+ | ----- | | ------ |
+ --------- ----------
+
+ ---------------------------
+ | Host |
+ | ------------- |
+ | | sdb1 | sdb2 | |
+ | ------------- |
+ ---------------------------
+
+On host machine, I had a spare SATA disk. I created two partitions sdb1
+and sdb2 and gave this partitions as additional storage to kvm guests. sdb1
+to KVM guest1 and sdb2 KVM guest2. These storage appeared as /dev/vdb in
+both the guests. Formatted the /dev/vdb and created ext3 file system and
+started a 1G file writeout in both the guests. Before writeout I had created
+two cgroups of weight 1000 and 500 and put virtual machines in two different
+groups.
+
+Following is write I started in both the guests.
+
+dd if=/dev/zero of=/mnt/vdc/zerofile1 bs=4K count=262144 conv=fdatasync
+
+Following are the results on host with "deadline" scheduler.
+
+group1 time=8:16 17254 group1 sectors=8:16 2104288
+group2 time=8:16 8498 group2 sectors=8:16 1007040
+
+Virtual machine with cgroup weight 1000 got almost double the time of virtual
+machine with weight 500.
--
1.6.0.6

2009-09-24 19:27:08

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 02/28] io-controller: Core of the elevator fair queuing

o This is core of the io scheduler implemented at elevator layer. This is a mix
of cpu CFS scheduler and CFQ IO scheduler. Some of the bits from CFS have
to be derived so that we can support hierarchical scheduling. Without
cgroups or with-in group, we should essentially get same behavior as CFQ.

o This patch only shows non-hierarchical bits. Hierarhical code comes in later
patches.

o This code is the building base of introducing fair queuing logic in common
elevator layer so that it can be used by all the four IO schedulers.

Signed-off-by: Fabio Checconi <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/Makefile | 2 +-
block/elevator-fq.c | 406 +++++++++++++++++++++++++++++++++++++++++++++++++++
block/elevator-fq.h | 148 +++++++++++++++++++
3 files changed, 555 insertions(+), 1 deletions(-)
create mode 100644 block/elevator-fq.c
create mode 100644 block/elevator-fq.h

diff --git a/block/Makefile b/block/Makefile
index 6c54ed0..19ff1e8 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
- ioctl.o genhd.o scsi_ioctl.o
+ ioctl.o genhd.o scsi_ioctl.o elevator-fq.o

obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..d98fa42
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,406 @@
+/*
+ * elevator fair queuing Layer.
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <[email protected]>
+ *
+ * Copyright (C) 2008 Fabio Checconi <[email protected]>
+ * Paolo Valente <[email protected]>
+ *
+ * Copyright (C) 2009 Vivek Goyal <[email protected]>
+ * Nauman Rafique <[email protected]>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+
+/*
+ * offset from end of service tree
+ */
+#define ELV_IDLE_DELAY (HZ / 5)
+#define ELV_SLICE_SCALE (500)
+#define ELV_SERVICE_SHIFT 20
+
+static inline struct io_queue *ioq_of(struct io_entity *entity)
+{
+ if (entity->my_sd == NULL)
+ return container_of(entity, struct io_queue, entity);
+ return NULL;
+}
+
+static inline int io_entity_class_rt(struct io_entity *entity)
+{
+ return entity->ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int io_entity_class_idle(struct io_entity *entity)
+{
+ return entity->ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline s64
+entity_key(struct io_service_tree *st, struct io_entity *entity)
+{
+ return entity->vdisktime - st->min_vdisktime;
+}
+
+static inline u64
+elv_delta(u64 service, unsigned int numerator_wt, unsigned int denominator_wt)
+{
+ if (numerator_wt != denominator_wt) {
+ service = service * numerator_wt;
+ do_div(service, denominator_wt);
+ }
+
+ return service;
+}
+
+static inline u64 elv_delta_fair(unsigned long delta, struct io_entity *entity)
+{
+ u64 d = delta << ELV_SERVICE_SHIFT;
+
+ return elv_delta(d, IO_WEIGHT_DEFAULT, entity->weight);
+}
+
+static inline int
+elv_weight_slice(struct elv_fq_data *efqd, int sync, unsigned int weight)
+{
+ const int base_slice = efqd->elv_slice[sync];
+
+ WARN_ON(weight > IO_WEIGHT_MAX);
+
+ return elv_delta(base_slice, weight, IO_WEIGHT_DEFAULT);
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+ return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
+}
+
+static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta > 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta < 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static void update_min_vdisktime(struct io_service_tree *st)
+{
+ u64 vdisktime;
+
+ if (st->active_entity)
+ vdisktime = st->active_entity->vdisktime;
+
+ if (st->rb_leftmost) {
+ struct io_entity *entity = rb_entry(st->rb_leftmost,
+ struct io_entity, rb_node);
+
+ if (!st->active_entity)
+ vdisktime = entity->vdisktime;
+ else
+ vdisktime = min_vdisktime(vdisktime, entity->vdisktime);
+ }
+
+ st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
+static inline struct io_entity *parent_entity(struct io_entity *entity)
+{
+ return entity->parent;
+}
+
+static inline struct io_group *iog_of(struct io_entity *entity)
+{
+ if (entity->my_sd)
+ return container_of(entity, struct io_group, entity);
+ return NULL;
+}
+
+static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
+{
+ return ioq_of(entity)->efqd;
+}
+
+static inline struct io_sched_data *
+io_entity_sched_data(struct io_entity *entity)
+{
+ struct elv_fq_data *efqd = efqd_of(entity);
+
+ return &efqd->root_group->sched_data;
+}
+
+static inline void
+init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
+{
+ struct io_group *parent_iog = iog_of(parent);
+ unsigned short idx = entity->ioprio_class - 1;
+
+ BUG_ON(idx >= IO_IOPRIO_CLASSES);
+
+ entity->st = &parent_iog->sched_data.service_tree[idx];
+}
+
+static void
+entity_served(struct io_entity *entity, unsigned long served,
+ unsigned long queue_charge, unsigned long nr_sectors)
+{
+ entity->vdisktime += elv_delta_fair(queue_charge, entity);
+ update_min_vdisktime(entity->st);
+}
+
+static void place_entity(struct io_service_tree *st, struct io_entity *entity,
+ int add_front)
+{
+ u64 vdisktime = st->min_vdisktime;
+ struct rb_node *parent;
+ struct io_entity *entry;
+ int nr_active = st->nr_active - 1;
+
+ /*
+ * Currently put entity at the end of last entity. This probably will
+ * require adjustments as we move along
+ */
+ if (io_entity_class_idle(entity)) {
+ vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
+ parent = rb_last(&st->active);
+ if (parent) {
+ entry = rb_entry(parent, struct io_entity, rb_node);
+ vdisktime += entry->vdisktime;
+ }
+ } else if (!add_front && nr_active) {
+ parent = rb_last(&st->active);
+ if (parent) {
+ entry = rb_entry(parent, struct io_entity, rb_node);
+ vdisktime = entry->vdisktime;
+ }
+ } else
+ vdisktime = st->min_vdisktime;
+
+ entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
+static inline void io_entity_update_prio(struct io_entity *entity)
+{
+ if (unlikely(entity->ioprio_changed)) {
+ /*
+ * Re-initialize the service tree as ioprio class of the
+ * entity might have changed.
+ */
+ init_io_entity_service_tree(entity, parent_entity(entity));
+ entity->ioprio_changed = 0;
+ }
+}
+
+static void
+__dequeue_io_entity(struct io_service_tree *st, struct io_entity *entity)
+{
+ /*
+ * This can happen when during put_prev_io_entity, we detect that ioprio
+ * of the queue has changed and decide to dequeue_entity() and requeue
+ * back. In this case entity is on service tree but has already been
+ * removed from rb tree.
+ */
+ if (RB_EMPTY_NODE(&entity->rb_node))
+ return;
+
+ if (st->rb_leftmost == &entity->rb_node) {
+ struct rb_node *next_node;
+
+ next_node = rb_next(&entity->rb_node);
+ st->rb_leftmost = next_node;
+ }
+
+ rb_erase(&entity->rb_node, &st->active);
+ RB_CLEAR_NODE(&entity->rb_node);
+}
+
+static void dequeue_io_entity(struct io_entity *entity)
+{
+ struct io_service_tree *st = entity->st;
+ struct io_sched_data *sd = io_entity_sched_data(entity);
+
+ __dequeue_io_entity(st, entity);
+ entity->on_st = 0;
+ st->nr_active--;
+ sd->nr_active--;
+}
+
+static void
+__enqueue_io_entity(struct io_service_tree *st, struct io_entity *entity,
+ int add_front)
+{
+ struct rb_node **node = &st->active.rb_node;
+ struct rb_node *parent = NULL;
+ struct io_entity *entry;
+ s64 key = entity_key(st, entity);
+ int leftmost = 1;
+
+ while (*node != NULL) {
+ parent = *node;
+ entry = rb_entry(parent, struct io_entity, rb_node);
+
+ if (key < entity_key(st, entry) ||
+ (add_front && (key == entity_key(st, entry)))) {
+ node = &parent->rb_left;
+ } else {
+ node = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ /*
+ * Maintain a cache of leftmost tree entries (it is frequently
+ * used)
+ */
+ if (leftmost)
+ st->rb_leftmost = &entity->rb_node;
+
+ rb_link_node(&entity->rb_node, parent, node);
+ rb_insert_color(&entity->rb_node, &st->active);
+}
+
+static void enqueue_io_entity(struct io_entity *entity)
+{
+ struct io_service_tree *st;
+ struct io_sched_data *sd = io_entity_sched_data(entity);
+
+ io_entity_update_prio(entity);
+ st = entity->st;
+ st->nr_active++;
+ sd->nr_active++;
+ entity->on_st = 1;
+ place_entity(st, entity, 0);
+ __enqueue_io_entity(st, entity, 0);
+}
+
+static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
+{
+ struct rb_node *left = st->rb_leftmost;
+
+ if (!left)
+ return NULL;
+
+ return rb_entry(left, struct io_entity, rb_node);
+}
+
+static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
+{
+ struct io_service_tree *st = sd->service_tree;
+ struct io_entity *entity = NULL;
+ int i;
+
+ BUG_ON(sd->active_entity != NULL);
+
+ if (!sd->nr_active)
+ return NULL;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+ entity = __lookup_next_io_entity(st);
+ if (entity) {
+ __dequeue_io_entity(st, entity);
+ st->active_entity = entity;
+ sd->active_entity = entity;
+ break;
+ }
+ }
+
+ return entity;
+}
+
+static void requeue_io_entity(struct io_entity *entity, int add_front)
+{
+ struct io_service_tree *st = entity->st;
+ struct io_entity *next_entity;
+
+ if (add_front) {
+ next_entity = __lookup_next_io_entity(st);
+ /*
+ * This is to emulate cfq like functionality where preemption
+ * can happen with-in same class, like sync queue preempting
+ * async queue.
+ *
+ * This feature is also used by cfq close cooperator
+ * functionlity where cfq selects a queue out of order to run
+ * next based on close cooperator.
+ */
+ if (next_entity && next_entity == entity)
+ return;
+ }
+
+ __dequeue_io_entity(st, entity);
+ place_entity(st, entity, add_front);
+ __enqueue_io_entity(st, entity, add_front);
+}
+
+/* Requeue and ioq which is already on the tree */
+static void requeue_ioq(struct io_queue *ioq, int add_front)
+{
+ requeue_io_entity(&ioq->entity, add_front);
+}
+
+static void put_prev_io_entity(struct io_entity *entity)
+{
+ struct io_service_tree *st = entity->st;
+ struct io_sched_data *sd = io_entity_sched_data(entity);
+
+ st->active_entity = NULL;
+ sd->active_entity = NULL;
+
+ if (unlikely(entity->ioprio_changed)) {
+ dequeue_io_entity(entity);
+ enqueue_io_entity(entity);
+ } else
+ __enqueue_io_entity(st, entity, 0);
+}
+
+/* Put curr ioq back into rb tree. */
+static void put_prev_ioq(struct io_queue *ioq)
+{
+ struct io_entity *entity = &ioq->entity;
+
+ put_prev_io_entity(entity);
+}
+
+static void dequeue_ioq(struct io_queue *ioq)
+{
+ struct io_entity *entity = &ioq->entity;
+
+ dequeue_io_entity(entity);
+ elv_put_ioq(ioq);
+ return;
+}
+
+/* Put a new queue on to the tree */
+static void enqueue_ioq(struct io_queue *ioq)
+{
+ struct io_entity *entity = &ioq->entity;
+
+ elv_get_ioq(ioq);
+ enqueue_io_entity(entity);
+}
+
+static inline void
+init_io_entity_parent(struct io_entity *entity, struct io_entity *parent)
+{
+ entity->parent = parent;
+ init_io_entity_service_tree(entity, parent);
+}
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+ BUG_ON(atomic_read(&ioq->ref) <= 0);
+ if (!atomic_dec_and_test(&ioq->ref))
+ return;
+}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..868e035
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,148 @@
+/*
+ * elevator fair queuing Layer. Data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <[email protected]>
+ *
+ * Copyright (C) 2008 Fabio Checconi <[email protected]>
+ * Paolo Valente <[email protected]>
+ *
+ * Copyright (C) 2009 Vivek Goyal <[email protected]>
+ * Nauman Rafique <[email protected]>
+ */
+
+#ifdef CONFIG_BLOCK
+#include <linux/blkdev.h>
+
+#ifndef _ELV_SCHED_H
+#define _ELV_SCHED_H
+
+#define IO_WEIGHT_MIN 100
+#define IO_WEIGHT_MAX 1000
+#define IO_WEIGHT_DEFAULT 500
+#define IO_IOPRIO_CLASSES 3
+
+struct io_service_tree {
+ struct rb_root active;
+ struct io_entity *active_entity;
+ u64 min_vdisktime;
+ struct rb_node *rb_leftmost;
+ unsigned int nr_active;
+};
+
+struct io_sched_data {
+ struct io_entity *active_entity;
+ int nr_active;
+ struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+struct io_entity {
+ struct rb_node rb_node;
+ int on_st;
+ u64 vdisktime;
+ unsigned int weight;
+ struct io_entity *parent;
+
+ struct io_sched_data *my_sd;
+ struct io_service_tree *st;
+
+ unsigned short ioprio, ioprio_class;
+ int ioprio_changed;
+};
+
+/*
+ * A common structure representing the io queue where requests are actually
+ * queued.
+ */
+struct io_queue {
+ struct io_entity entity;
+ atomic_t ref;
+ unsigned int flags;
+
+ /* Pointer to generic elevator fair queuing data structure */
+ struct elv_fq_data *efqd;
+};
+
+struct io_group {
+ struct io_entity entity;
+ struct io_sched_data sched_data;
+};
+
+struct elv_fq_data {
+ struct io_group *root_group;
+
+ /* Base slice length for sync and async queues */
+ unsigned int elv_slice[2];
+};
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+ ELV_QUEUE_FLAG_sync, /* synchronous queue */
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name) \
+static inline void elv_mark_ioq_##name(struct io_queue *ioq) \
+{ \
+ (ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name); \
+} \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq) \
+{ \
+ (ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name); \
+} \
+static inline int elv_ioq_##name(struct io_queue *ioq) \
+{ \
+ return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0; \
+}
+
+ELV_IO_QUEUE_FLAG_FNS(sync)
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+ atomic_inc(&ioq->ref);
+}
+
+static inline unsigned int elv_ioprio_to_weight(int ioprio)
+{
+ WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+ /* Map prio 7 - 0 to weights 200 to 900 */
+ return IO_WEIGHT_DEFAULT + (IO_WEIGHT_DEFAULT/5 * (4 - ioprio));
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+ ioq->entity.ioprio = ioprio;
+ ioq->entity.weight = elv_ioprio_to_weight(ioprio);
+ ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+ int ioprio_class)
+{
+ ioq->entity.ioprio_class = ioprio_class;
+ ioq->entity.ioprio_changed = 1;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+ return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+ return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+ return ioq->entity.ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+ return ioq->entity.ioprio;
+}
+
+extern void elv_put_ioq(struct io_queue *ioq);
+#endif /* _ELV_SCHED_H */
+#endif /* CONFIG_BLOCK */
--
1.6.0.6

2009-09-24 19:26:32

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 03/28] io-controller: Keep a cache of recently expired queues

o Currently once a queue uses its slice and it is empty, we remove it from the
service tree and when a new request comes in it is queued at the end of tree.
It works as long as there are only queue or only groups at same level but if
there is a mix of queue and group, there can be fairness issue. For example,
consider following case.

root
/ \
T1 G1
|
T2

T1 and T2 are two tasks with prio 0 and 7 respectively and G1 is the group
with weight 900.

Task T1 prio 0 is mapped to weight 900 and it will get slice length of 180ms
and then queue will expire and will be put after G1. (Note, in case of reader
most liekly next request will come after queue expiry hence queue will be
deleted and once the request comes in again, it will added to tree fresh. A
fresh queue is added at the end of the tree. So it will be put after G1.).

Now G1 will get to run (effectivly T2 will run), T2 has prio 7, which will
map to weight 200 and get slice length of 40ms and will expire after that. Now
G1 will a new vtime which is effectively charge of 40ms.

Now to get fairness G1 should run more but instead T1 will be running as we
gave it a vtime, same as G1.

The core issue here is that for readers, when slice expires, queue is empty
and not backlogged hence it gets deleted from the tree. Because CFQ only
operates in flat mode, it did a smart thing and did not keep a track of
history. Instead it provides slice lenghts according to prio and if in one
round of dispatch one gets fairness it is fine, otherwise upon queue expiry
you will be placed at the end of service tree.

This does not work in hierarchical setups where group's slice lenght is
determined not by group' weight but by the weight of the queue which will
run under the group.

Hence we need to keep track of histroy and assign a new vtime based on disk
time used by the current queue at the time of expiry.

But here io scheduler is little different from CFS that at the time of expiry
most of the time reader's queue is empty. So one will end up deleting it from
the service tree and next request comes with-in 1 ms and it gets into the tree
again like a new process.

So we need to keep track of process io queue's vdisktime, even it after got
deleted from io scheduler's service tree and use that same vdisktime if that
queue gets backlogged again. But trusting a ioq's vdisktime is bad because
it can lead to issues if a service tree min_vtime wrap around takes place
between two requests of the queue. (Agreed that it can be not that easy to
hit but it is possible).

Hence, keep a cache of io queues serviced recently and when a queue gets
backlogged, if it is found in cache, use that vdisktime otherwise assign
a new vdisktime. This cache of io queues (idle tree), is basically the idea
implemented by BFQ guys. I had gotten rid of idle trees in V9 and now I am
bringing it back. (Now I understand it better. :-)).

Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/elevator-fq.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-----
block/elevator-fq.h | 7 ++
2 files changed, 179 insertions(+), 16 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index d98fa42..8343397 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -21,6 +21,8 @@
#define ELV_SLICE_SCALE (500)
#define ELV_SERVICE_SHIFT 20

+static void check_idle_tree_release(struct io_service_tree *st);
+
static inline struct io_queue *ioq_of(struct io_entity *entity)
{
if (entity->my_sd == NULL)
@@ -78,6 +80,11 @@ elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
return elv_weight_slice(efqd, elv_ioq_sync(ioq), ioq->entity.weight);
}

+static inline int vdisktime_gt(u64 a, u64 b)
+{
+ return (s64)(a - b) > 0;
+}
+
static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
{
s64 delta = (s64)(vdisktime - min_vdisktime);
@@ -114,6 +121,7 @@ static void update_min_vdisktime(struct io_service_tree *st)
}

st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+ check_idle_tree_release(st);
}

static inline struct io_entity *parent_entity(struct io_entity *entity)
@@ -167,27 +175,46 @@ static void place_entity(struct io_service_tree *st, struct io_entity *entity,
struct rb_node *parent;
struct io_entity *entry;
int nr_active = st->nr_active - 1;
+ struct io_queue *ioq = ioq_of(entity);
+ int sync = 1;
+
+ if (ioq)
+ sync = elv_ioq_sync(ioq);
+
+ if (add_front || !nr_active) {
+ vdisktime = st->min_vdisktime;
+ goto done;
+ }
+
+ if (sync && entity->vdisktime
+ && vdisktime_gt(entity->vdisktime, st->min_vdisktime)) {
+ /* vdisktime still in future. Use old vdisktime */
+ vdisktime = entity->vdisktime;
+ goto done;
+ }

/*
- * Currently put entity at the end of last entity. This probably will
- * require adjustments as we move along
+ * Effectively a new queue. Assign sync queue a lower vdisktime so
+ * we can achieve better latencies for small file readers. For async
+ * queues, put them at the end of the existing queue.
+ * Group entities are always considered sync.
*/
- if (io_entity_class_idle(entity)) {
- vdisktime = elv_delta_fair(ELV_IDLE_DELAY, entity);
- parent = rb_last(&st->active);
- if (parent) {
- entry = rb_entry(parent, struct io_entity, rb_node);
- vdisktime += entry->vdisktime;
- }
- } else if (!add_front && nr_active) {
- parent = rb_last(&st->active);
- if (parent) {
- entry = rb_entry(parent, struct io_entity, rb_node);
- vdisktime = entry->vdisktime;
- }
- } else
+ if (sync) {
vdisktime = st->min_vdisktime;
+ goto done;
+ }

+ /*
+ * Put entity at the end of the tree. Effectively async queues use
+ * this path.
+ */
+ parent = rb_last(&st->active);
+ if (parent) {
+ entry = rb_entry(parent, struct io_entity, rb_node);
+ vdisktime = entry->vdisktime;
+ } else
+ vdisktime = st->min_vdisktime;
+done:
entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
}

@@ -200,6 +227,122 @@ static inline void io_entity_update_prio(struct io_entity *entity)
*/
init_io_entity_service_tree(entity, parent_entity(entity));
entity->ioprio_changed = 0;
+
+ /*
+ * Assign this entity a fresh vdisktime instead of using
+ * previous one as prio class will lead to service tree
+ * change and this vdisktime will not be valid on new
+ * service tree.
+ *
+ * TODO: Handle the case of only prio change.
+ */
+ entity->vdisktime = 0;
+ }
+}
+
+static void
+__dequeue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
+{
+ if (st->rb_leftmost_idle == &entity->rb_node) {
+ struct rb_node *next_node;
+
+ next_node = rb_next(&entity->rb_node);
+ st->rb_leftmost_idle = next_node;
+ }
+
+ rb_erase(&entity->rb_node, &st->idle);
+ RB_CLEAR_NODE(&entity->rb_node);
+}
+
+static void dequeue_io_entity_idle(struct io_entity *entity)
+{
+ struct io_queue *ioq = ioq_of(entity);
+
+ __dequeue_io_entity_idle(entity->st, entity);
+ entity->on_idle_st = 0;
+ if (ioq)
+ elv_put_ioq(ioq);
+}
+
+static void
+__enqueue_io_entity_idle(struct io_service_tree *st, struct io_entity *entity)
+{
+ struct rb_node **node = &st->idle.rb_node;
+ struct rb_node *parent = NULL;
+ struct io_entity *entry;
+ int leftmost = 1;
+
+ while (*node != NULL) {
+ parent = *node;
+ entry = rb_entry(parent, struct io_entity, rb_node);
+
+ if (vdisktime_gt(entry->vdisktime, entity->vdisktime))
+ node = &parent->rb_left;
+ else {
+ node = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ /*
+ * Maintain a cache of leftmost tree entries (it is frequently
+ * used)
+ */
+ if (leftmost)
+ st->rb_leftmost_idle = &entity->rb_node;
+
+ rb_link_node(&entity->rb_node, parent, node);
+ rb_insert_color(&entity->rb_node, &st->idle);
+}
+
+static void enqueue_io_entity_idle(struct io_entity *entity)
+{
+ struct io_queue *ioq = ioq_of(entity);
+ struct io_group *parent_iog;
+
+ /*
+ * Don't put an entity on idle tree if it has been marked for deletion.
+ * We are not expecting more io from this entity. No need to cache it
+ */
+
+ if (entity->exiting)
+ return;
+
+ /*
+ * If parent group is exiting, don't put on idle tree. May be task got
+ * moved to a different cgroup and original cgroup got deleted
+ */
+ parent_iog = iog_of(parent_entity(entity));
+ if (parent_iog->entity.exiting)
+ return;
+
+ if (ioq)
+ elv_get_ioq(ioq);
+ __enqueue_io_entity_idle(entity->st, entity);
+ entity->on_idle_st = 1;
+}
+
+static void check_idle_tree_release(struct io_service_tree *st)
+{
+ struct io_entity *leftmost;
+
+ if (!st->rb_leftmost_idle)
+ return;
+
+ leftmost = rb_entry(st->rb_leftmost_idle, struct io_entity, rb_node);
+
+ if (vdisktime_gt(st->min_vdisktime, leftmost->vdisktime))
+ dequeue_io_entity_idle(leftmost);
+}
+
+static void flush_idle_tree(struct io_service_tree *st)
+{
+ struct io_entity *entity;
+
+ while (st->rb_leftmost_idle) {
+ entity = rb_entry(st->rb_leftmost_idle, struct io_entity,
+ rb_node);
+ dequeue_io_entity_idle(entity);
}
}

@@ -235,6 +378,9 @@ static void dequeue_io_entity(struct io_entity *entity)
entity->on_st = 0;
st->nr_active--;
sd->nr_active--;
+
+ if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
+ enqueue_io_entity_idle(entity);
}

static void
@@ -276,6 +422,16 @@ static void enqueue_io_entity(struct io_entity *entity)
struct io_service_tree *st;
struct io_sched_data *sd = io_entity_sched_data(entity);

+ if (entity->on_idle_st)
+ dequeue_io_entity_idle(entity);
+ else
+ /*
+ * This entity was not in idle tree cache. Zero out vdisktime
+ * so that we don't rely on old vdisktime instead assign a
+ * fresh one.
+ */
+ entity->vdisktime = 0;
+
io_entity_update_prio(entity);
st = entity->st;
st->nr_active++;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 868e035..ee46a47 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -28,6 +28,10 @@ struct io_service_tree {
u64 min_vdisktime;
struct rb_node *rb_leftmost;
unsigned int nr_active;
+
+ /* A cache of io entities which were served and expired */
+ struct rb_root idle;
+ struct rb_node *rb_leftmost_idle;
};

struct io_sched_data {
@@ -39,9 +43,12 @@ struct io_sched_data {
struct io_entity {
struct rb_node rb_node;
int on_st;
+ int on_idle_st;
u64 vdisktime;
unsigned int weight;
struct io_entity *parent;
+ /* This io entity (queue or group) has been marked for deletion */
+ unsigned int exiting;

struct io_sched_data *my_sd;
struct io_service_tree *st;
--
1.6.0.6

2009-09-24 19:27:47

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 04/28] io-controller: Common flat fair queuing code in elevaotor layer

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

This is essentially a lot of CFQ logic moved into common layer so that other
IO schedulers can make use of that in hierarhical scheduling setup.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Fabio Checconi <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Aristeu Rozanski <[email protected]>
Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/Kconfig.iosched | 13 +
block/Makefile | 3 +-
block/as-iosched.c | 2 +-
block/blk.h | 6 +
block/cfq-iosched.c | 2 +-
block/deadline-iosched.c | 3 +-
block/elevator-fq.c | 1025 +++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 230 +++++++++++
block/elevator.c | 63 +++-
block/noop-iosched.c | 2 +-
include/linux/blkdev.h | 14 +
include/linux/elevator.h | 50 +++-
12 files changed, 1394 insertions(+), 19 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK

menu "IO Schedulers"

+config ELV_FAIR_QUEUING
+ bool "Elevator Fair Queuing Support"
+ default n
+ ---help---
+ Traditionally only cfq had notion of multiple queues and it did
+ fair queuing at its own. With the cgroups and need of controlling
+ IO, now even the simple io schedulers like noop, deadline, as will
+ have one queue per cgroup and will need hierarchical fair queuing.
+ Instead of every io scheduler implementing its own fair queuing
+ logic, this option enables fair queuing in elevator layer so that
+ other ioschedulers can make use of it.
+ If unsure, say N.
+
config IOSCHED_NOOP
bool
default y
diff --git a/block/Makefile b/block/Makefile
index 19ff1e8..d545323 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
- ioctl.o genhd.o scsi_ioctl.o elevator-fq.o
+ ioctl.o genhd.o scsi_ioctl.o

obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o

obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING) += elevator-fq.o
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7a12cf6..b90acbe 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1351,7 +1351,7 @@ static void as_exit_queue(struct elevator_queue *e)
/*
* initialize elevator private data (as_data).
*/
-static void *as_init_queue(struct request_queue *q)
+static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
{
struct as_data *ad;

diff --git a/block/blk.h b/block/blk.h
index 3fae6ad..d05b4cf 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -1,6 +1,8 @@
#ifndef BLK_INTERNAL_H
#define BLK_INTERNAL_H

+#include "elevator-fq.h"
+
/* Amount of time in which a process may batch requests */
#define BLK_BATCH_TIME (HZ/50UL)

@@ -71,6 +73,8 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

+ elv_activate_rq_fair(q, rq);
+
if (e->ops->elevator_activate_req_fn)
e->ops->elevator_activate_req_fn(q, rq);
}
@@ -79,6 +83,8 @@ static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq
{
struct elevator_queue *e = q->elevator;

+ elv_deactivate_rq_fair(q, rq);
+
if (e->ops->elevator_deactivate_req_fn)
e->ops->elevator_deactivate_req_fn(q, rq);
}
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fd7080e..5a67ec0 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2448,7 +2448,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
kfree(cfqd);
}

-static void *cfq_init_queue(struct request_queue *q)
+static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
{
struct cfq_data *cfqd;
int i;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index b547cbc..25af8b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -347,7 +347,8 @@ static void deadline_exit_queue(struct elevator_queue *e)
/*
* initialize elevator private data (deadline_data).
*/
-static void *deadline_init_queue(struct request_queue *q)
+static void *
+deadline_init_queue(struct request_queue *q, struct elevator_queue *eq)
{
struct deadline_data *dd;

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 8343397..629ddaa 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,14 +12,23 @@
*/

#include <linux/blkdev.h>
+#include <linux/blktrace_api.h>
#include "elevator-fq.h"

+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+static struct kmem_cache *elv_ioq_pool;
+
/*
* offset from end of service tree
*/
#define ELV_IDLE_DELAY (HZ / 5)
#define ELV_SLICE_SCALE (500)
#define ELV_SERVICE_SHIFT 20
+#define ELV_HW_QUEUE_MIN (5)
+#define ELV_SERVICE_TREE_INIT ((struct io_service_tree) \
+ { RB_ROOT, NULL, 0, NULL, 0})

static void check_idle_tree_release(struct io_service_tree *st);

@@ -105,7 +114,7 @@ static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)

static void update_min_vdisktime(struct io_service_tree *st)
{
- u64 vdisktime;
+ u64 vdisktime = st->min_vdisktime;

if (st->active_entity)
vdisktime = st->active_entity->vdisktime;
@@ -141,6 +150,12 @@ static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
return ioq_of(entity)->efqd;
}

+struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+ return ioq->efqd->root_group;
+}
+EXPORT_SYMBOL(ioq_to_io_group);
+
static inline struct io_sched_data *
io_entity_sched_data(struct io_entity *entity)
{
@@ -468,6 +483,7 @@ static struct io_entity *lookup_next_io_entity(struct io_sched_data *sd)
__dequeue_io_entity(st, entity);
st->active_entity = entity;
sd->active_entity = entity;
+ update_min_vdisktime(entity->st);
break;
}
}
@@ -556,7 +572,1014 @@ init_io_entity_parent(struct io_entity *entity, struct io_entity *parent)

void elv_put_ioq(struct io_queue *ioq)
{
+ struct elv_fq_data *efqd = ioq->efqd;
+ struct elevator_queue *e = efqd->eq;
+
BUG_ON(atomic_read(&ioq->ref) <= 0);
if (!atomic_dec_and_test(&ioq->ref))
return;
+ BUG_ON(ioq->nr_queued);
+ BUG_ON(elv_ioq_busy(ioq));
+ BUG_ON(efqd->active_queue == ioq);
+
+ /* Can be called by outgoing elevator. Don't use q */
+ BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+ e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+ elv_log_ioq(efqd, ioq, "put_queue");
+ elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
+{
+ unsigned long allocated_slice, queue_charge;
+
+ allocated_slice = elv_prio_to_slice(ioq->efqd, ioq);
+
+ /*
+ * We don't want to charge more than allocated slice otherwise this
+ * queue can miss one dispatch round doubling max latencies. On the
+ * other hand we don't want to charge less than allocated slice as
+ * we stick to CFQ theme of queue loosing its share if it does not
+ * use the slice and moves to the back of service tree (almost).
+ */
+ queue_charge = allocated_slice;
+ entity_served(&ioq->entity, served, queue_charge, ioq->nr_sectors);
+}
+
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
+{
+ return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
+{
+ char *p = (char *) page;
+
+ *var = simple_strtoul(p, &p, 10);
+ return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
+ssize_t __FUNC(struct elevator_queue *e, char *page) \
+{ \
+ struct elv_fq_data *efqd = e->efqd; \
+ unsigned int __data = __VAR; \
+ if (__CONV) \
+ __data = jiffies_to_msecs(__data); \
+ return elv_var_show(__data, (page)); \
+}
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
+{ \
+ struct elv_fq_data *efqd = e->efqd; \
+ unsigned int __data; \
+ int ret = elv_var_store(&__data, (page), count); \
+ if (__data < (MIN)) \
+ __data = (MIN); \
+ else if (__data > (MAX)) \
+ __data = (MAX); \
+ if (__CONV) \
+ *(__PTR) = msecs_to_jiffies(__data); \
+ else \
+ *(__PTR) = __data; \
+ return ret; \
+}
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = q->elevator->efqd;
+
+ if (elv_nr_busy_ioq(q->elevator)) {
+ elv_log(efqd, "schedule dispatch");
+ kblockd_schedule_work(q, &efqd->unplug_work);
+ }
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+static void elv_kick_queue(struct work_struct *work)
+{
+ struct elv_fq_data *efqd =
+ container_of(work, struct elv_fq_data, unplug_work);
+ struct request_queue *q = efqd->queue;
+
+ spin_lock_irq(q->queue_lock);
+ __blk_run_queue(q);
+ spin_unlock_irq(q->queue_lock);
+}
+
+static void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+ del_timer_sync(&e->efqd->idle_slice_timer);
+ cancel_work_sync(&e->efqd->unplug_work);
+}
+
+static void elv_set_prio_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+ ioq->slice_start = jiffies;
+ ioq->slice_end = elv_prio_to_slice(efqd, ioq) + jiffies;
+ elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->slice_end - jiffies);
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+ struct io_queue *ioq = NULL;
+
+ ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+ return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+ kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq, pid_t pid,
+ int is_sync)
+{
+ RB_CLEAR_NODE(&ioq->entity.rb_node);
+ atomic_set(&ioq->ref, 0);
+ ioq->efqd = eq->efqd;
+ ioq->pid = pid;
+
+ elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+ elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
+ return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+static void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+ struct io_queue *ioq = *ioq_ptr;
+
+ if (ioq != NULL) {
+ /* Drop the reference taken by the io group */
+ elv_put_ioq(ioq);
+ *ioq_ptr = NULL;
+ }
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+static void
+put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+ int i, j;
+
+ for (i = 0; i < 2; i++)
+ for (j = 0; j < IOPRIO_BE_NR; j++)
+ elv_release_ioq(e, &iog->async_queue[i][j]);
+
+ /* Free up async idle queue */
+ elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+void *elv_io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+ int ioprio)
+{
+ struct io_queue *ioq = NULL;
+
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ ioq = iog->async_queue[0][ioprio];
+ break;
+ case IOPRIO_CLASS_BE:
+ ioq = iog->async_queue[1][ioprio];
+ break;
+ case IOPRIO_CLASS_IDLE:
+ ioq = iog->async_idle_queue;
+ break;
+ default:
+ BUG();
+ }
+
+ if (ioq)
+ return ioq->sched_queue;
+ return NULL;
+}
+EXPORT_SYMBOL(elv_io_group_async_queue_prio);
+
+void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+ int ioprio, struct io_queue *ioq)
+{
+ switch (ioprio_class) {
+ case IOPRIO_CLASS_RT:
+ iog->async_queue[0][ioprio] = ioq;
+ break;
+ case IOPRIO_CLASS_BE:
+ iog->async_queue[1][ioprio] = ioq;
+ break;
+ case IOPRIO_CLASS_IDLE:
+ iog->async_idle_queue = ioq;
+ break;
+ default:
+ BUG();
+ }
+
+ /*
+ * Take the group reference and pin the queue. Group exit will
+ * clean it up
+ */
+ elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_io_group_set_async_queue);
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ iog->entity.parent = NULL;
+ iog->entity.my_sd = &iog->sched_data;
+ iog->key = key;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
+
+ return iog;
+}
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_group *iog = e->efqd->root_group;
+ struct io_service_tree *st;
+ int i;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ flush_idle_tree(st);
+ }
+
+ put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+/*
+ * Should be called after ioq prio and class has been initialized as prio
+ * class data will be used to determine which service tree in the group
+ * entity should be attached to.
+ */
+void elv_init_ioq_io_group(struct io_queue *ioq, struct io_group *iog)
+{
+ init_io_entity_parent(&ioq->entity, &iog->entity);
+}
+EXPORT_SYMBOL(elv_init_ioq_io_group);
+
+/* Get next queue for service. */
+static struct io_queue *elv_get_next_ioq(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = q->elevator->efqd;
+ struct io_entity *entity = NULL;
+ struct io_queue *ioq = NULL;
+ struct io_sched_data *sd;
+
+ BUG_ON(efqd->active_queue != NULL);
+
+ if (!efqd->busy_queues)
+ return NULL;
+
+ sd = &efqd->root_group->sched_data;
+ entity = lookup_next_io_entity(sd);
+ if (!entity)
+ return NULL;
+
+ ioq = ioq_of(entity);
+ return ioq;
+}
+
+/*
+ * coop (cooperating queue) tells that io scheduler selected a queue for us
+ * and we did not select the next queue based on fairness.
+ */
+static void
+__elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
+{
+ struct request_queue *q = efqd->queue;
+ struct elevator_queue *eq = q->elevator;
+
+ if (ioq) {
+ elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+ efqd->busy_queues);
+ ioq->slice_start = ioq->slice_end = 0;
+ ioq->dispatch_start = jiffies;
+
+ elv_clear_ioq_wait_request(ioq);
+ elv_clear_ioq_must_dispatch(ioq);
+ elv_mark_ioq_slice_new(ioq);
+
+ del_timer(&efqd->idle_slice_timer);
+ }
+
+ efqd->active_queue = ioq;
+
+ /* Let iosched know if it wants to take some action */
+ if (ioq && eq->ops->elevator_active_ioq_set_fn)
+ eq->ops->elevator_active_ioq_set_fn(q, ioq->sched_queue, coop);
+}
+
+static inline int ioq_is_idling(struct io_queue *ioq)
+{
+ return (elv_ioq_wait_request(ioq) ||
+ timer_pending(&ioq->efqd->idle_slice_timer));
+}
+
+/* Get and set a new active queue for service. */
+static struct
+io_queue *elv_set_active_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = q->elevator->efqd;
+ int coop = 0;
+
+ if (ioq) {
+ requeue_ioq(ioq, 1);
+ /*
+ * io scheduler selected the next queue for us. Pass this
+ * this info back to io scheudler. cfq currently uses it
+ * to reset coop flag on the queue.
+ */
+ coop = 1;
+ }
+
+ ioq = elv_get_next_ioq(q);
+ __elv_set_active_ioq(efqd, ioq, coop);
+ return ioq;
+}
+
+static void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+ struct request_queue *q = efqd->queue;
+ struct elevator_queue *eq = q->elevator;
+ struct io_queue *ioq = elv_active_ioq(eq);
+
+ if (eq->ops->elevator_active_ioq_reset_fn)
+ eq->ops->elevator_active_ioq_reset_fn(q, ioq->sched_queue);
+
+ efqd->active_queue = NULL;
+ del_timer(&efqd->idle_slice_timer);
+}
+
+/* Called when an inactive queue receives a new request. */
+static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+ BUG_ON(elv_ioq_busy(ioq));
+ BUG_ON(ioq == efqd->active_queue);
+ elv_log_ioq(efqd, ioq, "add to busy");
+ enqueue_ioq(ioq);
+ elv_mark_ioq_busy(ioq);
+ efqd->busy_queues++;
+}
+
+static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = e->efqd;
+
+ BUG_ON(!elv_ioq_busy(ioq));
+ BUG_ON(ioq->nr_queued);
+ elv_log_ioq(efqd, ioq, "del from busy");
+ elv_clear_ioq_busy(ioq);
+ BUG_ON(efqd->busy_queues == 0);
+ efqd->busy_queues--;
+ dequeue_ioq(ioq);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Currently one should set fairness = 1 to force completion of requests
+ * from queue before dispatch from next queue starts. This should help in
+ * better time accounting at the expense of throughput.
+ */
+void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = q->elevator->efqd;
+ long slice_used = 0, slice_overshoot = 0;
+
+ assert_spin_locked(q->queue_lock);
+ elv_log_ioq(efqd, ioq, "slice expired");
+
+ if (elv_ioq_wait_request(ioq))
+ del_timer(&efqd->idle_slice_timer);
+
+ elv_clear_ioq_wait_request(ioq);
+
+ /*
+ * Queue got expired before even a single request completed or
+ * got expired immediately after first request completion. Use
+ * the time elapsed since queue was scheduled in.
+ */
+ if (!ioq->slice_end || ioq->slice_start == jiffies) {
+ slice_used = jiffies - ioq->dispatch_start;
+ if (!slice_used)
+ slice_used = 1;
+ goto done;
+ }
+
+ slice_used = jiffies - ioq->slice_start;
+ if (time_after(jiffies, ioq->slice_end))
+ slice_overshoot = jiffies - ioq->slice_end;
+
+done:
+ elv_log_ioq(efqd, ioq, "disp_start = %lu sl_start= %lu sl_end=%lu,"
+ " jiffies=%lu", ioq->dispatch_start, ioq->slice_start,
+ ioq->slice_end, jiffies);
+ elv_log_ioq(efqd, ioq, "sl_used=%ld, overshoot=%ld sect=%lu",
+ slice_used, slice_overshoot, ioq->nr_sectors);
+ elv_ioq_served(ioq, slice_used);
+
+ BUG_ON(ioq != efqd->active_queue);
+ elv_reset_active_ioq(efqd);
+ /* Queue is being expired. Reset number of secotrs dispatched */
+ ioq->nr_sectors = 0;
+
+ put_prev_ioq(ioq);
+
+ if (!ioq->nr_queued)
+ elv_del_ioq_busy(q->elevator, ioq);
+ else if (!elv_ioq_sync(ioq)) {
+ /*
+ * Requeue async ioq so that these will be again placed at
+ * the end of service tree giving a chance to sync queues.
+ */
+ requeue_ioq(ioq, 0);
+ }
+}
+EXPORT_SYMBOL(elv_ioq_slice_expired);
+
+/* Expire the ioq. */
+void elv_slice_expired(struct request_queue *q)
+{
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+ if (ioq)
+ elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+ struct request *rq)
+{
+ struct io_queue *active_ioq;
+ struct elevator_queue *eq = q->elevator;
+ struct io_entity *entity, *new_entity;
+
+ active_ioq = elv_active_ioq(eq);
+
+ if (!active_ioq)
+ return 0;
+
+ entity = &active_ioq->entity;
+ new_entity = &new_ioq->entity;
+
+ /*
+ * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+ */
+
+ if (new_entity->ioprio_class == IOPRIO_CLASS_RT
+ && entity->ioprio_class != IOPRIO_CLASS_RT)
+ return 1;
+ /*
+ * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
+ */
+
+ if (new_entity->ioprio_class == IOPRIO_CLASS_BE
+ && entity->ioprio_class == IOPRIO_CLASS_IDLE)
+ return 1;
+
+ /*
+ * Check with io scheduler if it has additional criterion based on
+ * which it wants to preempt existing queue.
+ */
+ if (eq->ops->elevator_should_preempt_fn) {
+ void *sched_queue = elv_ioq_sched_queue(new_ioq);
+
+ return eq->ops->elevator_should_preempt_fn(q, sched_queue, rq);
+ }
+
+ return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+ elv_log_ioq(q->elevator->efqd, ioq, "preempt");
+ elv_slice_expired(q);
+
+ /*
+ * Put the new queue at the front of the of the current list,
+ * so we know that it will be selected next.
+ */
+
+ requeue_ioq(ioq, 1);
+ elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = q->elevator->efqd;
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ BUG_ON(!efqd);
+ BUG_ON(!ioq);
+ ioq->nr_queued++;
+ elv_log_ioq(efqd, ioq, "add rq: rq_queued=%d", ioq->nr_queued);
+
+ if (!elv_ioq_busy(ioq))
+ elv_add_ioq_busy(efqd, ioq);
+
+ if (ioq == elv_active_ioq(q->elevator)) {
+ /*
+ * Remember that we saw a request from this process, but
+ * don't start queuing just yet. Otherwise we risk seeing lots
+ * of tiny requests, because we disrupt the normal plugging
+ * and merging. If the request is already larger than a single
+ * page, let it rip immediately. For that case we assume that
+ * merging is already done. Ditto for a busy system that
+ * has other work pending, don't risk delaying until the
+ * idle timer unplug to continue working.
+ */
+ if (elv_ioq_wait_request(ioq)) {
+ del_timer(&efqd->idle_slice_timer);
+ elv_clear_ioq_wait_request(ioq);
+ if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+ efqd->busy_queues > 1 || !blk_queue_plugged(q))
+ __blk_run_queue(q);
+ else
+ elv_mark_ioq_must_dispatch(ioq);
+ }
+ } else if (elv_should_preempt(q, ioq, rq)) {
+ /*
+ * not the active queue - expire current slice if it is
+ * idle and has expired it's mean thinktime or this new queue
+ * has some old slice time left and is of higher priority or
+ * this new queue is RT and the current one is BE
+ */
+ elv_preempt_queue(q, ioq);
+ __blk_run_queue(q);
+ }
+}
+
+static void elv_idle_slice_timer(unsigned long data)
+{
+ struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+ struct io_queue *ioq;
+ unsigned long flags;
+ struct request_queue *q = efqd->queue;
+
+ elv_log(efqd, "idle timer fired");
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ ioq = efqd->active_queue;
+
+ if (ioq) {
+
+ elv_clear_ioq_wait_request(ioq);
+
+ /*
+ * We saw a request before the queue expired, let it through
+ */
+ if (elv_ioq_must_dispatch(ioq))
+ goto out_kick;
+
+ /*
+ * expired
+ */
+ if (elv_ioq_slice_used(ioq))
+ goto expire;
+
+ /*
+ * only expire and reinvoke request handler, if there are
+ * other queues with pending requests
+ */
+ if (!elv_nr_busy_ioq(q->elevator))
+ goto out_cont;
+
+ /*
+ * not expired and it has a request pending, let it dispatch
+ */
+ if (ioq->nr_queued)
+ goto out_kick;
+ }
+expire:
+ elv_slice_expired(q);
+out_kick:
+ elv_schedule_dispatch(q);
+out_cont:
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+ struct elevator_queue *eq = q->elevator;
+ struct io_queue *ioq = elv_active_ioq(eq);
+
+ if (eq->ops->elevator_arm_slice_timer_fn)
+ eq->ops->elevator_arm_slice_timer_fn(q, ioq->sched_queue);
+}
+
+/*
+ * If io scheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+ struct io_queue *ioq)
+{
+ struct elevator_queue *e = q->elevator;
+ struct io_queue *new_ioq = NULL;
+ void *sched_queue = ioq->sched_queue;
+
+ if (q->elevator->ops->elevator_close_cooperator_fn)
+ new_ioq = e->ops->elevator_close_cooperator_fn(q, sched_queue);
+
+ if (new_ioq)
+ elv_log_ioq(e->efqd, ioq, "cooperating ioq=%d", new_ioq->pid);
+
+ return new_ioq;
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_select_ioq(struct request_queue *q, int force)
+{
+ struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+
+ if (!elv_nr_busy_ioq(q->elevator))
+ return NULL;
+
+ if (ioq == NULL)
+ goto new_queue;
+
+ /* There is only one active queue which is empty. Nothing to dispatch */
+ if (elv_nr_busy_ioq(q->elevator) == 1 && !ioq->nr_queued)
+ return NULL;
+
+ /*
+ * Force dispatch. Continue to dispatch from current queue as long
+ * as it has requests.
+ */
+ if (unlikely(force)) {
+ if (ioq->nr_queued)
+ goto keep_queue;
+ else
+ goto expire;
+ }
+
+ /*
+ * The active queue has run out of time, expire it and select new.
+ */
+ if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+ goto expire;
+
+ /*
+ * The active queue has requests and isn't expired, allow it to
+ * dispatch.
+ */
+
+ if (ioq->nr_queued)
+ goto keep_queue;
+
+ /*
+ * If another queue has a request waiting within our mean seek
+ * distance, let it run. The expire code will check for close
+ * cooperators and put the close queue at the front of the service
+ * tree.
+ */
+ new_ioq = elv_close_cooperator(q, ioq);
+ if (new_ioq)
+ goto expire;
+
+ /*
+ * No requests pending. If the active queue still has requests in
+ * flight or is idling for a new request, allow either of these
+ * conditions to happen (or time out) before selecting a new queue.
+ */
+
+ if (ioq_is_idling(ioq) ||
+ (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+ ioq = NULL;
+ goto keep_queue;
+ }
+
+expire:
+ elv_slice_expired(q);
+new_queue:
+ ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+ return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+ struct io_queue *ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ ioq = rq->ioq;
+ BUG_ON(!ioq);
+ ioq->nr_queued--;
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_dispatched_request_fair(struct elevator_queue *e, struct request *rq)
+{
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ BUG_ON(!ioq);
+ ioq->dispatched++;
+ ioq->nr_sectors += blk_rq_sectors(rq);
+ elv_ioq_request_removed(e, rq);
+ elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_activate_rq_fair(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ efqd->rq_in_driver++;
+ elv_log_ioq(efqd, rq->ioq, "activate rq, drv=%d",
+ efqd->rq_in_driver);
+}
+
+void elv_deactivate_rq_fair(struct request_queue *q, struct request *rq)
+{
+ struct elv_fq_data *efqd = q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ WARN_ON(!efqd->rq_in_driver);
+ efqd->rq_in_driver--;
+ elv_log_ioq(efqd, rq->ioq, "deactivate rq, drv=%d",
+ efqd->rq_in_driver);
+}
+
+/*
+ * if this is only queue and it has completed all its requests and has nothing
+ * to dispatch, expire it. We don't want to keep it around idle otherwise later
+ * when it is expired, all this idle time will be added to queue's disk time
+ * used and queue might not get a chance to run for a long time.
+ */
+static inline void
+check_expire_last_empty_queue(struct request_queue *q, struct io_queue *ioq)
+{
+ struct elv_fq_data *efqd = q->elevator->efqd;
+
+ if (efqd->busy_queues != 1)
+ return;
+
+ if (ioq->dispatched || ioq->nr_queued)
+ return;
+
+ /*
+ * Anticipation is on. Don't expire queue. Either a new request will
+ * come or it is up to io scheduler to expire the queue once idle
+ * timer fires
+ */
+
+ if (ioq_is_idling(ioq))
+ return;
+
+ elv_log_ioq(efqd, ioq, "expire last empty queue");
+ elv_slice_expired(q);
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+ const int sync = rq_is_sync(rq);
+ struct io_queue *ioq;
+ struct elv_fq_data *efqd = q->elevator->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ ioq = rq->ioq;
+ WARN_ON(!efqd->rq_in_driver);
+ WARN_ON(!ioq->dispatched);
+ efqd->rq_in_driver--;
+ ioq->dispatched--;
+
+ elv_log_ioq(efqd, ioq, "complete rq_queued=%d drv=%d disp=%d",
+ ioq->nr_queued, efqd->rq_in_driver,
+ elv_ioq_nr_dispatched(ioq));
+ /*
+ * If this is the active queue, check if it needs to be expired,
+ * or if we want to idle in case it has no pending requests.
+ */
+
+ if (elv_active_ioq(q->elevator) == ioq) {
+ if (elv_ioq_slice_new(ioq)) {
+ elv_set_prio_slice(q->elevator->efqd, ioq);
+ elv_clear_ioq_slice_new(ioq);
+ }
+
+ /*
+ * If there are no requests waiting in this queue, and
+ * there are other queues ready to issue requests, AND
+ * those other queues are issuing requests within our
+ * mean seek distance, give them a chance to run instead
+ * of idling.
+ */
+ if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+ elv_slice_expired(q);
+ else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
+ && sync && !rq_noidle(rq))
+ elv_ioq_arm_slice_timer(q);
+
+ check_expire_last_empty_queue(q, ioq);
+ }
+
+ if (!efqd->rq_in_driver)
+ elv_schedule_dispatch(q);
}
+
+/*
+ * The process associted with ioq (in case of cfq), is going away. Mark it
+ * for deletion.
+ */
+void elv_exit_ioq(struct io_queue *ioq)
+{
+ struct io_entity *entity = &ioq->entity;
+
+ /*
+ * Async ioq's belong to io group and are cleaned up once group is
+ * being deleted. Not need to do any cleanup here even if cfq has
+ * dropped the reference to the queue
+ */
+ if (!elv_ioq_sync(ioq))
+ return;
+
+ /*
+ * This queue is still under service. Just mark it so that once all
+ * the IO from queue is done, it is not put back in idle tree.
+ */
+ if (entity->on_st) {
+ entity->exiting = 1;
+ return;
+ } else if (entity->on_idle_st) {
+ /* Remove ioq from idle tree */
+ dequeue_io_entity_idle(entity);
+ }
+}
+EXPORT_SYMBOL(elv_exit_ioq);
+
+static void elv_slab_kill(void)
+{
+ /*
+ * Caller already ensured that pending RCU callbacks are completed,
+ * so we should have no busy allocations at this point.
+ */
+ if (elv_ioq_pool)
+ kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+ elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+ if (!elv_ioq_pool)
+ goto fail;
+
+ return 0;
+fail:
+ elv_slab_kill();
+ return -ENOMEM;
+}
+
+struct elv_fq_data *
+elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+ struct elv_fq_data *efqd = NULL;
+
+ efqd = kmalloc_node(sizeof(*efqd), GFP_KERNEL | __GFP_ZERO, q->node);
+ return efqd;
+}
+
+void elv_release_fq_data(struct elv_fq_data *efqd)
+{
+ kfree(efqd);
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+ struct io_group *iog;
+ struct elv_fq_data *efqd = e->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return 0;
+
+ iog = io_alloc_root_group(q, e, efqd);
+ if (iog == NULL)
+ return 1;
+
+ efqd->root_group = iog;
+
+ /*
+ * Our fallback ioq if elv_alloc_ioq() runs into OOM issues.
+ * Grab a permanent reference to it, so that the normal code flow
+ * will not attempt to free it.
+ */
+ elv_init_ioq(e, &efqd->oom_ioq, 1, 0);
+ elv_get_ioq(&efqd->oom_ioq);
+ elv_init_ioq_io_group(&efqd->oom_ioq, iog);
+
+ efqd->queue = q;
+ efqd->eq = e;
+
+ init_timer(&efqd->idle_slice_timer);
+ efqd->idle_slice_timer.function = elv_idle_slice_timer;
+ efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+ INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+ efqd->elv_slice[0] = elv_slice_async;
+ efqd->elv_slice[1] = elv_slice_sync;
+
+ return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+ struct elv_fq_data *efqd = e->efqd;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return;
+
+ elv_shutdown_timer_wq(e);
+
+ BUG_ON(timer_pending(&efqd->idle_slice_timer));
+ io_free_root_group(e);
+}
+
+static int __init elv_fq_init(void)
+{
+ if (elv_slab_setup())
+ return -ENOMEM;
+
+ /* could be 0 on HZ < 1000 setups */
+
+ if (!elv_slice_async)
+ elv_slice_async = 1;
+
+ return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index ee46a47..6ea0d18 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -22,6 +22,10 @@
#define IO_WEIGHT_DEFAULT 500
#define IO_IOPRIO_CLASSES 3

+#ifdef CONFIG_ELV_FAIR_QUEUING
+#define ELV_ATTR(name) \
+ __ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
struct io_service_tree {
struct rb_root active;
struct io_entity *active_entity;
@@ -68,23 +72,80 @@ struct io_queue {

/* Pointer to generic elevator fair queuing data structure */
struct elv_fq_data *efqd;
+ pid_t pid;
+
+ /* Number of requests queued on this io queue */
+ unsigned long nr_queued;
+
+ /* Requests dispatched from this queue */
+ int dispatched;
+
+ /* Number of sectors dispatched in current dispatch round */
+ unsigned long nr_sectors;
+
+ /* time when dispatch from the queue was started */
+ unsigned long dispatch_start;
+ /* time when first request from queue completed and slice started. */
+ unsigned long slice_start;
+ unsigned long slice_end;
+
+ /* Pointer to io scheduler's queue */
+ void *sched_queue;
};

struct io_group {
struct io_entity entity;
struct io_sched_data sched_data;
+ /*
+ * async queue for each priority case for RT and BE class.
+ * Used only for cfq.
+ */
+
+ struct io_queue *async_queue[2][IOPRIO_BE_NR];
+ struct io_queue *async_idle_queue;
+ void *key;
};

struct elv_fq_data {
struct io_group *root_group;

+ struct request_queue *queue;
+ struct elevator_queue *eq;
+ unsigned int busy_queues;
+
+ /* Pointer to the ioscheduler queue being served */
+ void *active_queue;
+
+ int rq_in_driver;
+
+ struct timer_list idle_slice_timer;
+ struct work_struct unplug_work;
+
/* Base slice length for sync and async queues */
unsigned int elv_slice[2];
+
+ /* Fallback dummy ioq for extreme OOM conditions */
+ struct io_queue oom_ioq;
};

+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+ blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid, \
+ elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+ blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples) ((samples) > 80)
+
/* Some shared queue flag manipulation functions among elevators */

enum elv_queue_state_flags {
+ ELV_QUEUE_FLAG_busy, /* has requests or is under service */
+ ELV_QUEUE_FLAG_wait_request, /* waiting for a request */
+ ELV_QUEUE_FLAG_must_dispatch, /* must be allowed a dispatch */
+ ELV_QUEUE_FLAG_idle_window, /* elevator slice idling enabled */
+ ELV_QUEUE_FLAG_slice_new, /* no requests dispatched in slice */
ELV_QUEUE_FLAG_sync, /* synchronous queue */
};

@@ -102,6 +163,11 @@ static inline int elv_ioq_##name(struct io_queue *ioq) \
return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0; \
}

+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
ELV_IO_QUEUE_FLAG_FNS(sync)

static inline void elv_get_ioq(struct io_queue *ioq)
@@ -150,6 +216,170 @@ static inline int elv_ioq_ioprio(struct io_queue *ioq)
return ioq->entity.ioprio;
}

+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+ if (elv_ioq_slice_new(ioq))
+ return 0;
+ if (time_before(jiffies, ioq->slice_end))
+ return 0;
+
+ return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+ return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+ return ioq->nr_queued;
+}
+
+static inline void *elv_ioq_sched_queue(struct io_queue *ioq)
+{
+ if (ioq)
+ return ioq->sched_queue;
+ return NULL;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+ return e->efqd->active_queue;
+}
+
+static inline void *elv_active_sched_queue(struct elevator_queue *e)
+{
+ return elv_ioq_sched_queue(elv_active_ioq(e));
+}
+
+static inline int elv_rq_in_driver(struct elevator_queue *e)
+{
+ return e->efqd->rq_in_driver;
+}
+
+static inline int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+ return e->efqd->busy_queues;
+}
+
+/* Helper functions for operating on elevator idle slice timer */
+static inline int
+elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+ return mod_timer(&eq->efqd->idle_slice_timer, expires);
+}
+
+static inline int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+ return del_timer(&eq->efqd->idle_slice_timer);
+}
+
+static inline void
+elv_init_ioq_sched_queue(struct elevator_queue *eq, struct io_queue *ioq,
+ void *sched_queue)
+{
+ ioq->sched_queue = sched_queue;
+}
+
+static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
+{
+ return &eq->efqd->oom_ioq;
+}
+
+static inline struct io_group *
+elv_io_get_io_group(struct request_queue *q, int create)
+{
+ /* In flat mode, there is only root group */
+ return q->elevator->efqd->root_group;
+}
+
+extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
+ size_t count);
+extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
+ size_t count);
+
+/* Functions used by elevator.c */
+extern struct elv_fq_data *elv_alloc_fq_data(struct request_queue *q,
+ struct elevator_queue *e);
+extern void elv_release_fq_data(struct elv_fq_data *efqd);
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+ struct request *rq);
+extern void elv_dispatched_request_fair(struct elevator_queue *e,
+ struct request *rq);
+
+extern void elv_activate_rq_fair(struct request_queue *q, struct request *rq);
+extern void elv_deactivate_rq_fair(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+ struct request *rq);
+
+extern void *elv_select_ioq(struct request_queue *q, int force);
+
+/* Functions used by io schedulers */
extern void elv_put_ioq(struct io_queue *ioq);
+extern void elv_ioq_slice_expired(struct request_queue *q,
+ struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+ pid_t pid, int is_sync);
+extern void elv_init_ioq_io_group(struct io_queue *ioq, struct io_group *iog);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern void *elv_io_group_async_queue_prio(struct io_group *iog,
+ int ioprio_class, int ioprio);
+extern void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+ int ioprio, struct io_queue *ioq);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
+extern void elv_exit_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+static inline struct elv_fq_data *
+elv_alloc_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+ return 0;
+}
+static inline void elv_release_fq_data(struct elv_fq_data *efqd) {}
+
+static inline int
+elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+ return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+
+static inline void
+elv_activate_rq_fair(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_deactivate_rq_fair(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_dispatched_request_fair(struct elevator_queue *e, struct request *rq) {}
+
+static inline void
+elv_ioq_request_removed(struct elevator_queue *e, struct request *rq) {}
+
+static inline void
+elv_ioq_request_add(struct request_queue *q, struct request *rq) {}
+
+static inline void
+elv_ioq_completed_request(struct request_queue *q, struct request *rq) {}
+
+static inline void *elv_ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline void *elv_select_ioq(struct request_queue *q, int force)
+{
+ return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _ELV_SCHED_H */
#endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 2d511f9..ea4042e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -53,6 +53,15 @@ static const int elv_hash_shift = 6;
#define ELV_HASH_ENTRIES (1 << elv_hash_shift)
#define rq_hash_key(rq) (blk_rq_pos(rq) + blk_rq_sectors(rq))

+static inline struct elv_fq_data *elv_efqd(struct elevator_queue *eq)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ return eq->efqd;
+#else
+ return NULL;
+#endif
+}
+
/*
* Query io scheduler to see if the current process issuing bio may be
* merged with rq.
@@ -187,7 +196,7 @@ static struct elevator_type *elevator_get(const char *name)
static void *elevator_init_queue(struct request_queue *q,
struct elevator_queue *eq)
{
- return eq->ops->elevator_init_fn(q);
+ return eq->ops->elevator_init_fn(q, eq);
}

static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
@@ -239,8 +248,21 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
for (i = 0; i < ELV_HASH_ENTRIES; i++)
INIT_HLIST_HEAD(&eq->hash[i]);

+#ifdef CONFIG_ELV_FAIR_QUEUING
+ eq->efqd = elv_alloc_fq_data(q, eq);
+
+ if (!eq->efqd)
+ goto err;
+
+ if (elv_init_fq_data(q, eq))
+ goto err;
+#endif
return eq;
err:
+ if (elv_efqd(eq))
+ elv_release_fq_data(elv_efqd(eq));
+ if (eq->hash)
+ kfree(eq->hash);
kfree(eq);
elevator_put(e);
return NULL;
@@ -252,6 +274,7 @@ static void elevator_release(struct kobject *kobj)

e = container_of(kobj, struct elevator_queue, kobj);
elevator_put(e->elevator_type);
+ elv_release_fq_data(elv_efqd(e));
kfree(e->hash);
kfree(e);
}
@@ -309,6 +332,7 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
+ elv_exit_fq_data(e);
if (e->ops->elevator_exit_fn)
e->ops->elevator_exit_fn(e);
e->ops = NULL;
@@ -438,6 +462,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
elv_rqhash_del(q, rq);

q->nr_sorted--;
+ elv_dispatched_request_fair(q->elevator, rq);

boundary = q->end_sector;
stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -478,6 +503,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
elv_rqhash_del(q, rq);

q->nr_sorted--;
+ elv_dispatched_request_fair(q->elevator, rq);

q->end_sector = rq_end_sector(rq);
q->boundary_rq = rq;
@@ -545,6 +571,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
elv_rqhash_del(q, next);

q->nr_sorted--;
+ elv_ioq_request_removed(e, next);
q->last_merge = rq;
}

@@ -651,12 +678,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
q->last_merge = rq;
}

- /*
- * Some ioscheds (cfq) run q->request_fn directly, so
- * rq cannot be accessed after calling
- * elevator_add_req_fn.
- */
q->elevator->ops->elevator_add_req_fn(q, rq);
+ elv_ioq_request_add(q, rq);
break;

case ELEVATOR_INSERT_REQUEUE:
@@ -755,13 +778,12 @@ EXPORT_SYMBOL(elv_add_request);

int elv_queue_empty(struct request_queue *q)
{
- struct elevator_queue *e = q->elevator;
-
if (!list_empty(&q->queue_head))
return 0;

- if (e->ops->elevator_queue_empty_fn)
- return e->ops->elevator_queue_empty_fn(q);
+ /* Hopefully nr_sorted works and no need to call queue_empty_fn */
+ if (q->nr_sorted)
+ return 0;

return 1;
}
@@ -841,8 +863,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
*/
if (blk_account_rq(rq)) {
q->in_flight[rq_is_sync(rq)]--;
- if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
- e->ops->elevator_completed_req_fn(q, rq);
+ if (blk_sorted_rq(rq)) {
+ if (e->ops->elevator_completed_req_fn)
+ e->ops->elevator_completed_req_fn(q, rq);
+ elv_ioq_completed_request(q, rq);
+ }
}

/*
@@ -1138,3 +1163,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
return NULL;
}
EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+ return elv_ioq_sched_queue(req_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+ return elv_ioq_sched_queue(elv_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..36fc210 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -65,7 +65,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
return list_entry(rq->queuelist.next, struct request, queuelist);
}

-static void *noop_init_queue(struct request_queue *q)
+static void *noop_init_queue(struct request_queue *q, struct elevator_queue *eq)
{
struct noop_data *nd;

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69103e0..7cff5f2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -229,6 +229,11 @@ struct request {

/* for bidi */
struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ /* io queue request belongs to */
+ struct io_queue *ioq;
+#endif
};

static inline unsigned short req_get_ioprio(struct request *req)
@@ -236,6 +241,15 @@ static inline unsigned short req_get_ioprio(struct request *req)
return req->ioprio;
}

+static inline struct io_queue *req_ioq(struct request *req)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ return req->ioq;
+#else
+ return NULL;
+#endif
+}
+
/*
* State information carried for REQ_TYPE_PM_SUSPEND and REQ_TYPE_PM_RESUME
* requests. Some step values could eventually be made generic.
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1cb3372..4414a61 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -27,8 +27,19 @@ typedef void (elevator_put_req_fn) (struct request *);
typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);

-typedef void *(elevator_init_fn) (struct request_queue *);
+typedef void *(elevator_init_fn) (struct request_queue *,
+ struct elevator_queue *);
typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+ struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+ void*);
+#endif

struct elevator_ops
{
@@ -56,6 +67,16 @@ struct elevator_ops
elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+ elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+ elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+ elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+ elevator_should_preempt_fn *elevator_should_preempt_fn;
+ elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
};

#define ELV_NAME_MAX (16)
@@ -76,6 +97,9 @@ struct elevator_type
struct elv_fs_entry *elevator_attrs;
char elevator_name[ELV_NAME_MAX];
struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ int elevator_features;
+#endif
};

/*
@@ -89,6 +113,10 @@ struct elevator_queue
struct elevator_type *elevator_type;
struct mutex sysfs_lock;
struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+ /* fair queuing data */
+ struct elv_fq_data *efqd;
+#endif
};

/*
@@ -207,5 +235,25 @@ enum {
__val; \
})

+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fair queuing logic of elevator layer */
+#define ELV_IOSCHED_NEED_FQ 1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+ return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+ return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.6

2009-09-24 19:31:50

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 05/28] io-controller: Modify cfq to make use of flat elevator fair queuing

This patch changes cfq to use fair queuing code from elevator layer.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Fabio Checconi <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/Kconfig.iosched | 3 +-
block/cfq-iosched.c | 981 +++++++++++--------------------------------------
2 files changed, 218 insertions(+), 766 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
menu "IO Schedulers"

config ELV_FAIR_QUEUING
- bool "Elevator Fair Queuing Support"
+ bool
default n
---help---
Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE

config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
+ select ELV_FAIR_QUEUING
default y
---help---
The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5a67ec0..3e24c03 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,6 +12,7 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
+#include "elevator-fq.h"

/*
* tunables
@@ -23,17 +24,10 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
static const int cfq_back_max = 16 * 1024;
/* penalty of a backwards seek */
static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;

/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY (HZ / 5)
-
-/*
* below this threshold, we consider thinktime immediate
*/
#define CFQ_MIN_TT (2)
@@ -43,7 +37,7 @@ static int cfq_slice_idle = HZ / 125;

#define RQ_CIC(rq) \
((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq) (struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq) (struct cfq_queue *) (elv_ioq_sched_queue((rq)->ioq))

static struct kmem_cache *cfq_pool;
static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +47,6 @@ static struct completion *ioc_gone;
static DEFINE_SPINLOCK(ioc_gone_lock);

#define CFQ_PRIO_LISTS IOPRIO_BE_NR
-#define cfq_class_idle(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)

#define sample_valid(samples) ((samples) > 80)

@@ -74,16 +66,11 @@ struct cfq_rb_root {
* Per process-grouping structure
*/
struct cfq_queue {
- /* reference count */
- atomic_t ref;
+ struct io_queue *ioq;
/* various state flags, see below */
unsigned int flags;
/* parent cfq_data */
struct cfq_data *cfqd;
- /* service_tree member */
- struct rb_node rb_node;
- /* service_tree key */
- unsigned long rb_key;
/* prio tree member */
struct rb_node p_node;
/* prio tree root we belong to, if any */
@@ -99,18 +86,13 @@ struct cfq_queue {
/* fifo list of requests in sort_list */
struct list_head fifo;

- unsigned long slice_end;
- long slice_resid;
unsigned int slice_dispatch;

/* pending metadata requests */
int meta_pending;
- /* number of requests that are on the dispatch list or inside driver */
- int dispatched;

/* io prio of this group */
- unsigned short ioprio, org_ioprio;
- unsigned short ioprio_class, org_ioprio_class;
+ unsigned short org_ioprio, org_ioprio_class;

pid_t pid;
};
@@ -120,12 +102,6 @@ struct cfq_queue {
*/
struct cfq_data {
struct request_queue *queue;
-
- /*
- * rr list of queues with requests and the count of them
- */
- struct cfq_rb_root service_tree;
-
/*
* Each priority tree is sorted by next_request position. These
* trees are used when determining if two or more queues are
@@ -133,14 +109,6 @@ struct cfq_data {
*/
struct rb_root prio_trees[CFQ_PRIO_LISTS];

- unsigned int busy_queues;
- /*
- * Used to track any pending rt requests so we can pre-empt current
- * non-RT cfqq in service when this value is non-zero.
- */
- unsigned int busy_rt_queues;
-
- int rq_in_driver;
int sync_flight;

/*
@@ -151,21 +119,8 @@ struct cfq_data {
int hw_tag_samples;
int rq_in_driver_peak;

- /*
- * idle window management
- */
- struct timer_list idle_slice_timer;
- struct work_struct unplug_work;
-
- struct cfq_queue *active_queue;
struct cfq_io_context *active_cic;

- /*
- * async queue for each priority case
- */
- struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
- struct cfq_queue *async_idle_cfqq;
-
sector_t last_position;

/*
@@ -175,7 +130,6 @@ struct cfq_data {
unsigned int cfq_fifo_expire[2];
unsigned int cfq_back_penalty;
unsigned int cfq_back_max;
- unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;

@@ -188,16 +142,10 @@ struct cfq_data {
};

enum cfqq_state_flags {
- CFQ_CFQQ_FLAG_on_rr = 0, /* on round-robin busy list */
- CFQ_CFQQ_FLAG_wait_request, /* waiting for a request */
- CFQ_CFQQ_FLAG_must_dispatch, /* must be allowed a dispatch */
CFQ_CFQQ_FLAG_must_alloc, /* must be allowed rq alloc */
CFQ_CFQQ_FLAG_must_alloc_slice, /* per-slice must_alloc flag */
CFQ_CFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */
- CFQ_CFQQ_FLAG_idle_window, /* slice idling enabled */
CFQ_CFQQ_FLAG_prio_changed, /* task priority has changed */
- CFQ_CFQQ_FLAG_slice_new, /* no requests dispatched in slice */
- CFQ_CFQQ_FLAG_sync, /* synchronous queue */
CFQ_CFQQ_FLAG_coop, /* has done a coop jump of the queue */
};

@@ -215,16 +163,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq) \
return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0; \
}

-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
CFQ_CFQQ_FNS(must_alloc);
CFQ_CFQQ_FNS(must_alloc_slice);
CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
CFQ_CFQQ_FNS(coop);
#undef CFQ_CFQQ_FNS

@@ -263,66 +205,27 @@ static inline int cfq_bio_sync(struct bio *bio)
return 0;
}

-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
{
- if (cfqd->busy_queues) {
- cfq_log(cfqd, "schedule dispatch");
- kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
- }
+ return ioq_to_io_group(cfqq->ioq);
}

-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
{
- struct cfq_data *cfqd = q->elevator->elevator_data;
-
- return !cfqd->busy_queues;
+ return elv_ioq_class_idle(cfqq->ioq);
}

-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
- unsigned short prio)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
{
- const int base_slice = cfqd->cfq_slice[sync];
-
- WARN_ON(prio >= IOPRIO_BE_NR);
-
- return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
+ return elv_ioq_sync(cfqq->ioq);
}

-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
{
- return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
-}
-
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
- cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
- cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
-}
-
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
-{
- if (cfq_cfqq_slice_new(cfqq))
- return 0;
- if (time_before(jiffies, cfqq->slice_end))
- return 0;
+ struct cfq_data *cfqd = cfqq->cfqd;
+ struct elevator_queue *e = cfqd->queue->elevator;

- return 1;
+ return (elv_active_sched_queue(e) == cfqq);
}

/*
@@ -421,33 +324,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
}

/*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
- if (!root->left)
- root->left = rb_first(&root->rb);
-
- if (root->left)
- return rb_entry(root->left, struct cfq_queue, rb_node);
-
- return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
- rb_erase(n, root);
- RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
- if (root->left == n)
- root->left = NULL;
- rb_erase_init(n, &root->rb);
-}
-
-/*
* would be nice to take fifo expire time into account as well
*/
static struct request *
@@ -474,95 +350,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
return cfq_choose_req(cfqd, next, prev);
}

-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- /*
- * just an approximation, should be ok.
- */
- return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
- cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- int add_front)
-{
- struct rb_node **p, *parent;
- struct cfq_queue *__cfqq;
- unsigned long rb_key;
- int left;
-
- if (cfq_class_idle(cfqq)) {
- rb_key = CFQ_IDLE_DELAY;
- parent = rb_last(&cfqd->service_tree.rb);
- if (parent && parent != &cfqq->rb_node) {
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- rb_key += __cfqq->rb_key;
- } else
- rb_key += jiffies;
- } else if (!add_front) {
- rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
- rb_key += cfqq->slice_resid;
- cfqq->slice_resid = 0;
- } else
- rb_key = 0;
-
- if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
- /*
- * same position, nothing more to do
- */
- if (rb_key == cfqq->rb_key)
- return;
-
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
- }
-
- left = 1;
- parent = NULL;
- p = &cfqd->service_tree.rb.rb_node;
- while (*p) {
- struct rb_node **n;
-
- parent = *p;
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
- /*
- * sort RT queues first, we always want to give
- * preference to them. IDLE queues goes to the back.
- * after that, sort on the next service time.
- */
- if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
- n = &(*p)->rb_right;
- else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
- n = &(*p)->rb_right;
- else if (rb_key < __cfqq->rb_key)
- n = &(*p)->rb_left;
- else
- n = &(*p)->rb_right;
-
- if (n == &(*p)->rb_right)
- left = 0;
-
- p = n;
- }
-
- if (left)
- cfqd->service_tree.left = &cfqq->rb_node;
-
- cfqq->rb_key = rb_key;
- rb_link_node(&cfqq->rb_node, parent, p);
- rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
static struct cfq_queue *
cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
sector_t sector, struct rb_node **ret_parent,
@@ -624,57 +411,43 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
cfqq->p_root = NULL;
}

-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
{
- /*
- * Resorting requires the cfqq to be on the RR list already.
- */
- if (cfq_cfqq_on_rr(cfqq)) {
- cfq_service_tree_add(cfqd, cfqq, 0);
- cfq_prio_tree_add(cfqd, cfqq);
- }
-}
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = sched_queue;

-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
- cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
- BUG_ON(cfq_cfqq_on_rr(cfqq));
- cfq_mark_cfqq_on_rr(cfqq);
- cfqd->busy_queues++;
- if (cfq_class_rt(cfqq))
- cfqd->busy_rt_queues++;
+ if (cfqd->active_cic) {
+ put_io_context(cfqd->active_cic->ioc);
+ cfqd->active_cic = NULL;
+ }

- cfq_resort_rr_list(cfqd, cfqq);
+ /* Resort the cfqq in prio tree */
+ if (cfqq)
+ cfq_prio_tree_add(cfqd, cfqq);
}

-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+ int coop)
{
- cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
- BUG_ON(!cfq_cfqq_on_rr(cfqq));
- cfq_clear_cfqq_on_rr(cfqq);
+ struct cfq_queue *cfqq = sched_queue;

- if (!RB_EMPTY_NODE(&cfqq->rb_node))
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
- if (cfqq->p_root) {
- rb_erase(&cfqq->p_node, cfqq->p_root);
- cfqq->p_root = NULL;
- }
+ cfqq->slice_dispatch = 0;
+
+ cfq_clear_cfqq_must_alloc_slice(cfqq);
+ cfq_clear_cfqq_fifo_expire(cfqq);

- BUG_ON(!cfqd->busy_queues);
- cfqd->busy_queues--;
- if (cfq_class_rt(cfqq))
- cfqd->busy_rt_queues--;
+ /*
+ * If queue was selected because it was a close cooperator, then
+ * mark it so that it is not selected again and again. Otherwise
+ * clear the coop flag so that it becomes eligible to get selected
+ * again.
+ */
+ if (coop)
+ cfq_mark_cfqq_coop(cfqq);
+ else
+ cfq_clear_cfqq_coop(cfqq);
}

/*
@@ -683,7 +456,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- struct cfq_data *cfqd = cfqq->cfqd;
const int sync = rq_is_sync(rq);

BUG_ON(!cfqq->queued[sync]);
@@ -691,8 +463,17 @@ static void cfq_del_rq_rb(struct request *rq)

elv_rb_del(&cfqq->sort_list, rq);

- if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
- cfq_del_cfqq_rr(cfqd, cfqq);
+ /*
+ * If this was last request in the queue, remove this queue from
+ * prio trees. For last request nr_queued count will still be 1 as
+ * elevator fair queuing layer is yet to do the accounting.
+ */
+ if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+ if (cfqq->p_root) {
+ rb_erase(&cfqq->p_node, cfqq->p_root);
+ cfqq->p_root = NULL;
+ }
+ }
}

static void cfq_add_rq_rb(struct request *rq)
@@ -710,9 +491,6 @@ static void cfq_add_rq_rb(struct request *rq)
while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
cfq_dispatch_insert(cfqd->queue, __alias);

- if (!cfq_cfqq_on_rr(cfqq))
- cfq_add_cfqq_rr(cfqd, cfqq);
-
/*
* check if this request is a better next-serve candidate
*/
@@ -720,7 +498,9 @@ static void cfq_add_rq_rb(struct request *rq)
cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);

/*
- * adjust priority tree position, if ->next_rq changes
+ * adjust priority tree position, if ->next_rq changes. This should
+ * also take care of adding a new queue to prio tree as if this is
+ * first request then prev would be null and cfqq->next_rq will not.
*/
if (prev != cfqq->next_rq)
cfq_prio_tree_add(cfqd, cfqq);
@@ -760,23 +540,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
{
struct cfq_data *cfqd = q->elevator->elevator_data;

- cfqd->rq_in_driver++;
- cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
- cfqd->rq_in_driver);
-
cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
}

-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
- struct cfq_data *cfqd = q->elevator->elevator_data;
-
- WARN_ON(!cfqd->rq_in_driver);
- cfqd->rq_in_driver--;
- cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
- cfqd->rq_in_driver);
-}
-
static void cfq_remove_request(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -861,93 +627,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
return 0;
}

-static void __cfq_set_active_queue(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- if (cfqq) {
- cfq_log_cfqq(cfqd, cfqq, "set_active");
- cfqq->slice_end = 0;
- cfqq->slice_dispatch = 0;
-
- cfq_clear_cfqq_wait_request(cfqq);
- cfq_clear_cfqq_must_dispatch(cfqq);
- cfq_clear_cfqq_must_alloc_slice(cfqq);
- cfq_clear_cfqq_fifo_expire(cfqq);
- cfq_mark_cfqq_slice_new(cfqq);
-
- del_timer(&cfqd->idle_slice_timer);
- }
-
- cfqd->active_queue = cfqq;
-}
-
/*
* current cfqq expired its slice (or was too idle), select new one
*/
static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
- if (cfq_cfqq_wait_request(cfqq))
- del_timer(&cfqd->idle_slice_timer);
-
- cfq_clear_cfqq_wait_request(cfqq);
-
- /*
- * store what was left of this slice, if the queue idled/timed out
- */
- if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
- cfqq->slice_resid = cfqq->slice_end - jiffies;
- cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
- }
-
- cfq_resort_rr_list(cfqd, cfqq);
-
- if (cfqq == cfqd->active_queue)
- cfqd->active_queue = NULL;
-
- if (cfqd->active_cic) {
- put_io_context(cfqd->active_cic->ioc);
- cfqd->active_cic = NULL;
- }
+ elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
}

-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
{
- struct cfq_queue *cfqq = cfqd->active_queue;
+ struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);

if (cfqq)
- __cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
- if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
- return NULL;
-
- return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- if (!cfqq) {
- cfqq = cfq_get_next_queue(cfqd);
- if (cfqq)
- cfq_clear_cfqq_coop(cfqq);
- }
-
- __cfq_set_active_queue(cfqd, cfqq);
- return cfqq;
+ __cfq_slice_expired(cfqd, cfqq);
}

static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1024,11 +718,11 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
* associated with the I/O issued by cur_cfqq. I'm not sure this is a valid
* assumption.
*/
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
- struct cfq_queue *cur_cfqq,
- int probe)
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+ void *cur_sched_queue)
{
- struct cfq_queue *cfqq;
+ struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+ struct cfq_data *cfqd = q->elevator->elevator_data;

/*
* A valid cfq_io_context is necessary to compare requests against
@@ -1049,14 +743,13 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
if (cfq_cfqq_coop(cfqq))
return NULL;

- if (!probe)
- cfq_mark_cfqq_coop(cfqq);
- return cfqq;
+ return cfqq->ioq;
}

-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
{
- struct cfq_queue *cfqq = cfqd->active_queue;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = sched_queue;
struct cfq_io_context *cic;
unsigned long sl;

@@ -1069,18 +762,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
return;

WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
- WARN_ON(cfq_cfqq_slice_new(cfqq));
+ WARN_ON(elv_ioq_slice_new(cfqq->ioq));

/*
* idle is disabled, either manually or by past process history
*/
- if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
+ if (!cfqd->cfq_slice_idle || !elv_ioq_idle_window(cfqq->ioq))
return;

/*
* still requests with the driver, don't idle
*/
- if (cfqd->rq_in_driver)
+ if (elv_rq_in_driver(q->elevator))
return;

/*
@@ -1090,7 +783,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
if (!cic || !atomic_read(&cic->ioc->nr_tasks))
return;

- cfq_mark_cfqq_wait_request(cfqq);
+ elv_mark_ioq_wait_request(cfqq->ioq);

/*
* we don't want to idle for seeks, but we do want to allow
@@ -1101,7 +794,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));

- mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+ elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
}

@@ -1113,10 +806,9 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq = RQ_CFQQ(rq);

- cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+ cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", blk_rq_sectors(rq));

cfq_remove_request(rq);
- cfqq->dispatched++;
elv_dispatch_sort(q, rq);

if (cfq_cfqq_sync(cfqq))
@@ -1154,78 +846,11 @@ static inline int
cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
const int base_rq = cfqd->cfq_slice_async_rq;
+ unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);

- WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
+ WARN_ON(ioprio >= IOPRIO_BE_NR);

- return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
- struct cfq_queue *cfqq, *new_cfqq = NULL;
-
- cfqq = cfqd->active_queue;
- if (!cfqq)
- goto new_queue;
-
- /*
- * The active queue has run out of time, expire it and select new.
- */
- if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
- goto expire;
-
- /*
- * If we have a RT cfqq waiting, then we pre-empt the current non-rt
- * cfqq.
- */
- if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
- /*
- * We simulate this as cfqq timed out so that it gets to bank
- * the remaining of its time slice.
- */
- cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);
- goto new_queue;
- }
-
- /*
- * The active queue has requests and isn't expired, allow it to
- * dispatch.
- */
- if (!RB_EMPTY_ROOT(&cfqq->sort_list))
- goto keep_queue;
-
- /*
- * If another queue has a request waiting within our mean seek
- * distance, let it run. The expire code will check for close
- * cooperators and put the close queue at the front of the service
- * tree.
- */
- new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
- if (new_cfqq)
- goto expire;
-
- /*
- * No requests pending. If the active queue still has requests in
- * flight or is idling for a new request, allow either of these
- * conditions to happen (or time out) before selecting a new queue.
- */
- if (timer_pending(&cfqd->idle_slice_timer) ||
- (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
- cfqq = NULL;
- goto keep_queue;
- }
-
-expire:
- cfq_slice_expired(cfqd, 0);
-new_queue:
- cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
- return cfqq;
+ return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
}

static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1250,12 +875,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
struct cfq_queue *cfqq;
int dispatched = 0;

- while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+ while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);

- cfq_slice_expired(cfqd, 0);
+ /* This probably is redundant now. above loop will should make sure
+ * that all the busy queues have expired */
+ cfq_slice_expired(cfqd);

- BUG_ON(cfqd->busy_queues);
+ BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));

cfq_log(cfqd, "forced_dispatch=%d", dispatched);
return dispatched;
@@ -1301,13 +928,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
struct cfq_queue *cfqq;
unsigned int max_dispatch;

- if (!cfqd->busy_queues)
- return 0;
-
if (unlikely(force))
return cfq_forced_dispatch(cfqd);

- cfqq = cfq_select_queue(cfqd);
+ cfqq = elv_select_sched_queue(q, 0);
if (!cfqq)
return 0;

@@ -1324,7 +948,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
/*
* Does this cfqq already have too much IO in flight?
*/
- if (cfqq->dispatched >= max_dispatch) {
+ if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
/*
* idle queue must always only have a single IO in flight
*/
@@ -1334,13 +958,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
/*
* We have other queues, don't allow more IO from this one
*/
- if (cfqd->busy_queues > 1)
+ if (elv_nr_busy_ioq(q->elevator) > 1)
return 0;

/*
* we are the only queue, allow up to 4 times of 'quantum'
*/
- if (cfqq->dispatched >= 4 * max_dispatch)
+ if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
return 0;
}

@@ -1349,51 +973,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
*/
cfq_dispatch_request(cfqd, cfqq);
cfqq->slice_dispatch++;
- cfq_clear_cfqq_must_dispatch(cfqq);

/*
* expire an async queue immediately if it has used up its slice. idle
* queue always expire after 1 dispatch round.
*/
- if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+ if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
cfq_class_idle(cfqq))) {
- cfqq->slice_end = jiffies + 1;
- cfq_slice_expired(cfqd, 0);
+ cfq_slice_expired(cfqd);
}

cfq_log(cfqd, "dispatched a request");
return 1;
}

-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
{
+ struct cfq_queue *cfqq = sched_queue;
struct cfq_data *cfqd = cfqq->cfqd;

- BUG_ON(atomic_read(&cfqq->ref) <= 0);
-
- if (!atomic_dec_and_test(&cfqq->ref))
- return;
+ BUG_ON(!cfqq);

- cfq_log_cfqq(cfqd, cfqq, "put_queue");
+ cfq_log_cfqq(cfqd, cfqq, "free_queue");
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
- BUG_ON(cfq_cfqq_on_rr(cfqq));

- if (unlikely(cfqd->active_queue == cfqq)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
- cfq_schedule_dispatch(cfqd);
+ if (unlikely(cfqq_is_active_queue(cfqq))) {
+ __cfq_slice_expired(cfqd, cfqq);
+ elv_schedule_dispatch(cfqd->queue);
}

kmem_cache_free(cfq_pool, cfqq);
}

+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+ elv_put_ioq(cfqq->ioq);
+}
+
/*
* Must always be called with the rcu_read_lock() held
*/
@@ -1481,11 +1099,12 @@ static void cfq_free_io_context(struct io_context *ioc)

static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- if (unlikely(cfqq == cfqd->active_queue)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
- cfq_schedule_dispatch(cfqd);
+ if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+ __cfq_slice_expired(cfqd, cfqq);
+ elv_schedule_dispatch(cfqd->queue);
}

+ elv_exit_ioq(cfqq->ioq);
cfq_put_queue(cfqq);
}

@@ -1571,7 +1190,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
{
struct task_struct *tsk = current;
- int ioprio_class;
+ int ioprio_class, ioprio;

if (!cfq_cfqq_prio_changed(cfqq))
return;
@@ -1584,30 +1203,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
/*
* no prio set, inherit CPU scheduling settings
*/
- cfqq->ioprio = task_nice_ioprio(tsk);
- cfqq->ioprio_class = task_nice_ioclass(tsk);
+ ioprio = task_nice_ioprio(tsk);
+ ioprio_class = task_nice_ioclass(tsk);
break;
case IOPRIO_CLASS_RT:
- cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_RT;
+ ioprio = task_ioprio(ioc);
+ ioprio_class = IOPRIO_CLASS_RT;
break;
case IOPRIO_CLASS_BE:
- cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
+ ioprio = task_ioprio(ioc);
+ ioprio_class = IOPRIO_CLASS_BE;
break;
case IOPRIO_CLASS_IDLE:
- cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
- cfqq->ioprio = 7;
- cfq_clear_cfqq_idle_window(cfqq);
+ ioprio_class = IOPRIO_CLASS_IDLE;
+ ioprio = 7;
+ elv_clear_ioq_idle_window(cfqq->ioq);
break;
}

+ elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+ elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
/*
* keep track of original prio settings in case we have to temporarily
* elevate the priority of this queue
*/
- cfqq->org_ioprio = cfqq->ioprio;
- cfqq->org_ioprio_class = cfqq->ioprio_class;
+ cfqq->org_ioprio = ioprio;
+ cfqq->org_ioprio_class = ioprio_class;
cfq_clear_cfqq_prio_changed(cfqq);
}

@@ -1649,19 +1271,17 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
pid_t pid, int is_sync)
{
- RB_CLEAR_NODE(&cfqq->rb_node);
RB_CLEAR_NODE(&cfqq->p_node);
INIT_LIST_HEAD(&cfqq->fifo);

- atomic_set(&cfqq->ref, 0);
cfqq->cfqd = cfqd;

cfq_mark_cfqq_prio_changed(cfqq);

if (is_sync) {
if (!cfq_class_idle(cfqq))
- cfq_mark_cfqq_idle_window(cfqq);
- cfq_mark_cfqq_sync(cfqq);
+ elv_mark_ioq_idle_window(cfqq->ioq);
+ elv_mark_ioq_sync(cfqq->ioq);
}
cfqq->pid = pid;
}
@@ -1672,8 +1292,13 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
struct cfq_io_context *cic;
+ struct request_queue *q = cfqd->queue;
+ struct io_queue *ioq = NULL, *new_ioq = NULL;
+ struct io_group *iog = NULL;

retry:
+ iog = elv_io_get_io_group(q, 0);
+
cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);
@@ -1683,8 +1308,29 @@ retry:
* originally, since it should just be a temporary situation.
*/
if (!cfqq || cfqq == &cfqd->oom_cfqq) {
+ /* Allocate ioq object first and then cfqq */
+ if (new_ioq) {
+ goto alloc_cfqq;
+ } else if (gfp_mask & __GFP_WAIT) {
+ spin_unlock_irq(cfqd->queue->queue_lock);
+ new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+ spin_lock_irq(cfqd->queue->queue_lock);
+ if (new_ioq)
+ goto retry;
+ } else
+ ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+
+alloc_cfqq:
+ if (!ioq && !new_ioq) {
+ /* ioq allocation failed. Deafult to oom_cfqq */
+ cfqq = &cfqd->oom_cfqq;
+ goto out;
+ }
+
cfqq = NULL;
if (new_cfqq) {
+ ioq = new_ioq;
+ new_ioq = NULL;
cfqq = new_cfqq;
new_cfqq = NULL;
} else if (gfp_mask & __GFP_WAIT) {
@@ -1702,60 +1348,59 @@ retry:
}

if (cfqq) {
+ elv_init_ioq(q->elevator, ioq, current->pid, is_sync);
+ elv_init_ioq_sched_queue(q->elevator, ioq, cfqq);
+
+ cfqq->ioq = ioq;
cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
cfq_init_prio_data(cfqq, ioc);
+
+ /* call it after cfq has initialized queue prio */
+ elv_init_ioq_io_group(ioq, iog);
cfq_log_cfqq(cfqd, cfqq, "alloced");
- } else
+ } else {
cfqq = &cfqd->oom_cfqq;
+ /* If ioq allocation was successful, free it up */
+ if (ioq)
+ elv_free_ioq(ioq);
+ }
}

+ if (new_ioq)
+ elv_free_ioq(new_ioq);
+
if (new_cfqq)
kmem_cache_free(cfq_pool, new_cfqq);

+out:
return cfqq;
}

-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
- switch (ioprio_class) {
- case IOPRIO_CLASS_RT:
- return &cfqd->async_cfqq[0][ioprio];
- case IOPRIO_CLASS_BE:
- return &cfqd->async_cfqq[1][ioprio];
- case IOPRIO_CLASS_IDLE:
- return &cfqd->async_idle_cfqq;
- default:
- BUG();
- }
-}
-
static struct cfq_queue *
cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
- struct cfq_queue **async_cfqq = NULL;
+ struct cfq_queue *async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
+ struct io_group *iog = elv_io_get_io_group(cfqd->queue, 0);

if (!is_sync) {
- async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
- cfqq = *async_cfqq;
+ async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
+ ioprio);
+ cfqq = async_cfqq;
}

if (!cfqq)
cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);

- /*
- * pin the queue now that it's allocated, scheduler exit will prune it
- */
- if (!is_sync && !(*async_cfqq)) {
- atomic_inc(&cfqq->ref);
- *async_cfqq = cfqq;
- }
+ if (!is_sync && !async_cfqq)
+ elv_io_group_set_async_queue(iog, ioprio_class, ioprio,
+ cfqq->ioq);

- atomic_inc(&cfqq->ref);
+ /* ioc reference */
+ elv_get_ioq(cfqq->ioq);
return cfqq;
}

@@ -1960,7 +1605,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
return;

- enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
+ enable_idle = old_idle = elv_ioq_idle_window(cfqq->ioq);

if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
(cfqd->hw_tag && CIC_SEEKY(cic)))
@@ -1975,9 +1620,9 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (old_idle != enable_idle) {
cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
if (enable_idle)
- cfq_mark_cfqq_idle_window(cfqq);
+ elv_mark_ioq_idle_window(cfqq->ioq);
else
- cfq_clear_cfqq_idle_window(cfqq);
+ elv_clear_ioq_idle_window(cfqq->ioq);
}
}

@@ -1986,16 +1631,15 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
* no or if we aren't sure, a 1 will cause a preempt.
*/
static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
- struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
{
- struct cfq_queue *cfqq;
+ struct cfq_data *cfqd = q->elevator->elevator_data;
+ struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);

- cfqq = cfqd->active_queue;
if (!cfqq)
return 0;

- if (cfq_slice_used(cfqq))
+ if (elv_ioq_slice_used(cfqq->ioq))
return 1;

if (cfq_class_idle(new_cfqq))
@@ -2018,13 +1662,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
if (rq_is_meta(rq) && !cfqq->meta_pending)
return 1;

- /*
- * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
- */
- if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
- return 1;
-
- if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+ if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
return 0;

/*
@@ -2038,27 +1676,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
}

/*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
- cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);
-
- /*
- * Put the new queue at the front of the of the current list,
- * so we know that it will be selected next.
- */
- BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
- cfq_service_tree_add(cfqd, cfqq, 1);
-
- cfqq->slice_end = 0;
- cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
* Called when a new fs request (rq) is added (to cfqq). Check if there's
* something we should do about it
*/
@@ -2077,36 +1694,6 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_update_idle_window(cfqd, cfqq, cic);

cic->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
-
- if (cfqq == cfqd->active_queue) {
- /*
- * Remember that we saw a request from this process, but
- * don't start queuing just yet. Otherwise we risk seeing lots
- * of tiny requests, because we disrupt the normal plugging
- * and merging. If the request is already larger than a single
- * page, let it rip immediately. For that case we assume that
- * merging is already done. Ditto for a busy system that
- * has other work pending, don't risk delaying until the
- * idle timer unplug to continue working.
- */
- if (cfq_cfqq_wait_request(cfqq)) {
- if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
- cfqd->busy_queues > 1) {
- del_timer(&cfqd->idle_slice_timer);
- __blk_run_queue(cfqd->queue);
- }
- cfq_mark_cfqq_must_dispatch(cfqq);
- }
- } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
- /*
- * not the active queue - expire current slice if it is
- * idle and has expired it's mean thinktime or this new queue
- * has some old slice time left and is of higher priority or
- * this new queue is RT and the current one is BE
- */
- cfq_preempt_queue(cfqd, cfqq);
- __blk_run_queue(cfqd->queue);
- }
}

static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2130,11 +1717,13 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
*/
static void cfq_update_hw_tag(struct cfq_data *cfqd)
{
- if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
- cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
+ struct elevator_queue *eq = cfqd->queue->elevator;
+
+ if (elv_rq_in_driver(eq) > cfqd->rq_in_driver_peak)
+ cfqd->rq_in_driver_peak = elv_rq_in_driver(eq);

if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
- cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
+ elv_rq_in_driver(eq) <= CFQ_HW_QUEUE_MIN)
return;

if (cfqd->hw_tag_samples++ < 50)
@@ -2161,44 +1750,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)

cfq_update_hw_tag(cfqd);

- WARN_ON(!cfqd->rq_in_driver);
- WARN_ON(!cfqq->dispatched);
- cfqd->rq_in_driver--;
- cfqq->dispatched--;
-
if (cfq_cfqq_sync(cfqq))
cfqd->sync_flight--;
-
if (sync)
RQ_CIC(rq)->last_end_request = now;
-
- /*
- * If this is the active queue, check if it needs to be expired,
- * or if we want to idle in case it has no pending requests.
- */
- if (cfqd->active_queue == cfqq) {
- const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
- if (cfq_cfqq_slice_new(cfqq)) {
- cfq_set_prio_slice(cfqd, cfqq);
- cfq_clear_cfqq_slice_new(cfqq);
- }
- /*
- * If there are no requests waiting in this queue, and
- * there are other queues ready to issue requests, AND
- * those other queues are issuing requests within our
- * mean seek distance, give them a chance to run instead
- * of idling.
- */
- if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
- cfq_slice_expired(cfqd, 1);
- else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
- sync && !rq_noidle(rq))
- cfq_arm_slice_timer(cfqd);
- }
-
- if (!cfqd->rq_in_driver)
- cfq_schedule_dispatch(cfqd);
}

/*
@@ -2207,29 +1762,32 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
*/
static void cfq_prio_boost(struct cfq_queue *cfqq)
{
+ struct io_queue *ioq = cfqq->ioq;
+
if (has_fs_excl()) {
/*
* boost idle prio on transactions that would lock out other
* users of the filesystem
*/
if (cfq_class_idle(cfqq))
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
- if (cfqq->ioprio > IOPRIO_NORM)
- cfqq->ioprio = IOPRIO_NORM;
+ elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+ if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+ elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
} else {
/*
* check if we need to unboost the queue
*/
- if (cfqq->ioprio_class != cfqq->org_ioprio_class)
- cfqq->ioprio_class = cfqq->org_ioprio_class;
- if (cfqq->ioprio != cfqq->org_ioprio)
- cfqq->ioprio = cfqq->org_ioprio;
+ if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+ elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+ if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+ elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
}
}

static inline int __cfq_may_queue(struct cfq_queue *cfqq)
{
- if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
+ if ((elv_ioq_wait_request(cfqq->ioq) || cfq_cfqq_must_alloc(cfqq)) &&
!cfq_cfqq_must_alloc_slice(cfqq)) {
cfq_mark_cfqq_must_alloc_slice(cfqq);
return ELV_MQUEUE_MUST;
@@ -2282,7 +1840,7 @@ static void cfq_put_request(struct request *rq)
put_io_context(RQ_CIC(rq)->ioc);

rq->elevator_private = NULL;
- rq->elevator_private2 = NULL;
+ rq->ioq = NULL;

cfq_put_queue(cfqq);
}
@@ -2318,119 +1876,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)

cfqq->allocated[rw]++;
cfq_clear_cfqq_must_alloc(cfqq);
- atomic_inc(&cfqq->ref);
+ elv_get_ioq(cfqq->ioq);

spin_unlock_irqrestore(q->queue_lock, flags);

rq->elevator_private = cic;
- rq->elevator_private2 = cfqq;
+ rq->ioq = cfqq->ioq;
return 0;

queue_fail:
if (cic)
put_io_context(cic->ioc);

- cfq_schedule_dispatch(cfqd);
+ elv_schedule_dispatch(cfqd->queue);
spin_unlock_irqrestore(q->queue_lock, flags);
cfq_log(cfqd, "set_request fail");
return 1;
}

-static void cfq_kick_queue(struct work_struct *work)
-{
- struct cfq_data *cfqd =
- container_of(work, struct cfq_data, unplug_work);
- struct request_queue *q = cfqd->queue;
-
- spin_lock_irq(q->queue_lock);
- __blk_run_queue(cfqd->queue);
- spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
- struct cfq_data *cfqd = (struct cfq_data *) data;
- struct cfq_queue *cfqq;
- unsigned long flags;
- int timed_out = 1;
-
- cfq_log(cfqd, "idle timer fired");
-
- spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
- cfqq = cfqd->active_queue;
- if (cfqq) {
- timed_out = 0;
-
- /*
- * We saw a request before the queue expired, let it through
- */
- if (cfq_cfqq_must_dispatch(cfqq))
- goto out_kick;
-
- /*
- * expired
- */
- if (cfq_slice_used(cfqq))
- goto expire;
-
- /*
- * only expire and reinvoke request handler, if there are
- * other queues with pending requests
- */
- if (!cfqd->busy_queues)
- goto out_cont;
-
- /*
- * not expired and it has a request pending, let it dispatch
- */
- if (!RB_EMPTY_ROOT(&cfqq->sort_list))
- goto out_kick;
- }
-expire:
- cfq_slice_expired(cfqd, timed_out);
-out_kick:
- cfq_schedule_dispatch(cfqd);
-out_cont:
- spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
- del_timer_sync(&cfqd->idle_slice_timer);
- cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
- int i;
-
- for (i = 0; i < IOPRIO_BE_NR; i++) {
- if (cfqd->async_cfqq[0][i])
- cfq_put_queue(cfqd->async_cfqq[0][i]);
- if (cfqd->async_cfqq[1][i])
- cfq_put_queue(cfqd->async_cfqq[1][i]);
- }
-
- if (cfqd->async_idle_cfqq)
- cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
static void cfq_exit_queue(struct elevator_queue *e)
{
struct cfq_data *cfqd = e->elevator_data;
struct request_queue *q = cfqd->queue;

- cfq_shutdown_timer_wq(cfqd);
-
spin_lock_irq(q->queue_lock);

- if (cfqd->active_queue)
- __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
while (!list_empty(&cfqd->cic_list)) {
struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
struct cfq_io_context,
@@ -2439,12 +1909,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
__cfq_exit_single_io_context(cfqd, cic);
}

- cfq_put_async_queues(cfqd);
-
spin_unlock_irq(q->queue_lock);
-
- cfq_shutdown_timer_wq(cfqd);
-
kfree(cfqd);
}

@@ -2457,8 +1922,6 @@ static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
if (!cfqd)
return NULL;

- cfqd->service_tree = CFQ_RB_ROOT;
-
/*
* Not strictly needed (since RB_ROOT just clears the node and we
* zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2473,25 +1936,20 @@ static void *cfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
* will not attempt to free it.
*/
cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
- atomic_inc(&cfqd->oom_cfqq.ref);
+
+ /* Link up oom_ioq and oom_cfqq */
+ cfqd->oom_cfqq.ioq = elv_get_oom_ioq(eq);
+ elv_init_ioq_sched_queue(eq, elv_get_oom_ioq(eq), &cfqd->oom_cfqq);

INIT_LIST_HEAD(&cfqd->cic_list);

cfqd->queue = q;

- init_timer(&cfqd->idle_slice_timer);
- cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
- cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
- INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
cfqd->cfq_quantum = cfq_quantum;
cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
cfqd->cfq_back_max = cfq_back_max;
cfqd->cfq_back_penalty = cfq_back_penalty;
- cfqd->cfq_slice[0] = cfq_slice_async;
- cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
cfqd->hw_tag = 1;
@@ -2560,8 +2018,6 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
#undef SHOW_FUNCTION

@@ -2590,8 +2046,6 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
UINT_MAX, 0);
STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
#undef STORE_FUNCTION
@@ -2605,10 +2059,10 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(fifo_expire_async),
CFQ_ATTR(back_seek_max),
CFQ_ATTR(back_seek_penalty),
- CFQ_ATTR(slice_sync),
- CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
+ ELV_ATTR(slice_sync),
+ ELV_ATTR(slice_async),
__ATTR_NULL
};

@@ -2621,8 +2075,6 @@ static struct elevator_type iosched_cfq = {
.elevator_dispatch_fn = cfq_dispatch_requests,
.elevator_add_req_fn = cfq_insert_request,
.elevator_activate_req_fn = cfq_activate_request,
- .elevator_deactivate_req_fn = cfq_deactivate_request,
- .elevator_queue_empty_fn = cfq_queue_empty,
.elevator_completed_req_fn = cfq_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
@@ -2632,7 +2084,14 @@ static struct elevator_type iosched_cfq = {
.elevator_init_fn = cfq_init_queue,
.elevator_exit_fn = cfq_exit_queue,
.trim = cfq_free_io_context,
+ .elevator_free_sched_queue_fn = cfq_free_cfq_queue,
+ .elevator_active_ioq_set_fn = cfq_active_ioq_set,
+ .elevator_active_ioq_reset_fn = cfq_active_ioq_reset,
+ .elevator_arm_slice_timer_fn = cfq_arm_slice_timer,
+ .elevator_should_preempt_fn = cfq_should_preempt,
+ .elevator_close_cooperator_fn = cfq_close_cooperator,
},
+ .elevator_features = ELV_IOSCHED_NEED_FQ,
.elevator_attrs = cfq_attrs,
.elevator_name = "cfq",
.elevator_owner = THIS_MODULE,
@@ -2640,14 +2099,6 @@ static struct elevator_type iosched_cfq = {

static int __init cfq_init(void)
{
- /*
- * could be 0 on HZ < 1000 setups
- */
- if (!cfq_slice_async)
- cfq_slice_async = 1;
- if (!cfq_slice_idle)
- cfq_slice_idle = 1;
-
if (cfq_slab_setup())
return -ENOMEM;

--
1.6.0.6

2009-09-24 19:32:50

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 06/28] io-controller: Core scheduler changes to support hierarhical scheduling

o This patch introduces core changes in fair queuing scheduler to support
hierarhical/group scheduling. It is enabled by CONFIG_GROUP_IOSCHED.

Signed-off-by: Fabio Checconi <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/elevator-fq.c | 190 +++++++++++++++++++++++++++++++++++++++++++++++----
block/elevator-fq.h | 19 +++++
init/Kconfig | 8 ++
3 files changed, 204 insertions(+), 13 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 629ddaa..0e3d58c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -145,6 +145,88 @@ static inline struct io_group *iog_of(struct io_entity *entity)
return NULL;
}

+#ifdef CONFIG_GROUP_IOSCHED
+/* check for entity->parent so that loop is not executed for root entity. */
+#define for_each_entity(entity) \
+ for (; entity && entity->parent; entity = entity->parent)
+
+/* Do the two (enqueued) entities belong to the same group ? */
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+ if (parent_entity(entity) == parent_entity(new_entity))
+ return 1;
+
+ return 0;
+}
+
+/* return depth at which a io entity is present in the hierarchy */
+static inline int depth_entity(struct io_entity *entity)
+{
+ int depth = 0;
+
+ for_each_entity(entity)
+ depth++;
+
+ return depth;
+}
+
+static void find_matching_io_entity(struct io_entity **entity,
+ struct io_entity **new_entity)
+{
+ int entity_depth, new_entity_depth;
+
+ /*
+ * preemption test can be made between sibling entities who are in the
+ * same group i.e who have a common parent. Walk up the hierarchy of
+ * both entities until we find their ancestors who are siblings of
+ * common parent.
+ */
+
+ /* First walk up until both entities are at same depth */
+ entity_depth = depth_entity(*entity);
+ new_entity_depth = depth_entity(*new_entity);
+
+ while (entity_depth > new_entity_depth) {
+ entity_depth--;
+ *entity = parent_entity(*entity);
+ }
+
+ while (new_entity_depth > entity_depth) {
+ new_entity_depth--;
+ *new_entity = parent_entity(*new_entity);
+ }
+
+ while (!is_same_group(*entity, *new_entity)) {
+ *entity = parent_entity(*entity);
+ *new_entity = parent_entity(*new_entity);
+ }
+}
+struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+ return iog_of(parent_entity(&ioq->entity));
+}
+EXPORT_SYMBOL(ioq_to_io_group);
+
+static inline struct io_sched_data *
+io_entity_sched_data(struct io_entity *entity)
+{
+ return &iog_of(parent_entity(entity))->sched_data;
+}
+
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity) \
+ for (; entity != NULL; entity = NULL)
+
+static void find_matching_io_entity(struct io_entity **entity,
+ struct io_entity **new_entity) { }
+
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+ return 1;
+}
+
static inline struct elv_fq_data *efqd_of(struct io_entity *entity)
{
return ioq_of(entity)->efqd;
@@ -163,6 +245,7 @@ io_entity_sched_data(struct io_entity *entity)

return &efqd->root_group->sched_data;
}
+#endif /* GROUP_IOSCHED */

static inline void
init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
@@ -175,12 +258,18 @@ init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
entity->st = &parent_iog->sched_data.service_tree[idx];
}

-static void
-entity_served(struct io_entity *entity, unsigned long served,
- unsigned long queue_charge, unsigned long nr_sectors)
+static void entity_served(struct io_entity *entity, unsigned long served,
+ unsigned long queue_charge, unsigned long group_charge,
+ unsigned long nr_sectors)
{
- entity->vdisktime += elv_delta_fair(queue_charge, entity);
- update_min_vdisktime(entity->st);
+ unsigned long charge = queue_charge;
+
+ for_each_entity(entity) {
+ entity->vdisktime += elv_delta_fair(queue_charge, entity);
+ update_min_vdisktime(entity->st);
+ /* Group charge can be different from queue charge */
+ charge = group_charge;
+ }
}

static void place_entity(struct io_service_tree *st, struct io_entity *entity,
@@ -542,14 +631,23 @@ static void put_prev_ioq(struct io_queue *ioq)
{
struct io_entity *entity = &ioq->entity;

- put_prev_io_entity(entity);
+ for_each_entity(entity) {
+ put_prev_io_entity(entity);
+ }
}

static void dequeue_ioq(struct io_queue *ioq)
{
struct io_entity *entity = &ioq->entity;

- dequeue_io_entity(entity);
+ for_each_entity(entity) {
+ struct io_sched_data *sd = io_entity_sched_data(entity);
+
+ dequeue_io_entity(entity);
+ /* Don't dequeue parent if it has other entities besides us */
+ if (sd->nr_active)
+ break;
+ }
elv_put_ioq(ioq);
return;
}
@@ -560,7 +658,12 @@ static void enqueue_ioq(struct io_queue *ioq)
struct io_entity *entity = &ioq->entity;

elv_get_ioq(ioq);
- enqueue_io_entity(entity);
+
+ for_each_entity(entity) {
+ if (entity->on_st)
+ break;
+ enqueue_io_entity(entity);
+ }
}

static inline void
@@ -592,7 +695,7 @@ EXPORT_SYMBOL(elv_put_ioq);

static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
{
- unsigned long allocated_slice, queue_charge;
+ unsigned long allocated_slice, queue_charge, group_charge;

allocated_slice = elv_prio_to_slice(ioq->efqd, ioq);

@@ -604,7 +707,18 @@ static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
* use the slice and moves to the back of service tree (almost).
*/
queue_charge = allocated_slice;
- entity_served(&ioq->entity, served, queue_charge, ioq->nr_sectors);
+
+ /*
+ * Group is charged the real time consumed so that it does not loose
+ * fair share.
+ */
+ if (served > allocated_slice)
+ group_charge = allocated_slice;
+ else
+ group_charge = served;
+
+ entity_served(&ioq->entity, served, queue_charge, group_charge,
+ ioq->nr_sectors);
}

/*
@@ -804,6 +918,45 @@ void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
}
EXPORT_SYMBOL(elv_io_group_set_async_queue);

+#ifdef CONFIG_GROUP_IOSCHED
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+ struct io_group *iog = e->efqd->root_group;
+ struct io_service_tree *st;
+ int i;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ flush_idle_tree(st);
+ }
+
+ put_io_group_queues(e, iog);
+ kfree(iog);
+}
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+ struct elevator_queue *e, void *key)
+{
+ struct io_group *iog;
+ int i;
+
+ iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+ if (iog == NULL)
+ return NULL;
+
+ iog->entity.parent = NULL;
+ iog->entity.my_sd = &iog->sched_data;
+ iog->key = key;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
+
+ return iog;
+}
+
+#else /* CONFIG_GROUP_IOSCHED */
+
static struct io_group *io_alloc_root_group(struct request_queue *q,
struct elevator_queue *e, void *key)
{
@@ -839,6 +992,8 @@ static void io_free_root_group(struct elevator_queue *e)
kfree(iog);
}

+#endif /* CONFIG_GROUP_IOSCHED */
+
/*
* Should be called after ioq prio and class has been initialized as prio
* class data will be used to determine which service tree in the group
@@ -864,9 +1019,11 @@ static struct io_queue *elv_get_next_ioq(struct request_queue *q)
return NULL;

sd = &efqd->root_group->sched_data;
- entity = lookup_next_io_entity(sd);
- if (!entity)
- return NULL;
+ for (; sd != NULL; sd = entity->my_sd) {
+ entity = lookup_next_io_entity(sd);
+ if (!entity)
+ return NULL;
+ }

ioq = ioq_of(entity);
return ioq;
@@ -1073,6 +1230,13 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
new_entity = &new_ioq->entity;

/*
+ * In hierarchical setup, one need to traverse up the hierarchy
+ * till both the queues are children of same parent to make a
+ * decision whether to do the preemption or not.
+ */
+ find_matching_io_entity(&entity, &new_entity);
+
+ /*
* Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
*/

diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 6ea0d18..068f240 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -93,6 +93,23 @@ struct io_queue {
void *sched_queue;
};

+#ifdef CONFIG_GROUP_IOSCHED /* CONFIG_GROUP_IOSCHED */
+struct io_group {
+ struct io_entity entity;
+ atomic_t ref;
+ struct io_sched_data sched_data;
+ /*
+ * async queue for each priority case for RT and BE class.
+ * Used only for cfq.
+ */
+
+ struct io_queue *async_queue[2][IOPRIO_BE_NR];
+ struct io_queue *async_idle_queue;
+ void *key;
+};
+
+#else /* CONFIG_GROUP_IOSCHED */
+
struct io_group {
struct io_entity entity;
struct io_sched_data sched_data;
@@ -106,6 +123,8 @@ struct io_group {
void *key;
};

+#endif /* CONFIG_GROUP_IOSCHED */
+
struct elv_fq_data {
struct io_group *root_group;

diff --git a/init/Kconfig b/init/Kconfig
index 3f7e609..29f701d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -612,6 +612,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
size is 4096bytes, 512k per 1Gbytes of swap.

+config GROUP_IOSCHED
+ bool "Group IO Scheduler"
+ depends on CGROUPS && ELV_FAIR_QUEUING
+ default n
+ ---help---
+ This feature lets IO scheduler recognize task groups and control
+ disk bandwidth allocation to such task groups.
+
endif # CGROUPS

config MM_OWNER
--
1.6.0.6

2009-09-24 19:31:29

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 07/28] io-controller: cgroup related changes for hierarchical group support

o This patch introduces some of the cgroup related code for io controller.

Signed-off-by: Fabio Checconi <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/blk-ioc.c | 3 +
block/elevator-fq.c | 169 ++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 14 ++++
include/linux/cgroup_subsys.h | 6 ++
include/linux/iocontext.h | 5 +
5 files changed, 196 insertions(+), 1 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index d4ed600..0d56336 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
spin_lock_init(&ret->lock);
ret->ioprio_changed = 0;
ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+ ret->cgroup_changed = 0;
+#endif
ret->last_waited = jiffies; /* doesn't matter... */
ret->nr_batch_requests = 0; /* because this is 0 */
ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 0e3d58c..0c060a6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -265,7 +265,7 @@ static void entity_served(struct io_entity *entity, unsigned long served,
unsigned long charge = queue_charge;

for_each_entity(entity) {
- entity->vdisktime += elv_delta_fair(queue_charge, entity);
+ entity->vdisktime += elv_delta_fair(charge, entity);
update_min_vdisktime(entity->st);
/* Group charge can be different from queue charge */
charge = group_charge;
@@ -920,6 +920,173 @@ EXPORT_SYMBOL(elv_io_group_set_async_queue);

#ifdef CONFIG_GROUP_IOSCHED

+struct io_cgroup io_root_cgroup = {
+ .weight = IO_WEIGHT_DEFAULT,
+ .ioprio_class = IOPRIO_CLASS_BE,
+};
+
+static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+ return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+ struct io_cgroup, css);
+}
+
+#define SHOW_FUNCTION(__VAR) \
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype) \
+{ \
+ struct io_cgroup *iocg; \
+ u64 ret; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ iocg = cgroup_to_io_cgroup(cgroup); \
+ spin_lock_irq(&iocg->lock); \
+ ret = iocg->__VAR; \
+ spin_unlock_irq(&iocg->lock); \
+ \
+ cgroup_unlock(); \
+ \
+ return ret; \
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX) \
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
+ struct cftype *cftype, \
+ u64 val) \
+{ \
+ struct io_cgroup *iocg; \
+ struct io_group *iog; \
+ struct hlist_node *n; \
+ \
+ if (val < (__MIN) || val > (__MAX)) \
+ return -EINVAL; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ iocg = cgroup_to_io_cgroup(cgroup); \
+ \
+ spin_lock_irq(&iocg->lock); \
+ iocg->__VAR = (unsigned long)val; \
+ hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
+ iog->entity.__VAR = (unsigned long)val; \
+ smp_wmb(); \
+ iog->entity.ioprio_changed = 1; \
+ } \
+ spin_unlock_irq(&iocg->lock); \
+ \
+ cgroup_unlock(); \
+ \
+ return 0; \
+}
+
+STORE_FUNCTION(weight, IO_WEIGHT_MIN, IO_WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+struct cftype io_files[] = {
+ {
+ .name = "weight",
+ .read_u64 = io_cgroup_weight_read,
+ .write_u64 = io_cgroup_weight_write,
+ },
+ {
+ .name = "ioprio_class",
+ .read_u64 = io_cgroup_ioprio_class_read,
+ .write_u64 = io_cgroup_ioprio_class_write,
+ },
+};
+
+static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ return cgroup_add_files(cgroup, subsys, io_files, ARRAY_SIZE(io_files));
+}
+
+static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+ struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg;
+
+ if (cgroup->parent != NULL) {
+ iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+ if (iocg == NULL)
+ return ERR_PTR(-ENOMEM);
+ } else
+ iocg = &io_root_cgroup;
+
+ spin_lock_init(&iocg->lock);
+ INIT_HLIST_HEAD(&iocg->group_data);
+ iocg->weight = IO_WEIGHT_DEFAULT;
+ iocg->ioprio_class = IOPRIO_CLASS_BE;
+
+ return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic data structures. By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct task_struct *tsk)
+{
+ struct io_context *ioc;
+ int ret = 0;
+
+ /* task_lock() is needed to avoid races with exit_io_context() */
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+ /*
+ * ioc == NULL means that the task is either too young or
+ * exiting: if it has still no ioc the ioc can't be shared,
+ * if the task is exiting the attach will fail anyway, no
+ * matter what we return here.
+ */
+ ret = -EINVAL;
+ task_unlock(tsk);
+
+ return ret;
+}
+
+static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct cgroup *prev, struct task_struct *tsk)
+{
+ struct io_context *ioc;
+
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc != NULL)
+ ioc->cgroup_changed = 1;
+ task_unlock(tsk);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+
+ /* Implemented in later patch */
+}
+
+struct cgroup_subsys io_subsys = {
+ .name = "io",
+ .create = iocg_create,
+ .can_attach = iocg_can_attach,
+ .attach = iocg_attach,
+ .destroy = iocg_destroy,
+ .populate = iocg_populate,
+ .subsys_id = io_subsys_id,
+ .use_id = 1,
+};
+
static void io_free_root_group(struct elevator_queue *e)
{
struct io_group *iog = e->efqd->root_group;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 068f240..f343841 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -13,6 +13,7 @@

#ifdef CONFIG_BLOCK
#include <linux/blkdev.h>
+#include <linux/cgroup.h>

#ifndef _ELV_SCHED_H
#define _ELV_SCHED_H
@@ -98,6 +99,8 @@ struct io_group {
struct io_entity entity;
atomic_t ref;
struct io_sched_data sched_data;
+ struct hlist_node group_node;
+ unsigned short iocg_id;
/*
* async queue for each priority case for RT and BE class.
* Used only for cfq.
@@ -108,6 +111,17 @@ struct io_group {
void *key;
};

+struct io_cgroup {
+ struct cgroup_subsys_state css;
+
+ unsigned int weight;
+ unsigned short ioprio_class;
+
+ spinlock_t lock;
+ struct hlist_head group_data;
+};
+
+
#else /* CONFIG_GROUP_IOSCHED */

struct io_group {
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..baf544f 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
#endif

/* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 4da4a75..b343594 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
unsigned short ioprio;
unsigned short ioprio_changed;

+#ifdef CONFIG_GROUP_IOSCHED
+ /* If task changes the cgroup, elevator processes it asynchronously */
+ unsigned short cgroup_changed;
+#endif
+
/*
* For request batching
*/
--
1.6.0.6

2009-09-24 19:27:28

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 08/28] io-controller: Common hierarchical fair queuing code in elevaotor layer

o This patch enables hierarchical fair queuing in common layer. It is
controlled by config option CONFIG_GROUP_IOSCHED.

o Requests keep a reference on ioq and ioq keeps keep a reference
on groups. For async queues in CFQ, and single ioq in other
schedulers, io_group also keeps are reference on io_queue. This
reference on ioq is dropped when the queue is released
(elv_release_ioq). So the queue can be freed.

When a queue is released, it puts the reference to io_group and the
io_group is released after all the queues are released. Child groups
also take reference on parent groups, and release it when they are
destroyed.

o Reads of iocg->group_data are not always iocg->lock; so all the operations
on that list are still protected by RCU. All modifications to
iocg->group_data should always done under iocg->lock.

Whenever iocg->lock and queue_lock can both be held, queue_lock should
be held first. This avoids all deadlocks. In order to avoid race
between cgroup deletion and elevator switch the following algorithm is
used:

- Cgroup deletion path holds iocg->lock and removes iog entry
to iocg->group_data list. Then it drops iocg->lock, holds
queue_lock and destroys iog. So in this path, we never hold
iocg->lock and queue_lock at the same time. Also, since we
remove iog from iocg->group_data under iocg->lock, we can't
race with elevator switch.

- Elevator switch path does not remove iog from
iocg->group_data list directly. It first hold iocg->lock,
scans iocg->group_data again to see if iog is still there;
it removes iog only if it finds iog there. Otherwise, cgroup
deletion must have removed it from the list, and cgroup
deletion is responsible for removing iog.

So the path which removes iog from iocg->group_data list does
the final removal of iog by calling __io_destroy_group()
function.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Fabio Checconi <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Aristeu Rozanski <[email protected]>
Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/cfq-iosched.c | 2 +
block/elevator-fq.c | 500 +++++++++++++++++++++++++++++++++++++++++++++++++--
block/elevator-fq.h | 35 ++++
block/elevator.c | 4 +
4 files changed, 530 insertions(+), 11 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3e24c03..79ac161 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1357,6 +1357,8 @@ alloc_cfqq:

/* call it after cfq has initialized queue prio */
elv_init_ioq_io_group(ioq, iog);
+ /* ioq reference on iog */
+ elv_get_iog(iog);
cfq_log_cfqq(cfqd, cfqq, "alloced");
} else {
cfqq = &cfqd->oom_cfqq;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 0c060a6..d59ac50 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -677,6 +677,7 @@ void elv_put_ioq(struct io_queue *ioq)
{
struct elv_fq_data *efqd = ioq->efqd;
struct elevator_queue *e = efqd->eq;
+ struct io_group *iog;

BUG_ON(atomic_read(&ioq->ref) <= 0);
if (!atomic_dec_and_test(&ioq->ref))
@@ -684,12 +685,14 @@ void elv_put_ioq(struct io_queue *ioq)
BUG_ON(ioq->nr_queued);
BUG_ON(elv_ioq_busy(ioq));
BUG_ON(efqd->active_queue == ioq);
+ iog = ioq_to_io_group(ioq);

/* Can be called by outgoing elevator. Don't use q */
BUG_ON(!e->ops->elevator_free_sched_queue_fn);
e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
elv_log_ioq(efqd, ioq, "put_queue");
elv_free_ioq(ioq);
+ elv_put_iog(iog);
}
EXPORT_SYMBOL(elv_put_ioq);

@@ -919,6 +922,27 @@ void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
EXPORT_SYMBOL(elv_io_group_set_async_queue);

#ifdef CONFIG_GROUP_IOSCHED
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+ struct io_entity *entity = &iog->entity;
+
+ entity->weight = iocg->weight;
+ entity->ioprio_class = iocg->ioprio_class;
+ entity->ioprio_changed = 1;
+ entity->my_sd = &iog->sched_data;
+}
+
+static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+ struct io_entity *entity = &iog->entity;
+
+ init_io_entity_parent(entity, &parent->entity);
+
+ /* Child group reference on parent group. */
+ elv_get_iog(parent);
+}

struct io_cgroup io_root_cgroup = {
.weight = IO_WEIGHT_DEFAULT,
@@ -931,6 +955,27 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
struct io_cgroup, css);
}

+/*
+ * Search the io_group for efqd into the hash table (by now only a list)
+ * of bgrp. Must be called under rcu_read_lock().
+ */
+static struct io_group *
+io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+ struct io_group *iog;
+ struct hlist_node *n;
+ void *__key;
+
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ __key = rcu_dereference(iog->key);
+ if (__key == key)
+ return iog;
+ }
+
+ return NULL;
+}
+
+
#define SHOW_FUNCTION(__VAR) \
static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup, \
struct cftype *cftype) \
@@ -1070,12 +1115,6 @@ static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
task_unlock(tsk);
}

-static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-
- /* Implemented in later patch */
-}
-
struct cgroup_subsys io_subsys = {
.name = "io",
.create = iocg_create,
@@ -1087,11 +1126,196 @@ struct cgroup_subsys io_subsys = {
.use_id = 1,
};

+static inline unsigned int iog_weight(struct io_group *iog)
+{
+ return iog->entity.weight;
+}
+
+static struct io_group *
+io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg;
+ struct io_group *iog, *leaf = NULL, *prev = NULL;
+ gfp_t flags = GFP_ATOMIC | __GFP_ZERO;
+
+ for (; cgroup != NULL; cgroup = cgroup->parent) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ if (iog != NULL) {
+ /*
+ * All the cgroups in the path from there to the
+ * root must have a io_group for efqd, so we don't
+ * need any more allocations.
+ */
+ break;
+ }
+
+ iog = kzalloc_node(sizeof(*iog), flags, q->node);
+ if (!iog)
+ goto cleanup;
+
+ iog->iocg_id = css_id(&iocg->css);
+
+ io_group_init_entity(iocg, iog);
+
+ atomic_set(&iog->ref, 0);
+
+ /*
+ * Take the initial reference that will be released on destroy
+ * This can be thought of a joint reference by cgroup and
+ * elevator which will be dropped by either elevator exit
+ * or cgroup deletion path depending on who is exiting first.
+ */
+ elv_get_iog(iog);
+
+ if (leaf == NULL) {
+ leaf = iog;
+ prev = leaf;
+ } else {
+ io_group_set_parent(prev, iog);
+ /*
+ * Build a list of allocated nodes using the efqd
+ * filed, that is still unused and will be initialized
+ * only after the node will be connected.
+ */
+ prev->key = iog;
+ prev = iog;
+ }
+ }
+
+ return leaf;
+
+cleanup:
+ while (leaf != NULL) {
+ prev = leaf;
+ leaf = leaf->key;
+ kfree(prev);
+ }
+
+ return NULL;
+}
+
+static void io_group_chain_link(struct request_queue *q, void *key,
+ struct cgroup *cgroup, struct io_group *leaf,
+ struct elv_fq_data *efqd)
+{
+ struct io_cgroup *iocg;
+ struct io_group *iog, *next, *prev = NULL;
+ unsigned long flags;
+
+ assert_spin_locked(q->queue_lock);
+
+ for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+ next = leaf->key;
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ BUG_ON(iog != NULL);
+
+ spin_lock_irqsave(&iocg->lock, flags);
+
+ rcu_assign_pointer(leaf->key, key);
+ hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+ hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+ spin_unlock_irqrestore(&iocg->lock, flags);
+
+ prev = leaf;
+ leaf = next;
+ }
+
+ BUG_ON(cgroup == NULL && leaf != NULL);
+
+ /*
+ * This connects the topmost element of the allocated chain to the
+ * parent group.
+ */
+ if (cgroup != NULL && prev != NULL) {
+ iocg = cgroup_to_io_cgroup(cgroup);
+ iog = io_cgroup_lookup_group(iocg, key);
+ io_group_set_parent(prev, iog);
+ }
+}
+
+static struct io_group *io_find_alloc_group(struct request_queue *q,
+ struct cgroup *cgroup, struct elv_fq_data *efqd,
+ int create)
+{
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct io_group *iog = NULL;
+ /* Note: Use efqd as key */
+ void *key = efqd;
+
+ /*
+ * Take a refenrece to css object. Don't want to map a bio to
+ * a group if it has been marked for deletion
+ */
+
+ if (!iocg || !css_tryget(&iocg->css))
+ return iog;
+
+ iog = io_cgroup_lookup_group(iocg, key);
+ if (iog != NULL || !create)
+ goto end;
+
+ iog = io_group_chain_alloc(q, key, cgroup);
+ if (iog != NULL)
+ io_group_chain_link(q, key, cgroup, iog, efqd);
+
+end:
+ css_put(&iocg->css);
+ return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ *
+ * Note: This function should be called with queue lock held. It returns
+ * a pointer to io group without taking any reference. That group will
+ * be around as long as queue lock is not dropped (as group reclaim code
+ * needs to get hold of queue lock). So if somebody needs to use group
+ * pointer even after dropping queue lock, take a reference to the group
+ * before dropping queue lock.
+ */
+struct io_group *elv_io_get_io_group(struct request_queue *q, int create)
+{
+ struct cgroup *cgroup;
+ struct io_group *iog;
+ struct elv_fq_data *efqd = q->elevator->efqd;
+
+ assert_spin_locked(q->queue_lock);
+
+ rcu_read_lock();
+ cgroup = task_cgroup(current, io_subsys_id);
+ iog = io_find_alloc_group(q, cgroup, efqd, create);
+ if (!iog) {
+ if (create)
+ iog = efqd->root_group;
+ else
+ /*
+ * bio merge functions doing lookup don't want to
+ * map bio to root group by default
+ */
+ iog = NULL;
+ }
+ rcu_read_unlock();
+ return iog;
+}
+EXPORT_SYMBOL(elv_io_get_io_group);
+
+
static void io_free_root_group(struct elevator_queue *e)
{
struct io_group *iog = e->efqd->root_group;
struct io_service_tree *st;
int i;
+ struct io_cgroup *iocg = &io_root_cgroup;
+
+ spin_lock_irq(&iocg->lock);
+ hlist_del_rcu(&iog->group_node);
+ spin_unlock_irq(&iocg->lock);

for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
st = iog->sched_data.service_tree + i;
@@ -1099,19 +1323,21 @@ static void io_free_root_group(struct elevator_queue *e)
}

put_io_group_queues(e, iog);
- kfree(iog);
+ elv_put_iog(iog);
}

static struct io_group *io_alloc_root_group(struct request_queue *q,
struct elevator_queue *e, void *key)
{
struct io_group *iog;
+ struct io_cgroup *iocg = &io_root_cgroup;
int i;

iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
if (iog == NULL)
return NULL;

+ elv_get_iog(iog);
iog->entity.parent = NULL;
iog->entity.my_sd = &iog->sched_data;
iog->key = key;
@@ -1119,11 +1345,235 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
for (i = 0; i < IO_IOPRIO_CLASSES; i++)
iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;

+ spin_lock_irq(&iocg->lock);
+ rcu_assign_pointer(iog->key, key);
+ hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+ iog->iocg_id = css_id(&iocg->css);
+ spin_unlock_irq(&iocg->lock);
+
return iog;
}

+static void io_group_free_rcu(struct rcu_head *head)
+{
+ struct io_group *iog;
+
+ iog = container_of(head, struct io_group, rcu_head);
+ kfree(iog);
+}
+
+/*
+ * This cleanup function does the last bit of things to destroy cgroup.
+ * It should only get called after io_destroy_group has been invoked.
+ */
+static void io_group_cleanup(struct io_group *iog)
+{
+ struct io_service_tree *st;
+ int i;
+
+ BUG_ON(iog->sched_data.active_entity != NULL);
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ BUG_ON(!RB_EMPTY_ROOT(&st->active));
+ BUG_ON(st->active_entity != NULL);
+ }
+
+ /*
+ * Wait for any rcu readers to exit before freeing up the group.
+ * Primarily useful when elv_io_get_io_group() is called without queue
+ * lock to access some group data from bdi_congested_group() path.
+ */
+ call_rcu(&iog->rcu_head, io_group_free_rcu);
+}
+
+void elv_put_iog(struct io_group *iog)
+{
+ struct io_group *parent_iog = NULL;
+ struct io_entity *parent;
+
+ BUG_ON(atomic_read(&iog->ref) <= 0);
+ if (!atomic_dec_and_test(&iog->ref))
+ return;
+
+ parent = parent_entity(&iog->entity);
+ if (parent)
+ parent_iog = iog_of(parent);
+
+ io_group_cleanup(iog);
+
+ if (parent_iog)
+ elv_put_iog(parent_iog);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
+/*
+ * After the group is destroyed, no new sync IO should come to the group.
+ * It might still have pending IOs in some busy queues. It should be able to
+ * send those IOs down to the disk. The async IOs (due to dirty page writeback)
+ * would go in the root group queues after this, as the group does not exist
+ * anymore.
+ */
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+ struct io_service_tree *st;
+ int i;
+ struct io_entity *entity = &iog->entity;
+
+ /*
+ * Mark io group for deletion so that no new entry goes in
+ * idle tree. Any active queue which is removed from active
+ * tree will not be put in to idle tree.
+ */
+ entity->exiting = 1;
+
+ /* We flush idle tree now, and don't put things in there any more. */
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = iog->sched_data.service_tree + i;
+ flush_idle_tree(st);
+ }
+
+ hlist_del(&iog->elv_data_node);
+ put_io_group_queues(efqd->eq, iog);
+
+ if (entity->on_idle_st)
+ dequeue_io_entity_idle(entity);
+
+ /*
+ * Put the reference taken at the time of creation so that when all
+ * queues are gone, group can be destroyed.
+ */
+ elv_put_iog(iog);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+ struct io_group *iog;
+ struct elv_fq_data *efqd;
+ unsigned long uninitialized_var(flags);
+
+ /*
+ * io groups are linked in two lists. One list is maintained
+ * in elevator (efqd->group_list) and other is maintained
+ * per cgroup structure (iocg->group_data).
+ *
+ * While a cgroup is being deleted, elevator also might be
+ * exiting and both might try to cleanup the same io group
+ * so need to be little careful.
+ *
+ * (iocg->group_data) is protected by iocg->lock. To avoid deadlock,
+ * we can't hold the queue lock while holding iocg->lock. So we first
+ * remove iog from iocg->group_data under iocg->lock. Whoever removes
+ * iog from iocg->group_data should call __io_destroy_group to remove
+ * iog.
+ */
+
+ rcu_read_lock();
+
+remove_entry:
+ spin_lock_irqsave(&iocg->lock, flags);
+
+ if (hlist_empty(&iocg->group_data)) {
+ spin_unlock_irqrestore(&iocg->lock, flags);
+ goto done;
+ }
+ iog = hlist_entry(iocg->group_data.first, struct io_group,
+ group_node);
+ efqd = rcu_dereference(iog->key);
+ hlist_del_rcu(&iog->group_node);
+ iog->iocg_id = 0;
+ spin_unlock_irqrestore(&iocg->lock, flags);
+
+ spin_lock_irqsave(efqd->queue->queue_lock, flags);
+ __io_destroy_group(efqd, iog);
+ spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+ goto remove_entry;
+
+done:
+ free_css_id(&io_subsys, &iocg->css);
+ rcu_read_unlock();
+ BUG_ON(!hlist_empty(&iocg->group_data));
+ kfree(iocg);
+}
+
+/*
+ * This functions checks if iog is still in iocg->group_data, and removes it.
+ * If iog is not in that list, then cgroup destroy path has removed it, and
+ * we do not need to remove it.
+ */
+static void
+io_group_check_and_destroy(struct elv_fq_data *efqd, struct io_group *iog)
+{
+ struct io_cgroup *iocg;
+ unsigned long flags;
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+
+ css = css_lookup(&io_subsys, iog->iocg_id);
+
+ if (!css)
+ goto out;
+
+ iocg = container_of(css, struct io_cgroup, css);
+
+ spin_lock_irqsave(&iocg->lock, flags);
+
+ if (iog->iocg_id) {
+ hlist_del_rcu(&iog->group_node);
+ __io_destroy_group(efqd, iog);
+ }
+
+ spin_unlock_irqrestore(&iocg->lock, flags);
+out:
+ rcu_read_unlock();
+}
+
+static void release_elv_io_groups(struct elevator_queue *e)
+{
+ struct hlist_node *pos, *n;
+ struct io_group *iog;
+ struct elv_fq_data *efqd = e->efqd;
+
+ hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+ elv_data_node) {
+ io_group_check_and_destroy(efqd, iog);
+ }
+}
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+ struct request_queue *q = rq->q;
+ struct io_queue *ioq = rq->ioq;
+ struct io_group *iog, *__iog;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return 1;
+
+ /* Determine the io group of the bio submitting task */
+ iog = elv_io_get_io_group(q, 0);
+ if (!iog) {
+ /* May be task belongs to a differet cgroup for which io
+ * group has not been setup yet. */
+ return 0;
+ }
+
+ /* Determine the io group of the ioq, rq belongs to*/
+ __iog = ioq_to_io_group(ioq);
+
+ return (iog == __iog);
+}
+
#else /* CONFIG_GROUP_IOSCHED */

+static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
+static inline void release_elv_io_groups(struct elevator_queue *e) {}
+
static struct io_group *io_alloc_root_group(struct request_queue *q,
struct elevator_queue *e, void *key)
{
@@ -1207,8 +1657,13 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
struct elevator_queue *eq = q->elevator;

if (ioq) {
- elv_log_ioq(efqd, ioq, "set_active, busy=%d",
- efqd->busy_queues);
+ struct io_group *iog = ioq_to_io_group(ioq);
+ elv_log_ioq(efqd, ioq, "set_active, busy=%d class=%hu prio=%hu"
+ " weight=%u group_weight=%u qued=%d",
+ efqd->busy_queues, ioq->entity.ioprio_class,
+ ioq->entity.ioprio, ioq->entity.weight,
+ iog_weight(iog), ioq->nr_queued);
+
ioq->slice_start = ioq->slice_end = 0;
ioq->dispatch_start = jiffies;

@@ -1387,6 +1842,7 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
struct io_queue *active_ioq;
struct elevator_queue *eq = q->elevator;
struct io_entity *entity, *new_entity;
+ struct io_group *iog = NULL, *new_iog = NULL;

active_ioq = elv_active_ioq(eq);

@@ -1419,9 +1875,16 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
return 1;

/*
- * Check with io scheduler if it has additional criterion based on
- * which it wants to preempt existing queue.
+ * If both the queues belong to same group, check with io scheduler
+ * if it has additional criterion based on which it wants to
+ * preempt existing queue.
*/
+ iog = ioq_to_io_group(active_ioq);
+ new_iog = ioq_to_io_group(new_ioq);
+
+ if (iog != new_iog)
+ return 0;
+
if (eq->ops->elevator_should_preempt_fn) {
void *sched_queue = elv_ioq_sched_queue(new_ioq);

@@ -1569,6 +2032,10 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
if (new_ioq)
elv_log_ioq(e->efqd, ioq, "cooperating ioq=%d", new_ioq->pid);

+ /* Only select co-operating queue if it belongs to same group as ioq */
+ if (new_ioq && !is_same_group(&ioq->entity, &new_ioq->entity))
+ return NULL;
+
return new_ioq;
}

@@ -1873,6 +2340,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
efqd->idle_slice_timer.data = (unsigned long) efqd;

INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+ INIT_HLIST_HEAD(&efqd->group_list);

efqd->elv_slice[0] = elv_slice_async;
efqd->elv_slice[1] = elv_slice_sync;
@@ -1890,12 +2358,22 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
void elv_exit_fq_data(struct elevator_queue *e)
{
struct elv_fq_data *efqd = e->efqd;
+ struct request_queue *q = efqd->queue;

if (!elv_iosched_fair_queuing_enabled(e))
return;

elv_shutdown_timer_wq(e);

+ spin_lock_irq(q->queue_lock);
+ release_elv_io_groups(e);
+ spin_unlock_irq(q->queue_lock);
+
+ elv_shutdown_timer_wq(e);
+
+ /* Wait for iog->key accessors to exit their grace periods. */
+ synchronize_rcu();
+
BUG_ON(timer_pending(&efqd->idle_slice_timer));
io_free_root_group(e);
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f343841..769798b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -100,6 +100,7 @@ struct io_group {
atomic_t ref;
struct io_sched_data sched_data;
struct hlist_node group_node;
+ struct hlist_node elv_data_node;
unsigned short iocg_id;
/*
* async queue for each priority case for RT and BE class.
@@ -109,6 +110,7 @@ struct io_group {
struct io_queue *async_queue[2][IOPRIO_BE_NR];
struct io_queue *async_idle_queue;
void *key;
+ struct rcu_head rcu_head;
};

struct io_cgroup {
@@ -142,6 +144,9 @@ struct io_group {
struct elv_fq_data {
struct io_group *root_group;

+ /* List of io groups hanging on this elevator */
+ struct hlist_head group_list;
+
struct request_queue *queue;
struct elevator_queue *eq;
unsigned int busy_queues;
@@ -322,6 +327,28 @@ static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
return &eq->efqd->oom_ioq;
}

+#ifdef CONFIG_GROUP_IOSCHED
+
+extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+extern struct io_group *elv_io_get_io_group(struct request_queue *q,
+ int create);
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+ atomic_inc(&iog->ref);
+}
+
+#else /* !GROUP_IOSCHED */
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+ return 1;
+}
+
+static inline void elv_get_iog(struct io_group *iog) {}
+static inline void elv_put_iog(struct io_group *iog) {}
+
static inline struct io_group *
elv_io_get_io_group(struct request_queue *q, int create)
{
@@ -329,6 +356,8 @@ elv_io_get_io_group(struct request_queue *q, int create)
return q->elevator->efqd->root_group;
}

+#endif /* GROUP_IOSCHED */
+
extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
size_t count);
@@ -413,6 +442,12 @@ static inline void *elv_select_ioq(struct request_queue *q, int force)
{
return NULL;
}
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+ return 1;
+}
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _ELV_SCHED_H */
#endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index ea4042e..b2725cd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -122,6 +122,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
!bio_failfast_driver(bio) != !blk_failfast_driver(rq))
return 0;

+ /* If rq and bio belongs to different groups, dont allow merging */
+ if (!elv_io_group_allow_merge(rq, bio))
+ return 0;
+
if (!elv_iosched_allow_merge(rq, bio))
return 0;

--
1.6.0.6

2009-09-24 19:30:28

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 09/28] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Fabio Checconi <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Aristeu Rozanski <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/Kconfig.iosched | 8 ++++++
block/cfq-iosched.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++--
init/Kconfig | 2 +-
3 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
working environment, suitable for desktop systems.
This is the default I/O scheduler.

+config IOSCHED_CFQ_HIER
+ bool "CFQ Hierarchical Scheduling support"
+ depends on IOSCHED_CFQ && CGROUPS
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in cfq.
+
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 79ac161..0e665a9 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1286,6 +1286,61 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->pid = pid;
}

+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+ struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+ struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+ struct cfq_data *cfqd = cic->key;
+ struct io_group *iog, *__iog;
+ unsigned long flags;
+ struct request_queue *q;
+
+ if (unlikely(!cfqd))
+ return;
+
+ q = cfqd->queue;
+
+ spin_lock_irqsave(q->queue_lock, flags);
+
+ iog = elv_io_get_io_group(q, 0);
+
+ if (async_cfqq != NULL) {
+ __iog = cfqq_to_io_group(async_cfqq);
+ if (iog != __iog) {
+ /* cgroup changed, drop the reference to async queue */
+ cic_set_cfqq(cic, NULL, 0);
+ cfq_put_queue(async_cfqq);
+ }
+ }
+
+ if (sync_cfqq != NULL) {
+ __iog = cfqq_to_io_group(sync_cfqq);
+
+ /*
+ * Drop reference to sync queue. A new sync queue will
+ * be assigned in new group upon arrival of a fresh request.
+ * If old queue has got requests, those reuests will be
+ * dispatched over a period of time and queue will be freed
+ * automatically.
+ */
+ if (iog != __iog) {
+ cic_set_cfqq(cic, NULL, 1);
+ elv_exit_ioq(sync_cfqq->ioq);
+ cfq_put_queue(sync_cfqq);
+ }
+ }
+
+ spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+ call_for_each_cic(ioc, changed_cgroup);
+ ioc->cgroup_changed = 0;
+}
+#endif /* CONFIG_IOSCHED_CFQ_HIER */
+
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
struct io_context *ioc, gfp_t gfp_mask)
@@ -1297,7 +1352,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
struct io_group *iog = NULL;

retry:
- iog = elv_io_get_io_group(q, 0);
+ iog = elv_io_get_io_group(q, 1);

cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
@@ -1386,7 +1441,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue *async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct io_group *iog = elv_io_get_io_group(cfqd->queue, 0);
+ struct io_group *iog = elv_io_get_io_group(cfqd->queue, 1);

if (!is_sync) {
async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
@@ -1541,7 +1596,10 @@ out:
smp_read_barrier_depends();
if (unlikely(ioc->ioprio_changed))
cfq_ioc_set_ioprio(ioc);
-
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+ if (unlikely(ioc->cgroup_changed))
+ cfq_ioc_set_cgroup(ioc);
+#endif
return cic;
err_free:
cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index 29f701d..afcaa86 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -613,7 +613,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
size is 4096bytes, 512k per 1Gbytes of swap.

config GROUP_IOSCHED
- bool "Group IO Scheduler"
+ bool
depends on CGROUPS && ELV_FAIR_QUEUING
default n
---help---
--
1.6.0.6

2009-09-24 19:27:20

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 10/28] io-controller: Export disk time used and nr sectors dipatched through cgroups

o This patch exports some statistics through cgroup interface. Two of the
statistics currently exported are actual disk time assigned to the cgroup
and actual number of sectors dispatched to disk on behalf of this cgroup.

Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/elevator-fq.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++
block/elevator-fq.h | 10 +++++++
2 files changed, 86 insertions(+), 0 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index d59ac50..a57ca9d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -13,6 +13,7 @@

#include <linux/blkdev.h>
#include <linux/blktrace_api.h>
+#include <linux/seq_file.h>
#include "elevator-fq.h"

const int elv_slice_sync = HZ / 10;
@@ -267,6 +268,8 @@ static void entity_served(struct io_entity *entity, unsigned long served,
for_each_entity(entity) {
entity->vdisktime += elv_delta_fair(charge, entity);
update_min_vdisktime(entity->st);
+ entity->total_time += served;
+ entity->total_sectors += nr_sectors;
/* Group charge can be different from queue charge */
charge = group_charge;
}
@@ -1035,6 +1038,66 @@ STORE_FUNCTION(weight, IO_WEIGHT_MIN, IO_WEIGHT_MAX);
STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
#undef STORE_FUNCTION

+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq_file *m)
+{
+ struct io_cgroup *iocg;
+ struct io_group *iog;
+ struct hlist_node *n;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (iog->key) {
+ seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+ MINOR(iog->dev),
+ iog->entity.total_time);
+ }
+ }
+ rcu_read_unlock();
+ cgroup_unlock();
+
+ return 0;
+}
+
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq_file *m)
+{
+ struct io_cgroup *iocg;
+ struct io_group *iog;
+ struct hlist_node *n;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (iog->key) {
+ seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+ MINOR(iog->dev),
+ iog->entity.total_sectors);
+ }
+ }
+ rcu_read_unlock();
+ cgroup_unlock();
+
+ return 0;
+}
+
struct cftype io_files[] = {
{
.name = "weight",
@@ -1046,6 +1109,14 @@ struct cftype io_files[] = {
.read_u64 = io_cgroup_ioprio_class_read,
.write_u64 = io_cgroup_ioprio_class_write,
},
+ {
+ .name = "disk_time",
+ .read_seq_string = io_cgroup_disk_time_read,
+ },
+ {
+ .name = "disk_sectors",
+ .read_seq_string = io_cgroup_disk_sectors_read,
+ },
};

static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1137,6 +1208,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
struct io_cgroup *iocg;
struct io_group *iog, *leaf = NULL, *prev = NULL;
gfp_t flags = GFP_ATOMIC | __GFP_ZERO;
+ unsigned int major, minor;
+ struct backing_dev_info *bdi = &q->backing_dev_info;

for (; cgroup != NULL; cgroup = cgroup->parent) {
iocg = cgroup_to_io_cgroup(cgroup);
@@ -1157,6 +1230,9 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)

iog->iocg_id = css_id(&iocg->css);

+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ iog->dev = MKDEV(major, minor);
+
io_group_init_entity(iocg, iog);

atomic_set(&iog->ref, 0);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 769798b..256f71a 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -60,6 +60,13 @@ struct io_entity {

unsigned short ioprio, ioprio_class;
int ioprio_changed;
+
+ /*
+ * Keep track of total service received by this entity. Keep the
+ * stats both for time slices and number of sectors dispatched
+ */
+ unsigned long total_time;
+ unsigned long total_sectors;
};

/*
@@ -111,6 +118,9 @@ struct io_group {
struct io_queue *async_idle_queue;
void *key;
struct rcu_head rcu_head;
+
+ /* The device MKDEV(major, minor), this group has been created for */
+ dev_t dev;
};

struct io_cgroup {
--
1.6.0.6

2009-09-24 19:26:44

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 11/28] io-controller: Debug hierarchical IO scheduling

o Littile debugging aid for hierarchical IO scheduling.

o Enabled under CONFIG_DEBUG_GROUP_IOSCHED

o Currently it outputs more debug messages in blktrace output which helps
a great deal in debugging in hierarchical setup. It also creates additional
cgroup interfaces io.disk_queue and io.disk_dequeue to output some more
debugging data.

Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/Kconfig.iosched | 8 +++
block/elevator-fq.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 29 +++++++++
3 files changed, 202 insertions(+), 3 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..a7d0bf8 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -90,6 +90,14 @@ config DEFAULT_IOSCHED
default "cfq" if DEFAULT_CFQ
default "noop" if DEFAULT_NOOP

+config DEBUG_GROUP_IOSCHED
+ bool "Debug Hierarchical Scheduling support"
+ depends on CGROUPS && GROUP_IOSCHED
+ default n
+ ---help---
+ Enable some debugging hooks for hierarchical scheduling support.
+ Currently it just outputs more information in blktrace output.
+
endmenu

endif
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index a57ca9d..6020406 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -259,6 +259,91 @@ init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
entity->st = &parent_iog->sched_data.service_tree[idx];
}

+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog)
+{
+ unsigned short id = iog->iocg_id;
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+
+ if (!id)
+ goto out;
+
+ css = css_lookup(&io_subsys, id);
+ if (!css)
+ goto out;
+
+ if (!css_tryget(css))
+ goto out;
+
+ cgroup_path(css->cgroup, iog->path, sizeof(iog->path));
+
+ css_put(css);
+
+ rcu_read_unlock();
+ return;
+out:
+ rcu_read_unlock();
+ iog->path[0] = '\0';
+ return;
+}
+
+static inline void debug_update_stats_enqueue(struct io_entity *entity)
+{
+ struct io_group *iog = iog_of(entity);
+
+ if (iog) {
+ struct elv_fq_data *efqd;
+
+ /*
+ * Keep track of how many times a group has been added
+ * to active tree.
+ */
+ iog->queue++;
+
+ rcu_read_lock();
+ efqd = rcu_dereference(iog->key);
+ if (efqd)
+ elv_log_iog(efqd, iog, "add group weight=%u",
+ iog->entity.weight);
+ rcu_read_unlock();
+ }
+}
+
+static inline void debug_update_stats_dequeue(struct io_entity *entity)
+{
+ struct io_group *iog = iog_of(entity);
+
+ if (iog) {
+ struct elv_fq_data *efqd;
+
+ iog->dequeue++;
+ rcu_read_lock();
+ efqd = rcu_dereference(iog->key);
+ if (efqd)
+ elv_log_iog(efqd, iog, "del group weight=%u",
+ iog->entity.weight);
+ rcu_read_unlock();
+ }
+}
+
+static inline void print_ioq_service_stats(struct io_queue *ioq)
+{
+ struct io_group *iog = ioq_to_io_group(ioq);
+
+ elv_log_ioq(ioq->efqd, ioq, "service: QTt=%lu QTs=%lu GTt=%lu GTs=%lu",
+ ioq->entity.total_time, ioq->entity.total_sectors,
+ iog->entity.total_time, iog->entity.total_sectors);
+}
+
+#else /* DEBUG_GROUP_IOSCHED */
+static inline void io_group_path(struct io_group *iog) {}
+static inline void print_ioq_service_stats(struct io_queue *ioq) {}
+static inline void debug_update_stats_enqueue(struct io_entity *entity) {}
+static inline void debug_update_stats_dequeue(struct io_entity *entity) {}
+#endif /* DEBUG_GROUP_IOSCHED */
+
static void entity_served(struct io_entity *entity, unsigned long served,
unsigned long queue_charge, unsigned long group_charge,
unsigned long nr_sectors)
@@ -485,6 +570,7 @@ static void dequeue_io_entity(struct io_entity *entity)
entity->on_st = 0;
st->nr_active--;
sd->nr_active--;
+ debug_update_stats_dequeue(entity);

if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
enqueue_io_entity_idle(entity);
@@ -546,6 +632,7 @@ static void enqueue_io_entity(struct io_entity *entity)
entity->on_st = 1;
place_entity(st, entity, 0);
__enqueue_io_entity(st, entity, 0);
+ debug_update_stats_enqueue(entity);
}

static struct io_entity *__lookup_next_io_entity(struct io_service_tree *st)
@@ -725,6 +812,9 @@ static void elv_ioq_served(struct io_queue *ioq, unsigned long served)

entity_served(&ioq->entity, served, queue_charge, group_charge,
ioq->nr_sectors);
+ elv_log_ioq(ioq->efqd, ioq, "ioq served: QSt=%lu QSs=%lu qued=%lu",
+ served, ioq->nr_sectors, ioq->nr_queued);
+ print_ioq_service_stats(ioq);
}

/*
@@ -978,7 +1068,6 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
return NULL;
}

-
#define SHOW_FUNCTION(__VAR) \
static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup, \
struct cftype *cftype) \
@@ -1098,6 +1187,64 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
return 0;
}

+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq_file *m)
+{
+ struct io_cgroup *iocg = NULL;
+ struct io_group *iog = NULL;
+ struct hlist_node *n;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
+ rcu_read_lock();
+ /* Loop through all the io groups and print statistics */
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (iog->key)
+ seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+ MINOR(iog->dev), iog->queue);
+ }
+ rcu_read_unlock();
+ cgroup_unlock();
+
+ return 0;
+}
+
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq_file *m)
+{
+ struct io_cgroup *iocg = NULL;
+ struct io_group *iog = NULL;
+ struct hlist_node *n;
+
+ if (!cgroup_lock_live_group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup_to_io_cgroup(cgroup);
+ spin_lock_irq(&iocg->lock);
+ /* Loop through all the io groups and print statistics */
+ hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (iog->key)
+ seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
+ MINOR(iog->dev), iog->dequeue);
+ }
+ spin_unlock_irq(&iocg->lock);
+ cgroup_unlock();
+
+ return 0;
+}
+#endif
+
struct cftype io_files[] = {
{
.name = "weight",
@@ -1117,6 +1264,16 @@ struct cftype io_files[] = {
.name = "disk_sectors",
.read_seq_string = io_cgroup_disk_sectors_read,
},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ {
+ .name = "disk_queue",
+ .read_seq_string = io_cgroup_disk_queue_read,
+ },
+ {
+ .name = "disk_dequeue",
+ .read_seq_string = io_cgroup_disk_dequeue_read,
+ },
+#endif
};

static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1244,6 +1401,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
* or cgroup deletion path depending on who is exiting first.
*/
elv_get_iog(iog);
+ io_group_path(iog);

if (leaf == NULL) {
leaf = iog;
@@ -1426,6 +1584,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
iog->iocg_id = css_id(&iocg->css);
spin_unlock_irq(&iocg->lock);
+ io_group_path(iog);

return iog;
}
@@ -1739,6 +1898,7 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
efqd->busy_queues, ioq->entity.ioprio_class,
ioq->entity.ioprio, ioq->entity.weight,
iog_weight(iog), ioq->nr_queued);
+ print_ioq_service_stats(ioq);

ioq->slice_start = ioq->slice_end = 0;
ioq->dispatch_start = jiffies;
@@ -1803,10 +1963,11 @@ static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
{
BUG_ON(elv_ioq_busy(ioq));
BUG_ON(ioq == efqd->active_queue);
- elv_log_ioq(efqd, ioq, "add to busy");
enqueue_ioq(ioq);
elv_mark_ioq_busy(ioq);
efqd->busy_queues++;
+ elv_log_ioq(efqd, ioq, "add to busy: qued=%d", ioq->nr_queued);
+ print_ioq_service_stats(ioq);
}

static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
@@ -1815,7 +1976,8 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)

BUG_ON(!elv_ioq_busy(ioq));
BUG_ON(ioq->nr_queued);
- elv_log_ioq(efqd, ioq, "del from busy");
+ elv_log_ioq(efqd, ioq, "del from busy: qued=%d", ioq->nr_queued);
+ print_ioq_service_stats(ioq);
elv_clear_ioq_busy(ioq);
BUG_ON(efqd->busy_queues == 0);
efqd->busy_queues--;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 256f71a..2ea746b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -121,6 +121,16 @@ struct io_group {

/* The device MKDEV(major, minor), this group has been created for */
dev_t dev;
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+ /* How many times this group has been added to active tree */
+ unsigned long queue;
+
+ /* How many times this group has been removed from active tree */
+ unsigned long dequeue;
+
+ /* Store cgroup path */
+ char path[128];
+#endif
};

struct io_cgroup {
@@ -177,10 +187,29 @@ struct elv_fq_data {
};

/* Logging facilities. */
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+{ \
+ blk_add_trace_msg((efqd)->queue, "elv%d%c %s " fmt, (ioq)->pid, \
+ elv_ioq_sync(ioq) ? 'S' : 'A', \
+ ioq_to_io_group(ioq)->path, ##args); \
+}
+
+#define elv_log_iog(efqd, iog, fmt, args...) \
+{ \
+ blk_add_trace_msg((efqd)->queue, "elv %s " fmt, (iog)->path, ##args); \
+}
+
+#else
#define elv_log_ioq(efqd, ioq, fmt, args...) \
blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid, \
elv_ioq_sync(ioq) ? 'S' : 'A', ##args)

+#define elv_log_iog(efqd, iog, fmt, args...) \
+ blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#endif
+
#define elv_log(efqd, fmt, args...) \
blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)

--
1.6.0.6

2009-09-24 19:27:39

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 12/28] io-controller: Introduce group idling

o It is not always that IO from a process or group is continuous. There are
cases of dependent reads where next read is not issued till previous read
has finished. For such cases, CFQ introduced the notion of slice_idle,
where we idle on the queue for sometime hoping next request will come
and that's how fairness is provided otherwise queue will be deleted
immediately from the service tree and this process will not get the
fair share.

o This patch introduces the similar concept at group level. Idle on the group
for a period of "group_idle" which is tunable through sysfs interface. So
if a group is empty and about to be deleted, we idle for the next request.

o This patch also introduces the notion of wait busy where we wait for one
extra group_idle period even if queue has consumed its time slice. The
reason being that group will loose its share upon removal from service
tree as some other entity will be picked for dispatch and vtime jump will
take place.

Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/cfq-iosched.c | 10 ++-
block/elevator-fq.c | 218 ++++++++++++++++++++++++++++++++++++++++++++++++---
block/elevator-fq.h | 44 ++++++++++-
3 files changed, 258 insertions(+), 14 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 0e665a9..878cf76 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -981,7 +981,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
cfq_class_idle(cfqq))) {
- cfq_slice_expired(cfqd);
+ /*
+ * If this queue deletion will cause the group to loose its
+ * fairness, hold off expiry.
+ */
+ if (!elv_iog_should_idle(cfqq->ioq))
+ cfq_slice_expired(cfqd);
}

cfq_log(cfqd, "dispatched a request");
@@ -2123,6 +2128,9 @@ static struct elv_fs_entry cfq_attrs[] = {
CFQ_ATTR(slice_idle),
ELV_ATTR(slice_sync),
ELV_ATTR(slice_async),
+#ifdef CONFIG_GROUP_IOSCHED
+ ELV_ATTR(group_idle),
+#endif
__ATTR_NULL
};

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 6020406..5511256 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -19,6 +19,7 @@
const int elv_slice_sync = HZ / 10;
int elv_slice_async = HZ / 25;
const int elv_slice_async_rq = 2;
+int elv_group_idle = HZ / 125;
static struct kmem_cache *elv_ioq_pool;

/*
@@ -259,6 +260,17 @@ init_io_entity_service_tree(struct io_entity *entity, struct io_entity *parent)
entity->st = &parent_iog->sched_data.service_tree[idx];
}

+/*
+ * Returns the number of active entities a particular io group has. This
+ * includes number of active entities on service trees as well as the active
+ * entity which is being served currently, if any.
+ */
+
+static inline int elv_iog_nr_active(struct io_group *iog)
+{
+ return iog->sched_data.nr_active;
+}
+
#ifdef CONFIG_DEBUG_GROUP_IOSCHED
static void io_group_path(struct io_group *iog)
{
@@ -844,6 +856,8 @@ ssize_t __FUNC(struct elevator_queue *e, char *page) \
__data = jiffies_to_msecs(__data); \
return elv_var_show(__data, (page)); \
}
+SHOW_FUNCTION(elv_group_idle_show, efqd->elv_group_idle, 1);
+EXPORT_SYMBOL(elv_group_idle_show);
SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
EXPORT_SYMBOL(elv_slice_sync_show);
SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
@@ -866,6 +880,8 @@ ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
*(__PTR) = __data; \
return ret; \
}
+STORE_FUNCTION(elv_group_idle_store, &efqd->elv_group_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_group_idle_store);
STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
EXPORT_SYMBOL(elv_slice_sync_store);
STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
@@ -1027,6 +1043,31 @@ static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
entity->my_sd = &iog->sched_data;
}

+/* Check if we plan to idle on the group associated with this queue or not */
+int elv_iog_should_idle(struct io_queue *ioq)
+{
+ struct io_group *iog = ioq_to_io_group(ioq);
+ struct elv_fq_data *efqd = ioq->efqd;
+
+ /*
+ * No idling on group if group idle is disabled or idling is disabled
+ * for this group. Currently for root group idling is disabled.
+ */
+ if (!efqd->elv_group_idle || !elv_iog_idle_window(iog))
+ return 0;
+
+ /*
+ * If this is last active queue in group with no request queued, we
+ * need to idle on group before expiring the queue to make sure group
+ * does not loose its share.
+ */
+ if ((elv_iog_nr_active(iog) <= 1) && !ioq->nr_queued)
+ return 1;
+
+ return 0;
+}
+EXPORT_SYMBOL(elv_iog_should_idle);
+
static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
{
struct io_entity *entity = &iog->entity;
@@ -1394,6 +1435,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)

atomic_set(&iog->ref, 0);

+ elv_mark_iog_idle_window(iog);
/*
* Take the initial reference that will be released on destroy
* This can be thought of a joint reference by cgroup and
@@ -1844,6 +1886,10 @@ static void io_free_root_group(struct elevator_queue *e)
kfree(iog);
}

+/* No group idling in flat mode */
+int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
+EXPORT_SYMBOL(elv_iog_should_idle);
+
#endif /* CONFIG_GROUP_IOSCHED */

/*
@@ -1904,7 +1950,9 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
ioq->dispatch_start = jiffies;

elv_clear_ioq_wait_request(ioq);
+ elv_clear_iog_wait_request(iog);
elv_clear_ioq_must_dispatch(ioq);
+ elv_clear_iog_wait_busy_done(iog);
elv_mark_ioq_slice_new(ioq);

del_timer(&efqd->idle_slice_timer);
@@ -2009,14 +2057,19 @@ void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
{
struct elv_fq_data *efqd = q->elevator->efqd;
long slice_used = 0, slice_overshoot = 0;
+ struct io_group *iog = ioq_to_io_group(ioq);

assert_spin_locked(q->queue_lock);
elv_log_ioq(efqd, ioq, "slice expired");

- if (elv_ioq_wait_request(ioq))
+ if (elv_ioq_wait_request(ioq) || elv_iog_wait_request(iog)
+ || elv_iog_wait_busy(iog))
del_timer(&efqd->idle_slice_timer);

elv_clear_ioq_wait_request(ioq);
+ elv_clear_iog_wait_request(iog);
+ elv_clear_iog_wait_busy(iog);
+ elv_clear_iog_wait_busy_done(iog);

/*
* Queue got expired before even a single request completed or
@@ -2075,7 +2128,7 @@ void elv_slice_expired(struct request_queue *q)
* no or if we aren't sure, a 1 will cause a preemption attempt.
*/
static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
- struct request *rq)
+ struct request *rq, int group_wait_req)
{
struct io_queue *active_ioq;
struct elevator_queue *eq = q->elevator;
@@ -2123,6 +2176,14 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
if (iog != new_iog)
return 0;

+ /*
+ * New queue belongs to same group as active queue. If we are just
+ * idling on the group (not queue), then let this new queue preempt
+ * the active queue.
+ */
+ if (group_wait_req)
+ return 1;
+
if (eq->ops->elevator_should_preempt_fn) {
void *sched_queue = elv_ioq_sched_queue(new_ioq);

@@ -2150,8 +2211,11 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
{
struct elv_fq_data *efqd = q->elevator->efqd;
struct io_queue *ioq = rq->ioq;
+ struct io_group *iog = ioq_to_io_group(ioq);
+ int group_wait_req = 0;
+ struct elevator_queue *eq = q->elevator;

- if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ if (!elv_iosched_fair_queuing_enabled(eq))
return;

BUG_ON(!efqd);
@@ -2162,7 +2226,25 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
if (!elv_ioq_busy(ioq))
elv_add_ioq_busy(efqd, ioq);

- if (ioq == elv_active_ioq(q->elevator)) {
+ if (elv_iog_wait_request(iog)) {
+ del_timer(&efqd->idle_slice_timer);
+ elv_clear_iog_wait_request(iog);
+ group_wait_req = 1;
+ }
+
+ /*
+ * If we were waiting for a request on this group, wait is
+ * done. Schedule the next dispatch
+ */
+ if (elv_iog_wait_busy(iog)) {
+ del_timer(&efqd->idle_slice_timer);
+ elv_clear_iog_wait_busy(iog);
+ elv_mark_iog_wait_busy_done(iog);
+ elv_schedule_dispatch(q);
+ return;
+ }
+
+ if (ioq == elv_active_ioq(eq)) {
/*
* Remember that we saw a request from this process, but
* don't start queuing just yet. Otherwise we risk seeing lots
@@ -2173,7 +2255,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
* has other work pending, don't risk delaying until the
* idle timer unplug to continue working.
*/
- if (elv_ioq_wait_request(ioq)) {
+ if (group_wait_req || elv_ioq_wait_request(ioq)) {
del_timer(&efqd->idle_slice_timer);
elv_clear_ioq_wait_request(ioq);
if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
@@ -2182,7 +2264,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
else
elv_mark_ioq_must_dispatch(ioq);
}
- } else if (elv_should_preempt(q, ioq, rq)) {
+ } else if (elv_should_preempt(q, ioq, rq, group_wait_req)) {
/*
* not the active queue - expire current slice if it is
* idle and has expired it's mean thinktime or this new queue
@@ -2208,8 +2290,15 @@ static void elv_idle_slice_timer(unsigned long data)
ioq = efqd->active_queue;

if (ioq) {
+ struct io_group *iog = ioq_to_io_group(ioq);

elv_clear_ioq_wait_request(ioq);
+ elv_clear_iog_wait_request(iog);
+
+ if (elv_iog_wait_busy(iog)) {
+ elv_clear_iog_wait_busy(iog);
+ goto expire;
+ }

/*
* We saw a request before the queue expired, let it through
@@ -2253,6 +2342,32 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q)
eq->ops->elevator_arm_slice_timer_fn(q, ioq->sched_queue);
}

+static void elv_iog_arm_slice_timer(struct request_queue *q,
+ struct io_group *iog, int wait_for_busy)
+{
+ struct elv_fq_data *efqd = q->elevator->efqd;
+ unsigned long sl;
+
+ if (!efqd->elv_group_idle || !elv_iog_idle_window(iog))
+ return;
+ /*
+ * This queue has consumed its time slice. We are waiting only for
+ * it to become busy before we select next queue for dispatch.
+ */
+ if (wait_for_busy) {
+ elv_mark_iog_wait_busy(iog);
+ sl = efqd->elv_group_idle;
+ mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+ elv_log_iog(efqd, iog, "arm idle group: %lu wait busy=1", sl);
+ return;
+ }
+
+ elv_mark_iog_wait_request(iog);
+ sl = efqd->elv_group_idle;
+ mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+ elv_log_iog(efqd, iog, "arm_idle group: %lu", sl);
+}
+
/*
* If io scheduler has functionality of keeping track of close cooperator, check
* with it if it has got a closely co-operating queue.
@@ -2281,6 +2396,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
void *elv_select_ioq(struct request_queue *q, int force)
{
struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+ struct io_group *iog;

if (!elv_nr_busy_ioq(q->elevator))
return NULL;
@@ -2292,6 +2408,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
if (elv_nr_busy_ioq(q->elevator) == 1 && !ioq->nr_queued)
return NULL;

+ iog = ioq_to_io_group(ioq);
+
/*
* Force dispatch. Continue to dispatch from current queue as long
* as it has requests.
@@ -2303,11 +2421,47 @@ void *elv_select_ioq(struct request_queue *q, int force)
goto expire;
}

+ /* We are waiting for this group to become busy before it expires.*/
+ if (elv_iog_wait_busy(iog)) {
+ ioq = NULL;
+ goto keep_queue;
+ }
+
/*
* The active queue has run out of time, expire it and select new.
*/
- if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
- goto expire;
+ if ((elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+ && !elv_ioq_must_dispatch(ioq)) {
+ /*
+ * Queue has used up its slice. Wait busy is not on otherwise
+ * we wouldn't have been here. If this group will be deleted
+ * after the queue expiry, then make sure we have onece
+ * done wait busy on the group in an attempt to make it
+ * backlogged.
+ *
+ * Following check helps in two conditions.
+ * - If there are requests dispatched from the queue and
+ * select_ioq() comes before a request completed from the
+ * queue and got a chance to arm any of the idle timers.
+ *
+ * - If at request completion time slice had not expired and
+ * we armed either a ioq timer or group timer but when
+ * select_ioq() hits, slice has expired and it will expire
+ * the queue without doing busy wait on group.
+ *
+ * In similar situations cfq lets delte the queue even if
+ * idle timer is armed. That does not impact fairness in non
+ * hierarhical setup due to weighted slice lengths. But in
+ * hierarchical setup where group slice lengths are derived
+ * from queue and is not proportional to group's weight, it
+ * harms the fairness of the group.
+ */
+ if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
+ ioq = NULL;
+ goto keep_queue;
+ } else
+ goto expire;
+ }

/*
* The active queue has requests and isn't expired, allow it to
@@ -2339,6 +2493,12 @@ void *elv_select_ioq(struct request_queue *q, int force)
goto keep_queue;
}

+ /* Check for group idling */
+ if (elv_iog_should_idle(ioq) && elv_ioq_nr_dispatched(ioq)) {
+ ioq = NULL;
+ goto keep_queue;
+ }
+
expire:
elv_slice_expired(q);
new_queue:
@@ -2436,11 +2596,13 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
const int sync = rq_is_sync(rq);
struct io_queue *ioq;
struct elv_fq_data *efqd = q->elevator->efqd;
+ struct io_group *iog;

if (!elv_iosched_fair_queuing_enabled(q->elevator))
return;

ioq = rq->ioq;
+ iog = ioq_to_io_group(ioq);
WARN_ON(!efqd->rq_in_driver);
WARN_ON(!ioq->dispatched);
efqd->rq_in_driver--;
@@ -2467,15 +2629,46 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
* mean seek distance, give them a chance to run instead
* of idling.
*/
- if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+ if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq)) {
+ /*
+ * This is the last empty queue in the group and it
+ * has consumed its slice. If we expire it right away
+ * group might loose its share. Wait for an extra
+ * group_idle period for a request before queue
+ * expires.
+ */
+ if (elv_iog_should_idle(ioq)) {
+ elv_iog_arm_slice_timer(q, iog, 1);
+ goto done;
+ }
+
+ /* Expire the queue */
elv_slice_expired(q);
- else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
- && sync && !rq_noidle(rq))
+ goto done;
+ } else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
+ && sync && !rq_noidle(rq))
elv_ioq_arm_slice_timer(q);
+ /*
+ * If this is the last queue in the group and we did not
+ * decide to idle on queue, idle on group.
+ */
+ if (elv_iog_should_idle(ioq) && !ioq->dispatched
+ && !ioq_is_idling(ioq)) {
+ /*
+ * If queue has used up its slice, wait for the
+ * one extra group_idle period to let the group
+ * backlogged again. This is to avoid a group loosing
+ * its fair share.
+ */
+ if (elv_ioq_slice_used(ioq))
+ elv_iog_arm_slice_timer(q, iog, 1);
+ else
+ elv_iog_arm_slice_timer(q, iog, 0);
+ }

check_expire_last_empty_queue(q, ioq);
}
-
+done:
if (!efqd->rq_in_driver)
elv_schedule_dispatch(q);
}
@@ -2582,6 +2775,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)

efqd->elv_slice[0] = elv_slice_async;
efqd->elv_slice[1] = elv_slice_sync;
+ efqd->elv_group_idle = elv_group_idle;

return 0;
}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 2ea746b..7b73f11 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -105,6 +105,7 @@ struct io_queue {
struct io_group {
struct io_entity entity;
atomic_t ref;
+ unsigned int flags;
struct io_sched_data sched_data;
struct hlist_node group_node;
struct hlist_node elv_data_node;
@@ -179,6 +180,8 @@ struct elv_fq_data {
struct timer_list idle_slice_timer;
struct work_struct unplug_work;

+ unsigned int elv_group_idle;
+
/* Base slice length for sync and async queues */
unsigned int elv_slice[2];

@@ -247,6 +250,42 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
ELV_IO_QUEUE_FLAG_FNS(slice_new)
ELV_IO_QUEUE_FLAG_FNS(sync)

+#ifdef CONFIG_GROUP_IOSCHED
+
+enum elv_group_state_flags {
+ ELV_GROUP_FLAG_idle_window, /* elevator group idling enabled */
+ ELV_GROUP_FLAG_wait_request, /* waiting for a request */
+ ELV_GROUP_FLAG_wait_busy, /* wait for this queue to get busy */
+ ELV_GROUP_FLAG_wait_busy_done, /* Have already waited on this group*/
+};
+
+#define ELV_IO_GROUP_FLAG_FNS(name) \
+static inline void elv_mark_iog_##name(struct io_group *iog) \
+{ \
+ (iog)->flags |= (1 << ELV_GROUP_FLAG_##name); \
+} \
+static inline void elv_clear_iog_##name(struct io_group *iog) \
+{ \
+ (iog)->flags &= ~(1 << ELV_GROUP_FLAG_##name); \
+} \
+static inline int elv_iog_##name(struct io_group *iog) \
+{ \
+ return ((iog)->flags & (1 << ELV_GROUP_FLAG_##name)) != 0; \
+}
+
+#else /* GROUP_IOSCHED */
+
+#define ELV_IO_GROUP_FLAG_FNS(name) \
+static inline void elv_mark_iog_##name(struct io_group *iog) {} \
+static inline void elv_clear_iog_##name(struct io_group *iog) {} \
+static inline int elv_iog_##name(struct io_group *iog) { return 0; }
+#endif /* GROUP_IOSCHED */
+
+ELV_IO_GROUP_FLAG_FNS(idle_window)
+ELV_IO_GROUP_FLAG_FNS(wait_request)
+ELV_IO_GROUP_FLAG_FNS(wait_busy)
+ELV_IO_GROUP_FLAG_FNS(wait_busy_done)
+
static inline void elv_get_ioq(struct io_queue *ioq)
{
atomic_inc(&ioq->ref);
@@ -372,7 +411,9 @@ extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
extern void elv_put_iog(struct io_group *iog);
extern struct io_group *elv_io_get_io_group(struct request_queue *q,
int create);
-
+extern ssize_t elv_group_idle_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_group_idle_store(struct elevator_queue *q, const char *name,
+ size_t count);
static inline void elv_get_iog(struct io_group *iog)
{
atomic_inc(&iog->ref);
@@ -441,6 +482,7 @@ extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
extern void elv_free_ioq(struct io_queue *ioq);
extern struct io_group *ioq_to_io_group(struct io_queue *ioq);
extern void elv_exit_ioq(struct io_queue *ioq);
+extern int elv_iog_should_idle(struct io_queue *ioq);

#else /* CONFIG_ELV_FAIR_QUEUING */
static inline struct elv_fq_data *
--
1.6.0.6

2009-09-24 19:27:33

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 13/28] io-controller: Implement wait busy for io queues

o CFQ enables idling on very selective queues (sequential readers). That's why
we implemented the concept of group idling where irrespective of workload
in the group, one can idle on the group and provide fair share before moving
on to next queue or group. This provides stronger isolation but also slows
does the switching between groups.

One can disable "group_idle" to make group switching faster but then we
loose fairness for sequenatial readers also as once queue has consumed its
slice we delete it and move onto next queue.

o This patch implments the concept of wait busy (simliar to groups) on queues.
So once a CFQ queue has consumed its slice, we idle for one extra period
for it to get busy again and then expire it and move on to next queue. This
makes sure that sequential readers don't loose fairness (no vtime jump), even
if group idling is disabled.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/elevator-fq.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 5511256..b8862d3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -21,6 +21,7 @@ int elv_slice_async = HZ / 25;
const int elv_slice_async_rq = 2;
int elv_group_idle = HZ / 125;
static struct kmem_cache *elv_ioq_pool;
+static int elv_ioq_wait_busy = HZ / 125;

/*
* offset from end of service tree
@@ -1043,6 +1044,36 @@ static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
entity->my_sd = &iog->sched_data;
}

+/* If group_idling is enabled then group takes care of doing idling and wait
+ * busy on a queue. But this happens on all queues, even if we are running
+ * a random reader or random writer. This has its own advantage that group
+ * gets to run continuously for a period of time and provides strong isolation
+ * but too strong isolation can also slow down group switching.
+ *
+ * Hence provide this alternate mode where we do wait busy on the queues for
+ * which CFQ has idle_window enabled. This is useful in ensuring the fairness
+ * of sequential readers in group at the same time we don't do group idling
+ * on all the queues hence faster switching.
+ */
+int elv_ioq_should_wait_busy(struct io_queue *ioq)
+{
+ struct io_group *iog = ioq_to_io_group(ioq);
+
+ /* Idle window is disabled for root group */
+ if (!elv_iog_idle_window(iog))
+ return 0;
+
+ /*
+ * if CFQ has got idling enabled on this queue, wait for this queue
+ * to get backlogged again.
+ */
+ if (!ioq->nr_queued && elv_ioq_idle_window(ioq)
+ && elv_ioq_slice_used(ioq))
+ return 1;
+
+ return 0;
+}
+
/* Check if we plan to idle on the group associated with this queue or not */
int elv_iog_should_idle(struct io_queue *ioq)
{
@@ -1889,6 +1920,7 @@ static void io_free_root_group(struct elevator_queue *e)
/* No group idling in flat mode */
int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
EXPORT_SYMBOL(elv_iog_should_idle);
+static int elv_ioq_should_wait_busy(struct io_queue *ioq) { return 0; }

#endif /* CONFIG_GROUP_IOSCHED */

@@ -2368,6 +2400,24 @@ static void elv_iog_arm_slice_timer(struct request_queue *q,
elv_log_iog(efqd, iog, "arm_idle group: %lu", sl);
}

+static void
+elv_ioq_arm_wait_busy_timer(struct request_queue *q, struct io_queue *ioq)
+{
+ struct io_group *iog = ioq_to_io_group(ioq);
+ struct elv_fq_data *efqd = q->elevator->efqd;
+ unsigned long sl = 8;
+
+ /*
+ * This queue has consumed its time slice. We are waiting only for
+ * it to become busy before we select next queue for dispatch.
+ */
+ elv_mark_iog_wait_busy(iog);
+ sl = elv_ioq_wait_busy;
+ mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+ elv_log_ioq(efqd, ioq, "arm wait busy ioq: %lu", sl);
+ return;
+}
+
/*
* If io scheduler has functionality of keeping track of close cooperator, check
* with it if it has got a closely co-operating queue.
@@ -2456,7 +2506,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
* from queue and is not proportional to group's weight, it
* harms the fairness of the group.
*/
- if (elv_iog_should_idle(ioq) && !elv_iog_wait_busy_done(iog)) {
+ if ((elv_iog_should_idle(ioq) || elv_ioq_should_wait_busy(ioq))
+ && !elv_iog_wait_busy_done(iog)) {
ioq = NULL;
goto keep_queue;
} else
@@ -2640,6 +2691,9 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
if (elv_iog_should_idle(ioq)) {
elv_iog_arm_slice_timer(q, iog, 1);
goto done;
+ } else if (elv_ioq_should_wait_busy(ioq)) {
+ elv_ioq_arm_wait_busy_timer(q, ioq);
+ goto done;
}

/* Expire the queue */
--
1.6.0.6

2009-09-24 19:28:46

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 14/28] io-controller: Keep track of late preemptions

o Found another issue during testing. Consider following hierarchy.

root
/ \
R1 G1
/\
R2 W

Generally in CFQ when readers and writers are running, reader immediately
preempts writers and hence reader gets the better bandwidth. In case of
hierarchical setup, it becomes little more tricky. In above diagram, G1
is a group and R1, R2 are readers and W is writer tasks.

Now assume W runs and then R1 runs and then R2 runs. After R2 has used its
time slice, if R1 is schedule in, after couple of ms, R1 will get backlogged
again in group G1, (streaming reader). But it will not preempt R1 as R1 is
also a reader and also because preemption across group is not allowed for
isolation reasons. Hence R2 will get backlogged in G1 and will get a
vdisktime much higher than W. So when G2 gets scheduled again, W will get
to run its full slice length despite the fact R2 is queue on same service
tree.

The core issue here is that apart from regular preemptions (preemption
across classes), CFQ also has this special notion of preemption with-in
class and that can lead to issues active task is running in a differnt
group than where new queue gets backlogged.

To solve the issue keep a track of this event (I am calling it late
preemption). When a group becomes eligible to run again, if late_preemption
is set, check if there are sync readers backlogged, and if yes, expire the
writer after one round of dispatch.

This solves the issue of reader not getting enough bandwidth in hierarchical
setups.

Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/elevator-fq.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++
block/elevator-fq.h | 4 ++
2 files changed, 104 insertions(+), 0 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b8862d3..25beaf7 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -217,6 +217,62 @@ io_entity_sched_data(struct io_entity *entity)
return &iog_of(parent_entity(entity))->sched_data;
}

+static inline void set_late_preemption(struct elevator_queue *eq,
+ struct io_queue *active_ioq, struct io_queue *new_ioq)
+{
+ struct io_group *new_iog;
+
+ if (!active_ioq)
+ return;
+
+ /* For the time being, set late preempt only if new queue is sync */
+ if (!elv_ioq_sync(new_ioq))
+ return;
+
+ new_iog = ioq_to_io_group(new_ioq);
+
+ if (ioq_to_io_group(active_ioq) != new_iog
+ && !new_iog->late_preemption) {
+ new_iog->late_preemption = 1;
+ elv_log_ioq(eq->efqd, new_ioq, "set late preempt");
+ }
+}
+
+static inline void reset_late_preemption(struct elevator_queue *eq,
+ struct io_group *iog, struct io_queue *ioq)
+{
+ if (iog->late_preemption) {
+ iog->late_preemption = 0;
+ elv_log_ioq(eq->efqd, ioq, "reset late preempt");
+ }
+}
+
+static inline void
+check_late_preemption(struct elevator_queue *eq, struct io_queue *ioq)
+{
+ struct io_group *iog = ioq_to_io_group(ioq);
+
+ if (!iog->late_preemption)
+ return;
+
+ /*
+ * If a sync queue got queued in a group where other writers are are
+ * queued and at the time of queuing some other reader was running
+ * in anohter group, then this reader will not preempt the reader in
+ * another group. Side affect of this is that once this group gets
+ * scheduled, writer will start running and will not get preempted,
+ * as it should have been.
+ *
+ * Don't expire the writer right now otherwise writers might get
+ * completely starved. Let it just do one dispatch round and then
+ * expire. Mark the queue for expiry.
+ */
+ if (!elv_ioq_sync(ioq) && iog->sched_data.nr_sync) {
+ elv_mark_ioq_must_expire(ioq);
+ elv_log_ioq(eq->efqd, ioq, "late preempt, must expire");
+ }
+}
+
#else /* GROUP_IOSCHED */
#define for_each_entity(entity) \
for (; entity != NULL; entity = NULL)
@@ -248,6 +304,20 @@ io_entity_sched_data(struct io_entity *entity)

return &efqd->root_group->sched_data;
}
+
+static inline void set_late_preemption(struct elevator_queue *eq,
+ struct io_queue *active_ioq, struct io_queue *new_ioq)
+{
+}
+
+static inline void reset_late_preemption(struct elevator_queue *eq,
+ struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline void
+check_late_preemption(struct elevator_queue *eq, struct io_queue *ioq) { }
+
#endif /* GROUP_IOSCHED */

static inline void
@@ -578,11 +648,14 @@ static void dequeue_io_entity(struct io_entity *entity)
{
struct io_service_tree *st = entity->st;
struct io_sched_data *sd = io_entity_sched_data(entity);
+ struct io_queue *ioq = ioq_of(entity);

__dequeue_io_entity(st, entity);
entity->on_st = 0;
st->nr_active--;
sd->nr_active--;
+ if (ioq && elv_ioq_sync(ioq) && !elv_ioq_class_idle(ioq))
+ sd->nr_sync--;
debug_update_stats_dequeue(entity);

if (vdisktime_gt(entity->vdisktime, st->min_vdisktime))
@@ -627,6 +700,7 @@ static void enqueue_io_entity(struct io_entity *entity)
{
struct io_service_tree *st;
struct io_sched_data *sd = io_entity_sched_data(entity);
+ struct io_queue *ioq = ioq_of(entity);

if (entity->on_idle_st)
dequeue_io_entity_idle(entity);
@@ -642,6 +716,9 @@ static void enqueue_io_entity(struct io_entity *entity)
st = entity->st;
st->nr_active++;
sd->nr_active++;
+ /* Keep a track of how many sync queues are backlogged on this group */
+ if (ioq && elv_ioq_sync(ioq) && !elv_ioq_class_idle(ioq))
+ sd->nr_sync++;
entity->on_st = 1;
place_entity(st, entity, 0);
__enqueue_io_entity(st, entity, 0);
@@ -1986,6 +2063,7 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
elv_clear_ioq_must_dispatch(ioq);
elv_clear_iog_wait_busy_done(iog);
elv_mark_ioq_slice_new(ioq);
+ elv_clear_ioq_must_expire(ioq);

del_timer(&efqd->idle_slice_timer);
}
@@ -2102,6 +2180,10 @@ void elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
elv_clear_iog_wait_request(iog);
elv_clear_iog_wait_busy(iog);
elv_clear_iog_wait_busy_done(iog);
+ elv_clear_ioq_must_expire(ioq);
+
+ if (elv_ioq_sync(ioq))
+ reset_late_preemption(q->elevator, iog, ioq);

/*
* Queue got expired before even a single request completed or
@@ -2305,6 +2387,15 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
*/
elv_preempt_queue(q, ioq);
__blk_run_queue(q);
+ } else {
+ /*
+ * Request came in a queue which is not active and we did not
+ * decide to preempt the active queue. It is possible that
+ * active queue belonged to a different group and we did not
+ * allow preemption. Keep a track of this event so that once
+ * this group is ready to dispatch, we can do some more checks
+ */
+ set_late_preemption(eq, elv_active_ioq(eq), ioq);
}
}

@@ -2447,6 +2538,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
{
struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
struct io_group *iog;
+ struct elv_fq_data *efqd = q->elevator->efqd;

if (!elv_nr_busy_ioq(q->elevator))
return NULL;
@@ -2471,6 +2563,12 @@ void *elv_select_ioq(struct request_queue *q, int force)
goto expire;
}

+ /* This queue has been marked for expiry. Try to expire it */
+ if (elv_ioq_must_expire(ioq)) {
+ elv_log_ioq(efqd, ioq, "select: ioq must_expire. expire");
+ goto expire;
+ }
+
/* We are waiting for this group to become busy before it expires.*/
if (elv_iog_wait_busy(iog)) {
ioq = NULL;
@@ -2555,6 +2653,8 @@ expire:
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
keep_queue:
+ if (ioq)
+ check_late_preemption(q->elevator, ioq);
return ioq;
}

diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 7b73f11..2992d93 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -43,6 +43,7 @@ struct io_sched_data {
struct io_entity *active_entity;
int nr_active;
struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+ int nr_sync;
};

struct io_entity {
@@ -132,6 +133,7 @@ struct io_group {
/* Store cgroup path */
char path[128];
#endif
+ int late_preemption;
};

struct io_cgroup {
@@ -227,6 +229,7 @@ enum elv_queue_state_flags {
ELV_QUEUE_FLAG_idle_window, /* elevator slice idling enabled */
ELV_QUEUE_FLAG_slice_new, /* no requests dispatched in slice */
ELV_QUEUE_FLAG_sync, /* synchronous queue */
+ ELV_QUEUE_FLAG_must_expire, /* expire queue even slice is left */
};

#define ELV_IO_QUEUE_FLAG_FNS(name) \
@@ -249,6 +252,7 @@ ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
ELV_IO_QUEUE_FLAG_FNS(idle_window)
ELV_IO_QUEUE_FLAG_FNS(slice_new)
ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(must_expire)

#ifdef CONFIG_GROUP_IOSCHED

--
1.6.0.6

2009-09-24 19:27:03

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 15/28] io-controller: Allow CFQ specific extra preemptions

o CFQ allows a reader preemting a writer. So far we allow this with-in group
but not across groups. But there seems to be following special case where
this preemption might make sense.

root
/ \
R Group
|
W

Now here reader should be able to preempt the writer. Think of there are
10 groups each running a writer and an admin trying to do "ls" and he
experiences suddenly high latencies for ls.

Same is true for meta data requests. If there is a meta data request and
a reader is running inside a sibling group, preemption will be allowed.
Note, following is not allowed.
root
/ \
group1 group2
| |
R W

Here reader can't preempt writer.

o Put meta data requesting queues at the front of the service tree. Generally
such queues will preempt currently running queue but not in following case.
root
/ \
group1 group2
| / \
R1 R3 R2 (meta data)

Here R2 is having a meta data request but it will not preempt R1. We need
to make sure that R2 gets queued ahead of R3 so taht once group2 gets
going, we first service R2 and then R3 and not vice versa.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/elevator-fq.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
block/elevator-fq.h | 3 +++
2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 25beaf7..8ff8a19 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -701,6 +701,7 @@ static void enqueue_io_entity(struct io_entity *entity)
struct io_service_tree *st;
struct io_sched_data *sd = io_entity_sched_data(entity);
struct io_queue *ioq = ioq_of(entity);
+ int add_front = 0;

if (entity->on_idle_st)
dequeue_io_entity_idle(entity);
@@ -716,12 +717,22 @@ static void enqueue_io_entity(struct io_entity *entity)
st = entity->st;
st->nr_active++;
sd->nr_active++;
+
/* Keep a track of how many sync queues are backlogged on this group */
if (ioq && elv_ioq_sync(ioq) && !elv_ioq_class_idle(ioq))
sd->nr_sync++;
entity->on_st = 1;
- place_entity(st, entity, 0);
- __enqueue_io_entity(st, entity, 0);
+
+ /*
+ * If a meta data request is pending in this queue, put this
+ * queue at the front so that it gets a chance to run first
+ * as soon as the associated group becomes eligbile to run.
+ */
+ if (ioq && ioq->meta_pending)
+ add_front = 1;
+
+ place_entity(st, entity, add_front);
+ __enqueue_io_entity(st, entity, add_front);
debug_update_stats_enqueue(entity);
}

@@ -2280,6 +2291,31 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
return 1;

/*
+ * Allow some additional preemptions where a reader queue gets
+ * backlogged and some writer queue is running under any of the
+ * sibling groups.
+ *
+ * root
+ * / \
+ * R group
+ * |
+ * W
+ */
+
+ if (ioq_of(new_entity) == new_ioq && iog_of(entity)) {
+ /* Let reader queue preempt writer in sibling group */
+ if (elv_ioq_sync(new_ioq) && !elv_ioq_sync(active_ioq))
+ return 1;
+ /*
+ * So both queues are sync. Let the new request get disk time if
+ * it's a metadata request and the current queue is doing
+ * regular IO.
+ */
+ if (new_ioq->meta_pending && !active_ioq->meta_pending)
+ return 1;
+ }
+
+ /*
* If both the queues belong to same group, check with io scheduler
* if it has additional criterion based on which it wants to
* preempt existing queue.
@@ -2335,6 +2371,8 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
BUG_ON(!efqd);
BUG_ON(!ioq);
ioq->nr_queued++;
+ if (rq_is_meta(rq))
+ ioq->meta_pending++;
elv_log_ioq(efqd, ioq, "add rq: rq_queued=%d", ioq->nr_queued);

if (!elv_ioq_busy(ioq))
@@ -2669,6 +2707,11 @@ void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
ioq = rq->ioq;
BUG_ON(!ioq);
ioq->nr_queued--;
+
+ if (rq_is_meta(rq)) {
+ WARN_ON(!ioq->meta_pending);
+ ioq->meta_pending--;
+ }
}

/* A request got dispatched. Do the accounting. */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 2992d93..27ff5c4 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -100,6 +100,9 @@ struct io_queue {

/* Pointer to io scheduler's queue */
void *sched_queue;
+
+ /* pending metadata requests */
+ int meta_pending;
};

#ifdef CONFIG_GROUP_IOSCHED /* CONFIG_GROUP_IOSCHED */
--
1.6.0.6

2009-09-24 19:29:51

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 16/28] io-controller: Wait for requests to complete from last queue before new queue is scheduled

o Currently one can dispatch requests from multiple queues to the disk. This
is true for hardware which supports queuing. So if a disk support queue
depth of 31 it is possible that 20 requests are dispatched from queue 1
and then next queue is scheduled in which dispatches more requests.

o This multiple queue dispatch introduces issues for accurate accounting of
disk time consumed by a particular queue. For example, if one async queue
is scheduled in, it can dispatch 31 requests to the disk and then it will
be expired and a new sync queue might get scheduled in. These 31 requests
might take a long time to finish but this time is never accounted to the
async queue which dispatched these requests.

o This patch introduces the functionality where we wait for all the requests
to finish from previous queue before next queue is scheduled in. That way
a queue is more accurately accounted for disk time it has consumed. Note
this still does not take care of errors introduced by disk write caching.

o Because above behavior can result in reduced throughput, this behavior will
be enabled only if user sets "fairness" tunable to 1.

o This patch helps in achieving more isolation between reads and buffered
writes in different cgroups. buffered writes typically utilize full queue
depth and then expire the queue. On the contarary, sequential reads
typicaly driver queue depth of 1. So despite the fact that writes are
using more disk time it is never accounted to write queue because we don't
wait for requests to finish after dispatching these. This patch helps
do more accurate accounting of disk time, especially for buffered writes
hence providing better fairness hence better isolation between two cgroups
running read and write workloads.

Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/cfq-iosched.c | 1 +
block/elevator-fq.c | 30 +++++++++++++++++++++++++-----
block/elevator-fq.h | 10 +++++++++-
3 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 878cf76..37a4832 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2130,6 +2130,7 @@ static struct elv_fs_entry cfq_attrs[] = {
ELV_ATTR(slice_async),
#ifdef CONFIG_GROUP_IOSCHED
ELV_ATTR(group_idle),
+ ELV_ATTR(fairness),
#endif
__ATTR_NULL
};
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 8ff8a19..bac45fe 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -893,6 +893,8 @@ static void elv_ioq_served(struct io_queue *ioq, unsigned long served)

allocated_slice = elv_prio_to_slice(ioq->efqd, ioq);

+ queue_charge = group_charge = served;
+
/*
* We don't want to charge more than allocated slice otherwise this
* queue can miss one dispatch round doubling max latencies. On the
@@ -900,16 +902,15 @@ static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
* we stick to CFQ theme of queue loosing its share if it does not
* use the slice and moves to the back of service tree (almost).
*/
- queue_charge = allocated_slice;
+ if (!ioq->efqd->fairness)
+ queue_charge = allocated_slice;

/*
* Group is charged the real time consumed so that it does not loose
* fair share.
*/
- if (served > allocated_slice)
+ if (!ioq->efqd->fairness && group_charge > allocated_slice)
group_charge = allocated_slice;
- else
- group_charge = served;

entity_served(&ioq->entity, served, queue_charge, group_charge,
ioq->nr_sectors);
@@ -951,6 +952,8 @@ SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
EXPORT_SYMBOL(elv_slice_sync_show);
SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
EXPORT_SYMBOL(elv_slice_async_show);
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
#undef SHOW_FUNCTION

#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -975,6 +978,8 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
EXPORT_SYMBOL(elv_slice_sync_store);
STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
EXPORT_SYMBOL(elv_slice_async_store);
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
#undef STORE_FUNCTION

void elv_schedule_dispatch(struct request_queue *q)
@@ -2687,6 +2692,17 @@ void *elv_select_ioq(struct request_queue *q, int force)
}

expire:
+ if (efqd->fairness && !force && ioq && ioq->dispatched) {
+ /*
+ * If there are request dispatched from this queue, don't
+ * dispatch requests from new queue till all the requests from
+ * this queue have completed.
+ */
+ elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
+ " disp=%lu", ioq->dispatched);
+ ioq = NULL;
+ goto keep_queue;
+ }
elv_slice_expired(q);
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
@@ -2839,6 +2855,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
goto done;
}

+ /* Wait for requests to finish from this queue */
+ if (efqd->fairness && elv_ioq_nr_dispatched(ioq))
+ goto done;
+
/* Expire the queue */
elv_slice_expired(q);
goto done;
@@ -2849,7 +2869,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
* If this is the last queue in the group and we did not
* decide to idle on queue, idle on group.
*/
- if (elv_iog_should_idle(ioq) && !ioq->dispatched
+ if (elv_iog_should_idle(ioq) && !elv_ioq_nr_dispatched(ioq)
&& !ioq_is_idling(ioq)) {
/*
* If queue has used up its slice, wait for the
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 27ff5c4..68c6d16 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -192,6 +192,12 @@ struct elv_fq_data {

/* Fallback dummy ioq for extreme OOM conditions */
struct io_queue oom_ioq;
+
+ /*
+ * If set to 1, waits for all request completions from current
+ * queue before new queue is scheduled in
+ */
+ unsigned int fairness;
};

/* Logging facilities. */
@@ -451,7 +457,9 @@ extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
size_t count);
-
+extern ssize_t elv_fairness_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *q, const char *name,
+ size_t count);
/* Functions used by elevator.c */
extern struct elv_fq_data *elv_alloc_fq_data(struct request_queue *q,
struct elevator_queue *e);
--
1.6.0.6

2009-09-24 19:31:07

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 17/28] io-controller: Separate out queue and data

o So far noop, deadline and AS had one common structure called *_data which
contained both the queue information where requests are queued and also
common data used for scheduling. This patch breaks down this common
structure in two parts, *_queue and *_data. This is along the lines of
cfq where all the reuquests are queued in queue and common data and tunables
are part of data.

o It does not change the functionality but this re-organization helps once
noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
not.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/as-iosched.c | 208 ++++++++++++++++++++++++++--------------------
block/deadline-iosched.c | 117 ++++++++++++++++----------
block/elevator.c | 111 +++++++++++++++++++++----
block/noop-iosched.c | 59 ++++++-------
include/linux/elevator.h | 9 ++-
5 files changed, 320 insertions(+), 184 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index b90acbe..ec6b940 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
* or timed out */
};

-struct as_data {
- /*
- * run time data
- */
-
- struct request_queue *q; /* the "owner" queue */
-
+struct as_queue {
/*
* requests (as_rq s) are present on both sort_list and fifo_list
*/
@@ -90,6 +84,14 @@ struct as_data {
struct list_head fifo_list[2];

struct request *next_rq[2]; /* next in sort order */
+ unsigned long last_check_fifo[2];
+ int write_batch_count; /* max # of reqs in a write batch */
+ int current_write_count; /* how many requests left this batch */
+ int write_batch_idled; /* has the write batch gone idle? */
+};
+
+struct as_data {
+ struct request_queue *q; /* the "owner" queue */
sector_t last_sector[2]; /* last SYNC & ASYNC sectors */

unsigned long exit_prob; /* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
sector_t new_seek_mean;

unsigned long current_batch_expires;
- unsigned long last_check_fifo[2];
int changed_batch; /* 1: waiting for old batch to end */
int new_batch; /* 1: waiting on first read complete */
- int batch_data_dir; /* current batch SYNC / ASYNC */
- int write_batch_count; /* max # of reqs in a write batch */
- int current_write_count; /* how many requests left this batch */
- int write_batch_idled; /* has the write batch gone idle? */

enum anticipation_status antic_status;
unsigned long antic_start; /* jiffies: when it started */
struct timer_list antic_timer; /* anticipatory scheduling timer */
- struct work_struct antic_work; /* Deferred unplugging */
+ struct work_struct antic_work; /* Deferred unplugging */
struct io_context *io_context; /* Identify the expected process */
int ioc_finished; /* IO associated with io_context is finished */
int nr_dispatched;
+ int batch_data_dir; /* current batch SYNC / ASYNC */

/*
* settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
/*
* rb tree support functions
*/
-#define RQ_RB_ROOT(ad, rq) (&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq) (&(asq)->sort_list[rq_is_sync((rq))])

static void as_add_rq_rb(struct as_data *ad, struct request *rq)
{
struct request *alias;
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);

- while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+ while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
as_move_to_dispatch(ad, alias);
as_antic_stop(ad);
}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)

static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
{
- elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+ elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
}

/*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
* what request to process next. Anticipation works on top of this.
*/
static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
{
struct rb_node *rbnext = rb_next(&last->rb_node);
struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
else {
const int data_dir = rq_is_sync(last);

- rbnext = rb_first(&ad->sort_list[data_dir]);
+ rbnext = rb_first(&asq->sort_list[data_dir]);
if (rbnext && rbnext != &last->rb_node)
next = rb_entry_rq(rbnext);
}
@@ -789,9 +790,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
static void as_update_rq(struct as_data *ad, struct request *rq)
{
const int data_dir = rq_is_sync(rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);

/* keep the next_rq cache up to date */
- ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+ asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);

/*
* have we been anticipating this request?
@@ -812,25 +814,26 @@ static void update_write_batch(struct as_data *ad)
{
unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
long write_time;
+ struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);

write_time = (jiffies - ad->current_batch_expires) + batch;
if (write_time < 0)
write_time = 0;

- if (write_time > batch && !ad->write_batch_idled) {
+ if (write_time > batch && !asq->write_batch_idled) {
if (write_time > batch * 3)
- ad->write_batch_count /= 2;
+ asq->write_batch_count /= 2;
else
- ad->write_batch_count--;
- } else if (write_time < batch && ad->current_write_count == 0) {
+ asq->write_batch_count--;
+ } else if (write_time < batch && asq->current_write_count == 0) {
if (batch > write_time * 3)
- ad->write_batch_count *= 2;
+ asq->write_batch_count *= 2;
else
- ad->write_batch_count++;
+ asq->write_batch_count++;
}

- if (ad->write_batch_count < 1)
- ad->write_batch_count = 1;
+ if (asq->write_batch_count < 1)
+ asq->write_batch_count = 1;
}

/*
@@ -901,6 +904,7 @@ static void as_remove_queued_request(struct request_queue *q,
const int data_dir = rq_is_sync(rq);
struct as_data *ad = q->elevator->elevator_data;
struct io_context *ioc;
+ struct as_queue *asq = elv_get_sched_queue(q, rq);

WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);

@@ -914,8 +918,8 @@ static void as_remove_queued_request(struct request_queue *q,
* Update the "next_rq" cache if we are about to remove its
* entry
*/
- if (ad->next_rq[data_dir] == rq)
- ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+ if (asq->next_rq[data_dir] == rq)
+ asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);

rq_fifo_clear(rq);
as_del_rq_rb(ad, rq);
@@ -929,23 +933,23 @@ static void as_remove_queued_request(struct request_queue *q,
*
* See as_antic_expired comment.
*/
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
{
struct request *rq;
long delta_jif;

- delta_jif = jiffies - ad->last_check_fifo[adir];
+ delta_jif = jiffies - asq->last_check_fifo[adir];
if (unlikely(delta_jif < 0))
delta_jif = -delta_jif;
if (delta_jif < ad->fifo_expire[adir])
return 0;

- ad->last_check_fifo[adir] = jiffies;
+ asq->last_check_fifo[adir] = jiffies;

- if (list_empty(&ad->fifo_list[adir]))
+ if (list_empty(&asq->fifo_list[adir]))
return 0;

- rq = rq_entry_fifo(ad->fifo_list[adir].next);
+ rq = rq_entry_fifo(asq->fifo_list[adir].next);

return time_after(jiffies, rq_fifo_time(rq));
}
@@ -954,7 +958,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
* as_batch_expired returns true if the current batch has expired. A batch
* is a set of reads or a set of writes.
*/
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
{
if (ad->changed_batch || ad->new_batch)
return 0;
@@ -964,7 +968,7 @@ static inline int as_batch_expired(struct as_data *ad)
return time_after(jiffies, ad->current_batch_expires);

return time_after(jiffies, ad->current_batch_expires)
- || ad->current_write_count == 0;
+ || asq->current_write_count == 0;
}

/*
@@ -973,6 +977,7 @@ static inline int as_batch_expired(struct as_data *ad)
static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
{
const int data_dir = rq_is_sync(rq);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);

BUG_ON(RB_EMPTY_NODE(&rq->rb_node));

@@ -995,12 +1000,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
ad->io_context = NULL;
}

- if (ad->current_write_count != 0)
- ad->current_write_count--;
+ if (asq->current_write_count != 0)
+ asq->current_write_count--;
}
ad->ioc_finished = 0;

- ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+ asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);

/*
* take it off the sort and fifo list, add to dispatch queue
@@ -1024,9 +1029,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
static int as_dispatch_request(struct request_queue *q, int force)
{
struct as_data *ad = q->elevator->elevator_data;
- const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
- const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
struct request *rq;
+ struct as_queue *asq = elv_select_sched_queue(q, force);
+ int reads, writes;
+
+ if (!asq)
+ return 0;
+
+ reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+ writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+

if (unlikely(force)) {
/*
@@ -1042,25 +1054,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
ad->changed_batch = 0;
ad->new_batch = 0;

- while (ad->next_rq[BLK_RW_SYNC]) {
- as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+ while (asq->next_rq[BLK_RW_SYNC]) {
+ as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
dispatched++;
}
- ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+ asq->last_check_fifo[BLK_RW_SYNC] = jiffies;

- while (ad->next_rq[BLK_RW_ASYNC]) {
- as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+ while (asq->next_rq[BLK_RW_ASYNC]) {
+ as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
dispatched++;
}
- ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+ asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;

return dispatched;
}

/* Signal that the write batch was uncontended, so we can't time it */
if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
- if (ad->current_write_count == 0 || !writes)
- ad->write_batch_idled = 1;
+ if (asq->current_write_count == 0 || !writes)
+ asq->write_batch_idled = 1;
}

if (!(reads || writes)
@@ -1069,14 +1081,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
|| ad->changed_batch)
return 0;

- if (!(reads && writes && as_batch_expired(ad))) {
+ if (!(reads && writes && as_batch_expired(ad, asq))) {
/*
* batch is still running or no reads or no writes
*/
- rq = ad->next_rq[ad->batch_data_dir];
+ rq = asq->next_rq[ad->batch_data_dir];

if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
- if (as_fifo_expired(ad, BLK_RW_SYNC))
+ if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
goto fifo_expired;

if (as_can_anticipate(ad, rq)) {
@@ -1100,7 +1112,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
*/

if (reads) {
- BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+ BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));

if (writes && ad->batch_data_dir == BLK_RW_SYNC)
/*
@@ -1113,8 +1125,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
ad->changed_batch = 1;
}
ad->batch_data_dir = BLK_RW_SYNC;
- rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
- ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+ rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+ asq->last_check_fifo[ad->batch_data_dir] = jiffies;
goto dispatch_request;
}

@@ -1124,7 +1136,7 @@ static int as_dispatch_request(struct request_queue *q, int force)

if (writes) {
dispatch_writes:
- BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+ BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));

if (ad->batch_data_dir == BLK_RW_SYNC) {
ad->changed_batch = 1;
@@ -1137,10 +1149,10 @@ dispatch_writes:
ad->new_batch = 0;
}
ad->batch_data_dir = BLK_RW_ASYNC;
- ad->current_write_count = ad->write_batch_count;
- ad->write_batch_idled = 0;
- rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
- ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+ asq->current_write_count = asq->write_batch_count;
+ asq->write_batch_idled = 0;
+ rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+ asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
goto dispatch_request;
}

@@ -1152,9 +1164,9 @@ dispatch_request:
* If a request has expired, service it.
*/

- if (as_fifo_expired(ad, ad->batch_data_dir)) {
+ if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
fifo_expired:
- rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+ rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
}

if (ad->changed_batch) {
@@ -1187,6 +1199,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
{
struct as_data *ad = q->elevator->elevator_data;
int data_dir;
+ struct as_queue *asq = elv_get_sched_queue(q, rq);

RQ_SET_STATE(rq, AS_RQ_NEW);

@@ -1205,7 +1218,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
* set expire time and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
- list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+ list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);

as_update_rq(ad, rq); /* keep state machine up to date */
RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1227,31 +1240,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
}

-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
- struct as_data *ad = q->elevator->elevator_data;
-
- return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
- && list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
static int
as_merge(struct request_queue *q, struct request **req, struct bio *bio)
{
- struct as_data *ad = q->elevator->elevator_data;
sector_t rb_key = bio->bi_sector + bio_sectors(bio);
struct request *__rq;
+ struct as_queue *asq = elv_get_sched_queue_current(q);
+
+ if (!asq)
+ return ELEVATOR_NO_MERGE;

/*
* check for front merge
*/
- __rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+ __rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
if (__rq && elv_rq_merge_ok(__rq, bio)) {
*req = __rq;
return ELEVATOR_FRONT_MERGE;
@@ -1334,6 +1336,41 @@ static int as_may_queue(struct request_queue *q, int rw)
return ret;
}

+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
+{
+ struct as_queue *asq;
+ struct as_data *ad = eq->elevator_data;
+
+ asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+ if (asq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+ INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+ asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+ asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+ if (ad)
+ asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+ else
+ asq->write_batch_count = default_write_batch_expire / 10;
+
+ if (asq->write_batch_count < 2)
+ asq->write_batch_count = 2;
+out:
+ return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+ struct as_queue *asq = sched_queue;
+
+ BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+ BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+ kfree(asq);
+}
+
static void as_exit_queue(struct elevator_queue *e)
{
struct as_data *ad = e->elevator_data;
@@ -1341,9 +1378,6 @@ static void as_exit_queue(struct elevator_queue *e)
del_timer_sync(&ad->antic_timer);
cancel_work_sync(&ad->antic_work);

- BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
- BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
put_io_context(ad->io_context);
kfree(ad);
}
@@ -1367,10 +1401,6 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
init_timer(&ad->antic_timer);
INIT_WORK(&ad->antic_work, as_work_handler);

- INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
- INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
- ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
- ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
ad->antic_expire = default_antic_expire;
@@ -1378,9 +1408,6 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;

ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
- ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
- if (ad->write_batch_count < 2)
- ad->write_batch_count = 2;

return ad;
}
@@ -1478,7 +1505,6 @@ static struct elevator_type iosched_as = {
.elevator_add_req_fn = as_add_request,
.elevator_activate_req_fn = as_activate_request,
.elevator_deactivate_req_fn = as_deactivate_request,
- .elevator_queue_empty_fn = as_queue_empty,
.elevator_completed_req_fn = as_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
@@ -1486,6 +1512,8 @@ static struct elevator_type iosched_as = {
.elevator_init_fn = as_init_queue,
.elevator_exit_fn = as_exit_queue,
.trim = as_trim,
+ .elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+ .elevator_free_sched_queue_fn = as_free_as_queue,
},

.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 25af8b9..5b017da 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2; /* max times reads can starve a write */
static const int fifo_batch = 16; /* # of sequential requests treated as one
by the above parameters. For throughput. */

-struct deadline_data {
- /*
- * run time data
- */
-
+struct deadline_queue {
/*
* requests (deadline_rq s) are present on both sort_list and fifo_list
*/
- struct rb_root sort_list[2];
+ struct rb_root sort_list[2];
struct list_head fifo_list[2];
-
/*
* next in sort order. read, write or both are NULL
*/
struct request *next_rq[2];
unsigned int batching; /* number of sequential requests made */
- sector_t last_sector; /* head position */
unsigned int starved; /* times reads have starved writes */
+};

+struct deadline_data {
+ struct request_queue *q;
+ sector_t last_sector; /* head position */
/*
* settings that change how the i/o scheduler behaves
*/
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
static inline struct rb_root *
deadline_rb_root(struct deadline_data *dd, struct request *rq)
{
- return &dd->sort_list[rq_data_dir(rq)];
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+ return &dq->sort_list[rq_data_dir(rq)];
}

/*
@@ -87,9 +87,10 @@ static inline void
deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
{
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);

- if (dd->next_rq[data_dir] == rq)
- dd->next_rq[data_dir] = deadline_latter_request(rq);
+ if (dq->next_rq[data_dir] == rq)
+ dq->next_rq[data_dir] = deadline_latter_request(rq);

elv_rb_del(deadline_rb_root(dd, rq), rq);
}
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
{
struct deadline_data *dd = q->elevator->elevator_data;
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(q, rq);

deadline_add_rq_rb(dd, rq);

@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
* set expire time and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
- list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+ list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
}

/*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
struct deadline_data *dd = q->elevator->elevator_data;
struct request *__rq;
int ret;
+ struct deadline_queue *dq;
+
+ dq = elv_get_sched_queue_current(q);
+ if (!dq)
+ return ELEVATOR_NO_MERGE;

/*
* check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
if (dd->front_merges) {
sector_t sector = bio->bi_sector + bio_sectors(bio);

- __rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+ __rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
if (__rq) {
BUG_ON(sector != blk_rq_pos(__rq));

@@ -207,10 +214,11 @@ static void
deadline_move_request(struct deadline_data *dd, struct request *rq)
{
const int data_dir = rq_data_dir(rq);
+ struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);

- dd->next_rq[READ] = NULL;
- dd->next_rq[WRITE] = NULL;
- dd->next_rq[data_dir] = deadline_latter_request(rq);
+ dq->next_rq[READ] = NULL;
+ dq->next_rq[WRITE] = NULL;
+ dq->next_rq[data_dir] = deadline_latter_request(rq);

dd->last_sector = rq_end_sector(rq);

@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
* deadline_check_fifo returns 0 if there are no expired requests on the fifo,
* 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
*/
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
{
- struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+ struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);

/*
* rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
static int deadline_dispatch_requests(struct request_queue *q, int force)
{
struct deadline_data *dd = q->elevator->elevator_data;
- const int reads = !list_empty(&dd->fifo_list[READ]);
- const int writes = !list_empty(&dd->fifo_list[WRITE]);
+ struct deadline_queue *dq = elv_select_sched_queue(q, force);
+ int reads, writes;
struct request *rq;
int data_dir;

+ if (!dq)
+ return 0;
+
+ reads = !list_empty(&dq->fifo_list[READ]);
+ writes = !list_empty(&dq->fifo_list[WRITE]);
+
/*
* batches are currently reads XOR writes
*/
- if (dd->next_rq[WRITE])
- rq = dd->next_rq[WRITE];
+ if (dq->next_rq[WRITE])
+ rq = dq->next_rq[WRITE];
else
- rq = dd->next_rq[READ];
+ rq = dq->next_rq[READ];

- if (rq && dd->batching < dd->fifo_batch)
+ if (rq && dq->batching < dd->fifo_batch)
/* we have a next request are still entitled to batch */
goto dispatch_request;

@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
*/

if (reads) {
- BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+ BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));

- if (writes && (dd->starved++ >= dd->writes_starved))
+ if (writes && (dq->starved++ >= dd->writes_starved))
goto dispatch_writes;

data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)

if (writes) {
dispatch_writes:
- BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+ BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));

- dd->starved = 0;
+ dq->starved = 0;

data_dir = WRITE;

@@ -299,48 +313,62 @@ dispatch_find_request:
/*
* we are not running a batch, find best request for selected data_dir
*/
- if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+ if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
/*
* A deadline has expired, the last request was in the other
* direction, or we have run out of higher-sectored requests.
* Start again from the request with the earliest expiry time.
*/
- rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+ rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
} else {
/*
* The last req was the same dir and we have a next request in
* sort order. No expired requests so continue on from here.
*/
- rq = dd->next_rq[data_dir];
+ rq = dq->next_rq[data_dir];
}

- dd->batching = 0;
+ dq->batching = 0;

dispatch_request:
/*
* rq is the selected appropriate request.
*/
- dd->batching++;
+ dq->batching++;
deadline_move_request(dd, rq);

return 1;
}

-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
{
- struct deadline_data *dd = q->elevator->elevator_data;
+ struct deadline_queue *dq;

- return list_empty(&dd->fifo_list[WRITE])
- && list_empty(&dd->fifo_list[READ]);
+ dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+ if (dq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&dq->fifo_list[READ]);
+ INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+ dq->sort_list[READ] = RB_ROOT;
+ dq->sort_list[WRITE] = RB_ROOT;
+out:
+ return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+ void *sched_queue)
+{
+ struct deadline_queue *dq = sched_queue;
+
+ kfree(dq);
}

static void deadline_exit_queue(struct elevator_queue *e)
{
struct deadline_data *dd = e->elevator_data;

- BUG_ON(!list_empty(&dd->fifo_list[READ]));
- BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
kfree(dd);
}

@@ -356,10 +384,7 @@ deadline_init_queue(struct request_queue *q, struct elevator_queue *eq)
if (!dd)
return NULL;

- INIT_LIST_HEAD(&dd->fifo_list[READ]);
- INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
- dd->sort_list[READ] = RB_ROOT;
- dd->sort_list[WRITE] = RB_ROOT;
+ dd->q = q;
dd->fifo_expire[READ] = read_expire;
dd->fifo_expire[WRITE] = write_expire;
dd->writes_starved = writes_starved;
@@ -446,13 +471,13 @@ static struct elevator_type iosched_deadline = {
.elevator_merge_req_fn = deadline_merged_requests,
.elevator_dispatch_fn = deadline_dispatch_requests,
.elevator_add_req_fn = deadline_add_request,
- .elevator_queue_empty_fn = deadline_queue_empty,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
.elevator_init_fn = deadline_init_queue,
.elevator_exit_fn = deadline_exit_queue,
+ .elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+ .elevator_free_sched_queue_fn = deadline_free_deadline_queue,
},
-
.elevator_attrs = deadline_attrs,
.elevator_name = "deadline",
.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index b2725cd..0b7c5a6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -197,17 +197,54 @@ static struct elevator_type *elevator_get(const char *name)
return e;
}

-static void *elevator_init_queue(struct request_queue *q,
- struct elevator_queue *eq)
+static void *
+elevator_init_data(struct request_queue *q, struct elevator_queue *eq)
{
- return eq->ops->elevator_init_fn(q, eq);
+ void *data = NULL;
+
+ if (eq->ops->elevator_init_fn) {
+ data = eq->ops->elevator_init_fn(q, eq);
+ if (data)
+ return data;
+ else
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* IO scheduler does not instanciate data (noop), it is not an error */
+ return NULL;
+}
+
+static void
+elevator_free_sched_queue(struct elevator_queue *eq, void *sched_queue)
+{
+ /* Not all io schedulers (cfq) strore sched_queue */
+ if (!sched_queue)
+ return;
+ eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *
+elevator_alloc_sched_queue(struct request_queue *q, struct elevator_queue *eq)
+{
+ void *sched_queue = NULL;
+
+ if (eq->ops->elevator_alloc_sched_queue_fn) {
+ sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+ GFP_KERNEL);
+ if (!sched_queue)
+ return ERR_PTR(-ENOMEM);
+
+ }
+
+ return sched_queue;
}

static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
- void *data)
+ void *data, void *sched_queue)
{
q->elevator = eq;
eq->elevator_data = data;
+ eq->sched_queue = sched_queue;
}

static char chosen_elevator[16];
@@ -288,7 +325,7 @@ int elevator_init(struct request_queue *q, char *name)
struct elevator_type *e = NULL;
struct elevator_queue *eq;
int ret = 0;
- void *data;
+ void *data = NULL, *sched_queue = NULL;

INIT_LIST_HEAD(&q->queue_head);
q->last_merge = NULL;
@@ -322,13 +359,21 @@ int elevator_init(struct request_queue *q, char *name)
if (!eq)
return -ENOMEM;

- data = elevator_init_queue(q, eq);
- if (!data) {
+ data = elevator_init_data(q, eq);
+
+ if (IS_ERR(data)) {
+ kobject_put(&eq->kobj);
+ return -ENOMEM;
+ }
+
+ sched_queue = elevator_alloc_sched_queue(q, eq);
+
+ if (IS_ERR(sched_queue)) {
kobject_put(&eq->kobj);
return -ENOMEM;
}

- elevator_attach(q, eq, data);
+ elevator_attach(q, eq, data, sched_queue);
return ret;
}
EXPORT_SYMBOL(elevator_init);
@@ -336,6 +381,7 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
+ elevator_free_sched_queue(e, e->sched_queue);
elv_exit_fq_data(e);
if (e->ops->elevator_exit_fn)
e->ops->elevator_exit_fn(e);
@@ -1024,7 +1070,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
{
struct elevator_queue *old_elevator, *e;
- void *data;
+ void *data = NULL, *sched_queue = NULL;

/*
* Allocate new elevator
@@ -1033,10 +1079,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
if (!e)
return 0;

- data = elevator_init_queue(q, e);
- if (!data) {
+ data = elevator_init_data(q, e);
+
+ if (IS_ERR(data)) {
kobject_put(&e->kobj);
- return 0;
+ return -ENOMEM;
+ }
+
+ sched_queue = elevator_alloc_sched_queue(q, e);
+
+ if (IS_ERR(sched_queue)) {
+ kobject_put(&e->kobj);
+ return -ENOMEM;
}

/*
@@ -1053,7 +1107,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
/*
* attach and start new elevator
*/
- elevator_attach(q, e, data);
+ elevator_attach(q, e, data, sched_queue);

spin_unlock_irq(q->queue_lock);

@@ -1168,16 +1222,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
}
EXPORT_SYMBOL(elv_rb_latter_request);

-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
{
- return elv_ioq_sched_queue(req_ioq(rq));
+ /*
+ * io scheduler is not using fair queuing. Return sched_queue
+ * pointer stored in elevator_queue. It will be null if io
+ * scheduler never stored anything there to begin with (cfq)
+ */
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
+ /*
+ * IO schedueler is using fair queuing infrasture. If io scheduler
+ * has passed a non null rq, retrieve sched_queue pointer from
+ * there. */
+ if (rq)
+ return elv_ioq_sched_queue(req_ioq(rq));
+
+ return NULL;
}
EXPORT_SYMBOL(elv_get_sched_queue);

/* Select an ioscheduler queue to dispatch request from. */
void *elv_select_sched_queue(struct request_queue *q, int force)
{
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
return elv_ioq_sched_queue(elv_select_ioq(q, force));
}
EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+ return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 36fc210..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
#include <linux/module.h>
#include <linux/init.h>

-struct noop_data {
+struct noop_queue {
struct list_head queue;
};

@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,

static int noop_dispatch(struct request_queue *q, int force)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_select_sched_queue(q, force);

- if (!list_empty(&nd->queue)) {
+ if (!nq)
+ return 0;
+
+ if (!list_empty(&nq->queue)) {
struct request *rq;
- rq = list_entry(nd->queue.next, struct request, queuelist);
+ rq = list_entry(nq->queue.next, struct request, queuelist);
list_del_init(&rq->queuelist);
elv_dispatch_sort(q, rq);
return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)

static void noop_add_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);

- list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
- struct noop_data *nd = q->elevator->elevator_data;
-
- return list_empty(&nd->queue);
+ list_add_tail(&rq->queuelist, &nq->queue);
}

static struct request *
noop_former_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);

- if (rq->queuelist.prev == &nd->queue)
+ if (rq->queuelist.prev == &nq->queue)
return NULL;
return list_entry(rq->queuelist.prev, struct request, queuelist);
}
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
static struct request *
noop_latter_request(struct request_queue *q, struct request *rq)
{
- struct noop_data *nd = q->elevator->elevator_data;
+ struct noop_queue *nq = elv_get_sched_queue(q, rq);

- if (rq->queuelist.next == &nd->queue)
+ if (rq->queuelist.next == &nq->queue)
return NULL;
return list_entry(rq->queuelist.next, struct request, queuelist);
}

-static void *noop_init_queue(struct request_queue *q, struct elevator_queue *eq)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+ struct elevator_queue *eq, gfp_t gfp_mask)
{
- struct noop_data *nd;
+ struct noop_queue *nq;

- nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
- if (!nd)
- return NULL;
- INIT_LIST_HEAD(&nd->queue);
- return nd;
+ nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+ if (nq == NULL)
+ goto out;
+
+ INIT_LIST_HEAD(&nq->queue);
+out:
+ return nq;
}

-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
{
- struct noop_data *nd = e->elevator_data;
+ struct noop_queue *nq = sched_queue;

- BUG_ON(!list_empty(&nd->queue));
- kfree(nd);
+ kfree(nq);
}

static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
.elevator_merge_req_fn = noop_merged_requests,
.elevator_dispatch_fn = noop_dispatch,
.elevator_add_req_fn = noop_add_request,
- .elevator_queue_empty_fn = noop_queue_empty,
.elevator_former_req_fn = noop_former_request,
.elevator_latter_req_fn = noop_latter_request,
- .elevator_init_fn = noop_init_queue,
- .elevator_exit_fn = noop_exit_queue,
+ .elevator_alloc_sched_queue_fn = noop_alloc_noop_queue,
+ .elevator_free_sched_queue_fn = noop_free_noop_queue,
},
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 4414a61..2c6b0c7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,10 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
typedef void *(elevator_init_fn) (struct request_queue *,
struct elevator_queue *);
typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
+ struct elevator_queue *eq, gfp_t);
typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -68,8 +70,9 @@ struct elevator_ops
elevator_exit_fn *elevator_exit_fn;
void (*trim)(struct io_context *);

-#ifdef CONFIG_ELV_FAIR_QUEUING
+ elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;

@@ -109,6 +112,7 @@ struct elevator_queue
{
struct elevator_ops *ops;
void *elevator_data;
+ void *sched_queue;
struct kobject kobj;
struct elevator_type *elevator_type;
struct mutex sysfs_lock;
@@ -255,5 +259,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.6

2009-09-24 19:31:42

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 18/28] io-conroller: Prepare elevator layer for single queue schedulers

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/as-iosched.c | 2 +-
block/deadline-iosched.c | 2 +-
block/elevator-fq.c | 186 +++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 36 +++++++++
block/elevator.c | 37 +++++++++-
block/noop-iosched.c | 2 +-
include/linux/elevator.h | 18 ++++-
7 files changed, 275 insertions(+), 8 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index ec6b940..6d2468b 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1338,7 +1338,7 @@ static int as_may_queue(struct request_queue *q, int rw)

/* Called with queue lock held */
static void *as_alloc_as_queue(struct request_queue *q,
- struct elevator_queue *eq, gfp_t gfp_mask)
+ struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
{
struct as_queue *asq;
struct as_data *ad = eq->elevator_data;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5b017da..6e69ea3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -341,7 +341,7 @@ dispatch_request:
}

static void *deadline_alloc_deadline_queue(struct request_queue *q,
- struct elevator_queue *eq, gfp_t gfp_mask)
+ struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
{
struct deadline_queue *dq;

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index bac45fe..b08a200 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -222,6 +222,9 @@ static inline void set_late_preemption(struct elevator_queue *eq,
{
struct io_group *new_iog;

+ if (elv_iosched_single_ioq(eq))
+ return;
+
if (!active_ioq)
return;

@@ -252,6 +255,9 @@ check_late_preemption(struct elevator_queue *eq, struct io_queue *ioq)
{
struct io_group *iog = ioq_to_io_group(ioq);

+ if (elv_iosched_single_ioq(eq))
+ return;
+
if (!iog->late_preemption)
return;

@@ -1038,7 +1044,10 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq, pid_t pid,
RB_CLEAR_NODE(&ioq->entity.rb_node);
atomic_set(&ioq->ref, 0);
ioq->efqd = eq->efqd;
- ioq->pid = pid;
+ if (elv_iosched_single_ioq(eq))
+ ioq->pid = 0;
+ else
+ ioq->pid = current->pid;

elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
@@ -1072,6 +1081,12 @@ put_io_group_queues(struct elevator_queue *e, struct io_group *iog)

/* Free up async idle queue */
elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+ /* Optimization for io schedulers having single ioq */
+ if (elv_iosched_single_ioq(e))
+ elv_release_ioq(e, &iog->ioq);
+#endif
}

void *elv_io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
@@ -1970,6 +1985,164 @@ int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
return (iog == __iog);
}

+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void
+elv_io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+ /* io group reference. Will be dropped when group is destroyed. */
+ elv_get_ioq(ioq);
+ iog->ioq = ioq;
+}
+
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue of request based on process, this
+ * function is not invoked.
+ */
+int elv_set_request_ioq(struct request_queue *q, struct request *rq,
+ gfp_t gfp_mask)
+{
+ struct elevator_queue *e = q->elevator;
+ unsigned long flags;
+ struct io_queue *ioq = NULL, *new_ioq = NULL;
+ struct io_group *iog;
+ void *sched_q = NULL, *new_sched_q = NULL;
+
+ if (!elv_iosched_fair_queuing_enabled(e))
+ return 0;
+
+ might_sleep_if(gfp_mask & __GFP_WAIT);
+ spin_lock_irqsave(q->queue_lock, flags);
+
+retry:
+ /* Determine the io group request belongs to */
+ iog = elv_io_get_io_group(q, 1);
+ BUG_ON(!iog);
+
+ /* Get the iosched queue */
+ ioq = iog->ioq;
+ if (!ioq) {
+ /* io queue and sched_queue needs to be allocated */
+ BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+ if (new_ioq) {
+ goto alloc_sched_q;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+ | __GFP_ZERO);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+ if (!ioq)
+ goto queue_fail;
+ }
+
+alloc_sched_q:
+ if (new_sched_q) {
+ ioq = new_ioq;
+ new_ioq = NULL;
+ sched_q = new_sched_q;
+ new_sched_q = NULL;
+ } else if (gfp_mask & __GFP_WAIT) {
+ /*
+ * Inform the allocator of the fact that we will
+ * just repeat this allocation if it fails, to allow
+ * the allocator to do whatever it needs to attempt to
+ * free memory.
+ */
+ spin_unlock_irq(q->queue_lock);
+ /* Call io scheduer to create scheduler queue */
+ new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+ e, gfp_mask | __GFP_NOFAIL
+ | __GFP_ZERO, new_ioq);
+ spin_lock_irq(q->queue_lock);
+ goto retry;
+ } else {
+ sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+ gfp_mask | __GFP_ZERO, ioq);
+ if (!sched_q) {
+ elv_free_ioq(ioq);
+ goto queue_fail;
+ }
+ }
+
+ elv_init_ioq(e, ioq, current->pid, 1);
+ elv_init_ioq_io_group(ioq, iog);
+ elv_init_ioq_sched_queue(e, ioq, sched_q);
+
+ elv_io_group_set_ioq(iog, ioq);
+ elv_mark_ioq_sync(ioq);
+ elv_get_iog(iog);
+ }
+
+ if (new_sched_q)
+ e->ops->elevator_free_sched_queue_fn(q->elevator, new_sched_q);
+
+ if (new_ioq)
+ elv_free_ioq(new_ioq);
+
+ /* Request reference */
+ elv_get_ioq(ioq);
+ rq->ioq = ioq;
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return 0;
+
+queue_fail:
+ WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+ elv_schedule_dispatch(q);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ struct io_group *iog;
+
+ /* Determine the io group and io queue of the bio submitting task */
+ iog = elv_io_get_io_group(q, 0);
+ if (!iog) {
+ /* May be task belongs to a cgroup for which io group has
+ * not been setup yet. */
+ return NULL;
+ }
+ return iog->ioq;
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
+{
+ struct io_queue *ioq = rq->ioq;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return;
+
+ if (ioq) {
+ rq->ioq = NULL;
+ elv_put_ioq(ioq);
+ }
+}
+
#else /* CONFIG_GROUP_IOSCHED */

static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
@@ -2265,6 +2438,15 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
struct io_entity *entity, *new_entity;
struct io_group *iog = NULL, *new_iog = NULL;

+ /*
+ * Currently only CFQ has preemption logic. Other schedulers don't
+ * have any notion of preemption across classes or preemption with-in
+ * class etc.
+ */
+ if (elv_iosched_single_ioq(eq))
+ return 0;
+
+
active_ioq = elv_active_ioq(eq);

if (!active_ioq)
@@ -2703,6 +2885,7 @@ expire:
ioq = NULL;
goto keep_queue;
}
+
elv_slice_expired(q);
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
@@ -2832,6 +3015,7 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
elv_clear_ioq_slice_new(ioq);
}

+
/*
* If there are no requests waiting in this queue, and
* there are other queues ready to issue requests, AND
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 68c6d16..e60ceed 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -137,6 +137,9 @@ struct io_group {
char path[128];
#endif
int late_preemption;
+
+ /* Single ioq per group, used for noop, deadline, anticipatory */
+ struct io_queue *ioq;
};

struct io_cgroup {
@@ -432,6 +435,11 @@ static inline void elv_get_iog(struct io_group *iog)
atomic_inc(&iog->ref);
}

+extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
+ gfp_t gfp_mask);
+extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
#else /* !GROUP_IOSCHED */

static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
@@ -449,6 +457,20 @@ elv_io_get_io_group(struct request_queue *q, int create)
return q->elevator->efqd->root_group;
}

+static inline int
+elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static inline void
+elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ return NULL;
+}
+
#endif /* GROUP_IOSCHED */

extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
@@ -544,6 +566,20 @@ static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
{
return 1;
}
+static inline int
+elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+{
+ return 0;
+}
+
+static inline void
+elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+ return NULL;
+}
+
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _ELV_SCHED_H */
#endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 0b7c5a6..bc43edd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -228,9 +228,17 @@ elevator_alloc_sched_queue(struct request_queue *q, struct elevator_queue *eq)
{
void *sched_queue = NULL;

+ /*
+ * If fair queuing is enabled, then queue allocation takes place
+ * during set_request() functions when request actually comes
+ * in.
+ */
+ if (elv_iosched_fair_queuing_enabled(eq))
+ return NULL;
+
if (eq->ops->elevator_alloc_sched_queue_fn) {
sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
- GFP_KERNEL);
+ GFP_KERNEL, NULL);
if (!sched_queue)
return ERR_PTR(-ENOMEM);

@@ -861,6 +869,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;

+ /*
+ * Optimization for noop, deadline and AS which maintain only single
+ * ioq per io group
+ */
+ if (elv_iosched_single_ioq(e))
+ return elv_set_request_ioq(q, rq, gfp_mask);
+
if (e->ops->elevator_set_req_fn)
return e->ops->elevator_set_req_fn(q, rq, gfp_mask);

@@ -872,6 +887,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

+ /*
+ * Optimization for noop, deadline and AS which maintain only single
+ * ioq per io group
+ */
+ if (elv_iosched_single_ioq(e)) {
+ elv_reset_request_ioq(q, rq);
+ return;
+ }
+
if (e->ops->elevator_put_req_fn)
e->ops->elevator_put_req_fn(rq);
}
@@ -1256,9 +1280,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);

/*
* Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
*/
void *elv_get_sched_queue_current(struct request_queue *q)
{
- return q->elevator->sched_queue;
+ /* Fair queuing is not enabled. There is only one queue. */
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return q->elevator->sched_queue;
+
+ return elv_ioq_sched_queue(elv_lookup_ioq_current(q));
}
EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..731dbf2 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -62,7 +62,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
}

static void *noop_alloc_noop_queue(struct request_queue *q,
- struct elevator_queue *eq, gfp_t gfp_mask)
+ struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
{
struct noop_queue *nq;

diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 2c6b0c7..77c1fa5 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,9 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
typedef void *(elevator_init_fn) (struct request_queue *,
struct elevator_queue *);
typedef void (elevator_exit_fn) (struct elevator_queue *);
-typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
- struct elevator_queue *eq, gfp_t);
typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q,
+ struct elevator_queue *eq, gfp_t, struct io_queue *ioq);
#ifdef CONFIG_ELV_FAIR_QUEUING
typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
@@ -245,17 +245,31 @@ enum {
/* iosched wants to use fair queuing logic of elevator layer */
#define ELV_IOSCHED_NEED_FQ 1

+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ 2
+
static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
}

+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+ return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
#else /* ELV_IOSCHED_FAIR_QUEUING */

static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
{
return 0;
}
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+ return 0;
+}
+
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
--
1.6.0.6

2009-09-24 19:27:23

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 19/28] io-controller: Avoid expiring ioq for single ioq scheduler if only root group

o For io scheduler noop, deadline and AS, we maintain only one ioq per group.
If people are using only flat mode where only root group is present, there
will be only one ioq. In that case we can avoid queue expiration every
100ms (dependent on slice_sync). This patch introduces this optimization.

o If an ioq is not expired for a long time and suddenly somebody
decides to create a group and launch a job there, in that case old ioq
queue will be expired with a very high value of slice used and will get
a very high disk time. Fix it by marking the queue as "charge_one_slice"
and charge the queue only for a single time slice and not for whole
of the duration when queue was running.

o Introduce the notion of "real_served" and "virtual_served". real time is the
actual time queue used and is visible through cgroup interface. virtual_time
is the one we actually want to charge the queue for. If a queue has not
been expired for long time, real_time value will probably be high but we
charge the queue for only one slice length.

Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
block/elevator-fq.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++
block/elevator-fq.h | 3 ++
2 files changed, 78 insertions(+), 0 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b08a200..04419cf 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -902,6 +902,19 @@ static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
queue_charge = group_charge = served;

/*
+ * For single ioq schedulers we don't expire the queue if there are
+ * no other competing groups. It might happen that once a queue has
+ * not been expired for a long time, suddenly a new group is created
+ * and IO comes in that new group. In that case, we don't want to
+ * charge the old queue for whole of the period it was not expired.
+ */
+
+ if (elv_ioq_charge_one_slice(ioq) && queue_charge > allocated_slice)
+ queue_charge = group_charge = allocated_slice;
+
+ elv_clear_ioq_charge_one_slice(ioq);
+
+ /*
* We don't want to charge more than allocated slice otherwise this
* queue can miss one dispatch round doubling max latencies. On the
* other hand we don't want to charge less than allocated slice as
@@ -2143,6 +2156,37 @@ void elv_reset_request_ioq(struct request_queue *q, struct request *rq)
}
}

+static inline int is_only_root_group(void)
+{
+ if (list_empty(&io_root_cgroup.css.cgroup->children))
+ return 1;
+
+ return 0;
+}
+
+/*
+ * One can do some optimizations for single ioq scheduler, when one does
+ * not have to expire the queue after every time slice is used. This avoids
+ * some unnecessary overhead, especially in AS where we wait for requests to
+ * finish from last queue before new queue is scheduled in.
+ */
+static inline int single_ioq_no_timed_expiry(struct request_queue *q)
+{
+ struct elv_fq_data *efqd = q->elevator->efqd;
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+ if (!elv_iosched_single_ioq(q->elevator))
+ return 0;
+
+ if (!is_only_root_group())
+ return 0;
+
+ if (efqd->busy_queues == 1 && ioq == efqd->root_group->ioq)
+ return 1;
+
+ return 0;
+}
+
#else /* CONFIG_GROUP_IOSCHED */

static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
@@ -2188,6 +2232,17 @@ int elv_iog_should_idle(struct io_queue *ioq) { return 0; }
EXPORT_SYMBOL(elv_iog_should_idle);
static int elv_ioq_should_wait_busy(struct io_queue *ioq) { return 0; }

+static inline int is_only_root_group(void)
+{
+ return 1;
+}
+
+/* Never expire the single ioq in flat mode */
+static inline int single_ioq_no_timed_expiry(struct request_queue *q)
+{
+ return 1;
+};
+
#endif /* CONFIG_GROUP_IOSCHED */

/*
@@ -2794,6 +2849,16 @@ void *elv_select_ioq(struct request_queue *q, int force)
goto expire;
}

+ /*
+ * If there is only root group present, don't expire the queue for
+ * single queue ioschedulers (noop, deadline, AS).
+ */
+
+ if (single_ioq_no_timed_expiry(q)) {
+ elv_mark_ioq_charge_one_slice(ioq);
+ goto keep_queue;
+ }
+
/* We are waiting for this group to become busy before it expires.*/
if (elv_iog_wait_busy(iog)) {
ioq = NULL;
@@ -3015,6 +3080,16 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
elv_clear_ioq_slice_new(ioq);
}

+ /*
+ * If there is only root group present, don't expire the queue
+ * for single queue ioschedulers (noop, deadline, AS). It is
+ * unnecessary overhead.
+ */
+ if (single_ioq_no_timed_expiry(q)) {
+ elv_mark_ioq_charge_one_slice(ioq);
+ elv_log_ioq(efqd, ioq, "single ioq no timed expiry");
+ goto done;
+ }

/*
* If there are no requests waiting in this queue, and
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index e60ceed..4114543 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -242,6 +242,8 @@ enum elv_queue_state_flags {
ELV_QUEUE_FLAG_slice_new, /* no requests dispatched in slice */
ELV_QUEUE_FLAG_sync, /* synchronous queue */
ELV_QUEUE_FLAG_must_expire, /* expire queue even slice is left */
+ ELV_QUEUE_FLAG_charge_one_slice, /* Charge the queue for only one
+ * time slice length */
};

#define ELV_IO_QUEUE_FLAG_FNS(name) \
@@ -265,6 +267,7 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
ELV_IO_QUEUE_FLAG_FNS(slice_new)
ELV_IO_QUEUE_FLAG_FNS(sync)
ELV_IO_QUEUE_FLAG_FNS(must_expire)
+ELV_IO_QUEUE_FLAG_FNS(charge_one_slice)

#ifdef CONFIG_GROUP_IOSCHED

--
1.6.0.6

2009-09-24 19:32:18

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 20/28] io-controller: noop changes for hierarchical fair queuing

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/Kconfig.iosched | 11 +++++++++++
block/noop-iosched.c | 14 ++++++++++++++
2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a7d0bf8..28cd500 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
that do their own scheduling and require only minimal assistance from
the kernel.

+config IOSCHED_NOOP_HIER
+ bool "Noop Hierarchical Scheduling support"
+ depends on IOSCHED_NOOP && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in noop. In this mode noop keeps
+ one IO queue per cgroup instead of a global queue. Elevator
+ fair queuing logic ensures fairness among various queues.
+
config IOSCHED_AS
tristate "Anticipatory I/O scheduler"
default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 731dbf2..4ba496f 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -6,6 +6,7 @@
#include <linux/bio.h>
#include <linux/module.h>
#include <linux/init.h>
+#include "elevator-fq.h"

struct noop_queue {
struct list_head queue;
@@ -82,6 +83,15 @@ static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
kfree(nq);
}

+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_sync),
+ ELV_ATTR(group_idle),
+ __ATTR_NULL
+};
+#endif
+
static struct elevator_type elevator_noop = {
.ops = {
.elevator_merge_req_fn = noop_merged_requests,
@@ -92,6 +102,10 @@ static struct elevator_type elevator_noop = {
.elevator_alloc_sched_queue_fn = noop_alloc_noop_queue,
.elevator_free_sched_queue_fn = noop_free_noop_queue,
},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+ .elevator_attrs = noop_attrs,
+#endif
.elevator_name = "noop",
.elevator_owner = THIS_MODULE,
};
--
1.6.0.6

2009-09-24 19:26:35

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 21/28] io-controller: deadline changes for hierarchical fair queuing

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/Kconfig.iosched | 11 +++++++++++
block/deadline-iosched.c | 9 +++++++++
2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 28cd500..cc87c87 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
a disk at any one time, its behaviour is almost identical to the
anticipatory I/O scheduler and so is a good choice.

+config IOSCHED_DEADLINE_HIER
+ bool "Deadline Hierarchical Scheduling support"
+ depends on IOSCHED_DEADLINE && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in deadline. In this mode deadline keeps
+ one IO queue per cgroup instead of a global queue. Elevator
+ fair queuing logic ensures fairness among various queues.
+
config IOSCHED_CFQ
tristate "CFQ I/O scheduler"
select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 6e69ea3..e5bc823 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -13,6 +13,7 @@
#include <linux/init.h>
#include <linux/compiler.h>
#include <linux/rbtree.h>
+#include "elevator-fq.h"

/*
* See Documentation/block/deadline-iosched.txt
@@ -461,6 +462,11 @@ static struct elv_fs_entry deadline_attrs[] = {
DD_ATTR(writes_starved),
DD_ATTR(front_merges),
DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_sync),
+ ELV_ATTR(group_idle),
+#endif
__ATTR_NULL
};

@@ -478,6 +484,9 @@ static struct elevator_type iosched_deadline = {
.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
.elevator_attrs = deadline_attrs,
.elevator_name = "deadline",
.elevator_owner = THIS_MODULE,
--
1.6.0.6

2009-09-24 19:26:49

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 22/28] io-controller: anticipatory changes for hierarchical fair queuing

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer. One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER. Even with CONFIG_IOSCHED_AS_HIER=y, with-out any
other cgroup created, AS behavior should remain the same as old.

o AS is a single queue ioschduler, that means there is one AS queue per group.

o common layer code select the queue to dispatch from based on fairness, and
then AS code selects the request with-in group.

o AS runs reads and writes batches with-in group. So common layer runs timed
group queues and with-in group time, AS runs timed batches of reads and
writes.

o Note: Previously AS write batch length was adjusted synamically whenever
a W->R batch data direction took place and when first request from the
read batch completed.

Now write batch updation takes place when last request from the write
batch has finished during W->R transition.

o AS runs its own anticipation logic to anticipate on reads. common layer also
does the anticipation on the group if think time of the group is with-in
slice_idle.

o Introduced few debugging messages in AS.

o There are cases where in case of AS, excessive queue expiration will take
place by elevator fair queuing layer because of few reasons.
- AS does not anticipate on a queue if there are no competing requests.
So if only a single reader is present in a group, anticipation does
not get turn on.

- elevator layer does not know that As is anticipating hence initiates
expiry requests in select_ioq() thinking queue is empty.

- elevaotr layer tries to aggressively expire last empty queue. This
can lead to lof of queue expiry

o This patch now starts ANITC_WAIT_NEXT anticipation if last request in the
queue completed and associated io context is eligible to anticipate. Also
AS lets elevatory layer know that it is anticipating (elv_ioq_wait_request())
This solves above mentioned issues.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/Kconfig.iosched | 12 ++
block/as-iosched.c | 376 +++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.c | 107 ++++++++++++--
include/linux/elevator.h | 2 +
4 files changed, 477 insertions(+), 20 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index cc87c87..8ab08da 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
deadline I/O scheduler, it can also be slower in some cases
especially some database loads.

+config IOSCHED_AS_HIER
+ bool "Anticipatory Hierarchical Scheduling support"
+ depends on IOSCHED_AS && CGROUPS
+ select ELV_FAIR_QUEUING
+ select GROUP_IOSCHED
+ default n
+ ---help---
+ Enable hierarhical scheduling in anticipatory. In this mode
+ anticipatory keeps one IO queue per cgroup instead of a global
+ queue. Elevator fair queuing logic ensures fairness among various
+ queues.
+
config IOSCHED_DEADLINE
tristate "Deadline I/O scheduler"
default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 6d2468b..fed579f 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -16,6 +16,8 @@
#include <linux/compiler.h>
#include <linux/rbtree.h>
#include <linux/interrupt.h>
+#include <linux/blktrace_api.h>
+#include "elevator-fq.h"

/*
* See Documentation/block/as-iosched.txt
@@ -77,6 +79,7 @@ enum anticipation_status {
};

struct as_queue {
+ struct io_queue *ioq;
/*
* requests (as_rq s) are present on both sort_list and fifo_list
*/
@@ -84,10 +87,24 @@ struct as_queue {
struct list_head fifo_list[2];

struct request *next_rq[2]; /* next in sort order */
+
+ /*
+ * If an as_queue is switched while a batch is running, then we
+ * store the time left before current batch will expire
+ */
+ long current_batch_time_left;
+
+ /*
+ * batch data dir when queue was scheduled out. This will be used
+ * to setup ad->batch_data_dir when queue is scheduled in.
+ */
+ int saved_batch_data_dir;
+
unsigned long last_check_fifo[2];
int write_batch_count; /* max # of reqs in a write batch */
int current_write_count; /* how many requests left this batch */
int write_batch_idled; /* has the write batch gone idle? */
+ int nr_queued[2];
};

struct as_data {
@@ -123,6 +140,9 @@ struct as_data {
unsigned long fifo_expire[2];
unsigned long batch_expire[2];
unsigned long antic_expire;
+
+ /* elevator requested a queue switch. */
+ int switch_queue;
};

/*
@@ -144,12 +164,259 @@ enum arq_state {
#define RQ_STATE(rq) ((enum arq_state)(rq)->elevator_private2)
#define RQ_SET_STATE(rq, state) ((rq)->elevator_private2 = (void *) state)

+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define as_log_asq(ad, asq, fmt, args...) \
+{ \
+ blk_add_trace_msg((ad)->q, "as %s " fmt, \
+ ioq_to_io_group((asq)->ioq)->path, ##args); \
+}
+#else
+#define as_log_asq(ad, asq, fmt, args...) \
+ blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+#endif
+
+#define as_log(ad, fmt, args...) \
+ blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+
static DEFINE_PER_CPU(unsigned long, ioc_count);
static struct completion *ioc_gone;
static DEFINE_SPINLOCK(ioc_gone_lock);

static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static int as_can_anticipate(struct as_data *ad, struct request *rq);
+static void as_antic_waitnext(struct as_data *ad);
+
+static inline void as_mark_active_asq_wait_request(struct as_data *ad)
+{
+ struct as_queue *asq = elv_active_sched_queue(ad->q->elevator);
+
+ elv_mark_ioq_wait_request(asq->ioq);
+}
+
+static inline void as_clear_active_asq_wait_request(struct as_data *ad)
+{
+ struct as_queue *asq = elv_active_sched_queue(ad->q->elevator);
+
+ if (asq)
+ elv_clear_ioq_wait_request(asq->ioq);
+}
+
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+ /* Save batch data dir */
+ asq->saved_batch_data_dir = ad->batch_data_dir;
+
+ if (ad->changed_batch) {
+ /*
+ * In case of force expire, we come here. Batch changeover
+ * has been signalled but we are waiting for all the
+ * request to finish from previous batch and then start
+ * the new batch. Can't wait now. Mark that full batch time
+ * needs to be allocated when this queue is scheduled again.
+ */
+ asq->current_batch_time_left =
+ ad->batch_expire[ad->batch_data_dir];
+ ad->changed_batch = 0;
+ goto out;
+ }
+
+ if (ad->new_batch) {
+ /*
+ * We should come here only when new_batch has been set
+ * but no read request has been issued or if it is a forced
+ * expiry.
+ *
+ * In both the cases, new batch has not started yet so
+ * allocate full batch length for next scheduling opportunity.
+ * We don't do write batch size adjustment in hierarchical
+ * AS so that should not be an issue.
+ */
+ asq->current_batch_time_left =
+ ad->batch_expire[ad->batch_data_dir];
+ ad->new_batch = 0;
+ goto out;
+ }
+
+ /* Save how much time is left before current batch expires */
+ if (as_batch_expired(ad, asq))
+ asq->current_batch_time_left = 0;
+ else {
+ asq->current_batch_time_left = ad->current_batch_expires
+ - jiffies;
+ BUG_ON((asq->current_batch_time_left) < 0);
+ }
+
+ if (ad->io_context) {
+ put_io_context(ad->io_context);
+ ad->io_context = NULL;
+ }
+
+out:
+ as_log_asq(ad, asq, "save batch: dir=%c time_left=%d changed_batch=%d"
+ " new_batch=%d, antic_status=%d",
+ ad->batch_data_dir ? 'R' : 'W',
+ asq->current_batch_time_left,
+ ad->changed_batch, ad->new_batch, ad->antic_status);
+ return;
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+ /* Adjust the batch expire time */
+ if (asq->current_batch_time_left)
+ ad->current_batch_expires = jiffies +
+ asq->current_batch_time_left;
+ /* restore asq batch_data_dir info */
+ ad->batch_data_dir = asq->saved_batch_data_dir;
+ as_log_asq(ad, asq, "restore batch: dir=%c time=%d reads_q=%d"
+ " writes_q=%d ad->antic_status=%d",
+ ad->batch_data_dir ? 'R' : 'W',
+ asq->current_batch_time_left,
+ asq->nr_queued[1], asq->nr_queued[0],
+ ad->antic_status);
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+ int coop)
+{
+ struct as_queue *asq = sched_queue;
+ struct as_data *ad = q->elevator->elevator_data;
+
+ as_restore_batch_context(ad, asq);
+}
+
+/*
+ * AS does not anticipate on a context if there is no other request pending.
+ * So if only a single sequential reader was running, AS will not turn on
+ * anticipation. This function turns on anticipation if an io context has
+ * think time with-in limits and there are no other requests to dispatch.
+ *
+ * With group scheduling, a queue is expired if is empty, does not have a
+ * request dispatched and we are not idling. In case of this single reader
+ * we will see a queue expiration after every request completion. Hence turn
+ * on the anticipation if an io context should ancipate and there are no
+ * other requests queued in the queue.
+ */
+static inline void
+as_hier_check_start_waitnext(struct request_queue *q, struct as_queue *asq)
+{
+ struct as_data *ad = q->elevator->elevator_data;
+
+ if (!ad->nr_dispatched && !asq->nr_queued[1] && !asq->nr_queued[0] &&
+ as_can_anticipate(ad, NULL)) {
+ as_antic_waitnext(ad);
+ }
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+ int slice_expired, int force)
+{
+ struct as_data *ad = q->elevator->elevator_data;
+ int status = ad->antic_status;
+ struct as_queue *asq = sched_queue;
+
+ as_log_asq(ad, asq, "as_expire_ioq slice_expired=%d, force=%d",
+ slice_expired, force);
+
+ /* Forced expiry. We don't have a choice */
+ if (force) {
+ as_antic_stop(ad);
+ /*
+ * antic_stop() sets antic_status to FINISHED which signifies
+ * that either we timed out or we found a close request but
+ * that's not the case here. Start from scratch.
+ */
+ ad->antic_status = ANTIC_OFF;
+ as_save_batch_context(ad, asq);
+ ad->switch_queue = 0;
+ return 1;
+ }
+
+ /*
+ * We are waiting for requests to finish from last
+ * batch. Don't expire the queue now
+ */
+ if (ad->changed_batch)
+ goto keep_queue;
+
+ /*
+ * Wait for all requests from existing batch to finish before we
+ * switch the queue. New queue might change the batch direction
+ * and this is to be consistent with AS philosophy of not dispatching
+ * new requests to underlying drive till requests from requests
+ * from previous batch are completed.
+ */
+ if (ad->nr_dispatched)
+ goto keep_queue;
+
+ /*
+ * If AS anticipation is ON, wait for it to finish if queue slice
+ * has not expired.
+ */
+ BUG_ON(status == ANTIC_WAIT_REQ);
+
+ if (status == ANTIC_WAIT_NEXT) {
+ if (!slice_expired)
+ goto keep_queue;
+ /* Slice expired. Stop anticipating. */
+ as_antic_stop(ad);
+ ad->antic_status = ANTIC_OFF;
+ }
+
+ /* We are good to expire the queue. Save batch context */
+ as_save_batch_context(ad, asq);
+ ad->switch_queue = 0;
+ return 1;
+
+keep_queue:
+ /* Mark that elevator requested for queue switch whenever possible */
+ ad->switch_queue = 1;
+ return 0;
+}
+
+static inline void as_check_expire_active_as_queue(struct request_queue *q)
+{
+ struct as_data *ad = q->elevator->elevator_data;
+ struct as_queue *asq = elv_active_sched_queue(q->elevator);
+
+ /*
+ * We anticpated on the queue and timer fired. If queue is empty,
+ * expire the queue. This will make sure an idle queue does not
+ * remain active for a very long time as later all the idle time
+ * can be added to the queue disk usage.
+ */
+ if (asq) {
+ if (!ad->nr_dispatched && !asq->nr_queued[1] &&
+ !asq->nr_queued[0]) {
+ ad->switch_queue = 0;
+ elv_ioq_slice_expired(q, asq->ioq);
+ }
+ }
+}
+
+#else /* CONFIG_IOSCHED_AS_HIER */
+static inline void as_mark_active_asq_wait_request(struct as_data *ad) {}
+static inline void as_clear_active_asq_wait_request(struct as_data *ad) {}
+static inline void
+as_hier_check_start_waitnext(struct request_queue *q, struct as_queue *asq) {}
+static inline void as_check_expire_active_as_queue(struct request_queue *q) {}
+#endif

/*
* IO Context helper functions
@@ -429,6 +696,8 @@ static void as_antic_waitnext(struct as_data *ad)
mod_timer(&ad->antic_timer, timeout);

ad->antic_status = ANTIC_WAIT_NEXT;
+ as_mark_active_asq_wait_request(ad);
+ as_log(ad, "antic_waitnext set");
}

/*
@@ -442,8 +711,10 @@ static void as_antic_waitreq(struct as_data *ad)
if (ad->antic_status == ANTIC_OFF) {
if (!ad->io_context || ad->ioc_finished)
as_antic_waitnext(ad);
- else
+ else {
ad->antic_status = ANTIC_WAIT_REQ;
+ as_log(ad, "antic_waitreq set");
+ }
}
}

@@ -455,9 +726,12 @@ static void as_antic_stop(struct as_data *ad)
{
int status = ad->antic_status;

+ as_log(ad, "as_antic_stop antic_status=%d", ad->antic_status);
+
if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
if (status == ANTIC_WAIT_NEXT)
del_timer(&ad->antic_timer);
+ as_clear_active_asq_wait_request(ad);
ad->antic_status = ANTIC_FINISHED;
/* see as_work_handler */
kblockd_schedule_work(ad->q, &ad->antic_work);
@@ -474,6 +748,7 @@ static void as_antic_timeout(unsigned long data)
unsigned long flags;

spin_lock_irqsave(q->queue_lock, flags);
+ as_log(ad, "as_antic_timeout. antic_status=%d", ad->antic_status);
if (ad->antic_status == ANTIC_WAIT_REQ
|| ad->antic_status == ANTIC_WAIT_NEXT) {
struct as_io_context *aic;
@@ -481,6 +756,8 @@ static void as_antic_timeout(unsigned long data)
aic = ad->io_context->aic;

ad->antic_status = ANTIC_FINISHED;
+ as_clear_active_asq_wait_request(ad);
+ as_check_expire_active_as_queue(q);
kblockd_schedule_work(q, &ad->antic_work);

if (aic->ttime_samples == 0) {
@@ -652,6 +929,21 @@ static int as_can_break_anticipation(struct as_data *ad, struct request *rq)
struct io_context *ioc;
struct as_io_context *aic;

+#ifdef CONFIG_IOSCHED_AS_HIER
+ /*
+ * If the active asq and rq's asq are not same, then one can not
+ * break the anticipation. This primarily becomes useful when a
+ * request is added to a queue which is not being served currently.
+ */
+ if (rq) {
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+ struct as_queue *curr_asq =
+ elv_active_sched_queue(ad->q->elevator);
+
+ if (asq != curr_asq)
+ return 0;
+ }
+#endif
ioc = ad->io_context;
BUG_ON(!ioc);
spin_lock(&ioc->lock);
@@ -810,16 +1102,21 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
/*
* Gathers timings and resizes the write batch automatically
*/
-static void update_write_batch(struct as_data *ad)
+static void update_write_batch(struct as_data *ad, struct request *rq)
{
unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
long write_time;
- struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
+ struct as_queue *asq = elv_get_sched_queue(ad->q, rq);

write_time = (jiffies - ad->current_batch_expires) + batch;
if (write_time < 0)
write_time = 0;

+ as_log_asq(ad, asq, "upd write: write_time=%d batch=%d"
+ " write_batch_idled=%d current_write_count=%d",
+ write_time, batch, asq->write_batch_idled,
+ asq->current_write_count);
+
if (write_time > batch && !asq->write_batch_idled) {
if (write_time > batch * 3)
asq->write_batch_count /= 2;
@@ -834,6 +1131,8 @@ static void update_write_batch(struct as_data *ad)

if (asq->write_batch_count < 1)
asq->write_batch_count = 1;
+
+ as_log_asq(ad, asq, "upd write count=%d", asq->write_batch_count);
}

/*
@@ -843,6 +1142,7 @@ static void update_write_batch(struct as_data *ad)
static void as_completed_request(struct request_queue *q, struct request *rq)
{
struct as_data *ad = q->elevator->elevator_data;
+ struct as_queue *asq = elv_get_sched_queue(q, rq);

WARN_ON(!list_empty(&rq->queuelist));

@@ -851,7 +1151,24 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
goto out;
}

+ as_log_asq(ad, asq, "complete: reads_q=%d writes_q=%d changed_batch=%d"
+ " new_batch=%d switch_queue=%d, dir=%c",
+ asq->nr_queued[1], asq->nr_queued[0], ad->changed_batch,
+ ad->new_batch, ad->switch_queue,
+ ad->batch_data_dir ? 'R' : 'W');
+
if (ad->changed_batch && ad->nr_dispatched == 1) {
+ /*
+ * If this was write batch finishing, adjust the write batch
+ * length.
+ *
+ * Note, write batch length is being calculated upon completion
+ * of last write request finished and not completion of first
+ * read request finished in the next batch.
+ */
+ if (ad->batch_data_dir == BLK_RW_SYNC)
+ update_write_batch(ad, rq);
+
ad->current_batch_expires = jiffies +
ad->batch_expire[ad->batch_data_dir];
kblockd_schedule_work(q, &ad->antic_work);
@@ -869,7 +1186,6 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
* and writeback caches
*/
if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
- update_write_batch(ad);
ad->current_batch_expires = jiffies +
ad->batch_expire[BLK_RW_SYNC];
ad->new_batch = 0;
@@ -884,10 +1200,18 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
* the next one
*/
as_antic_waitnext(ad);
- }
+ } else
+ as_hier_check_start_waitnext(q, asq);
}

as_put_io_context(rq);
+
+ /*
+ * If elevator requested a queue switch, kick the queue in the
+ * hope that this is right time for switch.
+ */
+ if (ad->switch_queue)
+ kblockd_schedule_work(q, &ad->antic_work);
out:
RQ_SET_STATE(rq, AS_RQ_POSTSCHED);
}
@@ -908,6 +1232,9 @@ static void as_remove_queued_request(struct request_queue *q,

WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);

+ BUG_ON(asq->nr_queued[data_dir] <= 0);
+ asq->nr_queued[data_dir]--;
+
ioc = RQ_IOC(rq);
if (ioc && ioc->aic) {
BUG_ON(!atomic_read(&ioc->aic->nr_queued));
@@ -1019,6 +1346,8 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
if (RQ_IOC(rq) && RQ_IOC(rq)->aic)
atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
ad->nr_dispatched++;
+ as_log_asq(ad, asq, "dispatch req dir=%c nr_dispatched = %d",
+ data_dir ? 'R' : 'W', ad->nr_dispatched);
}

/*
@@ -1066,6 +1395,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
}
asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;

+ as_log_asq(ad, asq, "forced dispatch");
return dispatched;
}

@@ -1078,8 +1408,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
if (!(reads || writes)
|| ad->antic_status == ANTIC_WAIT_REQ
|| ad->antic_status == ANTIC_WAIT_NEXT
- || ad->changed_batch)
+ || ad->changed_batch) {
+ as_log_asq(ad, asq, "no dispatch. read_q=%d, writes_q=%d"
+ " ad->antic_status=%d, changed_batch=%d,"
+ " switch_queue=%d new_batch=%d", asq->nr_queued[1],
+ asq->nr_queued[0], ad->antic_status, ad->changed_batch,
+ ad->switch_queue, ad->new_batch);
return 0;
+ }

if (!(reads && writes && as_batch_expired(ad, asq))) {
/*
@@ -1092,6 +1428,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
goto fifo_expired;

if (as_can_anticipate(ad, rq)) {
+ as_log_asq(ad, asq, "can_anticipate = 1");
as_antic_waitreq(ad);
return 0;
}
@@ -1111,6 +1448,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
* data direction (read / write)
*/

+ as_log_asq(ad, asq, "select a fresh batch and request");
+
if (reads) {
BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));

@@ -1125,6 +1464,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
ad->changed_batch = 1;
}
ad->batch_data_dir = BLK_RW_SYNC;
+ as_log_asq(ad, asq, "new batch dir is sync");
rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
asq->last_check_fifo[ad->batch_data_dir] = jiffies;
goto dispatch_request;
@@ -1149,6 +1489,7 @@ dispatch_writes:
ad->new_batch = 0;
}
ad->batch_data_dir = BLK_RW_ASYNC;
+ as_log_asq(ad, asq, "new batch dir is async");
asq->current_write_count = asq->write_batch_count;
asq->write_batch_idled = 0;
rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
@@ -1184,6 +1525,9 @@ fifo_expired:
ad->changed_batch = 0;
}

+ if (ad->switch_queue)
+ return 0;
+
/*
* rq is the selected appropriate request.
*/
@@ -1207,6 +1551,11 @@ static void as_add_request(struct request_queue *q, struct request *rq)

rq->elevator_private = as_get_io_context(q->node);

+ asq->nr_queued[data_dir]++;
+ as_log_asq(ad, asq, "add a %c request read_q=%d write_q=%d",
+ data_dir ? 'R' : 'W', asq->nr_queued[1],
+ asq->nr_queued[0]);
+
if (RQ_IOC(rq)) {
as_update_iohist(ad, RQ_IOC(rq)->aic, rq);
atomic_inc(&RQ_IOC(rq)->aic->nr_queued);
@@ -1358,6 +1707,7 @@ static void *as_alloc_as_queue(struct request_queue *q,

if (asq->write_batch_count < 2)
asq->write_batch_count = 2;
+ asq->ioq = ioq;
out:
return asq;
}
@@ -1408,6 +1758,7 @@ static void *as_init_queue(struct request_queue *q, struct elevator_queue *eq)
ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;

ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
+ ad->switch_queue = 0;

return ad;
}
@@ -1493,6 +1844,11 @@ static struct elv_fs_entry as_attrs[] = {
AS_ATTR(antic_expire),
AS_ATTR(read_batch_expire),
AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+ ELV_ATTR(fairness),
+ ELV_ATTR(slice_sync),
+ ELV_ATTR(group_idle),
+#endif
__ATTR_NULL
};

@@ -1514,8 +1870,14 @@ static struct elevator_type iosched_as = {
.trim = as_trim,
.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+ .elevator_expire_ioq_fn = as_expire_ioq,
+ .elevator_active_ioq_set_fn = as_active_ioq_set,
},
-
+ .elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#else
+ },
+#endif
.elevator_attrs = as_attrs,
.elevator_name = "anticipatory",
.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 04419cf..149a147 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2096,6 +2096,21 @@ alloc_sched_q:
elv_init_ioq_io_group(ioq, iog);
elv_init_ioq_sched_queue(e, ioq, sched_q);

+ /*
+ * For AS, also mark the group queue idle_window. This will
+ * make sure that select_ioq() will not try to expire an
+ * AS queue if there are dispatched request from the queue but
+ * queue is empty. This gives a chance to asq to anticipate
+ * after request completion, otherwise select_ioq() will
+ * mark it must_expire and soon asq will be expired.
+ *
+ * Not doing it for noop and deadline yet as they don't have
+ * any anticpation logic and this will slow down queue
+ * switching in a NCQ supporting hardware.
+ */
+ if (!strcmp(e->elevator_type->elevator_name, "anticipatory"))
+ elv_mark_ioq_idle_window(ioq);
+
elv_io_group_set_ioq(iog, ioq);
elv_mark_ioq_sync(ioq);
elv_get_iog(iog);
@@ -2387,6 +2402,46 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq)
}

/*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * if force = 1, it is force dispatch and iosched must clean up its state.
+ * This is useful when elevator wants to drain iosched and wants to expire
+ * currnent active queue.
+ * if slice_expired = 1, ioq slice expired hence elevator fair queuing logic
+ * wants to switch the queue. iosched should allow that until and unless
+ * necessary. Currently AS can deny the switch if in the middle of batch switch.
+ *
+ * if slice_expired = 0, time slice is still remaining. It is up to the iosched
+ * whether it wants to wait on this queue or just want to expire it and move
+ * on to next queue.
+ */
+static int
+elv_iosched_expire_ioq(struct request_queue *q, int slice_expired, int force)
+{
+ struct elevator_queue *e = q->elevator;
+ struct io_queue *ioq = elv_active_ioq(q->elevator);
+ int ret = 1;
+
+ if (e->ops->elevator_expire_ioq_fn) {
+ ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+ slice_expired, force);
+ /*
+ * AS denied expiration of queue right now. Mark that elevator
+ * layer has requested ioscheduler (as) to expire this queue.
+ * Now as will try to expire this queue as soon as it can.
+ * Now don't try to dispatch from this queue even if we get
+ * a new request and if time slice is left. Do expire it once.
+ */
+ if (!ret)
+ elv_mark_ioq_must_expire(ioq);
+ }
+
+ return ret;
+}
+
+/*
* Do the accounting. Determine how much service (in terms of time slices)
* current queue used and adjust the start, finish time of queue and vtime
* of the tree accordingly.
@@ -2587,16 +2642,18 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,

static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
{
- elv_log_ioq(q->elevator->efqd, ioq, "preempt");
- elv_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, 0, 1)) {
+ elv_log_ioq(q->elevator->efqd, ioq, "preempt");
+ elv_slice_expired(q);

- /*
- * Put the new queue at the front of the of the current list,
- * so we know that it will be selected next.
- */
+ /*
+ * Put the new queue at the front of the of the current list,
+ * so we know that it will be selected next.
+ */

- requeue_ioq(ioq, 1);
- elv_mark_ioq_slice_new(ioq);
+ requeue_ioq(ioq, 1);
+ elv_mark_ioq_slice_new(ioq);
+ }
}

void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -2729,6 +2786,8 @@ static void elv_idle_slice_timer(unsigned long data)
goto out_kick;
}
expire:
+ /* Force expire the queue for AS */
+ elv_iosched_expire_ioq(q, 0, 1);
elv_slice_expired(q);
out_kick:
elv_schedule_dispatch(q);
@@ -2819,6 +2878,8 @@ void *elv_select_ioq(struct request_queue *q, int force)
struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
struct io_group *iog;
struct elv_fq_data *efqd = q->elevator->efqd;
+ struct elevator_type *e = q->elevator->elevator_type;
+ int slice_expired = 0;

if (!elv_nr_busy_ioq(q->elevator))
return NULL;
@@ -2894,6 +2955,7 @@ void *elv_select_ioq(struct request_queue *q, int force)
* from queue and is not proportional to group's weight, it
* harms the fairness of the group.
*/
+ slice_expired = 1;
if ((elv_iog_should_idle(ioq) || elv_ioq_should_wait_busy(ioq))
&& !elv_iog_wait_busy_done(iog)) {
ioq = NULL;
@@ -2939,11 +3001,15 @@ void *elv_select_ioq(struct request_queue *q, int force)
}

expire:
- if (efqd->fairness && !force && ioq && ioq->dispatched) {
+ if (efqd->fairness && !force && ioq && ioq->dispatched
+ && strcmp(e->elevator_name, "anticipatory")) {
/*
* If there are request dispatched from this queue, don't
* dispatch requests from new queue till all the requests from
* this queue have completed.
+ *
+ * Anticipatory does not allow queue switch until requests
+ * from previous queue have finished.
*/
elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
" disp=%lu", ioq->dispatched);
@@ -2951,7 +3017,14 @@ expire:
goto keep_queue;
}

- elv_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, slice_expired, force))
+ elv_slice_expired(q);
+ else
+ /*
+ * Not making ioq = NULL, as AS can deny queue expiration and
+ * continue to dispatch from same queue
+ */
+ goto keep_queue;
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
keep_queue:
@@ -3044,8 +3117,15 @@ check_expire_last_empty_queue(struct request_queue *q, struct io_queue *ioq)
if (ioq_is_idling(ioq))
return;

- elv_log_ioq(efqd, ioq, "expire last empty queue");
- elv_slice_expired(q);
+ /*
+ * If IO scheduler denies expiration here, it is up to io scheduler
+ * to expire the queue when possible. Otherwise all the idle time
+ * will be charged to the queue when queue finally expires.
+ */
+ if (elv_iosched_expire_ioq(q, 0, 0)) {
+ elv_log_ioq(efqd, ioq, "expire last empty queue");
+ elv_slice_expired(q);
+ }
}

/* A request got completed from io_queue. Do the accounting. */
@@ -3119,7 +3199,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
goto done;

/* Expire the queue */
- elv_slice_expired(q);
+ if (elv_iosched_expire_ioq(q, 1, 0))
+ elv_slice_expired(q);
goto done;
} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq)
&& sync && !rq_noidle(rq))
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 77c1fa5..3d4e31c 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -41,6 +41,7 @@ typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
struct request*);
typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
void*);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
#endif

struct elevator_ops
@@ -79,6 +80,7 @@ struct elevator_ops
elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
elevator_should_preempt_fn *elevator_should_preempt_fn;
elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+ elevator_expire_ioq_fn *elevator_expire_ioq_fn;
#endif
};

--
1.6.0.6

2009-09-24 19:28:04

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 23/28] io-controller: blkio_cgroup patches from Ryo to track async bios.

o blkio_cgroup patches from Ryo to track async bios.

o This functionality is used to determine the group of async IO from page
instead of context of submitting task.

Signed-off-by: Hirokazu Takahashi <[email protected]>
Signed-off-by: Ryo Tsuruta <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/blk-ioc.c | 36 +++---
fs/buffer.c | 2 +
fs/direct-io.c | 2 +
include/linux/biotrack.h | 100 ++++++++++++++
include/linux/cgroup_subsys.h | 6 +
include/linux/iocontext.h | 1 +
include/linux/memcontrol.h | 6 +
include/linux/mmzone.h | 4 +-
include/linux/page_cgroup.h | 5 +-
init/Kconfig | 16 +++
mm/Makefile | 4 +-
mm/biotrack.c | 293 +++++++++++++++++++++++++++++++++++++++++
mm/bounce.c | 2 +
mm/filemap.c | 2 +
mm/memcontrol.c | 6 +
mm/memory.c | 5 +
mm/page-writeback.c | 2 +
mm/page_cgroup.c | 23 ++--
mm/swap_state.c | 2 +
19 files changed, 486 insertions(+), 31 deletions(-)
create mode 100644 include/linux/biotrack.h
create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 0d56336..890d475 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,31 @@ void exit_io_context(void)
}
}

+void init_io_context(struct io_context *ioc)
+{
+ atomic_long_set(&ioc->refcount, 1);
+ atomic_set(&ioc->nr_tasks, 1);
+ spin_lock_init(&ioc->lock);
+ ioc->ioprio_changed = 0;
+ ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+ ioc->cgroup_changed = 0;
+#endif
+ ioc->last_waited = jiffies; /* doesn't matter... */
+ ioc->nr_batch_requests = 0; /* because this is 0 */
+ ioc->aic = NULL;
+ INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+ INIT_HLIST_HEAD(&ioc->cic_list);
+ ioc->ioc_data = NULL;
+}
+
struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
{
struct io_context *ret;

ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
- if (ret) {
- atomic_long_set(&ret->refcount, 1);
- atomic_set(&ret->nr_tasks, 1);
- spin_lock_init(&ret->lock);
- ret->ioprio_changed = 0;
- ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
- ret->cgroup_changed = 0;
-#endif
- ret->last_waited = jiffies; /* doesn't matter... */
- ret->nr_batch_requests = 0; /* because this is 0 */
- ret->aic = NULL;
- INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
- INIT_HLIST_HEAD(&ret->cic_list);
- ret->ioc_data = NULL;
- }
+ if (ret)
+ init_io_context(ret);

return ret;
}
diff --git a/fs/buffer.c b/fs/buffer.c
index 28f320f..8efcd82 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
#include <linux/buffer_head.h>
#include <linux/task_io_accounting_ops.h>
#include <linux/bio.h>
+#include <linux/biotrack.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8b10b87..185ba0a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
#include <linux/err.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
+#include <linux/biotrack.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
#include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
ret = PTR_ERR(page);
goto out;
}
+ blkio_cgroup_reset_owner(page, current->mm);

while (block_in_page < blocks_per_page) {
unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..2b8bb0b
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,100 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+ struct cgroup_subsys_state css;
+ struct io_context *io_context; /* default io_context */
+/* struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc: page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+ pc->blkio_cgroup_id = 0;
+}
+
+/**
+ * blkio_cgroup_disabled() - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+ if (blkio_cgroup_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *get_cgroup_from_page(struct page *page);
+
+#else /* !CONFIG_CGROUP_BLKIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+ return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+ struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+ return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ return 0;
+}
+
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index baf544f..78504f3 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)

/* */

+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
#ifdef CONFIG_CGROUP_DEVICE
SUBSYS(devices)
#endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index b343594..1baa6c1 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -109,6 +109,7 @@ int put_io_context(struct io_context *ioc);
void exit_io_context(void);
struct io_context *get_io_context(gfp_t gfp_flags, int node);
struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
void copy_io_context(struct io_context **pdst, struct io_context **psrc);
#else
static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e46a073..eb45fe9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
* (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
*/

+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
/* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;

+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
static inline int mem_cgroup_newpage_charge(struct page *page,
struct mm_struct *mm, gfp_t gfp_mask)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8895985..c9d1ed4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
struct page_cgroup *node_page_cgroup;
#endif
#endif
@@ -956,7 +956,7 @@ struct mem_section {

/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 13f126c..bca6c8a 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
#ifndef __LINUX_PAGE_CGROUP_H
#define __LINUX_PAGE_CGROUP_H

-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
#include <linux/bit_spinlock.h>
/*
* Page Cgroup can be considered as an extended mem_map.
@@ -14,6 +14,7 @@ struct page_cgroup {
unsigned long flags;
struct mem_cgroup *mem_cgroup;
struct page *page;
+ unsigned long blkio_cgroup_id;
struct list_head lru; /* per cgroup LRU list */
};

@@ -83,7 +84,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
bit_spin_unlock(PCG_LOCK, &pc->flags);
}

-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
struct page_cgroup;

static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/init/Kconfig b/init/Kconfig
index afcaa86..54aa85a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -622,6 +622,22 @@ config GROUP_IOSCHED

endif # CGROUPS

+config CGROUP_BLKIO
+ bool
+ depends on CGROUPS && BLOCK
+ select MM_OWNER
+ default n
+ ---help---
+ Provides a Resource Controller which enables to track the onwner
+ of every Block I/O requests.
+ The information this subsystem provides can be used from any
+ kind of module such as dm-ioband device mapper modules or
+ the cfq-scheduler.
+
+config CGROUP_PAGE
+ def_bool y
+ depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
config MM_OWNER
bool

diff --git a/mm/Makefile b/mm/Makefile
index 5e0bd64..6208744 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,6 +39,8 @@ else
obj-$(CONFIG_SMP) += allocpercpu.o
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..1da7d1e
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,293 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+ return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+ struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+ .io_context = &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+ struct blkio_cgroup *biog;
+ struct page_cgroup *pc;
+
+ if (blkio_cgroup_disabled())
+ return;
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ pc->blkio_cgroup_id = 0; /* 0: default blkio_cgroup id */
+ if (!mm)
+ return;
+ /*
+ * Locking "pc" isn't necessary here since the current process is
+ * the only one that can access the members related to blkio_cgroup.
+ */
+ rcu_read_lock();
+ biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+ if (unlikely(!biog))
+ goto out;
+ /*
+ * css_get(&bio->css) isn't called to increment the reference
+ * count of this blkio_cgroup "biog" so pc->blkio_cgroup_id
+ * might turn invalid even if this page is still active.
+ * This approach is chosen to minimize the overhead.
+ */
+ pc->blkio_cgroup_id = css_id(&biog->css);
+out:
+ rcu_read_unlock();
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+ /*
+ * A little trick:
+ * Just call blkio_cgroup_set_owner() for pages which are already
+ * active since the blkio_cgroup_id member of page_cgroup can be
+ * updated without any locks. This is because an integer type of
+ * variable can be set a new value at once on modern cpus.
+ */
+ blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page: the page we want to tag
+ * @mm: the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+ if (!page_is_file_cache(page))
+ return;
+ if (current->flags & PF_MEMALLOC)
+ return;
+
+ blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage: the page where we want to copy the owner
+ * @opage: the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+ struct page_cgroup *npc, *opc;
+
+ if (blkio_cgroup_disabled())
+ return;
+ npc = lookup_page_cgroup(npage);
+ if (unlikely(!npc))
+ return;
+ opc = lookup_page_cgroup(opage);
+ if (unlikely(!opc))
+ return;
+
+ /*
+ * Do this without any locks. The reason is the same as
+ * blkio_cgroup_reset_owner().
+ */
+ npc->blkio_cgroup_id = opc->blkio_cgroup_id;
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio_cgroup *biog;
+ struct io_context *ioc;
+
+ if (!cgrp->parent) {
+ biog = &default_blkio_cgroup;
+ init_io_context(biog->io_context);
+ /* Increment the referrence count not to be released ever. */
+ atomic_long_inc(&biog->io_context->refcount);
+ return &biog->css;
+ }
+
+ biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+ if (!biog)
+ return ERR_PTR(-ENOMEM);
+ ioc = alloc_io_context(GFP_KERNEL, -1);
+ if (!ioc) {
+ kfree(biog);
+ return ERR_PTR(-ENOMEM);
+ }
+ biog->io_context = ioc;
+ return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+ put_io_context(biog->io_context);
+ free_css_id(&blkio_cgroup_subsys, &biog->css);
+ kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio: the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+ struct page_cgroup *pc;
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+ unsigned long id = 0;
+
+ pc = lookup_page_cgroup(page);
+ if (pc)
+ id = pc->blkio_cgroup_id;
+ return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio: the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+ struct cgroup_subsys_state *css;
+ struct blkio_cgroup *biog;
+ struct io_context *ioc;
+ unsigned long id;
+
+ id = get_blkio_cgroup_id(bio);
+ rcu_read_lock();
+ css = css_lookup(&blkio_cgroup_subsys, id);
+ if (css)
+ biog = container_of(css, struct blkio_cgroup, css);
+ else
+ biog = &default_blkio_cgroup;
+ ioc = biog->io_context; /* default io_context for this cgroup */
+ atomic_long_inc(&ioc->refcount);
+ rcu_read_unlock();
+ return ioc;
+}
+
+/**
+ * get_cgroup_from_page() - determine the cgroup from a page.
+ * @page: the page to be tracked
+ *
+ * Returns the cgroup of a given page. A return value zero means that
+ * the page associated with the page belongs to default_blkio_cgroup.
+ *
+ * Note:
+ * This function must be called under rcu_read_lock().
+ */
+struct cgroup *get_cgroup_from_page(struct page *page)
+{
+ struct page_cgroup *pc;
+ struct cgroup_subsys_state *css;
+
+ pc = lookup_page_cgroup(page);
+ if (!pc)
+ return NULL;
+
+ css = css_lookup(&blkio_cgroup_subsys, pc->blkio_cgroup_id);
+ if (!css)
+ return NULL;
+
+ return css->cgroup;
+}
+
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_cgroup_from_page);
+
+/* Read the ID of the specified blkio cgroup. */
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+ return (u64)css_id(&biog->css);
+}
+
+static struct cftype blkio_files[] = {
+ {
+ .name = "id",
+ .read_u64 = blkio_id_read,
+ },
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ return cgroup_add_files(cgrp, ss, blkio_files,
+ ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+ .name = "blkio",
+ .create = blkio_cgroup_create,
+ .destroy = blkio_cgroup_destroy,
+ .populate = blkio_cgroup_populate,
+ .subsys_id = blkio_cgroup_subsys_id,
+ .use_id = 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index a2b76a5..7ad8d44 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -13,6 +13,7 @@
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
+#include <linux/biotrack.h>
#include <asm/tlbflush.h>

#include <trace/events/block.h>
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
to->bv_len = from->bv_len;
to->bv_offset = from->bv_offset;
inc_zone_page_state(to->bv_page, NR_BOUNCE);
+ blkio_cgroup_copy_owner(to->bv_page, page);

if (rw == WRITE) {
char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..01c47a1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/cpuset.h>
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
#include "internal.h"

@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
goto out;
+ blkio_cgroup_set_owner(page, current->mm);

error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fd4529d..baf4be7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};

+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+ pc->mem_cgroup = NULL;
+ INIT_LIST_HEAD(&pc->lru);
+}
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index aede2ce..346f368 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mmu_notifier.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
@@ -2116,6 +2117,7 @@ gotten:
*/
ptep_clear_flush_notify(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
+ blkio_cgroup_set_owner(new_page, mm);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
if (old_page) {
@@ -2581,6 +2583,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address);
+ blkio_cgroup_reset_owner(page, mm);
/* It's better to call commit-charge after rmap is established */
mem_cgroup_commit_charge_swapin(page, ptr);

@@ -2645,6 +2648,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto release;
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
set_pte_at(mm, address, page_table, entry);

/* No need to invalidate - it was non-present before */
@@ -2792,6 +2796,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (anon) {
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
+ blkio_cgroup_set_owner(page, mm);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..1df421b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -1247,6 +1248,7 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(mapping2 != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
+ blkio_cgroup_reset_owner_pagedirty(page, current->mm);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index f22b4eb..29bf26c 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
#include <linux/vmalloc.h>
#include <linux/cgroup.h>
#include <linux/swapops.h>
+#include <linux/biotrack.h>

static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
pc->page = pfn_to_page(pfn);
- INIT_LIST_HEAD(&pc->lru);
+ __init_mem_page_cgroup(pc);
+ __init_blkio_page_cgroup(pc);
}
static unsigned long total_usage;

@@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)

int nid, fail;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;

for_each_online_node(nid) {
@@ -83,12 +84,13 @@ void __init page_cgroup_init_flatmem(void)
goto fail;
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
- " don't want memory cgroups\n");
+ printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+ " if you don't want memory and blkio cgroups\n");
return;
fail:
printk(KERN_CRIT "allocation of page_cgroup failed.\n");
- printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+ printk(KERN_CRIT
+ "please try 'cgroup_disable=memory,blkio' boot option\n");
panic("Out of memory");
}

@@ -245,7 +247,7 @@ void __init page_cgroup_init(void)
unsigned long pfn;
int fail = 0;

- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && blkio_cgroup_disabled())
return;

for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -254,14 +256,15 @@ void __init page_cgroup_init(void)
fail = init_section_page_cgroup(pfn);
}
if (fail) {
- printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+ printk(KERN_CRIT
+ "try 'cgroup_disable=memory,blkio' boot option\n");
panic("Out of memory");
} else {
hotplug_memory_notifier(page_cgroup_callback, 0);
}
printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
- printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
- " want memory cgroups\n");
+ printk(KERN_INFO "please try 'cgroup_disable=memory,blkio' option"
+ " if you don't want memory and blkio cgroups\n");
}

void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 42cd38e..6eb96f1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
#include <linux/pagevec.h>
#include <linux/migrate.h>
#include <linux/page_cgroup.h>
+#include <linux/biotrack.h>

#include <asm/pgtable.h>

@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
+ blkio_cgroup_set_owner(new_page, current->mm);
err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
if (likely(!err)) {
/*
--
1.6.0.6

2009-09-24 19:27:14

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 24/28] io-controller: map async requests to appropriate cgroup

o So far we were assuming that a bio/rq belongs to the task who is submitting
it. It did not hold good in case of async writes. This patch makes use of
blkio_cgroup pataches to attribute the aysnc writes to right group instead
of task submitting the bio.

o For sync requests, we continue to assume that io belongs to the task
submitting it. Only in case of async requests, we make use of io tracking
patches to track the owner cgroup.

o So far cfq always caches the async queue pointer. With async requests now
not necessarily being tied to submitting task io context, caching the
pointer will not help for async queues. This patch introduces a new config
option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
old behavior where async queue pointer is cached in task context. If it
is set, async queue pointer is not cached and we take help of bio
tracking patches to determine group bio belongs to and then map it to
async queue of that group.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/Kconfig.iosched | 16 +++++
block/as-iosched.c | 2 +-
block/blk-core.c | 7 +-
block/cfq-iosched.c | 152 ++++++++++++++++++++++++++++++++++++----------
block/deadline-iosched.c | 2 +-
block/elevator-fq.c | 93 +++++++++++++++++++++++-----
block/elevator-fq.h | 31 ++++++---
block/elevator.c | 15 +++--
include/linux/elevator.h | 22 ++++++-
9 files changed, 267 insertions(+), 73 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8ab08da..8b507c4 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -132,6 +132,22 @@ config DEBUG_GROUP_IOSCHED
Enable some debugging hooks for hierarchical scheduling support.
Currently it just outputs more information in blktrace output.

+config TRACK_ASYNC_CONTEXT
+ bool "Determine async request context from bio"
+ depends on GROUP_IOSCHED
+ select CGROUP_BLKIO
+ default n
+ ---help---
+ Normally async request is attributed to the task submitting the
+ request. With group ioscheduling, for accurate accounting of
+ async writes, one needs to map the request to original task/cgroup
+ which originated the request and not the submitter of the request.
+
+ Currently there are generic io tracking patches to provide facility
+ to map bio to original owner. If this option is set, for async
+ request, original owner of the bio is decided by using io tracking
+ patches otherwise we continue to attribute the request to the
+ submitting thread.
endmenu

endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index fed579f..fc2453d 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1594,7 +1594,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
{
sector_t rb_key = bio->bi_sector + bio_sectors(bio);
struct request *__rq;
- struct as_queue *asq = elv_get_sched_queue_current(q);
+ struct as_queue *asq = elv_get_sched_queue_bio(q, bio);

if (!asq)
return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index e3299a7..47cce59 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -619,7 +619,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
}

static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+ gfp_t gfp_mask)
{
struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);

@@ -631,7 +632,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
rq->cmd_flags = flags | REQ_ALLOCED;

if (priv) {
- if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+ if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
mempool_free(rq, q->rq.rq_pool);
return NULL;
}
@@ -772,7 +773,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);

- rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+ rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 37a4832..88a7275 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -176,8 +176,8 @@ CFQ_CFQQ_FNS(coop);
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)

static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
- struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio *bio,
+ int, struct io_context *, gfp_t);
static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
struct io_context *);

@@ -187,22 +187,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
return cic->cfqq[!!is_sync];
}

-static inline void cic_set_cfqq(struct cfq_io_context *cic,
- struct cfq_queue *cfqq, int is_sync)
-{
- cic->cfqq[!!is_sync] = cfqq;
-}
-
/*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
*/
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+ struct cfq_io_context *cic, struct bio *bio, int is_sync)
{
- if (bio_data_dir(bio) == READ || bio_sync(bio))
- return 1;
+ struct cfq_queue *cfqq = NULL;

- return 0;
+ cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+ struct io_group *iog;
+ /*
+ * async bio tracking is enabled and we are not caching
+ * async queue pointer in cic.
+ */
+ iog = elv_io_get_io_group_bio(cfqd->queue, bio, 0);
+ if (!iog) {
+ /*
+ * May be this is first rq/bio and io group has not
+ * been setup yet.
+ */
+ return NULL;
+ }
+ return elv_io_group_async_queue_prio(iog, ioprio_class, ioprio);
+ }
+#endif
+ return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+ struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ /*
+ * Don't cache async queue pointer as now one io context might
+ * be submitting async io for various different async queues
+ */
+ if (!is_sync)
+ return;
+#endif
+ cic->cfqq[!!is_sync] = cfqq;
}

static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -526,7 +560,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
if (!cic)
return NULL;

- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
if (cfqq) {
sector_t sector = bio->bi_sector + bio_sectors(bio);

@@ -609,7 +643,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
/*
* Disallow merge of a sync bio into an async request.
*/
- if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+ if (elv_bio_sync(bio) && !rq_is_sync(rq))
return 0;

/*
@@ -620,7 +654,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
if (!cic)
return 0;

- cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+ cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
if (cfqq == RQ_CFQQ(rq))
return 1;

@@ -1250,14 +1284,28 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
spin_lock_irqsave(cfqd->queue->queue_lock, flags);

cfqq = cic->cfqq[BLK_RW_ASYNC];
+
if (cfqq) {
struct cfq_queue *new_cfqq;
- new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+ /*
+ * Drop the reference to old queue unconditionally. Don't
+ * worry whether new async prio queue has been allocated
+ * or not.
+ */
+ cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+ cfq_put_queue(cfqq);
+
+ /*
+ * Why to allocate new queue now? Will it not be automatically
+ * allocated whenever another async request from same context
+ * comes? Keeping it for the time being because existing cfq
+ * code allocates the new queue immediately upon prio change
+ */
+ new_cfqq = cfq_get_queue(cfqd, NULL, BLK_RW_ASYNC, cic->ioc,
GFP_ATOMIC);
- if (new_cfqq) {
- cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
- cfq_put_queue(cfqq);
- }
+ if (new_cfqq)
+ cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
}

cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1308,7 +1356,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)

spin_lock_irqsave(q->queue_lock, flags);

- iog = elv_io_get_io_group(q, 0);
+ iog = elv_io_get_io_group(q, NULL, 0);

if (async_cfqq != NULL) {
__iog = cfqq_to_io_group(async_cfqq);
@@ -1347,7 +1395,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
#endif /* CONFIG_IOSCHED_CFQ_HIER */

static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
struct io_context *ioc, gfp_t gfp_mask)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1357,12 +1405,28 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
struct io_group *iog = NULL;

retry:
- iog = elv_io_get_io_group(q, 1);
+ iog = elv_io_get_io_group_bio(q, bio, 1);

cic = cfq_cic_lookup(cfqd, ioc);
/* cic always exists here */
cfqq = cic_to_cfqq(cic, is_sync);

+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ if (!cfqq && !is_sync) {
+ const int ioprio = task_ioprio(cic->ioc);
+ const int ioprio_class = task_ioprio_class(cic->ioc);
+
+ /*
+ * We have not cached async queue pointer as bio tracking
+ * is enabled. Look into group async queue array using ioc
+ * class and prio to see if somebody already allocated the
+ * queue.
+ */
+
+ cfqq = elv_io_group_async_queue_prio(iog, ioprio_class, ioprio);
+ }
+#endif
+
/*
* Always try a new alloc if we fell back to the OOM cfqq
* originally, since it should just be a temporary situation.
@@ -1439,14 +1503,14 @@ out:
}

static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
- gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
+ struct io_context *ioc, gfp_t gfp_mask)
{
const int ioprio = task_ioprio(ioc);
const int ioprio_class = task_ioprio_class(ioc);
struct cfq_queue *async_cfqq = NULL;
struct cfq_queue *cfqq = NULL;
- struct io_group *iog = elv_io_get_io_group(cfqd->queue, 1);
+ struct io_group *iog = elv_io_get_io_group_bio(cfqd->queue, bio, 1);

if (!is_sync) {
async_cfqq = elv_io_group_async_queue_prio(iog, ioprio_class,
@@ -1455,14 +1519,35 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
}

if (!cfqq)
- cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+ cfqq = cfq_find_alloc_queue(cfqd, bio, is_sync, ioc, gfp_mask);

if (!is_sync && !async_cfqq)
elv_io_group_set_async_queue(iog, ioprio_class, ioprio,
cfqq->ioq);
-
- /* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ /*
+ * ioc reference. If async request queue/group is determined from the
+ * original task/cgroup and not from submitter task, io context can
+ * not cache the pointer to async queue and everytime a request comes,
+ * it will be determined by going through the async queue array.
+ *
+ * This comes from the fact that we might be getting async requests
+ * which belong to a different cgroup altogether than the cgroup
+ * iocontext belongs to. And this thread might be submitting bios
+ * from various cgroups. So every time async queue will be different
+ * based on the cgroup of the bio/rq. Can't cache the async cfqq
+ * pointer in cic.
+ */
+ if (is_sync)
+ elv_get_ioq(cfqq->ioq);
+#else
+ /*
+ * async requests are being attributed to task submitting
+ * it, hence cic can cache async cfqq pointer. Take the
+ * queue reference even for async queue.
+ */
elv_get_ioq(cfqq->ioq);
+#endif
return cfqq;
}

@@ -1915,7 +2000,8 @@ static void cfq_put_request(struct request *rq)
* Allocate cfq data structures associated with this request.
*/
static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+ gfp_t gfp_mask)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_io_context *cic;
@@ -1935,7 +2021,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)

cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq || cfqq == &cfqd->oom_cfqq) {
- cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+ cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
cic_set_cfqq(cic, cfqq, is_sync);
}

diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index e5bc823..cc9c8c3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -134,7 +134,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
int ret;
struct deadline_queue *dq;

- dq = elv_get_sched_queue_current(q);
+ dq = elv_get_sched_queue_bio(q, bio);
if (!dq)
return ELEVATOR_NO_MERGE;

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 149a147..3089175 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -14,6 +14,7 @@
#include <linux/blkdev.h>
#include <linux/blktrace_api.h>
#include <linux/seq_file.h>
+#include <linux/biotrack.h>
#include "elevator-fq.h"

const int elv_slice_sync = HZ / 10;
@@ -1237,6 +1238,9 @@ struct io_cgroup io_root_cgroup = {

static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
{
+ if (!cgroup)
+ return &io_root_cgroup;
+
return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
struct io_cgroup, css);
}
@@ -1696,9 +1700,45 @@ end:
return iog;
}

+struct io_group *elv_io_get_io_group_bio(struct request_queue *q,
+ struct bio *bio, int create)
+{
+ struct page *page = NULL;
+
+ /*
+ * Determine the group from task context. Even calls from
+ * blk_get_request() which don't have any bio info will be mapped
+ * to the task's group
+ */
+ if (!bio)
+ goto sync;
+
+ if (bio_barrier(bio)) {
+ /*
+ * Map barrier requests to root group. May be more special
+ * bio cases should come here
+ */
+ return q->elevator->efqd->root_group;
+ }
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+ /* Map the sync bio to the right group using task context */
+ if (elv_bio_sync(bio))
+ goto sync;
+
+ /* Determine the group from info stored in page */
+ page = bio_iovec_idx(bio, 0)->bv_page;
+ return elv_io_get_io_group(q, page, create);
+#endif
+
+sync:
+ return elv_io_get_io_group(q, page, create);
+}
+EXPORT_SYMBOL(elv_io_get_io_group_bio);
+
/*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group page belongs to.
+ * If "create" is set, io group is created if it is not already present.
*
* Note: This function should be called with queue lock held. It returns
* a pointer to io group without taking any reference. That group will
@@ -1706,28 +1746,45 @@ end:
* needs to get hold of queue lock). So if somebody needs to use group
* pointer even after dropping queue lock, take a reference to the group
* before dropping queue lock.
+ *
+ * One can call it without queue lock with rcu read lock held for browsing
+ * through the groups.
*/
-struct io_group *elv_io_get_io_group(struct request_queue *q, int create)
+struct io_group *
+elv_io_get_io_group(struct request_queue *q, struct page *page, int create)
{
struct cgroup *cgroup;
struct io_group *iog;
struct elv_fq_data *efqd = q->elevator->efqd;

- assert_spin_locked(q->queue_lock);
+ if (create)
+ assert_spin_locked(q->queue_lock);

rcu_read_lock();
- cgroup = task_cgroup(current, io_subsys_id);
+
+ if (!page)
+ cgroup = task_cgroup(current, io_subsys_id);
+ else
+ cgroup = get_cgroup_from_page(page);
+
+ if (!cgroup) {
+ iog = efqd->root_group;
+ goto out;
+ }
+
iog = io_find_alloc_group(q, cgroup, efqd, create);
if (!iog) {
if (create)
iog = efqd->root_group;
- else
+ else {
/*
* bio merge functions doing lookup don't want to
* map bio to root group by default
*/
iog = NULL;
+ }
}
+out:
rcu_read_unlock();
return iog;
}
@@ -1985,7 +2042,7 @@ int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
return 1;

/* Determine the io group of the bio submitting task */
- iog = elv_io_get_io_group(q, 0);
+ iog = elv_io_get_io_group_bio(q, bio, 0);
if (!iog) {
/* May be task belongs to a differet cgroup for which io
* group has not been setup yet. */
@@ -2018,7 +2075,7 @@ elv_io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
* function is not invoked.
*/
int elv_set_request_ioq(struct request_queue *q, struct request *rq,
- gfp_t gfp_mask)
+ struct bio *bio, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;
unsigned long flags;
@@ -2034,7 +2091,7 @@ int elv_set_request_ioq(struct request_queue *q, struct request *rq,

retry:
/* Determine the io group request belongs to */
- iog = elv_io_get_io_group(q, 1);
+ iog = elv_io_get_io_group_bio(q, bio, 1);
BUG_ON(!iog);

/* Get the iosched queue */
@@ -2136,18 +2193,20 @@ queue_fail:
}

/*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
* per io group io schedulers.
*/
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
{
struct io_group *iog;

/* Determine the io group and io queue of the bio submitting task */
- iog = elv_io_get_io_group(q, 0);
+ iog = elv_io_get_io_group_bio(q, bio, 0);
if (!iog) {
- /* May be task belongs to a cgroup for which io group has
- * not been setup yet. */
+ /*
+ * May be bio belongs to a cgroup for which io group has
+ * not been setup yet.
+ */
return NULL;
}
return iog->ioq;
@@ -3028,8 +3087,12 @@ expire:
new_queue:
ioq = elv_set_active_ioq(q, new_ioq);
keep_queue:
- if (ioq)
+ if (ioq) {
+ elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+ elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+ elv_ioq_nr_dispatched(ioq));
check_late_preemption(q->elevator, ioq);
+ }
return ioq;
}

diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 4114543..be66d28 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -429,7 +429,9 @@ static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
extern void elv_put_iog(struct io_group *iog);
extern struct io_group *elv_io_get_io_group(struct request_queue *q,
- int create);
+ struct page *page, int create);
+extern struct io_group *elv_io_get_io_group_bio(struct request_queue *q,
+ struct bio *bio, int create);
extern ssize_t elv_group_idle_show(struct elevator_queue *q, char *name);
extern ssize_t elv_group_idle_store(struct elevator_queue *q, const char *name,
size_t count);
@@ -439,9 +441,10 @@ static inline void elv_get_iog(struct io_group *iog)
}

extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
- gfp_t gfp_mask);
+ struct bio *bio, gfp_t gfp_mask);
extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
-extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio);

#else /* !GROUP_IOSCHED */

@@ -454,14 +457,20 @@ static inline void elv_get_iog(struct io_group *iog) {}
static inline void elv_put_iog(struct io_group *iog) {}

static inline struct io_group *
-elv_io_get_io_group(struct request_queue *q, int create)
+elv_io_get_io_group(struct request_queue *q, struct page *page, int create)
{
/* In flat mode, there is only root group */
return q->elevator->efqd->root_group;
}

-static inline int
-elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+static inline struct io_group *
+elv_io_get_io_group_bio(struct request_queue *q, struct bio *bio, int create)
+{
+ return q->elevator->efqd->root_group;
+}
+
+static inline int elv_set_request_ioq(struct request_queue *q,
+ struct request *rq, struct bio *bio, gfp_t gfp_mask)
{
return 0;
}
@@ -469,7 +478,8 @@ elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
static inline void
elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }

-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *
+elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
{
return NULL;
}
@@ -569,8 +579,8 @@ static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
{
return 1;
}
-static inline int
-elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+static inline int elv_set_request_ioq(struct request_queue *q,
+ struct request *rq, struct bio *bio, gfp_t gfp_mask)
{
return 0;
}
@@ -578,7 +588,8 @@ elv_set_request_ioq(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
static inline void
elv_reset_request_ioq(struct request_queue *q, struct request *rq) { }

-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+ struct bio *bio)
{
return NULL;
}
diff --git a/block/elevator.c b/block/elevator.c
index bc43edd..4ed37b6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -865,7 +865,8 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
return NULL;
}

-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+ struct bio *bio, gfp_t gfp_mask)
{
struct elevator_queue *e = q->elevator;

@@ -874,10 +875,10 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
* ioq per io group
*/
if (elv_iosched_single_ioq(e))
- return elv_set_request_ioq(q, rq, gfp_mask);
+ return elv_set_request_ioq(q, rq, bio, gfp_mask);

if (e->ops->elevator_set_req_fn)
- return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+ return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);

rq->elevator_private = NULL;
return 0;
@@ -1279,19 +1280,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
EXPORT_SYMBOL(elv_select_sched_queue);

/*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
*
* If fair queuing is enabled, determine the io group of task and retrieve
* the ioq pointer from that. This is used by only single queue ioschedulers
* for retrieving the queue associated with the group to decide whether the
* new bio can do a front merge or not.
*/
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
{
/* Fair queuing is not enabled. There is only one queue. */
if (!elv_iosched_fair_queuing_enabled(q->elevator))
return q->elevator->sched_queue;

- return elv_ioq_sched_queue(elv_lookup_ioq_current(q));
+ return elv_ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
}
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3d4e31c..0ace96e 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -22,7 +22,8 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
typedef int (elevator_may_queue_fn) (struct request_queue *, int);

-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *,
+ struct bio *bio, gfp_t);
typedef void (elevator_put_req_fn) (struct request *);
typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -146,7 +147,8 @@ extern void elv_unregister_queue(struct request_queue *q);
extern int elv_may_queue(struct request_queue *, int);
extern void elv_abort_queue(struct request_queue *);
extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+ struct bio *bio, gfp_t);
extern void elv_put_request(struct request_queue *, struct request *);
extern void elv_drain_elevator(struct request_queue *);

@@ -275,6 +277,20 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
#endif /* ELV_IOSCHED_FAIR_QUEUING */
extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+ if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+ return 1;
+ return 0;
+}
#endif /* CONFIG_BLOCK */
#endif
--
1.6.0.6

2009-09-24 19:28:05

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 25/28] io-controller: Per cgroup request descriptor support

o Currently a request queue has got fixed number of request descriptors for
sync and async requests. Once the request descriptors are consumed, new
processes are put to sleep and they effectively become serialized. Because
sync and async queues are separate, async requests don't impact sync ones
but if one is looking for fairness between async requests, that is not
achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
going on in one cgroup, it does not impact the IO of other group.

o This patch implements the per cgroup request descriptors. request pool per
queue is still common but every group will have its own wait list and its
own count of request descriptors allocated to that group for sync and async
queues. So effectively request_list becomes per io group property and not a
global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
allocated for the queue. Now there is another tunable q->nr_group_requests
which controls the requests descriptr limit per group. q->nr_requests
supercedes q->nr_group_requests to make sure if there are lots of groups
present, we don't end up allocating too many request descriptors on the
queue.

Signed-off-by: Nauman Rafique <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/blk-core.c | 323 +++++++++++++++++++++++++++++++++---------
block/blk-settings.c | 1 +
block/blk-sysfs.c | 59 ++++++--
block/elevator-fq.c | 36 +++++
block/elevator-fq.h | 29 ++++
block/elevator.c | 7 +-
include/linux/blkdev.h | 47 ++++++-
include/trace/events/block.h | 6 +-
kernel/trace/blktrace.c | 6 +-
9 files changed, 427 insertions(+), 87 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 47cce59..a84dfb7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -460,20 +460,53 @@ void blk_cleanup_queue(struct request_queue *q)
}
EXPORT_SYMBOL(blk_cleanup_queue);

-static int blk_init_free_list(struct request_queue *q)
+struct request_list *
+blk_get_request_list(struct request_queue *q, struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+ /*
+ * Determine which request list bio will be allocated from. This
+ * is dependent on which io group bio belongs to
+ */
+ return elv_get_request_list_bio(q, bio);
+#else
+ return &q->rq;
+#endif
+}
+
+static struct request_list *rq_rl(struct request_queue *q, struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+ int priv = rq->cmd_flags & REQ_ELVPRIV;
+
+ return elv_get_request_list_rq(q, rq, priv);
+#else
+ return &q->rq;
+#endif
+}
+
+void blk_init_request_list(struct request_list *rl)
{
- struct request_list *rl = &q->rq;

rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
- rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
- rl->elvpriv = 0;
init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}

- rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
- mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+ /*
+ * In case of group scheduling, request list is inside group and is
+ * initialized when group is instanciated.
+ */
+#ifndef CONFIG_GROUP_IOSCHED
+ blk_init_request_list(&q->rq);
+#endif
+ q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+ mempool_alloc_slab, mempool_free_slab,
+ request_cachep, q->node);

- if (!rl->rq_pool)
+ if (!q->rq_data.rq_pool)
return -ENOMEM;

return 0;
@@ -581,6 +614,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
q->queue_flags = QUEUE_FLAG_DEFAULT;
q->queue_lock = lock;

+ /* init starved waiter wait queue */
+ init_waitqueue_head(&q->rq_data.starved_wait);
+
/*
* This also sets hw/phys segments, boundary and size
*/
@@ -615,14 +651,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
{
if (rq->cmd_flags & REQ_ELVPRIV)
elv_put_request(q, rq);
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_data.rq_pool);
}

static struct request *
blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
gfp_t gfp_mask)
{
- struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+ struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);

if (!rq)
return NULL;
@@ -633,7 +669,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,

if (priv) {
if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
- mempool_free(rq, q->rq.rq_pool);
+ mempool_free(rq, q->rq_data.rq_pool);
return NULL;
}
rq->cmd_flags |= REQ_ELVPRIV;
@@ -676,18 +712,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
ioc->last_waited = jiffies;
}

-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
-
- if (rl->count[sync] < queue_congestion_off_threshold(q))
+ if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, sync);

- if (rl->count[sync] + 1 <= q->nr_requests) {
+ if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+ blk_clear_queue_full(q, sync);
+
+ if (rl->count[sync] + 1 <= q->nr_group_requests) {
if (waitqueue_active(&rl->wait[sync]))
wake_up(&rl->wait[sync]);
-
- blk_clear_queue_full(q, sync);
}
}

@@ -695,63 +731,130 @@ static void __freed_request(struct request_queue *q, int sync)
* A request has just been released. Account for it, update the full and
* congestion status, wake up any waiters. Called under q->queue_lock.
*/
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+ struct request_list *rl)
{
- struct request_list *rl = &q->rq;
+ /*
+ * There is a window during request allocation where request is
+ * mapped to one group but by the time a queue for the group is
+ * allocated, it is possible that original cgroup/io group has been
+ * deleted and now io queue is allocated in a different group (root)
+ * altogether.
+ *
+ * One solution to the problem is that rq should take io group
+ * reference. But it looks too much to do that to solve this issue.
+ * The only side affect to the hard to hit issue seems to be that
+ * we will try to decrement the rl->count for a request list which
+ * did not allocate that request. Chcek for rl->count going less than
+ * zero and do not decrement it if that's the case.
+ */
+
+ if (priv && rl->count[sync] > 0)
+ rl->count[sync]--;
+
+ BUG_ON(!q->rq_data.count[sync]);
+ q->rq_data.count[sync]--;

- rl->count[sync]--;
if (priv)
- rl->elvpriv--;
+ q->rq_data.elvpriv--;

- __freed_request(q, sync);
+ __freed_request(q, sync, rl);

if (unlikely(rl->starved[sync ^ 1]))
- __freed_request(q, sync ^ 1);
+ __freed_request(q, sync ^ 1, rl);
+
+ /* Wake up the starved process on global list, if any */
+ if (unlikely(q->rq_data.starved)) {
+ if (waitqueue_active(&q->rq_data.starved_wait))
+ wake_up(&q->rq_data.starved_wait);
+ q->rq_data.starved--;
+ }
+}
+
+/*
+ * Returns whether one can sleep on this request list or not. There are
+ * cases (elevator switch) where request list might not have allocated
+ * any request descriptor but we deny request allocation due to gloabl
+ * limits. In that case one should sleep on global list as on this request
+ * list no wakeup will take place.
+ *
+ * Also sets the request list starved flag if there are no requests pending
+ * in the direction of rq.
+ *
+ * Return 1 --> sleep on request list, 0 --> sleep on global list
+ */
+static int can_sleep_on_request_list(struct request_list *rl, int is_sync)
+{
+ if (unlikely(rl->count[is_sync] == 0)) {
+ /*
+ * If there is a request pending in other direction
+ * in same io group, then set the starved flag of
+ * the group request list. Otherwise, we need to
+ * make this process sleep in global starved list
+ * to make sure it will not sleep indefinitely.
+ */
+ if (rl->count[is_sync ^ 1] != 0) {
+ rl->starved[is_sync] = 1;
+ return 1;
+ } else
+ return 0;
+ }
+
+ return 1;
}

/*
* Get a free request, queue_lock must be held.
- * Returns NULL on failure, with queue_lock held.
+ * Returns NULL on failure, with queue_lock held. Also sets the "reason" field
+ * in case of failure. This reason field helps caller decide to whether sleep
+ * on per group list or global per queue list.
+ * reason = 0 sleep on per group list
+ * reason = 1 sleep on global list
+ *
* Returns !NULL on success, with queue_lock *not held*.
*/
static struct request *get_request(struct request_queue *q, int rw_flags,
- struct bio *bio, gfp_t gfp_mask)
+ struct bio *bio, gfp_t gfp_mask,
+ struct request_list *rl, int *reason)
{
struct request *rq = NULL;
- struct request_list *rl = &q->rq;
struct io_context *ioc = NULL;
const bool is_sync = rw_is_sync(rw_flags) != 0;
int may_queue, priv;
+ int sleep_on_global = 0;

may_queue = elv_may_queue(q, rw_flags);
if (may_queue == ELV_MQUEUE_NO)
goto rq_starved;

- if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
- if (rl->count[is_sync]+1 >= q->nr_requests) {
- ioc = current_io_context(GFP_ATOMIC, q->node);
- /*
- * The queue will fill after this allocation, so set
- * it as full, and mark this process as "batching".
- * This process will be allowed to complete a batch of
- * requests, others will be blocked.
- */
- if (!blk_queue_full(q, is_sync)) {
- ioc_set_batching(q, ioc);
- blk_set_queue_full(q, is_sync);
- } else {
- if (may_queue != ELV_MQUEUE_MUST
- && !ioc_batching(q, ioc)) {
- /*
- * The queue is full and the allocating
- * process is not a "batcher", and not
- * exempted by the IO scheduler
- */
- goto out;
- }
+ if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+ blk_set_queue_congested(q, is_sync);
+
+ /* queue full seems redundant now */
+ if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+ blk_set_queue_full(q, is_sync);
+
+ if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+ ioc = current_io_context(GFP_ATOMIC, q->node);
+ /*
+ * The queue request descriptor group will fill after this
+ * allocation, so set it as full, and mark this process as
+ * "batching". This process will be allowed to complete a
+ * batch of requests, others will be blocked.
+ */
+ if (rl->count[is_sync] <= q->nr_group_requests)
+ ioc_set_batching(q, ioc);
+ else {
+ if (may_queue != ELV_MQUEUE_MUST
+ && !ioc_batching(q, ioc)) {
+ /*
+ * The queue is full and the allocating
+ * process is not a "batcher", and not
+ * exempted by the IO scheduler
+ */
+ goto out;
}
}
- blk_set_queue_congested(q, is_sync);
}

/*
@@ -759,21 +862,60 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* limit of requests, otherwise we could have thousands of requests
* allocated with any setting of ->nr_requests
*/
- if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+ if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2)) {
+ /*
+ * Queue is too full for allocation. On which request queue
+ * the task should sleep? Generally it should sleep on its
+ * request list but if elevator switch is happening, in that
+ * window, request descriptors are allocated from global
+ * pool and are not accounted against any particular request
+ * list as group is going away.
+ *
+ * So it might happen that request list does not have any
+ * requests allocated at all and if process sleeps on per
+ * group request list, it will not be woken up. In such case,
+ * make it sleep on global starved list.
+ */
+ if (test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)
+ || !can_sleep_on_request_list(rl, is_sync))
+ sleep_on_global = 1;
+ goto out;
+ }
+
+ /*
+ * Allocation of request is allowed from queue perspective. Now check
+ * from per group request list
+ */
+
+ if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
goto out;

- rl->count[is_sync]++;
rl->starved[is_sync] = 0;

+ q->rq_data.count[is_sync]++;
+
priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
- if (priv)
- rl->elvpriv++;
+ if (priv) {
+ q->rq_data.elvpriv++;
+ /*
+ * Account the request to request list only if request is
+ * going to elevator. During elevator switch, there will
+ * be small window where group is going away and new group
+ * will not be allocated till elevator switch is complete.
+ * So till then instead of slowing down the application,
+ * we will continue to allocate request from total common
+ * pool instead of per group limit
+ */
+ rl->count[is_sync]++;
+ }

if (blk_queue_io_stat(q))
rw_flags |= REQ_IO_STAT;
spin_unlock_irq(q->queue_lock);

rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
@@ -783,7 +925,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* wait queue, but this is pretty rare.
*/
spin_lock_irq(q->queue_lock);
- freed_request(q, is_sync, priv);
+ freed_request(q, is_sync, priv, rl);

/*
* in the very unlikely event that allocation failed and no
@@ -793,9 +935,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
* rq mempool into READ and WRITE
*/
rq_starved:
- if (unlikely(rl->count[is_sync] == 0))
- rl->starved[is_sync] = 1;
-
+ if (!can_sleep_on_request_list(rl, is_sync))
+ sleep_on_global = 1;
goto out;
}

@@ -810,6 +951,8 @@ rq_starved:

trace_block_getrq(q, bio, rw_flags & 1);
out:
+ if (reason && sleep_on_global)
+ *reason = 1;
return rq;
}

@@ -823,16 +966,39 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
struct bio *bio)
{
const bool is_sync = rw_is_sync(rw_flags) != 0;
+ int sleep_on_global = 0;
struct request *rq;
+ struct request_list *rl = blk_get_request_list(q, bio);

- rq = get_request(q, rw_flags, bio, GFP_NOIO);
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl, &sleep_on_global);
while (!rq) {
DEFINE_WAIT(wait);
struct io_context *ioc;
- struct request_list *rl = &q->rq;

- prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
- TASK_UNINTERRUPTIBLE);
+ if (sleep_on_global) {
+ /*
+ * Task failed allocation and needs to wait and
+ * try again. There are no requests pending from
+ * the io group hence need to sleep on global
+ * wait queue. Most likely the allocation failed
+ * because of memory issues.
+ */
+
+ q->rq_data.starved++;
+ prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+ &wait, TASK_UNINTERRUPTIBLE);
+ } else {
+ /*
+ * We are about to sleep on a request list and we
+ * drop queue lock. After waking up, we will do
+ * finish_wait() on request list and in the mean
+ * time group might be gone. Take a reference to
+ * the group now.
+ */
+ prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+ TASK_UNINTERRUPTIBLE);
+ elv_get_rl_iog(rl);
+ }

trace_block_sleeprq(q, bio, rw_flags & 1);

@@ -850,9 +1016,25 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
ioc_set_batching(q, ioc);

spin_lock_irq(q->queue_lock);
- finish_wait(&rl->wait[is_sync], &wait);

- rq = get_request(q, rw_flags, bio, GFP_NOIO);
+ if (sleep_on_global) {
+ finish_wait(&q->rq_data.starved_wait, &wait);
+ sleep_on_global = 0;
+ } else {
+ /*
+ * We had taken a reference to the rl/iog. Put that now
+ */
+ finish_wait(&rl->wait[is_sync], &wait);
+ elv_put_rl_iog(rl);
+ }
+
+ /*
+ * After the sleep check the rl again in case cgrop bio
+ * belonged to is gone and it is mapped to root group now
+ */
+ rl = blk_get_request_list(q, bio);
+ rq = get_request(q, rw_flags, bio, GFP_NOIO, rl,
+ &sleep_on_global);
};

return rq;
@@ -861,14 +1043,16 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
{
struct request *rq;
+ struct request_list *rl;

BUG_ON(rw != READ && rw != WRITE);

spin_lock_irq(q->queue_lock);
+ rl = blk_get_request_list(q, NULL);
if (gfp_mask & __GFP_WAIT) {
rq = get_request_wait(q, rw, NULL);
} else {
- rq = get_request(q, rw, NULL, gfp_mask);
+ rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
if (!rq)
spin_unlock_irq(q->queue_lock);
}
@@ -1085,12 +1269,19 @@ void __blk_put_request(struct request_queue *q, struct request *req)
if (req->cmd_flags & REQ_ALLOCED) {
int is_sync = rq_is_sync(req) != 0;
int priv = req->cmd_flags & REQ_ELVPRIV;
+ struct request_list *rl = rq_rl(q, req);

BUG_ON(!list_empty(&req->queuelist));
BUG_ON(!hlist_unhashed(&req->hash));

+ /*
+ * Call freed request before actually freeing the request
+ * freeing the request might cause freeing up of io queue, and
+ * in turn io group. That mean rl pointer will no more be
+ * valid.
+ */
+ freed_request(q, is_sync, priv, rl);
blk_free_request(q, req);
- freed_request(q, is_sync, priv);
}
}
EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 476d870..c3102c7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -149,6 +149,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
* set defaults
*/
q->nr_requests = BLKDEV_MAX_RQ;
+ q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;

q->make_request_fn = mfn;
blk_queue_dma_alignment(q, 511);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d3aa2aa..0ddf245 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,42 +38,67 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
static ssize_t
queue_requests_store(struct request_queue *q, const char *page, size_t count)
{
- struct request_list *rl = &q->rq;
+ struct request_list *rl;
unsigned long nr;
int ret = queue_var_store(&nr, page, count);
if (nr < BLKDEV_MIN_RQ)
nr = BLKDEV_MIN_RQ;

spin_lock_irq(q->queue_lock);
+ rl = blk_get_request_list(q, NULL);
q->nr_requests = nr;
blk_queue_congestion_threshold(q);

- if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+ if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, BLK_RW_SYNC);
- else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+ else if (q->rq_data.count[BLK_RW_SYNC] <
+ queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, BLK_RW_SYNC);

- if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+ if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, BLK_RW_ASYNC);
- else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+ else if (q->rq_data.count[BLK_RW_ASYNC] <
+ queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, BLK_RW_ASYNC);

- if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+ if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
blk_set_queue_full(q, BLK_RW_SYNC);
- } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+ } else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
blk_clear_queue_full(q, BLK_RW_SYNC);
wake_up(&rl->wait[BLK_RW_SYNC]);
}

- if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+ if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
blk_set_queue_full(q, BLK_RW_ASYNC);
- } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+ } else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
blk_clear_queue_full(q, BLK_RW_ASYNC);
wake_up(&rl->wait[BLK_RW_ASYNC]);
}
spin_unlock_irq(q->queue_lock);
return ret;
}
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long nr;
+ int ret = queue_var_store(&nr, page, count);
+
+ if (nr < BLKDEV_MIN_RQ)
+ nr = BLKDEV_MIN_RQ;
+
+ spin_lock_irq(q->queue_lock);
+ q->nr_group_requests = nr;
+ spin_unlock_irq(q->queue_lock);
+ return ret;
+}
+#endif

static ssize_t queue_ra_show(struct request_queue *q, char *page)
{
@@ -240,6 +265,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
.store = queue_requests_store,
};

+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+ .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_group_requests_show,
+ .store = queue_group_requests_store,
+};
+#endif
+
static struct queue_sysfs_entry queue_ra_entry = {
.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
.show = queue_ra_show,
@@ -314,6 +347,9 @@ static struct queue_sysfs_entry queue_iostats_entry = {

static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+ &queue_group_requests_entry.attr,
+#endif
&queue_ra_entry.attr,
&queue_max_hw_sectors_entry.attr,
&queue_max_sectors_entry.attr,
@@ -393,12 +429,11 @@ static void blk_release_queue(struct kobject *kobj)
{
struct request_queue *q =
container_of(kobj, struct request_queue, kobj);
- struct request_list *rl = &q->rq;

blk_sync_queue(q);

- if (rl->rq_pool)
- mempool_destroy(rl->rq_pool);
+ if (q->rq_data.rq_pool)
+ mempool_destroy(q->rq_data.rq_pool);

if (q->queue_tags)
__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 3089175..5ecc519 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1245,6 +1245,39 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
struct io_cgroup, css);
}

+struct request_list *
+elv_get_request_list_bio(struct request_queue *q, struct bio *bio)
+{
+ struct io_group *iog;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ iog = q->elevator->efqd->root_group;
+ else
+ iog = elv_io_get_io_group_bio(q, bio, 1);
+
+ BUG_ON(!iog);
+ return &iog->rl;
+}
+
+struct request_list *
+elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
+{
+ struct io_group *iog;
+
+ if (!elv_iosched_fair_queuing_enabled(q->elevator))
+ return &q->elevator->efqd->root_group->rl;
+
+ BUG_ON(priv && !rq->ioq);
+
+ if (priv)
+ iog = ioq_to_io_group(rq->ioq);
+ else
+ iog = q->elevator->efqd->root_group;
+
+ BUG_ON(!iog);
+ return &iog->rl;
+}
+
/*
* Search the io_group for efqd into the hash table (by now only a list)
* of bgrp. Must be called under rcu_read_lock().
@@ -1601,6 +1634,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
elv_get_iog(iog);
io_group_path(iog);

+ blk_init_request_list(&iog->rl);
+
if (leaf == NULL) {
leaf = iog;
prev = leaf;
@@ -1830,6 +1865,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
for (i = 0; i < IO_IOPRIO_CLASSES; i++)
iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;

+ blk_init_request_list(&iog->rl);
spin_lock_irq(&iocg->lock);
rcu_assign_pointer(iog->key, key);
hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index be66d28..c9ea0a1 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -140,6 +140,9 @@ struct io_group {

/* Single ioq per group, used for noop, deadline, anticipatory */
struct io_queue *ioq;
+
+ /* request list associated with the group */
+ struct request_list rl;
};

struct io_cgroup {
@@ -440,11 +443,31 @@ static inline void elv_get_iog(struct io_group *iog)
atomic_inc(&iog->ref);
}

+static inline struct io_group *rl_iog(struct request_list *rl)
+{
+ return container_of(rl, struct io_group, rl);
+}
+
+static inline void elv_get_rl_iog(struct request_list *rl)
+{
+ elv_get_iog(rl_iog(rl));
+}
+
+static inline void elv_put_rl_iog(struct request_list *rl)
+{
+ elv_put_iog(rl_iog(rl));
+}
+
extern int elv_set_request_ioq(struct request_queue *q, struct request *rq,
struct bio *bio, gfp_t gfp_mask);
extern void elv_reset_request_ioq(struct request_queue *q, struct request *rq);
extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
struct bio *bio);
+struct request_list *
+elv_get_request_list_bio(struct request_queue *q, struct bio *bio);
+
+struct request_list *
+elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);

#else /* !GROUP_IOSCHED */

@@ -484,6 +507,9 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
return NULL;
}

+static inline void elv_get_rl_iog(struct request_list *rl) { }
+static inline void elv_put_rl_iog(struct request_list *rl) { }
+
#endif /* GROUP_IOSCHED */

extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
@@ -594,6 +620,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
return NULL;
}

+static inline void elv_get_rl_iog(struct request_list *rl) { }
+static inline void elv_put_rl_iog(struct request_list *rl) { }
+
#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _ELV_SCHED_H */
#endif /* CONFIG_BLOCK */
diff --git a/block/elevator.c b/block/elevator.c
index 4ed37b6..b23db03 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -678,7 +678,7 @@ void elv_quiesce_start(struct request_queue *q)
* make sure we don't have any requests in flight
*/
elv_drain_elevator(q);
- while (q->rq.elvpriv) {
+ while (q->rq_data.elvpriv) {
__blk_run_queue(q);
spin_unlock_irq(q->queue_lock);
msleep(10);
@@ -777,8 +777,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
}

if (unplug_it && blk_queue_plugged(q)) {
- int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
- - queue_in_flight(q);
+ int nrq = q->rq_data.count[BLK_RW_SYNC] +
+ q->rq_data.count[BLK_RW_ASYNC] -
+ queue_in_flight(q);

if (nrq >= q->unplug_thresh)
__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7cff5f2..74deb17 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
struct sg_io_hdr;

#define BLKDEV_MIN_RQ 4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ 512 /* Default maximum for queue */
+#define BLKDEV_MAX_GROUP_RQ 128 /* Default maximum per group*/
+#else
#define BLKDEV_MAX_RQ 128 /* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ BLKDEV_MAX_RQ /* Default maximum */
+#endif

struct request;
typedef void (rq_end_io_fn)(struct request *, int);

struct request_list {
/*
- * count[], starved[], and wait[] are indexed by
+ * count[], starved and wait[] are indexed by
* BLK_RW_SYNC/BLK_RW_ASYNC
*/
int count[2];
int starved[2];
+ wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+ /*
+ * Per queue request descriptor count. This is in addition to per
+ * cgroup count
+ */
+ int count[2];
int elvpriv;
mempool_t *rq_pool;
- wait_queue_head_t wait[2];
+ int starved;
+ /*
+ * Global list for starved tasks. A task will be queued here if
+ * it could not allocate request descriptor and the associated
+ * group request list does not have any requests pending.
+ */
+ wait_queue_head_t starved_wait;
};

/*
@@ -339,10 +369,17 @@ struct request_queue
struct request *last_merge;
struct elevator_queue *elevator;

+#ifndef CONFIG_GROUP_IOSCHED
/*
* the queue request freelist, one for reads and one for writes
+ * In case of group io scheduling, this request list is per group
+ * and is present in group data structure.
*/
struct request_list rq;
+#endif
+
+ /* Contains request pool and other data like starved data */
+ struct request_data rq_data;

request_fn_proc *request_fn;
make_request_fn *make_request_fn;
@@ -405,6 +442,8 @@ struct request_queue
* queue settings
*/
unsigned long nr_requests; /* Max # of requests */
+ /* Max # of per io group requests */
+ unsigned long nr_group_requests;
unsigned int nr_congestion_on;
unsigned int nr_congestion_off;
unsigned int nr_batching;
@@ -784,6 +823,10 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
struct scsi_ioctl_command __user *);

+extern void blk_init_request_list(struct request_list *rl);
+
+extern struct request_list *blk_get_request_list(struct request_queue *q,
+ struct bio *bio);
/*
* A queue has just exitted congestion. Note this in the global counter of
* congested queues, and wake up anyone who was waiting for requests to be
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 9a74b46..af6c9e5 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -397,7 +397,8 @@ TRACE_EVENT(block_unplug_timer,
),

TP_fast_assign(
- __entry->nr_rq = q->rq.count[READ] + q->rq.count[WRITE];
+ __entry->nr_rq = q->rq_data.count[READ] +
+ q->rq_data.count[WRITE];
memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
),

@@ -416,7 +417,8 @@ TRACE_EVENT(block_unplug_io,
),

TP_fast_assign(
- __entry->nr_rq = q->rq.count[READ] + q->rq.count[WRITE];
+ __entry->nr_rq = q->rq_data.count[READ] +
+ q->rq_data.count[WRITE];
memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
),

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 7a34cb5..9a03980 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -786,7 +786,8 @@ static void blk_add_trace_unplug_io(struct request_queue *q)
struct blk_trace *bt = q->blk_trace;

if (bt) {
- unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
+ unsigned int pdu = q->rq_data.count[READ] +
+ q->rq_data.count[WRITE];
__be64 rpdu = cpu_to_be64(pdu);

__blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0,
@@ -799,7 +800,8 @@ static void blk_add_trace_unplug_timer(struct request_queue *q)
struct blk_trace *bt = q->blk_trace;

if (bt) {
- unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE];
+ unsigned int pdu = q->rq_data.count[READ] +
+ q->rq_data.count[WRITE];
__be64 rpdu = cpu_to_be64(pdu);

__blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_TIMER, 0,
--
1.6.0.6

2009-09-24 19:26:57

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 26/28] io-controller: Per io group bdi congestion interface

o So far there used to be only one pair or queue of request descriptors
(one for sync and one for async) per device and number of requests allocated
used to decide whether associated bdi is congested or not.

Now with per io group request descriptor infrastructure, there is a pair
of request descriptor queue per io group per device. So it might happen
that overall request queue is not congested but a particular io group
bio belongs to is congested.

Or, it could be otherwise that group is not congested but overall queue
is congested. This can happen if user has not properly set the request
descriptors limits for queue and groups.
(q->nr_requests < nr_groups * q->nr_group_requests)

Hence there is a need for new interface which can query deivce congestion
status per group. This group is determined by the "struct page" IO will be
done for. If page is null, then group is determined from the current task
context.

o This patch introduces new set of function bdi_*_congested_group(), which
take "struct page" as addition argument. These functions will call the
block layer and in trun elevator to find out if the io group the page will
go into is congested or not.

o Currently I have introduced the core functions and migrated most of the users.
But there might be still some left. This is an ongoing TODO item.

Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/blk-core.c | 26 ++++++++
block/blk-sysfs.c | 6 +-
block/elevator-fq.c | 135 +++++++++++++++++++++++++++++++++++++++++++
block/elevator-fq.h | 24 +++++++-
drivers/md/dm-table.c | 11 ++-
drivers/md/dm.c | 7 +-
drivers/md/dm.h | 3 +-
drivers/md/linear.c | 7 ++-
drivers/md/multipath.c | 7 ++-
drivers/md/raid0.c | 6 +-
drivers/md/raid1.c | 9 ++-
drivers/md/raid10.c | 6 +-
drivers/md/raid5.c | 2 +-
fs/afs/write.c | 8 ++-
fs/btrfs/disk-io.c | 6 +-
fs/btrfs/extent_io.c | 12 ++++
fs/btrfs/volumes.c | 8 ++-
fs/cifs/file.c | 11 ++++
fs/ext2/ialloc.c | 2 +-
fs/gfs2/aops.c | 12 ++++
fs/nilfs2/segbuf.c | 3 +-
fs/xfs/linux-2.6/xfs_aops.c | 2 +-
fs/xfs/linux-2.6/xfs_buf.c | 2 +-
include/linux/backing-dev.h | 63 +++++++++++++++++++-
include/linux/blkdev.h | 5 ++
mm/backing-dev.c | 74 ++++++++++++++++++++++-
mm/page-writeback.c | 11 ++++
mm/readahead.c | 2 +-
28 files changed, 430 insertions(+), 40 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a84dfb7..83ba5a0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
q->nr_congestion_off = nr;
}

+#ifdef CONFIG_GROUP_IOSCHED
+int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
+ struct page *page)
+{
+ int ret = 0;
+ struct request_queue *q = bdi->unplug_io_data;
+
+ if (!q || !q->elevator)
+ return bdi_congested(bdi, bdi_bits);
+
+ /* Do we need to hold queue lock? */
+ if (bdi_bits & (1 << BDI_sync_congested))
+ ret |= elv_page_io_group_congested(q, page, 1);
+
+ if (bdi_bits & (1 << BDI_async_congested))
+ ret |= elv_page_io_group_congested(q, page, 0);
+
+ return ret;
+}
+#endif
+
/**
* blk_get_backing_dev_info - get the address of a queue's backing_dev_info
* @bdev: device
@@ -721,6 +742,8 @@ static void __freed_request(struct request_queue *q, int sync,
if (q->rq_data.count[sync] + 1 <= q->nr_requests)
blk_clear_queue_full(q, sync);

+ elv_freed_request(rl, sync);
+
if (rl->count[sync] + 1 <= q->nr_group_requests) {
if (waitqueue_active(&rl->wait[sync]))
wake_up(&rl->wait[sync]);
@@ -830,6 +853,9 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, is_sync);

+ /* check if io group will get congested after this allocation*/
+ elv_get_request(rl, is_sync);
+
/* queue full seems redundant now */
if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
blk_set_queue_full(q, is_sync);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0ddf245..3419e1a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -83,9 +83,8 @@ static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
return queue_var_show(q->nr_group_requests, (page));
}

-static ssize_t
-queue_group_requests_store(struct request_queue *q, const char *page,
- size_t count)
+static ssize_t queue_group_requests_store(struct request_queue *q,
+ const char *page, size_t count)
{
unsigned long nr;
int ret = queue_var_store(&nr, page, count);
@@ -95,6 +94,7 @@ queue_group_requests_store(struct request_queue *q, const char *page,

spin_lock_irq(q->queue_lock);
q->nr_group_requests = nr;
+ elv_updated_nr_group_requests(q);
spin_unlock_irq(q->queue_lock);
return ret;
}
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 5ecc519..fd0a40f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1278,6 +1278,139 @@ elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv)
return &iog->rl;
}

+/* Set io group congestion on and off thresholds */
+void elv_io_group_congestion_threshold(struct request_queue *q,
+ struct io_group *iog)
+{
+ int nr;
+
+ nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
+ if (nr > q->nr_group_requests)
+ nr = q->nr_group_requests;
+ iog->nr_congestion_on = nr;
+
+ nr = q->nr_group_requests - (q->nr_group_requests / 8)
+ - (q->nr_group_requests / 16) - 1;
+ if (nr < 1)
+ nr = 1;
+ iog->nr_congestion_off = nr;
+}
+
+void elv_clear_iog_congested(struct io_group *iog, int sync)
+{
+ enum io_group_state bit;
+
+ bit = sync ? IOG_sync_congested : IOG_async_congested;
+ clear_bit(bit, &iog->state);
+ smp_mb__after_clear_bit();
+ congestion_wake_up(sync);
+}
+
+void elv_set_iog_congested(struct io_group *iog, int sync)
+{
+ enum io_group_state bit;
+
+ bit = sync ? IOG_sync_congested : IOG_async_congested;
+ set_bit(bit, &iog->state);
+}
+
+static inline int elv_iog_congested(struct io_group *iog, int iog_bits)
+{
+ return iog->state & iog_bits;
+}
+
+/* Determine if io group page maps to is congested or not */
+int elv_page_io_group_congested(struct request_queue *q, struct page *page,
+ int sync)
+{
+ struct io_group *iog;
+ int ret = 0;
+
+ rcu_read_lock();
+
+ iog = elv_io_get_io_group(q, page, 0);
+
+ if (!iog) {
+ /*
+ * Either cgroup got deleted or this is first request in the
+ * group and associated io group object has not been created
+ * yet. Map it to root group.
+ *
+ * TODO: Fix the case of group not created yet.
+ */
+ iog = q->elevator->efqd->root_group;
+ }
+
+ if (sync)
+ ret = elv_iog_congested(iog, 1 << IOG_sync_congested);
+ else
+ ret = elv_iog_congested(iog, 1 << IOG_async_congested);
+
+ if (ret)
+ elv_log_iog(q->elevator->efqd, iog, "iog congested=%d sync=%d"
+ " rl.count[sync]=%d nr_group_requests=%d",
+ ret, sync, iog->rl.count[sync], q->nr_group_requests);
+ rcu_read_unlock();
+ return ret;
+}
+
+static inline int
+elv_iog_congestion_on_threshold(struct io_group *iog)
+{
+ return iog->nr_congestion_on;
+}
+
+static inline int
+elv_iog_congestion_off_threshold(struct io_group *iog)
+{
+ return iog->nr_congestion_off;
+}
+
+void elv_freed_request(struct request_list *rl, int sync)
+{
+ struct io_group *iog = rl_iog(rl);
+
+ if (iog->rl.count[sync] < elv_iog_congestion_off_threshold(iog))
+ elv_clear_iog_congested(iog, sync);
+}
+
+void elv_get_request(struct request_list *rl, int sync)
+{
+ struct io_group *iog = rl_iog(rl);
+
+ if (iog->rl.count[sync]+1 >= elv_iog_congestion_on_threshold(iog))
+ elv_set_iog_congested(iog, sync);
+}
+
+static void iog_nr_requests_updated(struct io_group *iog)
+{
+ if (iog->rl.count[BLK_RW_SYNC] >= elv_iog_congestion_on_threshold(iog))
+ elv_set_iog_congested(iog, BLK_RW_SYNC);
+ else if (iog->rl.count[BLK_RW_SYNC] <
+ elv_iog_congestion_off_threshold(iog))
+ elv_clear_iog_congested(iog, BLK_RW_SYNC);
+
+ if (iog->rl.count[BLK_RW_ASYNC] >= elv_iog_congestion_on_threshold(iog))
+ elv_set_iog_congested(iog, BLK_RW_ASYNC);
+ else if (iog->rl.count[BLK_RW_ASYNC] <
+ elv_iog_congestion_off_threshold(iog))
+ elv_clear_iog_congested(iog, BLK_RW_ASYNC);
+}
+
+void elv_updated_nr_group_requests(struct request_queue *q)
+{
+ struct elv_fq_data *efqd;
+ struct hlist_node *n;
+ struct io_group *iog;
+
+ efqd = q->elevator->efqd;
+
+ hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
+ elv_io_group_congestion_threshold(q, iog);
+ iog_nr_requests_updated(iog);
+ }
+}
+
/*
* Search the io_group for efqd into the hash table (by now only a list)
* of bgrp. Must be called under rcu_read_lock().
@@ -1635,6 +1768,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
io_group_path(iog);

blk_init_request_list(&iog->rl);
+ elv_io_group_congestion_threshold(q, iog);

if (leaf == NULL) {
leaf = iog;
@@ -1866,6 +2000,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;

blk_init_request_list(&iog->rl);
+ elv_io_group_congestion_threshold(q, iog);
spin_lock_irq(&iocg->lock);
rcu_assign_pointer(iog->key, key);
hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index c9ea0a1..203250a 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -106,6 +106,13 @@ struct io_queue {
};

#ifdef CONFIG_GROUP_IOSCHED /* CONFIG_GROUP_IOSCHED */
+
+enum io_group_state {
+ IOG_async_congested, /* The async queue of group is getting full */
+ IOG_sync_congested, /* The sync queue of group is getting full */
+ IOG_unused, /* Available bits start here */
+};
+
struct io_group {
struct io_entity entity;
atomic_t ref;
@@ -141,6 +148,11 @@ struct io_group {
/* Single ioq per group, used for noop, deadline, anticipatory */
struct io_queue *ioq;

+ /* io group congestion on and off threshold for request descriptors */
+ unsigned int nr_congestion_on;
+ unsigned int nr_congestion_off;
+
+ unsigned long state;
/* request list associated with the group */
struct request_list rl;
};
@@ -468,6 +480,11 @@ elv_get_request_list_bio(struct request_queue *q, struct bio *bio);

struct request_list *
elv_get_request_list_rq(struct request_queue *q, struct request *rq, int priv);
+extern int elv_page_io_group_congested(struct request_queue *q,
+ struct page *page, int sync);
+extern void elv_freed_request(struct request_list *rl, int sync);
+extern void elv_get_request(struct request_list *rl, int sync);
+extern void elv_updated_nr_group_requests(struct request_queue *q);

#else /* !GROUP_IOSCHED */

@@ -506,9 +523,11 @@ elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
{
return NULL;
}
-
static inline void elv_get_rl_iog(struct request_list *rl) { }
static inline void elv_put_rl_iog(struct request_list *rl) { }
+static inline void elv_updated_nr_group_requests(struct request_queue *q) { }
+static inline void elv_freed_request(struct request_list *rl, int sync) { }
+static inline void elv_get_request(struct request_list *rl, int sync) { }

#endif /* GROUP_IOSCHED */

@@ -622,6 +641,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,

static inline void elv_get_rl_iog(struct request_list *rl) { }
static inline void elv_put_rl_iog(struct request_list *rl) { }
+static inline void elv_updated_nr_group_requests(struct request_queue *q) { }
+static inline void elv_freed_request(struct request_list *rl, int sync) { }
+static inline void elv_get_request(struct request_list *rl, int sync) { }

#endif /* CONFIG_ELV_FAIR_QUEUING */
#endif /* _ELV_SCHED_H */
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 1a6cb3c..bfca5c1 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1185,7 +1185,8 @@ int dm_table_resume_targets(struct dm_table *t)
return 0;
}

-int dm_table_any_congested(struct dm_table *t, int bdi_bits)
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+ int group)
{
struct dm_dev_internal *dd;
struct list_head *devices = dm_table_get_devices(t);
@@ -1195,9 +1196,11 @@ int dm_table_any_congested(struct dm_table *t, int bdi_bits)
struct request_queue *q = bdev_get_queue(dd->dm_dev.bdev);
char b[BDEVNAME_SIZE];

- if (likely(q))
- r |= bdi_congested(&q->backing_dev_info, bdi_bits);
- else
+ if (likely(q)) {
+ struct backing_dev_info *bdi = &q->backing_dev_info;
+ r |= group ? bdi_congested_group(bdi, bdi_bits, page)
+ : bdi_congested(bdi, bdi_bits);
+ } else
DMWARN_LIMIT("%s: any_congested: nonexistent device %s",
dm_device_name(t->md),
bdevname(dd->dm_dev.bdev, b));
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b4845b1..45ca047 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1613,7 +1613,8 @@ static void dm_unplug_all(struct request_queue *q)
}
}

-static int dm_any_congested(void *congested_data, int bdi_bits)
+static int dm_any_congested(void *congested_data, int bdi_bits,
+ struct page *page, int group)
{
int r = bdi_bits;
struct mapped_device *md = congested_data;
@@ -1630,8 +1631,8 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
r = md->queue->backing_dev_info.state &
bdi_bits;
else
- r = dm_table_any_congested(map, bdi_bits);
-
+ r = dm_table_any_congested(map, bdi_bits, page,
+ group);
dm_table_put(map);
}
}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index a7663eb..bf533a9 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -57,7 +57,8 @@ struct list_head *dm_table_get_devices(struct dm_table *t);
void dm_table_presuspend_targets(struct dm_table *t);
void dm_table_postsuspend_targets(struct dm_table *t);
int dm_table_resume_targets(struct dm_table *t);
-int dm_table_any_congested(struct dm_table *t, int bdi_bits);
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+ int group);
int dm_table_any_busy_target(struct dm_table *t);
int dm_table_set_type(struct dm_table *t);
unsigned dm_table_get_type(struct dm_table *t);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 5fe39c2..10765da 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -102,7 +102,7 @@ static void linear_unplug(struct request_queue *q)
rcu_read_unlock();
}

-static int linear_congested(void *data, int bits)
+static int linear_congested(void *data, int bits, struct page *page, int group)
{
mddev_t *mddev = data;
linear_conf_t *conf;
@@ -113,7 +113,10 @@ static int linear_congested(void *data, int bits)

for (i = 0; i < mddev->raid_disks && !ret ; i++) {
struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
- ret |= bdi_congested(&q->backing_dev_info, bits);
+ struct backing_dev_info *bdi = &q->backing_dev_info;
+
+ ret |= group ? bdi_congested_group(bdi, bits, page) :
+ bdi_congested(bdi, bits);
}

rcu_read_unlock();
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 7140909..52a54c7 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -192,7 +192,8 @@ static void multipath_status (struct seq_file *seq, mddev_t *mddev)
seq_printf (seq, "]");
}

-static int multipath_congested(void *data, int bits)
+static int multipath_congested(void *data, int bits, struct page *page,
+ int group)
{
mddev_t *mddev = data;
multipath_conf_t *conf = mddev->private;
@@ -203,8 +204,10 @@ static int multipath_congested(void *data, int bits)
mdk_rdev_t *rdev = rcu_dereference(conf->multipaths[i].rdev);
if (rdev && !test_bit(Faulty, &rdev->flags)) {
struct request_queue *q = bdev_get_queue(rdev->bdev);
+ struct backing_dev_info *bdi = &q->backing_dev_info;

- ret |= bdi_congested(&q->backing_dev_info, bits);
+ ret |= group ? bdi_congested_group(bdi, bits, page)
+ : bdi_congested(bdi, bits);
/* Just like multipath_map, we just check the
* first available device
*/
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 898e2bd..915a95f 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -37,7 +37,7 @@ static void raid0_unplug(struct request_queue *q)
}
}

-static int raid0_congested(void *data, int bits)
+static int raid0_congested(void *data, int bits, struct page *page, int group)
{
mddev_t *mddev = data;
raid0_conf_t *conf = mddev->private;
@@ -46,8 +46,10 @@ static int raid0_congested(void *data, int bits)

for (i = 0; i < mddev->raid_disks && !ret ; i++) {
struct request_queue *q = bdev_get_queue(devlist[i]->bdev);
+ struct backing_dev_info *bdi = &q->backing_dev_info;

- ret |= bdi_congested(&q->backing_dev_info, bits);
+ ret |= group ? bdi_congested_group(bdi, bits, page)
+ : bdi_congested(bdi, bits);
}
return ret;
}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8726fd7..0f0c6ac 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -570,7 +570,7 @@ static void raid1_unplug(struct request_queue *q)
md_wakeup_thread(mddev->thread);
}

-static int raid1_congested(void *data, int bits)
+static int raid1_congested(void *data, int bits, struct page *page, int group)
{
mddev_t *mddev = data;
conf_t *conf = mddev->private;
@@ -581,14 +581,17 @@ static int raid1_congested(void *data, int bits)
mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
if (rdev && !test_bit(Faulty, &rdev->flags)) {
struct request_queue *q = bdev_get_queue(rdev->bdev);
+ struct backing_dev_info *bdi = &q->backing_dev_info;

/* Note the '|| 1' - when read_balance prefers
* non-congested targets, it can be removed
*/
if ((bits & (1<<BDI_async_congested)) || 1)
- ret |= bdi_congested(&q->backing_dev_info, bits);
+ ret |= group ? bdi_congested_group(bdi, bits,
+ page) : bdi_congested(bdi, bits);
else
- ret &= bdi_congested(&q->backing_dev_info, bits);
+ ret &= group ? bdi_congested_group(bdi, bits,
+ page) : bdi_congested(bdi, bits);
}
}
rcu_read_unlock();
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3d9020c..d85351f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -625,7 +625,7 @@ static void raid10_unplug(struct request_queue *q)
md_wakeup_thread(mddev->thread);
}

-static int raid10_congested(void *data, int bits)
+static int raid10_congested(void *data, int bits, struct page *page, int group)
{
mddev_t *mddev = data;
conf_t *conf = mddev->private;
@@ -636,8 +636,10 @@ static int raid10_congested(void *data, int bits)
mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
if (rdev && !test_bit(Faulty, &rdev->flags)) {
struct request_queue *q = bdev_get_queue(rdev->bdev);
+ struct backing_dev_info *bdi = &q->backing_dev_info;

- ret |= bdi_congested(&q->backing_dev_info, bits);
+ ret |= group ? bdi_congested_group(bdi, bits, page)
+ : bdi_congested(bdi, bits);
}
}
rcu_read_unlock();
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b8a2c5d..b6cc455 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3323,7 +3323,7 @@ static void raid5_unplug_device(struct request_queue *q)
unplug_slaves(mddev);
}

-static int raid5_congested(void *data, int bits)
+static int raid5_congested(void *data, int bits, struct page *page, int group)
{
mddev_t *mddev = data;
raid5_conf_t *conf = mddev->private;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index c2e7a7f..aa8b359 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -455,7 +455,7 @@ int afs_writepage(struct page *page, struct writeback_control *wbc)
}

wbc->nr_to_write -= ret;
- if (wbc->nonblocking && bdi_write_congested(bdi))
+ if (wbc->nonblocking && bdi_or_group_write_congested(bdi, page))
wbc->encountered_congestion = 1;

_leave(" = 0");
@@ -491,6 +491,12 @@ static int afs_writepages_region(struct address_space *mapping,
return 0;
}

+ if (wbc->nonblocking && bdi_write_congested_group(bdi, page)) {
+ wbc->encountered_congestion = 1;
+ page_cache_release(page);
+ break;
+ }
+
/* at this point we hold neither mapping->tree_lock nor lock on
* the page itself: the page may be truncated or invalidated
* (changing page->mapping to NULL), or even swizzled back from
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e83be2e..35cd95a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1249,7 +1249,8 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
return root;
}

-static int btrfs_congested_fn(void *congested_data, int bdi_bits)
+static int btrfs_congested_fn(void *congested_data, int bdi_bits,
+ struct page *page, int group)
{
struct btrfs_fs_info *info = (struct btrfs_fs_info *)congested_data;
int ret = 0;
@@ -1260,7 +1261,8 @@ static int btrfs_congested_fn(void *congested_data, int bdi_bits)
if (!device->bdev)
continue;
bdi = blk_get_backing_dev_info(device->bdev);
- if (bdi && bdi_congested(bdi, bdi_bits)) {
+ if (bdi && (group ? bdi_congested_group(bdi, bdi_bits, page) :
+ bdi_congested(bdi, bdi_bits))) {
ret = 1;
break;
}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6826018..fd7d53f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2368,6 +2368,18 @@ retry:
unsigned i;

scanned = 1;
+
+ /*
+ * If the io group page will go into is congested, bail out.
+ */
+ if (wbc->nonblocking
+ && bdi_write_congested_group(bdi, pvec.pages[0])) {
+ wbc->encountered_congestion = 1;
+ done = 1;
+ pagevec_release(&pvec);
+ break;
+ }
+
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5dbefd1..ed2d100 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -165,6 +165,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
unsigned long limit;
unsigned long last_waited = 0;
int force_reg = 0;
+ struct page *page;

bdi = blk_get_backing_dev_info(device->bdev);
fs_info = device->dev_root->fs_info;
@@ -276,8 +277,11 @@ loop_lock:
* is now congested. Back off and let other work structs
* run instead
*/
- if (pending && bdi_write_congested(bdi) && batch_run > 32 &&
- fs_info->fs_devices->open_devices > 1) {
+ if (pending)
+ page = bio_iovec_idx(pending, 0)->bv_page;
+
+ if (pending && bdi_or_group_write_congested(bdi, page) &&
+ num_run > 32 && fs_info->fs_devices->open_devices > 1) {
struct io_context *ioc;

ioc = current->io_context;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c34b7f8..33d0339 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1470,6 +1470,17 @@ retry:
n_iov = 0;
bytes_to_write = 0;

+ /*
+ * If the io group page will go into is congested, bail out.
+ */
+ if (wbc->nonblocking &&
+ bdi_write_congested_group(bdi, pvec.pages[0])) {
+ wbc->encountered_congestion = 1;
+ done = 1;
+ pagevec_release(&pvec);
+ break;
+ }
+
for (i = 0; i < nr_pages; i++) {
page = pvec.pages[i];
/*
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index 15387c9..090a961 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -179,7 +179,7 @@ static void ext2_preread_inode(struct inode *inode)
struct backing_dev_info *bdi;

bdi = inode->i_mapping->backing_dev_info;
- if (bdi_read_congested(bdi))
+ if (bdi_or_group_read_congested(bdi, NULL))
return;
if (bdi_write_congested(bdi))
return;
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 7ebae9a..f5fba6c 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -371,6 +371,18 @@ retry:
PAGECACHE_TAG_DIRTY,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
scanned = 1;
+
+ /*
+ * If io group page belongs to is congested. bail out.
+ */
+ if (wbc->nonblocking
+ && bdi_write_congested_group(bdi, pvec.pages[0])) {
+ wbc->encountered_congestion = 1;
+ done = 1;
+ pagevec_release(&pvec);
+ break;
+ }
+
ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end);
if (ret)
done = 1;
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 9e3fe17..aa29612 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -266,8 +266,9 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode)
{
struct bio *bio = wi->bio;
int err;
+ struct page *page = bio_iovec_idx(bio, 0)->bv_page;

- if (wi->nbio > 0 && bdi_write_congested(wi->bdi)) {
+ if (wi->nbio > 0 && bdi_or_group_write_congested(wi->bdi, page)) {
wait_for_completion(&wi->bio_event);
wi->nbio--;
if (unlikely(atomic_read(&wi->err))) {
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index aecf251..5835a2e 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -891,7 +891,7 @@ xfs_convert_page(

bdi = inode->i_mapping->backing_dev_info;
wbc->nr_to_write--;
- if (bdi_write_congested(bdi)) {
+ if (bdi_or_group_write_congested(bdi, page)) {
wbc->encountered_congestion = 1;
done = 1;
} else if (wbc->nr_to_write <= 0) {
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 965df12..473223a 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -714,7 +714,7 @@ xfs_buf_readahead(
struct backing_dev_info *bdi;

bdi = target->bt_mapping->backing_dev_info;
- if (bdi_read_congested(bdi))
+ if (bdi_or_group_read_congested(bdi, NULL))
return;

flags |= (XBF_TRYLOCK|XBF_ASYNC|XBF_READ_AHEAD);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 1d52425..1b13539 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,7 +29,7 @@ enum bdi_state {
BDI_unused, /* Available bits start here */
};

-typedef int (congested_fn)(void *, int);
+typedef int (congested_fn)(void *, int, struct page *, int);

enum bdi_stat_item {
BDI_RECLAIMABLE,
@@ -209,7 +209,7 @@ int writeback_in_progress(struct backing_dev_info *bdi);
static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
{
if (bdi->congested_fn)
- return bdi->congested_fn(bdi->congested_data, bdi_bits);
+ return bdi->congested_fn(bdi->congested_data, bdi_bits, NULL, 0);
return (bdi->state & bdi_bits);
}

@@ -229,6 +229,63 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
(1 << BDI_async_congested));
}

+#ifdef CONFIG_GROUP_IOSCHED
+extern int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+ struct page *page);
+
+extern int bdi_read_congested_group(struct backing_dev_info *bdi,
+ struct page *page);
+
+extern int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+ struct page *page);
+
+extern int bdi_write_congested_group(struct backing_dev_info *bdi,
+ struct page *page);
+
+extern int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+ struct page *page);
+
+extern int bdi_rw_congested_group(struct backing_dev_info *bdi,
+ struct page *page);
+#else /* CONFIG_GROUP_IOSCHED */
+static inline int bdi_congested_group(struct backing_dev_info *bdi,
+ int bdi_bits, struct page *page)
+{
+ return bdi_congested(bdi, bdi_bits);
+}
+
+static inline int bdi_read_congested_group(struct backing_dev_info *bdi,
+ struct page *page)
+{
+ return bdi_read_congested(bdi);
+}
+
+static inline int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+ struct page *page)
+{
+ return bdi_read_congested(bdi);
+}
+
+static inline int bdi_write_congested_group(struct backing_dev_info *bdi,
+ struct page *page)
+{
+ return bdi_write_congested(bdi);
+}
+
+static inline int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+ struct page *page)
+{
+ return bdi_write_congested(bdi);
+}
+
+static inline int bdi_rw_congested_group(struct backing_dev_info *bdi,
+ struct page *page)
+{
+ return bdi_rw_congested(bdi);
+}
+
+#endif /* CONFIG_GROUP_IOSCHED */
+
enum {
BLK_RW_ASYNC = 0,
BLK_RW_SYNC = 1,
@@ -237,7 +294,7 @@ enum {
void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
void set_bdi_congested(struct backing_dev_info *bdi, int sync);
long congestion_wait(int sync, long timeout);
-
+extern void congestion_wake_up(int sync);

static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
{
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 74deb17..247e237 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -846,6 +846,11 @@ static inline void blk_set_queue_congested(struct request_queue *q, int sync)
set_bdi_congested(&q->backing_dev_info, sync);
}

+#ifdef CONFIG_GROUP_IOSCHED
+extern int blk_queue_io_group_congested(struct backing_dev_info *bdi,
+ int bdi_bits, struct page *page);
+#endif
+
extern void blk_start_queue(struct request_queue *q);
extern void blk_stop_queue(struct request_queue *q);
extern void blk_sync_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c86edd2..60c91e4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -7,6 +7,7 @@
#include <linux/module.h>
#include <linux/writeback.h>
#include <linux/device.h>
+#include "../block/elevator-fq.h"

void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
@@ -283,16 +284,22 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};

+void congestion_wake_up(int sync)
+{
+ wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+ if (waitqueue_active(wqh))
+ wake_up(wqh);
+}
+
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
enum bdi_state bit;
- wait_queue_head_t *wqh = &congestion_wqh[sync];

bit = sync ? BDI_sync_congested : BDI_async_congested;
clear_bit(bit, &bdi->state);
smp_mb__after_clear_bit();
- if (waitqueue_active(wqh))
- wake_up(wqh);
+ congestion_wake_up(sync);
}
EXPORT_SYMBOL(clear_bdi_congested);

@@ -327,3 +334,64 @@ long congestion_wait(int sync, long timeout)
}
EXPORT_SYMBOL(congestion_wait);

+/*
+ * With group IO scheduling, there are request descriptors per io group per
+ * queue. So generic notion of whether queue is congested or not is not
+ * very accurate. Queue might not be congested but the io group in which
+ * request will go might actually be congested.
+ *
+ * Hence to get the correct idea about congestion level, one should query
+ * the io group congestion status on the queue. Pass in the page information
+ * which can be used to determine the io group of the page and congestion
+ * status can be determined accordingly.
+ *
+ * If page info is not passed, io group is determined from the current task
+ * context.
+ */
+#ifdef CONFIG_GROUP_IOSCHED
+int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+ struct page *page)
+{
+ if (bdi->congested_fn)
+ return bdi->congested_fn(bdi->congested_data, bdi_bits, page, 1);
+
+ return blk_queue_io_group_congested(bdi, bdi_bits, page);
+}
+EXPORT_SYMBOL(bdi_congested_group);
+
+int bdi_read_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+ return bdi_congested_group(bdi, 1 << BDI_sync_congested, page);
+}
+EXPORT_SYMBOL(bdi_read_congested_group);
+
+/* Checks if either bdi or associated group is read congested */
+int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+ struct page *page)
+{
+ return bdi_read_congested(bdi) || bdi_read_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_read_congested);
+
+int bdi_write_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+ return bdi_congested_group(bdi, 1 << BDI_async_congested, page);
+}
+EXPORT_SYMBOL(bdi_write_congested_group);
+
+/* Checks if either bdi or associated group is write congested */
+int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+ struct page *page)
+{
+ return bdi_write_congested(bdi) || bdi_write_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_write_congested);
+
+int bdi_rw_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+ return bdi_congested_group(bdi, (1 << BDI_sync_congested) |
+ (1 << BDI_async_congested), page);
+}
+EXPORT_SYMBOL(bdi_rw_congested_group);
+
+#endif /* CONFIG_GROUP_IOSCHED */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1df421b..f924e05 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -985,6 +985,17 @@ retry:
if (nr_pages == 0)
break;

+ /*
+ * If the io group page will go into is congested, bail out.
+ */
+ if (wbc->nonblocking
+ && bdi_write_congested_group(bdi, pvec.pages[0])) {
+ wbc->encountered_congestion = 1;
+ done = 1;
+ pagevec_release(&pvec);
+ break;
+ }
+
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];

diff --git a/mm/readahead.c b/mm/readahead.c
index aa1aa23..22e0639 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -542,7 +542,7 @@ page_cache_async_readahead(struct address_space *mapping,
/*
* Defer asynchronous read-ahead on IO congestion.
*/
- if (bdi_read_congested(mapping->backing_dev_info))
+ if (bdi_or_group_read_congested(mapping->backing_dev_info, NULL))
return;

/* do read-ahead */
--
1.6.0.6

2009-09-24 19:26:40

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 27/28] io-controller: Support per cgroup per device weights and io class

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo dev_major:dev_minor weight ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for device.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
# echo "8:16 300 2" > io.policy
# cat io.policy
dev weight class
8:16 300 2

Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
# echo "8:0 500 1" > io.policy
# cat io.policy
dev weight class
8:0 500 1
8:16 300 2

Remove the policy for /dev/hda in this cgroup
# echo 8:0 0 1 > io.policy
# cat io.policy
dev weight class
8:16 300 2

Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.

Signed-off-by: Gui Jianfeng <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/elevator-fq.c | 263 ++++++++++++++++++++++++++++++++++++++++++++++++++-
block/elevator-fq.h | 10 ++
2 files changed, 269 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index fd0a40f..e69de98 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -15,6 +15,7 @@
#include <linux/blktrace_api.h>
#include <linux/seq_file.h>
#include <linux/biotrack.h>
+#include <linux/genhd.h>
#include "elevator-fq.h"

const int elv_slice_sync = HZ / 10;
@@ -1156,12 +1157,26 @@ EXPORT_SYMBOL(elv_io_group_set_async_queue);
#ifdef CONFIG_GROUP_IOSCHED
static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);

-static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+ dev_t dev);
+static void
+io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog, dev_t dev)
{
struct io_entity *entity = &iog->entity;
+ struct io_policy_node *pn;
+ unsigned long flags;
+
+ spin_lock_irqsave(&iocg->lock, flags);
+ pn = policy_search_node(iocg, dev);
+ if (pn) {
+ entity->weight = pn->weight;
+ entity->ioprio_class = pn->ioprio_class;
+ } else {
+ entity->weight = iocg->weight;
+ entity->ioprio_class = iocg->ioprio_class;
+ }
+ spin_unlock_irqrestore(&iocg->lock, flags);

- entity->weight = iocg->weight;
- entity->ioprio_class = iocg->ioprio_class;
entity->ioprio_changed = 1;
entity->my_sd = &iog->sched_data;
}
@@ -1431,6 +1446,229 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
return NULL;
}

+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct io_cgroup *iocg;
+ struct io_policy_node *pn;
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+
+ if (list_empty(&iocg->policy_list))
+ goto out;
+
+ seq_printf(m, "dev\tweight\tclass\n");
+
+ spin_lock_irq(&iocg->lock);
+ list_for_each_entry(pn, &iocg->policy_list, node) {
+ seq_printf(m, "%u:%u\t%u\t%hu\n", MAJOR(pn->dev),
+ MINOR(pn->dev), pn->weight, pn->ioprio_class);
+ }
+ spin_unlock_irq(&iocg->lock);
+out:
+ return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+ struct io_policy_node *pn)
+{
+ list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+ list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+ dev_t dev)
+{
+ struct io_policy_node *pn;
+
+ if (list_empty(&iocg->policy_list))
+ return NULL;
+
+ list_for_each_entry(pn, &iocg->policy_list, node) {
+ if (pn->dev == dev)
+ return pn;
+ }
+
+ return NULL;
+}
+
+static int check_dev_num(dev_t dev)
+{
+ int part = 0;
+ struct gendisk *disk;
+
+ disk = get_gendisk(dev, &part);
+ if (!disk || part)
+ return -ENODEV;
+
+ return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+ char *s[4], *p, *major_s = NULL, *minor_s = NULL;
+ int ret;
+ unsigned long major, minor, temp;
+ int i = 0;
+ dev_t dev;
+
+ memset(s, 0, sizeof(s));
+ while ((p = strsep(&buf, " ")) != NULL) {
+ if (!*p)
+ continue;
+ s[i++] = p;
+
+ /* Prevent from inputing too many things */
+ if (i == 4)
+ break;
+ }
+
+ if (i != 3)
+ return -EINVAL;
+
+ p = strsep(&s[0], ":");
+ if (p != NULL)
+ major_s = p;
+ else
+ return -EINVAL;
+
+ minor_s = s[0];
+ if (!minor_s)
+ return -EINVAL;
+
+ ret = strict_strtoul(major_s, 10, &major);
+ if (ret)
+ return -EINVAL;
+
+ ret = strict_strtoul(minor_s, 10, &minor);
+ if (ret)
+ return -EINVAL;
+
+ dev = MKDEV(major, minor);
+
+ ret = check_dev_num(dev);
+ if (ret)
+ return ret;
+
+ newpn->dev = dev;
+
+ if (s[1] == NULL)
+ return -EINVAL;
+
+ ret = strict_strtoul(s[1], 10, &temp);
+ if (ret || temp > IO_WEIGHT_MAX)
+ return -EINVAL;
+
+ newpn->weight = temp;
+
+ if (s[2] == NULL)
+ return -EINVAL;
+
+ ret = strict_strtoul(s[2], 10, &temp);
+ if (ret || temp < IOPRIO_CLASS_RT || temp > IOPRIO_CLASS_IDLE)
+ return -EINVAL;
+ newpn->ioprio_class = temp;
+
+ return 0;
+}
+
+static void update_iog_weight_prio(struct io_group *iog, struct io_cgroup *iocg,
+ struct io_policy_node *pn)
+{
+ if (pn->weight) {
+ iog->entity.weight = pn->weight;
+ iog->entity.ioprio_class = pn->ioprio_class;
+ /*
+ * iog weight and ioprio_class updating actually happens if
+ * ioprio_changed is set. So ensure ioprio_changed is not set
+ * until new weight and new ioprio_class are updated.
+ */
+ smp_wmb();
+ iog->entity.ioprio_changed = 1;
+ } else {
+ iog->entity.weight = iocg->weight;
+ iog->entity.ioprio_class = iocg->ioprio_class;
+
+ /* The same as above */
+ smp_wmb();
+ iog->entity.ioprio_changed = 1;
+ }
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct io_cgroup *iocg;
+ struct io_policy_node *newpn, *pn;
+ char *buf;
+ int ret = 0;
+ int keep_newpn = 0;
+ struct hlist_node *n;
+ struct io_group *iog;
+
+ buf = kstrdup(buffer, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+ if (!newpn) {
+ ret = -ENOMEM;
+ goto free_buf;
+ }
+
+ ret = policy_parse_and_set(buf, newpn);
+ if (ret)
+ goto free_newpn;
+
+ if (!cgroup_lock_live_group(cgrp)) {
+ ret = -ENODEV;
+ goto free_newpn;
+ }
+
+ iocg = cgroup_to_io_cgroup(cgrp);
+ spin_lock_irq(&iocg->lock);
+
+ pn = policy_search_node(iocg, newpn->dev);
+ if (!pn) {
+ if (newpn->weight != 0) {
+ policy_insert_node(iocg, newpn);
+ keep_newpn = 1;
+ }
+ goto update_io_group;
+ }
+
+ if (newpn->weight == 0) {
+ /* weight == 0 means deleteing a policy */
+ policy_delete_node(pn);
+ goto update_io_group;
+ }
+
+ pn->weight = newpn->weight;
+ pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+ hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+ if (iog->dev == newpn->dev)
+ update_iog_weight_prio(iog, iocg, newpn);
+ }
+ spin_unlock_irq(&iocg->lock);
+
+ cgroup_unlock();
+
+free_newpn:
+ if (!keep_newpn)
+ kfree(newpn);
+free_buf:
+ kfree(buf);
+ return ret;
+}
+
#define SHOW_FUNCTION(__VAR) \
static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup, \
struct cftype *cftype) \
@@ -1463,6 +1701,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
struct io_cgroup *iocg; \
struct io_group *iog; \
struct hlist_node *n; \
+ struct io_policy_node *pn; \
\
if (val < (__MIN) || val > (__MAX)) \
return -EINVAL; \
@@ -1475,6 +1714,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup, \
spin_lock_irq(&iocg->lock); \
iocg->__VAR = (unsigned long)val; \
hlist_for_each_entry(iog, n, &iocg->group_data, group_node) { \
+ pn = policy_search_node(iocg, iog->dev); \
+ if (pn) \
+ continue; \
iog->entity.__VAR = (unsigned long)val; \
smp_wmb(); \
iog->entity.ioprio_changed = 1; \
@@ -1610,6 +1852,12 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,

struct cftype io_files[] = {
{
+ .name = "policy",
+ .read_seq_string = io_cgroup_policy_read,
+ .write_string = io_cgroup_policy_write,
+ .max_write_len = 256,
+ },
+ {
.name = "weight",
.read_u64 = io_cgroup_weight_read,
.write_u64 = io_cgroup_weight_write,
@@ -1660,6 +1908,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
INIT_HLIST_HEAD(&iocg->group_data);
iocg->weight = IO_WEIGHT_DEFAULT;
iocg->ioprio_class = IOPRIO_CLASS_BE;
+ INIT_LIST_HEAD(&iocg->policy_list);

return &iocg->css;
}
@@ -1753,7 +2002,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
iog->dev = MKDEV(major, minor);

- io_group_init_entity(iocg, iog);
+ io_group_init_entity(iocg, iog, iog->dev);

atomic_set(&iog->ref, 0);

@@ -2109,6 +2358,7 @@ static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
struct io_group *iog;
struct elv_fq_data *efqd;
unsigned long uninitialized_var(flags);
+ struct io_policy_node *pn, *pntmp;

/*
* io groups are linked in two lists. One list is maintained
@@ -2148,6 +2398,11 @@ remove_entry:
goto remove_entry;

done:
+ list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+ policy_delete_node(pn);
+ kfree(pn);
+ }
+
free_css_id(&io_subsys, &iocg->css);
rcu_read_unlock();
BUG_ON(!hlist_empty(&iocg->group_data));
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 203250a..e0d5e54 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -157,12 +157,22 @@ struct io_group {
struct request_list rl;
};

+struct io_policy_node {
+ struct list_head node;
+ dev_t dev;
+ unsigned int weight;
+ unsigned short ioprio_class;
+};
+
struct io_cgroup {
struct cgroup_subsys_state css;

unsigned int weight;
unsigned short ioprio_class;

+ /* list of io_policy_node */
+ struct list_head policy_list;
+
spinlock_t lock;
struct hlist_head group_data;
};
--
1.6.0.6

2009-09-24 19:32:57

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 28/28] io-controller: debug elevator fair queuing support

o More debugging help to debug elevator fair queuing support. Enabled under
CONFIG_DEBUG_ELV_FAIR_QUEUING. Currently it prints vdisktime related
trace messages in blktrace.

Signed-off-by: Vivek Goyal <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
block/Kconfig.iosched | 9 +++++++++
block/elevator-fq.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8b507c4..edcd317 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -15,6 +15,15 @@ config ELV_FAIR_QUEUING
other ioschedulers can make use of it.
If unsure, say N.

+config DEBUG_ELV_FAIR_QUEUING
+ bool "Debug elevator fair queuing"
+ depends on ELV_FAIR_QUEUING
+ default n
+ ---help---
+ Enable some debugging hooks for elevator fair queuing support.
+ Currently it just outputs more information about vdisktime in
+ blktrace output .
+
config IOSCHED_NOOP
bool
default y
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index e69de98..62691c6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -37,6 +37,24 @@ static int elv_ioq_wait_busy = HZ / 125;

static void check_idle_tree_release(struct io_service_tree *st);

+#ifdef CONFIG_DEBUG_ELV_FAIR_QUEUING
+#define elv_log_entity(entity, fmt, args...) \
+{ \
+{ \
+ struct io_queue *ioq = ioq_of(entity); \
+ struct io_group *iog = iog_of(entity); \
+ \
+ if (ioq) { \
+ elv_log_ioq(ioq->efqd, ioq, fmt, ##args); \
+ } else { \
+ elv_log_iog((struct elv_fq_data *)iog->key, iog, fmt, ##args);\
+ } \
+} \
+}
+#else
+#define elv_log_entity(entity, fmt, args...)
+#endif
+
static inline struct io_queue *ioq_of(struct io_entity *entity)
{
if (entity->my_sd == NULL)
@@ -435,6 +453,26 @@ static inline void debug_update_stats_enqueue(struct io_entity *entity) {}
static inline void debug_update_stats_dequeue(struct io_entity *entity) {}
#endif /* DEBUG_GROUP_IOSCHED */

+#ifdef CONFIG_DEBUG_ELV_FAIR_QUEUING
+static inline void debug_entity_vdisktime(struct io_entity *entity,
+ unsigned long served, u64 delta)
+{
+ struct elv_fq_data *efqd;
+ struct io_group *iog;
+
+ elv_log_entity(entity, "vdisktime=%llu service=%lu delta=%llu"
+ " entity->weight=%u", entity->vdisktime,
+ served, delta, entity->weight);
+
+ iog = iog_of(parent_entity(entity));
+ efqd = iog->key;
+ elv_log_iog(efqd, iog, "min_vdisktime=%llu", entity->st->min_vdisktime);
+}
+#else /* DEBUG_ELV_FAIR_QUEUING */
+static inline void debug_entity_vdisktime(struct io_entity *entity,
+ unsigned long served, u64 delta) {}
+#endif /* DEBUG_ELV_FAIR_QUEUING */
+
static void entity_served(struct io_entity *entity, unsigned long served,
unsigned long queue_charge, unsigned long group_charge,
unsigned long nr_sectors)
@@ -442,10 +480,14 @@ static void entity_served(struct io_entity *entity, unsigned long served,
unsigned long charge = queue_charge;

for_each_entity(entity) {
- entity->vdisktime += elv_delta_fair(charge, entity);
+ u64 delta;
+
+ delta = elv_delta_fair(charge, entity);
+ entity->vdisktime += delta;
update_min_vdisktime(entity->st);
entity->total_time += served;
entity->total_sectors += nr_sectors;
+ debug_entity_vdisktime(entity, charge, delta);
/* Group charge can be different from queue charge */
charge = group_charge;
}
@@ -499,6 +541,9 @@ static void place_entity(struct io_service_tree *st, struct io_entity *entity,
vdisktime = st->min_vdisktime;
done:
entity->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+ elv_log_entity(entity, "place_entity: vdisktime=%llu"
+ " min_vdisktime=%llu", entity->vdisktime,
+ st->min_vdisktime);
}

static inline void io_entity_update_prio(struct io_entity *entity)
--
1.6.0.6

2009-09-24 21:35:11

by Andrew Morton

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Thu, 24 Sep 2009 15:25:04 -0400
Vivek Goyal <[email protected]> wrote:

>
> Hi All,
>
> Here is the V10 of the IO controller patches generated on top of 2.6.31.
>

Thanks for the writeup. It really helps and is most worthwhile for a
project of this importance, size and complexity.


>
> What problem are we trying to solve
> ===================================
> Provide group IO scheduling feature in Linux along the lines of other resource
> controllers like cpu.
>
> IOW, provide facility so that a user can group applications using cgroups and
> control the amount of disk time/bandwidth received by a group based on its
> weight.
>
> How to solve the problem
> =========================
>
> Different people have solved the issue differetnly. So far looks it looks
> like we seem to have following two core requirements when it comes to
> fairness at group level.
>
> - Control bandwidth seen by groups.
> - Control on latencies when a request gets backlogged in group.
>
> At least there are now three patchsets available (including this one).
>
> IO throttling
> -------------
> This is a bandwidth controller which keeps track of IO rate of a group and
> throttles the process in the group if it exceeds the user specified limit.
>
> dm-ioband
> ---------
> This is a proportional bandwidth controller implemented as device mapper
> driver and provides fair access in terms of amount of IO done (not in terms
> of disk time as CFQ does).
>
> So one will setup one or more dm-ioband devices on top of physical/logical
> block device, configure the ioband device and pass information like grouping
> etc. Now this device will keep track of bios flowing through it and control
> the flow of bios based on group policies.
>
> IO scheduler based IO controller
> --------------------------------
> Here we have viewed the problem of IO contoller as hierarchical group
> scheduling (along the lines of CFS group scheduling) issue. Currently one can
> view linux IO schedulers as flat where there is one root group and all the IO
> belongs to that group.
>
> This patchset basically modifies IO schedulers to also support hierarchical
> group scheduling. CFQ already provides fairness among different processes. I
> have extended it support group IO schduling. Also took some of the code out
> of CFQ and put in a common layer so that same group scheduling code can be
> used by noop, deadline and AS to support group scheduling.
>
> Pros/Cons
> =========
> There are pros and cons to each of the approach. Following are some of the
> thoughts.
>
> Max bandwidth vs proportional bandwidth
> ---------------------------------------
> IO throttling is a max bandwidth controller and not a proportional one.
> Additionaly it provides fairness in terms of amount of IO done (and not in
> terms of disk time as CFQ does).
>
> Personally, I think that proportional weight controller is useful to more
> people than just max bandwidth controller. In addition, IO scheduler based
> controller can also be enhanced to do max bandwidth control. So it can
> satisfy wider set of requirements.
>
> Fairness in terms of disk time vs size of IO
> ---------------------------------------------
> An higher level controller will most likely be limited to providing fairness
> in terms of size/number of IO done and will find it hard to provide fairness
> in terms of disk time used (as CFQ provides between various prio levels). This
> is because only IO scheduler knows how much disk time a queue has used and
> information about queues and disk time used is not exported to higher
> layers.
>
> So a seeky application will still run away with lot of disk time and bring
> down the overall throughput of the the disk.

But that's only true if the thing is poorly implemented.

A high-level controller will need some view of the busyness of the
underlying device(s). That could be "proportion of idle time", or
"average length of queue" or "average request latency" or some mix of
these or something else altogether.

But these things are simple to calculate, and are simple to feed back
to the higher-level controller and probably don't require any changes
to to IO scheduler at all, which is a great advantage.


And I must say that high-level throttling based upon feedback from
lower layers seems like a much better model to me than hacking away in
the IO scheduler layer. Both from an implementation point of view and
from a "we can get it to work on things other than block devices" point
of view.

> Currently dm-ioband provides fairness in terms of number/size of IO.
>
> Latencies and isolation between groups
> --------------------------------------
> An higher level controller is generally implementing a bandwidth throttling
> solution where if a group exceeds either the max bandwidth or the proportional
> share then throttle that group.
>
> This kind of approach will probably not help in controlling latencies as it
> will depend on underlying IO scheduler. Consider following scenario.
>
> Assume there are two groups. One group is running multiple sequential readers
> and other group has a random reader. sequential readers will get a nice 100ms
> slice

Do you refer to each reader within group1, or to all readers? It would be
daft if each reader in group1 were to get 100ms.

> each and then a random reader from group2 will get to dispatch the
> request. So latency of this random reader will depend on how many sequential
> readers are running in other group and that is a weak isolation between groups.

And yet that is what you appear to mean.

But surely nobody would do that - the 100ms would be assigned to and
distributed amongst all readers in group1?

> When we control things at IO scheduler level, we assign one time slice to one
> group and then pick next entity to run. So effectively after one time slice
> (max 180ms, if prio 0 sequential reader is running), random reader in other
> group will get to run. Hence we achieve better isolation between groups as
> response time of process in a differnt group is generally not dependent on
> number of processes running in competing group.

I don't understand why you're comparing this implementation with such
an obviously dumb competing design!

> So a higher level solution is most likely limited to only shaping bandwidth
> without any control on latencies.
>
> Stacking group scheduler on top of CFQ can lead to issues
> ---------------------------------------------------------
> IO throttling and dm-ioband both are second level controller. That is these
> controllers are implemented in higher layers than io schedulers. So they
> control the IO at higher layer based on group policies and later IO
> schedulers take care of dispatching these bios to disk.
>
> Implementing a second level controller has the advantage of being able to
> provide bandwidth control even on logical block devices in the IO stack
> which don't have any IO schedulers attached to these. But they can also
> interefere with IO scheduling policy of underlying IO scheduler and change
> the effective behavior. Following are some of the issues which I think
> should be visible in second level controller in one form or other.
>
> Prio with-in group
> ------------------
> A second level controller can potentially interefere with behavior of
> different prio processes with-in a group. bios are buffered at higher layer
> in single queue and release of bios is FIFO and not proportionate to the
> ioprio of the process. This can result in a particular prio level not
> getting fair share.

That's an administrator error, isn't it? Should have put the
different-priority processes into different groups.

> Buffering at higher layer can delay read requests for more than slice idle
> period of CFQ (default 8 ms). That means, it is possible that we are waiting
> for a request from the queue but it is buffered at higher layer and then idle
> timer will fire. It means that queue will losse its share at the same time
> overall throughput will be impacted as we lost those 8 ms.

That sounds like a bug.

> Read Vs Write
> -------------
> Writes can overwhelm readers hence second level controller FIFO release
> will run into issue here. If there is a single queue maintained then reads
> will suffer large latencies. If there separate queues for reads and writes
> then it will be hard to decide in what ratio to dispatch reads and writes as
> it is IO scheduler's decision to decide when and how much read/write to
> dispatch. This is another place where higher level controller will not be in
> sync with lower level io scheduler and can change the effective policies of
> underlying io scheduler.

The IO schedulers already take care of read-vs-write and already take
care of preventing large writes-starve-reads latencies (or at least,
they're supposed to).

> CFQ IO context Issues
> ---------------------
> Buffering at higher layer means submission of bios later with the help of
> a worker thread.

Why?

If it's a read, we just block the userspace process.

If it's a delayed write, the IO submission already happens in a kernel thread.

If it's a synchronous write, we have to block the userspace caller
anyway.

Async reads might be an issue, dunno.

> This changes the io context information at CFQ layer which
> assigns the request to submitting thread. Change of io context info again
> leads to issues of idle timer expiry and issue of a process not getting fair
> share and reduced throughput.

But we already have that problem with delayed writeback, which is a
huge thing - often it's the majority of IO.

> Throughput with noop, deadline and AS
> ---------------------------------------------
> I think an higher level controller will result in reduced overall throughput
> (as compared to io scheduler based io controller) and more seeks with noop,
> deadline and AS.
>
> The reason being, that it is likely that IO with-in a group will be related
> and will be relatively close as compared to IO across the groups. For example,
> thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
> control, IO from various groups will go into a single queue at lower level
> controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
> G4....) causing more seeks and reduced throughput. (Agreed that merging will
> help up to some extent but still....).
>
> Instead, in case of lower level controller, IO scheduler maintains one queue
> per group hence there is no interleaving of IO between groups. And if IO is
> related with-in group, then we shoud get reduced number/amount of seek and
> higher throughput.
>
> Latency can be a concern but that can be controlled by reducing the time
> slice length of the queue.

Well maybe, maybe not. If a group is throttled, it isn't submitting
new IO. The unthrottled group is doing the IO submitting and that IO
will have decent locality.

> Fairness at logical device level vs at physical device level
> ------------------------------------------------------------
>
> IO scheduler based controller has the limitation that it works only with the
> bottom most devices in the IO stack where IO scheduler is attached.
>
> For example, assume a user has created a logical device lv0 using three
> underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
> in two groups doing IO on lv0. Also assume that weights of groups are in the
> ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.
>
> T1 T2
> \ /
> lv0
> / | \
> sda sdb sdc
>
>
> Now resource control will take place only on devices sda, sdb and sdc and
> not at lv0 level. So if IO from two tasks is relatively uniformly
> distributed across the disks then T1 and T2 will see the throughput ratio
> in proportion to weight specified. But if IO from T1 and T2 is going to
> different disks and there is no contention then at higher level they both
> will see same BW.
>
> Here a second level controller can produce better fairness numbers at
> logical device but most likely at redued overall throughput of the system,
> because it will try to control IO even if there is no contention at phsical
> possibly leaving diksks unused in the system.
>
> Hence, question comes that how important it is to control bandwidth at
> higher level logical devices also. The actual contention for resources is
> at the leaf block device so it probably makes sense to do any kind of
> control there and not at the intermediate devices. Secondly probably it
> also means better use of available resources.

hm. What will be the effects of this limitation in real-world use?

> Limited Fairness
> ----------------
> Currently CFQ idles on a sequential reader queue to make sure it gets its
> fair share. A second level controller will find it tricky to anticipate.
> Either it will not have any anticipation logic and in that case it will not
> provide fairness to single readers in a group (as dm-ioband does) or if it
> starts anticipating then we should run into these strange situations where
> second level controller is anticipating on one queue/group and underlying
> IO scheduler might be anticipating on something else.

It depends on the size of the inter-group timeslices. If the amount of
time for which a group is unthrottled is "large" comapred to the
typical anticipation times, this issue fades away.

And those timeslices _should_ be large. Because as you mentioned
above, different groups are probably working different parts of the
disk.

> Need of device mapper tools
> ---------------------------
> A device mapper based solution will require creation of a ioband device
> on each physical/logical device one wants to control. So it requires usage
> of device mapper tools even for the people who are not using device mapper.
> At the same time creation of ioband device on each partition in the system to
> control the IO can be cumbersome and overwhelming if system has got lots of
> disks and partitions with-in.
>
>
> IMHO, IO scheduler based IO controller is a reasonable approach to solve the
> problem of group bandwidth control, and can do hierarchical IO scheduling
> more tightly and efficiently.
>
> But I am all ears to alternative approaches and suggestions how doing things
> can be done better and will be glad to implement it.
>
> TODO
> ====
> - code cleanups, testing, bug fixing, optimizations, benchmarking etc...
> - More testing to make sure there are no regressions in CFQ.
>
> Testing
> =======
>
> Environment
> ==========
> A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.

That's a bit of a toy.

Do we have testing results for more enterprisey hardware? Big storage
arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha)


> I am mostly
> running fio jobs which have been limited to 30 seconds run and then monitored
> the throughput and latency.
>
> Test1: Random Reader Vs Random Writers
> ======================================
> Launched a random reader and then increasing number of random writers to see
> the effect on random reader BW and max lantecies.
>
> [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
>
> [Vanilla CFQ, No groups]
> <--------------random writers--------------------> <------random reader-->
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> 1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec
> 2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec
> 4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec
> 8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec
> 16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec
> 32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec
>
> Created two cgroups group1 and group2 of weights 500 each. Launched increasing
> number of random writers in group1 and one random reader in group2 using fio.
>
> [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
> <--------------random writers(group1)-------------> <-random reader(group2)->
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> 1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec
> 2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec
> 4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec
> 8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec
> 16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec
> 32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec

That's a good result.

> Also ran the same test with IO controller CFQ in flat mode to see if there
> are any major deviations from Vanilla CFQ. Does not look like any.
>
> [IO controller CFQ; No groups ]
> <--------------random writers--------------------> <------random reader-->
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> 1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec
> 2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec
> 4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec
> 8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec
> 16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec
> 32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec
>
> Notes:
> - With vanilla CFQ, random writers can overwhelm a random reader. Bring down
> its throughput and bump up latencies significantly.

Isn't that a CFQ shortcoming which we should address separately? If
so, the comparisons aren't presently valid because we're comparing with
a CFQ which has known, should-be-fixed problems.

> - With IO controller, one can provide isolation to the random reader group and
> maintain consitent view of bandwidth and latencies.
>
> Test2: Random Reader Vs Sequential Reader
> ========================================
> Launched a random reader and then increasing number of sequential readers to
> see the effect on BW and latencies of random reader.
>
> [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
>
> [ Vanilla CFQ, No groups ]
> <---------------seq readers----------------------> <------random reader-->
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> 1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec
> 2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec
> 4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec
> 8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec
> 16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec
>
> Created two cgroups group1 and group2 of weights 500 each. Launched increasing
> number of sequential readers in group1 and one random reader in group2 using
> fio.
>
> [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
> <---------------group1---------------------------> <------group2--------->
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> 1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec
> 2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec
> 4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec
> 8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec
> 16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec
>
> Also ran the same test with IO controller CFQ in flat mode to see if there
> are any major deviations from Vanilla CFQ. Does not look like any.
>
> [IO controller CFQ; No groups ]
> <---------------seq readers----------------------> <------random reader-->
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> 1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec
> 2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec
> 4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec
> 8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec
> 16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec
>
> Notes:
> - The BW and latencies of random reader in group 2 seems to be stable and
> bounded and does not get impacted much as number of sequential readers
> increase in group1. Hence provding good isolation.
>
> - Throughput of sequential readers comes down and latencies go up as half
> of disk bandwidth (in terms of time) has been reserved for random reader
> group.
>
> Test3: Sequential Reader Vs Sequential Reader
> ============================================
> Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
> Launched increasing number of sequential readers in group1 and one sequential
> reader in group2 using fio and monitored how bandwidth is being distributed
> between two groups.
>
> First 5 columns give stats about job in group1 and last two columns give
> stats about job in group2.
>
> <---------------group1---------------------------> <------group2--------->
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> 1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec
> 2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec
> 4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec
> 8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec
> 16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec
>
> Note: group2 is getting double the bandwidth of group1 even in the face
> of increasing number of readers in group1.
>
> Test4 (Isolation between two KVM virtual machines)
> ==================================================
> Created two KVM virtual machines. Partitioned a disk on host in two partitions
> and gave one partition to each virtual machine. Put both the virtual machines
> in two different cgroup of weight 1000 and 500 each. Virtual machines created
> ext3 file system on the partitions exported from host and did buffered writes.
> Host seems writes as synchronous and virtual machine with higher weight gets
> double the disk time of virtual machine of lower weight. Used deadline
> scheduler in this test case.
>
> Some more details about configuration are in documentation patch.
>
> Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> ===================================================================
> Fairness for async writes is tricky and biggest reason is that async writes
> are cached in higher layers (page cahe) as well as possibly in file system
> layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> in proportional manner.
>
> For example, consider two dd threads reading /dev/zero as input file and doing
> writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> be forced to write out some pages to disk before more pages can be dirtied. But
> not necessarily dirty pages of same thread are picked. It can very well pick
> the inode of lesser priority dd thread and do some writeout. So effectively
> higher weight dd is doing writeouts of lower weight dd pages and we don't see
> service differentation.
>
> IOW, the core problem with buffered write fairness is that higher weight thread
> does not throw enought IO traffic at IO controller to keep the queue
> continuously backlogged. In my testing, there are many .2 to .8 second
> intervals where higher weight queue is empty and in that duration lower weight
> queue get lots of job done giving the impression that there was no service
> differentiation.
>
> In summary, from IO controller point of view async writes support is there.
> Because page cache has not been designed in such a manner that higher
> prio/weight writer can do more write out as compared to lower prio/weight
> writer, gettting service differentiation is hard and it is visible in some
> cases and not visible in some cases.

Here's where it all falls to pieces.

For async writeback we just don't care about IO priorities. Because
from the point of view of the userspace task, the write was async! It
occurred at memory bandwidth speed.

It's only when the kernel's dirty memory thresholds start to get
exceeded that we start to care about prioritisation. And at that time,
all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
consumes just as much memory as a low-ioprio dirty page.

So when balance_dirty_pages() hits, what do we want to do?

I suppose that all we can do is to block low-ioprio processes more
agressively at the VFS layer, to reduce the rate at which they're
dirtying memory so as to give high-ioprio processes more of the disk
bandwidth.

But you've gone and implemented all of this stuff at the io-controller
level and not at the VFS level so you're, umm, screwed.

Importantly screwed! It's a very common workload pattern, and one
which causes tremendous amounts of IO to be generated very quickly,
traditionally causing bad latency effects all over the place. And we
have no answer to this.

> Vanilla CFQ Vs IO Controller CFQ
> ================================
> We have not fundamentally changed CFQ, instead enhanced it to also support
> hierarchical io scheduling. In the process invariably there are small changes
> here and there as new scenarios come up. Running some tests here and comparing
> both the CFQ's to see if there is any major deviation in behavior.
>
> Test1: Sequential Readers
> =========================
> [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
>
> IO scheduler: Vanilla CFQ
>
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec
> 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec
> 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec
> 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec
> 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec
>
> IO scheduler: IO controller CFQ
>
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec
> 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec
> 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec
> 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec
> 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec
>
> Test2: Sequential Writers
> =========================
> [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
>
> IO scheduler: Vanilla CFQ
>
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec
> 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec
> 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec
> 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec
> 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec
>
> IO scheduler: IO Controller CFQ
>
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec
> 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec
> 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec
> 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec
> 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec
>
> Test3: Random Readers
> =========================
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
>
> IO scheduler: Vanilla CFQ
>
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> 1 484KiB/s 484KiB/s 484KiB/s 22596 usec
> 2 229KiB/s 196KiB/s 425KiB/s 51111 usec
> 4 119KiB/s 73KiB/s 405KiB/s 2344 msec
> 8 93KiB/s 23KiB/s 399KiB/s 2246 msec
> 16 38KiB/s 8KiB/s 328KiB/s 3965 msec
>
> IO scheduler: IO Controller CFQ
>
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> 1 483KiB/s 483KiB/s 483KiB/s 29391 usec
> 2 229KiB/s 196KiB/s 426KiB/s 51625 usec
> 4 132KiB/s 88KiB/s 417KiB/s 2313 msec
> 8 79KiB/s 18KiB/s 389KiB/s 2298 msec
> 16 43KiB/s 9KiB/s 327KiB/s 3905 msec
>
> Test4: Random Writers
> =====================
> [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
>
> IO scheduler: Vanilla CFQ
>
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec
> 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec
> 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec
> 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec
> 16 66KiB/s 22KiB/s 829KiB/s 1308 msec
>
> IO scheduler: IO Controller CFQ
>
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec
> 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec
> 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec
> 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec
> 16 71KiB/s 29KiB/s 814KiB/s 1457 msec
>
> Notes:
> - Does not look like that anything has changed significantly.
>
> Previous versions of the patches were posted here.
> ------------------------------------------------
>
> (V1) http://lkml.org/lkml/2009/3/11/486
> (V2) http://lkml.org/lkml/2009/5/5/275
> (V3) http://lkml.org/lkml/2009/5/26/472
> (V4) http://lkml.org/lkml/2009/6/8/580
> (V5) http://lkml.org/lkml/2009/6/19/279
> (V6) http://lkml.org/lkml/2009/7/2/369
> (V7) http://lkml.org/lkml/2009/7/24/253
> (V8) http://lkml.org/lkml/2009/8/16/204
> (V9) http://lkml.org/lkml/2009/8/28/327
>
> Thanks
> Vivek

2009-09-25 01:12:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Thu, 24 Sep 2009 14:33:15 -0700
Andrew Morton <[email protected]> wrote:
> > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > ===================================================================
> > Fairness for async writes is tricky and biggest reason is that async writes
> > are cached in higher layers (page cahe) as well as possibly in file system
> > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > in proportional manner.
> >
> > For example, consider two dd threads reading /dev/zero as input file and doing
> > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > be forced to write out some pages to disk before more pages can be dirtied. But
> > not necessarily dirty pages of same thread are picked. It can very well pick
> > the inode of lesser priority dd thread and do some writeout. So effectively
> > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > service differentation.
> >
> > IOW, the core problem with buffered write fairness is that higher weight thread
> > does not throw enought IO traffic at IO controller to keep the queue
> > continuously backlogged. In my testing, there are many .2 to .8 second
> > intervals where higher weight queue is empty and in that duration lower weight
> > queue get lots of job done giving the impression that there was no service
> > differentiation.
> >
> > In summary, from IO controller point of view async writes support is there.
> > Because page cache has not been designed in such a manner that higher
> > prio/weight writer can do more write out as compared to lower prio/weight
> > writer, gettting service differentiation is hard and it is visible in some
> > cases and not visible in some cases.
>
> Here's where it all falls to pieces.
>
> For async writeback we just don't care about IO priorities. Because
> from the point of view of the userspace task, the write was async! It
> occurred at memory bandwidth speed.
>
> It's only when the kernel's dirty memory thresholds start to get
> exceeded that we start to care about prioritisation. And at that time,
> all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> consumes just as much memory as a low-ioprio dirty page.
>
> So when balance_dirty_pages() hits, what do we want to do?
>
> I suppose that all we can do is to block low-ioprio processes more
> agressively at the VFS layer, to reduce the rate at which they're
> dirtying memory so as to give high-ioprio processes more of the disk
> bandwidth.
>
> But you've gone and implemented all of this stuff at the io-controller
> level and not at the VFS level so you're, umm, screwed.
>

I think I must support dirty-ratio in memcg layer. But not yet.
I can't easily imagine how the system will work if both dirty-ratio and
io-controller cgroup are supported. But considering use them as a set of
cgroup, called containers(zone?), it's will not be bad, I think.

The final bottelneck queue for fairness in usual workload on usual (small)
server will ext3's journal, I wonder ;)

Thanks,
-Kame


> Importantly screwed! It's a very common workload pattern, and one
> which causes tremendous amounts of IO to be generated very quickly,
> traditionally causing bad latency effects all over the place. And we
> have no answer to this.
>
> > Vanilla CFQ Vs IO Controller CFQ
> > ================================
> > We have not fundamentally changed CFQ, instead enhanced it to also support
> > hierarchical io scheduling. In the process invariably there are small changes
> > here and there as new scenarios come up. Running some tests here and comparing
> > both the CFQ's to see if there is any major deviation in behavior.
> >
> > Test1: Sequential Readers
> > =========================
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> >
> > IO scheduler: Vanilla CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec
> > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec
> > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec
> > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec
> > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec
> >
> > IO scheduler: IO controller CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec
> > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec
> > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec
> > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec
> > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec
> >
> > Test2: Sequential Writers
> > =========================
> > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> >
> > IO scheduler: Vanilla CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec
> > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec
> > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec
> > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec
> > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec
> >
> > IO scheduler: IO Controller CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec
> > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec
> > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec
> > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec
> > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec
> >
> > Test3: Random Readers
> > =========================
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> >
> > IO scheduler: Vanilla CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec
> > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec
> > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec
> > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec
> > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec
> >
> > IO scheduler: IO Controller CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec
> > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec
> > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec
> > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec
> > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec
> >
> > Test4: Random Writers
> > =====================
> > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> >
> > IO scheduler: Vanilla CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec
> > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec
> > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec
> > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec
> > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec
> >
> > IO scheduler: IO Controller CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec
> > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec
> > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec
> > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec
> > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec
> >
> > Notes:
> > - Does not look like that anything has changed significantly.
> >
> > Previous versions of the patches were posted here.
> > ------------------------------------------------
> >
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > (V6) http://lkml.org/lkml/2009/7/2/369
> > (V7) http://lkml.org/lkml/2009/7/24/253
> > (V8) http://lkml.org/lkml/2009/8/16/204
> > (V9) http://lkml.org/lkml/2009/8/28/327
> >
> > Thanks
> > Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-09-25 01:20:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 25 Sep 2009 10:09:52 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Thu, 24 Sep 2009 14:33:15 -0700
> Andrew Morton <[email protected]> wrote:
> > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > ===================================================================
> > > Fairness for async writes is tricky and biggest reason is that async writes
> > > are cached in higher layers (page cahe) as well as possibly in file system
> > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > in proportional manner.
> > >
> > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > service differentation.
> > >
> > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > does not throw enought IO traffic at IO controller to keep the queue
> > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > intervals where higher weight queue is empty and in that duration lower weight
> > > queue get lots of job done giving the impression that there was no service
> > > differentiation.
> > >
> > > In summary, from IO controller point of view async writes support is there.
> > > Because page cache has not been designed in such a manner that higher
> > > prio/weight writer can do more write out as compared to lower prio/weight
> > > writer, gettting service differentiation is hard and it is visible in some
> > > cases and not visible in some cases.
> >
> > Here's where it all falls to pieces.
> >
> > For async writeback we just don't care about IO priorities. Because
> > from the point of view of the userspace task, the write was async! It
> > occurred at memory bandwidth speed.
> >
> > It's only when the kernel's dirty memory thresholds start to get
> > exceeded that we start to care about prioritisation. And at that time,
> > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > consumes just as much memory as a low-ioprio dirty page.
> >
> > So when balance_dirty_pages() hits, what do we want to do?
> >
> > I suppose that all we can do is to block low-ioprio processes more
> > agressively at the VFS layer, to reduce the rate at which they're
> > dirtying memory so as to give high-ioprio processes more of the disk
> > bandwidth.
> >
> > But you've gone and implemented all of this stuff at the io-controller
> > level and not at the VFS level so you're, umm, screwed.
> >
>
> I think I must support dirty-ratio in memcg layer. But not yet.

OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
And add a control knob as
bufferred_write.nr_dirty_thresh
to limit the number of dirty pages generetad via a cgroup.

Because memcg just records a owner of pages but not records who makes them
dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
cgroup code.

But I'm not sure how I should treat I/Os generated out by kswapd.

Thanks,
-Kame

2009-09-25 02:20:15

by Ulrich Lukas

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Vivek Goyal wrote:
> Notes:
> - With vanilla CFQ, random writers can overwhelm a random reader.
> Bring down its throughput and bump up latencies significantly.


IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
too.

I'm basing this assumption on the observations I made on both OpenSuse
11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
titled: "Poor desktop responsiveness with background I/O-operations" of
2009-09-20.
(Message ID: [email protected])


Thus, I'm posting this to show that your work is greatly appreciated,
given the rather disappointig status quo of Linux's fairness when it
comes to disk IO time.

I hope that your efforts lead to a change in performance of current
userland applications, the sooner, the better.


Thanks
Ulrich

2009-09-25 04:16:01

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Sep 25, 2009 at 10:09:52AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 24 Sep 2009 14:33:15 -0700
> Andrew Morton <[email protected]> wrote:
> > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > ===================================================================
> > > Fairness for async writes is tricky and biggest reason is that async writes
> > > are cached in higher layers (page cahe) as well as possibly in file system
> > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > in proportional manner.
> > >
> > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > service differentation.
> > >
> > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > does not throw enought IO traffic at IO controller to keep the queue
> > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > intervals where higher weight queue is empty and in that duration lower weight
> > > queue get lots of job done giving the impression that there was no service
> > > differentiation.
> > >
> > > In summary, from IO controller point of view async writes support is there.
> > > Because page cache has not been designed in such a manner that higher
> > > prio/weight writer can do more write out as compared to lower prio/weight
> > > writer, gettting service differentiation is hard and it is visible in some
> > > cases and not visible in some cases.
> >
> > Here's where it all falls to pieces.
> >
> > For async writeback we just don't care about IO priorities. Because
> > from the point of view of the userspace task, the write was async! It
> > occurred at memory bandwidth speed.
> >
> > It's only when the kernel's dirty memory thresholds start to get
> > exceeded that we start to care about prioritisation. And at that time,
> > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > consumes just as much memory as a low-ioprio dirty page.
> >
> > So when balance_dirty_pages() hits, what do we want to do?
> >
> > I suppose that all we can do is to block low-ioprio processes more
> > agressively at the VFS layer, to reduce the rate at which they're
> > dirtying memory so as to give high-ioprio processes more of the disk
> > bandwidth.
> >
> > But you've gone and implemented all of this stuff at the io-controller
> > level and not at the VFS level so you're, umm, screwed.
> >
>
> I think I must support dirty-ratio in memcg layer. But not yet.
> I can't easily imagine how the system will work if both dirty-ratio and
> io-controller cgroup are supported.

IIUC, you are suggesting per memeory cgroup dirty ratio and writer will be
throttled if dirty ratio is crossed. makes sense to me. Just that io
controller and memory controller shall have to me mounted together.

Thanks
Vivek

> But considering use them as a set of
> cgroup, called containers(zone?), it's will not be bad, I think.
>
> The final bottelneck queue for fairness in usual workload on usual (small)
> server will ext3's journal, I wonder ;)
>
> Thanks,
> -Kame
>
>
> > Importantly screwed! It's a very common workload pattern, and one
> > which causes tremendous amounts of IO to be generated very quickly,
> > traditionally causing bad latency effects all over the place. And we
> > have no answer to this.
> >
> > > Vanilla CFQ Vs IO Controller CFQ
> > > ================================
> > > We have not fundamentally changed CFQ, instead enhanced it to also support
> > > hierarchical io scheduling. In the process invariably there are small changes
> > > here and there as new scenarios come up. Running some tests here and comparing
> > > both the CFQ's to see if there is any major deviation in behavior.
> > >
> > > Test1: Sequential Readers
> > > =========================
> > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > >
> > > IO scheduler: Vanilla CFQ
> > >
> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec
> > > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec
> > > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec
> > > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec
> > > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec
> > >
> > > IO scheduler: IO controller CFQ
> > >
> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec
> > > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec
> > > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec
> > > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec
> > > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec
> > >
> > > Test2: Sequential Writers
> > > =========================
> > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > >
> > > IO scheduler: Vanilla CFQ
> > >
> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec
> > > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec
> > > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec
> > > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec
> > > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec
> > >
> > > IO scheduler: IO Controller CFQ
> > >
> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec
> > > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec
> > > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec
> > > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec
> > > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec
> > >
> > > Test3: Random Readers
> > > =========================
> > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > >
> > > IO scheduler: Vanilla CFQ
> > >
> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec
> > > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec
> > > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec
> > > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec
> > > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec
> > >
> > > IO scheduler: IO Controller CFQ
> > >
> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec
> > > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec
> > > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec
> > > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec
> > > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec
> > >
> > > Test4: Random Writers
> > > =====================
> > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > >
> > > IO scheduler: Vanilla CFQ
> > >
> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec
> > > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec
> > > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec
> > > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec
> > > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec
> > >
> > > IO scheduler: IO Controller CFQ
> > >
> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec
> > > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec
> > > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec
> > > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec
> > > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec
> > >
> > > Notes:
> > > - Does not look like that anything has changed significantly.
> > >
> > > Previous versions of the patches were posted here.
> > > ------------------------------------------------
> > >
> > > (V1) http://lkml.org/lkml/2009/3/11/486
> > > (V2) http://lkml.org/lkml/2009/5/5/275
> > > (V3) http://lkml.org/lkml/2009/5/26/472
> > > (V4) http://lkml.org/lkml/2009/6/8/580
> > > (V5) http://lkml.org/lkml/2009/6/19/279
> > > (V6) http://lkml.org/lkml/2009/7/2/369
> > > (V7) http://lkml.org/lkml/2009/7/24/253
> > > (V8) http://lkml.org/lkml/2009/8/16/204
> > > (V9) http://lkml.org/lkml/2009/8/28/327
> > >
> > > Thanks
> > > Vivek
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >

2009-09-25 05:05:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote:
> On Thu, 24 Sep 2009 15:25:04 -0400
> Vivek Goyal <[email protected]> wrote:
>
> >
> > Hi All,
> >
> > Here is the V10 of the IO controller patches generated on top of 2.6.31.
> >
>
> Thanks for the writeup. It really helps and is most worthwhile for a
> project of this importance, size and complexity.
>
>
> >
> > What problem are we trying to solve
> > ===================================
> > Provide group IO scheduling feature in Linux along the lines of other resource
> > controllers like cpu.
> >
> > IOW, provide facility so that a user can group applications using cgroups and
> > control the amount of disk time/bandwidth received by a group based on its
> > weight.
> >
> > How to solve the problem
> > =========================
> >
> > Different people have solved the issue differetnly. So far looks it looks
> > like we seem to have following two core requirements when it comes to
> > fairness at group level.
> >
> > - Control bandwidth seen by groups.
> > - Control on latencies when a request gets backlogged in group.
> >
> > At least there are now three patchsets available (including this one).
> >
> > IO throttling
> > -------------
> > This is a bandwidth controller which keeps track of IO rate of a group and
> > throttles the process in the group if it exceeds the user specified limit.
> >
> > dm-ioband
> > ---------
> > This is a proportional bandwidth controller implemented as device mapper
> > driver and provides fair access in terms of amount of IO done (not in terms
> > of disk time as CFQ does).
> >
> > So one will setup one or more dm-ioband devices on top of physical/logical
> > block device, configure the ioband device and pass information like grouping
> > etc. Now this device will keep track of bios flowing through it and control
> > the flow of bios based on group policies.
> >
> > IO scheduler based IO controller
> > --------------------------------
> > Here we have viewed the problem of IO contoller as hierarchical group
> > scheduling (along the lines of CFS group scheduling) issue. Currently one can
> > view linux IO schedulers as flat where there is one root group and all the IO
> > belongs to that group.
> >
> > This patchset basically modifies IO schedulers to also support hierarchical
> > group scheduling. CFQ already provides fairness among different processes. I
> > have extended it support group IO schduling. Also took some of the code out
> > of CFQ and put in a common layer so that same group scheduling code can be
> > used by noop, deadline and AS to support group scheduling.
> >
> > Pros/Cons
> > =========
> > There are pros and cons to each of the approach. Following are some of the
> > thoughts.
> >
> > Max bandwidth vs proportional bandwidth
> > ---------------------------------------
> > IO throttling is a max bandwidth controller and not a proportional one.
> > Additionaly it provides fairness in terms of amount of IO done (and not in
> > terms of disk time as CFQ does).
> >
> > Personally, I think that proportional weight controller is useful to more
> > people than just max bandwidth controller. In addition, IO scheduler based
> > controller can also be enhanced to do max bandwidth control. So it can
> > satisfy wider set of requirements.
> >
> > Fairness in terms of disk time vs size of IO
> > ---------------------------------------------
> > An higher level controller will most likely be limited to providing fairness
> > in terms of size/number of IO done and will find it hard to provide fairness
> > in terms of disk time used (as CFQ provides between various prio levels). This
> > is because only IO scheduler knows how much disk time a queue has used and
> > information about queues and disk time used is not exported to higher
> > layers.
> >
> > So a seeky application will still run away with lot of disk time and bring
> > down the overall throughput of the the disk.
>
> But that's only true if the thing is poorly implemented.
>
> A high-level controller will need some view of the busyness of the
> underlying device(s). That could be "proportion of idle time", or
> "average length of queue" or "average request latency" or some mix of
> these or something else altogether.
>
> But these things are simple to calculate, and are simple to feed back
> to the higher-level controller and probably don't require any changes
> to to IO scheduler at all, which is a great advantage.
>
>
> And I must say that high-level throttling based upon feedback from
> lower layers seems like a much better model to me than hacking away in
> the IO scheduler layer. Both from an implementation point of view and
> from a "we can get it to work on things other than block devices" point
> of view.
>

Hi Andrew,

Few thoughts.

- A higher level throttling approach suffers from the issue of unfair
throttling. So if there are multiple tasks in the group, who do we
throttle and how do we make sure that we did throttling in proportion
to the prio of tasks. Andrea's IO throttling implementation suffered
from these issues. I had run some tests where RT and BW tasks were
getting same BW with-in group or tasks of different prio were gettting
same BW.

Even if we figure a way out to do fair throttling with-in group, underlying
IO scheduler might not be CFQ at all and we should not have done so.

https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html

- Higher level throttling does not know where actually IO is going in
physical layer. So we might unnecessarily be throttling IO which are
going to same logical device but at the end of day to different physical
devices.

Agreed that some people will want that behavior, especially in the case
of max bandwidth control where one does not want to give you the BW
because you did not pay for it.

So higher level controller is good for max bw control but if it comes
to optimal usage of resources and do control only if needed, then it
probably is not the best thing.

About the feedback thing, I am not very sure. Are you saying that we will
run timed groups in higher layer and take feedback from underlying IO
scheduler about how much time a group consumed or something like that and
not do accounting in terms of size of IO?


> > Currently dm-ioband provides fairness in terms of number/size of IO.
> >
> > Latencies and isolation between groups
> > --------------------------------------
> > An higher level controller is generally implementing a bandwidth throttling
> > solution where if a group exceeds either the max bandwidth or the proportional
> > share then throttle that group.
> >
> > This kind of approach will probably not help in controlling latencies as it
> > will depend on underlying IO scheduler. Consider following scenario.
> >
> > Assume there are two groups. One group is running multiple sequential readers
> > and other group has a random reader. sequential readers will get a nice 100ms
> > slice
>
> Do you refer to each reader within group1, or to all readers? It would be
> daft if each reader in group1 were to get 100ms.
>

All readers in the group should get 100ms each, both in IO throttling and
dm-ioband solution.

Higher level solutions are not keeping track of time slices. Time slices will
be allocated by CFQ which does not have any idea about grouping. Higher
level controller just keeps track of size of IO done at group level and
then run either a leaky bucket or token bucket algorithm.

IO throttling is a max BW controller, so it will not even care about what is
happening in other group. It will just be concerned with rate of IO in one
particular group and if we exceed specified limit, throttle it. So until and
unless sequential reader group hits it max bw limit, it will keep sending
reads down to CFQ, and CFQ will happily assign 100ms slices to readers.

dm-ioband will not try to choke the high throughput sequential reader group
for the slow random reader group because that would just kill the throughput
of rotational media. Every sequential reader will run for few ms and then
be throttled and this goes on. Disk will soon be seek bound.

> > each and then a random reader from group2 will get to dispatch the
> > request. So latency of this random reader will depend on how many sequential
> > readers are running in other group and that is a weak isolation between groups.
>
> And yet that is what you appear to mean.
>
> But surely nobody would do that - the 100ms would be assigned to and
> distributed amongst all readers in group1?

Dividing 100ms to all the sequential readers might not be very good on
rotational media as each reader runs for small time and then seek happens.
This will increase number of seeks in the system. Think of 32 sequential
readers in the group and then each getting less than 3ms to run.

A better way probably is to give each queue 100ms in one run of group and
then switch group. Someting like following.

SR1 RR SR2 RR SR3 RR SR4 RR...

Now each sequential reader gets 100ms and disk is not seek bound at the
same time random reader latency limited by number of competing groups
and not by number of processes in the group. This is what IO scheduler
based IO controller is effectively doing currently.

>
> > When we control things at IO scheduler level, we assign one time slice to one
> > group and then pick next entity to run. So effectively after one time slice
> > (max 180ms, if prio 0 sequential reader is running), random reader in other
> > group will get to run. Hence we achieve better isolation between groups as
> > response time of process in a differnt group is generally not dependent on
> > number of processes running in competing group.
>
> I don't understand why you're comparing this implementation with such
> an obviously dumb competing design!
>
> > So a higher level solution is most likely limited to only shaping bandwidth
> > without any control on latencies.
> >
> > Stacking group scheduler on top of CFQ can lead to issues
> > ---------------------------------------------------------
> > IO throttling and dm-ioband both are second level controller. That is these
> > controllers are implemented in higher layers than io schedulers. So they
> > control the IO at higher layer based on group policies and later IO
> > schedulers take care of dispatching these bios to disk.
> >
> > Implementing a second level controller has the advantage of being able to
> > provide bandwidth control even on logical block devices in the IO stack
> > which don't have any IO schedulers attached to these. But they can also
> > interefere with IO scheduling policy of underlying IO scheduler and change
> > the effective behavior. Following are some of the issues which I think
> > should be visible in second level controller in one form or other.
> >
> > Prio with-in group
> > ------------------
> > A second level controller can potentially interefere with behavior of
> > different prio processes with-in a group. bios are buffered at higher layer
> > in single queue and release of bios is FIFO and not proportionate to the
> > ioprio of the process. This can result in a particular prio level not
> > getting fair share.
>
> That's an administrator error, isn't it? Should have put the
> different-priority processes into different groups.
>

I am thinking in practice it probably will be a mix of priority in each
group. For example, consider a hypothetical scenario where two students
on a university server are given two cgroups of certain weights so that IO
done by these students are limited in case of contention. Now these students
might want to throw in a mix of priority workload in their respective cgroup.
Admin would not have any idea what priority process students are running in
respective cgroup.

> > Buffering at higher layer can delay read requests for more than slice idle
> > period of CFQ (default 8 ms). That means, it is possible that we are waiting
> > for a request from the queue but it is buffered at higher layer and then idle
> > timer will fire. It means that queue will losse its share at the same time
> > overall throughput will be impacted as we lost those 8 ms.
>
> That sounds like a bug.
>

Actually this probably is a limitation of higher level controller. It most
likely is sitting so high in IO stack that it has no idea what underlying
IO scheduler is and what are IO scheduler's policies. So it can't keep up
with IO scheduler's policies. Secondly, it might be a low weight group and
tokens might not be available fast enough to release the request.

> > Read Vs Write
> > -------------
> > Writes can overwhelm readers hence second level controller FIFO release
> > will run into issue here. If there is a single queue maintained then reads
> > will suffer large latencies. If there separate queues for reads and writes
> > then it will be hard to decide in what ratio to dispatch reads and writes as
> > it is IO scheduler's decision to decide when and how much read/write to
> > dispatch. This is another place where higher level controller will not be in
> > sync with lower level io scheduler and can change the effective policies of
> > underlying io scheduler.
>
> The IO schedulers already take care of read-vs-write and already take
> care of preventing large writes-starve-reads latencies (or at least,
> they're supposed to).

True. Actually this is a limitation of higher level controller. A higher
level controller will most likely implement some of kind of queuing/buffering
mechanism where it will buffer requeuests when it decides to throttle the
group. Now once a fair number read and requests are buffered, and if
controller is ready to dispatch some requests from the group, which
requests/bio should it dispatch? reads first or writes first or reads and
writes in certain ratio?

In what ratio reads and writes are dispatched is the property and decision of
IO scheduler. Now higher level controller will be taking this decision and
change the behavior of underlying io scheduler.

>
> > CFQ IO context Issues
> > ---------------------
> > Buffering at higher layer means submission of bios later with the help of
> > a worker thread.
>
> Why?
>
> If it's a read, we just block the userspace process.
>
> If it's a delayed write, the IO submission already happens in a kernel thread.

Is it ok to block pdflush on group. Some low weight group might block it
for long time and hence not allow flushing out other pages. Probably that's
the reason pdflush used to check if underlying device is congested or not
and if it is congested, we don't go ahead with submission of request.
With per bdi flusher thread things will change.

I think btrfs also has some threds which don't want to block and if
underlying deivce is congested, it bails out. That's the reason I
implemented per group congestion interface where if a thread does not want
to block, it can check whether the group IO is going in is congested or
not and will it block. So for such threads, probably higher level
controller shall have to implement per group congestion interface so that
threads which don't want to block can check with the controller whether
it has sufficient BW to let it through and not block or may be start
buffering writes in group queue.

>
> If it's a synchronous write, we have to block the userspace caller
> anyway.
>
> Async reads might be an issue, dunno.
>

I think async IO is one of the reason. IIRC, Andrea Righi, implemented the
policy of returning error for async IO if group did not have sufficient
tokens to dispatch the async IO and expected the application to retry
later. I am not sure if that is ok.

So yes, if we are not buffering any of the read requests and either
blocking the caller or returning an error (async IO) than CFQ io context is not
an issue.

> > This changes the io context information at CFQ layer which
> > assigns the request to submitting thread. Change of io context info again
> > leads to issues of idle timer expiry and issue of a process not getting fair
> > share and reduced throughput.
>
> But we already have that problem with delayed writeback, which is a
> huge thing - often it's the majority of IO.
>

For delayed writes CFQ will not anticipate so increased anticipation timer
expiry is not an issue with writes. But it probably will be an issue with
reads where if higher level controller decides to block next read and
CFQ is anticipating on that read. I am wondering that such kind of issues
must appear with all the higher level device mapper/software raid devices
also. How do they handle it. May be it is more theoritical and in practice
impact is not significant.

> > Throughput with noop, deadline and AS
> > ---------------------------------------------
> > I think an higher level controller will result in reduced overall throughput
> > (as compared to io scheduler based io controller) and more seeks with noop,
> > deadline and AS.
> >
> > The reason being, that it is likely that IO with-in a group will be related
> > and will be relatively close as compared to IO across the groups. For example,
> > thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
> > control, IO from various groups will go into a single queue at lower level
> > controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
> > G4....) causing more seeks and reduced throughput. (Agreed that merging will
> > help up to some extent but still....).
> >
> > Instead, in case of lower level controller, IO scheduler maintains one queue
> > per group hence there is no interleaving of IO between groups. And if IO is
> > related with-in group, then we shoud get reduced number/amount of seek and
> > higher throughput.
> >
> > Latency can be a concern but that can be controlled by reducing the time
> > slice length of the queue.
>
> Well maybe, maybe not. If a group is throttled, it isn't submitting
> new IO. The unthrottled group is doing the IO submitting and that IO
> will have decent locality.

But throttling will kick in ocassionaly. Rest of the time both the groups
will be dispatching bios at the same time. So for most part of it IO
scheduler will probably see IO from both the groups and there will be
small intervals where one group is completely throttled and IO scheduler
is busy dispatching requests only from a single group.

>
> > Fairness at logical device level vs at physical device level
> > ------------------------------------------------------------
> >
> > IO scheduler based controller has the limitation that it works only with the
> > bottom most devices in the IO stack where IO scheduler is attached.
> >
> > For example, assume a user has created a logical device lv0 using three
> > underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
> > in two groups doing IO on lv0. Also assume that weights of groups are in the
> > ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.
> >
> > T1 T2
> > \ /
> > lv0
> > / | \
> > sda sdb sdc
> >
> >
> > Now resource control will take place only on devices sda, sdb and sdc and
> > not at lv0 level. So if IO from two tasks is relatively uniformly
> > distributed across the disks then T1 and T2 will see the throughput ratio
> > in proportion to weight specified. But if IO from T1 and T2 is going to
> > different disks and there is no contention then at higher level they both
> > will see same BW.
> >
> > Here a second level controller can produce better fairness numbers at
> > logical device but most likely at redued overall throughput of the system,
> > because it will try to control IO even if there is no contention at phsical
> > possibly leaving diksks unused in the system.
> >
> > Hence, question comes that how important it is to control bandwidth at
> > higher level logical devices also. The actual contention for resources is
> > at the leaf block device so it probably makes sense to do any kind of
> > control there and not at the intermediate devices. Secondly probably it
> > also means better use of available resources.
>
> hm. What will be the effects of this limitation in real-world use?

In some cases user/application will not see the bandwidth ratio between
two groups in same proportion as assigned weights and primary reason for
that will be that this workload did not create enough contention for
physical resources unerneath.

So it all depends on what kind of bandwidth gurantees are we offering. If
we are saying that we provide good fairness numbers at logical devices
irrespective of whether resources are not used optimally, then it will be
irritating for the user.

I think it also might become an issue once we implement max bandwidth
control. We will not be able to define max bandwidth on a logical device
and an application will get more than max bandwidth if it is doing IO to
different underlying devices.

I would say that leaf node control is good for optimal resource usage and
for proportional BW control, but not a good fit for max bandwidth control.

>
> > Limited Fairness
> > ----------------
> > Currently CFQ idles on a sequential reader queue to make sure it gets its
> > fair share. A second level controller will find it tricky to anticipate.
> > Either it will not have any anticipation logic and in that case it will not
> > provide fairness to single readers in a group (as dm-ioband does) or if it
> > starts anticipating then we should run into these strange situations where
> > second level controller is anticipating on one queue/group and underlying
> > IO scheduler might be anticipating on something else.
>
> It depends on the size of the inter-group timeslices. If the amount of
> time for which a group is unthrottled is "large" comapred to the
> typical anticipation times, this issue fades away.
>
> And those timeslices _should_ be large. Because as you mentioned
> above, different groups are probably working different parts of the
> disk.
>
> > Need of device mapper tools
> > ---------------------------
> > A device mapper based solution will require creation of a ioband device
> > on each physical/logical device one wants to control. So it requires usage
> > of device mapper tools even for the people who are not using device mapper.
> > At the same time creation of ioband device on each partition in the system to
> > control the IO can be cumbersome and overwhelming if system has got lots of
> > disks and partitions with-in.
> >
> >
> > IMHO, IO scheduler based IO controller is a reasonable approach to solve the
> > problem of group bandwidth control, and can do hierarchical IO scheduling
> > more tightly and efficiently.
> >
> > But I am all ears to alternative approaches and suggestions how doing things
> > can be done better and will be glad to implement it.
> >
> > TODO
> > ====
> > - code cleanups, testing, bug fixing, optimizations, benchmarking etc...
> > - More testing to make sure there are no regressions in CFQ.
> >
> > Testing
> > =======
> >
> > Environment
> > ==========
> > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
>
> That's a bit of a toy.

Yes it is. :-)

>
> Do we have testing results for more enterprisey hardware? Big storage
> arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha)

Not yet. I will try to get hold of some storage arrays and run some tests.

>
>
> > I am mostly
> > running fio jobs which have been limited to 30 seconds run and then monitored
> > the throughput and latency.
> >
> > Test1: Random Reader Vs Random Writers
> > ======================================
> > Launched a random reader and then increasing number of random writers to see
> > the effect on random reader BW and max lantecies.
> >
> > [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> >
> > [Vanilla CFQ, No groups]
> > <--------------random writers--------------------> <------random reader-->
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> > 1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec
> > 2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec
> > 4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec
> > 8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec
> > 16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec
> > 32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec
> >
> > Created two cgroups group1 and group2 of weights 500 each. Launched increasing
> > number of random writers in group1 and one random reader in group2 using fio.
> >
> > [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
> > <--------------random writers(group1)-------------> <-random reader(group2)->
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> > 1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec
> > 2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec
> > 4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec
> > 8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec
> > 16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec
> > 32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec
>
> That's a good result.
>
> > Also ran the same test with IO controller CFQ in flat mode to see if there
> > are any major deviations from Vanilla CFQ. Does not look like any.
> >
> > [IO controller CFQ; No groups ]
> > <--------------random writers--------------------> <------random reader-->
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> > 1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec
> > 2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec
> > 4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec
> > 8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec
> > 16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec
> > 32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec
> >
> > Notes:
> > - With vanilla CFQ, random writers can overwhelm a random reader. Bring down
> > its throughput and bump up latencies significantly.
>
> Isn't that a CFQ shortcoming which we should address separately? If
> so, the comparisons aren't presently valid because we're comparing with
> a CFQ which has known, should-be-fixed problems.

I am not sure if it is a CFQ issue. These are synchronous random writes.
These are equally important as random reader. So now CFQ has 33 synchronous
queues to serve. Becuase it does not know about groups, it has no choice but
to serve them in round robin manner. So it does not sound like a CFQ issue.
Just that CFQ can give random reader an advantage if it knows that random
reader is in a different group and that's where IO controller comes in to
picture.

>
> > - With IO controller, one can provide isolation to the random reader group and
> > maintain consitent view of bandwidth and latencies.
> >
> > Test2: Random Reader Vs Sequential Reader
> > ========================================
> > Launched a random reader and then increasing number of sequential readers to
> > see the effect on BW and latencies of random reader.
> >
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> >
> > [ Vanilla CFQ, No groups ]
> > <---------------seq readers----------------------> <------random reader-->
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> > 1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec
> > 2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec
> > 4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec
> > 8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec
> > 16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec
> >
> > Created two cgroups group1 and group2 of weights 500 each. Launched increasing
> > number of sequential readers in group1 and one random reader in group2 using
> > fio.
> >
> > [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
> > <---------------group1---------------------------> <------group2--------->
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> > 1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec
> > 2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec
> > 4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec
> > 8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec
> > 16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec
> >
> > Also ran the same test with IO controller CFQ in flat mode to see if there
> > are any major deviations from Vanilla CFQ. Does not look like any.
> >
> > [IO controller CFQ; No groups ]
> > <---------------seq readers----------------------> <------random reader-->
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> > 1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec
> > 2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec
> > 4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec
> > 8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec
> > 16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec
> >
> > Notes:
> > - The BW and latencies of random reader in group 2 seems to be stable and
> > bounded and does not get impacted much as number of sequential readers
> > increase in group1. Hence provding good isolation.
> >
> > - Throughput of sequential readers comes down and latencies go up as half
> > of disk bandwidth (in terms of time) has been reserved for random reader
> > group.
> >
> > Test3: Sequential Reader Vs Sequential Reader
> > ============================================
> > Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
> > Launched increasing number of sequential readers in group1 and one sequential
> > reader in group2 using fio and monitored how bandwidth is being distributed
> > between two groups.
> >
> > First 5 columns give stats about job in group1 and last two columns give
> > stats about job in group2.
> >
> > <---------------group1---------------------------> <------group2--------->
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> > 1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec
> > 2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec
> > 4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec
> > 8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec
> > 16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec
> >
> > Note: group2 is getting double the bandwidth of group1 even in the face
> > of increasing number of readers in group1.
> >
> > Test4 (Isolation between two KVM virtual machines)
> > ==================================================
> > Created two KVM virtual machines. Partitioned a disk on host in two partitions
> > and gave one partition to each virtual machine. Put both the virtual machines
> > in two different cgroup of weight 1000 and 500 each. Virtual machines created
> > ext3 file system on the partitions exported from host and did buffered writes.
> > Host seems writes as synchronous and virtual machine with higher weight gets
> > double the disk time of virtual machine of lower weight. Used deadline
> > scheduler in this test case.
> >
> > Some more details about configuration are in documentation patch.
> >
> > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > ===================================================================
> > Fairness for async writes is tricky and biggest reason is that async writes
> > are cached in higher layers (page cahe) as well as possibly in file system
> > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > in proportional manner.
> >
> > For example, consider two dd threads reading /dev/zero as input file and doing
> > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > be forced to write out some pages to disk before more pages can be dirtied. But
> > not necessarily dirty pages of same thread are picked. It can very well pick
> > the inode of lesser priority dd thread and do some writeout. So effectively
> > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > service differentation.
> >
> > IOW, the core problem with buffered write fairness is that higher weight thread
> > does not throw enought IO traffic at IO controller to keep the queue
> > continuously backlogged. In my testing, there are many .2 to .8 second
> > intervals where higher weight queue is empty and in that duration lower weight
> > queue get lots of job done giving the impression that there was no service
> > differentiation.
> >
> > In summary, from IO controller point of view async writes support is there.
> > Because page cache has not been designed in such a manner that higher
> > prio/weight writer can do more write out as compared to lower prio/weight
> > writer, gettting service differentiation is hard and it is visible in some
> > cases and not visible in some cases.
>
> Here's where it all falls to pieces.
>
> For async writeback we just don't care about IO priorities. Because
> from the point of view of the userspace task, the write was async! It
> occurred at memory bandwidth speed.
>
> It's only when the kernel's dirty memory thresholds start to get
> exceeded that we start to care about prioritisation. And at that time,
> all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> consumes just as much memory as a low-ioprio dirty page.
>
> So when balance_dirty_pages() hits, what do we want to do?
>
> I suppose that all we can do is to block low-ioprio processes more
> agressively at the VFS layer, to reduce the rate at which they're
> dirtying memory so as to give high-ioprio processes more of the disk
> bandwidth.
>
> But you've gone and implemented all of this stuff at the io-controller
> level and not at the VFS level so you're, umm, screwed.

True that's an issue. For async writes we don't create parallel IO paths
from user space to IO scheduler hence it is hard to provide fairness in
all the cases. I think part of the problem is page cache and some
serialization also comes from kjournald.

How about coming up with another cgroup controller for buffered writes or
clubbing it with memory controller as KAMEZAWA Hiroyuki suggested and co-mount
this with io controller? This should help control buffered writes per
cgroup.

>
> Importantly screwed! It's a very common workload pattern, and one
> which causes tremendous amounts of IO to be generated very quickly,
> traditionally causing bad latency effects all over the place. And we
> have no answer to this.
>
> > Vanilla CFQ Vs IO Controller CFQ
> > ================================
> > We have not fundamentally changed CFQ, instead enhanced it to also support
> > hierarchical io scheduling. In the process invariably there are small changes
> > here and there as new scenarios come up. Running some tests here and comparing
> > both the CFQ's to see if there is any major deviation in behavior.
> >
> > Test1: Sequential Readers
> > =========================
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> >
> > IO scheduler: Vanilla CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec
> > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec
> > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec
> > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec
> > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec
> >
> > IO scheduler: IO controller CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec
> > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec
> > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec
> > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec
> > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec
> >
> > Test2: Sequential Writers
> > =========================
> > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> >
> > IO scheduler: Vanilla CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec
> > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec
> > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec
> > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec
> > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec
> >
> > IO scheduler: IO Controller CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec
> > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec
> > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec
> > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec
> > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec
> >
> > Test3: Random Readers
> > =========================
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> >
> > IO scheduler: Vanilla CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec
> > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec
> > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec
> > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec
> > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec
> >
> > IO scheduler: IO Controller CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec
> > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec
> > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec
> > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec
> > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec
> >
> > Test4: Random Writers
> > =====================
> > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> >
> > IO scheduler: Vanilla CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec
> > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec
> > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec
> > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec
> > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec
> >
> > IO scheduler: IO Controller CFQ
> >
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency
> > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec
> > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec
> > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec
> > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec
> > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec
> >
> > Notes:
> > - Does not look like that anything has changed significantly.
> >
> > Previous versions of the patches were posted here.
> > ------------------------------------------------
> >
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > (V6) http://lkml.org/lkml/2009/7/2/369
> > (V7) http://lkml.org/lkml/2009/7/24/253
> > (V8) http://lkml.org/lkml/2009/8/16/204
> > (V9) http://lkml.org/lkml/2009/8/28/327
> >
> > Thanks
> > Vivek

2009-09-25 05:29:17

by Balbir Singh

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

* KAMEZAWA Hiroyuki <[email protected]> [2009-09-25 10:18:21]:

> On Fri, 25 Sep 2009 10:09:52 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Thu, 24 Sep 2009 14:33:15 -0700
> > Andrew Morton <[email protected]> wrote:
> > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > > ===================================================================
> > > > Fairness for async writes is tricky and biggest reason is that async writes
> > > > are cached in higher layers (page cahe) as well as possibly in file system
> > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > > in proportional manner.
> > > >
> > > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > > service differentation.
> > > >
> > > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > > does not throw enought IO traffic at IO controller to keep the queue
> > > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > > intervals where higher weight queue is empty and in that duration lower weight
> > > > queue get lots of job done giving the impression that there was no service
> > > > differentiation.
> > > >
> > > > In summary, from IO controller point of view async writes support is there.
> > > > Because page cache has not been designed in such a manner that higher
> > > > prio/weight writer can do more write out as compared to lower prio/weight
> > > > writer, gettting service differentiation is hard and it is visible in some
> > > > cases and not visible in some cases.
> > >
> > > Here's where it all falls to pieces.
> > >
> > > For async writeback we just don't care about IO priorities. Because
> > > from the point of view of the userspace task, the write was async! It
> > > occurred at memory bandwidth speed.
> > >
> > > It's only when the kernel's dirty memory thresholds start to get
> > > exceeded that we start to care about prioritisation. And at that time,
> > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > > consumes just as much memory as a low-ioprio dirty page.
> > >
> > > So when balance_dirty_pages() hits, what do we want to do?
> > >
> > > I suppose that all we can do is to block low-ioprio processes more
> > > agressively at the VFS layer, to reduce the rate at which they're
> > > dirtying memory so as to give high-ioprio processes more of the disk
> > > bandwidth.
> > >
> > > But you've gone and implemented all of this stuff at the io-controller
> > > level and not at the VFS level so you're, umm, screwed.
> > >
> >
> > I think I must support dirty-ratio in memcg layer. But not yet.
>

We need to add this to the TODO list.

> OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
> And add a control knob as
> bufferred_write.nr_dirty_thresh
> to limit the number of dirty pages generetad via a cgroup.
>
> Because memcg just records a owner of pages but not records who makes them
> dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
> cgroup code.

Very good point, this is crucial for shared pages.

>
> But I'm not sure how I should treat I/Os generated out by kswapd.
>

Account them to process 0 :)

--
Balbir

2009-09-25 06:26:52

by Gui, Jianfeng/归 剑峰

[permalink] [raw]
Subject: Re: [PATCH 15/28] io-controller: Allow CFQ specific extra preemptions

Vivek Goyal wrote:
> o CFQ allows a reader preemting a writer. So far we allow this with-in group
> but not across groups. But there seems to be following special case where
> this preemption might make sense.
>
> root
> / \
> R Group
> |
> W
>
> Now here reader should be able to preempt the writer. Think of there are
> 10 groups each running a writer and an admin trying to do "ls" and he
> experiences suddenly high latencies for ls.

Hi Vivek,

This preemption might be unfair to the readers who stay in the same group with
writer. Consider the following:

root
/ \
R1 Group
/ \
R2 W

Say W is running and late preemption is enabled, then a request goes into R1,
R1 will preempt W immediately regardless of R2. Now R2 don't have a chance to
get scheduled even if R1 has a very high vdisktime. It seems not so fair to R2.
So I suggest the number of readers in group should be taken into account when
making this preemption decision. R1 should only preempts W when there are not
any readers in that group.

Thanks,
Gui Jianfeng

>
> Same is true for meta data requests. If there is a meta data request and
> a reader is running inside a sibling group, preemption will be allowed.
> Note, following is not allowed.
> root
> / \
> group1 group2
> | |
> R W
>
> Here reader can't preempt writer.
>
> o Put meta data requesting queues at the front of the service tree. Generally
> such queues will preempt currently running queue but not in following case.
> root
> / \
> group1 group2
> | / \
> R1 R3 R2 (meta data)
>
> Here R2 is having a meta data request but it will not preempt R1. We need
> to make sure that R2 gets queued ahead of R3 so taht once group2 gets
> going, we first service R2 and then R3 and not vice versa.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/elevator-fq.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
> block/elevator-fq.h | 3 +++
> 2 files changed, 48 insertions(+), 2 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 25beaf7..8ff8a19 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -701,6 +701,7 @@ static void enqueue_io_entity(struct io_entity *entity)
> struct io_service_tree *st;
> struct io_sched_data *sd = io_entity_sched_data(entity);
> struct io_queue *ioq = ioq_of(entity);
> + int add_front = 0;
>
> if (entity->on_idle_st)
> dequeue_io_entity_idle(entity);
> @@ -716,12 +717,22 @@ static void enqueue_io_entity(struct io_entity *entity)
> st = entity->st;
> st->nr_active++;
> sd->nr_active++;
> +
> /* Keep a track of how many sync queues are backlogged on this group */
> if (ioq && elv_ioq_sync(ioq) && !elv_ioq_class_idle(ioq))
> sd->nr_sync++;
> entity->on_st = 1;
> - place_entity(st, entity, 0);
> - __enqueue_io_entity(st, entity, 0);
> +
> + /*
> + * If a meta data request is pending in this queue, put this
> + * queue at the front so that it gets a chance to run first
> + * as soon as the associated group becomes eligbile to run.
> + */
> + if (ioq && ioq->meta_pending)
> + add_front = 1;
> +
> + place_entity(st, entity, add_front);
> + __enqueue_io_entity(st, entity, add_front);
> debug_update_stats_enqueue(entity);
> }
>
> @@ -2280,6 +2291,31 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
> return 1;
>
> /*
> + * Allow some additional preemptions where a reader queue gets
> + * backlogged and some writer queue is running under any of the
> + * sibling groups.
> + *
> + * root
> + * / \
> + * R group
> + * |
> + * W
> + */
> +
> + if (ioq_of(new_entity) == new_ioq && iog_of(entity)) {
> + /* Let reader queue preempt writer in sibling group */
> + if (elv_ioq_sync(new_ioq) && !elv_ioq_sync(active_ioq))
> + return 1;
> + /*
> + * So both queues are sync. Let the new request get disk time if
> + * it's a metadata request and the current queue is doing
> + * regular IO.
> + */
> + if (new_ioq->meta_pending && !active_ioq->meta_pending)
> + return 1;
> + }
> +
> + /*
> * If both the queues belong to same group, check with io scheduler
> * if it has additional criterion based on which it wants to
> * preempt existing queue.
> @@ -2335,6 +2371,8 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
> BUG_ON(!efqd);
> BUG_ON(!ioq);
> ioq->nr_queued++;
> + if (rq_is_meta(rq))
> + ioq->meta_pending++;
> elv_log_ioq(efqd, ioq, "add rq: rq_queued=%d", ioq->nr_queued);
>
> if (!elv_ioq_busy(ioq))
> @@ -2669,6 +2707,11 @@ void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
> ioq = rq->ioq;
> BUG_ON(!ioq);
> ioq->nr_queued--;
> +
> + if (rq_is_meta(rq)) {
> + WARN_ON(!ioq->meta_pending);
> + ioq->meta_pending--;
> + }
> }
>
> /* A request got dispatched. Do the accounting. */
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index 2992d93..27ff5c4 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -100,6 +100,9 @@ struct io_queue {
>
> /* Pointer to io scheduler's queue */
> void *sched_queue;
> +
> + /* pending metadata requests */
> + int meta_pending;
> };
>
> #ifdef CONFIG_GROUP_IOSCHED /* CONFIG_GROUP_IOSCHED */

--

2009-09-25 07:09:12

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi,

Balbir Singh <[email protected]> wrote:
> > > I think I must support dirty-ratio in memcg layer. But not yet.
> >
>
> We need to add this to the TODO list.
>
> > OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
> > And add a control knob as
> > bufferred_write.nr_dirty_thresh
> > to limit the number of dirty pages generetad via a cgroup.
> >
> > Because memcg just records a owner of pages but not records who makes them
> > dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
> > cgroup code.
>
> Very good point, this is crucial for shared pages.
>
> >
> > But I'm not sure how I should treat I/Os generated out by kswapd.
> >
>
> Account them to process 0 :)

How about accounting them to processes who make pages dirty? I think
that a process which consumes more memory should get penalty. However,
this allows a page request process to use other's bandwidth, but If
a user doesn't want to swap-out the memory, the user should allocate
enough memory for the process by using memcg in advance.

Thanks,
Ryo Tsuruta

2009-09-25 09:07:24

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek,

Vivek Goyal <[email protected]> wrote:
> Higher level solutions are not keeping track of time slices. Time slices will
> be allocated by CFQ which does not have any idea about grouping. Higher
> level controller just keeps track of size of IO done at group level and
> then run either a leaky bucket or token bucket algorithm.
>
> IO throttling is a max BW controller, so it will not even care about what is
> happening in other group. It will just be concerned with rate of IO in one
> particular group and if we exceed specified limit, throttle it. So until and
> unless sequential reader group hits it max bw limit, it will keep sending
> reads down to CFQ, and CFQ will happily assign 100ms slices to readers.
>
> dm-ioband will not try to choke the high throughput sequential reader group
> for the slow random reader group because that would just kill the throughput
> of rotational media. Every sequential reader will run for few ms and then
> be throttled and this goes on. Disk will soon be seek bound.

Because dm-ioband provides faireness in terms of how many IO requests
are issued or how many bytes are transferred, so this behaviour is to
be expected. Do you think fairness in terms of IO requests and size is
not fair?

> > > Buffering at higher layer can delay read requests for more than slice idle
> > > period of CFQ (default 8 ms). That means, it is possible that we are waiting
> > > for a request from the queue but it is buffered at higher layer and then idle
> > > timer will fire. It means that queue will losse its share at the same time
> > > overall throughput will be impacted as we lost those 8 ms.
> >
> > That sounds like a bug.
> >
>
> Actually this probably is a limitation of higher level controller. It most
> likely is sitting so high in IO stack that it has no idea what underlying
> IO scheduler is and what are IO scheduler's policies. So it can't keep up
> with IO scheduler's policies. Secondly, it might be a low weight group and
> tokens might not be available fast enough to release the request.
>
> > > Read Vs Write
> > > -------------
> > > Writes can overwhelm readers hence second level controller FIFO release
> > > will run into issue here. If there is a single queue maintained then reads
> > > will suffer large latencies. If there separate queues for reads and writes
> > > then it will be hard to decide in what ratio to dispatch reads and writes as
> > > it is IO scheduler's decision to decide when and how much read/write to
> > > dispatch. This is another place where higher level controller will not be in
> > > sync with lower level io scheduler and can change the effective policies of
> > > underlying io scheduler.
> >
> > The IO schedulers already take care of read-vs-write and already take
> > care of preventing large writes-starve-reads latencies (or at least,
> > they're supposed to).
>
> True. Actually this is a limitation of higher level controller. A higher
> level controller will most likely implement some of kind of queuing/buffering
> mechanism where it will buffer requeuests when it decides to throttle the
> group. Now once a fair number read and requests are buffered, and if
> controller is ready to dispatch some requests from the group, which
> requests/bio should it dispatch? reads first or writes first or reads and
> writes in certain ratio?

The write-starve-reads on dm-ioband, that you pointed out before, was
not caused by FIFO release, it was caused by IO flow control in
dm-ioband. When I turned off the flow control, then the read
throughput was quite improved.

Now I'm considering separating dm-ioband's internal queue into sync
and async and giving a certain priority of dispatch to async IOs.

Thanks,
Ryo Tsuruta

2009-09-25 14:35:20

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Sep 25, 2009 at 06:07:24PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> Vivek Goyal <[email protected]> wrote:
> > Higher level solutions are not keeping track of time slices. Time slices will
> > be allocated by CFQ which does not have any idea about grouping. Higher
> > level controller just keeps track of size of IO done at group level and
> > then run either a leaky bucket or token bucket algorithm.
> >
> > IO throttling is a max BW controller, so it will not even care about what is
> > happening in other group. It will just be concerned with rate of IO in one
> > particular group and if we exceed specified limit, throttle it. So until and
> > unless sequential reader group hits it max bw limit, it will keep sending
> > reads down to CFQ, and CFQ will happily assign 100ms slices to readers.
> >
> > dm-ioband will not try to choke the high throughput sequential reader group
> > for the slow random reader group because that would just kill the throughput
> > of rotational media. Every sequential reader will run for few ms and then
> > be throttled and this goes on. Disk will soon be seek bound.
>
> Because dm-ioband provides faireness in terms of how many IO requests
> are issued or how many bytes are transferred, so this behaviour is to
> be expected. Do you think fairness in terms of IO requests and size is
> not fair?
>

Hi Ryo,

Fairness in terms of size of IO or number of requests is probably not the
best thing to do on rotational media where seek latencies are significant.

It probably should work just well on media with very low seek latencies
like SSD.

So on rotational media, either you will not provide fairness to random
readers because they are too slow or you will choke the sequential readers
in other group and also bring down the overall disk throughput.

If you don't decide to choke/throttle sequential reader group for the sake
of random reader in other group then you will not have a good control
on random reader latencies. Because now IO scheduler sees the IO from both
sequential reader as well as random reader and sequential readers have not
been throttled. So the dispatch pattern/time slices will again look like..

SR1 SR2 SR3 SR4 SR5 RR.....

instead of

SR1 RR SR2 RR SR3 RR SR4 RR ....

SR --> sequential reader, RR --> random reader

> > > > Buffering at higher layer can delay read requests for more than slice idle
> > > > period of CFQ (default 8 ms). That means, it is possible that we are waiting
> > > > for a request from the queue but it is buffered at higher layer and then idle
> > > > timer will fire. It means that queue will losse its share at the same time
> > > > overall throughput will be impacted as we lost those 8 ms.
> > >
> > > That sounds like a bug.
> > >
> >
> > Actually this probably is a limitation of higher level controller. It most
> > likely is sitting so high in IO stack that it has no idea what underlying
> > IO scheduler is and what are IO scheduler's policies. So it can't keep up
> > with IO scheduler's policies. Secondly, it might be a low weight group and
> > tokens might not be available fast enough to release the request.
> >
> > > > Read Vs Write
> > > > -------------
> > > > Writes can overwhelm readers hence second level controller FIFO release
> > > > will run into issue here. If there is a single queue maintained then reads
> > > > will suffer large latencies. If there separate queues for reads and writes
> > > > then it will be hard to decide in what ratio to dispatch reads and writes as
> > > > it is IO scheduler's decision to decide when and how much read/write to
> > > > dispatch. This is another place where higher level controller will not be in
> > > > sync with lower level io scheduler and can change the effective policies of
> > > > underlying io scheduler.
> > >
> > > The IO schedulers already take care of read-vs-write and already take
> > > care of preventing large writes-starve-reads latencies (or at least,
> > > they're supposed to).
> >
> > True. Actually this is a limitation of higher level controller. A higher
> > level controller will most likely implement some of kind of queuing/buffering
> > mechanism where it will buffer requeuests when it decides to throttle the
> > group. Now once a fair number read and requests are buffered, and if
> > controller is ready to dispatch some requests from the group, which
> > requests/bio should it dispatch? reads first or writes first or reads and
> > writes in certain ratio?
>
> The write-starve-reads on dm-ioband, that you pointed out before, was
> not caused by FIFO release, it was caused by IO flow control in
> dm-ioband. When I turned off the flow control, then the read
> throughput was quite improved.

What was flow control doing?

>
> Now I'm considering separating dm-ioband's internal queue into sync
> and async and giving a certain priority of dispatch to async IOs.

Even if you maintain separate queues for sync and async, in what ratio will
you dispatch reads and writes to underlying layer once fresh tokens become
available to the group and you decide to unthrottle the group.

Whatever policy you adopt for read and write dispatch, it might not match
with policy of underlying IO scheduler because every IO scheduler seems to
have its own way of determining how reads and writes should be dispatched.

Now somebody might start complaining that my job inside the group is not
getting same reader/writer ratio as it was getting outside the group.

Thanks
Vivek

2009-09-25 15:06:55

by Rik van Riel

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Ryo Tsuruta wrote:

> Because dm-ioband provides faireness in terms of how many IO requests
> are issued or how many bytes are transferred, so this behaviour is to
> be expected. Do you think fairness in terms of IO requests and size is
> not fair?

When there are two workloads competing for the same
resources, I would expect each of the workloads to
run at about 50% of the speed at which it would run
on an uncontended system.

Having one of the workloads run at 95% of the
uncontended speed and the other workload at 5%
is "not fair" (to put it diplomatically).

--
All rights reversed.

2009-09-25 20:27:28

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> Vivek Goyal wrote:
> > Notes:
> > - With vanilla CFQ, random writers can overwhelm a random reader.
> > Bring down its throughput and bump up latencies significantly.
>
>
> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> too.
>
> I'm basing this assumption on the observations I made on both OpenSuse
> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> titled: "Poor desktop responsiveness with background I/O-operations" of
> 2009-09-20.
> (Message ID: [email protected])
>
>
> Thus, I'm posting this to show that your work is greatly appreciated,
> given the rather disappointig status quo of Linux's fairness when it
> comes to disk IO time.
>
> I hope that your efforts lead to a change in performance of current
> userland applications, the sooner, the better.
>
[Please don't remove people from original CC list. I am putting them back.]

Hi Ulrich,

I quicky went through that mail thread and I tried following on my
desktop.

##########################################
dd if=/home/vgoyal/4G-file of=/dev/null &
sleep 5
time firefox
# close firefox once gui pops up.
##########################################

It was taking close to 1 minute 30 seconds to launch firefox and dd got
following.

4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s

(Results do vary across runs, especially if system is booted fresh. Don't
know why...).


Then I tried putting both the applications in separate groups and assign
them weights 200 each.

##########################################
dd if=/home/vgoyal/4G-file of=/dev/null &
echo $! > /cgroup/io/test1/tasks
sleep 5
echo $$ > /cgroup/io/test2/tasks
time firefox
# close firefox once gui pops up.
##########################################

Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.

4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s

Notice that throughput of dd also improved.

I ran the block trace and noticed in many a cases firefox threads
immediately preempted the "dd". Probably because it was a file system
request. So in this case latency will arise from seek time.

In some other cases, threads had to wait for up to 100ms because dd was
not preempted. In this case latency will arise both from waiting on queue
as well as seek time.

With cgroup thing, We will run 100ms slice for the group in which firefox
is being launched and then give 100ms uninterrupted time slice to dd. So
it should cut down on number of seeks happening and that's why we probably
see this improvement.

So grouping can help in such cases. May be you can move your X session in
one group and launch the big IO in other group. Most likely you should
have better desktop experience without compromising on dd thread output.

Thanks
Vivek

2009-09-26 14:51:25

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-09-25 at 16:26 -0400, Vivek Goyal wrote:
> On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> > Vivek Goyal wrote:
> > > Notes:
> > > - With vanilla CFQ, random writers can overwhelm a random reader.
> > > Bring down its throughput and bump up latencies significantly.
> >
> >
> > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> > too.
> >
> > I'm basing this assumption on the observations I made on both OpenSuse
> > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> > titled: "Poor desktop responsiveness with background I/O-operations" of
> > 2009-09-20.
> > (Message ID: [email protected])
> >
> >
> > Thus, I'm posting this to show that your work is greatly appreciated,
> > given the rather disappointig status quo of Linux's fairness when it
> > comes to disk IO time.
> >
> > I hope that your efforts lead to a change in performance of current
> > userland applications, the sooner, the better.
> >
> [Please don't remove people from original CC list. I am putting them back.]
>
> Hi Ulrich,
>
> I quicky went through that mail thread and I tried following on my
> desktop.
>
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> sleep 5
> time firefox
> # close firefox once gui pops up.
> ##########################################
>
> It was taking close to 1 minute 30 seconds to launch firefox and dd got
> following.
>
> 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>
> (Results do vary across runs, especially if system is booted fresh. Don't
> know why...).
>
>
> Then I tried putting both the applications in separate groups and assign
> them weights 200 each.
>
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> echo $! > /cgroup/io/test1/tasks
> sleep 5
> echo $$ > /cgroup/io/test2/tasks
> time firefox
> # close firefox once gui pops up.
> ##########################################
>
> Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>
> 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>
> Notice that throughput of dd also improved.
>
> I ran the block trace and noticed in many a cases firefox threads
> immediately preempted the "dd". Probably because it was a file system
> request. So in this case latency will arise from seek time.
>
> In some other cases, threads had to wait for up to 100ms because dd was
> not preempted. In this case latency will arise both from waiting on queue
> as well as seek time.

Hm, with tip, I see ~10ms max wakeup latency running scriptlet below.

> With cgroup thing, We will run 100ms slice for the group in which firefox
> is being launched and then give 100ms uninterrupted time slice to dd. So
> it should cut down on number of seeks happening and that's why we probably
> see this improvement.

I'm not testing with group IO/CPU, but my numbers kinda agree that it's
seek latency that's THE killer. What the compiled numbers below from
the cheezy script below that _seem_ to be telling me is that the default
setting of CFQ quantum is allowing too many write requests through,
inflicting too much read latency... for the disk where my binaries live.
The longer the seeky burst, the more it hurts both reader/writer, so
cutting down the max requests queueable helps the reader (which i think
can't queue anything near per unit time that the writer can) finish and
get out of the writer's way sooner.

'nuff possibly useless words, onward to possibly useless numbers :)

dd pre == number dd emits upon receiving USR1 before execing perf.
perf stat == time to load/execute perf stat konsole -e exit.
dd post == same after dd number, after perf finishes.

quantum = 1 Avg
dd pre 58.4 52.5 56.1 61.6 52.3 56.1 MB/s
perf stat 2.87 0.91 1.64 1.41 0.90 1.5 Sec
dd post 56.6 61.0 66.3 64.7 60.9 61.9

quantum = 2
dd pre 59.7 62.4 58.9 65.3 60.3 61.3
perf stat 5.81 6.09 6.24 10.13 6.21 6.8
dd post 64.0 62.6 64.2 60.4 61.1 62.4

quantum = 3
dd pre 65.5 57.7 54.5 51.1 56.3 57.0
perf stat 14.01 13.71 8.35 5.35 8.57 9.9
dd post 59.2 49.1 58.8 62.3 62.1 58.3

quantum = 4
dd pre 57.2 52.1 56.8 55.2 61.6 56.5
perf stat 11.98 1.61 9.63 16.21 11.13 10.1
dd post 57.2 52.6 62.2 49.3 50.2 54.3

Nothing pinned btw, 4 cores available, but only 1 drive.

#!/bin/sh

DISK=sdb
QUANTUM=/sys/block/$DISK/queue/iosched/quantum
END=$(cat $QUANTUM)

for q in `seq 1 $END`; do
echo $q > $QUANTUM
LOGFILE=quantum_log_$q
rm -f $LOGFILE
for i in `seq 1 5`; do
echo 2 > /proc/sys/vm/drop_caches
sh -c "dd if=/dev/zero of=./deleteme.dd 2>&1|tee -a $LOGFILE" &
sleep 30
sh -c "echo quantum $(cat $QUANTUM) loop $i" 2>&1|tee -a $LOGFILE
perf stat -- killlall -q get_stuf_into_ram >/dev/null 2>&1
sleep 1
killall -q -USR1 dd &
sleep 1
sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE
sleep 1
killall -q -USR1 dd &
sleep 5
killall -qw dd
rm -f ./deleteme.dd
sync
sh -c "echo" 2>&1|tee -a $LOGFILE
done;
done;

2009-09-27 06:55:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

My dd vs load non-cached binary woes seem to be coming from backmerge.

#if 0 /*MIKEDIDIT sand in gearbox?*/
/*
* See if our hash lookup can find a potential backmerge.
*/
__rq = elv_rqhash_find(q, bio->bi_sector);
if (__rq && elv_rq_merge_ok(__rq, bio)) {
*req = __rq;
return ELEVATOR_BACK_MERGE;
}
#endif

- = stock = 0
+ = /sys/block/sdb/queue/nomerges = 1
x = backmerge disabled

quantum = 1 Avg
dd pre 58.4 52.5 56.1 61.6 52.3 56.1- MB/s virgin/foo
59.6 54.4 53.0 56.1 58.6 56.3+ 1.003
53.8 56.6 54.7 50.7 59.3 55.0x .980
perf stat 2.87 0.91 1.64 1.41 0.90 1.5- Sec
2.61 1.14 1.45 1.43 1.47 1.6+ 1.066
1.07 1.19 1.20 1.24 1.37 1.2x .800
dd post 56.6 61.0 66.3 64.7 60.9 61.9-
54.0 59.3 61.1 58.3 58.9 58.3+ .941
54.3 60.2 59.6 60.6 60.3 59.0x .953

quantum = 2
dd pre 59.7 62.4 58.9 65.3 60.3 61.3-
49.4 51.9 58.7 49.3 52.4 52.3+ .853
58.3 52.8 53.1 50.4 59.9 54.9x .895
perf stat 5.81 6.09 6.24 10.13 6.21 6.8-
2.48 2.10 3.23 2.29 2.31 2.4+ .352
2.09 2.73 1.72 1.96 1.83 2.0x .294
dd post 64.0 62.6 64.2 60.4 61.1 62.4-
52.9 56.2 49.6 51.3 51.2 52.2+ .836
54.7 60.9 56.0 54.0 55.4 56.2x .900

quantum = 3
dd pre 65.5 57.7 54.5 51.1 56.3 57.0-
58.1 53.9 52.2 58.2 51.8 54.8+ .961
60.5 56.5 56.7 55.3 54.6 56.7x .994
perf stat 14.01 13.71 8.35 5.35 8.57 9.9-
1.84 2.30 2.14 2.10 2.45 2.1+ .212
2.12 1.63 2.54 2.23 2.29 2.1x .212
dd post 59.2 49.1 58.8 62.3 62.1 58.3-
59.8 53.2 55.2 50.9 53.7 54.5+ .934
56.1 61.9 51.9 54.3 53.1 55.4x .950

quantun = 4
dd pre 57.2 52.1 56.8 55.2 61.6 56.5-
48.7 55.4 51.3 49.7 54.5 51.9+ .918
55.8 54.5 50.3 56.4 49.3 53.2x .941
perf stat 11.98 1.61 9.63 16.21 11.13 10.1-
2.29 1.94 2.68 2.46 2.45 2.3+ .227
3.01 1.84 2.11 2.27 2.30 2.3x .227
dd post 57.2 52.6 62.2 49.3 50.2 54.3-
50.1 54.5 58.4 54.1 49.0 53.2+ .979
52.9 53.2 50.6 53.2 50.5 52.0x .957

2009-09-27 16:42:34

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sun, Sep 27 2009, Mike Galbraith wrote:
> My dd vs load non-cached binary woes seem to be coming from backmerge.
>
> #if 0 /*MIKEDIDIT sand in gearbox?*/
> /*
> * See if our hash lookup can find a potential backmerge.
> */
> __rq = elv_rqhash_find(q, bio->bi_sector);
> if (__rq && elv_rq_merge_ok(__rq, bio)) {
> *req = __rq;
> return ELEVATOR_BACK_MERGE;
> }
> #endif

It's a given that not merging will provide better latency. We can't
disable that or performance will suffer A LOT on some systems. There are
ways to make it better, though. One would be to make the max request
size smaller, but that would also hurt for streamed workloads. Can you
try whether the below patch makes a difference? It will basically
disallow merges to a request that isn't the last one.

We should probably make the merging logic a bit more clever, since the
below wont work well for two (or more) streamed cases. I'll think a bit
about that.

Note this is totally untested!

diff --git a/block/elevator.c b/block/elevator.c
index 1975b61..d00a72b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
* See if our hash lookup can find a potential backmerge.
*/
__rq = elv_rqhash_find(q, bio->bi_sector);
- if (__rq && elv_rq_merge_ok(__rq, bio)) {
- *req = __rq;
- return ELEVATOR_BACK_MERGE;
+ if (__rq) {
+ /*
+ * If requests are queued behind this one, disallow merge. This
+ * prevents streaming IO from continually passing new IO.
+ */
+ if (elv_latter_request(q, __rq))
+ return ELEVATOR_NO_MERGE;
+ if (elv_rq_merge_ok(__rq, bio)) {
+ *req = __rq;
+ return ELEVATOR_BACK_MERGE;
+ }
}

if (e->ops->elevator_merge_fn)

--
Jens Axboe

2009-09-27 17:00:07

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek,
On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <[email protected]> wrote:
> On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
>> Vivek Goyal wrote:
>> > Notes:
>> > - With vanilla CFQ, random writers can overwhelm a random reader.
>> >   Bring down its throughput and bump up latencies significantly.
>>
>>
>> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
>> too.
>>
>> I'm basing this assumption on the observations I made on both OpenSuse
>> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
>> titled: "Poor desktop responsiveness with background I/O-operations" of
>> 2009-09-20.
>> (Message ID: [email protected])
>>
>>
>> Thus, I'm posting this to show that your work is greatly appreciated,
>> given the rather disappointig status quo of Linux's fairness when it
>> comes to disk IO time.
>>
>> I hope that your efforts lead to a change in performance of current
>> userland applications, the sooner, the better.
>>
> [Please don't remove people from original CC list. I am putting them back.]
>
> Hi Ulrich,
>
> I quicky went through that mail thread and I tried following on my
> desktop.
>
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> sleep 5
> time firefox
> # close firefox once gui pops up.
> ##########################################
>
> It was taking close to 1 minute 30 seconds to launch firefox and dd got
> following.
>
> 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>
> (Results do vary across runs, especially if system is booted fresh. Don't
>  know why...).
>
>
> Then I tried putting both the applications in separate groups and assign
> them weights 200 each.
>
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> echo $! > /cgroup/io/test1/tasks
> sleep 5
> echo $$ > /cgroup/io/test2/tasks
> time firefox
> # close firefox once gui pops up.
> ##########################################
>
> Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>
> 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>
> Notice that throughput of dd also improved.
>
> I ran the block trace and noticed in many a cases firefox threads
> immediately preempted the "dd". Probably because it was a file system
> request. So in this case latency will arise from seek time.
>
> In some other cases, threads had to wait for up to 100ms because dd was
> not preempted. In this case latency will arise both from waiting on queue
> as well as seek time.

I think cfq should already be doing something similar, i.e. giving
100ms slices to firefox, that alternate with dd, unless:
* firefox is too seeky (in this case, the idle window will be too small)
* firefox has too much think time.

To rule out the first case, what happens if you run the test with your
"fairness for seeky processes" patch?
To rule out the second case, what happens if you increase the slice_idle?

Thanks,
Corrado

>
> With cgroup thing, We will run 100ms slice for the group in which firefox
> is being launched and then give 100ms uninterrupted time slice to dd. So
> it should cut down on number of seeks happening and that's why we probably
> see this improvement.
>
> So grouping can help in such cases. May be you can move your X session in
> one group and launch the big IO in other group. Most likely you should
> have better desktop experience without compromising on dd thread output.

> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-09-27 18:16:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:
> On Sun, Sep 27 2009, Mike Galbraith wrote:
> > My dd vs load non-cached binary woes seem to be coming from backmerge.
> >
> > #if 0 /*MIKEDIDIT sand in gearbox?*/
> > /*
> > * See if our hash lookup can find a potential backmerge.
> > */
> > __rq = elv_rqhash_find(q, bio->bi_sector);
> > if (__rq && elv_rq_merge_ok(__rq, bio)) {
> > *req = __rq;
> > return ELEVATOR_BACK_MERGE;
> > }
> > #endif
>
> It's a given that not merging will provide better latency.

Yeah, absolutely everything I've diddled that reduces the size of queued
data improves the situation, which makes perfect sense. This one was a
bit unexpected. Front merges didn't hurt at all, back merges did, and
lots. After diddling the code a bit, I had the "well _duh_" moment.

> We can't
> disable that or performance will suffer A LOT on some systems. There are
> ways to make it better, though. One would be to make the max request
> size smaller, but that would also hurt for streamed workloads. Can you
> try whether the below patch makes a difference? It will basically
> disallow merges to a request that isn't the last one.

That's what all the looking I've done ends up at. Either you let the
disk be all it can be, and you pay in latency, or you don't, and you pay
in throughput.

> below wont work well for two (or more) streamed cases. I'll think a bit
> about that.

Cool, think away. I've been eyeballing and pondering how to know when
latency is going to become paramount. Absolutely nothing is happening,
even for "it's my root".

> Note this is totally untested!

I'll give it a shot first thing in the A.M.

Note: I tested my stable of kernels today (22->), and we are better off
dd vs read today than ever in this time span at least.

(i can't recall ever seeing a system where beating snot outta root
didn't hurt really bad... would be very nice though;)

> diff --git a/block/elevator.c b/block/elevator.c
> index 1975b61..d00a72b 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> * See if our hash lookup can find a potential backmerge.
> */
> __rq = elv_rqhash_find(q, bio->bi_sector);
> - if (__rq && elv_rq_merge_ok(__rq, bio)) {
> - *req = __rq;
> - return ELEVATOR_BACK_MERGE;
> + if (__rq) {
> + /*
> + * If requests are queued behind this one, disallow merge. This
> + * prevents streaming IO from continually passing new IO.
> + */
> + if (elv_latter_request(q, __rq))
> + return ELEVATOR_NO_MERGE;
> + if (elv_rq_merge_ok(__rq, bio)) {
> + *req = __rq;
> + return ELEVATOR_BACK_MERGE;
> + }
> }
>
> if (e->ops->elevator_merge_fn)
>

2009-09-28 04:04:13

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote:
> On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:

> I'll give it a shot first thing in the A.M.

> > diff --git a/block/elevator.c b/block/elevator.c
> > index 1975b61..d00a72b 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> > * See if our hash lookup can find a potential backmerge.
> > */
> > __rq = elv_rqhash_find(q, bio->bi_sector);
> > - if (__rq && elv_rq_merge_ok(__rq, bio)) {
> > - *req = __rq;
> > - return ELEVATOR_BACK_MERGE;
> > + if (__rq) {
> > + /*
> > + * If requests are queued behind this one, disallow merge. This
> > + * prevents streaming IO from continually passing new IO.
> > + */
> > + if (elv_latter_request(q, __rq))
> > + return ELEVATOR_NO_MERGE;
> > + if (elv_rq_merge_ok(__rq, bio)) {
> > + *req = __rq;
> > + return ELEVATOR_BACK_MERGE;
> > + }
> > }
> >
> > if (e->ops->elevator_merge_fn)

- = virgin tip v2.6.31-10215-ga3c9602
+ = with patchlet
Avg
dd pre 67.4 70.9 65.4 68.9 66.2 67.7-
65.9 68.5 69.8 65.2 65.8 67.0- Avg
70.4 70.3 65.1 66.4 70.1 68.4- 67.7-
73.1 64.6 65.3 65.3 64.9 66.6+ 65.6+ .968
63.8 67.9 65.2 65.1 64.4 65.2+
64.9 66.3 64.1 65.2 64.8 65.0+
perf stat 8.66 16.29 9.65 14.88 9.45 11.7-
15.36 9.71 15.47 10.44 12.93 12.7-
10.55 15.11 10.22 15.35 10.32 12.3- 12.2-
9.87 7.53 10.62 7.51 9.95 9.0+ 9.1+ .745
7.73 10.12 8.19 11.87 8.07 9.1+
11.04 7.62 10.14 8.13 10.23 9.4+
dd post 63.4 60.5 66.7 64.5 67.3 64.4-
64.4 66.8 64.3 61.5 62.0 63.8-
63.8 64.9 66.2 65.6 66.9 65.4- 64.5-
60.9 63.4 60.2 63.4 65.5 62.6+ 61.8+ .958
63.3 59.9 61.9 62.7 61.2 61.8+
60.1 63.7 59.5 61.5 60.6 61.0+

2009-09-28 05:55:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

P.S.

On Mon, 2009-09-28 at 06:04 +0200, Mike Galbraith wrote:

> - = virgin tip v2.6.31-10215-ga3c9602
> + = with patchlet
> Avg
> dd pre 67.4 70.9 65.4 68.9 66.2 67.7-
> 65.9 68.5 69.8 65.2 65.8 67.0- Avg
> 70.4 70.3 65.1 66.4 70.1 68.4- 67.7-
> 73.1 64.6 65.3 65.3 64.9 66.6+ 65.6+ .968
> 63.8 67.9 65.2 65.1 64.4 65.2+
> 64.9 66.3 64.1 65.2 64.8 65.0+
> perf stat 8.66 16.29 9.65 14.88 9.45 11.7-
> 15.36 9.71 15.47 10.44 12.93 12.7-
> 10.55 15.11 10.22 15.35 10.32 12.3- 12.2-
> 9.87 7.53 10.62 7.51 9.95 9.0+ 9.1+ .745
> 7.73 10.12 8.19 11.87 8.07 9.1+
> 11.04 7.62 10.14 8.13 10.23 9.4+
> dd post 63.4 60.5 66.7 64.5 67.3 64.4-
> 64.4 66.8 64.3 61.5 62.0 63.8-
> 63.8 64.9 66.2 65.6 66.9 65.4- 64.5-
> 60.9 63.4 60.2 63.4 65.5 62.6+ 61.8+ .958
> 63.3 59.9 61.9 62.7 61.2 61.8+
> 60.1 63.7 59.5 61.5 60.6 61.0+

Deadline and noop fsc^W are less than wonderful choices for this load.

perf stat 12.82 7.19 8.49 5.76 9.32 anticipatory
16.24 175.82 154.38 228.97 147.16 noop
43.23 57.39 96.13 148.25 180.09 deadline
28.65 167.40 195.95 183.69 178.61 deadline v2.6.27.35

-Mike

2009-09-28 07:30:49

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek,

Vivek Goyal <[email protected]> wrote:
> > Because dm-ioband provides faireness in terms of how many IO requests
> > are issued or how many bytes are transferred, so this behaviour is to
> > be expected. Do you think fairness in terms of IO requests and size is
> > not fair?
> >
>
> Hi Ryo,
>
> Fairness in terms of size of IO or number of requests is probably not the
> best thing to do on rotational media where seek latencies are significant.
>
> It probably should work just well on media with very low seek latencies
> like SSD.
>
> So on rotational media, either you will not provide fairness to random
> readers because they are too slow or you will choke the sequential readers
> in other group and also bring down the overall disk throughput.
>
> If you don't decide to choke/throttle sequential reader group for the sake
> of random reader in other group then you will not have a good control
> on random reader latencies. Because now IO scheduler sees the IO from both
> sequential reader as well as random reader and sequential readers have not
> been throttled. So the dispatch pattern/time slices will again look like..
>
> SR1 SR2 SR3 SR4 SR5 RR.....
>
> instead of
>
> SR1 RR SR2 RR SR3 RR SR4 RR ....
>
> SR --> sequential reader, RR --> random reader

Thank you for elaborating. However, I think that fairness in terms of
disk time has a similar problem. The below is a benchmark result of
randread vs seqread I posted before, rand-readers and seq-readers ran
on individual groups and their weights were equally assigned.

Throughput [KiB/s]
io-controller dm-ioband
randread 161 314
seqread 9556 631

I know that dm-ioband is needed to improvement on the seqread
throughput, but I don't think that io-controller seems quite fair,
even the disk times of each group are equal, why randread can't get
more bandwidth. So I think that this is how users think about
faireness, and it would be good thing to provide multiple policies of
bandwidth control for uses.

> > The write-starve-reads on dm-ioband, that you pointed out before, was
> > not caused by FIFO release, it was caused by IO flow control in
> > dm-ioband. When I turned off the flow control, then the read
> > throughput was quite improved.
>
> What was flow control doing?

dm-ioband gives a limit on each IO group. When the number of IO
requests backlogged in a group exceeds the limit, processes which are
going to issue IO requests to the group are made sleep until all the
backlogged requests are flushed out.

> > Now I'm considering separating dm-ioband's internal queue into sync
> > and async and giving a certain priority of dispatch to async IOs.
>
> Even if you maintain separate queues for sync and async, in what ratio will
> you dispatch reads and writes to underlying layer once fresh tokens become
> available to the group and you decide to unthrottle the group.

Now I'm thinking that It's according to the requested order, but
when the number of in-flight sync IOs exceeds io_limit (io_limit is
calculated based on nr_requests of underlying block device), dm-ioband
dispatches only async IOs until the number of in-flight sync IOs are
below the io_limit, and vice versa. At least it could solve the
write-starve-read issue which you pointed out.

> Whatever policy you adopt for read and write dispatch, it might not match
> with policy of underlying IO scheduler because every IO scheduler seems to
> have its own way of determining how reads and writes should be dispatched.

I think that this is a matter of users choise, which a user would
like to give priority to bandwidth or IO scheduler's policy.

> Now somebody might start complaining that my job inside the group is not
> getting same reader/writer ratio as it was getting outside the group.
>
> Thanks
> Vivek

Thanks,
Ryo Tsuruta

2009-09-28 07:38:09

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Rik,

Rik van Riel <[email protected]> wrote:
> Ryo Tsuruta wrote:
>
> > Because dm-ioband provides faireness in terms of how many IO requests
> > are issued or how many bytes are transferred, so this behaviour is to
> > be expected. Do you think fairness in terms of IO requests and size is
> > not fair?
>
> When there are two workloads competing for the same
> resources, I would expect each of the workloads to
> run at about 50% of the speed at which it would run
> on an uncontended system.
>
> Having one of the workloads run at 95% of the
> uncontended speed and the other workload at 5%
> is "not fair" (to put it diplomatically).

As I wrote in the mail to Vivek, I think that providing multiple
policies, on a per disk time basis, on a per iosize basis, maximum
rate limiting or etc would be good for users.

Thanks,
Ryo Tsuruta

2009-09-28 14:59:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <[email protected]> wrote:
> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> >> Vivek Goyal wrote:
> >> > Notes:
> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >> > ? Bring down its throughput and bump up latencies significantly.
> >>
> >>
> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> >> too.
> >>
> >> I'm basing this assumption on the observations I made on both OpenSuse
> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> >> titled: "Poor desktop responsiveness with background I/O-operations" of
> >> 2009-09-20.
> >> (Message ID: [email protected])
> >>
> >>
> >> Thus, I'm posting this to show that your work is greatly appreciated,
> >> given the rather disappointig status quo of Linux's fairness when it
> >> comes to disk IO time.
> >>
> >> I hope that your efforts lead to a change in performance of current
> >> userland applications, the sooner, the better.
> >>
> > [Please don't remove people from original CC list. I am putting them back.]
> >
> > Hi Ulrich,
> >
> > I quicky went through that mail thread and I tried following on my
> > desktop.
> >
> > ##########################################
> > dd if=/home/vgoyal/4G-file of=/dev/null &
> > sleep 5
> > time firefox
> > # close firefox once gui pops up.
> > ##########################################
> >
> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
> > following.
> >
> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
> >
> > (Results do vary across runs, especially if system is booted fresh. Don't
> > ?know why...).
> >
> >
> > Then I tried putting both the applications in separate groups and assign
> > them weights 200 each.
> >
> > ##########################################
> > dd if=/home/vgoyal/4G-file of=/dev/null &
> > echo $! > /cgroup/io/test1/tasks
> > sleep 5
> > echo $$ > /cgroup/io/test2/tasks
> > time firefox
> > # close firefox once gui pops up.
> > ##########################################
> >
> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
> >
> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
> >
> > Notice that throughput of dd also improved.
> >
> > I ran the block trace and noticed in many a cases firefox threads
> > immediately preempted the "dd". Probably because it was a file system
> > request. So in this case latency will arise from seek time.
> >
> > In some other cases, threads had to wait for up to 100ms because dd was
> > not preempted. In this case latency will arise both from waiting on queue
> > as well as seek time.
>
> I think cfq should already be doing something similar, i.e. giving
> 100ms slices to firefox, that alternate with dd, unless:
> * firefox is too seeky (in this case, the idle window will be too small)
> * firefox has too much think time.
>

Hi Corrado,

"firefox" is the shell script to setup the environment and launch the
broser. It seems to be a group of threads. Some of them run in parallel
and some of these seems to be running one after the other (once previous
process or threads finished).


> To rule out the first case, what happens if you run the test with your
> "fairness for seeky processes" patch?

I applied that patch and it helps a lot.

http://lwn.net/Articles/341032/

With above patchset applied, and fairness=1, firefox pops up in 27-28
seconds.

So it looks like if we don't disable idle window for seeky processes on
hardware supporting command queuing, it helps in this particular case.

Thanks
Vivek



> To rule out the second case, what happens if you increase the slice_idle?
>
> Thanks,
> Corrado
>
> >
> > With cgroup thing, We will run 100ms slice for the group in which firefox
> > is being launched and then give 100ms uninterrupted time slice to dd. So
> > it should cut down on number of seeks happening and that's why we probably
> > see this improvement.
> >
> > So grouping can help in such cases. May be you can move your X session in
> > one group and launch the big IO in other group. Most likely you should
> > have better desktop experience without compromising on dd thread output.
>
> > Thanks
> > Vivek
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at ?http://www.tux.org/lkml/
> >
>
>
>
> --
> __________________________________________________________________________
>
> dott. Corrado Zoccolo mailto:[email protected]
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------

2009-09-28 15:35:00

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <[email protected]> wrote:
> On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
>> Hi Vivek,
>> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <[email protected]> wrote:
>> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
>> >> Vivek Goyal wrote:
>> >> > Notes:
>> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
>> >> >   Bring down its throughput and bump up latencies significantly.
>> >>
>> >>
>> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
>> >> too.
>> >>
>> >> I'm basing this assumption on the observations I made on both OpenSuse
>> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
>> >> titled: "Poor desktop responsiveness with background I/O-operations" of
>> >> 2009-09-20.
>> >> (Message ID: [email protected])
>> >>
>> >>
>> >> Thus, I'm posting this to show that your work is greatly appreciated,
>> >> given the rather disappointig status quo of Linux's fairness when it
>> >> comes to disk IO time.
>> >>
>> >> I hope that your efforts lead to a change in performance of current
>> >> userland applications, the sooner, the better.
>> >>
>> > [Please don't remove people from original CC list. I am putting them back.]
>> >
>> > Hi Ulrich,
>> >
>> > I quicky went through that mail thread and I tried following on my
>> > desktop.
>> >
>> > ##########################################
>> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> > sleep 5
>> > time firefox
>> > # close firefox once gui pops up.
>> > ##########################################
>> >
>> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
>> > following.
>> >
>> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>> >
>> > (Results do vary across runs, especially if system is booted fresh. Don't
>> >  know why...).
>> >
>> >
>> > Then I tried putting both the applications in separate groups and assign
>> > them weights 200 each.
>> >
>> > ##########################################
>> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> > echo $! > /cgroup/io/test1/tasks
>> > sleep 5
>> > echo $$ > /cgroup/io/test2/tasks
>> > time firefox
>> > # close firefox once gui pops up.
>> > ##########################################
>> >
>> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>> >
>> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>> >
>> > Notice that throughput of dd also improved.
>> >
>> > I ran the block trace and noticed in many a cases firefox threads
>> > immediately preempted the "dd". Probably because it was a file system
>> > request. So in this case latency will arise from seek time.
>> >
>> > In some other cases, threads had to wait for up to 100ms because dd was
>> > not preempted. In this case latency will arise both from waiting on queue
>> > as well as seek time.
>>
>> I think cfq should already be doing something similar, i.e. giving
>> 100ms slices to firefox, that alternate with dd, unless:
>> * firefox is too seeky (in this case, the idle window will be too small)
>> * firefox has too much think time.
>>
>
Hi Vivek,
> Hi Corrado,
>
> "firefox" is the shell script to setup the environment and launch the
> broser. It seems to be a group of threads. Some of them run in parallel
> and some of these seems to be running one after the other (once previous
> process or threads finished).

Ok.

>
>> To rule out the first case, what happens if you run the test with your
>> "fairness for seeky processes" patch?
>
> I applied that patch and it helps a lot.
>
> http://lwn.net/Articles/341032/
>
> With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.

Great.
Can you try the attached patch (on top of 2.6.31)?
It implements the alternative approach we discussed privately in july,
and it addresses the possible latency increase that could happen with
your patch.

To summarize for everyone, we separate sync sequential queues, sync
seeky queues and async queues in three separate RR strucutres, and
alternate servicing requests between them.

When servicing seeky queues (the ones that are usually penalized by
cfq, for which no fairness is usually provided), we do not idle
between them, but we do idle for the last queue (the idle can be
exited when any seeky queue has requests). This allows us to allocate
disk time globally for all seeky processes, and to reduce seeky
processes latencies.

I tested with 'konsole -e exit', while doing a sequential write with
dd, and the start up time reduced from 37s to 7s, on an old laptop
disk.

Thanks,
Corrado

>
>> To rule out the first case, what happens if you run the test with your
>> "fairness for seeky processes" patch?
>
> I applied that patch and it helps a lot.
>
> http://lwn.net/Articles/341032/
>
> With above patchset applied, and fairness=1, firefox pops up in 27-28
> seconds.
>
> So it looks like if we don't disable idle window for seeky processes on
> hardware supporting command queuing, it helps in this particular case.
>
> Thanks
> Vivek
>


Attachments:
cfq.patch (23.65 kB)

2009-09-28 17:16:10

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote:
> On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <[email protected]> wrote:
> > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <[email protected]> wrote:
> >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> >> >> Vivek Goyal wrote:
> >> >> > Notes:
> >> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >> >> > ? Bring down its throughput and bump up latencies significantly.
> >> >>
> >> >>
> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> >> >> too.
> >> >>
> >> >> I'm basing this assumption on the observations I made on both OpenSuse
> >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> >> >> titled: "Poor desktop responsiveness with background I/O-operations" of
> >> >> 2009-09-20.
> >> >> (Message ID: [email protected])
> >> >>
> >> >>
> >> >> Thus, I'm posting this to show that your work is greatly appreciated,
> >> >> given the rather disappointig status quo of Linux's fairness when it
> >> >> comes to disk IO time.
> >> >>
> >> >> I hope that your efforts lead to a change in performance of current
> >> >> userland applications, the sooner, the better.
> >> >>
> >> > [Please don't remove people from original CC list. I am putting them back.]
> >> >
> >> > Hi Ulrich,
> >> >
> >> > I quicky went through that mail thread and I tried following on my
> >> > desktop.
> >> >
> >> > ##########################################
> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
> >> > sleep 5
> >> > time firefox
> >> > # close firefox once gui pops up.
> >> > ##########################################
> >> >
> >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
> >> > following.
> >> >
> >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
> >> >
> >> > (Results do vary across runs, especially if system is booted fresh. Don't
> >> > ?know why...).
> >> >
> >> >
> >> > Then I tried putting both the applications in separate groups and assign
> >> > them weights 200 each.
> >> >
> >> > ##########################################
> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
> >> > echo $! > /cgroup/io/test1/tasks
> >> > sleep 5
> >> > echo $$ > /cgroup/io/test2/tasks
> >> > time firefox
> >> > # close firefox once gui pops up.
> >> > ##########################################
> >> >
> >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
> >> >
> >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
> >> >
> >> > Notice that throughput of dd also improved.
> >> >
> >> > I ran the block trace and noticed in many a cases firefox threads
> >> > immediately preempted the "dd". Probably because it was a file system
> >> > request. So in this case latency will arise from seek time.
> >> >
> >> > In some other cases, threads had to wait for up to 100ms because dd was
> >> > not preempted. In this case latency will arise both from waiting on queue
> >> > as well as seek time.
> >>
> >> I think cfq should already be doing something similar, i.e. giving
> >> 100ms slices to firefox, that alternate with dd, unless:
> >> * firefox is too seeky (in this case, the idle window will be too small)
> >> * firefox has too much think time.
> >>
> >
> Hi Vivek,
> > Hi Corrado,
> >
> > "firefox" is the shell script to setup the environment and launch the
> > broser. It seems to be a group of threads. Some of them run in parallel
> > and some of these seems to be running one after the other (once previous
> > process or threads finished).
>
> Ok.
>
> >
> >> To rule out the first case, what happens if you run the test with your
> >> "fairness for seeky processes" patch?
> >
> > I applied that patch and it helps a lot.
> >
> > http://lwn.net/Articles/341032/
> >
> > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.
>
> Great.
> Can you try the attached patch (on top of 2.6.31)?
> It implements the alternative approach we discussed privately in july,
> and it addresses the possible latency increase that could happen with
> your patch.
>
> To summarize for everyone, we separate sync sequential queues, sync
> seeky queues and async queues in three separate RR strucutres, and
> alternate servicing requests between them.
>
> When servicing seeky queues (the ones that are usually penalized by
> cfq, for which no fairness is usually provided), we do not idle
> between them, but we do idle for the last queue (the idle can be
> exited when any seeky queue has requests). This allows us to allocate
> disk time globally for all seeky processes, and to reduce seeky
> processes latencies.
>

Ok, I seem to be doing same thing at group level (In group scheduling
patches). I do not idle on individual sync seeky queues but if this is
last queue in the group, then I do idle to make sure group does not loose
its fair share and exit from idle the moment there is any busy queue in
the group.

So you seem to be grouping all the sync seeky queues system wide in a
single group. So all the sync seeky queues collectively get 100ms in a
single round of dispatch? I am wondering what happens if there are lot
of such sync seeky queues this 100ms time slice is consumed before all the
sync seeky queues got a chance to dispatch. Does that mean that some of
the queues can completely skip the one dispatch round?

Thanks
Vivek

> I tested with 'konsole -e exit', while doing a sequential write with
> dd, and the start up time reduced from 37s to 7s, on an old laptop
> disk.
>
> Thanks,
> Corrado
>
> >
> >> To rule out the first case, what happens if you run the test with your
> >> "fairness for seeky processes" patch?
> >
> > I applied that patch and it helps a lot.
> >
> > http://lwn.net/Articles/341032/
> >
> > With above patchset applied, and fairness=1, firefox pops up in 27-28
> > seconds.
> >
> > So it looks like if we don't disable idle window for seeky processes on
> > hardware supporting command queuing, it helps in this particular case.
> >
> > Thanks
> > Vivek
> >

2009-09-28 17:49:59

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, Sep 28, 2009 at 06:04:08AM +0200, Mike Galbraith wrote:
> On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote:
> > On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:
>
> > I'll give it a shot first thing in the A.M.
>
> > > diff --git a/block/elevator.c b/block/elevator.c
> > > index 1975b61..d00a72b 100644
> > > --- a/block/elevator.c
> > > +++ b/block/elevator.c
> > > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> > > * See if our hash lookup can find a potential backmerge.
> > > */
> > > __rq = elv_rqhash_find(q, bio->bi_sector);
> > > - if (__rq && elv_rq_merge_ok(__rq, bio)) {
> > > - *req = __rq;
> > > - return ELEVATOR_BACK_MERGE;
> > > + if (__rq) {
> > > + /*
> > > + * If requests are queued behind this one, disallow merge. This
> > > + * prevents streaming IO from continually passing new IO.
> > > + */
> > > + if (elv_latter_request(q, __rq))
> > > + return ELEVATOR_NO_MERGE;
> > > + if (elv_rq_merge_ok(__rq, bio)) {
> > > + *req = __rq;
> > > + return ELEVATOR_BACK_MERGE;
> > > + }
> > > }
> > >
> > > if (e->ops->elevator_merge_fn)
>
> - = virgin tip v2.6.31-10215-ga3c9602
> + = with patchlet
> Avg
> dd pre 67.4 70.9 65.4 68.9 66.2 67.7-
> 65.9 68.5 69.8 65.2 65.8 67.0- Avg
> 70.4 70.3 65.1 66.4 70.1 68.4- 67.7-
> 73.1 64.6 65.3 65.3 64.9 66.6+ 65.6+ .968
> 63.8 67.9 65.2 65.1 64.4 65.2+
> 64.9 66.3 64.1 65.2 64.8 65.0+
> perf stat 8.66 16.29 9.65 14.88 9.45 11.7-
> 15.36 9.71 15.47 10.44 12.93 12.7-
> 10.55 15.11 10.22 15.35 10.32 12.3- 12.2-
> 9.87 7.53 10.62 7.51 9.95 9.0+ 9.1+ .745
> 7.73 10.12 8.19 11.87 8.07 9.1+
> 11.04 7.62 10.14 8.13 10.23 9.4+
> dd post 63.4 60.5 66.7 64.5 67.3 64.4-
> 64.4 66.8 64.3 61.5 62.0 63.8-
> 63.8 64.9 66.2 65.6 66.9 65.4- 64.5-
> 60.9 63.4 60.2 63.4 65.5 62.6+ 61.8+ .958
> 63.3 59.9 61.9 62.7 61.2 61.8+
> 60.1 63.7 59.5 61.5 60.6 61.0+
>

Hmm.., so close to 25% reduction on average in completion time of konsole.
But this is in presece of writer. Does this help even in presence of 1 or
more sequential readers going?

So here latency seems to be coming from three sources.

- Wait in CFQ before request is dispatched (only in case of competing seq readers).
- seek latencies
- latencies because of bigger requests are already dispatched to disk.

So limiting the size of request will help with third factor but not with first
two factors and here seek latencies seem to be the biggest contributor.

Thanks
Vivek

2009-09-28 17:51:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote:

> Great.
> Can you try the attached patch (on top of 2.6.31)?
> It implements the alternative approach we discussed privately in july,
> and it addresses the possible latency increase that could happen with
> your patch.
>
> To summarize for everyone, we separate sync sequential queues, sync
> seeky queues and async queues in three separate RR strucutres, and
> alternate servicing requests between them.
>
> When servicing seeky queues (the ones that are usually penalized by
> cfq, for which no fairness is usually provided), we do not idle
> between them, but we do idle for the last queue (the idle can be
> exited when any seeky queue has requests). This allows us to allocate
> disk time globally for all seeky processes, and to reduce seeky
> processes latencies.
>
> I tested with 'konsole -e exit', while doing a sequential write with
> dd, and the start up time reduced from 37s to 7s, on an old laptop
> disk.

I was fiddling around trying to get IDLE class to behave at least, and
getting a bit frustrated. Class/priority didn't seem to make much if
any difference for konsole -e exit timings, and now I know why. I saw
the reference to Vivek's patch, and gave it a shot. Makes a large
difference.
Avg
perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory
16.24 175.82 154.38 228.97 147.16 144.5 noop
43.23 57.39 96.13 148.25 180.09 105.0 deadline
9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0
12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19
9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE
4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0
3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19
2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE

I'll give your patch a spin as well.

-Mike

2009-09-28 18:20:39

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:
> On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote:
>
> > Great.
> > Can you try the attached patch (on top of 2.6.31)?
> > It implements the alternative approach we discussed privately in july,
> > and it addresses the possible latency increase that could happen with
> > your patch.
> >
> > To summarize for everyone, we separate sync sequential queues, sync
> > seeky queues and async queues in three separate RR strucutres, and
> > alternate servicing requests between them.
> >
> > When servicing seeky queues (the ones that are usually penalized by
> > cfq, for which no fairness is usually provided), we do not idle
> > between them, but we do idle for the last queue (the idle can be
> > exited when any seeky queue has requests). This allows us to allocate
> > disk time globally for all seeky processes, and to reduce seeky
> > processes latencies.
> >
> > I tested with 'konsole -e exit', while doing a sequential write with
> > dd, and the start up time reduced from 37s to 7s, on an old laptop
> > disk.
>
> I was fiddling around trying to get IDLE class to behave at least, and
> getting a bit frustrated. Class/priority didn't seem to make much if
> any difference for konsole -e exit timings, and now I know why.

You seem to be testing kconsole timings against a writer. In case of a
writer prio will not make much of a difference as prio only adjusts length
of slice given to process and writers rarely get to use their slice
length. Reader immediately preemtps it...

I guess changing class to IDLE should have helped a bit as now this is
equivalent to setting the quantum to 1 and after dispatching one request
to disk, CFQ will always expire the writer once. So it might happen that
by the the reader preempted writer, we have less number of requests in
disk and lesser latency for this reader.

> I saw
> the reference to Vivek's patch, and gave it a shot. Makes a large
> difference.
> Avg
> perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory
> 16.24 175.82 154.38 228.97 147.16 144.5 noop
> 43.23 57.39 96.13 148.25 180.09 105.0 deadline
> 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0
> 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19
> 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE
> 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0
> 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19
> 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE
>

Hmm.., looks like average latency went down only in case of fairness=1
and not in case of fairness=0. (Looking at previous mail, average vanilla
cfq latencies were around 12 seconds).

Are you running all this in root group or have you put writers and readers
into separate cgroups?

If everything is running in root group, then I am curious why latency went
down in case of fairness=1. The only thing fairness=1 parameter does is
that it lets complete all the requests from previous queue before start
dispatching from next queue. On top of this is valid only if no preemption
took place. In your test case, konsole should preempt the writer so
practically fairness=1 might not make much difference.

In fact now Jens has committed a patch which achieves the similar effect as
fairness=1 for async queues.

commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
Author: Jens Axboe <[email protected]>
Date: Fri Jul 3 12:57:48 2009 +0200

cfq-iosched: drain device queue before switching to a sync queue

To lessen the impact of async IO on sync IO, let the device drain of
any async IO in progress when switching to a sync cfqq that has idling
enabled.


If everything is in separate cgroups, then we should have seen latency
improvements in case of fairness=0 case also. I am little perplexed here..

Thanks
Vivek

2009-09-28 18:25:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, 2009-09-28 at 13:48 -0400, Vivek Goyal wrote:

> Hmm.., so close to 25% reduction on average in completion time of konsole.
> But this is in presece of writer. Does this help even in presence of 1 or
> more sequential readers going?

Dunno, I've only tested sequential writer.

> So here latency seems to be coming from three sources.
>
> - Wait in CFQ before request is dispatched (only in case of competing seq readers).
> - seek latencies
> - latencies because of bigger requests are already dispatched to disk.
>
> So limiting the size of request will help with third factor but not with first
> two factors and here seek latencies seem to be the biggest contributor.

Yeah, seek latency seems to dominate.

-Mike

2009-09-28 18:54:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote:
> On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:

> I guess changing class to IDLE should have helped a bit as now this is
> equivalent to setting the quantum to 1 and after dispatching one request
> to disk, CFQ will always expire the writer once. So it might happen that
> by the the reader preempted writer, we have less number of requests in
> disk and lesser latency for this reader.

I expected SCHED_IDLE to be better than setting quantum to 1, because
max is quantum*4 if you aren't IDLE. But that's not what happened. I
just retested with all knobs set back to stock, fairness off, and
quantum set to 1 with everything running nice 0. 2.8 seconds avg :-/

> > I saw
> > the reference to Vivek's patch, and gave it a shot. Makes a large
> > difference.
> > Avg
> > perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory
> > 16.24 175.82 154.38 228.97 147.16 144.5 noop
> > 43.23 57.39 96.13 148.25 180.09 105.0 deadline
> > 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0
> > 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19
> > 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE
> > 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0
> > 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19
> > 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE
> >
>
> Hmm.., looks like average latency went down only in case of fairness=1
> and not in case of fairness=0. (Looking at previous mail, average vanilla
> cfq latencies were around 12 seconds).

Yup.

> Are you running all this in root group or have you put writers and readers
> into separate cgroups?

No cgroups here.

> If everything is running in root group, then I am curious why latency went
> down in case of fairness=1. The only thing fairness=1 parameter does is
> that it lets complete all the requests from previous queue before start
> dispatching from next queue. On top of this is valid only if no preemption
> took place. In your test case, konsole should preempt the writer so
> practically fairness=1 might not make much difference.

fairness=1 very definitely makes a very large difference. All of those
cfq numbers were logged in back to back runs.

> In fact now Jens has committed a patch which achieves the similar effect as
> fairness=1 for async queues.

Yeah, I was there yesterday. I speculated that that would hurt my
reader, but rearranging things didn't help one bit. Playing with merge,
I managed to give dd ~7% more throughput, and injured poor reader even
more. (problem analysis via hammer/axe not always most effective;)

> commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
> Author: Jens Axboe <[email protected]>
> Date: Fri Jul 3 12:57:48 2009 +0200
>
> cfq-iosched: drain device queue before switching to a sync queue
>
> To lessen the impact of async IO on sync IO, let the device drain of
> any async IO in progress when switching to a sync cfqq that has idling
> enabled.
>
>
> If everything is in separate cgroups, then we should have seen latency
> improvements in case of fairness=0 case also. I am little perplexed here..
>
> Thanks
> Vivek

2009-09-29 00:37:44

by Nauman Rafique

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek,
Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
Jens about IO controller during Linux Plumbers Conference '09. Jens
expressed his concerns about the size and complexity of the patches. I
believe that is a reasonable concern. We talked about things that
could be done to reduce the size of the patches. The requirement that
the "solution has to work with all IO schedulers" seems like a
secondary concern at this point; and it came out as one thing that can
help to reduce the size of the patch set. Another possibility is to
use a simpler scheduling algorithm e.g. weighted round robin, instead
of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
the fact that it is complex to understand, and might be cumbersome to
maintain. Also, hierarchical scheduling is something that could be
unnecessary in the first set of patches, even though cgroups are
hierarchical in nature.

We are starting from a point where there is no cgroup based IO
scheduling in the kernel. And it is probably not reasonable to satisfy
all IO scheduling related requirements in one patch set. We can start
with something simple, and build on top of that. So a very simple
patch set that enables cgroup based proportional scheduling for CFQ
seems like the way to go at this point.

It would be great if we discuss our plans on the mailing list, so we
can get early feedback from everyone.

2009-09-29 03:23:43

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> Hi Vivek,
> Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
> Jens about IO controller during Linux Plumbers Conference '09. Jens
> expressed his concerns about the size and complexity of the patches. I
> believe that is a reasonable concern. We talked about things that
> could be done to reduce the size of the patches. The requirement that
> the "solution has to work with all IO schedulers" seems like a
> secondary concern at this point; and it came out as one thing that can
> help to reduce the size of the patch set.

Initially doing cgroup based IO control only for CFQ should help a lot in
reducing the patchset size.

> Another possibility is to
> use a simpler scheduling algorithm e.g. weighted round robin, instead
> of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
> the fact that it is complex to understand, and might be cumbersome to
> maintain.

Core of the BFQ I have gotten rid of already. The remaining part is idle tree
and data structures. I will see how can I simplify it further.

> Also, hierarchical scheduling is something that could be
> unnecessary in the first set of patches, even though cgroups are
> hierarchical in nature.

Sure. Though I don't think that a lot of code is there because of
hierarchical nature. If we solve the issue at CFQ layer, we have to
maintain atleast two levels. One for queue and other for groups. So even
the simplest solution becomes almost hierarchical in nature. But I will
still see how to get rid of some code here too...
>
> We are starting from a point where there is no cgroup based IO
> scheduling in the kernel. And it is probably not reasonable to satisfy
> all IO scheduling related requirements in one patch set. We can start
> with something simple, and build on top of that. So a very simple
> patch set that enables cgroup based proportional scheduling for CFQ
> seems like the way to go at this point.

Sure, we can start with CFQ only. But a bigger question we need to answer
is that is CFQ the right place to solve the issue? Jens, do you think
that CFQ is the right place to solve the problem?

Andrew seems to favor a high level approach so that IO schedulers are less
complex and we can provide fairness at high level logical devices also.

I will again try to summarize my understanding so far about the pros/cons
of each approach and then we can take the discussion forward.

Fairness in terms of size of IO or disk time used
=================================================
On a seeky media, fairness in terms of disk time can get us better results
instead fairness interms of size of IO or number of IO.

If we implement some kind of time based solution at higher layer, then
that higher layer should know who used how much of time each group used. We
can probably do some kind of timestamping in bio to get a sense when did it
get into disk and when did it finish. But on a multi queue hardware there
can be multiple requests in the disk either from same queue or from differnet
queues and with pure timestamping based apparoch, so far I could not think
how at high level we will get an idea who used how much of time.

So this is the first point of contention that how do we want to provide
fairness. In terms of disk time used or in terms of size of IO/number of
IO.

Max bandwidth Controller or Proportional bandwidth controller
=============================================================
What is our primary requirement here? A weight based proportional
bandwidth controller where we can use the resources optimally and any
kind of throttling kicks in only if there is contention for the disk.

Or we want max bandwidth control where a group is not allowed to use the
disk even if disk is free.

Or we need both? I would think that at some point of time we will need
both but we can start with proportional bandwidth control first.

Fairness for higher level logical devices
=========================================
Do we want good fairness numbers for higher level logical devices also
or it is sufficient to provide fairness at leaf nodes. Providing fairness
at leaf nodes can help us use the resources optimally and in the process
we can get fairness at higher level also in many of the cases.

But do we want strict fairness numbers on higher level logical devices
even if it means sub-optimal usage of unerlying phsical devices?

I think that for proportinal bandwidth control, it should be ok to provide
fairness at higher level logical device but for max bandwidth control it
might make more sense to provide fairness at higher level. Consider a
case where from a striped device a customer wants to limit a group to
30MB/s and in case of leaf node control, if every leaf node provides
30MB/s, it might accumulate to much more than specified rate at logical
device.

Latency Control and strong isolation between groups
===================================================
Do we want a good isolation between groups and better latencies and
stronger isolation between groups?

I think if problem is solved at IO scheduler level, we can achieve better
latency control and hence stronger isolation between groups.

Higher level solutions should find it hard to provide same kind of latency
control and isolation between groups as IO scheduler based solution.

Fairness for buffered writes
============================
Doing io control at any place below page cache has disadvantage that page
cache might not dispatch more writes from higher weight group hence higher
weight group might not see more IO done. Andrew says that we don't have
a solution to this problem in kernel and he would like to see it handled
properly.

Only way to solve this seems to be to slow down the writers before they
write into page cache. IO throttling patch handled it by slowing down
writer if it crossed max specified rate. Other suggestions have come in
the form of dirty_ratio per memory cgroup or a separate cgroup controller
al-together where some kind of per group write limit can be specified.

So if solution is implemented at IO scheduler layer or at device mapper
layer, both shall have to rely on another controller to be co-mounted
to handle buffered writes properly.

Fairness with-in group
======================
One of the issues with higher level controller is that how to do fair
throttling so that fairness with-in group is not impacted. Especially
the case of making sure that we don't break the notion of ioprio of the
processes with-in group.

Especially io throttling patch was very bad in terms of prio with-in
group where throttling treated everyone equally and difference between
process prio disappeared.

Reads Vs Writes
===============
A higher level control most likely will change the ratio in which reads
and writes are dispatched to disk with-in group. It used to be decided
by IO scheduler so far but with higher level groups doing throttling and
possibly buffering the bios and releasing them later, they will have to
come up with their own policy on in what proportion reads and writes
should be dispatched. In case of IO scheduler based control, all the
queuing takes place at IO scheduler and it still retains control of
in what ration reads and writes should be dispatched.


Summary
=======

- An io scheduler based io controller can provide better latencies,
stronger isolation between groups, time based fairness and will not
interfere with io schedulers policies like class, ioprio and
reader vs writer issues.

But it can gunrantee fairness at higher logical level devices.
Especially in case of max bw control, leaf node control does not sound
to be the most appropriate thing.

- IO throttling provides max bw control in terms of absolute rate. It has
the advantage that it can provide control at higher level logical device
and also control buffered writes without need of additional controller
co-mounted.

But it does only max bw control and not proportion control so one might
not be using resources optimally. It looses sense of task prio and class
with-in group as any of the task can be throttled with-in group. Because
throttling does not kick in till you hit the max bw limit, it should find
it hard to provide same latencies as io scheduler based control.

- dm-ioband also has the advantage that it can provide fairness at higher
level logical devices.

But, fairness is provided only in terms of size of IO or number of IO.
No time based fairness. It is very throughput oriented and does not
throttle high speed group if other group is running slow random reader.
This results in bad latnecies for random reader group and weaker
isolation between groups.

Also it does not provide fairness if a group is not continuously
backlogged. So if one is running 1-2 dd/sequential readers in the group,
one does not get fairness until workload is increased to a point where
group becomes continuously backlogged. This also results in poor
latencies and limited fairness.

At this point of time it does not look like a single IO controller all
the scenarios/requirements. This means few things to me.

- Drop some of the requirements and go with one implementation which meets
those reduced set of requirements.

- Have more than one IO controller implementation in kenrel. One for lower
level control for better latencies, stronger isolation and optimal resource
usage and other one for fairness at higher level logical devices and max
bandwidth control.

And let user decide which one to use based on his/her needs.

- Come up with more intelligent way of doing IO control where single
controller covers all the cases.

At this point of time, I am more inclined towards option 2 of having more
than one implementation in kernel. :-) (Until and unless we can brainstrom
and come up with ideas to make option 3 happen).

>
> It would be great if we discuss our plans on the mailing list, so we
> can get early feedback from everyone.

This is what comes to my mind so far. Please add to the list if I have missed
some points. Also correct me if I am wrong about the pros/cons of the
approaches.

Thoughts/ideas/opinions are welcome...

Thanks
Vivek

2009-09-29 05:55:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, 2009-09-28 at 19:51 +0200, Mike Galbraith wrote:

> I'll give your patch a spin as well.

I applied it to tip, and fixed up rejects. I haven't done a line for
line verification against the original patch yet (brave or..), so add
giant economy sized pinch of salt.

In the form it ended up in, it didn't help here. I tried twiddling
knobs, but it didn't help either. Reducing latency target from 300 to
30 did nada, but dropping to 3 did... I got to poke BRB.

Plugging Vivek's fairness tweakable on top, and enabling it, my timings
return to decent numbers, so that one liner absatively posilutely is
where my write vs read woes are coming from.

FWIW, below is patch wedged into tip v2.6.31-10215-ga3c9602

---
block/cfq-iosched.c | 281 ++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 227 insertions(+), 54 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -27,6 +27,12 @@ static const int cfq_slice_sync = HZ / 1
static int cfq_slice_async = HZ / 25;
static const int cfq_slice_async_rq = 2;
static int cfq_slice_idle = HZ / 125;
+static int cfq_target_latency = HZ * 3/10; /* 300 ms */
+static int cfq_hist_divisor = 4;
+/*
+ * Number of times that other workloads can be scheduled before async
+ */
+static const unsigned int cfq_async_penalty = 4;

/*
* offset from end of service tree
@@ -36,7 +42,7 @@ static int cfq_slice_idle = HZ / 125;
/*
* below this threshold, we consider thinktime immediate
*/
-#define CFQ_MIN_TT (2)
+#define CFQ_MIN_TT (1)

#define CFQ_SLICE_SCALE (5)
#define CFQ_HW_QUEUE_MIN (5)
@@ -67,8 +73,9 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
struct cfq_rb_root {
struct rb_root rb;
struct rb_node *left;
+ unsigned count;
};
-#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, }
+#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, }

/*
* Per process-grouping structure
@@ -113,6 +120,21 @@ struct cfq_queue {
unsigned short ioprio_class, org_ioprio_class;

pid_t pid;
+
+ struct cfq_rb_root *service_tree;
+ struct cfq_io_context *cic;
+};
+
+enum wl_prio_t {
+ IDLE_WL = -1,
+ BE_WL = 0,
+ RT_WL = 1
+};
+
+enum wl_type_t {
+ ASYNC_WL = 0,
+ SYNC_NOIDLE_WL = 1,
+ SYNC_WL = 2
};

/*
@@ -124,7 +146,13 @@ struct cfq_data {
/*
* rr list of queues with requests and the count of them
*/
- struct cfq_rb_root service_tree;
+ struct cfq_rb_root service_trees[2][3];
+ struct cfq_rb_root service_tree_idle;
+
+ enum wl_prio_t serving_prio;
+ enum wl_type_t serving_type;
+ unsigned long workload_expires;
+ unsigned int async_starved;

/*
* Each priority tree is sorted by next_request position. These
@@ -134,9 +162,11 @@ struct cfq_data {
struct rb_root prio_trees[CFQ_PRIO_LISTS];

unsigned int busy_queues;
+ unsigned int busy_queues_avg[2];

int rq_in_driver[2];
int sync_flight;
+ int reads_delayed;

/*
* queue-depth detection
@@ -173,6 +203,9 @@ struct cfq_data {
unsigned int cfq_slice[2];
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
+ unsigned int cfq_target_latency;
+ unsigned int cfq_hist_divisor;
+ unsigned int cfq_async_penalty;

struct list_head cic_list;

@@ -182,6 +215,11 @@ struct cfq_data {
struct cfq_queue oom_cfqq;
};

+static struct cfq_rb_root * service_tree_for(enum wl_prio_t prio, enum wl_type_t type,
+ struct cfq_data *cfqd) {
+ return prio == IDLE_WL ? &cfqd->service_tree_idle : &cfqd->service_trees[prio][type];
+}
+
enum cfqq_state_flags {
CFQ_CFQQ_FLAG_on_rr = 0, /* on round-robin busy list */
CFQ_CFQQ_FLAG_wait_request, /* waiting for a request */
@@ -226,6 +264,17 @@ CFQ_CFQQ_FNS(coop);
#define cfq_log(cfqd, fmt, args...) \
blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)

+#define CIC_SEEK_THR 1024
+#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR)
+#define CFQQ_SEEKY(cfqq) (!cfqq->cic || CIC_SEEKY(cfqq->cic))
+
+static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd) {
+ return wl==IDLE_WL? cfqd->service_tree_idle.count :
+ cfqd->service_trees[wl][ASYNC_WL].count
+ + cfqd->service_trees[wl][SYNC_NOIDLE_WL].count
+ + cfqd->service_trees[wl][SYNC_WL].count;
+}
+
static void cfq_dispatch_insert(struct request_queue *, struct request *);
static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
struct io_context *, gfp_t);
@@ -247,6 +296,7 @@ static inline void cic_set_cfqq(struct c
struct cfq_queue *cfqq, int is_sync)
{
cic->cfqq[!!is_sync] = cfqq;
+ cfqq->cic = cic;
}

/*
@@ -301,10 +351,33 @@ cfq_prio_to_slice(struct cfq_data *cfqd,
return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
}

+static inline unsigned
+cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) {
+ unsigned min_q, max_q;
+ unsigned mult = cfqd->cfq_hist_divisor - 1;
+ unsigned round = cfqd->cfq_hist_divisor / 2;
+ unsigned busy = cfq_busy_queues_wl(rt, cfqd);
+ min_q = min(cfqd->busy_queues_avg[rt], busy);
+ max_q = max(cfqd->busy_queues_avg[rt], busy);
+ cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
+ cfqd->cfq_hist_divisor;
+ return cfqd->busy_queues_avg[rt];
+}
+
static inline void
cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
+ unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1];
+ unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq));
+ unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
+
+ if (iq > process_thr) {
+ unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle
+ / cfqd->cfq_slice[1];
+ slice = max(slice * process_thr / iq, min(slice, low_slice));
+ }
+
+ cfqq->slice_end = jiffies + slice;
cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
}

@@ -443,6 +516,7 @@ static void cfq_rb_erase(struct rb_node
if (root->left == n)
root->left = NULL;
rb_erase_init(n, &root->rb);
+ --root->count;
}

/*
@@ -483,46 +557,56 @@ static unsigned long cfq_slice_offset(st
}

/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
+ * The cfqd->service_trees holds all pending cfq_queue's that have
* requests waiting to be processed. It is sorted in the order that
* we will service the queues.
*/
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- int add_front)
+static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
struct rb_node **p, *parent;
struct cfq_queue *__cfqq;
unsigned long rb_key;
+ struct cfq_rb_root *service_tree;
int left;

if (cfq_class_idle(cfqq)) {
rb_key = CFQ_IDLE_DELAY;
- parent = rb_last(&cfqd->service_tree.rb);
+ service_tree = &cfqd->service_tree_idle;
+ parent = rb_last(&service_tree->rb);
if (parent && parent != &cfqq->rb_node) {
__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
rb_key += __cfqq->rb_key;
} else
rb_key += jiffies;
- } else if (!add_front) {
+ } else {
+ enum wl_prio_t prio = cfq_class_rt(cfqq) ? RT_WL : BE_WL;
+ enum wl_type_t type = cfq_cfqq_sync(cfqq) ? SYNC_WL : ASYNC_WL;
+
rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
rb_key += cfqq->slice_resid;
cfqq->slice_resid = 0;
- } else
- rb_key = 0;
+
+ if (type == SYNC_WL && (CFQQ_SEEKY(cfqq) || !cfq_cfqq_idle_window(cfqq)))
+ type = SYNC_NOIDLE_WL;
+
+ service_tree = service_tree_for(prio, type, cfqd);
+ }

if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
/*
* same position, nothing more to do
*/
- if (rb_key == cfqq->rb_key)
+ if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree)
return;

- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+ cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+ cfqq->service_tree = NULL;
}

left = 1;
parent = NULL;
- p = &cfqd->service_tree.rb.rb_node;
+ cfqq->service_tree = service_tree;
+ p = &service_tree->rb.rb_node;
while (*p) {
struct rb_node **n;

@@ -554,11 +638,12 @@ static void cfq_service_tree_add(struct
}

if (left)
- cfqd->service_tree.left = &cfqq->rb_node;
+ service_tree->left = &cfqq->rb_node;

cfqq->rb_key = rb_key;
rb_link_node(&cfqq->rb_node, parent, p);
- rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
+ rb_insert_color(&cfqq->rb_node, &service_tree->rb);
+ service_tree->count++;
}

static struct cfq_queue *
@@ -631,7 +716,7 @@ static void cfq_resort_rr_list(struct cf
* Resorting requires the cfqq to be on the RR list already.
*/
if (cfq_cfqq_on_rr(cfqq)) {
- cfq_service_tree_add(cfqd, cfqq, 0);
+ cfq_service_tree_add(cfqd, cfqq);
cfq_prio_tree_add(cfqd, cfqq);
}
}
@@ -660,8 +745,10 @@ static void cfq_del_cfqq_rr(struct cfq_d
BUG_ON(!cfq_cfqq_on_rr(cfqq));
cfq_clear_cfqq_on_rr(cfqq);

- if (!RB_EMPTY_NODE(&cfqq->rb_node))
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+ if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+ cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+ cfqq->service_tree = NULL;
+ }
if (cfqq->p_root) {
rb_erase(&cfqq->p_node, cfqq->p_root);
cfqq->p_root = NULL;
@@ -923,10 +1010,11 @@ static inline void cfq_slice_expired(str
*/
static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
{
- if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
- return NULL;
+ struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd);

- return cfq_rb_first(&cfqd->service_tree);
+ if (RB_EMPTY_ROOT(&service_tree->rb))
+ return NULL;
+ return cfq_rb_first(service_tree);
}

/*
@@ -954,9 +1042,6 @@ static inline sector_t cfq_dist_from_las
return cfqd->last_position - blk_rq_pos(rq);
}

-#define CIC_SEEK_THR 8 * 1024
-#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR)
-
static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq)
{
struct cfq_io_context *cic = cfqd->active_cic;
@@ -1044,6 +1129,10 @@ static struct cfq_queue *cfq_close_coope
if (cfq_cfqq_coop(cfqq))
return NULL;

+ /* we don't want to mix processes with different characteristics */
+ if (cfqq->service_tree != cur_cfqq->service_tree)
+ return NULL;
+
if (!probe)
cfq_mark_cfqq_coop(cfqq);
return cfqq;
@@ -1087,14 +1176,15 @@ static void cfq_arm_slice_timer(struct c

cfq_mark_cfqq_wait_request(cfqq);

- /*
- * we don't want to idle for seeks, but we do want to allow
- * fair distribution of slice time for a process doing back-to-back
- * seeks. so allow a little bit of time for him to submit a new rq
- */
- sl = cfqd->cfq_slice_idle;
- if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
+ sl = min_t(unsigned, cfqd->cfq_slice_idle, cfqq->slice_end - jiffies);
+
+ /* very small idle if we are serving noidle trees, and there are more trees */
+ if (cfqd->serving_type == SYNC_NOIDLE_WL &&
+ service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count > 0) {
+ if (blk_queue_nonrot(cfqd->queue))
+ return;
sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
+ }

mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
@@ -1110,6 +1200,11 @@ static void cfq_dispatch_insert(struct r

cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");

+ if (!time_before(jiffies, rq->start_time + cfqd->cfq_target_latency / 2) && rq_data_dir(rq)==READ) {
+ cfqd->reads_delayed = max_t(int, cfqd->reads_delayed,
+ (jiffies - rq->start_time) / (cfqd->cfq_target_latency / 2));
+ }
+
cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
cfq_remove_request(rq);
cfqq->dispatched++;
@@ -1156,6 +1251,16 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd,
return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
}

+enum wl_type_t cfq_choose_sync_async(struct cfq_data *cfqd, enum wl_prio_t prio) {
+ struct cfq_queue *id, *ni;
+ ni = cfq_rb_first(service_tree_for(prio, SYNC_NOIDLE_WL, cfqd));
+ id = cfq_rb_first(service_tree_for(prio, SYNC_WL, cfqd));
+ if (id && ni && id->rb_key < ni->rb_key)
+ return SYNC_WL;
+ if (!ni) return SYNC_WL;
+ return SYNC_NOIDLE_WL;
+}
+
/*
* Select a queue for service. If we have a current active queue,
* check whether to continue servicing it, or retrieve and set a new one.
@@ -1196,15 +1301,68 @@ static struct cfq_queue *cfq_select_queu
* flight or is idling for a new request, allow either of these
* conditions to happen (or time out) before selecting a new queue.
*/
- if (timer_pending(&cfqd->idle_slice_timer) ||
+ if (timer_pending(&cfqd->idle_slice_timer) ||
(cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
cfqq = NULL;
goto keep_queue;
}
-
expire:
cfq_slice_expired(cfqd, 0);
new_queue:
+ if (!new_cfqq) {
+ enum wl_prio_t previous_prio = cfqd->serving_prio;
+
+ if (cfq_busy_queues_wl(RT_WL, cfqd))
+ cfqd->serving_prio = RT_WL;
+ else if (cfq_busy_queues_wl(BE_WL, cfqd))
+ cfqd->serving_prio = BE_WL;
+ else {
+ cfqd->serving_prio = IDLE_WL;
+ cfqd->workload_expires = jiffies + 1;
+ cfqd->reads_delayed = 0;
+ }
+
+ if (cfqd->serving_prio != IDLE_WL) {
+ int counts[]={
+ service_tree_for(cfqd->serving_prio, ASYNC_WL, cfqd)->count,
+ service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count,
+ service_tree_for(cfqd->serving_prio, SYNC_WL, cfqd)->count
+ };
+ int nonzero_counts= !!counts[0] + !!counts[1] + !!counts[2];
+
+ if (previous_prio != cfqd->serving_prio || (nonzero_counts == 1)) {
+ cfqd->serving_type = counts[1] ? SYNC_NOIDLE_WL : counts[2] ? SYNC_WL : ASYNC_WL;
+ cfqd->async_starved = 0;
+ cfqd->reads_delayed = 0;
+ } else {
+ if (!counts[cfqd->serving_type] || time_after(jiffies, cfqd->workload_expires)) {
+ if (cfqd->serving_type != ASYNC_WL && counts[ASYNC_WL] &&
+ cfqd->async_starved++ > cfqd->cfq_async_penalty * (1 + cfqd->reads_delayed))
+ cfqd->serving_type = ASYNC_WL;
+ else
+ cfqd->serving_type = cfq_choose_sync_async(cfqd, cfqd->serving_prio);
+ } else
+ goto same_wl;
+ }
+
+ {
+ unsigned slice = cfqd->cfq_target_latency;
+ slice = slice * counts[cfqd->serving_type] /
+ max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio],
+ counts[SYNC_WL] + counts[SYNC_NOIDLE_WL] + counts[ASYNC_WL]);
+
+ if (cfqd->serving_type == ASYNC_WL)
+ slice = max(1U, (slice / (1 + cfqd->reads_delayed))
+ * cfqd->cfq_slice[0] / cfqd->cfq_slice[1]);
+ else
+ slice = max(slice, 2U * max(1U, cfqd->cfq_slice_idle));
+
+ cfqd->workload_expires = jiffies + slice;
+ cfqd->async_starved *= (cfqd->serving_type != ASYNC_WL);
+ }
+ }
+ }
+ same_wl:
cfqq = cfq_set_active_queue(cfqd, new_cfqq);
keep_queue:
return cfqq;
@@ -1231,8 +1389,13 @@ static int cfq_forced_dispatch(struct cf
{
struct cfq_queue *cfqq;
int dispatched = 0;
+ int i,j;
+ for (i = 0; i < 2; ++i)
+ for (j = 0; j < 3; ++j)
+ while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j])) != NULL)
+ dispatched += __cfq_forced_dispatch_cfqq(cfqq);

- while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+ while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL)
dispatched += __cfq_forced_dispatch_cfqq(cfqq);

cfq_slice_expired(cfqd, 0);
@@ -1300,6 +1463,12 @@ static int cfq_dispatch_requests(struct
return 0;

/*
+ * Drain async requests before we start sync IO
+ */
+ if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+ return 0;
+
+ /*
* If this is an async queue and we have sync IO in flight, let it wait
*/
if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
@@ -1993,18 +2162,8 @@ cfq_should_preempt(struct cfq_data *cfqd
if (cfq_class_idle(cfqq))
return 1;

- /*
- * if the new request is sync, but the currently running queue is
- * not, let the sync request have priority.
- */
- if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
- return 1;
-
- /*
- * So both queues are sync. Let the new request get disk time if
- * it's a metadata request and the current queue is doing regular IO.
- */
- if (rq_is_meta(rq) && !cfqq->meta_pending)
+ if (cfqd->serving_type == SYNC_NOIDLE_WL
+ && new_cfqq->service_tree == cfqq->service_tree)
return 1;

/*
@@ -2035,13 +2194,9 @@ static void cfq_preempt_queue(struct cfq
cfq_log_cfqq(cfqd, cfqq, "preempt");
cfq_slice_expired(cfqd, 1);

- /*
- * Put the new queue at the front of the of the current list,
- * so we know that it will be selected next.
- */
BUG_ON(!cfq_cfqq_on_rr(cfqq));

- cfq_service_tree_add(cfqd, cfqq, 1);
+ cfq_service_tree_add(cfqd, cfqq);

cfqq->slice_end = 0;
cfq_mark_cfqq_slice_new(cfqq);
@@ -2438,13 +2593,16 @@ static void cfq_exit_queue(struct elevat
static void *cfq_init_queue(struct request_queue *q)
{
struct cfq_data *cfqd;
- int i;
+ int i,j;

cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
if (!cfqd)
return NULL;

- cfqd->service_tree = CFQ_RB_ROOT;
+ for (i = 0; i < 2; ++i)
+ for (j = 0; j < 3; ++j)
+ cfqd->service_trees[i][j] = CFQ_RB_ROOT;
+ cfqd->service_tree_idle = CFQ_RB_ROOT;

/*
* Not strictly needed (since RB_ROOT just clears the node and we
@@ -2481,6 +2639,9 @@ static void *cfq_init_queue(struct reque
cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
+ cfqd->cfq_target_latency = cfq_target_latency;
+ cfqd->cfq_hist_divisor = cfq_hist_divisor;
+ cfqd->cfq_async_penalty = cfq_async_penalty;
cfqd->hw_tag = 1;

return cfqd;
@@ -2517,6 +2678,7 @@ fail:
/*
* sysfs parts below -->
*/
+
static ssize_t
cfq_var_show(unsigned int var, char *page)
{
@@ -2550,6 +2712,9 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd-
SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
+SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1);
+SHOW_FUNCTION(cfq_hist_divisor_show, cfqd->cfq_hist_divisor, 0);
+SHOW_FUNCTION(cfq_async_penalty_show, cfqd->cfq_async_penalty, 0);
#undef SHOW_FUNCTION

#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -2581,6 +2746,11 @@ STORE_FUNCTION(cfq_slice_sync_store, &cf
STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
+
+STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 1000, 1);
+STORE_FUNCTION(cfq_hist_divisor_store, &cfqd->cfq_hist_divisor, 1, 100, 0);
+STORE_FUNCTION(cfq_async_penalty_store, &cfqd->cfq_async_penalty, 1, UINT_MAX, 0);
+
#undef STORE_FUNCTION

#define CFQ_ATTR(name) \
@@ -2596,6 +2766,9 @@ static struct elv_fs_entry cfq_attrs[] =
CFQ_ATTR(slice_async),
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
+ CFQ_ATTR(target_latency),
+ CFQ_ATTR(hist_divisor),
+ CFQ_ATTR(async_penalty),
__ATTR_NULL
};


2009-09-29 07:10:17

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek,
On Mon, Sep 28, 2009 at 7:14 PM, Vivek Goyal <[email protected]> wrote:
> On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote:
>> On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <[email protected]> wrote:
>> > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
>> >> Hi Vivek,
>> >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <[email protected]> wrote:
>> >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
>> >> >> Vivek Goyal wrote:
>> >> >> > Notes:
>> >> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
>> >> >> >   Bring down its throughput and bump up latencies significantly.
>> >> >>
>> >> >>
>> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
>> >> >> too.
>> >> >>
>> >> >> I'm basing this assumption on the observations I made on both OpenSuse
>> >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
>> >> >> titled: "Poor desktop responsiveness with background I/O-operations" of
>> >> >> 2009-09-20.
>> >> >> (Message ID: [email protected])
>> >> >>
>> >> >>
>> >> >> Thus, I'm posting this to show that your work is greatly appreciated,
>> >> >> given the rather disappointig status quo of Linux's fairness when it
>> >> >> comes to disk IO time.
>> >> >>
>> >> >> I hope that your efforts lead to a change in performance of current
>> >> >> userland applications, the sooner, the better.
>> >> >>
>> >> > [Please don't remove people from original CC list. I am putting them back.]
>> >> >
>> >> > Hi Ulrich,
>> >> >
>> >> > I quicky went through that mail thread and I tried following on my
>> >> > desktop.
>> >> >
>> >> > ##########################################
>> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> >> > sleep 5
>> >> > time firefox
>> >> > # close firefox once gui pops up.
>> >> > ##########################################
>> >> >
>> >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
>> >> > following.
>> >> >
>> >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>> >> >
>> >> > (Results do vary across runs, especially if system is booted fresh. Don't
>> >> >  know why...).
>> >> >
>> >> >
>> >> > Then I tried putting both the applications in separate groups and assign
>> >> > them weights 200 each.
>> >> >
>> >> > ##########################################
>> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> >> > echo $! > /cgroup/io/test1/tasks
>> >> > sleep 5
>> >> > echo $$ > /cgroup/io/test2/tasks
>> >> > time firefox
>> >> > # close firefox once gui pops up.
>> >> > ##########################################
>> >> >
>> >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>> >> >
>> >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>> >> >
>> >> > Notice that throughput of dd also improved.
>> >> >
>> >> > I ran the block trace and noticed in many a cases firefox threads
>> >> > immediately preempted the "dd". Probably because it was a file system
>> >> > request. So in this case latency will arise from seek time.
>> >> >
>> >> > In some other cases, threads had to wait for up to 100ms because dd was
>> >> > not preempted. In this case latency will arise both from waiting on queue
>> >> > as well as seek time.
>> >>
>> >> I think cfq should already be doing something similar, i.e. giving
>> >> 100ms slices to firefox, that alternate with dd, unless:
>> >> * firefox is too seeky (in this case, the idle window will be too small)
>> >> * firefox has too much think time.
>> >>
>> >
>> Hi Vivek,
>> > Hi Corrado,
>> >
>> > "firefox" is the shell script to setup the environment and launch the
>> > broser. It seems to be a group of threads. Some of them run in parallel
>> > and some of these seems to be running one after the other (once previous
>> > process or threads finished).
>>
>> Ok.
>>
>> >
>> >> To rule out the first case, what happens if you run the test with your
>> >> "fairness for seeky processes" patch?
>> >
>> > I applied that patch and it helps a lot.
>> >
>> > http://lwn.net/Articles/341032/
>> >
>> > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.
>>
>> Great.
>> Can you try the attached patch (on top of 2.6.31)?
>> It implements the alternative approach we discussed privately in july,
>> and it addresses the possible latency increase that could happen with
>> your patch.
>>
>> To summarize for everyone, we separate sync sequential queues, sync
>> seeky queues and async queues in three separate RR strucutres, and
>> alternate servicing requests between them.
>>
>> When servicing seeky queues (the ones that are usually penalized by
>> cfq, for which no fairness is usually provided), we do not idle
>> between them, but we do idle for the last queue (the idle can be
>> exited when any seeky queue has requests). This allows us to allocate
>> disk time globally for all seeky processes, and to reduce seeky
>> processes latencies.
>>
>
> Ok, I seem to be doing same thing at group level (In group scheduling
> patches). I do not idle on individual sync seeky queues but if this is
> last queue in the group, then I do idle to make sure group does not loose
> its fair share and exit from idle the moment there is any busy queue in
> the group.
>
> So you seem to be grouping all the sync seeky queues system wide in a
> single group. So all the sync seeky queues collectively get 100ms in a
> single round of dispatch?

A round of dispatch (defined by tunable target_latency, default 300ms)
is subdivided between the three groups, proportionally to how many
queues are waiting in each, so if we have 1 sequential and 2 seeky
(and 0 async), we get 100ms for seq and 200ms for seeky.

> I am wondering what happens if there are lot
> of such sync seeky queues this 100ms time slice is consumed before all the
> sync seeky queues got a chance to dispatch. Does that mean that some of
> the queues can completely skip the one dispatch round?
It can happen: if each seek costs 10ms, and you have more than 30
seeky processes, then you are guaranteed that they cannot issue all in
the same round.
When this happens, the ones that did not issue before, will be the
first ones to be issued in the next round.

Thanks,
Corrado

>
> Thanks
> Vivek
>
>> I tested with 'konsole -e exit', while doing a sequential write with
>> dd, and the start up time reduced from 37s to 7s, on an old laptop
>> disk.
>>
>> Thanks,
>> Corrado
>>
>> >
>> >> To rule out the first case, what happens if you run the test with your
>> >> "fairness for seeky processes" patch?
>> >
>> > I applied that patch and it helps a lot.
>> >
>> > http://lwn.net/Articles/341032/
>> >
>> > With above patchset applied, and fairness=1, firefox pops up in 27-28
>> > seconds.
>> >
>> > So it looks like if we don't disable idle window for seeky processes on
>> > hardware supporting command queuing, it helps in this particular case.
>> >
>> > Thanks
>> > Vivek
>> >
>
>
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-09-29 07:14:45

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Mike,
On Mon, Sep 28, 2009 at 8:53 PM, Mike Galbraith <[email protected]> wrote:
> On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote:
>> On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:
>
>> I guess changing class to IDLE should have helped a bit as now this is
>> equivalent to setting the quantum to 1 and after dispatching one request
>> to disk, CFQ will always expire the writer once. So it might happen that
>> by the the reader preempted writer, we have less number of requests in
>> disk and lesser latency for this reader.
>
> I expected SCHED_IDLE to be better than setting quantum to 1, because
> max is quantum*4 if you aren't IDLE.  But that's not what happened.  I
> just retested with all knobs set back to stock, fairness off, and
> quantum set to 1 with everything running nice 0.  2.8 seconds avg :-/

Idle doesn't work very well for async writes, since the writer process
will just send its writes to the page cache.
The real writeback will happen in the context of a kernel thread, with
best effort scheduling class.

>
>> > I saw
>> > the reference to Vivek's patch, and gave it a shot.  Makes a large
>> > difference.
>> >                                                            Avg
>> > perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
>> >               16.24   175.82   154.38   228.97   147.16  144.5     noop
>> >               43.23    57.39    96.13   148.25   180.09  105.0     deadline
>> >                9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
>> >               12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
>> >                9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
>> >                4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
>> >                3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
>> >                2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE
>> >
>>
>> Hmm.., looks like average latency went down only in  case of fairness=1
>> and not in case of fairness=0. (Looking at previous mail, average vanilla
>> cfq latencies were around 12 seconds).
>
> Yup.
>
>> Are you running all this in root group or have you put writers and readers
>> into separate cgroups?
>
> No cgroups here.
>
>> If everything is running in root group, then I am curious why latency went
>> down in case of fairness=1. The only thing fairness=1 parameter does is
>> that it lets complete all the requests from previous queue before start
>> dispatching from next queue. On top of this is valid only if no preemption
>> took place. In your test case, konsole should preempt the writer so
>> practically fairness=1 might not make much difference.
>
> fairness=1 very definitely makes a very large difference.  All of those
> cfq numbers were logged in back to back runs.
>
>> In fact now Jens has committed a patch which achieves the similar effect as
>> fairness=1 for async queues.
>
> Yeah, I was there yesterday.  I speculated that that would hurt my
> reader, but rearranging things didn't help one bit.  Playing with merge,
> I managed to give dd ~7% more throughput, and injured poor reader even
> more.  (problem analysis via hammer/axe not always most effective;)
>
>> commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
>> Author: Jens Axboe <[email protected]>
>> Date:   Fri Jul 3 12:57:48 2009 +0200
>>
>>     cfq-iosched: drain device queue before switching to a sync queue
>>
>>     To lessen the impact of async IO on sync IO, let the device drain of
>>     any async IO in progress when switching to a sync cfqq that has idling
>>     enabled.
>>
>>
>> If everything is in separate cgroups, then we should have seen latency
>> improvements in case of fairness=0 case also. I am little perplexed here..
>>
>> Thanks
>> Vivek
>
>

Thanks,
Corrado


--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-09-29 09:56:52

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek and all,

Vivek Goyal <[email protected]> wrote:
> On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:

> > We are starting from a point where there is no cgroup based IO
> > scheduling in the kernel. And it is probably not reasonable to satisfy
> > all IO scheduling related requirements in one patch set. We can start
> > with something simple, and build on top of that. So a very simple
> > patch set that enables cgroup based proportional scheduling for CFQ
> > seems like the way to go at this point.
>
> Sure, we can start with CFQ only. But a bigger question we need to answer
> is that is CFQ the right place to solve the issue? Jens, do you think
> that CFQ is the right place to solve the problem?
>
> Andrew seems to favor a high level approach so that IO schedulers are less
> complex and we can provide fairness at high level logical devices also.

I'm not in favor of expansion of CFQ, because some enterprise storages
are better performed with NOOP rather than CFQ, and I think bandwidth
control is needed much more for such storage system. Is it easy to
support other IO schedulers even if a new IO scheduler is introduced?
I would like to know a bit more specific about Namuman's scheduler design.

> I will again try to summarize my understanding so far about the pros/cons
> of each approach and then we can take the discussion forward.

Good summary. Thanks for your work.

> Fairness in terms of size of IO or disk time used
> =================================================
> On a seeky media, fairness in terms of disk time can get us better results
> instead fairness interms of size of IO or number of IO.
>
> If we implement some kind of time based solution at higher layer, then
> that higher layer should know who used how much of time each group used. We
> can probably do some kind of timestamping in bio to get a sense when did it
> get into disk and when did it finish. But on a multi queue hardware there
> can be multiple requests in the disk either from same queue or from differnet
> queues and with pure timestamping based apparoch, so far I could not think
> how at high level we will get an idea who used how much of time.

IIUC, could the overlap time be calculated from time-stamp on a multi
queue hardware?

> So this is the first point of contention that how do we want to provide
> fairness. In terms of disk time used or in terms of size of IO/number of
> IO.
>
> Max bandwidth Controller or Proportional bandwidth controller
> =============================================================
> What is our primary requirement here? A weight based proportional
> bandwidth controller where we can use the resources optimally and any
> kind of throttling kicks in only if there is contention for the disk.
>
> Or we want max bandwidth control where a group is not allowed to use the
> disk even if disk is free.
>
> Or we need both? I would think that at some point of time we will need
> both but we can start with proportional bandwidth control first.

How about making throttling policy be user selectable like the IO
scheduler and putting it in the higher layer? So we could support
all of policies (time-based, size-based and rate limiting). There
seems not to only one solution which satisfies all users. But I agree
with starting with proportional bandwidth control first.

BTW, I will start to reimplement dm-ioband into block layer.

> Fairness for higher level logical devices
> =========================================
> Do we want good fairness numbers for higher level logical devices also
> or it is sufficient to provide fairness at leaf nodes. Providing fairness
> at leaf nodes can help us use the resources optimally and in the process
> we can get fairness at higher level also in many of the cases.

We should also take care of block devices which provide their own
make_request_fn() and not use a IO scheduler. We can't use the leaf
nodes approach to such devices.

> But do we want strict fairness numbers on higher level logical devices
> even if it means sub-optimal usage of unerlying phsical devices?
>
> I think that for proportinal bandwidth control, it should be ok to provide
> fairness at higher level logical device but for max bandwidth control it
> might make more sense to provide fairness at higher level. Consider a
> case where from a striped device a customer wants to limit a group to
> 30MB/s and in case of leaf node control, if every leaf node provides
> 30MB/s, it might accumulate to much more than specified rate at logical
> device.
>
> Latency Control and strong isolation between groups
> ===================================================
> Do we want a good isolation between groups and better latencies and
> stronger isolation between groups?
>
> I think if problem is solved at IO scheduler level, we can achieve better
> latency control and hence stronger isolation between groups.
>
> Higher level solutions should find it hard to provide same kind of latency
> control and isolation between groups as IO scheduler based solution.

Why do you think that the higher level solution is hard to provide it?
I think that it is a matter of how to implement throttling policy.

> Fairness for buffered writes
> ============================
> Doing io control at any place below page cache has disadvantage that page
> cache might not dispatch more writes from higher weight group hence higher
> weight group might not see more IO done. Andrew says that we don't have
> a solution to this problem in kernel and he would like to see it handled
> properly.
>
> Only way to solve this seems to be to slow down the writers before they
> write into page cache. IO throttling patch handled it by slowing down
> writer if it crossed max specified rate. Other suggestions have come in
> the form of dirty_ratio per memory cgroup or a separate cgroup controller
> al-together where some kind of per group write limit can be specified.
>
> So if solution is implemented at IO scheduler layer or at device mapper
> layer, both shall have to rely on another controller to be co-mounted
> to handle buffered writes properly.
>
> Fairness with-in group
> ======================
> One of the issues with higher level controller is that how to do fair
> throttling so that fairness with-in group is not impacted. Especially
> the case of making sure that we don't break the notion of ioprio of the
> processes with-in group.

I ran your test script to confirm that the notion of ioprio was not
broken by dm-ioband. Here is the results of the test.
https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html

I think that the time period during which dm-ioband holds IO requests
for throttling would be too short to break the notion of ioprio.

> Especially io throttling patch was very bad in terms of prio with-in
> group where throttling treated everyone equally and difference between
> process prio disappeared.
>
> Reads Vs Writes
> ===============
> A higher level control most likely will change the ratio in which reads
> and writes are dispatched to disk with-in group. It used to be decided
> by IO scheduler so far but with higher level groups doing throttling and
> possibly buffering the bios and releasing them later, they will have to
> come up with their own policy on in what proportion reads and writes
> should be dispatched. In case of IO scheduler based control, all the
> queuing takes place at IO scheduler and it still retains control of
> in what ration reads and writes should be dispatched.

I don't think it is a concern. The current implementation of dm-ioband
is that sync/async IO requests are handled separately and the
backlogged IOs are released according to the order of arrival if both
sync and async requests are backlogged.

> Summary
> =======
>
> - An io scheduler based io controller can provide better latencies,
> stronger isolation between groups, time based fairness and will not
> interfere with io schedulers policies like class, ioprio and
> reader vs writer issues.
>
> But it can gunrantee fairness at higher logical level devices.
> Especially in case of max bw control, leaf node control does not sound
> to be the most appropriate thing.
>
> - IO throttling provides max bw control in terms of absolute rate. It has
> the advantage that it can provide control at higher level logical device
> and also control buffered writes without need of additional controller
> co-mounted.
>
> But it does only max bw control and not proportion control so one might
> not be using resources optimally. It looses sense of task prio and class
> with-in group as any of the task can be throttled with-in group. Because
> throttling does not kick in till you hit the max bw limit, it should find
> it hard to provide same latencies as io scheduler based control.
>
> - dm-ioband also has the advantage that it can provide fairness at higher
> level logical devices.
>
> But, fairness is provided only in terms of size of IO or number of IO.
> No time based fairness. It is very throughput oriented and does not
> throttle high speed group if other group is running slow random reader.
> This results in bad latnecies for random reader group and weaker
> isolation between groups.

A new policy can be added to dm-ioband. Actually, range-bw policy,
which provides min and max bandwidth control, does time-based
throttling. Moreover there is room for improvement for existing
policies. The write-starve-read issue you pointed out will be solved
soon.

> Also it does not provide fairness if a group is not continuously
> backlogged. So if one is running 1-2 dd/sequential readers in the group,
> one does not get fairness until workload is increased to a point where
> group becomes continuously backlogged. This also results in poor
> latencies and limited fairness.

This is intended to efficiently use bandwidth of underlying devices
when IO load is low.

> At this point of time it does not look like a single IO controller all
> the scenarios/requirements. This means few things to me.
>
> - Drop some of the requirements and go with one implementation which meets
> those reduced set of requirements.
>
> - Have more than one IO controller implementation in kenrel. One for lower
> level control for better latencies, stronger isolation and optimal resource
> usage and other one for fairness at higher level logical devices and max
> bandwidth control.
>
> And let user decide which one to use based on his/her needs.
>
> - Come up with more intelligent way of doing IO control where single
> controller covers all the cases.
>
> At this point of time, I am more inclined towards option 2 of having more
> than one implementation in kernel. :-) (Until and unless we can brainstrom
> and come up with ideas to make option 3 happen).
>
> > It would be great if we discuss our plans on the mailing list, so we
> > can get early feedback from everyone.
>
> This is what comes to my mind so far. Please add to the list if I have missed
> some points. Also correct me if I am wrong about the pros/cons of the
> approaches.
>
> Thoughts/ideas/opinions are welcome...
>
> Thanks
> Vivek

Thanks,
Ryo Tsuruta

2009-09-29 10:41:47

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi,

Ryo Tsuruta wrote:
> Hi Vivek and all,
>
> Vivek Goyal <[email protected]> wrote:
>> On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
>
>>> We are starting from a point where there is no cgroup based IO
>>> scheduling in the kernel. And it is probably not reasonable to satisfy
>>> all IO scheduling related requirements in one patch set. We can start
>>> with something simple, and build on top of that. So a very simple
>>> patch set that enables cgroup based proportional scheduling for CFQ
>>> seems like the way to go at this point.
>> Sure, we can start with CFQ only. But a bigger question we need to answer
>> is that is CFQ the right place to solve the issue? Jens, do you think
>> that CFQ is the right place to solve the problem?
>>
>> Andrew seems to favor a high level approach so that IO schedulers are less
>> complex and we can provide fairness at high level logical devices also.
>
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.

Nauman said "cgroup based proportional scheduling for CFQ" and we need not
expand much of CFQ itself, is it right Nauman?

If so, we can reuse the io controller for new schedulers similar to CFQ.

I do not know well about how much important is it to consider which scheduler
is the current enterprise storages' favarite.
If we introduce an io controller, io pattern to disks will change,
in that case there is no guarantee that NOOP with some io controller
should work better than CFQ with some io controller.

Of course io controller for NOOP may be better.

Thanks,
Takuya Yoshikawa


>
>> I will again try to summarize my understanding so far about the pros/cons
>> of each approach and then we can take the discussion forward.
>
> Good summary. Thanks for your work.
>
>> Fairness in terms of size of IO or disk time used
>> =================================================
>> On a seeky media, fairness in terms of disk time can get us better results
>> instead fairness interms of size of IO or number of IO.
>>
>> If we implement some kind of time based solution at higher layer, then
>> that higher layer should know who used how much of time each group used. We
>> can probably do some kind of timestamping in bio to get a sense when did it
>> get into disk and when did it finish. But on a multi queue hardware there
>> can be multiple requests in the disk either from same queue or from differnet
>> queues and with pure timestamping based apparoch, so far I could not think
>> how at high level we will get an idea who used how much of time.
>
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?
>
>> So this is the first point of contention that how do we want to provide
>> fairness. In terms of disk time used or in terms of size of IO/number of
>> IO.
>>
>> Max bandwidth Controller or Proportional bandwidth controller
>> =============================================================
>> What is our primary requirement here? A weight based proportional
>> bandwidth controller where we can use the resources optimally and any
>> kind of throttling kicks in only if there is contention for the disk.
>>
>> Or we want max bandwidth control where a group is not allowed to use the
>> disk even if disk is free.
>>
>> Or we need both? I would think that at some point of time we will need
>> both but we can start with proportional bandwidth control first.
>
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first.
>
> BTW, I will start to reimplement dm-ioband into block layer.
>
>> Fairness for higher level logical devices
>> =========================================
>> Do we want good fairness numbers for higher level logical devices also
>> or it is sufficient to provide fairness at leaf nodes. Providing fairness
>> at leaf nodes can help us use the resources optimally and in the process
>> we can get fairness at higher level also in many of the cases.
>
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
>
>> But do we want strict fairness numbers on higher level logical devices
>> even if it means sub-optimal usage of unerlying phsical devices?
>>
>> I think that for proportinal bandwidth control, it should be ok to provide
>> fairness at higher level logical device but for max bandwidth control it
>> might make more sense to provide fairness at higher level. Consider a
>> case where from a striped device a customer wants to limit a group to
>> 30MB/s and in case of leaf node control, if every leaf node provides
>> 30MB/s, it might accumulate to much more than specified rate at logical
>> device.
>>
>> Latency Control and strong isolation between groups
>> ===================================================
>> Do we want a good isolation between groups and better latencies and
>> stronger isolation between groups?
>>
>> I think if problem is solved at IO scheduler level, we can achieve better
>> latency control and hence stronger isolation between groups.
>>
>> Higher level solutions should find it hard to provide same kind of latency
>> control and isolation between groups as IO scheduler based solution.
>
> Why do you think that the higher level solution is hard to provide it?
> I think that it is a matter of how to implement throttling policy.
>
>> Fairness for buffered writes
>> ============================
>> Doing io control at any place below page cache has disadvantage that page
>> cache might not dispatch more writes from higher weight group hence higher
>> weight group might not see more IO done. Andrew says that we don't have
>> a solution to this problem in kernel and he would like to see it handled
>> properly.
>>
>> Only way to solve this seems to be to slow down the writers before they
>> write into page cache. IO throttling patch handled it by slowing down
>> writer if it crossed max specified rate. Other suggestions have come in
>> the form of dirty_ratio per memory cgroup or a separate cgroup controller
>> al-together where some kind of per group write limit can be specified.
>>
>> So if solution is implemented at IO scheduler layer or at device mapper
>> layer, both shall have to rely on another controller to be co-mounted
>> to handle buffered writes properly.
>>
>> Fairness with-in group
>> ======================
>> One of the issues with higher level controller is that how to do fair
>> throttling so that fairness with-in group is not impacted. Especially
>> the case of making sure that we don't break the notion of ioprio of the
>> processes with-in group.
>
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
>
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.
>
>> Especially io throttling patch was very bad in terms of prio with-in
>> group where throttling treated everyone equally and difference between
>> process prio disappeared.
>>
>> Reads Vs Writes
>> ===============
>> A higher level control most likely will change the ratio in which reads
>> and writes are dispatched to disk with-in group. It used to be decided
>> by IO scheduler so far but with higher level groups doing throttling and
>> possibly buffering the bios and releasing them later, they will have to
>> come up with their own policy on in what proportion reads and writes
>> should be dispatched. In case of IO scheduler based control, all the
>> queuing takes place at IO scheduler and it still retains control of
>> in what ration reads and writes should be dispatched.
>
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.
>
>> Summary
>> =======
>>
>> - An io scheduler based io controller can provide better latencies,
>> stronger isolation between groups, time based fairness and will not
>> interfere with io schedulers policies like class, ioprio and
>> reader vs writer issues.
>>
>> But it can gunrantee fairness at higher logical level devices.
>> Especially in case of max bw control, leaf node control does not sound
>> to be the most appropriate thing.
>>
>> - IO throttling provides max bw control in terms of absolute rate. It has
>> the advantage that it can provide control at higher level logical device
>> and also control buffered writes without need of additional controller
>> co-mounted.
>>
>> But it does only max bw control and not proportion control so one might
>> not be using resources optimally. It looses sense of task prio and class
>> with-in group as any of the task can be throttled with-in group. Because
>> throttling does not kick in till you hit the max bw limit, it should find
>> it hard to provide same latencies as io scheduler based control.
>>
>> - dm-ioband also has the advantage that it can provide fairness at higher
>> level logical devices.
>>
>> But, fairness is provided only in terms of size of IO or number of IO.
>> No time based fairness. It is very throughput oriented and does not
>> throttle high speed group if other group is running slow random reader.
>> This results in bad latnecies for random reader group and weaker
>> isolation between groups.
>
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
>
>> Also it does not provide fairness if a group is not continuously
>> backlogged. So if one is running 1-2 dd/sequential readers in the group,
>> one does not get fairness until workload is increased to a point where
>> group becomes continuously backlogged. This also results in poor
>> latencies and limited fairness.
>
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.
>
>> At this point of time it does not look like a single IO controller all
>> the scenarios/requirements. This means few things to me.
>>
>> - Drop some of the requirements and go with one implementation which meets
>> those reduced set of requirements.
>>
>> - Have more than one IO controller implementation in kenrel. One for lower
>> level control for better latencies, stronger isolation and optimal resource
>> usage and other one for fairness at higher level logical devices and max
>> bandwidth control.
>>
>> And let user decide which one to use based on his/her needs.
>>
>> - Come up with more intelligent way of doing IO control where single
>> controller covers all the cases.
>>
>> At this point of time, I am more inclined towards option 2 of having more
>> than one implementation in kernel. :-) (Until and unless we can brainstrom
>> and come up with ideas to make option 3 happen).
>>
>>> It would be great if we discuss our plans on the mailing list, so we
>>> can get early feedback from everyone.
>>
>> This is what comes to my mind so far. Please add to the list if I have missed
>> some points. Also correct me if I am wrong about the pros/cons of the
>> approaches.
>>
>> Thoughts/ideas/opinions are welcome...
>>
>> Thanks
>> Vivek
>
> Thanks,
> Ryo Tsuruta
>

2009-09-29 14:12:06

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and all,
>
> Vivek Goyal <[email protected]> wrote:
> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
>
> > > We are starting from a point where there is no cgroup based IO
> > > scheduling in the kernel. And it is probably not reasonable to satisfy
> > > all IO scheduling related requirements in one patch set. We can start
> > > with something simple, and build on top of that. So a very simple
> > > patch set that enables cgroup based proportional scheduling for CFQ
> > > seems like the way to go at this point.
> >
> > Sure, we can start with CFQ only. But a bigger question we need to answer
> > is that is CFQ the right place to solve the issue? Jens, do you think
> > that CFQ is the right place to solve the problem?
> >
> > Andrew seems to favor a high level approach so that IO schedulers are less
> > complex and we can provide fairness at high level logical devices also.
>
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.
>

The new design is essentially the old design. Except the fact that
suggestion is that in the first step instead of covering all the 4 IO
schedulers, first cover only CFQ and then later others.

So providing fairness for NOOP is not an issue. Even if we introduce new
IO schedulers down the line, I can't think of a reason why can't we cover
that too with common layer.

> > I will again try to summarize my understanding so far about the pros/cons
> > of each approach and then we can take the discussion forward.
>
> Good summary. Thanks for your work.
>
> > Fairness in terms of size of IO or disk time used
> > =================================================
> > On a seeky media, fairness in terms of disk time can get us better results
> > instead fairness interms of size of IO or number of IO.
> >
> > If we implement some kind of time based solution at higher layer, then
> > that higher layer should know who used how much of time each group used. We
> > can probably do some kind of timestamping in bio to get a sense when did it
> > get into disk and when did it finish. But on a multi queue hardware there
> > can be multiple requests in the disk either from same queue or from differnet
> > queues and with pure timestamping based apparoch, so far I could not think
> > how at high level we will get an idea who used how much of time.
>
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?

So far could not think of anything clean. Do you have something in mind.

I was thinking that elevator layer will do the merge of bios. So IO
scheduler/elevator can time stamp the first bio in the request as it goes
into the disk and again timestamp with finish time once request finishes.

This way higher layer can get an idea how much disk time a group of bios
used. But on multi queue, if we dispatch say 4 requests from same queue,
then time accounting becomes an issue.

Consider following where four requests rq1, rq2, rq3 and rq4 are
dispatched to disk at time t0, t1, t2 and t3 respectively and these
requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
time elapsed between each of milestones is t. Also assume that all these
requests are from same queue/group.

t0 t1 t2 t3 t4 t5 t6 t7
rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4

Now higher layer will think that time consumed by group is:

(t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t

But the time elapsed is only 7t.

Secondly if a different group is running only single sequential reader,
there CFQ will be driving queue depth of 1 and time will not be running
faster and this inaccuracy in accounting will lead to unfair share between
groups.

So we need something better to get a sense which group used how much of
disk time.

>
> > So this is the first point of contention that how do we want to provide
> > fairness. In terms of disk time used or in terms of size of IO/number of
> > IO.
> >
> > Max bandwidth Controller or Proportional bandwidth controller
> > =============================================================
> > What is our primary requirement here? A weight based proportional
> > bandwidth controller where we can use the resources optimally and any
> > kind of throttling kicks in only if there is contention for the disk.
> >
> > Or we want max bandwidth control where a group is not allowed to use the
> > disk even if disk is free.
> >
> > Or we need both? I would think that at some point of time we will need
> > both but we can start with proportional bandwidth control first.
>
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first.
>

What are the cases where time based policy does not work and size based
policy works better and user would choose size based policy and not timed
based one?

I am not against implementing things in higher layer as long as we can
ensure tight control on latencies, strong isolation between groups and
not break CFQ's class and ioprio model with-in group.

> BTW, I will start to reimplement dm-ioband into block layer.

Can you elaborate little bit on this?

>
> > Fairness for higher level logical devices
> > =========================================
> > Do we want good fairness numbers for higher level logical devices also
> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > at leaf nodes can help us use the resources optimally and in the process
> > we can get fairness at higher level also in many of the cases.
>
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
>

I am not sure how big an issue is this. This can be easily solved by
making use of NOOP scheduler by these devices. What are the reasons for
these devices to not use even noop?

> > But do we want strict fairness numbers on higher level logical devices
> > even if it means sub-optimal usage of unerlying phsical devices?
> >
> > I think that for proportinal bandwidth control, it should be ok to provide
> > fairness at higher level logical device but for max bandwidth control it
> > might make more sense to provide fairness at higher level. Consider a
> > case where from a striped device a customer wants to limit a group to
> > 30MB/s and in case of leaf node control, if every leaf node provides
> > 30MB/s, it might accumulate to much more than specified rate at logical
> > device.
> >
> > Latency Control and strong isolation between groups
> > ===================================================
> > Do we want a good isolation between groups and better latencies and
> > stronger isolation between groups?
> >
> > I think if problem is solved at IO scheduler level, we can achieve better
> > latency control and hence stronger isolation between groups.
> >
> > Higher level solutions should find it hard to provide same kind of latency
> > control and isolation between groups as IO scheduler based solution.
>
> Why do you think that the higher level solution is hard to provide it?
> I think that it is a matter of how to implement throttling policy.
>

So far both in dm-ioband and IO throttling solution I have seen that
higher layer implements some of kind leaky bucket/token bucket algorithm,
which inherently allows IO from all the competing groups until they run
out of tokens and then these groups are made to wait till fresh tokens are
issued.

That means, most of the times, IO scheduler will see requests from more
than one group at the same time and that will be the source of weak
isolation between groups.

Consider following simple examples. Assume there are two groups and one
contains 16 random readers and other contains 1 random reader.

G1 G2
16RR 1RR

Now it might happen that IO scheduler sees requests from all the 17 RR
readers at the same time. (Throttling probably will kick in later because
you would like to give one group a nice slice of 100ms otherwise
sequential readers will suffer a lot and disk will become seek bound).

So CFQ will dispatch requests (at least one), from each of the 16 random
readers first and then from 1 random reader in group 2 and this increases
the max latency for the application in group 2 and provides weak
isolation.

There will also be additional issues with CFQ preemtpion logic. CFQ will
have no knowledge of groups and it will do cross group preemtptions. For
example if a meta data request comes in group1, it will preempt any of
the queue being served in other groups. So somebody doing "find . *" or
"cat <small files>" in one group will keep on preempting a sequential
reader in other group. Again this will probably lead to higher max
latencies.

Note, even if CFQ does not enable idling on random readers, and expires
queue after single dispatch, seeking time between queues can be
significant. Similarly, if instead of 16 random reders we had 16 random
synchronous writers we will have seek time issue as well as writers can
often dump bigger requests which also adds to latency.

This latency issue can be solved if we dispatch requests only from one
group for a certain time of time and then move to next group. (Something
what common layer is doing).

If we go for only single group dispatching requests, then we shall have
to implemnt some of the preemption semantics also in higher layer because
in certain cases we want to do preemption across the groups. Like RT task
group preemting non-RT task group etc.

Once we go deeper into implementation, I think we will find more issues.

> > Fairness for buffered writes
> > ============================
> > Doing io control at any place below page cache has disadvantage that page
> > cache might not dispatch more writes from higher weight group hence higher
> > weight group might not see more IO done. Andrew says that we don't have
> > a solution to this problem in kernel and he would like to see it handled
> > properly.
> >
> > Only way to solve this seems to be to slow down the writers before they
> > write into page cache. IO throttling patch handled it by slowing down
> > writer if it crossed max specified rate. Other suggestions have come in
> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
> > al-together where some kind of per group write limit can be specified.
> >
> > So if solution is implemented at IO scheduler layer or at device mapper
> > layer, both shall have to rely on another controller to be co-mounted
> > to handle buffered writes properly.
> >
> > Fairness with-in group
> > ======================
> > One of the issues with higher level controller is that how to do fair
> > throttling so that fairness with-in group is not impacted. Especially
> > the case of making sure that we don't break the notion of ioprio of the
> > processes with-in group.
>
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
>
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.

Ok, I re-ran that test. Previously default io_limit value was 192 and now
I set it up to 256 as you suggested. I still see writer starving reader. I
have removed "conv=fdatasync" from writer so that a writer is pure buffered
writes.

With vanilla CFQ
----------------
reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s

with dm-ioband default io_limit=192
-----------------------------------
writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s
reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s

ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100

with dm-ioband default io_limit=256
-----------------------------------
reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s

ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100
ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100

Notice that with vanilla CFQ, reader is taking 10 seconds to finish and
with dm-ioband it takes more than 40 seconds to finish. So writer is still
starving the reader with both io_limit 192 and 256.

On top of that can you please give some details how increasing the
buffered queue length reduces the impact of writers?

IO Prio issue
--------------
I ran another test where two ioband devices were created of weight 100
each on two partitions. In first group 4 readers were launched. Three
readers are of class BE and prio 7, fourth one is of class BE prio 0. In
group2, I launched a buffered writer.

One would expect that prio0 reader gets more bandwidth as compared to
prio 4 readers and prio 7 readers will get more or less same bw. Looks like
that is not happening. Look how vanilla CFQ provides much more bandwidth
to prio0 reader as compared to prio7 reader and how putting them in the
group reduces the difference betweej prio0 and prio7 readers.

Following are the results.

Vanilla CFQ
===========
set1
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s
578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s
578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s

set2
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s
578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s
578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s

set3
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s
578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s

with dm-ioband
==============
ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100

set1
----
prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s
578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s
578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s
578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s

set2
---
prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s
578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s
578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s
578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s

set3
----
prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s
578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s
578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s
578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s

Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader.
With dm-ioband this ratio changed to less than 200%.

I will run more tests, but this show how notion of priority with-in a
group changes if we implement throttling at higher layer and don't
keep it with CFQ.

The second thing which strikes me is that I divided the disk 50% each
between readers and writers and in that case would expect protection
for writers and expect writers to finish fast. But writers have been
slowed down like and it also kills overall disk throughput. I think
it probably became seek bound.

I think the moment I get more time, I will run some timed fio tests
and look at how overall disk performed and how bandwidth was
distributed with-in group and between groups.

>
> > Especially io throttling patch was very bad in terms of prio with-in
> > group where throttling treated everyone equally and difference between
> > process prio disappeared.
> >
> > Reads Vs Writes
> > ===============
> > A higher level control most likely will change the ratio in which reads
> > and writes are dispatched to disk with-in group. It used to be decided
> > by IO scheduler so far but with higher level groups doing throttling and
> > possibly buffering the bios and releasing them later, they will have to
> > come up with their own policy on in what proportion reads and writes
> > should be dispatched. In case of IO scheduler based control, all the
> > queuing takes place at IO scheduler and it still retains control of
> > in what ration reads and writes should be dispatched.
>
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.

At least the version of dm-ioband I have is not producing the desired
results. See above.

Is there a newer version? I will run some tests on that too. But I think
you will again run into same issue where you will decide the ratio of
read vs write with-in group and as I change the IO schedulers results
will vary.

So at this point of time I can't think how can you solve read vs write
ratio issue at higher layer without changing the behavior or underlying
IO scheduler.

>
> > Summary
> > =======
> >
> > - An io scheduler based io controller can provide better latencies,
> > stronger isolation between groups, time based fairness and will not
> > interfere with io schedulers policies like class, ioprio and
> > reader vs writer issues.
> >
> > But it can gunrantee fairness at higher logical level devices.
> > Especially in case of max bw control, leaf node control does not sound
> > to be the most appropriate thing.
> >
> > - IO throttling provides max bw control in terms of absolute rate. It has
> > the advantage that it can provide control at higher level logical device
> > and also control buffered writes without need of additional controller
> > co-mounted.
> >
> > But it does only max bw control and not proportion control so one might
> > not be using resources optimally. It looses sense of task prio and class
> > with-in group as any of the task can be throttled with-in group. Because
> > throttling does not kick in till you hit the max bw limit, it should find
> > it hard to provide same latencies as io scheduler based control.
> >
> > - dm-ioband also has the advantage that it can provide fairness at higher
> > level logical devices.
> >
> > But, fairness is provided only in terms of size of IO or number of IO.
> > No time based fairness. It is very throughput oriented and does not
> > throttle high speed group if other group is running slow random reader.
> > This results in bad latnecies for random reader group and weaker
> > isolation between groups.
>
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
>
> > Also it does not provide fairness if a group is not continuously
> > backlogged. So if one is running 1-2 dd/sequential readers in the group,
> > one does not get fairness until workload is increased to a point where
> > group becomes continuously backlogged. This also results in poor
> > latencies and limited fairness.
>
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.

But this has following undesired results.

- Slow moving group does not get reduced latencies. For example, random readers
in slow moving group get no isolation and will continue to see higher max
latencies.

- A single sequential reader in one group does not get fair share and
we might be pushing buffered writes in other group thinking that we
are getting better throughput. But the fact is that we are eating away
readers share in group1 and giving it to writers in group2. Also I
showed that we did not necessarily improve the overall throughput of
the system by doing so. (Because it increases the number of seeks).

I had sent you a mail to show that.

http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html

But you changed the test case to run 4 readers in a single group to show that
it throughput does not decrease. Please don't change test cases. In case of 4
sequential readers in the group, group is continuously backlogged and you
don't steal bandwidth from slow moving group. So in that mail I was not
even discussing the scenario when you don't steal the bandwidth from
other group.

I specially created one slow moving group with one reader so that we end up
stealing bandwidth from slow moving group and show that we did not achive
higher overall throughput by stealing the BW at the same time we did not get
fairness for single reader and observed decreasing throughput for single
reader as number of writers in other group increased.

Thanks
Vivek

>
> > At this point of time it does not look like a single IO controller all
> > the scenarios/requirements. This means few things to me.
> >
> > - Drop some of the requirements and go with one implementation which meets
> > those reduced set of requirements.
> >
> > - Have more than one IO controller implementation in kenrel. One for lower
> > level control for better latencies, stronger isolation and optimal resource
> > usage and other one for fairness at higher level logical devices and max
> > bandwidth control.
> >
> > And let user decide which one to use based on his/her needs.
> >
> > - Come up with more intelligent way of doing IO control where single
> > controller covers all the cases.
> >
> > At this point of time, I am more inclined towards option 2 of having more
> > than one implementation in kernel. :-) (Until and unless we can brainstrom
> > and come up with ideas to make option 3 happen).
> >
> > > It would be great if we discuss our plans on the mailing list, so we
> > > can get early feedback from everyone.
> >
> > This is what comes to my mind so far. Please add to the list if I have missed
> > some points. Also correct me if I am wrong about the pros/cons of the
> > approaches.
> >
> > Thoughts/ideas/opinions are welcome...
> >
> > Thanks
> > Vivek
>
> Thanks,
> Ryo Tsuruta

2009-09-29 19:53:18

by Nauman Rafique

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

We have been going around in circles for past many months on this
issue of IO controller. I thought that we are getting closer to a
point where we agree on one approach and go with it, but apparently we
are not. I think it would be useful at this point to learn from the
example of how similar functionality was introduced for other
resources like cpu scheduling and memory controllers.

We are starting from a point where there is no cgroup based resource
allocation for disks and there is a lot to be done. CFS has been doing
hierarchical proportional allocation for CPU scheduling for a while
now. Only recently someone has sent out patches for enforcing upper
limits. And it makes a lot of sense (more discussion on this later).
Also Fernando tells me that memory controller did not support
hierarchies in the first attempt. What I don't understand is, if we
are starting from scratch, why do we want to solve all the problems of
IO scheduling in one attempt?

Max bandwidth Controller or Proportional bandwidth controller
===============================================

Enforcing limits is applicable in the scenario where you are managing
a bunch of services in a data center and you want to either charge
them for what they use or you want a very predictable performance over
time. If we just do proportional allocation, then the actual
performance received by a user depends on other co-scheduled tasks. If
other tasks are not using the resource, you end up using their share.
But if all the other co-users become active, the 'extra' resource that
you had would be taken away. Thus without enforcing some upper limit,
predictability gets hurt. But this becomes an issue only if we are
sharing resources. The most important precondition to sharing
resources is 'the requirement to provide isolation'. And isolation
includes controlling both bandwidth AND latency, in the presence of
other sharers. As Vivek has rightly pointed out, a ticket allocation
based algorithm is good for enforcing upper limits, but it is NOT good
for providing isolation i.e. latency control and even bandwidth in
some cases (as Vivek has shown with results in the last few emails).
Moreover, a solution that is implemented in higher layers (be it VFS
or DM) has little control over what happens in IO scheduler, again
hurting the isolation goal.

In the absence of isolation, we cannot even start sharing a resource.
The predictability or billing are secondary concerns that arise only
if we are sharing resources. If there is somebody who does not care
about isolation, but want to do their billing correctly, I would like
to know about it. Needless to say that max bandwidth limits can also
be enforced at IO scheduling layer.

Common layer vs CFS
==================

Takuya has raised an interesting point here. If somebody wishes to use
noop, using a common layer IO controller on top of noop isn't
necessarily going to give them the same thing. In fact, with IO
controller, noop might behave much like CFQ.

Moreover at one point, if we decide that we absolutely need IO
controller to work for other schedulers too, we have this Vivek's
patch set as a proof-of-concept. For now, as Jens very rightly pointed
out in our discussion, we can have a "simple scheduler: Noop" and an
"intelligent scheduler: CFQ with cgroup based scheduling".

Class based scheduling
===================

CFQ has this notion of classes that needs to be supported in any
solution that we come up with, otherwise we break the semantics of the
existing scheduler. We have workloads which have strong latency
requirements. We have two options: either don't do resource sharing
for them OR share the resource but put them in a higher class (RT) so
that their latencies are not (or minimally) effected by other
workloads running with them.

A solution in higher layer can try to support those semantics, but
what if somebody wants to use a Noop scheduler and does not care about
those semantics? We will end up with multiple schedulers in the upper
layers, and who knows where all this will stop.

Controlling writeback
================

It seems like writeback path has problems, but we should not try to
solve those problems with the same patch set that is trying to do
basic cgroup based IO scheduling. Jens patches for per-bdi pdflush are
already in. They should solve the problem of pdflush not sending down
enough IOs; at least Jens results seem to show that. IMHO, the next
step is to use memory controller in conjunction with IO controller,
and a per group per bdi pdflush threads (only if a group is doing IO
on that bdi), something similar to io_group that we have in Vivek's
patches. That should solve multiple problems. First, it would allow us
to obviate the need of any tracking for dirty pages. Second, we can
build a feedback from IO scheduling layer to the upper layers. If the
number of pending writes in IO controller for a given group exceed a
limit, we block the submitting thread (pdflush), similar to current
congestion implementation. Then the group would start hitting dirty
limits at one point (we would need per group dirty limits, as has
already been pointed out by others), thus blocking the tasks that are
dirtying the pages. Thus using a block layer IO controller, we can
achieve the affect similar achieved by Righi's proposal.

Vivek has summarized most of the other arguments very well. In short,
what I am trying to say is lets start with something very simple that
satisfies some of the most important requirements and we can build
upon that.

On Tue, Sep 29, 2009 at 7:10 AM, Vivek Goyal <[email protected]> wrote:
> On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
>> Hi Vivek and all,
>>
>> Vivek Goyal <[email protected]> wrote:
>> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
>>
>> > > We are starting from a point where there is no cgroup based IO
>> > > scheduling in the kernel. And it is probably not reasonable to satisfy
>> > > all IO scheduling related requirements in one patch set. We can start
>> > > with something simple, and build on top of that. So a very simple
>> > > patch set that enables cgroup based proportional scheduling for CFQ
>> > > seems like the way to go at this point.
>> >
>> > Sure, we can start with CFQ only. But a bigger question we need to answer
>> > is that is CFQ the right place to solve the issue? Jens, do you think
>> > that CFQ is the right place to solve the problem?
>> >
>> > Andrew seems to favor a high level approach so that IO schedulers are less
>> > complex and we can provide fairness at high level logical devices also.
>>
>> I'm not in favor of expansion of CFQ, because some enterprise storages
>> are better performed with NOOP rather than CFQ, and I think bandwidth
>> control is needed much more for such storage system. Is it easy to
>> support other IO schedulers even if a new IO scheduler is introduced?
>> I would like to know a bit more specific about Namuman's scheduler design.
>>
>
> The new design is essentially the old design. Except the fact that
> suggestion is that in the first step instead of covering all the 4 IO
> schedulers, first cover only CFQ and then later others.
>
> So providing fairness for NOOP is not an issue. Even if we introduce new
> IO schedulers down the line, I can't think of a reason why can't we cover
> that too with common layer.
>
>> > I will again try to summarize my understanding so far about the pros/cons
>> > of each approach and then we can take the discussion forward.
>>
>> Good summary. Thanks for your work.
>>
>> > Fairness in terms of size of IO or disk time used
>> > =================================================
>> > On a seeky media, fairness in terms of disk time can get us better results
>> > instead fairness interms of size of IO or number of IO.
>> >
>> > If we implement some kind of time based solution at higher layer, then
>> > that higher layer should know who used how much of time each group used. We
>> > can probably do some kind of timestamping in bio to get a sense when did it
>> > get into disk and when did it finish. But on a multi queue hardware there
>> > can be multiple requests in the disk either from same queue or from differnet
>> > queues and with pure timestamping based apparoch, so far I could not think
>> > how at high level we will get an idea who used how much of time.
>>
>> IIUC, could the overlap time be calculated from time-stamp on a multi
>> queue hardware?
>
> So far could not think of anything clean. Do you have something in mind.
>
> I was thinking that elevator layer will do the merge of bios. So IO
> scheduler/elevator can time stamp the first bio in the request as it goes
> into the disk and again timestamp with finish time once request finishes.
>
> This way higher layer can get an idea how much disk time a group of bios
> used. But on multi queue, if we dispatch say 4 requests from same queue,
> then time accounting becomes an issue.
>
> Consider following where four requests rq1, rq2, rq3 and rq4 are
> dispatched to disk at time t0, t1, t2 and t3 respectively and these
> requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> time elapsed between each of milestones is t. Also assume that all these
> requests are from same queue/group.
>
> ? ? ? ?t0 ? t1 ? t2 ? t3 ?t4 ? t5 ? t6 ? t7
> ? ? ? ?rq1 ?rq2 ?rq3 rq4 ?rq1 ?rq2 ?rq3 rq4
>
> Now higher layer will think that time consumed by group is:
>
> (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
>
> But the time elapsed is only 7t.
>
> Secondly if a different group is running only single sequential reader,
> there CFQ will be driving queue depth of 1 and time will not be running
> faster and this inaccuracy in accounting will lead to unfair share between
> groups.
>
> So we need something better to get a sense which group used how much of
> disk time.
>
>>
>> > So this is the first point of contention that how do we want to provide
>> > fairness. In terms of disk time used or in terms of size of IO/number of
>> > IO.
>> >
>> > Max bandwidth Controller or Proportional bandwidth controller
>> > =============================================================
>> > What is our primary requirement here? A weight based proportional
>> > bandwidth controller where we can use the resources optimally and any
>> > kind of throttling kicks in only if there is contention for the disk.
>> >
>> > Or we want max bandwidth control where a group is not allowed to use the
>> > disk even if disk is free.
>> >
>> > Or we need both? I would think that at some point of time we will need
>> > both but we can start with proportional bandwidth control first.
>>
>> How about making throttling policy be user selectable like the IO
>> scheduler and putting it in the higher layer? So we could support
>> all of policies (time-based, size-based and rate limiting). There
>> seems not to only one solution which satisfies all users. But I agree
>> with starting with proportional bandwidth control first.
>>
>
> What are the cases where time based policy does not work and size based
> policy works better and user would choose size based policy and not timed
> based one?
>
> I am not against implementing things in higher layer as long as we can
> ensure tight control on latencies, strong isolation between groups and
> not break CFQ's class and ioprio model with-in group.
>
>> BTW, I will start to reimplement dm-ioband into block layer.
>
> Can you elaborate little bit on this?
>
>>
>> > Fairness for higher level logical devices
>> > =========================================
>> > Do we want good fairness numbers for higher level logical devices also
>> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
>> > at leaf nodes can help us use the resources optimally and in the process
>> > we can get fairness at higher level also in many of the cases.
>>
>> We should also take care of block devices which provide their own
>> make_request_fn() and not use a IO scheduler. We can't use the leaf
>> nodes approach to such devices.
>>
>
> I am not sure how big an issue is this. This can be easily solved by
> making use of NOOP scheduler by these devices. What are the reasons for
> these devices to not use even noop?
>
>> > But do we want strict fairness numbers on higher level logical devices
>> > even if it means sub-optimal usage of unerlying phsical devices?
>> >
>> > I think that for proportinal bandwidth control, it should be ok to provide
>> > fairness at higher level logical device but for max bandwidth control it
>> > might make more sense to provide fairness at higher level. Consider a
>> > case where from a striped device a customer wants to limit a group to
>> > 30MB/s and in case of leaf node control, if every leaf node provides
>> > 30MB/s, it might accumulate to much more than specified rate at logical
>> > device.
>> >
>> > Latency Control and strong isolation between groups
>> > ===================================================
>> > Do we want a good isolation between groups and better latencies and
>> > stronger isolation between groups?
>> >
>> > I think if problem is solved at IO scheduler level, we can achieve better
>> > latency control and hence stronger isolation between groups.
>> >
>> > Higher level solutions should find it hard to provide same kind of latency
>> > control and isolation between groups as IO scheduler based solution.
>>
>> Why do you think that the higher level solution is hard to provide it?
>> I think that it is a matter of how to implement throttling policy.
>>
>
> So far both in dm-ioband and IO throttling solution I have seen that
> higher layer implements some of kind leaky bucket/token bucket algorithm,
> which inherently allows IO from all the competing groups until they run
> out of tokens and then these groups are made to wait till fresh tokens are
> issued.
>
> That means, most of the times, IO scheduler will see requests from more
> than one group at the same time and that will be the source of weak
> isolation between groups.
>
> Consider following simple examples. Assume there are two groups and one
> contains 16 random readers and other contains 1 random reader.
>
> ? ? ? ? ? ? ? ?G1 ? ? ?G2
> ? ? ? ? ? ? ? 16RR ? ? 1RR
>
> Now it might happen that IO scheduler sees requests from all the 17 RR
> readers at the same time. (Throttling probably will kick in later because
> you would like to give one group a nice slice of 100ms otherwise
> sequential readers will suffer a lot and disk will become seek bound).
>
> So CFQ will dispatch requests (at least one), from each of the 16 random
> readers first and then from 1 random reader in group 2 and this increases
> the max latency for the application in group 2 and provides weak
> isolation.
>
> There will also be additional issues with CFQ preemtpion logic. CFQ will
> have no knowledge of groups and it will do cross group preemtptions. For
> example if a meta data request comes in group1, it will preempt any of
> the queue being served in other groups. So somebody doing "find . *" or
> "cat <small files>" in one group will keep on preempting a sequential
> reader in other group. Again this will probably lead to higher max
> latencies.
>
> Note, even if CFQ does not enable idling on random readers, and expires
> queue after single dispatch, seeking time between queues can be
> significant. Similarly, if instead of 16 random reders we had 16 random
> synchronous writers we will have seek time issue as well as writers can
> often dump bigger requests which also adds to latency.
>
> This latency issue can be solved if we dispatch requests only from one
> group for a certain time of time and then move to next group. (Something
> what common layer is doing).
>
> If we go for only single group dispatching requests, then we shall have
> to implemnt some of the preemption semantics also in higher layer because
> in certain cases we want to do preemption across the groups. Like RT task
> group preemting non-RT task group etc.
>
> Once we go deeper into implementation, I think we will find more issues.
>
>> > Fairness for buffered writes
>> > ============================
>> > Doing io control at any place below page cache has disadvantage that page
>> > cache might not dispatch more writes from higher weight group hence higher
>> > weight group might not see more IO done. Andrew says that we don't have
>> > a solution to this problem in kernel and he would like to see it handled
>> > properly.
>> >
>> > Only way to solve this seems to be to slow down the writers before they
>> > write into page cache. IO throttling patch handled it by slowing down
>> > writer if it crossed max specified rate. Other suggestions have come in
>> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
>> > al-together where some kind of per group write limit can be specified.
>> >
>> > So if solution is implemented at IO scheduler layer or at device mapper
>> > layer, both shall have to rely on another controller to be co-mounted
>> > to handle buffered writes properly.
>> >
>> > Fairness with-in group
>> > ======================
>> > One of the issues with higher level controller is that how to do fair
>> > throttling so that fairness with-in group is not impacted. Especially
>> > the case of making sure that we don't break the notion of ioprio of the
>> > processes with-in group.
>>
>> I ran your test script to confirm that the notion of ioprio was not
>> broken by dm-ioband. Here is the results of the test.
>> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
>>
>> I think that the time period during which dm-ioband holds IO requests
>> for throttling would be too short to break the notion of ioprio.
>
> Ok, I re-ran that test. Previously default io_limit value was 192 and now
> I set it up to 256 as you suggested. I still see writer starving reader. I
> have removed "conv=fdatasync" from writer so that a writer is pure buffered
> writes.
>
> With vanilla CFQ
> ----------------
> reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s
>
> with dm-ioband default io_limit=192
> -----------------------------------
> writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s
> reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s
>
> ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
> ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100
>
> with dm-ioband default io_limit=256
> -----------------------------------
> reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s
>
> ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100
> ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100
>
> Notice that with vanilla CFQ, reader is taking 10 seconds to finish and
> with dm-ioband it takes more than 40 seconds to finish. So writer is still
> starving the reader with both io_limit 192 and 256.
>
> On top of that can you please give some details how increasing the
> buffered queue length reduces the impact of writers?
>
> IO Prio issue
> --------------
> I ran another test where two ioband devices were created of weight 100
> each on two partitions. In first group 4 readers were launched. Three
> readers are of class BE and prio 7, fourth one is of class BE prio 0. In
> group2, I launched a buffered writer.
>
> One would expect that prio0 reader gets more bandwidth as compared to
> prio 4 readers and prio 7 readers will get more or less same bw. Looks like
> that is not happening. Look how vanilla CFQ provides much more bandwidth
> to prio0 reader as compared to prio7 reader and how putting them in the
> group reduces the difference betweej prio0 and prio7 readers.
>
> Following are the results.
>
> Vanilla CFQ
> ===========
> set1
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s
> 578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s
> 578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s
>
> set2
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s
> 578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s
> 578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s
>
> set3
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s
> 578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s
>
> with dm-ioband
> ==============
> ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
> ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100
>
> set1
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s
> 578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s
> 578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s
> 578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s
>
> set2
> ---
> prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s
> 578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s
> 578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s
> 578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s
>
> set3
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s
> 578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s
> 578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s
> 578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s
>
> Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader.
> ? ? ?With dm-ioband this ratio changed to less than 200%.
>
> ? ? ?I will run more tests, but this show how notion of priority with-in a
> ? ? ?group changes if we implement throttling at higher layer and don't
> ? ? ?keep it with CFQ.
>
> ? ? The second thing which strikes me is that I divided the disk 50% each
> ? ? between readers and writers and in that case would expect protection
> ? ? for writers and expect writers to finish fast. But writers have been
> ? ? slowed down like and it also kills overall disk throughput. I think
> ? ? it probably became seek bound.
>
> ? ? I think the moment I get more time, I will run some timed fio tests
> ? ? and look at how overall disk performed and how bandwidth was
> ? ? distributed with-in group and between groups.
>
>>
>> > Especially io throttling patch was very bad in terms of prio with-in
>> > group where throttling treated everyone equally and difference between
>> > process prio disappeared.
>> >
>> > Reads Vs Writes
>> > ===============
>> > A higher level control most likely will change the ratio in which reads
>> > and writes are dispatched to disk with-in group. It used to be decided
>> > by IO scheduler so far but with higher level groups doing throttling and
>> > possibly buffering the bios and releasing them later, they will have to
>> > come up with their own policy on in what proportion reads and writes
>> > should be dispatched. In case of IO scheduler based control, all the
>> > queuing takes place at IO scheduler and it still retains control of
>> > in what ration reads and writes should be dispatched.
>>
>> I don't think it is a concern. The current implementation of dm-ioband
>> is that sync/async IO requests are handled separately and the
>> backlogged IOs are released according to the order of arrival if both
>> sync and async requests are backlogged.
>
> At least the version of dm-ioband I have is not producing the desired
> results. See above.
>
> Is there a newer version? I will run some tests on that too. But I think
> you will again run into same issue where you will decide the ratio of
> read vs write with-in group and as I change the IO schedulers results
> will vary.
>
> So at this point of time I can't think how can you solve read vs write
> ratio issue at higher layer without changing the behavior or underlying
> IO scheduler.
>
>>
>> > Summary
>> > =======
>> >
>> > - An io scheduler based io controller can provide better latencies,
>> > ? stronger isolation between groups, time based fairness and will not
>> > ? interfere with io schedulers policies like class, ioprio and
>> > ? reader vs writer issues.
>> >
>> > ? But it can gunrantee fairness at higher logical level devices.
>> > ? Especially in case of max bw control, leaf node control does not sound
>> > ? to be the most appropriate thing.
>> >
>> > - IO throttling provides max bw control in terms of absolute rate. It has
>> > ? the advantage that it can provide control at higher level logical device
>> > ? and also control buffered writes without need of additional controller
>> > ? co-mounted.
>> >
>> > ? But it does only max bw control and not proportion control so one might
>> > ? not be using resources optimally. It looses sense of task prio and class
>> > ? with-in group as any of the task can be throttled with-in group. Because
>> > ? throttling does not kick in till you hit the max bw limit, it should find
>> > ? it hard to provide same latencies as io scheduler based control.
>> >
>> > - dm-ioband also has the advantage that it can provide fairness at higher
>> > ? level logical devices.
>> >
>> > ? But, fairness is provided only in terms of size of IO or number of IO.
>> > ? No time based fairness. It is very throughput oriented and does not
>> > ? throttle high speed group if other group is running slow random reader.
>> > ? This results in bad latnecies for random reader group and weaker
>> > ? isolation between groups.
>>
>> A new policy can be added to dm-ioband. Actually, range-bw policy,
>> which provides min and max bandwidth control, does time-based
>> throttling. Moreover there is room for improvement for existing
>> policies. The write-starve-read issue you pointed out will be solved
>> soon.
>>
>> > ? Also it does not provide fairness if a group is not continuously
>> > ? backlogged. So if one is running 1-2 dd/sequential readers in the group,
>> > ? one does not get fairness until workload is increased to a point where
>> > ? group becomes continuously backlogged. This also results in poor
>> > ? latencies and limited fairness.
>>
>> This is intended to efficiently use bandwidth of underlying devices
>> when IO load is low.
>
> But this has following undesired results.
>
> - Slow moving group does not get reduced latencies. For example, random readers
> ?in slow moving group get no isolation and will continue to see higher max
> ?latencies.
>
> - A single sequential reader in one group does not get fair share and
> ?we might be pushing buffered writes in other group thinking that we
> ?are getting better throughput. But the fact is that we are eating away
> ?readers share in group1 and giving it to writers in group2. Also I
> ?showed that we did not necessarily improve the overall throughput of
> ?the system by doing so. (Because it increases the number of seeks).
>
> ?I had sent you a mail to show that.
>
> http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html
>
> ?But you changed the test case to run 4 readers in a single group to show that
> ?it throughput does not decrease. Please don't change test cases. In case of 4
> ?sequential readers in the group, group is continuously backlogged and you
> ?don't steal bandwidth from slow moving group. So in that mail I was not
> ?even discussing the scenario when you don't steal the bandwidth from
> ?other group.
>
> ?I specially created one slow moving group with one reader so that we end up
> ?stealing bandwidth from slow moving group and show that we did not achive
> ?higher overall throughput by stealing the BW at the same time we did not get
> ?fairness for single reader and observed decreasing throughput for single
> ?reader as number of writers in other group increased.
>
> Thanks
> Vivek
>
>>
>> > At this point of time it does not look like a single IO controller all
>> > the scenarios/requirements. This means few things to me.
>> >
>> > - Drop some of the requirements and go with one implementation which meets
>> > ? those reduced set of requirements.
>> >
>> > - Have more than one IO controller implementation in kenrel. One for lower
>> > ? level control for better latencies, stronger isolation and optimal resource
>> > ? usage and other one for fairness at higher level logical devices and max
>> > ? bandwidth control.
>> >
>> > ? And let user decide which one to use based on his/her needs.
>> >
>> > - Come up with more intelligent way of doing IO control where single
>> > ? controller covers all the cases.
>> >
>> > At this point of time, I am more inclined towards option 2 of having more
>> > than one implementation in kernel. :-) (Until and unless we can brainstrom
>> > and come up with ideas to make option 3 happen).
>> >
>> > > It would be great if we discuss our plans on the mailing list, so we
>> > > can get early feedback from everyone.
>> >
>> > This is what comes to my mind so far. Please add to the list if I have missed
>> > some points. Also correct me if I am wrong about the pros/cons of the
>> > approaches.
>> >
>> > Thoughts/ideas/opinions are welcome...
>> >
>> > Thanks
>> > Vivek
>>
>> Thanks,
>> Ryo Tsuruta
>

2009-09-30 03:12:32

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and all,
>
> Vivek Goyal <[email protected]> wrote:
> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
>
> > > We are starting from a point where there is no cgroup based IO
> > > scheduling in the kernel. And it is probably not reasonable to satisfy
> > > all IO scheduling related requirements in one patch set. We can start
> > > with something simple, and build on top of that. So a very simple
> > > patch set that enables cgroup based proportional scheduling for CFQ
> > > seems like the way to go at this point.
> >
> > Sure, we can start with CFQ only. But a bigger question we need to answer
> > is that is CFQ the right place to solve the issue? Jens, do you think
> > that CFQ is the right place to solve the problem?
> >
> > Andrew seems to favor a high level approach so that IO schedulers are less
> > complex and we can provide fairness at high level logical devices also.
>
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.
>
> > I will again try to summarize my understanding so far about the pros/cons
> > of each approach and then we can take the discussion forward.
>
> Good summary. Thanks for your work.
>
> > Fairness in terms of size of IO or disk time used
> > =================================================
> > On a seeky media, fairness in terms of disk time can get us better results
> > instead fairness interms of size of IO or number of IO.
> >
> > If we implement some kind of time based solution at higher layer, then
> > that higher layer should know who used how much of time each group used. We
> > can probably do some kind of timestamping in bio to get a sense when did it
> > get into disk and when did it finish. But on a multi queue hardware there
> > can be multiple requests in the disk either from same queue or from differnet
> > queues and with pure timestamping based apparoch, so far I could not think
> > how at high level we will get an idea who used how much of time.
>
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?
>
> > So this is the first point of contention that how do we want to provide
> > fairness. In terms of disk time used or in terms of size of IO/number of
> > IO.
> >
> > Max bandwidth Controller or Proportional bandwidth controller
> > =============================================================
> > What is our primary requirement here? A weight based proportional
> > bandwidth controller where we can use the resources optimally and any
> > kind of throttling kicks in only if there is contention for the disk.
> >
> > Or we want max bandwidth control where a group is not allowed to use the
> > disk even if disk is free.
> >
> > Or we need both? I would think that at some point of time we will need
> > both but we can start with proportional bandwidth control first.
>
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first.
>
> BTW, I will start to reimplement dm-ioband into block layer.
>
> > Fairness for higher level logical devices
> > =========================================
> > Do we want good fairness numbers for higher level logical devices also
> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > at leaf nodes can help us use the resources optimally and in the process
> > we can get fairness at higher level also in many of the cases.
>
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
>
> > But do we want strict fairness numbers on higher level logical devices
> > even if it means sub-optimal usage of unerlying phsical devices?
> >
> > I think that for proportinal bandwidth control, it should be ok to provide
> > fairness at higher level logical device but for max bandwidth control it
> > might make more sense to provide fairness at higher level. Consider a
> > case where from a striped device a customer wants to limit a group to
> > 30MB/s and in case of leaf node control, if every leaf node provides
> > 30MB/s, it might accumulate to much more than specified rate at logical
> > device.
> >
> > Latency Control and strong isolation between groups
> > ===================================================
> > Do we want a good isolation between groups and better latencies and
> > stronger isolation between groups?
> >
> > I think if problem is solved at IO scheduler level, we can achieve better
> > latency control and hence stronger isolation between groups.
> >
> > Higher level solutions should find it hard to provide same kind of latency
> > control and isolation between groups as IO scheduler based solution.
>
> Why do you think that the higher level solution is hard to provide it?
> I think that it is a matter of how to implement throttling policy.
>
> > Fairness for buffered writes
> > ============================
> > Doing io control at any place below page cache has disadvantage that page
> > cache might not dispatch more writes from higher weight group hence higher
> > weight group might not see more IO done. Andrew says that we don't have
> > a solution to this problem in kernel and he would like to see it handled
> > properly.
> >
> > Only way to solve this seems to be to slow down the writers before they
> > write into page cache. IO throttling patch handled it by slowing down
> > writer if it crossed max specified rate. Other suggestions have come in
> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
> > al-together where some kind of per group write limit can be specified.
> >
> > So if solution is implemented at IO scheduler layer or at device mapper
> > layer, both shall have to rely on another controller to be co-mounted
> > to handle buffered writes properly.
> >
> > Fairness with-in group
> > ======================
> > One of the issues with higher level controller is that how to do fair
> > throttling so that fairness with-in group is not impacted. Especially
> > the case of making sure that we don't break the notion of ioprio of the
> > processes with-in group.
>
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
>
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.
>

Hi Ryo,

I am doing some more tests to see how do we maintain notion of prio
with-in group.

I have created two ioband devies ioband1 and ioband2 of weight 100 each on
two disk partitions. On one partition/device (ioband1) a buffered writer is
doing writeout and on other partition I launch one prio0 reader and
increasing number of prio4 readers using fio and let it run for 30
seconds and see how BW got distributed between prio0 and prio4 processes.

Note, here readers are doing direct IO.

I did this test with vanilla CFQ and with dm-ioband + cfq.

With vanilla CFQ
----------------
<---------prio4 readers --------------------------> <---prio0 reader--->
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 12892KiB/s 12892KiB/s 12892KiB/s 409K usec 14705KiB/s 252K usec
2 5667KiB/s 5637KiB/s 11302KiB/s 717K usec 17555KiB/s 339K usec
4 4395KiB/s 4173KiB/s 17027KiB/s 933K usec 12437KiB/s 553K usec
8 2652KiB/s 2391KiB/s 20268KiB/s 1410K usec 9482KiB/s 685K usec
16 1653KiB/s 1413KiB/s 24035KiB/s 2418K usec 5860KiB/s 1027K usec

Note, as we increase number of prio4 readers, prio0 processes aggregate
bandwidth goes down (nr=2 seems to be only exception) but it still
maintains more BW than prio4 process.

Also note that as we increase number of prio4 readers, their aggreagate
bandwidth goes up which is expected.

With dm-ioband
--------------
<---------prio4 readers --------------------------> <---prio0 reader--->
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 11242KiB/s 11242KiB/s 11242KiB/s 415K usec 3884KiB/s 244K usec
2 8110KiB/s 6236KiB/s 14345KiB/s 304K usec 320KiB/s 125K usec
4 6898KiB/s 622KiB/s 11059KiB/s 206K usec 503KiB/s 201K usec
8 345KiB/s 47KiB/s 850KiB/s 342K usec 8350KiB/s 164K usec
16 28KiB/s 28KiB/s 451KiB/s 688 msec 5092KiB/s 306K usec

Looking at the output with dm-ioband, it seems to be all over the place.
Look at aggregate bandwidth of prio0 reader and how wildly it is swinging.
It first goes down and then suddenly jumps up way high.

Similiarly look at aggregate bandwidth of prio4 readers and the moment we
hit 8 readers, it suddenly tanks.

Look at prio4 reader and prio 7 reader BW with 16 prio4 processes running.
prio4 process gets 28Kb/s and prio 0 process gets 5MB/s.

Can you please look into it? It looks like we got serious issues w.r.t
to fairness and bandwidth distribution with-in group.

Thanks
Vivek


> > Especially io throttling patch was very bad in terms of prio with-in
> > group where throttling treated everyone equally and difference between
> > process prio disappeared.
> >
> > Reads Vs Writes
> > ===============
> > A higher level control most likely will change the ratio in which reads
> > and writes are dispatched to disk with-in group. It used to be decided
> > by IO scheduler so far but with higher level groups doing throttling and
> > possibly buffering the bios and releasing them later, they will have to
> > come up with their own policy on in what proportion reads and writes
> > should be dispatched. In case of IO scheduler based control, all the
> > queuing takes place at IO scheduler and it still retains control of
> > in what ration reads and writes should be dispatched.
>
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.
>
> > Summary
> > =======
> >
> > - An io scheduler based io controller can provide better latencies,
> > stronger isolation between groups, time based fairness and will not
> > interfere with io schedulers policies like class, ioprio and
> > reader vs writer issues.
> >
> > But it can gunrantee fairness at higher logical level devices.
> > Especially in case of max bw control, leaf node control does not sound
> > to be the most appropriate thing.
> >
> > - IO throttling provides max bw control in terms of absolute rate. It has
> > the advantage that it can provide control at higher level logical device
> > and also control buffered writes without need of additional controller
> > co-mounted.
> >
> > But it does only max bw control and not proportion control so one might
> > not be using resources optimally. It looses sense of task prio and class
> > with-in group as any of the task can be throttled with-in group. Because
> > throttling does not kick in till you hit the max bw limit, it should find
> > it hard to provide same latencies as io scheduler based control.
> >
> > - dm-ioband also has the advantage that it can provide fairness at higher
> > level logical devices.
> >
> > But, fairness is provided only in terms of size of IO or number of IO.
> > No time based fairness. It is very throughput oriented and does not
> > throttle high speed group if other group is running slow random reader.
> > This results in bad latnecies for random reader group and weaker
> > isolation between groups.
>
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
>
> > Also it does not provide fairness if a group is not continuously
> > backlogged. So if one is running 1-2 dd/sequential readers in the group,
> > one does not get fairness until workload is increased to a point where
> > group becomes continuously backlogged. This also results in poor
> > latencies and limited fairness.
>
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.
>
> > At this point of time it does not look like a single IO controller all
> > the scenarios/requirements. This means few things to me.
> >
> > - Drop some of the requirements and go with one implementation which meets
> > those reduced set of requirements.
> >
> > - Have more than one IO controller implementation in kenrel. One for lower
> > level control for better latencies, stronger isolation and optimal resource
> > usage and other one for fairness at higher level logical devices and max
> > bandwidth control.
> >
> > And let user decide which one to use based on his/her needs.
> >
> > - Come up with more intelligent way of doing IO control where single
> > controller covers all the cases.
> >
> > At this point of time, I am more inclined towards option 2 of having more
> > than one implementation in kernel. :-) (Until and unless we can brainstrom
> > and come up with ideas to make option 3 happen).
> >
> > > It would be great if we discuss our plans on the mailing list, so we
> > > can get early feedback from everyone.
> >
> > This is what comes to my mind so far. Please add to the list if I have missed
> > some points. Also correct me if I am wrong about the pros/cons of the
> > approaches.
> >
> > Thoughts/ideas/opinions are welcome...
> >
> > Thanks
> > Vivek
>
> Thanks,
> Ryo Tsuruta

2009-09-30 08:43:17

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek,

Vivek Goyal <[email protected]> wrote:
> I was thinking that elevator layer will do the merge of bios. So IO
> scheduler/elevator can time stamp the first bio in the request as it goes
> into the disk and again timestamp with finish time once request finishes.
>
> This way higher layer can get an idea how much disk time a group of bios
> used. But on multi queue, if we dispatch say 4 requests from same queue,
> then time accounting becomes an issue.
>
> Consider following where four requests rq1, rq2, rq3 and rq4 are
> dispatched to disk at time t0, t1, t2 and t3 respectively and these
> requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> time elapsed between each of milestones is t. Also assume that all these
> requests are from same queue/group.
>
> t0 t1 t2 t3 t4 t5 t6 t7
> rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4
>
> Now higher layer will think that time consumed by group is:
>
> (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
>
> But the time elapsed is only 7t.

IO controller can know how many requests are issued and still in
progress. Is it not enough to accumulate the time while in-flight IOs
exist?

> Secondly if a different group is running only single sequential reader,
> there CFQ will be driving queue depth of 1 and time will not be running
> faster and this inaccuracy in accounting will lead to unfair share between
> groups.
>
> So we need something better to get a sense which group used how much of
> disk time.

It could be solved by implementing the way to pass on such information
from IO scheduler to higher layer controller.

> > How about making throttling policy be user selectable like the IO
> > scheduler and putting it in the higher layer? So we could support
> > all of policies (time-based, size-based and rate limiting). There
> > seems not to only one solution which satisfies all users. But I agree
> > with starting with proportional bandwidth control first.
> >
>
> What are the cases where time based policy does not work and size based
> policy works better and user would choose size based policy and not timed
> based one?

I think that disk time is not simply proportional to IO size. If there
are two groups whose wights are equally assigned and they issue
different sized IOs repsectively, the bandwidth of each group would
not distributed equally as expected.

> I am not against implementing things in higher layer as long as we can
> ensure tight control on latencies, strong isolation between groups and
> not break CFQ's class and ioprio model with-in group.
>
> > BTW, I will start to reimplement dm-ioband into block layer.
>
> Can you elaborate little bit on this?

bio is grabbed in generic_make_request() and throttled as well as
dm-ioband's mechanism. dmsetup command is not necessary any longer.

> > > Fairness for higher level logical devices
> > > =========================================
> > > Do we want good fairness numbers for higher level logical devices also
> > > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > > at leaf nodes can help us use the resources optimally and in the process
> > > we can get fairness at higher level also in many of the cases.
> >
> > We should also take care of block devices which provide their own
> > make_request_fn() and not use a IO scheduler. We can't use the leaf
> > nodes approach to such devices.
> >
>
> I am not sure how big an issue is this. This can be easily solved by
> making use of NOOP scheduler by these devices. What are the reasons for
> these devices to not use even noop?

I'm not sure why the developers of the device driver choose their own
way, and the driver is provided in binary form, so we can't modify it.

> > > Fairness with-in group
> > > ======================
> > > One of the issues with higher level controller is that how to do fair
> > > throttling so that fairness with-in group is not impacted. Especially
> > > the case of making sure that we don't break the notion of ioprio of the
> > > processes with-in group.
> >
> > I ran your test script to confirm that the notion of ioprio was not
> > broken by dm-ioband. Here is the results of the test.
> > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> >
> > I think that the time period during which dm-ioband holds IO requests
> > for throttling would be too short to break the notion of ioprio.
>
> Ok, I re-ran that test. Previously default io_limit value was 192 and now

The default value of io_limit on the previous test was 128 (not 192)
which is equall to the default value of nr_request.

> I set it up to 256 as you suggested. I still see writer starving reader. I
> have removed "conv=fdatasync" from writer so that a writer is pure buffered
> writes.

O.K. You removed "conv=fdatasync", the new dm-ioband handles
sync/async requests separately, and it solves this
buffered-write-starves-read problem. I would like to post it soon
after doing some more test.

> On top of that can you please give some details how increasing the
> buffered queue length reduces the impact of writers?

When the number of in-flight IOs exceeds io_limit, processes which are
going to issue IOs are made sleep by dm-ioband until all the in-flight
IOs are finished. But IO scheduler layer can accept IO requests more
than the value of io_limit, so it was a bottleneck of the throughput.

> IO Prio issue
> --------------
> I ran another test where two ioband devices were created of weight 100
> each on two partitions. In first group 4 readers were launched. Three
> readers are of class BE and prio 7, fourth one is of class BE prio 0. In
> group2, I launched a buffered writer.
>
> One would expect that prio0 reader gets more bandwidth as compared to
> prio 4 readers and prio 7 readers will get more or less same bw. Looks like
> that is not happening. Look how vanilla CFQ provides much more bandwidth
> to prio0 reader as compared to prio7 reader and how putting them in the
> group reduces the difference betweej prio0 and prio7 readers.
>
> Following are the results.

O.K. I'll try to do more test with dm-ioband according to your
comments especially working with CFQ. Thanks for pointing out.

Thanks,
Ryo Tsuruta

2009-09-30 11:06:00

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> Vivek Goyal <[email protected]> wrote:
> > I was thinking that elevator layer will do the merge of bios. So IO
> > scheduler/elevator can time stamp the first bio in the request as it goes
> > into the disk and again timestamp with finish time once request finishes.
> >
> > This way higher layer can get an idea how much disk time a group of bios
> > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > then time accounting becomes an issue.
> >
> > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > time elapsed between each of milestones is t. Also assume that all these
> > requests are from same queue/group.
> >
> > t0 t1 t2 t3 t4 t5 t6 t7
> > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4
> >
> > Now higher layer will think that time consumed by group is:
> >
> > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> >
> > But the time elapsed is only 7t.
>
> IO controller can know how many requests are issued and still in
> progress. Is it not enough to accumulate the time while in-flight IOs
> exist?
>

That time would not reflect disk time used. It will be follwoing.

(time spent waiting in CFQ queues) + (time spent in dispatch queue) +
(time spent in disk)

> > Secondly if a different group is running only single sequential reader,
> > there CFQ will be driving queue depth of 1 and time will not be running
> > faster and this inaccuracy in accounting will lead to unfair share between
> > groups.
> >
> > So we need something better to get a sense which group used how much of
> > disk time.
>
> It could be solved by implementing the way to pass on such information
> from IO scheduler to higher layer controller.
>

How would you do that? Can you give some details exactly how and what
information IO scheduler will pass to higher level IO controller so that IO
controller can attribute right time to the group.

> > > How about making throttling policy be user selectable like the IO
> > > scheduler and putting it in the higher layer? So we could support
> > > all of policies (time-based, size-based and rate limiting). There
> > > seems not to only one solution which satisfies all users. But I agree
> > > with starting with proportional bandwidth control first.
> > >
> >
> > What are the cases where time based policy does not work and size based
> > policy works better and user would choose size based policy and not timed
> > based one?
>
> I think that disk time is not simply proportional to IO size. If there
> are two groups whose wights are equally assigned and they issue
> different sized IOs repsectively, the bandwidth of each group would
> not distributed equally as expected.
>

If we are providing fairness in terms of time, it is fair. If we provide
equal time slots to two processes and if one got more IO done because it
was not wasting time seeking or it issued bigger size IO, it deserves that
higher BW. IO controller will make sure that process gets fair share in
terms of time and exactly how much BW one got will depend on the workload.

That's the precise reason that fairness in terms of time is better on
seeky media.

> > I am not against implementing things in higher layer as long as we can
> > ensure tight control on latencies, strong isolation between groups and
> > not break CFQ's class and ioprio model with-in group.
> >
> > > BTW, I will start to reimplement dm-ioband into block layer.
> >
> > Can you elaborate little bit on this?
>
> bio is grabbed in generic_make_request() and throttled as well as
> dm-ioband's mechanism. dmsetup command is not necessary any longer.
>

Ok, so one would not need dm-ioband device now, but same dm-ioband
throttling policies will apply. So until and unless we figure out a
better way, the issues I have pointed out will still exists even in
new implementation.

> > > > Fairness for higher level logical devices
> > > > =========================================
> > > > Do we want good fairness numbers for higher level logical devices also
> > > > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > > > at leaf nodes can help us use the resources optimally and in the process
> > > > we can get fairness at higher level also in many of the cases.
> > >
> > > We should also take care of block devices which provide their own
> > > make_request_fn() and not use a IO scheduler. We can't use the leaf
> > > nodes approach to such devices.
> > >
> >
> > I am not sure how big an issue is this. This can be easily solved by
> > making use of NOOP scheduler by these devices. What are the reasons for
> > these devices to not use even noop?
>
> I'm not sure why the developers of the device driver choose their own
> way, and the driver is provided in binary form, so we can't modify it.
>
> > > > Fairness with-in group
> > > > ======================
> > > > One of the issues with higher level controller is that how to do fair
> > > > throttling so that fairness with-in group is not impacted. Especially
> > > > the case of making sure that we don't break the notion of ioprio of the
> > > > processes with-in group.
> > >
> > > I ran your test script to confirm that the notion of ioprio was not
> > > broken by dm-ioband. Here is the results of the test.
> > > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> > >
> > > I think that the time period during which dm-ioband holds IO requests
> > > for throttling would be too short to break the notion of ioprio.
> >
> > Ok, I re-ran that test. Previously default io_limit value was 192 and now
>
> The default value of io_limit on the previous test was 128 (not 192)
> which is equall to the default value of nr_request.

Hm..., I used following commands to create two ioband devices.

echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
"weight 0 :100" | dmsetup create ioband1
echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
"weight 0 :100" | dmsetup create ioband2

Here io_limit value is zero so it should pick default value. Following is
output of "dmsetup table" command.

ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
^^^^
IIUC, above number 192 is reflecting io_limit? If yes, then default seems
to be 192?

>
> > I set it up to 256 as you suggested. I still see writer starving reader. I
> > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > writes.
>
> O.K. You removed "conv=fdatasync", the new dm-ioband handles
> sync/async requests separately, and it solves this
> buffered-write-starves-read problem. I would like to post it soon
> after doing some more test.
>
> > On top of that can you please give some details how increasing the
> > buffered queue length reduces the impact of writers?
>
> When the number of in-flight IOs exceeds io_limit, processes which are
> going to issue IOs are made sleep by dm-ioband until all the in-flight
> IOs are finished. But IO scheduler layer can accept IO requests more
> than the value of io_limit, so it was a bottleneck of the throughput.
>

Ok, so it should have been throughput bottleneck but how did it solve the
issue of writer starving the reader as you had mentioned in the mail.

Secondly, you mentioned that processes are made to sleep once we cross
io_limit. This sounds like request descriptor facility on requeust queue
where processes are made to sleep.

There are threads in kernel which don't want to sleep while submitting
bios. For example, btrfs has bio submitting thread which does not want
to sleep hence it checks with device if it is congested or not and not
submit the bio if it is congested. How would you handle such cases. Have
you implemented any per group congestion kind of interface to make sure
such IO's don't sleep if group is congested.

Or this limit is per ioband device which every group on the device is
sharing. If yes, then how would you provide isolation between groups
because if one groups consumes io_limit tokens, then other will simply
be serialized on that device?

> > IO Prio issue
> > --------------
> > I ran another test where two ioband devices were created of weight 100
> > each on two partitions. In first group 4 readers were launched. Three
> > readers are of class BE and prio 7, fourth one is of class BE prio 0. In
> > group2, I launched a buffered writer.
> >
> > One would expect that prio0 reader gets more bandwidth as compared to
> > prio 4 readers and prio 7 readers will get more or less same bw. Looks like
> > that is not happening. Look how vanilla CFQ provides much more bandwidth
> > to prio0 reader as compared to prio7 reader and how putting them in the
> > group reduces the difference betweej prio0 and prio7 readers.
> >
> > Following are the results.
>
> O.K. I'll try to do more test with dm-ioband according to your
> comments especially working with CFQ. Thanks for pointing out.
>
> Thanks,
> Ryo Tsuruta

2009-09-30 19:58:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:

> It's a given that not merging will provide better latency. We can't
> disable that or performance will suffer A LOT on some systems. There are
> ways to make it better, though. One would be to make the max request
> size smaller, but that would also hurt for streamed workloads. Can you
> try whether the below patch makes a difference? It will basically
> disallow merges to a request that isn't the last one.

Thoughts about something like the below?

The problem with the dd vs konsole -e exit type load seems to be
kjournald overloading the disk between reads. When userland is blocked,
kjournald is free to stuff 4*quantum into the queue instantly.

Taking the hint from Vivek's fairness tweakable patch, I stamped the
queue when a seeker was last seen, and disallowed overload within
CIC_SEEK_THR of that time. Worked well.

dd competing against perf stat -- konsole -e exec timings, 5 back to back runs
Avg
before 9.15 14.51 9.39 15.06 9.90 11.6
after 1.76 1.54 1.93 1.88 1.56 1.7

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e2a9b92..4a00129 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -174,6 +174,8 @@ struct cfq_data {
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;

+ unsigned long last_seeker;
+
struct list_head cic_list;

/*
@@ -1326,6 +1328,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
return 0;

/*
+ * We may have seeky queues, don't throttle up just yet.
+ */
+ if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
+ return 0;
+
+ /*
* we are the only queue, allow up to 4 times of 'quantum'
*/
if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1941,7 +1949,7 @@ static void
cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct cfq_io_context *cic)
{
- int old_idle, enable_idle;
+ int old_idle, enable_idle, seeky = 0;

/*
* Don't idle for async or idle io prio class
@@ -1951,8 +1959,12 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,

enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

- if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (cfqd->hw_tag && CIC_SEEKY(cic)))
+ if (cfqd->hw_tag && CIC_SEEKY(cic)) {
+ cfqd->last_seeker = jiffies;
+ seeky = 1;
+ }
+
+ if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || seeky)
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2482,6 +2494,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
cfqd->hw_tag = 1;
+ cfqd->last_seeker = jiffies;

return cfqd;
}

2009-09-30 20:05:44

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10



> /*
> + * We may have seeky queues, don't throttle up just yet.
> + */
> + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
> + return 0;
> +

bzzzt. Window too large, but the though is to let them overload, but
not instantly.

-Mike

2009-09-30 20:27:23

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote:
>
>
> > /*
> > + * We may have seeky queues, don't throttle up just yet.
> > + */
> > + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
> > + return 0;
> > +
>
> bzzzt. Window too large, but the though is to let them overload, but
> not instantly.
>

CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
using one "slice_idle" period of 8 ms. But it might turn out to be too
short depending on the disk speed.

Thanks
Vivek

2009-10-01 06:41:23

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek,

Vivek Goyal <[email protected]> wrote:
> On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> >
> > Vivek Goyal <[email protected]> wrote:
> > > I was thinking that elevator layer will do the merge of bios. So IO
> > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > into the disk and again timestamp with finish time once request finishes.
> > >
> > > This way higher layer can get an idea how much disk time a group of bios
> > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > then time accounting becomes an issue.
> > >
> > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > time elapsed between each of milestones is t. Also assume that all these
> > > requests are from same queue/group.
> > >
> > > t0 t1 t2 t3 t4 t5 t6 t7
> > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4
> > >
> > > Now higher layer will think that time consumed by group is:
> > >
> > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > >
> > > But the time elapsed is only 7t.
> >
> > IO controller can know how many requests are issued and still in
> > progress. Is it not enough to accumulate the time while in-flight IOs
> > exist?
> >
>
> That time would not reflect disk time used. It will be follwoing.
>
> (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> (time spent in disk)

In the case where multiple IO requests are issued from IO controller,
that time measurement is the time from when the first IO request is
issued until when the endio is called for the last IO request. Does
not it reflect disk time?

> > > Secondly if a different group is running only single sequential reader,
> > > there CFQ will be driving queue depth of 1 and time will not be running
> > > faster and this inaccuracy in accounting will lead to unfair share between
> > > groups.
> > >
> > > So we need something better to get a sense which group used how much of
> > > disk time.
> >
> > It could be solved by implementing the way to pass on such information
> > from IO scheduler to higher layer controller.
> >
>
> How would you do that? Can you give some details exactly how and what
> information IO scheduler will pass to higher level IO controller so that IO
> controller can attribute right time to the group.

If you would like to know when the idle timer is expired, how about
adding a function to IO controller to be notified it from IO
scheduler? IO scheduler calls the function when the timer is expired.

> > > > How about making throttling policy be user selectable like the IO
> > > > scheduler and putting it in the higher layer? So we could support
> > > > all of policies (time-based, size-based and rate limiting). There
> > > > seems not to only one solution which satisfies all users. But I agree
> > > > with starting with proportional bandwidth control first.
> > > >
> > >
> > > What are the cases where time based policy does not work and size based
> > > policy works better and user would choose size based policy and not timed
> > > based one?
> >
> > I think that disk time is not simply proportional to IO size. If there
> > are two groups whose wights are equally assigned and they issue
> > different sized IOs repsectively, the bandwidth of each group would
> > not distributed equally as expected.
> >
>
> If we are providing fairness in terms of time, it is fair. If we provide
> equal time slots to two processes and if one got more IO done because it
> was not wasting time seeking or it issued bigger size IO, it deserves that
> higher BW. IO controller will make sure that process gets fair share in
> terms of time and exactly how much BW one got will depend on the workload.
>
> That's the precise reason that fairness in terms of time is better on
> seeky media.

If the seek time is negligible, the bandwidth would not be distributed
according to a proportion of weight settings. I think that it would be
unclear for users to understand how bandwidth is distributed. And I
also think that seeky media would gradually become obsolete,

> > > I am not against implementing things in higher layer as long as we can
> > > ensure tight control on latencies, strong isolation between groups and
> > > not break CFQ's class and ioprio model with-in group.
> > >
> > > > BTW, I will start to reimplement dm-ioband into block layer.
> > >
> > > Can you elaborate little bit on this?
> >
> > bio is grabbed in generic_make_request() and throttled as well as
> > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> >
>
> Ok, so one would not need dm-ioband device now, but same dm-ioband
> throttling policies will apply. So until and unless we figure out a
> better way, the issues I have pointed out will still exists even in
> new implementation.

Yes, those still exist, but somehow I would like to try to solve them.

> > The default value of io_limit on the previous test was 128 (not 192)
> > which is equall to the default value of nr_request.
>
> Hm..., I used following commands to create two ioband devices.
>
> echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> "weight 0 :100" | dmsetup create ioband1
> echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> "weight 0 :100" | dmsetup create ioband2
>
> Here io_limit value is zero so it should pick default value. Following is
> output of "dmsetup table" command.
>
> ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
> ^^^^
> IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> to be 192?

The default vaule has changed since v1.12.0 and increased from 128 to 192.

> > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > writes.
> >
> > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > sync/async requests separately, and it solves this
> > buffered-write-starves-read problem. I would like to post it soon
> > after doing some more test.
> >
> > > On top of that can you please give some details how increasing the
> > > buffered queue length reduces the impact of writers?
> >
> > When the number of in-flight IOs exceeds io_limit, processes which are
> > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > IOs are finished. But IO scheduler layer can accept IO requests more
> > than the value of io_limit, so it was a bottleneck of the throughput.
> >
>
> Ok, so it should have been throughput bottleneck but how did it solve the
> issue of writer starving the reader as you had mentioned in the mail.

As wrote above, I modified dm-ioband to handle sync/async requests
separately, so even if writers do a lot of buffered IOs, readers can
issue IOs regardless writers' busyness. Once the IOs are backlogged
for throttling, the both sync and async requests are issued according
to the other of arrival.

> Secondly, you mentioned that processes are made to sleep once we cross
> io_limit. This sounds like request descriptor facility on requeust queue
> where processes are made to sleep.
>
> There are threads in kernel which don't want to sleep while submitting
> bios. For example, btrfs has bio submitting thread which does not want
> to sleep hence it checks with device if it is congested or not and not
> submit the bio if it is congested. How would you handle such cases. Have
> you implemented any per group congestion kind of interface to make sure
> such IO's don't sleep if group is congested.
>
> Or this limit is per ioband device which every group on the device is
> sharing. If yes, then how would you provide isolation between groups
> because if one groups consumes io_limit tokens, then other will simply
> be serialized on that device?

There are two kind of limit and both limit the number of IO requests
which can be issued simultaneously, but one is for per ioband device,
the other is for per ioband group. The per group limit assigned to
each group is calculated by dividing io_limit according to their
proportion of weight.

The kernel thread is not made to sleep by the per group limit, because
several kinds of kernel threads submit IOs from multiple groups and
for multiple devices in a single thread. At this time, the kernel
thread is made to sleep by the per device limit only.

Thanks,
Ryo Tsuruta

2009-10-01 07:33:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Wed, 2009-09-30 at 16:24 -0400, Vivek Goyal wrote:
> On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote:
> >
> >
> > > /*
> > > + * We may have seeky queues, don't throttle up just yet.
> > > + */
> > > + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
> > > + return 0;
> > > +
> >
> > bzzzt. Window too large, but the though is to let them overload, but
> > not instantly.
> >
>
> CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> using one "slice_idle" period of 8 ms. But it might turn out to be too
> short depending on the disk speed.

Yeah, it is too short, as is even _400_ ms. Trouble is, by the time
some new task is determined to be seeky, the damage is already done.

The below does better, though not as well as "just say no to overload"
of course ;-)

I have a patchlet from Corrado to test, likely better time investment
than poking this darn thing with sharp sticks.

-Mike

grep elapsed testo.log
0.894345911 seconds time elapsed <== solo seeky test measurement
3.732472877 seconds time elapsed
3.208443735 seconds time elapsed
4.249776673 seconds time elapsed
2.763449260 seconds time elapsed
4.235271019 seconds time elapsed

(3.73 + 3.20 + 4.24 + 2.76 + 4.23) / 5 / 0.89 = 4... darn.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e2a9b92..44a888d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -174,6 +174,8 @@ struct cfq_data {
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;

+ unsigned long od_stamp;
+
struct list_head cic_list;

/*
@@ -1296,19 +1298,26 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
/*
* Drain async requests before we start sync IO
*/
- if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+ if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
+ cfqd->od_stamp = jiffies;
return 0;
+ }

/*
* If this is an async queue and we have sync IO in flight, let it wait
*/
- if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
+ if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
+ cfqd->od_stamp = jiffies;
return 0;
+ }

max_dispatch = cfqd->cfq_quantum;
if (cfq_class_idle(cfqq))
max_dispatch = 1;

+ if (cfqd->busy_queues > 1)
+ cfqd->od_stamp = jiffies;
+
/*
* Does this cfqq already have too much IO in flight?
*/
@@ -1326,6 +1335,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
return 0;

/*
+ * Don't start overloading until we've been alone for a bit.
+ */
+ if (time_before(jiffies, cfqd->od_stamp + cfq_slice_sync))
+ return 0;
+
+ /*
* we are the only queue, allow up to 4 times of 'quantum'
*/
if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1941,7 +1956,7 @@ static void
cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct cfq_io_context *cic)
{
- int old_idle, enable_idle;
+ int old_idle, enable_idle, seeky = 0;

/*
* Don't idle for async or idle io prio class
@@ -1949,10 +1964,19 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
return;

+ if (cfqd->hw_tag) {
+ if (CIC_SEEKY(cic))
+ seeky = 1;
+ /*
+ * If known or incalculable seekiness, delay.
+ */
+ if (seeky || !sample_valid(cic->seek_samples))
+ cfqd->od_stamp = jiffies;
+ }
+
enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

- if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (cfqd->hw_tag && CIC_SEEKY(cic)))
+ if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || seeky)
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2482,6 +2506,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
cfqd->hw_tag = 1;
+ cfqd->od_stamp = INITIAL_JIFFIES;

return cfqd;
}



2009-10-01 13:32:43

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> Vivek Goyal <[email protected]> wrote:
> > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > > Hi Vivek,
> > >
> > > Vivek Goyal <[email protected]> wrote:
> > > > I was thinking that elevator layer will do the merge of bios. So IO
> > > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > > into the disk and again timestamp with finish time once request finishes.
> > > >
> > > > This way higher layer can get an idea how much disk time a group of bios
> > > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > > then time accounting becomes an issue.
> > > >
> > > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > > time elapsed between each of milestones is t. Also assume that all these
> > > > requests are from same queue/group.
> > > >
> > > > t0 t1 t2 t3 t4 t5 t6 t7
> > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4
> > > >
> > > > Now higher layer will think that time consumed by group is:
> > > >
> > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > >
> > > > But the time elapsed is only 7t.
> > >
> > > IO controller can know how many requests are issued and still in
> > > progress. Is it not enough to accumulate the time while in-flight IOs
> > > exist?
> > >
> >
> > That time would not reflect disk time used. It will be follwoing.
> >
> > (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> > (time spent in disk)
>
> In the case where multiple IO requests are issued from IO controller,
> that time measurement is the time from when the first IO request is
> issued until when the endio is called for the last IO request. Does
> not it reflect disk time?
>

Not accurately as it will be including the time spent in CFQ queues as
well as dispatch queue. I will not worry much about dispatch queue time
but time spent CFQ queues can be significant.

This is assuming that you are using token based scheme and will be
dispatching requests from multiple groups at the same time.

But if you figure out a way that you dispatch requests from one group only
at one time and wait for all requests to finish and then let next group
go, then above can work fairly accurately. In that case it will become
like CFQ with the only difference that effectively we have one queue per
group instread of per process.

> > > > Secondly if a different group is running only single sequential reader,
> > > > there CFQ will be driving queue depth of 1 and time will not be running
> > > > faster and this inaccuracy in accounting will lead to unfair share between
> > > > groups.
> > > >
> > > > So we need something better to get a sense which group used how much of
> > > > disk time.
> > >
> > > It could be solved by implementing the way to pass on such information
> > > from IO scheduler to higher layer controller.
> > >
> >
> > How would you do that? Can you give some details exactly how and what
> > information IO scheduler will pass to higher level IO controller so that IO
> > controller can attribute right time to the group.
>
> If you would like to know when the idle timer is expired, how about
> adding a function to IO controller to be notified it from IO
> scheduler? IO scheduler calls the function when the timer is expired.
>

This probably can be done. So this is like syncing between lower layers
and higher layers about when do we start idling and when do we stop it and
both the layers should be in sync.

This is something my common layer approach does. Becuase it is so close to
IO scheuler, I can do it relatively easily.

One probably can create interfaces to even propogate this information up.
But this all will probably come into the picture only if we don't use
token based schemes and come up with something where at one point of time
dispatch are from one group only.

> > > > > How about making throttling policy be user selectable like the IO
> > > > > scheduler and putting it in the higher layer? So we could support
> > > > > all of policies (time-based, size-based and rate limiting). There
> > > > > seems not to only one solution which satisfies all users. But I agree
> > > > > with starting with proportional bandwidth control first.
> > > > >
> > > >
> > > > What are the cases where time based policy does not work and size based
> > > > policy works better and user would choose size based policy and not timed
> > > > based one?
> > >
> > > I think that disk time is not simply proportional to IO size. If there
> > > are two groups whose wights are equally assigned and they issue
> > > different sized IOs repsectively, the bandwidth of each group would
> > > not distributed equally as expected.
> > >
> >
> > If we are providing fairness in terms of time, it is fair. If we provide
> > equal time slots to two processes and if one got more IO done because it
> > was not wasting time seeking or it issued bigger size IO, it deserves that
> > higher BW. IO controller will make sure that process gets fair share in
> > terms of time and exactly how much BW one got will depend on the workload.
> >
> > That's the precise reason that fairness in terms of time is better on
> > seeky media.
>
> If the seek time is negligible, the bandwidth would not be distributed
> according to a proportion of weight settings. I think that it would be
> unclear for users to understand how bandwidth is distributed. And I
> also think that seeky media would gradually become obsolete,
>

I can understand that if lesser the seek cost game starts changing and
probably a size based policy also work decently.

In that case at some point of time probably CFQ will also need to support
another mode/policy where fairness is provided in terms of size of IO, if
it detects a SSD with hardware queuing. Currently it seem to be disabling
the idling in that case. But this is not very good from fairness point of
view. I guess if CFQ wants to provide fairness in such cases, it needs to
dynamically change the shape and start thinking in terms of size of IO.

So far my testing has been very limited to hard disks connected to my
computer. I will do some testing on high end enterprise storage and see
how much do seek matter and how well both the implementations work.

> > > > I am not against implementing things in higher layer as long as we can
> > > > ensure tight control on latencies, strong isolation between groups and
> > > > not break CFQ's class and ioprio model with-in group.
> > > >
> > > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > >
> > > > Can you elaborate little bit on this?
> > >
> > > bio is grabbed in generic_make_request() and throttled as well as
> > > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > >
> >
> > Ok, so one would not need dm-ioband device now, but same dm-ioband
> > throttling policies will apply. So until and unless we figure out a
> > better way, the issues I have pointed out will still exists even in
> > new implementation.
>
> Yes, those still exist, but somehow I would like to try to solve them.
>
> > > The default value of io_limit on the previous test was 128 (not 192)
> > > which is equall to the default value of nr_request.
> >
> > Hm..., I used following commands to create two ioband devices.
> >
> > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> > "weight 0 :100" | dmsetup create ioband1
> > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> > "weight 0 :100" | dmsetup create ioband2
> >
> > Here io_limit value is zero so it should pick default value. Following is
> > output of "dmsetup table" command.
> >
> > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
> > ^^^^
> > IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> > to be 192?
>
> The default vaule has changed since v1.12.0 and increased from 128 to 192.
>
> > > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > > writes.
> > >
> > > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > > sync/async requests separately, and it solves this
> > > buffered-write-starves-read problem. I would like to post it soon
> > > after doing some more test.
> > >
> > > > On top of that can you please give some details how increasing the
> > > > buffered queue length reduces the impact of writers?
> > >
> > > When the number of in-flight IOs exceeds io_limit, processes which are
> > > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > > IOs are finished. But IO scheduler layer can accept IO requests more
> > > than the value of io_limit, so it was a bottleneck of the throughput.
> > >
> >
> > Ok, so it should have been throughput bottleneck but how did it solve the
> > issue of writer starving the reader as you had mentioned in the mail.
>
> As wrote above, I modified dm-ioband to handle sync/async requests
> separately, so even if writers do a lot of buffered IOs, readers can
> issue IOs regardless writers' busyness. Once the IOs are backlogged
> for throttling, the both sync and async requests are issued according
> to the other of arrival.
>

Ok, so if both the readers and writers are buffered and some tokens become
available then these tokens will be divided half and half between readers
or writer queues?

> > Secondly, you mentioned that processes are made to sleep once we cross
> > io_limit. This sounds like request descriptor facility on requeust queue
> > where processes are made to sleep.
> >
> > There are threads in kernel which don't want to sleep while submitting
> > bios. For example, btrfs has bio submitting thread which does not want
> > to sleep hence it checks with device if it is congested or not and not
> > submit the bio if it is congested. How would you handle such cases. Have
> > you implemented any per group congestion kind of interface to make sure
> > such IO's don't sleep if group is congested.
> >
> > Or this limit is per ioband device which every group on the device is
> > sharing. If yes, then how would you provide isolation between groups
> > because if one groups consumes io_limit tokens, then other will simply
> > be serialized on that device?
>
> There are two kind of limit and both limit the number of IO requests
> which can be issued simultaneously, but one is for per ioband device,
> the other is for per ioband group. The per group limit assigned to
> each group is calculated by dividing io_limit according to their
> proportion of weight.
>
> The kernel thread is not made to sleep by the per group limit, because
> several kinds of kernel threads submit IOs from multiple groups and
> for multiple devices in a single thread. At this time, the kernel
> thread is made to sleep by the per device limit only.
>

Interesting. Actually not blocking kernel threads on per group limit
and instead blocking it only on per device limts sounds like a good idea.

I can also do something similar and that will take away the need of
exporting per group congestion interface to higher layers and reduce
complexity. If some kernel thread does not want to block, these will
continue to use existing per device/bdi congestion interface.

Thanks
Vivek

2009-10-01 18:58:14

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Thu, Oct 01 2009, Mike Galbraith wrote:
> > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> > using one "slice_idle" period of 8 ms. But it might turn out to be too
> > short depending on the disk speed.
>
> Yeah, it is too short, as is even _400_ ms. Trouble is, by the time
> some new task is determined to be seeky, the damage is already done.
>
> The below does better, though not as well as "just say no to overload"
> of course ;-)

So this essentially takes the "avoid impact from previous slice" to a
new extreme, but idling even before dispatching requests from the new
queue. We basically do two things to prevent this already - one is to
only set the slice when the first request is actually serviced, and the
other is to drain async requests completely before starting sync ones.
I'm a bit surprised that the former doesn't solve the problem fully, I
guess what happens is that if the drive has been flooded with writes, it
may service the new read immediately and then return to finish emptying
its writeback cache. This will cause an impact for any sync IO until
that cache is flushed, and then cause that sync queue to not get as much
service as it should have.

Perhaps the "set slice on first complete" isn't working correctly? Or
perhaps we just need to be more extreme.

--
Jens Axboe

2009-10-02 02:58:18

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Thu, Oct 01, 2009 at 09:31:09AM -0400, Vivek Goyal wrote:
> On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> >
> > Vivek Goyal <[email protected]> wrote:
> > > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > > > Hi Vivek,
> > > >
> > > > Vivek Goyal <[email protected]> wrote:
> > > > > I was thinking that elevator layer will do the merge of bios. So IO
> > > > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > > > into the disk and again timestamp with finish time once request finishes.
> > > > >
> > > > > This way higher layer can get an idea how much disk time a group of bios
> > > > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > > > then time accounting becomes an issue.
> > > > >
> > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > > > time elapsed between each of milestones is t. Also assume that all these
> > > > > requests are from same queue/group.
> > > > >
> > > > > t0 t1 t2 t3 t4 t5 t6 t7
> > > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4
> > > > >
> > > > > Now higher layer will think that time consumed by group is:
> > > > >
> > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > > >
> > > > > But the time elapsed is only 7t.
> > > >
> > > > IO controller can know how many requests are issued and still in
> > > > progress. Is it not enough to accumulate the time while in-flight IOs
> > > > exist?
> > > >
> > >
> > > That time would not reflect disk time used. It will be follwoing.
> > >
> > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> > > (time spent in disk)
> >
> > In the case where multiple IO requests are issued from IO controller,
> > that time measurement is the time from when the first IO request is
> > issued until when the endio is called for the last IO request. Does
> > not it reflect disk time?
> >
>
> Not accurately as it will be including the time spent in CFQ queues as
> well as dispatch queue. I will not worry much about dispatch queue time
> but time spent CFQ queues can be significant.
>
> This is assuming that you are using token based scheme and will be
> dispatching requests from multiple groups at the same time.
>

Thinking more about it...

Does time based fairness make sense at higher level logical devices?

- Time based fairness generally helps with rotational devices which have
high seek costs. At higher level we don't even know what is the nature
of underlying device where IO will ultimately go.

- For time based fairness to work accurately at higher level, most likely
it will require dispatch from the single group at a time and wait for
requests to complete from that group and then dispatch from next.
Something like CFQ model of queue.

Dispatching from single queue/group works well in case of a single
underlying device where CFQ is operating but at higher level devices
where typically there will be multiple physical devices under it, it
might not make sense as it made things more linear and reduced
parallel processing further. So dispatching from single group at a time
and waiting before we dispatch from next group will most likely be
killer for throughput in higher level devices and might not make sense.

If we don't adopt the policy of dispatch from single group, then we run
into all the issues of weak isolation between groups, higher latencies,
preemptions across groups etc.

More I think about the whole issue and desired set of requirements, more
I am convinced that we probably need two io controlling mechanisms. One
which focusses purely on providing bandwidth fairness numbers on high
level devices and the other which works at low level devices with CFQ
and provides good bandwidth shaping, strong isolation, preserves fairness
with-in group and good control on latencies.

Higher level controller will not worry about time based policies. It can
implemente max bw and proportional bw control based on size of IO and
number of IO.

Lower level controller at CFQ level will implement time based group
scheduling. Keeping it at low level will have the advantage of better
utitlization of hardware in various dm/md configurations (as no throttling
takes place at higher level) but at the cost of not so strict fairness numbers
at higher level. So those who want strict fairness number policies at higher
level devices irrespective of shortcomings, can use that. Others can stick to
lower level controller.

For buffered write control we anyway have to do either something in memory
controller or come up with another cgroup controller which throttles IO
before it goes into cache. Or, in fact we can have a re-look at Andrea
Righi's controller which provided max BW and throttled buffered writes
before they got into page cache and try to provide proportional BW also
there.

Basically I see the space for two IO controllers. At the moment can't
think of a way of coming up with single controller which satisfies all
the requirements. So instead provide two and let user choose one based on
his need.

Any thoughts?

Before finishing this mail, will throw a whacky idea in the ring. I was
going through the request based dm-multipath paper. Will it make sense
to implement request based dm-ioband? So basically we implement all the
group scheduling in CFQ and let dm-ioband implement a request function
to take the request and break it back into bios. This way we can keep
all the group control at one place and also meet most of the requirements.

So request based dm-ioband will have a request in hand once that request
has passed group control and prio control. Because dm-ioband is a device
mapper target, one can put it on higher level devices (practically taking
CFQ at higher level device), and provide fairness there. One can also
put it on those SSDs which don't use IO scheduler (this is kind of forcing
them to use the IO scheduler.)

I am sure that will be many issues but one big issue I could think of that
CFQ thinks that there is one device beneath it and dipsatches requests
from one queue (in case of idling) and that would kill parallelism at
higher layer and throughput will suffer on many of the dm/md configurations.

Thanks
Vivek

> But if you figure out a way that you dispatch requests from one group only
> at one time and wait for all requests to finish and then let next group
> go, then above can work fairly accurately. In that case it will become
> like CFQ with the only difference that effectively we have one queue per
> group instread of per process.
>
> > > > > Secondly if a different group is running only single sequential reader,
> > > > > there CFQ will be driving queue depth of 1 and time will not be running
> > > > > faster and this inaccuracy in accounting will lead to unfair share between
> > > > > groups.
> > > > >
> > > > > So we need something better to get a sense which group used how much of
> > > > > disk time.
> > > >
> > > > It could be solved by implementing the way to pass on such information
> > > > from IO scheduler to higher layer controller.
> > > >
> > >
> > > How would you do that? Can you give some details exactly how and what
> > > information IO scheduler will pass to higher level IO controller so that IO
> > > controller can attribute right time to the group.
> >
> > If you would like to know when the idle timer is expired, how about
> > adding a function to IO controller to be notified it from IO
> > scheduler? IO scheduler calls the function when the timer is expired.
> >
>
> This probably can be done. So this is like syncing between lower layers
> and higher layers about when do we start idling and when do we stop it and
> both the layers should be in sync.
>
> This is something my common layer approach does. Becuase it is so close to
> IO scheuler, I can do it relatively easily.
>
> One probably can create interfaces to even propogate this information up.
> But this all will probably come into the picture only if we don't use
> token based schemes and come up with something where at one point of time
> dispatch are from one group only.
>
> > > > > > How about making throttling policy be user selectable like the IO
> > > > > > scheduler and putting it in the higher layer? So we could support
> > > > > > all of policies (time-based, size-based and rate limiting). There
> > > > > > seems not to only one solution which satisfies all users. But I agree
> > > > > > with starting with proportional bandwidth control first.
> > > > > >
> > > > >
> > > > > What are the cases where time based policy does not work and size based
> > > > > policy works better and user would choose size based policy and not timed
> > > > > based one?
> > > >
> > > > I think that disk time is not simply proportional to IO size. If there
> > > > are two groups whose wights are equally assigned and they issue
> > > > different sized IOs repsectively, the bandwidth of each group would
> > > > not distributed equally as expected.
> > > >
> > >
> > > If we are providing fairness in terms of time, it is fair. If we provide
> > > equal time slots to two processes and if one got more IO done because it
> > > was not wasting time seeking or it issued bigger size IO, it deserves that
> > > higher BW. IO controller will make sure that process gets fair share in
> > > terms of time and exactly how much BW one got will depend on the workload.
> > >
> > > That's the precise reason that fairness in terms of time is better on
> > > seeky media.
> >
> > If the seek time is negligible, the bandwidth would not be distributed
> > according to a proportion of weight settings. I think that it would be
> > unclear for users to understand how bandwidth is distributed. And I
> > also think that seeky media would gradually become obsolete,
> >
>
> I can understand that if lesser the seek cost game starts changing and
> probably a size based policy also work decently.
>
> In that case at some point of time probably CFQ will also need to support
> another mode/policy where fairness is provided in terms of size of IO, if
> it detects a SSD with hardware queuing. Currently it seem to be disabling
> the idling in that case. But this is not very good from fairness point of
> view. I guess if CFQ wants to provide fairness in such cases, it needs to
> dynamically change the shape and start thinking in terms of size of IO.
>
> So far my testing has been very limited to hard disks connected to my
> computer. I will do some testing on high end enterprise storage and see
> how much do seek matter and how well both the implementations work.
>
> > > > > I am not against implementing things in higher layer as long as we can
> > > > > ensure tight control on latencies, strong isolation between groups and
> > > > > not break CFQ's class and ioprio model with-in group.
> > > > >
> > > > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > > >
> > > > > Can you elaborate little bit on this?
> > > >
> > > > bio is grabbed in generic_make_request() and throttled as well as
> > > > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > > >
> > >
> > > Ok, so one would not need dm-ioband device now, but same dm-ioband
> > > throttling policies will apply. So until and unless we figure out a
> > > better way, the issues I have pointed out will still exists even in
> > > new implementation.
> >
> > Yes, those still exist, but somehow I would like to try to solve them.
> >
> > > > The default value of io_limit on the previous test was 128 (not 192)
> > > > which is equall to the default value of nr_request.
> > >
> > > Hm..., I used following commands to create two ioband devices.
> > >
> > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> > > "weight 0 :100" | dmsetup create ioband1
> > > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> > > "weight 0 :100" | dmsetup create ioband2
> > >
> > > Here io_limit value is zero so it should pick default value. Following is
> > > output of "dmsetup table" command.
> > >
> > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> > > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
> > > ^^^^
> > > IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> > > to be 192?
> >
> > The default vaule has changed since v1.12.0 and increased from 128 to 192.
> >
> > > > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > > > writes.
> > > >
> > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > > > sync/async requests separately, and it solves this
> > > > buffered-write-starves-read problem. I would like to post it soon
> > > > after doing some more test.
> > > >
> > > > > On top of that can you please give some details how increasing the
> > > > > buffered queue length reduces the impact of writers?
> > > >
> > > > When the number of in-flight IOs exceeds io_limit, processes which are
> > > > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > > > IOs are finished. But IO scheduler layer can accept IO requests more
> > > > than the value of io_limit, so it was a bottleneck of the throughput.
> > > >
> > >
> > > Ok, so it should have been throughput bottleneck but how did it solve the
> > > issue of writer starving the reader as you had mentioned in the mail.
> >
> > As wrote above, I modified dm-ioband to handle sync/async requests
> > separately, so even if writers do a lot of buffered IOs, readers can
> > issue IOs regardless writers' busyness. Once the IOs are backlogged
> > for throttling, the both sync and async requests are issued according
> > to the other of arrival.
> >
>
> Ok, so if both the readers and writers are buffered and some tokens become
> available then these tokens will be divided half and half between readers
> or writer queues?
>
> > > Secondly, you mentioned that processes are made to sleep once we cross
> > > io_limit. This sounds like request descriptor facility on requeust queue
> > > where processes are made to sleep.
> > >
> > > There are threads in kernel which don't want to sleep while submitting
> > > bios. For example, btrfs has bio submitting thread which does not want
> > > to sleep hence it checks with device if it is congested or not and not
> > > submit the bio if it is congested. How would you handle such cases. Have
> > > you implemented any per group congestion kind of interface to make sure
> > > such IO's don't sleep if group is congested.
> > >
> > > Or this limit is per ioband device which every group on the device is
> > > sharing. If yes, then how would you provide isolation between groups
> > > because if one groups consumes io_limit tokens, then other will simply
> > > be serialized on that device?
> >
> > There are two kind of limit and both limit the number of IO requests
> > which can be issued simultaneously, but one is for per ioband device,
> > the other is for per ioband group. The per group limit assigned to
> > each group is calculated by dividing io_limit according to their
> > proportion of weight.
> >
> > The kernel thread is not made to sleep by the per group limit, because
> > several kinds of kernel threads submit IOs from multiple groups and
> > for multiple devices in a single thread. At this time, the kernel
> > thread is made to sleep by the per device limit only.
> >
>
> Interesting. Actually not blocking kernel threads on per group limit
> and instead blocking it only on per device limts sounds like a good idea.
>
> I can also do something similar and that will take away the need of
> exporting per group congestion interface to higher layers and reduce
> complexity. If some kernel thread does not want to block, these will
> continue to use existing per device/bdi congestion interface.
>
> Thanks
> Vivek

2009-10-02 06:23:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote:
> On Thu, Oct 01 2009, Mike Galbraith wrote:
> > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> > > using one "slice_idle" period of 8 ms. But it might turn out to be too
> > > short depending on the disk speed.
> >
> > Yeah, it is too short, as is even _400_ ms. Trouble is, by the time
> > some new task is determined to be seeky, the damage is already done.
> >
> > The below does better, though not as well as "just say no to overload"
> > of course ;-)
>
> So this essentially takes the "avoid impact from previous slice" to a
> new extreme, but idling even before dispatching requests from the new
> queue. We basically do two things to prevent this already - one is to
> only set the slice when the first request is actually serviced, and the
> other is to drain async requests completely before starting sync ones.
> I'm a bit surprised that the former doesn't solve the problem fully, I
> guess what happens is that if the drive has been flooded with writes, it
> may service the new read immediately and then return to finish emptying
> its writeback cache. This will cause an impact for any sync IO until
> that cache is flushed, and then cause that sync queue to not get as much
> service as it should have.

I did the stamping selection other than how long have we been solo based
on these possibly wrong speculations:

If we're in the idle window and doing the async drain thing, we've at
the spot where Vivek's patch helps a ton. Seemed like a great time to
limit the size of any io that may land in front of my sync reader to
plain "you are not alone" quantity.

If we've got sync io in flight, that should mean that my new or old
known seeky queue has been serviced at least once. There's likely to be
more on the way, so delay overloading then too.

The seeky bit is supposed to be the earlier "last time we saw a seeker"
thing, but known seeky is too late to help a new task at all unless you
turn off the overloading for ages, so I added the if incalculable check
for good measure, hoping that meant the task is new, may want to exec.

Stamping any place may (see below) possibly limit the size of the io the
reader can generate as well as writer, but I figured what's good for the
goose is good for the the gander, or it ain't really good. The overload
was causing the observed pain, definitely ain't good for both at these
times at least, so don't let it do that.

> Perhaps the "set slice on first complete" isn't working correctly? Or
> perhaps we just need to be more extreme.

Dunno, I was just tossing rocks and sticks at it.

I don't really understand the reasoning behind overloading: I can see
that allows cutting thicker slabs for the disk, but with the streaming
writer vs reader case, seems only the writers can do that. The reader
is unlikely to be alone isn't it? Seems to me that either dd, a flusher
thread or kjournald is going to be there with it, which gives dd a huge
advantage.. it has two proxies to help it squabble over disk, konsole
has none.

-Mike

2009-10-02 08:04:18

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote:
> > On Thu, Oct 01 2009, Mike Galbraith wrote:
> > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> > > > using one "slice_idle" period of 8 ms. But it might turn out to be too
> > > > short depending on the disk speed.
> > >
> > > Yeah, it is too short, as is even _400_ ms. Trouble is, by the time
> > > some new task is determined to be seeky, the damage is already done.
> > >
> > > The below does better, though not as well as "just say no to overload"
> > > of course ;-)
> >
> > So this essentially takes the "avoid impact from previous slice" to a
> > new extreme, but idling even before dispatching requests from the new
> > queue. We basically do two things to prevent this already - one is to
> > only set the slice when the first request is actually serviced, and the
> > other is to drain async requests completely before starting sync ones.
> > I'm a bit surprised that the former doesn't solve the problem fully, I
> > guess what happens is that if the drive has been flooded with writes, it
> > may service the new read immediately and then return to finish emptying
> > its writeback cache. This will cause an impact for any sync IO until
> > that cache is flushed, and then cause that sync queue to not get as much
> > service as it should have.
>
> I did the stamping selection other than how long have we been solo based
> on these possibly wrong speculations:
>
> If we're in the idle window and doing the async drain thing, we've at
> the spot where Vivek's patch helps a ton. Seemed like a great time to
> limit the size of any io that may land in front of my sync reader to
> plain "you are not alone" quantity.

You can't be in the idle window and doing async drain at the same time,
the idle window doesn't start until the sync queue has completed a
request. Hence my above rant on device interference.

> If we've got sync io in flight, that should mean that my new or old
> known seeky queue has been serviced at least once. There's likely to be
> more on the way, so delay overloading then too.
>
> The seeky bit is supposed to be the earlier "last time we saw a seeker"
> thing, but known seeky is too late to help a new task at all unless you
> turn off the overloading for ages, so I added the if incalculable check
> for good measure, hoping that meant the task is new, may want to exec.
>
> Stamping any place may (see below) possibly limit the size of the io the
> reader can generate as well as writer, but I figured what's good for the
> goose is good for the the gander, or it ain't really good. The overload
> was causing the observed pain, definitely ain't good for both at these
> times at least, so don't let it do that.
>
> > Perhaps the "set slice on first complete" isn't working correctly? Or
> > perhaps we just need to be more extreme.
>
> Dunno, I was just tossing rocks and sticks at it.
>
> I don't really understand the reasoning behind overloading: I can see
> that allows cutting thicker slabs for the disk, but with the streaming
> writer vs reader case, seems only the writers can do that. The reader
> is unlikely to be alone isn't it? Seems to me that either dd, a flusher
> thread or kjournald is going to be there with it, which gives dd a huge
> advantage.. it has two proxies to help it squabble over disk, konsole
> has none.

That is true, async queues have a huge advantage over sync ones. But
sync vs async is only part of it, any combination of queued sync, queued
sync random etc have different ramifications on behaviour of the
individual queue.

It's not hard to make the latency good, the hard bit is making sure we
also perform well for all other scenarios.

--
Jens Axboe

2009-10-02 08:53:37

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Mike Galbraith wrote:

> > If we're in the idle window and doing the async drain thing, we've at
> > the spot where Vivek's patch helps a ton. Seemed like a great time to
> > limit the size of any io that may land in front of my sync reader to
> > plain "you are not alone" quantity.
>
> You can't be in the idle window and doing async drain at the same time,
> the idle window doesn't start until the sync queue has completed a
> request. Hence my above rant on device interference.

I'll take your word for it.

/*
* Drain async requests before we start sync IO
*/
if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])

Looked about the same to me as..

enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

..where Vivek prevented turning 1 into 0, so I stamped it ;-)

> > Dunno, I was just tossing rocks and sticks at it.
> >
> > I don't really understand the reasoning behind overloading: I can see
> > that allows cutting thicker slabs for the disk, but with the streaming
> > writer vs reader case, seems only the writers can do that. The reader
> > is unlikely to be alone isn't it? Seems to me that either dd, a flusher
> > thread or kjournald is going to be there with it, which gives dd a huge
> > advantage.. it has two proxies to help it squabble over disk, konsole
> > has none.
>
> That is true, async queues have a huge advantage over sync ones. But
> sync vs async is only part of it, any combination of queued sync, queued
> sync random etc have different ramifications on behaviour of the
> individual queue.
>
> It's not hard to make the latency good, the hard bit is making sure we
> also perform well for all other scenarios.

Yeah, that's why I'm trying to be careful about what I say, I know full
well this ain't easy to get right. I'm not even thinking of submitting
anything, it's just diagnostic testing.

WRT my who can overload theory, I instrumented for my own edification.

Overload totally forbidden, stamps ergo disabled.

fairness=0 11.3 avg (ie == virgin source)
fairness=1 2.8 avg

Back to virgin settings, instrument who is overloading during sequences of..
echo 2 > /proc/sys/vm/drop_caches
sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE
..with dd continually running.

1 second counts for above.
...
[ 916.585880] od_sync: 0 od_async: 87 reject_sync: 0 reject_async: 37
[ 917.662585] od_sync: 0 od_async: 126 reject_sync: 0 reject_async: 53
[ 918.732872] od_sync: 0 od_async: 96 reject_sync: 0 reject_async: 22
[ 919.743730] od_sync: 0 od_async: 75 reject_sync: 0 reject_async: 15
[ 920.914549] od_sync: 0 od_async: 81 reject_sync: 0 reject_async: 17
[ 921.988198] od_sync: 0 od_async: 123 reject_sync: 0 reject_async: 30
...minutes long

(reject == fqq->dispatched >= 4 * max_dispatch)

Doing the same with firefox, I did see the burst below one time, dunno
what triggered that. I watched 6 runs, and only saw such a burst once.
Typically, numbers are the same as konsole, with a very rare 4 or
5 for sync sneaking in.

[ 1988.177758] od_sync: 0 od_async: 104 reject_sync: 0 reject_async: 48
[ 1992.291779] od_sync: 19 od_async: 83 reject_sync: 0 reject_async: 82
[ 1993.300850] od_sync: 79 od_async: 0 reject_sync: 28 reject_async: 0
[ 1994.313327] od_sync: 147 od_async: 104 reject_sync: 90 reject_async: 16
[ 1995.378025] od_sync: 14 od_async: 45 reject_sync: 0 reject_async: 2
[ 1996.456871] od_sync: 15 od_async: 74 reject_sync: 1 reject_async: 7
[ 1997.611226] od_sync: 0 od_async: 84 reject_sync: 0 reject_async: 14

Never noticed a sync overload watching a make -j4 for a couple minutes.

2009-10-02 09:00:45

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10


> WRT my who can overload theory, I instrumented for my own edification.
>
> Overload totally forbidden, stamps ergo disabled.
>
> fairness=0 11.3 avg (ie == virgin source)
> fairness=1 2.8 avg

(oops, quantum was set to 16 as well there. not that it matters, but
for completeness)

2009-10-02 09:24:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10


* Jens Axboe <[email protected]> wrote:

> It's not hard to make the latency good, the hard bit is making sure we
> also perform well for all other scenarios.

Looking at the numbers from Mike:

| dd competing against perf stat -- konsole -e exec timings, 5 back to
| back runs
| Avg
| before 9.15 14.51 9.39 15.06 9.90 11.6
| after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7

_PLEASE_ make read latencies this good - the numbers are _vastly_
better. We'll worry about the 'other' things _after_ we've reached good
latencies.

I thought this principle was a well established basic rule of Linux IO
scheduling. Why do we have to have a 'latency vs. bandwidth' discussion
again and again? I thought latency won hands down.

Ingo

2009-10-02 09:28:38

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > It's not hard to make the latency good, the hard bit is making sure we
> > also perform well for all other scenarios.
>
> Looking at the numbers from Mike:
>
> | dd competing against perf stat -- konsole -e exec timings, 5 back to
> | back runs
> | Avg
> | before 9.15 14.51 9.39 15.06 9.90 11.6
> | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7
>
> _PLEASE_ make read latencies this good - the numbers are _vastly_
> better. We'll worry about the 'other' things _after_ we've reached good
> latencies.
>
> I thought this principle was a well established basic rule of Linux IO
> scheduling. Why do we have to have a 'latency vs. bandwidth' discussion
> again and again? I thought latency won hands down.

It's really not that simple, if we go and do easy latency bits, then
throughput drops 30% or more. You can't say it's black and white latency
vs throughput issue, that's just not how the real world works. The
server folks would be most unpleased.

--
Jens Axboe

2009-10-02 09:37:00

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote:
> * Jens Axboe <[email protected]> wrote:
>
> > It's not hard to make the latency good, the hard bit is making sure we
> > also perform well for all other scenarios.
>
> Looking at the numbers from Mike:
>
> | dd competing against perf stat -- konsole -e exec timings, 5 back to
> | back runs
> | Avg
> | before 9.15 14.51 9.39 15.06 9.90 11.6
> | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7
>
> _PLEASE_ make read latencies this good - the numbers are _vastly_
> better. We'll worry about the 'other' things _after_ we've reached good
> latencies.
>
> I thought this principle was a well established basic rule of Linux IO
> scheduling. Why do we have to have a 'latency vs. bandwidth' discussion
> again and again? I thought latency won hands down.

Just a note: In the testing I've done so far, we're better off today
than ever, and I can't recall beating on root ever being anything less
than agony for interactivity. IO seekers look a lot like CPU sleepers
to me. Looks like both can be as annoying as hell ;-)

-Mike

2009-10-02 09:55:54

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote:
> > On Fri, Oct 02 2009, Mike Galbraith wrote:
>
> > > If we're in the idle window and doing the async drain thing, we've at
> > > the spot where Vivek's patch helps a ton. Seemed like a great time to
> > > limit the size of any io that may land in front of my sync reader to
> > > plain "you are not alone" quantity.
> >
> > You can't be in the idle window and doing async drain at the same time,
> > the idle window doesn't start until the sync queue has completed a
> > request. Hence my above rant on device interference.
>
> I'll take your word for it.
>
> /*
> * Drain async requests before we start sync IO
> */
> if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
>
> Looked about the same to me as..
>
> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>
> ..where Vivek prevented turning 1 into 0, so I stamped it ;-)

cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter
idling, not that it is currently idling. The actual idling happens from
cfq_completed_request(), here:

else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
sync && !rq_noidle(rq))
cfq_arm_slice_timer(cfqd);

and after that the queue will be marked as waiting, so
cfq_cfqq_wait_request(cfqq) is a better indication of whether we are
currently waiting for a request (idling) or not.

> > > Dunno, I was just tossing rocks and sticks at it.
> > >
> > > I don't really understand the reasoning behind overloading: I can see
> > > that allows cutting thicker slabs for the disk, but with the streaming
> > > writer vs reader case, seems only the writers can do that. The reader
> > > is unlikely to be alone isn't it? Seems to me that either dd, a flusher
> > > thread or kjournald is going to be there with it, which gives dd a huge
> > > advantage.. it has two proxies to help it squabble over disk, konsole
> > > has none.
> >
> > That is true, async queues have a huge advantage over sync ones. But
> > sync vs async is only part of it, any combination of queued sync, queued
> > sync random etc have different ramifications on behaviour of the
> > individual queue.
> >
> > It's not hard to make the latency good, the hard bit is making sure we
> > also perform well for all other scenarios.
>
> Yeah, that's why I'm trying to be careful about what I say, I know full
> well this ain't easy to get right. I'm not even thinking of submitting
> anything, it's just diagnostic testing.

It's much appreciated btw, if we can make this better without killing
throughput, then I'm surely interested in picking up your interesting
bits and getting them massaged into something we can include. So don't
be discouraged, I'm just being realistic :-)


--
Jens Axboe

2009-10-02 10:55:32

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Jens,
On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <[email protected]> wrote:
> On Fri, Oct 02 2009, Ingo Molnar wrote:
>>
>> * Jens Axboe <[email protected]> wrote:
>>
>
> It's really not that simple, if we go and do easy latency bits, then
> throughput drops 30% or more. You can't say it's black and white latency
> vs throughput issue, that's just not how the real world works. The
> server folks would be most unpleased.
Could we be more selective when the latency optimization is introduced?

The code that is currently touched by Vivek's patch is:
if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
(cfqd->hw_tag && CIC_SEEKY(cic)))
enable_idle = 0;
basically, when fairness=1, it becomes just:
if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
enable_idle = 0;

Note that, even if we enable idling here, the cfq_arm_slice_timer will use
a different idle window for seeky (2ms) than for normal I/O.

I think that the 2ms idle window is good for a single rotational SATA disk scenario,
even if it supports NCQ. Realistic access times for those disks are still around 8ms
(but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby
request may pay off, not only in latency and fairness, but also in throughput.

What we don't want to do is to enable idling for NCQ enabled SSDs
(and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.
If we agree that hardware RAIDs should be marked as non-rotational, then that
code could become:

if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
if (cic->ttime_mean > idle_time)
enable_idle = 0;
else
enable_idle = 1;
}

Thanks,
Corrado

>
> --
> Jens Axboe
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-10-02 11:04:24

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Corrado Zoccolo wrote:
> Hi Jens,
> On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <[email protected]> wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> >>
> >> * Jens Axboe <[email protected]> wrote:
> >>
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more. You can't say it's black and white latency
> > vs throughput issue, that's just not how the real world works. The
> > server folks would be most unpleased.
> Could we be more selective when the latency optimization is introduced?
>
> The code that is currently touched by Vivek's patch is:
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> (cfqd->hw_tag && CIC_SEEKY(cic)))
> enable_idle = 0;
> basically, when fairness=1, it becomes just:
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
> enable_idle = 0;
>
> Note that, even if we enable idling here, the cfq_arm_slice_timer will use
> a different idle window for seeky (2ms) than for normal I/O.
>
> I think that the 2ms idle window is good for a single rotational SATA
> disk scenario, even if it supports NCQ. Realistic access times for
> those disks are still around 8ms (but it is proportional to seek
> lenght), and waiting 2ms to see if we get a nearby request may pay
> off, not only in latency and fairness, but also in throughput.

I agree, that change looks good.

> What we don't want to do is to enable idling for NCQ enabled SSDs
> (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.

Right, it was part of the bigger SSD optimization stuff I did a few
revisions back.

> If we agree that hardware RAIDs should be marked as non-rotational, then that
> code could become:
>
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
> enable_idle = 0;
> else if (sample_valid(cic->ttime_samples)) {
> unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
> if (cic->ttime_mean > idle_time)
> enable_idle = 0;
> else
> enable_idle = 1;
> }

Yes agree on that too. We probably should make a different flag for
hardware raids, telling the io scheduler that this device is really
composed if several others. If it's composited only by SSD's (or has a
frontend similar to that), then non-rotational applies.

But yes, we should pass that information down.

--
Jens Axboe

2009-10-02 12:22:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 11:55 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Mike Galbraith wrote:
> >
> > /*
> > * Drain async requests before we start sync IO
> > */
> > if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> >
> > Looked about the same to me as..
> >
> > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
> >
> > ..where Vivek prevented turning 1 into 0, so I stamped it ;-)
>
> cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter
> idling, not that it is currently idling. The actual idling happens from
> cfq_completed_request(), here:
>
> else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
> sync && !rq_noidle(rq))
> cfq_arm_slice_timer(cfqd);
>
> and after that the queue will be marked as waiting, so
> cfq_cfqq_wait_request(cfqq) is a better indication of whether we are
> currently waiting for a request (idling) or not.

Hm. Then cfq_cfqq_idle_window(cfqq) actually suits my intent better.

(If I want to reduce async's advantage, I should target specifically, ie
only stamp if this queue is a sync queue....otoh, if this queue is sync,
it is now officially too late, whereas if this queue is dd about to
inflict the wrath of kjournald on my reader's world, stamping now is a
really good idea.. scritch scritch scritch <smoke>)

I'll go tinker with it. Thanks for the clue.

-Mike

2009-10-02 12:50:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> Hi Jens,
> On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <[email protected]> wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> >>
> >> * Jens Axboe <[email protected]> wrote:
> >>
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more. You can't say it's black and white latency
> > vs throughput issue, that's just not how the real world works. The
> > server folks would be most unpleased.
> Could we be more selective when the latency optimization is introduced?
>
> The code that is currently touched by Vivek's patch is:
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> (cfqd->hw_tag && CIC_SEEKY(cic)))
> enable_idle = 0;
> basically, when fairness=1, it becomes just:
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
> enable_idle = 0;
>

Actually I am not touching this code. Looking at the V10, I have not
changed anything here in idling code.

I think we are seeing latency improvements with fairness=1 because, CFQ
does pure roundrobin and once a seeky reader expires, it is put at the
end of the queue.

I retained the same behavior if fairness=0 but if fairness=1, then I don't
put the seeky reader at the end of queue, instead it gets vdisktime based
on the disk it has used. So it should get placed ahead of sync readers.

I think following is the code snippet in "elevator-fq.c" which is making a
difference.

/*
* We don't want to charge more than allocated slice otherwise
* this
* queue can miss one dispatch round doubling max latencies. On
* the
* other hand we don't want to charge less than allocated slice as
* we stick to CFQ theme of queue loosing its share if it does not
* use the slice and moves to the back of service tree (almost).
*/
if (!ioq->efqd->fairness)
queue_charge = allocated_slice;

So if a sync readers consumes 100ms and an seeky reader dispatches only
one request, then in CFQ, seeky reader gets to dispatch next request after
another 100ms.

With fairness=1, it should get a lower vdisktime when it comes with a new
request because its last slice usage was less (like CFS sleepers as mike
said). But this will make a difference only if there are more than one
processes in the system otherwise a vtime jump will take place by the time
seeky readers gets backlogged.

Anyway, once I started timestamping the queues and started keeping a cache
of expired queues, then any queue which got new request almost
immediately, should get a lower vdisktime assigned if it did not use the
full time slice in the previous dispatch round. Hence with fairness=1,
seeky readers kind of get more share of disk (fair share), because these
are now placed ahead of streaming readers and hence get better latencies.

In short, most likely, better latencies are being experienced because
seeky reader is getting lower time stamp (vdisktime), because it did not
use its full time slice in previous dispatch round, and not because we kept
the idling enabled on seeky reader.

Thanks
Vivek

> Note that, even if we enable idling here, the cfq_arm_slice_timer will use
> a different idle window for seeky (2ms) than for normal I/O.
>
> I think that the 2ms idle window is good for a single rotational SATA disk scenario,
> even if it supports NCQ. Realistic access times for those disks are still around 8ms
> (but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby
> request may pay off, not only in latency and fairness, but also in throughput.
>
> What we don't want to do is to enable idling for NCQ enabled SSDs
> (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.
> If we agree that hardware RAIDs should be marked as non-rotational, then that
> code could become:
>
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
> enable_idle = 0;
> else if (sample_valid(cic->ttime_samples)) {
> unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
> if (cic->ttime_mean > idle_time)
> enable_idle = 0;
> else
> enable_idle = 1;
> }
>
> Thanks,
> Corrado
>
> >
> > --
> > Jens Axboe
> >
>
> --
> __________________________________________________________________________
>
> dott. Corrado Zoccolo mailto:[email protected]
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------

2009-10-02 14:26:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10



On Fri, 2 Oct 2009, Jens Axboe wrote:
>
> It's really not that simple, if we go and do easy latency bits, then
> throughput drops 30% or more.

Well, if we're talking 500-950% improvement vs 30% deprovement, I think
it's pretty clear, though. Even the server people do care about latencies.

Often they care quite a bit, in fact.

And Mike's patch didn't look big or complicated.

> You can't say it's black and white latency vs throughput issue,

Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn
black-and-white _regardless_ of what you're measuring. Plus you probably
made up the 30% - have you tested the patch?

And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's
just harder to measure, so people seldom attach numbers to it. But that
again means that when people _are_ able to attach numbers to it, we should
take those numbers _more_ seriously rather than less.

So the 30% you threw out as a number is pretty much worthless.

Linus

2009-10-02 14:45:52

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 07:24 -0700, Linus Torvalds wrote:
>
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more.
>
> Well, if we're talking 500-950% improvement vs 30% deprovement, I think
> it's pretty clear, though. Even the server people do care about latencies.
>
> Often they care quite a bit, in fact.
>
> And Mike's patch didn't look big or complicated.

But it is a hack. (thought about and measured, but hack nonetheless)

I haven't tested it on much other than reader vs streaming writer. It
may well destroy the rest of the IO universe. I don't have the hw to
even test any hairy chested IO.

-Mike

2009-10-02 14:56:10

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Linus Torvalds wrote:
>
>
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more.
>
> Well, if we're talking 500-950% improvement vs 30% deprovement, I think
> it's pretty clear, though. Even the server people do care about latencies.
>
> Often they care quite a bit, in fact.

Mostly they care about throughput, and when they come running because
some their favorite app/benchmark/etc is now 2% slower, I get to hear
about it all the time. So yes, latency is not ignored, but mostly they
yack about throughput.

> And Mike's patch didn't look big or complicated.

It wasn't, it was more of a hack than something mergeable though (and I
think Mike will agree on that). So I'll repeat what I said to Mike, I'm
very well prepared to get something worked out and merged and I very
much appreciate the work he's putting into this.

> > You can't say it's black and white latency vs throughput issue,
>
> Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn
> black-and-white _regardless_ of what you're measuring. Plus you probably
> made up the 30% - have you tested the patch?

The 30% is totally made up, it's based on previous latency vs throughput
tradeoffs. I haven't tested Mike's patch.

> And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's
> just harder to measure, so people seldom attach numbers to it. But that
> again means that when people _are_ able to attach numbers to it, we should
> take those numbers _more_ seriously rather than less.

I agree, we can easily make CFQ be very about about latency. If you
think that is fine, then lets just do that. Then we'll get to fix the
server side up when the next RHEL/SLES/whatever cycle is honing in on a
kernel, hopefully we wont have to start over when that happens.

> So the 30% you threw out as a number is pretty much worthless.

It's hand waving, definitely. But I've been doing io scheduler tweaking
for years, and I know how hard it is to balance. If you want latency,
then you basically only ever give the device 1 thing to do. And you let
things cool down before switching over. If you do that, then your nice
big array of SSDs or rotating drives will easily drop to 1/4th of the
original performance. So we try and tweak the logic to make everybody
happy.

In some cases I wish we had a server vs desktop switch, since it would
decisions on this easier. I know you say that servers care about
latency, but not at all to the extent that desktops do. Most desktop
users would gladly give away the top of the performance for latency,
that's not true of most server users. Depends on what the server does,
of course.

--
Jens Axboe

2009-10-02 14:57:46

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 07:24 -0700, Linus Torvalds wrote:
> >
> > On Fri, 2 Oct 2009, Jens Axboe wrote:
> > >
> > > It's really not that simple, if we go and do easy latency bits, then
> > > throughput drops 30% or more.
> >
> > Well, if we're talking 500-950% improvement vs 30% deprovement, I think
> > it's pretty clear, though. Even the server people do care about latencies.
> >
> > Often they care quite a bit, in fact.
> >
> > And Mike's patch didn't look big or complicated.
>
> But it is a hack. (thought about and measured, but hack nonetheless)
>
> I haven't tested it on much other than reader vs streaming writer. It
> may well destroy the rest of the IO universe. I don't have the hw to
> even test any hairy chested IO.

I'll get a desktop box going on this too. The plan is to make the
latency as good as we can without making too many stupid decisions in
the io scheduler, then we can care about the throughput later. Rinse
and repeat.

--
Jens Axboe

2009-10-02 15:16:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10



On Fri, 2 Oct 2009, Jens Axboe wrote:
>
> Mostly they care about throughput, and when they come running because
> some their favorite app/benchmark/etc is now 2% slower, I get to hear
> about it all the time. So yes, latency is not ignored, but mostly they
> yack about throughput.

The reason they yack about it is that they can measure it.

Give them the benchmark where it goes the other way, and tell them why
they see a 2% deprovement. Give them some button they can tweak, because
they will.

But make the default be low-latency. Because everybody cares about low
latency, and the people who do so are _not_ the people who you give
buttons to tweak things with.

> I agree, we can easily make CFQ be very about about latency. If you
> think that is fine, then lets just do that. Then we'll get to fix the
> server side up when the next RHEL/SLES/whatever cycle is honing in on a
> kernel, hopefully we wont have to start over when that happens.

I really think we should do latency first, and throughput second.

It's _easy_ to get throughput. The people who care just about throughput
can always just disable all the work we do for latency. If they really
care about just throughput, they won't want fairness either - none of that
complex stuff.

Linus

2009-10-02 15:33:03

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <[email protected]> wrote:
> On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
>
> Actually I am not touching this code. Looking at the V10, I have not
> changed anything here in idling code.

I based my analisys on the original patch:
http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html

Mike, can you confirm which version of the fairness patch did you use
in your tests?

Corrado

> Thanks
> Vivek
>

2009-10-02 15:32:20

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02, 2009 at 05:27:55PM +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <[email protected]> wrote:
> > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> >
> > Actually I am not touching this code. Looking at the V10, I have not
> > changed anything here in idling code.
>
> I based my analisys on the original patch:
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
>

Oh.., you are talking about fairness for seeky process patch. I thought
you are talking about current IO controller patches. Actually they both
have this notion of "fairness=1" parameter but do different things in
patches, hence the confusion.

Thanks
Vivek


> Mike, can you confirm which version of the fairness patch did you use
> in your tests?
>
> Corrado
>
> > Thanks
> > Vivek
> >

2009-10-02 15:32:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <[email protected]> wrote:
> > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> >
> > Actually I am not touching this code. Looking at the V10, I have not
> > changed anything here in idling code.
>
> I based my analisys on the original patch:
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
>
> Mike, can you confirm which version of the fairness patch did you use
> in your tests?

That would be this one-liner.

o CFQ provides fair access to disk in terms of disk time used to processes.
Fairness is provided for the applications which have their think time with
in slice_idle (8ms default) limit.

o CFQ currently disables idling for seeky processes. So even if a process
has think time with-in slice_idle limits, it will still not get fair share
of disk. Disabling idling for a seeky process seems good from throughput
perspective but not necessarily from fairness perspecitve.

0 Do not disable idling based on seek pattern of process if a user has set
/sys/block/<disk>/queue/iosched/fairness = 1.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data *
enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (cfqd->hw_tag && CIC_SEEKY(cic)))
+ (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic)))
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
if (cic->ttime_mean > cfqd->cfq_slice_idle)

2009-10-02 15:42:07

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <[email protected]> wrote:
> > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> > >
> > > Actually I am not touching this code. Looking at the V10, I have not
> > > changed anything here in idling code.
> >
> > I based my analisys on the original patch:
> > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> >
> > Mike, can you confirm which version of the fairness patch did you use
> > in your tests?
>
> That would be this one-liner.
>

Ok. Thanks. Sorry, I got confused and thought that you are using "io
controller patches" with fairness=1.

In that case, Corrado's suggestion of refining it further and disabling idling
for seeky process only on non-rotational media (SSD and hardware RAID), makes
sense to me.

Thanks
Vivek

> o CFQ provides fair access to disk in terms of disk time used to processes.
> Fairness is provided for the applications which have their think time with
> in slice_idle (8ms default) limit.
>
> o CFQ currently disables idling for seeky processes. So even if a process
> has think time with-in slice_idle limits, it will still not get fair share
> of disk. Disabling idling for a seeky process seems good from throughput
> perspective but not necessarily from fairness perspecitve.
>
> 0 Do not disable idling based on seek pattern of process if a user has set
> /sys/block/<disk>/queue/iosched/fairness = 1.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> block/cfq-iosched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data *
> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> - (cfqd->hw_tag && CIC_SEEKY(cic)))
> + (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic)))
> enable_idle = 0;
> else if (sample_valid(cic->ttime_samples)) {
> if (cic->ttime_mean > cfqd->cfq_slice_idle)
>

2009-10-02 16:12:38

by jim owens

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Linus Torvalds wrote:
>
> I really think we should do latency first, and throughput second.

Agree.

> It's _easy_ to get throughput. The people who care just about throughput
> can always just disable all the work we do for latency.

But in my experience it is not that simple...

The argument latency vs throughput or desktop vs server is wrong.

I/O can never keep up with the ability of CPUs to dirty data.

On desktops and servers (really many-user-desktops) we want
minimum latency but the enemy is dirty VM. If we ignore the
need for throughput to flush dirty pages, VM gets angry and
forced VM page cleaning I/O is bad I/O.

We want min latency with low dirty page percent but need to
switch to max write throughput at some high dirty page percent.

We can not prevent the cliff we fall off where the system
chokes because the dirty page load is too high, but if we
only worry about latency, we bring that choke point cliff in
so it happens with a lower load. A 10% lower overload point
might be fine to get 100% better latency, but would desktop
users accept a 50% lower overload point where running one
more application makes the system appear hung?

Even desktop users commonly measure "how much work can I do
before the system becomes unresponsive".

jim

2009-10-02 16:03:34

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 11:40 -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> > > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <[email protected]> wrote:
> > > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> > > >
> > > > Actually I am not touching this code. Looking at the V10, I have not
> > > > changed anything here in idling code.
> > >
> > > I based my analisys on the original patch:
> > > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> > >
> > > Mike, can you confirm which version of the fairness patch did you use
> > > in your tests?
> >
> > That would be this one-liner.
> >
>
> Ok. Thanks. Sorry, I got confused and thought that you are using "io
> controller patches" with fairness=1.
>
> In that case, Corrado's suggestion of refining it further and disabling idling
> for seeky process only on non-rotational media (SSD and hardware RAID), makes
> sense to me.

One thing that might help with that is to have new tasks start out life
meeting the seeky criteria. If there's anything going on, they will be.

-Mike

2009-10-02 16:23:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10


* Linus Torvalds <[email protected]> wrote:

> On Fri, 2 Oct 2009, Jens Axboe wrote:
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more.
>
> Well, if we're talking 500-950% improvement vs 30% deprovement, I
> think it's pretty clear, though. Even the server people do care about
> latencies.
>
> Often they care quite a bit, in fact.

The other thing is that latency is basically a given property in any
system - as an app writer you have to live with it, there's not much you
can do to improve it.

Bandwidth on the other hand is a lot more engineerable, as it tends to
be about batching things and you can batch in user-space too. Batching
is often easier to do than getting good latencies.

Then there's also the fact that the range of apps that care about
bandwidth is a lot smaller than the range of apps which care about
latencies. The default should help more apps - i.e. latencies.

Ingo

2009-10-02 16:34:01

by Ray Lee

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <[email protected]> wrote:
> In some cases I wish we had a server vs desktop switch, since it would
> decisions on this easier. I know you say that servers care about
> latency, but not at all to the extent that desktops do. Most desktop
> users would gladly give away the top of the performance for latency,
> that's not true of most server users. Depends on what the server does,
> of course.

If most of the I/O on a system exhibits seeky tendencies, couldn't the
schedulers notice that and use that as the hint for what to optimize?

I mean, there's no switch better than the actual I/O behavior itself.

2009-10-02 16:37:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10


* Mike Galbraith <[email protected]> wrote:

> On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote:
> > * Jens Axboe <[email protected]> wrote:
> >
> > > It's not hard to make the latency good, the hard bit is making sure we
> > > also perform well for all other scenarios.
> >
> > Looking at the numbers from Mike:
> >
> > | dd competing against perf stat -- konsole -e exec timings, 5 back to
> > | back runs
> > | Avg
> > | before 9.15 14.51 9.39 15.06 9.90 11.6
> > | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7
> >
> > _PLEASE_ make read latencies this good - the numbers are _vastly_
> > better. We'll worry about the 'other' things _after_ we've reached good
> > latencies.
> >
> > I thought this principle was a well established basic rule of Linux
> > IO scheduling. Why do we have to have a 'latency vs. bandwidth'
> > discussion again and again? I thought latency won hands down.
>
> Just a note: In the testing I've done so far, we're better off today
> than ever, [...]

Definitely so, and a couple of months ago i've sung praises of that
progress on the IO/fs latencies front:

http://lkml.org/lkml/2009/4/9/461

... but we are greedy bastards and dont define excellence by how far
down we have come from but by how high we can still climb ;-)

Ingo

2009-10-02 16:51:32

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:

> In that case, Corrado's suggestion of refining it further and disabling idling
> for seeky process only on non-rotational media (SSD and hardware RAID), makes
> sense to me.

Umm... I got petabytes of hardware RAID across the hall that very definitely
*is* rotating. Did you mean "SSD and disk systems with big honking caches
that cover up the rotation"? Because "RAID" and "big honking caches" are
not *quite* the same thing, and I can just see that corner case coming out
to bite somebody on the ass...


Attachments:
(No filename) (227.00 B)

2009-10-02 17:11:28

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Linus Torvalds wrote:
>
>
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> >
> > Mostly they care about throughput, and when they come running because
> > some their favorite app/benchmark/etc is now 2% slower, I get to hear
> > about it all the time. So yes, latency is not ignored, but mostly they
> > yack about throughput.
>
> The reason they yack about it is that they can measure it.
>
> Give them the benchmark where it goes the other way, and tell them why
> they see a 2% deprovement. Give them some button they can tweak, because
> they will.

To some extent that's true, and I didn't want to generalize. If they are
adament that the benchmark models their real life, then no amount of
pointing in the other direction will change that.

Your point about tuning is definitely true, these people are used to
tuning things. For the desktop we care a lot more about working out of
the box.

> But make the default be low-latency. Because everybody cares about low
> latency, and the people who do so are _not_ the people who you give
> buttons to tweak things with.

Totally agree.

> > I agree, we can easily make CFQ be very about about latency. If you
> > think that is fine, then lets just do that. Then we'll get to fix the
> > server side up when the next RHEL/SLES/whatever cycle is honing in on a
> > kernel, hopefully we wont have to start over when that happens.
>
> I really think we should do latency first, and throughput second.
>
> It's _easy_ to get throughput. The people who care just about throughput
> can always just disable all the work we do for latency. If they really
> care about just throughput, they won't want fairness either - none of that
> complex stuff.

It's not _that_ easy, it depends a lot on the access patterns. A good
example of that is actually the idling that we already do. Say you have
two applications, each starting up. If you start them both at the same
time and just care for the dumb low latency, then you'll do one IO from
each of them in turn. Latency will be good, but throughput will be
aweful. And this means that in 20s they are both started, while with the
slice idling and priority disk access that CFQ does, you'd hopefully
have both up and running in 2s.

So latency is good, definitely, but sometimes you have to worry about
the bigger picture too. Latency is more than single IOs, it's often for
complete operation which may involve lots of IOs. Single IO latency is
a benchmark thing, it's not a real life issue. And that's where it
becomes complex and not so black and white. Mike's test is a really good
example of that.

--
Jens Axboe

2009-10-02 17:13:52

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Ray Lee wrote:
> On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <[email protected]> wrote:
> > In some cases I wish we had a server vs desktop switch, since it would
> > decisions on this easier. I know you say that servers care about
> > latency, but not at all to the extent that desktops do. Most desktop
> > users would gladly give away the top of the performance for latency,
> > that's not true of most server users. Depends on what the server does,
> > of course.
>
> If most of the I/O on a system exhibits seeky tendencies, couldn't the
> schedulers notice that and use that as the hint for what to optimize?
>
> I mean, there's no switch better than the actual I/O behavior itself.

Heuristics like that have a tendency to fail. What's the cut-off point?
Additionally, heuristics based on past process/system behaviour also has
a tendency to be suboptimal, since things aren't static.

We already look at seekiness of individual processes or groups. IIRC,
as-iosched also keeps a per-queue tracking.

--
Jens Axboe

2009-10-02 17:21:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10


* Jens Axboe <[email protected]> wrote:

> It's not _that_ easy, it depends a lot on the access patterns. A good
> example of that is actually the idling that we already do. Say you
> have two applications, each starting up. If you start them both at the
> same time and just care for the dumb low latency, then you'll do one
> IO from each of them in turn. Latency will be good, but throughput
> will be aweful. And this means that in 20s they are both started,
> while with the slice idling and priority disk access that CFQ does,
> you'd hopefully have both up and running in 2s.
>
> So latency is good, definitely, but sometimes you have to worry about
> the bigger picture too. Latency is more than single IOs, it's often
> for complete operation which may involve lots of IOs. Single IO
> latency is a benchmark thing, it's not a real life issue. And that's
> where it becomes complex and not so black and white. Mike's test is a
> really good example of that.

To the extent of you arguing that Mike's test is artificial (i'm not
sure you are arguing that) - Mike certainly did not do an artificial
test - he tested 'konsole' cache-cold startup latency, such as:

sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE

against a streaming dd.

That is a _very_ relevant benchmark IMHO and konsole's cache footprint
is far from trivial. (In fact i'd argue it's one of the most important
IO benchmarks on a desktop system - how does your desktop hold up to
something doing streaming IO.)

Ingo

2009-10-02 17:25:53

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > It's not _that_ easy, it depends a lot on the access patterns. A good
> > example of that is actually the idling that we already do. Say you
> > have two applications, each starting up. If you start them both at the
> > same time and just care for the dumb low latency, then you'll do one
> > IO from each of them in turn. Latency will be good, but throughput
> > will be aweful. And this means that in 20s they are both started,
> > while with the slice idling and priority disk access that CFQ does,
> > you'd hopefully have both up and running in 2s.
> >
> > So latency is good, definitely, but sometimes you have to worry about
> > the bigger picture too. Latency is more than single IOs, it's often
> > for complete operation which may involve lots of IOs. Single IO
> > latency is a benchmark thing, it's not a real life issue. And that's
> > where it becomes complex and not so black and white. Mike's test is a
> > really good example of that.
>
> To the extent of you arguing that Mike's test is artificial (i'm not
> sure you are arguing that) - Mike certainly did not do an artificial
> test - he tested 'konsole' cache-cold startup latency, such as:

[snip]

I was saying the exact opposite, that Mike's test is a good example of a
valid test. It's not measuring single IO latencies, it's doing a
sequence of valid events and looking at the latency for those. It's
benchmarking the bigger picture, not a microbenchmark.

--
Jens Axboe

2009-10-02 17:29:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10


* Jens Axboe <[email protected]> wrote:

> On Fri, Oct 02 2009, Ingo Molnar wrote:
> >
> > * Jens Axboe <[email protected]> wrote:
> >
> > > It's not _that_ easy, it depends a lot on the access patterns. A
> > > good example of that is actually the idling that we already do.
> > > Say you have two applications, each starting up. If you start them
> > > both at the same time and just care for the dumb low latency, then
> > > you'll do one IO from each of them in turn. Latency will be good,
> > > but throughput will be aweful. And this means that in 20s they are
> > > both started, while with the slice idling and priority disk access
> > > that CFQ does, you'd hopefully have both up and running in 2s.
> > >
> > > So latency is good, definitely, but sometimes you have to worry
> > > about the bigger picture too. Latency is more than single IOs,
> > > it's often for complete operation which may involve lots of IOs.
> > > Single IO latency is a benchmark thing, it's not a real life
> > > issue. And that's where it becomes complex and not so black and
> > > white. Mike's test is a really good example of that.
> >
> > To the extent of you arguing that Mike's test is artificial (i'm not
> > sure you are arguing that) - Mike certainly did not do an artificial
> > test - he tested 'konsole' cache-cold startup latency, such as:
>
> [snip]
>
> I was saying the exact opposite, that Mike's test is a good example of
> a valid test. It's not measuring single IO latencies, it's doing a
> sequence of valid events and looking at the latency for those. It's
> benchmarking the bigger picture, not a microbenchmark.

Good, so we are in violent agreement :-)

Ingo

2009-10-02 17:37:31

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > >
> > > * Jens Axboe <[email protected]> wrote:
> > >
> > > > It's not _that_ easy, it depends a lot on the access patterns. A
> > > > good example of that is actually the idling that we already do.
> > > > Say you have two applications, each starting up. If you start them
> > > > both at the same time and just care for the dumb low latency, then
> > > > you'll do one IO from each of them in turn. Latency will be good,
> > > > but throughput will be aweful. And this means that in 20s they are
> > > > both started, while with the slice idling and priority disk access
> > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > >
> > > > So latency is good, definitely, but sometimes you have to worry
> > > > about the bigger picture too. Latency is more than single IOs,
> > > > it's often for complete operation which may involve lots of IOs.
> > > > Single IO latency is a benchmark thing, it's not a real life
> > > > issue. And that's where it becomes complex and not so black and
> > > > white. Mike's test is a really good example of that.
> > >
> > > To the extent of you arguing that Mike's test is artificial (i'm not
> > > sure you are arguing that) - Mike certainly did not do an artificial
> > > test - he tested 'konsole' cache-cold startup latency, such as:
> >
> > [snip]
> >
> > I was saying the exact opposite, that Mike's test is a good example of
> > a valid test. It's not measuring single IO latencies, it's doing a
> > sequence of valid events and looking at the latency for those. It's
> > benchmarking the bigger picture, not a microbenchmark.
>
> Good, so we are in violent agreement :-)

Yes, perhaps that last sentence didn't provide enough evidence of which
category I put Mike's test into :-)

So to kick things off, I added an 'interactive' knob to CFQ and
defaulted it to on, along with re-enabling slice idling for hardware
that does tagged command queuing. This is almost completely identical to
what Vivek Goyal originally posted, it's just combined into one and uses
the term 'interactive' instead of 'fairness'. I think the former is a
better umbrella under which to add further tweaks that may sacrifice
throughput slightly, in the quest for better latency.

It's queued up in the for-linus branch.

--
Jens Axboe

2009-10-02 17:57:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10


* Jens Axboe <[email protected]> wrote:

> On Fri, Oct 02 2009, Ingo Molnar wrote:
> >
> > * Jens Axboe <[email protected]> wrote:
> >
> > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > >
> > > > * Jens Axboe <[email protected]> wrote:
> > > >
> > > > > It's not _that_ easy, it depends a lot on the access patterns. A
> > > > > good example of that is actually the idling that we already do.
> > > > > Say you have two applications, each starting up. If you start them
> > > > > both at the same time and just care for the dumb low latency, then
> > > > > you'll do one IO from each of them in turn. Latency will be good,
> > > > > but throughput will be aweful. And this means that in 20s they are
> > > > > both started, while with the slice idling and priority disk access
> > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > >
> > > > > So latency is good, definitely, but sometimes you have to worry
> > > > > about the bigger picture too. Latency is more than single IOs,
> > > > > it's often for complete operation which may involve lots of IOs.
> > > > > Single IO latency is a benchmark thing, it's not a real life
> > > > > issue. And that's where it becomes complex and not so black and
> > > > > white. Mike's test is a really good example of that.
> > > >
> > > > To the extent of you arguing that Mike's test is artificial (i'm not
> > > > sure you are arguing that) - Mike certainly did not do an artificial
> > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > >
> > > [snip]
> > >
> > > I was saying the exact opposite, that Mike's test is a good example of
> > > a valid test. It's not measuring single IO latencies, it's doing a
> > > sequence of valid events and looking at the latency for those. It's
> > > benchmarking the bigger picture, not a microbenchmark.
> >
> > Good, so we are in violent agreement :-)
>
> Yes, perhaps that last sentence didn't provide enough evidence of
> which category I put Mike's test into :-)
>
> So to kick things off, I added an 'interactive' knob to CFQ and
> defaulted it to on, along with re-enabling slice idling for hardware
> that does tagged command queuing. This is almost completely identical
> to what Vivek Goyal originally posted, it's just combined into one and
> uses the term 'interactive' instead of 'fairness'. I think the former
> is a better umbrella under which to add further tweaks that may
> sacrifice throughput slightly, in the quest for better latency.
>
> It's queued up in the for-linus branch.

i'd say 'latency' describes it even better. 'interactivity' as a term is
a bit overladen.

Ingo

2009-10-02 18:04:35

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > >
> > > * Jens Axboe <[email protected]> wrote:
> > >
> > > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > >
> > > > > * Jens Axboe <[email protected]> wrote:
> > > > >
> > > > > > It's not _that_ easy, it depends a lot on the access patterns. A
> > > > > > good example of that is actually the idling that we already do.
> > > > > > Say you have two applications, each starting up. If you start them
> > > > > > both at the same time and just care for the dumb low latency, then
> > > > > > you'll do one IO from each of them in turn. Latency will be good,
> > > > > > but throughput will be aweful. And this means that in 20s they are
> > > > > > both started, while with the slice idling and priority disk access
> > > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > >
> > > > > > So latency is good, definitely, but sometimes you have to worry
> > > > > > about the bigger picture too. Latency is more than single IOs,
> > > > > > it's often for complete operation which may involve lots of IOs.
> > > > > > Single IO latency is a benchmark thing, it's not a real life
> > > > > > issue. And that's where it becomes complex and not so black and
> > > > > > white. Mike's test is a really good example of that.
> > > > >
> > > > > To the extent of you arguing that Mike's test is artificial (i'm not
> > > > > sure you are arguing that) - Mike certainly did not do an artificial
> > > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > >
> > > > [snip]
> > > >
> > > > I was saying the exact opposite, that Mike's test is a good example of
> > > > a valid test. It's not measuring single IO latencies, it's doing a
> > > > sequence of valid events and looking at the latency for those. It's
> > > > benchmarking the bigger picture, not a microbenchmark.
> > >
> > > Good, so we are in violent agreement :-)
> >
> > Yes, perhaps that last sentence didn't provide enough evidence of
> > which category I put Mike's test into :-)
> >
> > So to kick things off, I added an 'interactive' knob to CFQ and
> > defaulted it to on, along with re-enabling slice idling for hardware
> > that does tagged command queuing. This is almost completely identical
> > to what Vivek Goyal originally posted, it's just combined into one and
> > uses the term 'interactive' instead of 'fairness'. I think the former
> > is a better umbrella under which to add further tweaks that may
> > sacrifice throughput slightly, in the quest for better latency.
> >
> > It's queued up in the for-linus branch.
>
> i'd say 'latency' describes it even better. 'interactivity' as a term is
> a bit overladen.

I'm not too crazy about it either. How about just using 'desktop' since
this is obviously what we are really targetting? 'latency' isn't fully
descriptive either, since it may not necessarily provide the best single
IO latency (noop would).

--
Jens Axboe

2009-10-02 18:08:56

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Thu, Oct 01 2009, Mike Galbraith wrote:
> max_dispatch = cfqd->cfq_quantum;
> if (cfq_class_idle(cfqq))
> max_dispatch = 1;
>
> + if (cfqd->busy_queues > 1)
> + cfqd->od_stamp = jiffies;
> +

->busy_queues > 1 just means that they have requests ready for dispatch,
not that they are dispatched.


--
Jens Axboe

2009-10-02 18:13:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 19:37 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Ingo Molnar wrote:
> >
> > * Jens Axboe <[email protected]> wrote:
> >
> > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > >
> > > > * Jens Axboe <[email protected]> wrote:
> > > >
> > > > > It's not _that_ easy, it depends a lot on the access patterns. A
> > > > > good example of that is actually the idling that we already do.
> > > > > Say you have two applications, each starting up. If you start them
> > > > > both at the same time and just care for the dumb low latency, then
> > > > > you'll do one IO from each of them in turn. Latency will be good,
> > > > > but throughput will be aweful. And this means that in 20s they are
> > > > > both started, while with the slice idling and priority disk access
> > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > >
> > > > > So latency is good, definitely, but sometimes you have to worry
> > > > > about the bigger picture too. Latency is more than single IOs,
> > > > > it's often for complete operation which may involve lots of IOs.
> > > > > Single IO latency is a benchmark thing, it's not a real life
> > > > > issue. And that's where it becomes complex and not so black and
> > > > > white. Mike's test is a really good example of that.
> > > >
> > > > To the extent of you arguing that Mike's test is artificial (i'm not
> > > > sure you are arguing that) - Mike certainly did not do an artificial
> > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > >
> > > [snip]
> > >
> > > I was saying the exact opposite, that Mike's test is a good example of
> > > a valid test. It's not measuring single IO latencies, it's doing a
> > > sequence of valid events and looking at the latency for those. It's
> > > benchmarking the bigger picture, not a microbenchmark.
> >
> > Good, so we are in violent agreement :-)
>
> Yes, perhaps that last sentence didn't provide enough evidence of which
> category I put Mike's test into :-)
>
> So to kick things off, I added an 'interactive' knob to CFQ and
> defaulted it to on, along with re-enabling slice idling for hardware
> that does tagged command queuing. This is almost completely identical to
> what Vivek Goyal originally posted, it's just combined into one and uses
> the term 'interactive' instead of 'fairness'. I think the former is a
> better umbrella under which to add further tweaks that may sacrifice
> throughput slightly, in the quest for better latency.
>
> It's queued up in the for-linus branch.

FWIW, I did a matrix of Vivek's patch combined with my hack. Seems we
do lose a bit of dd throughput over stock with either or both.

dd pre 65.1 65.4 67.5 64.8 65.1 65.5 fairness=1 overload_delay=1
perf stat 1.70 1.94 1.32 1.89 1.87 1.7
dd post 69.4 62.3 69.7 70.3 69.6 68.2

dd pre 67.0 67.8 64.7 64.7 64.9 65.8 fairness=1 overload_delay=0
perf stat 4.89 3.13 2.98 2.71 2.17 3.1
dd post 67.2 63.3 62.6 62.8 63.1 63.8

dd pre 65.0 66.0 66.9 64.6 67.0 65.9 fairness=0 overload_delay=1
perf stat 4.66 3.81 4.23 2.98 4.23 3.9
dd post 62.0 60.8 62.4 61.4 62.2 61.7

dd pre 65.3 65.6 64.9 69.5 65.8 66.2 fairness=0 overload_delay=0
perf stat 14.79 9.11 14.16 8.44 13.67 12.0
dd post 64.1 66.5 64.0 66.5 64.4 65.1


2009-10-02 18:19:01

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 19:37 +0200, Jens Axboe wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > >
> > > * Jens Axboe <[email protected]> wrote:
> > >
> > > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > >
> > > > > * Jens Axboe <[email protected]> wrote:
> > > > >
> > > > > > It's not _that_ easy, it depends a lot on the access patterns. A
> > > > > > good example of that is actually the idling that we already do.
> > > > > > Say you have two applications, each starting up. If you start them
> > > > > > both at the same time and just care for the dumb low latency, then
> > > > > > you'll do one IO from each of them in turn. Latency will be good,
> > > > > > but throughput will be aweful. And this means that in 20s they are
> > > > > > both started, while with the slice idling and priority disk access
> > > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > >
> > > > > > So latency is good, definitely, but sometimes you have to worry
> > > > > > about the bigger picture too. Latency is more than single IOs,
> > > > > > it's often for complete operation which may involve lots of IOs.
> > > > > > Single IO latency is a benchmark thing, it's not a real life
> > > > > > issue. And that's where it becomes complex and not so black and
> > > > > > white. Mike's test is a really good example of that.
> > > > >
> > > > > To the extent of you arguing that Mike's test is artificial (i'm not
> > > > > sure you are arguing that) - Mike certainly did not do an artificial
> > > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > >
> > > > [snip]
> > > >
> > > > I was saying the exact opposite, that Mike's test is a good example of
> > > > a valid test. It's not measuring single IO latencies, it's doing a
> > > > sequence of valid events and looking at the latency for those. It's
> > > > benchmarking the bigger picture, not a microbenchmark.
> > >
> > > Good, so we are in violent agreement :-)
> >
> > Yes, perhaps that last sentence didn't provide enough evidence of which
> > category I put Mike's test into :-)
> >
> > So to kick things off, I added an 'interactive' knob to CFQ and
> > defaulted it to on, along with re-enabling slice idling for hardware
> > that does tagged command queuing. This is almost completely identical to
> > what Vivek Goyal originally posted, it's just combined into one and uses
> > the term 'interactive' instead of 'fairness'. I think the former is a
> > better umbrella under which to add further tweaks that may sacrifice
> > throughput slightly, in the quest for better latency.
> >
> > It's queued up in the for-linus branch.
>
> FWIW, I did a matrix of Vivek's patch combined with my hack. Seems we
> do lose a bit of dd throughput over stock with either or both.
>
> dd pre 65.1 65.4 67.5 64.8 65.1 65.5 fairness=1 overload_delay=1
> perf stat 1.70 1.94 1.32 1.89 1.87 1.7
> dd post 69.4 62.3 69.7 70.3 69.6 68.2
>
> dd pre 67.0 67.8 64.7 64.7 64.9 65.8 fairness=1 overload_delay=0
> perf stat 4.89 3.13 2.98 2.71 2.17 3.1
> dd post 67.2 63.3 62.6 62.8 63.1 63.8
>
> dd pre 65.0 66.0 66.9 64.6 67.0 65.9 fairness=0 overload_delay=1
> perf stat 4.66 3.81 4.23 2.98 4.23 3.9
> dd post 62.0 60.8 62.4 61.4 62.2 61.7
>
> dd pre 65.3 65.6 64.9 69.5 65.8 66.2 fairness=0 overload_delay=0
> perf stat 14.79 9.11 14.16 8.44 13.67 12.0
> dd post 64.1 66.5 64.0 66.5 64.4 65.1

I'm not too worried about the "single IO producer" scenarios, and it
looks like (from a quick look) that most of your numbers are within some
expected noise levels. It's the more complex mixes that are likely to
cause a bit of a stink, but lets worry about that later. One quick thing
would be to read eg 2 or more files sequentially from disk and see how
that performs.

If you could do a cleaned up version of your overload patch based on
this:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768

then lets take it from there.

--
Jens Axboe

2009-10-02 18:22:38

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote:

> I'm not too crazy about it either. How about just using 'desktop' since
> this is obviously what we are really targetting? 'latency' isn't fully
> descriptive either, since it may not necessarily provide the best single
> IO latency (noop would).

Grin. "Perfect is the enemy of good" :)
Avg
16.24 175.82 154.38 228.97 147.16 144.5 noop
43.23 57.39 96.13 148.25 180.09 105.0 deadline

2009-10-02 18:26:07

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote:
>
> > I'm not too crazy about it either. How about just using 'desktop' since
> > this is obviously what we are really targetting? 'latency' isn't fully
> > descriptive either, since it may not necessarily provide the best single
> > IO latency (noop would).
>
> Grin. "Perfect is the enemy of good" :)
> Avg
> 16.24 175.82 154.38 228.97 147.16 144.5 noop
> 43.23 57.39 96.13 148.25 180.09 105.0 deadline

Yep, that's where it falls down. Noop basically fails here because it
treats all IO as equal, which obviously isn't true for most people. But
even for pure read workloads (is the above the mixed read/write, or just
read?), latency would be excellent with noop but the desktop experience
would not.

--
Jens Axboe

2009-10-02 18:30:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 20:08 +0200, Jens Axboe wrote:
> On Thu, Oct 01 2009, Mike Galbraith wrote:
> > max_dispatch = cfqd->cfq_quantum;
> > if (cfq_class_idle(cfqq))
> > max_dispatch = 1;
> >
> > + if (cfqd->busy_queues > 1)
> > + cfqd->od_stamp = jiffies;
> > +
>
> ->busy_queues > 1 just means that they have requests ready for dispatch,
> not that they are dispatched.

But we're not alone, somebody else is using disk. I'm trying to make
sure we don't have someone _about_ to come back.. like a reader, so when
there's another player, stamp to give him some time to wake up/submit
before putting the pedal to the metal.

-Mike

2009-10-02 18:33:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 20:26 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote:
> >
> > > I'm not too crazy about it either. How about just using 'desktop' since
> > > this is obviously what we are really targetting? 'latency' isn't fully
> > > descriptive either, since it may not necessarily provide the best single
> > > IO latency (noop would).
> >
> > Grin. "Perfect is the enemy of good" :)
> > Avg
> > 16.24 175.82 154.38 228.97 147.16 144.5 noop
> > 43.23 57.39 96.13 148.25 180.09 105.0 deadline
>
> Yep, that's where it falls down. Noop basically fails here because it
> treats all IO as equal, which obviously isn't true for most people. But
> even for pure read workloads (is the above the mixed read/write, or just
> read?), latency would be excellent with noop but the desktop experience
> would not.

Yeah, it's the dd vs konsole -e exit.

-Mike

2009-10-02 18:36:00

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:08 +0200, Jens Axboe wrote:
> > On Thu, Oct 01 2009, Mike Galbraith wrote:
> > > max_dispatch = cfqd->cfq_quantum;
> > > if (cfq_class_idle(cfqq))
> > > max_dispatch = 1;
> > >
> > > + if (cfqd->busy_queues > 1)
> > > + cfqd->od_stamp = jiffies;
> > > +
> >
> > ->busy_queues > 1 just means that they have requests ready for dispatch,
> > not that they are dispatched.
>
> But we're not alone, somebody else is using disk. I'm trying to make
> sure we don't have someone _about_ to come back.. like a reader, so when
> there's another player, stamp to give him some time to wake up/submit
> before putting the pedal to the metal.

OK, then the check does what you want. It'll tell you that you have a
pending request, and at least one other queue has one too. And that
could dispatch right after you finish yours, depending on idling etc.
Note that this _only_ applies to queues that have requests still sitting
in CFQ, as soon as they are on the dispatch list in the block layer they
will only be counted as busy if they still have sorted IO waiting.

But that should be OK already, since I switched CFQ to dispatch single
requests a few revisions ago. So we should not run into that anymore.

--
Jens Axboe

2009-10-02 18:37:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > i'd say 'latency' describes it even better. 'interactivity' as a term is
> > a bit overladen.
>
> I'm not too crazy about it either. How about just using 'desktop' since
> this is obviously what we are really targetting? 'latency' isn't fully
> descriptive either, since it may not necessarily provide the best single
> IO latency (noop would).

As Linus has already pointed out, it's not necessarily "desktop"
versus "server". There will be certain high frequency transaction
database workloads (for example) that will very much care about
latency. I think "low_latency" may be the best term to use.

- Ted

2009-10-02 18:45:47

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Theodore Tso wrote:
> On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > i'd say 'latency' describes it even better. 'interactivity' as a term is
> > > a bit overladen.
> >
> > I'm not too crazy about it either. How about just using 'desktop' since
> > this is obviously what we are really targetting? 'latency' isn't fully
> > descriptive either, since it may not necessarily provide the best single
> > IO latency (noop would).
>
> As Linus has already pointed out, it's not necessarily "desktop"
> versus "server". There will be certain high frequency transaction
> database workloads (for example) that will very much care about
> latency. I think "low_latency" may be the best term to use.

Not necessarily, but typically it will be. As already noted, I don't
think latency itself is a very descriptive term for this.

--
Jens Axboe

2009-10-02 18:57:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:

> I'm not too worried about the "single IO producer" scenarios, and it
> looks like (from a quick look) that most of your numbers are within some
> expected noise levels. It's the more complex mixes that are likely to
> cause a bit of a stink, but lets worry about that later. One quick thing
> would be to read eg 2 or more files sequentially from disk and see how
> that performs.

Hm. git(s) should be good for a nice repeatable load. Suggestions?

> If you could do a cleaned up version of your overload patch based on
> this:
>
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
>
> then lets take it from there.

I'll try to find a good repeatable git beater first. At this point, I
only know it helps with one load.

-Mike

2009-10-02 19:02:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10


* Jens Axboe <[email protected]> wrote:

> On Fri, Oct 02 2009, Theodore Tso wrote:
> > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > > i'd say 'latency' describes it even better. 'interactivity' as a term is
> > > > a bit overladen.
> > >
> > > I'm not too crazy about it either. How about just using 'desktop'
> > > since this is obviously what we are really targetting? 'latency'
> > > isn't fully descriptive either, since it may not necessarily
> > > provide the best single IO latency (noop would).
> >
> > As Linus has already pointed out, it's not necessarily "desktop"
> > versus "server". There will be certain high frequency transaction
> > database workloads (for example) that will very much care about
> > latency. I think "low_latency" may be the best term to use.
>
> Not necessarily, but typically it will be. As already noted, I don't
> think latency itself is a very descriptive term for this.

Why not? Nobody will think of 'latency' as something that requires noop,
but as something that in practice achieves low latencies, for stuff that
people use.

Ingo

2009-10-02 19:09:34

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > On Fri, Oct 02 2009, Theodore Tso wrote:
> > > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > > > i'd say 'latency' describes it even better. 'interactivity' as a term is
> > > > > a bit overladen.
> > > >
> > > > I'm not too crazy about it either. How about just using 'desktop'
> > > > since this is obviously what we are really targetting? 'latency'
> > > > isn't fully descriptive either, since it may not necessarily
> > > > provide the best single IO latency (noop would).
> > >
> > > As Linus has already pointed out, it's not necessarily "desktop"
> > > versus "server". There will be certain high frequency transaction
> > > database workloads (for example) that will very much care about
> > > latency. I think "low_latency" may be the best term to use.
> >
> > Not necessarily, but typically it will be. As already noted, I don't
> > think latency itself is a very descriptive term for this.
>
> Why not? Nobody will think of 'latency' as something that requires noop,
> but as something that in practice achieves low latencies, for stuff that
> people use.

Alright, I'll acknowledge that if that's the general consensus. I may be
somewhat biased myself.

--
Jens Axboe

2009-10-02 20:00:01

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 02, 2009 at 12:50:17PM -0400, [email protected] wrote:
> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
>
> > In that case, Corrado's suggestion of refining it further and disabling idling
> > for seeky process only on non-rotational media (SSD and hardware RAID), makes
> > sense to me.
>
> Umm... I got petabytes of hardware RAID across the hall that very definitely
> *is* rotating. Did you mean "SSD and disk systems with big honking caches
> that cover up the rotation"? Because "RAID" and "big honking caches" are
> not *quite* the same thing, and I can just see that corner case coming out
> to bite somebody on the ass...
>

I guess both. The systems which have big caches and cover up for rotation,
we probably need not idle for seeky process. An in case of big hardware
RAID, having multiple rotating disks, instead of idling and keeping rest
of the disks free, we probably are better off dispatching requests from
next queue (hoping it is going to a different disk altogether).

Thanks
Vivek

2009-10-02 20:30:11

by Munehiro Ikeda

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> Before finishing this mail, will throw a whacky idea in the ring. I was
> going through the request based dm-multipath paper. Will it make sense
> to implement request based dm-ioband? So basically we implement all the
> group scheduling in CFQ and let dm-ioband implement a request function
> to take the request and break it back into bios. This way we can keep
> all the group control at one place and also meet most of the requirements.
>
> So request based dm-ioband will have a request in hand once that request
> has passed group control and prio control. Because dm-ioband is a device
> mapper target, one can put it on higher level devices (practically taking
> CFQ at higher level device), and provide fairness there. One can also
> put it on those SSDs which don't use IO scheduler (this is kind of forcing
> them to use the IO scheduler.)
>
> I am sure that will be many issues but one big issue I could think of that
> CFQ thinks that there is one device beneath it and dipsatches requests
> from one queue (in case of idling) and that would kill parallelism at
> higher layer and throughput will suffer on many of the dm/md configurations.
>
> Thanks
> Vivek

As long as using CFQ, your idea is reasonable for me. But how about for
other IO schedulers? In my understanding, one of the keys to guarantee
group isolation in your patch is to have per-group IO scheduler internal
queue even with as, deadline, and noop scheduler. I think this is
great idea, and to implement generic code for all IO schedulers was
concluded when we had so many IO scheduler specific proposals.
If we will still need per-group IO scheduler internal queues with
request-based dm-ioband, we have to modify elevator layer. It seems
out of scope of dm.
I might miss something...



--
IKEDA, Munehiro
NEC Corporation of America
[email protected]

2009-10-02 20:47:33

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 20:57 +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
>
> > I'm not too worried about the "single IO producer" scenarios, and it
> > looks like (from a quick look) that most of your numbers are within some
> > expected noise levels. It's the more complex mixes that are likely to
> > cause a bit of a stink, but lets worry about that later. One quick thing
> > would be to read eg 2 or more files sequentially from disk and see how
> > that performs.
>
> Hm. git(s) should be good for a nice repeatable load. Suggestions?
>
> > If you could do a cleaned up version of your overload patch based on
> > this:
> >
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> >
> > then lets take it from there.
>
> I'll try to find a good repeatable git beater first. At this point, I
> only know it helps with one load.

Seems to help mixed concurrent read/write a bit too.

perf stat testo.sh Avg
108.12 106.33 106.34 97.00 106.52 104.8 1.000 fairness=0 overload_delay=0
93.98 102.44 94.47 97.70 98.90 97.4 .929 fairness=0 overload_delay=1
90.87 95.40 95.79 93.09 94.25 93.8 .895 fairness=1 overload_delay=0
89.93 90.57 89.13 93.43 93.72 91.3 .871 fairness=1 overload_delay=1

#!/bin/sh

LOGFILE=testo.log
rm -f $LOGFILE

echo 3 > /proc/sys/vm/drop_caches
sh -c "(cd linux-2.6.23; perf stat -- git checkout -f; git archive --format=tar HEAD > ../linux-2.6.23.tar)" 2>&1|tee -a $LOGFILE &
sh -c "(cd linux-2.6.24; perf stat -- git archive --format=tar HEAD > ../linux-2.6.24.tar; git checkout -f)" 2>&1|tee -a $LOGFILE &
sh -c "(cd linux-2.6.25; perf stat -- git checkout -f; git archive --format=tar HEAD > ../linux-2.6.25.tar)" 2>&1|tee -a $LOGFILE &
sh -c "(cd linux-2.6.26; perf stat -- git archive --format=tar HEAD > ../linux-2.6.26.tar; git checkout -f)" 2>&1|tee -a $LOGFILE &
wait

2009-10-02 22:14:25

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <[email protected]> wrote:
> On Fri, Oct 02, 2009 at 12:50:17PM -0400, [email protected] wrote:
>> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
>>
>> Umm... I got petabytes of hardware RAID across the hall that very definitely
>> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
>> that cover up the rotation"?  Because "RAID" and "big honking caches" are
>> not *quite* the same thing, and I can just see that corner case coming out
>> to bite somebody on the ass...
>>
>
> I guess both. The systems which have big caches and cover up for rotation,
> we probably need not idle for seeky process. An in case of big hardware
> RAID, having multiple rotating disks, instead of idling and keeping rest
> of the disks free, we probably are better off dispatching requests from
> next queue (hoping it is going to a different disk altogether).

In fact I think that the 'rotating' flag name is misleading.
All the checks we are doing are actually checking if the device truly
supports multiple parallel operations, and this feature is shared by
hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
NCQ-enabled SATA disk.

If we really wanted a "seek is cheap" flag, we could measure seek time
in the io-scheduler itself, but in the current code base we don't have
it used in this meaning anywhere.

Thanks,
Corrado

>
> Thanks
> Vivek
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-10-02 22:28:45

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <[email protected]> wrote:
> > On Fri, Oct 02, 2009 at 12:50:17PM -0400, [email protected] wrote:
> >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
> >>
> >> Umm... I got petabytes of hardware RAID across the hall that very definitely
> >> *is* rotating. ?Did you mean "SSD and disk systems with big honking caches
> >> that cover up the rotation"? ?Because "RAID" and "big honking caches" are
> >> not *quite* the same thing, and I can just see that corner case coming out
> >> to bite somebody on the ass...
> >>
> >
> > I guess both. The systems which have big caches and cover up for rotation,
> > we probably need not idle for seeky process. An in case of big hardware
> > RAID, having multiple rotating disks, instead of idling and keeping rest
> > of the disks free, we probably are better off dispatching requests from
> > next queue (hoping it is going to a different disk altogether).
>
> In fact I think that the 'rotating' flag name is misleading.
> All the checks we are doing are actually checking if the device truly
> supports multiple parallel operations, and this feature is shared by
> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> NCQ-enabled SATA disk.
>

While we are at it, what happens to notion of priority of tasks on SSDs?
Without idling there is not continuous time slice and there is no
fairness. So ioprio is out of the window for SSDs?

On SSDs, will it make more sense to provide fairness in terms of number or
IO or size of IO and not in terms of time slices.

Thanks
Vivek

> If we really wanted a "seek is cheap" flag, we could measure seek time
> in the io-scheduler itself, but in the current code base we don't have
> it used in this meaning anywhere.
>
> Thanks,
> Corrado
>
> >
> > Thanks
> > Vivek
> >
>
>
>
> --
> __________________________________________________________________________
>
> dott. Corrado Zoccolo mailto:[email protected]
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
> Tales of Power - C. Castaneda

2009-10-03 05:49:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:

> If you could do a cleaned up version of your overload patch based on
> this:
>
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
>
> then lets take it from there.

If take it from there ends up meaning apply, and see who squeaks, feel
free to delete the "Not", and my somewhat defective sense of humor.

Block: Delay overloading of CFQ queues to improve read latency.

Introduce a delay maximum dispatch timestamp, and stamp it when:
1. we encounter a known seeky or possibly new sync IO queue.
2. the current queue may go idle and we're draining async IO.
3. we have sync IO in flight and are servicing an async queue.
4 we are not the sole user of disk.
Disallow exceeding quantum if any of these events have occurred recently.

Protect this behavioral change with a "desktop_dispatch" knob and default
it to "on".. providing an easy means of regression verification prior to
hate-mail dispatch :) to CC list.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
... others who let somewhat hacky tweak slip by

LKML-Reference: <new-submission>

---
block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++----
1 file changed, 41 insertions(+), 4 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -174,6 +174,9 @@ struct cfq_data {
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
unsigned int cfq_desktop;
+ unsigned int cfq_desktop_dispatch;
+
+ unsigned long desktop_dispatch_ts;

struct list_head cic_list;

@@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq;
unsigned int max_dispatch;
+ unsigned long delay;

if (!cfqd->busy_queues)
return 0;
@@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
/*
* Drain async requests before we start sync IO
*/
- if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+ if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
+ cfqd->desktop_dispatch_ts = jiffies;
return 0;
+ }

/*
* If this is an async queue and we have sync IO in flight, let it wait
*/
- if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
+ if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
+ cfqd->desktop_dispatch_ts = jiffies;
return 0;
+ }

max_dispatch = cfqd->cfq_quantum;
if (cfq_class_idle(cfqq))
max_dispatch = 1;

+ if (cfqd->busy_queues > 1)
+ cfqd->desktop_dispatch_ts = jiffies;
+
/*
* Does this cfqq already have too much IO in flight?
*/
@@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
return 0;

/*
+ * Don't start overloading until we've been alone for a bit.
+ */
+ if (cfqd->cfq_desktop_dispatch) {
+ delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
+
+ if (time_before(jiffies, max_delay))
+ return 0;
+ }
+
+ /*
* we are the only queue, allow up to 4 times of 'quantum'
*/
if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1942,7 +1963,7 @@ static void
cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct cfq_io_context *cic)
{
- int old_idle, enable_idle;
+ int old_idle, enable_idle, seeky = 0;

/*
* Don't idle for async or idle io prio class
@@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
return;

+ if (cfqd->hw_tag) {
+ if (CIC_SEEKY(cic))
+ seeky = 1;
+ /*
+ * If seeky or incalculable seekiness, delay overloading.
+ */
+ if (seeky || !sample_valid(cic->seek_samples))
+ cfqd->desktop_dispatch_ts = jiffies;
+ }
+
enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
+ (!cfqd->cfq_desktop && seeky))
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
cfqd->cfq_desktop = 1;
+ cfqd->cfq_desktop_dispatch = 1;
+
+ cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
cfqd->hw_tag = 1;

return cfqd;
@@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
+SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
#undef SHOW_FUNCTION

#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
+STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
#undef STORE_FUNCTION

#define CFQ_ATTR(name) \
@@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
CFQ_ATTR(desktop),
+ CFQ_ATTR(desktop_dispatch),
__ATTR_NULL
};


2009-10-03 05:56:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
>
> > If you could do a cleaned up version of your overload patch based on
> > this:
> >
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> >
> > then lets take it from there.

Note to self: build the darn thing after last minute changes.

Block: Delay overloading of CFQ queues to improve read latency.

Introduce a delay maximum dispatch timestamp, and stamp it when:
1. we encounter a known seeky or possibly new sync IO queue.
2. the current queue may go idle and we're draining async IO.
3. we have sync IO in flight and are servicing an async queue.
4 we are not the sole user of disk.
Disallow exceeding quantum if any of these events have occurred recently.

Protect this behavioral change with a "desktop_dispatch" knob and default
it to "on".. providing an easy means of regression verification prior to
hate-mail dispatch :) to CC list.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
... others who let somewhat hacky tweak slip by

---
block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++----
1 file changed, 41 insertions(+), 4 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -174,6 +174,9 @@ struct cfq_data {
unsigned int cfq_slice_async_rq;
unsigned int cfq_slice_idle;
unsigned int cfq_desktop;
+ unsigned int cfq_desktop_dispatch;
+
+ unsigned long desktop_dispatch_ts;

struct list_head cic_list;

@@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq;
unsigned int max_dispatch;
+ unsigned long delay;

if (!cfqd->busy_queues)
return 0;
@@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
/*
* Drain async requests before we start sync IO
*/
- if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+ if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
+ cfqd->desktop_dispatch_ts = jiffies;
return 0;
+ }

/*
* If this is an async queue and we have sync IO in flight, let it wait
*/
- if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
+ if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
+ cfqd->desktop_dispatch_ts = jiffies;
return 0;
+ }

max_dispatch = cfqd->cfq_quantum;
if (cfq_class_idle(cfqq))
max_dispatch = 1;

+ if (cfqd->busy_queues > 1)
+ cfqd->desktop_dispatch_ts = jiffies;
+
/*
* Does this cfqq already have too much IO in flight?
*/
@@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
return 0;

/*
+ * Don't start overloading until we've been alone for a bit.
+ */
+ if (cfqd->cfq_desktop_dispatch) {
+ delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
+
+ if (time_before(jiffies, max_delay))
+ return 0;
+ }
+
+ /*
* we are the only queue, allow up to 4 times of 'quantum'
*/
if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1942,7 +1963,7 @@ static void
cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct cfq_io_context *cic)
{
- int old_idle, enable_idle;
+ int old_idle, enable_idle, seeky = 0;

/*
* Don't idle for async or idle io prio class
@@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
return;

+ if (cfqd->hw_tag) {
+ if (CIC_SEEKY(cic))
+ seeky = 1;
+ /*
+ * If seeky or incalculable seekiness, delay overloading.
+ */
+ if (seeky || !sample_valid(cic->seek_samples))
+ cfqd->desktop_dispatch_ts = jiffies;
+ }
+
enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
+ (!cfqd->cfq_desktop && seeky))
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
cfqd->cfq_desktop = 1;
+ cfqd->cfq_desktop_dispatch = 1;
+
+ cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
cfqd->hw_tag = 1;

return cfqd;
@@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
+SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
#undef SHOW_FUNCTION

#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
UINT_MAX, 0);
STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
+STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
#undef STORE_FUNCTION

#define CFQ_ATTR(name) \
@@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
CFQ_ATTR(slice_async_rq),
CFQ_ATTR(slice_idle),
CFQ_ATTR(desktop),
+ CFQ_ATTR(desktop_dispatch),
__ATTR_NULL
};


2009-10-03 06:31:17

by Mike Galbraith

[permalink] [raw]
Subject: tweaking IO latency [was Re: IO scheduler based IO controller V10]

P.S. now may be a good time to finally exit thread (and maybe trim cc?)

On Sat, 2009-10-03 at 07:56 +0200, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> >
> > > If you could do a cleaned up version of your overload patch based on
> > > this:
> > >
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > >
> > > then lets take it from there.
>
> Note to self: build the darn thing after last minute changes.
>
> Block: Delay overloading of CFQ queues to improve read latency.
>
> Introduce a delay maximum dispatch timestamp, and stamp it when:
> 1. we encounter a known seeky or possibly new sync IO queue.
> 2. the current queue may go idle and we're draining async IO.
> 3. we have sync IO in flight and are servicing an async queue.
> 4 we are not the sole user of disk.
> Disallow exceeding quantum if any of these events have occurred recently.
>
> Protect this behavioral change with a "desktop_dispatch" knob and default
> it to "on".. providing an easy means of regression verification prior to
> hate-mail dispatch :) to CC list.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ... others who let somewhat hacky tweak slip by
>
> ---
> block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 41 insertions(+), 4 deletions(-)
>
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -174,6 +174,9 @@ struct cfq_data {
> unsigned int cfq_slice_async_rq;
> unsigned int cfq_slice_idle;
> unsigned int cfq_desktop;
> + unsigned int cfq_desktop_dispatch;
> +
> + unsigned long desktop_dispatch_ts;
>
> struct list_head cic_list;
>
> @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
> struct cfq_data *cfqd = q->elevator->elevator_data;
> struct cfq_queue *cfqq;
> unsigned int max_dispatch;
> + unsigned long delay;
>
> if (!cfqd->busy_queues)
> return 0;
> @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
> /*
> * Drain async requests before we start sync IO
> */
> - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> + cfqd->desktop_dispatch_ts = jiffies;
> return 0;
> + }
>
> /*
> * If this is an async queue and we have sync IO in flight, let it wait
> */
> - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
> + cfqd->desktop_dispatch_ts = jiffies;
> return 0;
> + }
>
> max_dispatch = cfqd->cfq_quantum;
> if (cfq_class_idle(cfqq))
> max_dispatch = 1;
>
> + if (cfqd->busy_queues > 1)
> + cfqd->desktop_dispatch_ts = jiffies;
> +
> /*
> * Does this cfqq already have too much IO in flight?
> */
> @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
> return 0;
>
> /*
> + * Don't start overloading until we've been alone for a bit.
> + */
> + if (cfqd->cfq_desktop_dispatch) {
> + delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
> +
> + if (time_before(jiffies, max_delay))
> + return 0;
> + }
> +
> + /*
> * we are the only queue, allow up to 4 times of 'quantum'
> */
> if (cfqq->dispatched >= 4 * max_dispatch)
> @@ -1942,7 +1963,7 @@ static void
> cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> struct cfq_io_context *cic)
> {
> - int old_idle, enable_idle;
> + int old_idle, enable_idle, seeky = 0;
>
> /*
> * Don't idle for async or idle io prio class
> @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
> if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
> return;
>
> + if (cfqd->hw_tag) {
> + if (CIC_SEEKY(cic))
> + seeky = 1;
> + /*
> + * If seeky or incalculable seekiness, delay overloading.
> + */
> + if (seeky || !sample_valid(cic->seek_samples))
> + cfqd->desktop_dispatch_ts = jiffies;
> + }
> +
> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> - (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
> + (!cfqd->cfq_desktop && seeky))
> enable_idle = 0;
> else if (sample_valid(cic->ttime_samples)) {
> if (cic->ttime_mean > cfqd->cfq_slice_idle)
> @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
> cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
> cfqd->cfq_slice_idle = cfq_slice_idle;
> cfqd->cfq_desktop = 1;
> + cfqd->cfq_desktop_dispatch = 1;
> +
> + cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
> cfqd->hw_tag = 1;
>
> return cfqd;
> @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
> SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
> SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
> SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
> +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
> #undef SHOW_FUNCTION
>
> #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
> STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
> UINT_MAX, 0);
> STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
> +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
> #undef STORE_FUNCTION
>
> #define CFQ_ATTR(name) \
> @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
> CFQ_ATTR(slice_async_rq),
> CFQ_ATTR(slice_idle),
> CFQ_ATTR(desktop),
> + CFQ_ATTR(desktop_dispatch),
> __ATTR_NULL
> };
>
>

2009-10-03 07:20:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10


* Mike Galbraith <[email protected]> wrote:

> unsigned int cfq_desktop;
> + unsigned int cfq_desktop_dispatch;

> - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> + cfqd->desktop_dispatch_ts = jiffies;
> return 0;
> + }

btw., i hope all those desktop_ things will be named latency_ pretty
soon as the consensus seems to be - the word 'desktop' feels so wrong in
this context.

'desktop' is a form of use of computers and the implication of good
latencies goes far beyond that category of systems.

Ingo

2009-10-03 07:23:59

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> >
> > > If you could do a cleaned up version of your overload patch based on
> > > this:
> > >
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > >
> > > then lets take it from there.
>
> Note to self: build the darn thing after last minute changes.
>
> Block: Delay overloading of CFQ queues to improve read latency.
>
> Introduce a delay maximum dispatch timestamp, and stamp it when:
> 1. we encounter a known seeky or possibly new sync IO queue.
> 2. the current queue may go idle and we're draining async IO.
> 3. we have sync IO in flight and are servicing an async queue.
> 4 we are not the sole user of disk.
> Disallow exceeding quantum if any of these events have occurred recently.
>
> Protect this behavioral change with a "desktop_dispatch" knob and default
> it to "on".. providing an easy means of regression verification prior to
> hate-mail dispatch :) to CC list.

It still doesn't build:

block/cfq-iosched.c: In function ?cfq_dispatch_requests?:
block/cfq-iosched.c:1345: error: ?max_delay? undeclared (first use in
this function)

After shutting down the computer yesterday, I was thinking a bit about
this issue and how to solve it without incurring too much delay. If we
add a stricter control of the depth, that may help. So instead of
allowing up to max_quantum (or larger) depths, only allow gradual build
up of that the farther we get away from a dispatch from the sync IO
queues. For example, when switching to an async or seeky sync queue,
initially allow just 1 in flight. For the next round, if there still
hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
again, immediately drop to 1.

It could tie in with (or partly replace) the overload feature. The key
to good latency and decent throughput is knowing when to allow queue
build up and when not to.

--
Jens Axboe

2009-10-03 07:25:38

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sat, Oct 03 2009, Ingo Molnar wrote:
>
> * Mike Galbraith <[email protected]> wrote:
>
> > unsigned int cfq_desktop;
> > + unsigned int cfq_desktop_dispatch;
>
> > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> > + cfqd->desktop_dispatch_ts = jiffies;
> > return 0;
> > + }
>
> btw., i hope all those desktop_ things will be named latency_ pretty
> soon as the consensus seems to be - the word 'desktop' feels so wrong in
> this context.
>
> 'desktop' is a form of use of computers and the implication of good
> latencies goes far beyond that category of systems.

I will rename it, for now it doesn't matter (lets not get bogged down in
bike shed colors, please).

Oh and Mike, I forgot to mention this in the previous email - no more
tunables, please. We'll keep this under a single knob.

--
Jens Axboe

2009-10-03 08:53:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sat, 2009-10-03 at 09:25 +0200, Jens Axboe wrote:
> On Sat, Oct 03 2009, Ingo Molnar wrote:

> Oh and Mike, I forgot to mention this in the previous email - no more
> tunables, please. We'll keep this under a single knob.

OK.

Since I don't seem to be competent to operate quilt this morning anyway,
I won't send a fixed version yet. Anyone who wants to test can easily
fix the rename booboo. With the knob in place, it's easier to see what
load is affected by what change.

Back to rummage/test.

-Mike

2009-10-03 09:00:39

by Mike Galbraith

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:

> After shutting down the computer yesterday, I was thinking a bit about
> this issue and how to solve it without incurring too much delay. If we
> add a stricter control of the depth, that may help. So instead of
> allowing up to max_quantum (or larger) depths, only allow gradual build
> up of that the farther we get away from a dispatch from the sync IO
> queues. For example, when switching to an async or seeky sync queue,
> initially allow just 1 in flight. For the next round, if there still
> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> again, immediately drop to 1.
>
> It could tie in with (or partly replace) the overload feature. The key
> to good latency and decent throughput is knowing when to allow queue
> build up and when not to.

Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
build/unleash any sizable IO, but that's just my gut talking.

-Mike

2009-10-03 09:01:33

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Jens,
On Sat, Oct 3, 2009 at 9:25 AM, Jens Axboe <[email protected]> wrote:
> On Sat, Oct 03 2009, Ingo Molnar wrote:
>>
>> * Mike Galbraith <[email protected]> wrote:
>>
>> >     unsigned int cfq_desktop;
>> > +   unsigned int cfq_desktop_dispatch;
>>
>> > -   if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
>> > +   if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
>> > +           cfqd->desktop_dispatch_ts = jiffies;
>> >             return 0;
>> > +   }
>>
>> btw., i hope all those desktop_ things will be named latency_ pretty
>> soon as the consensus seems to be - the word 'desktop' feels so wrong in
>> this context.
>>
>> 'desktop' is a form of use of computers and the implication of good
>> latencies goes far beyond that category of systems.
>
> I will rename it, for now it doesn't matter (lets not get bogged down in
> bike shed colors, please).
>
> Oh and Mike, I forgot to mention this in the previous email - no more
> tunables, please. We'll keep this under a single knob.

Did you have a look at my http://patchwork.kernel.org/patch/47750/ ?
It already introduces a 'target_latency' tunable, expressed in ms.

If we can quantify the benefits of each technique, we could enable
them based on the target latency requested by that single tunable.

Corrado

>
> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-10-03 09:12:42

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi,
On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <[email protected]> wrote:
> On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
>
>> After shutting down the computer yesterday, I was thinking a bit about
>> this issue and how to solve it without incurring too much delay. If we
>> add a stricter control of the depth, that may help. So instead of
>> allowing up to max_quantum (or larger) depths, only allow gradual build
>> up of that the farther we get away from a dispatch from the sync IO
>> queues. For example, when switching to an async or seeky sync queue,
>> initially allow just 1 in flight. For the next round, if there still
>> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
>> again, immediately drop to 1.
>>

I would limit just async I/O. Seeky sync queues are automatically
throttled by being sync, and have already high latency, so we
shouldn't increase it artificially. I think, instead, that we should
send multiple seeky requests (possibly coming from different queues)
at once. They will help especially with raid devices, where the seeks
for requests going to different disks will happen in parallel.

>> It could tie in with (or partly replace) the overload feature. The key
>> to good latency and decent throughput is knowing when to allow queue
>> build up and when not to.
>
> Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> build/unleash any sizable IO, but that's just my gut talking.
>
On the other hand, sending 1 write first and then waiting it to
complete before submitting new ones, will help performing more merges,
so the subsequent requests will be bigger and thus more efficient.

Corrado

>        -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-10-03 11:30:13

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> >
> > > If you could do a cleaned up version of your overload patch based on
> > > this:
> > >
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > >
> > > then lets take it from there.
>

> Note to self: build the darn thing after last minute changes.
>
> Block: Delay overloading of CFQ queues to improve read latency.
>
> Introduce a delay maximum dispatch timestamp, and stamp it when:
> 1. we encounter a known seeky or possibly new sync IO queue.
> 2. the current queue may go idle and we're draining async IO.
> 3. we have sync IO in flight and are servicing an async queue.
> 4 we are not the sole user of disk.
> Disallow exceeding quantum if any of these events have occurred recently.
>

So it looks like primarily the issue seems to be that we done lot of
dispatch from async queue and if some sync queue comes in now, it will
experience latencies.

For a ongoing seeky sync queue issue will be solved up to some extent
because previously we did not choose to idle for that queue now we will
idle, hence async queue will not get a chance to overload the dispatch
queue.

For the sync queues where we choose not to enable idle, we still will see
the latencies. Instead of time stamping on all the above events, can we
just keep track of last sync request completed in the system and don't
allow async queue to flood/overload the dispatch queue with-in certain
time limit of that last sync request completion. This just gives a buffer
period to that sync queue to come back and submit more requests and
still not suffer large latencies?

Thanks
Vivek


> Protect this behavioral change with a "desktop_dispatch" knob and default
> it to "on".. providing an easy means of regression verification prior to
> hate-mail dispatch :) to CC list.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ... others who let somewhat hacky tweak slip by
>
> ---
> block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 41 insertions(+), 4 deletions(-)
>
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -174,6 +174,9 @@ struct cfq_data {
> unsigned int cfq_slice_async_rq;
> unsigned int cfq_slice_idle;
> unsigned int cfq_desktop;
> + unsigned int cfq_desktop_dispatch;
> +
> + unsigned long desktop_dispatch_ts;
>
> struct list_head cic_list;
>
> @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
> struct cfq_data *cfqd = q->elevator->elevator_data;
> struct cfq_queue *cfqq;
> unsigned int max_dispatch;
> + unsigned long delay;
>
> if (!cfqd->busy_queues)
> return 0;
> @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
> /*
> * Drain async requests before we start sync IO
> */
> - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> + cfqd->desktop_dispatch_ts = jiffies;
> return 0;
> + }
>
> /*
> * If this is an async queue and we have sync IO in flight, let it wait
> */
> - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
> + cfqd->desktop_dispatch_ts = jiffies;
> return 0;
> + }
>
> max_dispatch = cfqd->cfq_quantum;
> if (cfq_class_idle(cfqq))
> max_dispatch = 1;
>
> + if (cfqd->busy_queues > 1)
> + cfqd->desktop_dispatch_ts = jiffies;
> +
> /*
> * Does this cfqq already have too much IO in flight?
> */
> @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
> return 0;
>
> /*
> + * Don't start overloading until we've been alone for a bit.
> + */
> + if (cfqd->cfq_desktop_dispatch) {
> + delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
> +
> + if (time_before(jiffies, max_delay))
> + return 0;
> + }
> +
> + /*
> * we are the only queue, allow up to 4 times of 'quantum'
> */
> if (cfqq->dispatched >= 4 * max_dispatch)
> @@ -1942,7 +1963,7 @@ static void
> cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> struct cfq_io_context *cic)
> {
> - int old_idle, enable_idle;
> + int old_idle, enable_idle, seeky = 0;
>
> /*
> * Don't idle for async or idle io prio class
> @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
> if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
> return;
>
> + if (cfqd->hw_tag) {
> + if (CIC_SEEKY(cic))
> + seeky = 1;
> + /*
> + * If seeky or incalculable seekiness, delay overloading.
> + */
> + if (seeky || !sample_valid(cic->seek_samples))
> + cfqd->desktop_dispatch_ts = jiffies;
> + }
> +
> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> - (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
> + (!cfqd->cfq_desktop && seeky))
> enable_idle = 0;
> else if (sample_valid(cic->ttime_samples)) {
> if (cic->ttime_mean > cfqd->cfq_slice_idle)
> @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
> cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
> cfqd->cfq_slice_idle = cfq_slice_idle;
> cfqd->cfq_desktop = 1;
> + cfqd->cfq_desktop_dispatch = 1;
> +
> + cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
> cfqd->hw_tag = 1;
>
> return cfqd;
> @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
> SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
> SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
> SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
> +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
> #undef SHOW_FUNCTION
>
> #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
> STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
> UINT_MAX, 0);
> STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
> +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
> #undef STORE_FUNCTION
>
> #define CFQ_ATTR(name) \
> @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
> CFQ_ATTR(slice_async_rq),
> CFQ_ATTR(slice_idle),
> CFQ_ATTR(desktop),
> + CFQ_ATTR(desktop_dispatch),
> __ATTR_NULL
> };
>
>

2009-10-03 12:48:21

by Vivek Goyal

[permalink] [raw]
Subject: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, Oct 03, 2009 at 07:29:15AM -0400, Vivek Goyal wrote:
> On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote:
> > On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> > >
> > > > If you could do a cleaned up version of your overload patch based on
> > > > this:
> > > >
> > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > > >
> > > > then lets take it from there.
> >
>
> > Note to self: build the darn thing after last minute changes.
> >
> > Block: Delay overloading of CFQ queues to improve read latency.
> >
> > Introduce a delay maximum dispatch timestamp, and stamp it when:
> > 1. we encounter a known seeky or possibly new sync IO queue.
> > 2. the current queue may go idle and we're draining async IO.
> > 3. we have sync IO in flight and are servicing an async queue.
> > 4 we are not the sole user of disk.
> > Disallow exceeding quantum if any of these events have occurred recently.
> >
>
> So it looks like primarily the issue seems to be that we done lot of
> dispatch from async queue and if some sync queue comes in now, it will
> experience latencies.
>
> For a ongoing seeky sync queue issue will be solved up to some extent
> because previously we did not choose to idle for that queue now we will
> idle, hence async queue will not get a chance to overload the dispatch
> queue.
>
> For the sync queues where we choose not to enable idle, we still will see
> the latencies. Instead of time stamping on all the above events, can we
> just keep track of last sync request completed in the system and don't
> allow async queue to flood/overload the dispatch queue with-in certain
> time limit of that last sync request completion. This just gives a buffer
> period to that sync queue to come back and submit more requests and
> still not suffer large latencies?
>
> Thanks
> Vivek
>

Hi Mike,

Following is a quick hack patch for the above idea. It is just compile and
boot tested. Can you please see if it helps in your scenario.

Thanks
Vivek


o Do not allow more than max_dispatch requests from an async queue, if some
sync request has finished recently. This is in the hope that sync activity
is still going on in the system and we might receive a sync request soon.
Most likely from a sync queue which finished a request and we did not enable
idling on it.

Signed-off-by: Vivek Goyal <[email protected]>
---
block/cfq-iosched.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

Index: linux22/block/cfq-iosched.c
===================================================================
--- linux22.orig/block/cfq-iosched.c 2009-10-03 08:20:26.000000000 -0400
+++ linux22/block/cfq-iosched.c 2009-10-03 08:23:24.000000000 -0400
@@ -181,6 +181,8 @@ struct cfq_data {
* Fallback dummy cfqq for extreme OOM conditions
*/
struct cfq_queue oom_cfqq;
+
+ unsigned long last_end_sync_rq;
};

enum cfqq_state_flags {
@@ -1314,6 +1316,8 @@ static int cfq_dispatch_requests(struct
* Does this cfqq already have too much IO in flight?
*/
if (cfqq->dispatched >= max_dispatch) {
+ unsigned long load_at = cfqd->last_end_sync_rq + cfq_slice_sync;
+
/*
* idle queue must always only have a single IO in flight
*/
@@ -1327,6 +1331,14 @@ static int cfq_dispatch_requests(struct
return 0;

/*
+ * If a sync request has completed recently, don't overload
+ * the dispatch queue yet with async requests.
+ */
+ if (cfqd->cfq_desktop && !cfq_cfqq_sync(cfqq)
+ && time_before(jiffies, load_at))
+ return 0;
+
+ /*
* we are the only queue, allow up to 4 times of 'quantum'
*/
if (cfqq->dispatched >= 4 * max_dispatch)
@@ -2158,8 +2170,10 @@ static void cfq_completed_request(struct
if (cfq_cfqq_sync(cfqq))
cfqd->sync_flight--;

- if (sync)
+ if (sync) {
RQ_CIC(rq)->last_end_request = now;
+ cfqd->last_end_sync_rq = now;
+ }

/*
* If this is the active queue, check if it needs to be expired,
@@ -2483,7 +2497,7 @@ static void *cfq_init_queue(struct reque
cfqd->cfq_slice_idle = cfq_slice_idle;
cfqd->cfq_desktop = 1;
cfqd->hw_tag = 1;
-
+ cfqd->last_end_sync_rq = jiffies;
return cfqd;
}

2009-10-03 12:43:12

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <[email protected]> wrote:
> On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
>> In fact I think that the 'rotating' flag name is misleading.
>> All the checks we are doing are actually checking if the device truly
>> supports multiple parallel operations, and this feature is shared by
>> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
>> NCQ-enabled SATA disk.
>>
>
> While we are at it, what happens to notion of priority of tasks on SSDs?
This is not changed by proposed patch w.r.t. current CFQ.
> Without idling there is not continuous time slice and there is no
> fairness. So ioprio is out of the window for SSDs?
I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
me that the way in which queues are sorted in the rr tree may still
provide some sort of fairness and service differentiation for
priorities, in terms of number of IOs.
Non-NCQ SSDs, instead, will still have the idle window enabled, so it
is not an issue for them.
>
> On SSDs, will it make more sense to provide fairness in terms of number or
> IO or size of IO and not in terms of time slices.
Not on all SSDs. There are still ones that have a non-negligible
penalty on non-sequential access pattern (hopefully the ones without
NCQ, but if we find otherwise, then we will have to benchmark access
time in I/O scheduler to select the best policy). For those, time
based may still be needed.

Thanks,
Corrado

>
> Thanks
> Vivek

2009-10-03 13:17:14

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
>
> > After shutting down the computer yesterday, I was thinking a bit about
> > this issue and how to solve it without incurring too much delay. If we
> > add a stricter control of the depth, that may help. So instead of
> > allowing up to max_quantum (or larger) depths, only allow gradual build
> > up of that the farther we get away from a dispatch from the sync IO
> > queues. For example, when switching to an async or seeky sync queue,
> > initially allow just 1 in flight. For the next round, if there still
> > hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> > again, immediately drop to 1.
> >
> > It could tie in with (or partly replace) the overload feature. The key
> > to good latency and decent throughput is knowing when to allow queue
> > build up and when not to.
>
> Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> build/unleash any sizable IO, but that's just my gut talking.

Not sure, will need some testing of course. But it'll build up quickly.

--
Jens Axboe

2009-10-03 13:18:23

by Jens Axboe

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Sat, Oct 03 2009, Corrado Zoccolo wrote:
> Hi,
> On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <[email protected]> wrote:
> > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
> >
> >> After shutting down the computer yesterday, I was thinking a bit about
> >> this issue and how to solve it without incurring too much delay. If we
> >> add a stricter control of the depth, that may help. So instead of
> >> allowing up to max_quantum (or larger) depths, only allow gradual build
> >> up of that the farther we get away from a dispatch from the sync IO
> >> queues. For example, when switching to an async or seeky sync queue,
> >> initially allow just 1 in flight. For the next round, if there still
> >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> >> again, immediately drop to 1.
> >>
>
> I would limit just async I/O. Seeky sync queues are automatically
> throttled by being sync, and have already high latency, so we
> shouldn't increase it artificially. I think, instead, that we should
> send multiple seeky requests (possibly coming from different queues)
> at once. They will help especially with raid devices, where the seeks
> for requests going to different disks will happen in parallel.
>
Async is the prime offendor, definitely.

> >> It could tie in with (or partly replace) the overload feature. The key
> >> to good latency and decent throughput is knowing when to allow queue
> >> build up and when not to.
> >
> > Hm. ?Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> > build/unleash any sizable IO, but that's just my gut talking.
> >
> On the other hand, sending 1 write first and then waiting it to
> complete before submitting new ones, will help performing more merges,
> so the subsequent requests will be bigger and thus more efficient.

Usually async writes stack up very quickly, so as long as you don't
drain completely, the merging will happen automagically anyway.

--
Jens Axboe

2009-10-03 13:21:13

by Jens Axboe

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, Oct 03 2009, Vivek Goyal wrote:
> On Sat, Oct 03, 2009 at 07:29:15AM -0400, Vivek Goyal wrote:
> > On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote:
> > > On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > > > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> > > >
> > > > > If you could do a cleaned up version of your overload patch based on
> > > > > this:
> > > > >
> > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > > > >
> > > > > then lets take it from there.
> > >
> >
> > > Note to self: build the darn thing after last minute changes.
> > >
> > > Block: Delay overloading of CFQ queues to improve read latency.
> > >
> > > Introduce a delay maximum dispatch timestamp, and stamp it when:
> > > 1. we encounter a known seeky or possibly new sync IO queue.
> > > 2. the current queue may go idle and we're draining async IO.
> > > 3. we have sync IO in flight and are servicing an async queue.
> > > 4 we are not the sole user of disk.
> > > Disallow exceeding quantum if any of these events have occurred recently.
> > >
> >
> > So it looks like primarily the issue seems to be that we done lot of
> > dispatch from async queue and if some sync queue comes in now, it will
> > experience latencies.
> >
> > For a ongoing seeky sync queue issue will be solved up to some extent
> > because previously we did not choose to idle for that queue now we will
> > idle, hence async queue will not get a chance to overload the dispatch
> > queue.
> >
> > For the sync queues where we choose not to enable idle, we still will see
> > the latencies. Instead of time stamping on all the above events, can we
> > just keep track of last sync request completed in the system and don't
> > allow async queue to flood/overload the dispatch queue with-in certain
> > time limit of that last sync request completion. This just gives a buffer
> > period to that sync queue to come back and submit more requests and
> > still not suffer large latencies?
> >
> > Thanks
> > Vivek
> >
>
> Hi Mike,
>
> Following is a quick hack patch for the above idea. It is just compile and
> boot tested. Can you please see if it helps in your scenario.
>
> Thanks
> Vivek
>
>
> o Do not allow more than max_dispatch requests from an async queue, if some
> sync request has finished recently. This is in the hope that sync activity
> is still going on in the system and we might receive a sync request soon.
> Most likely from a sync queue which finished a request and we did not enable
> idling on it.

This is pretty much identical to the scheme I described, except for the
ramping of queue depth. I've applied it, it's nice and simple and I
believe this will get rid of the worst of the problem.

Things probably end up being a bit simplistic, but we can always tweak
around later.

--
Jens Axboe

2009-10-03 13:39:05

by Vivek Goyal

[permalink] [raw]
Subject: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <[email protected]> wrote:
> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> >> In fact I think that the 'rotating' flag name is misleading.
> >> All the checks we are doing are actually checking if the device truly
> >> supports multiple parallel operations, and this feature is shared by
> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> >> NCQ-enabled SATA disk.
> >>
> >
> > While we are at it, what happens to notion of priority of tasks on SSDs?
> This is not changed by proposed patch w.r.t. current CFQ.

This is a general question irrespective of current patch. Want to know
what is our statement w.r.t ioprio and what it means for user? When do
we support it and when do we not.

> > Without idling there is not continuous time slice and there is no
> > fairness. So ioprio is out of the window for SSDs?
> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
> me that the way in which queues are sorted in the rr tree may still
> provide some sort of fairness and service differentiation for
> priorities, in terms of number of IOs.

I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
not. I guess this happens because sometimes idling is enabled and sometmes
not because of dyanamic nature of hw_tag.

I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
third prio7.

(prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
(prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
(prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec

Note there is almost no difference between prio 0 and prio 4 job and prio7
job has been penalized heavily (gets less than 10% BW of prio 4 job).

> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
> is not an issue for them.

Agree.

> >
> > On SSDs, will it make more sense to provide fairness in terms of number or
> > IO or size of IO and not in terms of time slices.
> Not on all SSDs. There are still ones that have a non-negligible
> penalty on non-sequential access pattern (hopefully the ones without
> NCQ, but if we find otherwise, then we will have to benchmark access
> time in I/O scheduler to select the best policy). For those, time
> based may still be needed.

Ok.

So on better SSDs out there with NCQ, we probably don't support the notion of
ioprio? Or, I am missing something.

Thanks
Vivek

2009-10-03 13:57:03

by Vivek Goyal

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, Oct 03, 2009 at 03:21:15PM +0200, Jens Axboe wrote:
> On Sat, Oct 03 2009, Vivek Goyal wrote:
> > On Sat, Oct 03, 2009 at 07:29:15AM -0400, Vivek Goyal wrote:
> > > On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote:
> > > > On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > > > > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> > > > >
> > > > > > If you could do a cleaned up version of your overload patch based on
> > > > > > this:
> > > > > >
> > > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > > > > >
> > > > > > then lets take it from there.
> > > >
> > >
> > > > Note to self: build the darn thing after last minute changes.
> > > >
> > > > Block: Delay overloading of CFQ queues to improve read latency.
> > > >
> > > > Introduce a delay maximum dispatch timestamp, and stamp it when:
> > > > 1. we encounter a known seeky or possibly new sync IO queue.
> > > > 2. the current queue may go idle and we're draining async IO.
> > > > 3. we have sync IO in flight and are servicing an async queue.
> > > > 4 we are not the sole user of disk.
> > > > Disallow exceeding quantum if any of these events have occurred recently.
> > > >
> > >
> > > So it looks like primarily the issue seems to be that we done lot of
> > > dispatch from async queue and if some sync queue comes in now, it will
> > > experience latencies.
> > >
> > > For a ongoing seeky sync queue issue will be solved up to some extent
> > > because previously we did not choose to idle for that queue now we will
> > > idle, hence async queue will not get a chance to overload the dispatch
> > > queue.
> > >
> > > For the sync queues where we choose not to enable idle, we still will see
> > > the latencies. Instead of time stamping on all the above events, can we
> > > just keep track of last sync request completed in the system and don't
> > > allow async queue to flood/overload the dispatch queue with-in certain
> > > time limit of that last sync request completion. This just gives a buffer
> > > period to that sync queue to come back and submit more requests and
> > > still not suffer large latencies?
> > >
> > > Thanks
> > > Vivek
> > >
> >
> > Hi Mike,
> >
> > Following is a quick hack patch for the above idea. It is just compile and
> > boot tested. Can you please see if it helps in your scenario.
> >
> > Thanks
> > Vivek
> >
> >
> > o Do not allow more than max_dispatch requests from an async queue, if some
> > sync request has finished recently. This is in the hope that sync activity
> > is still going on in the system and we might receive a sync request soon.
> > Most likely from a sync queue which finished a request and we did not enable
> > idling on it.
>
> This is pretty much identical to the scheme I described, except for the
> ramping of queue depth. I've applied it, it's nice and simple and I
> believe this will get rid of the worst of the problem.
>
> Things probably end up being a bit simplistic, but we can always tweak
> around later.

I have kept the overload delay period as "cfq_slice_sync" same as Mike had
done. We shall have to experiment what is a good waiting perioed. Is 100ms
too long if we are waiting for a request from same process which recently
finished IO and we did not enable idle on it.

I guess we can tweak the delay period as we move along.

Thanks
Vivek

2009-10-03 13:58:11

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, 2009-10-03 at 08:40 -0400, Vivek Goyal wrote:
> On Sat, Oct 03, 2009 at 07:29:15AM -0400, Vivek Goyal wrote:
> > On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote:
> > > On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > > > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> > > >
> > > > > If you could do a cleaned up version of your overload patch based on
> > > > > this:
> > > > >
> > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > > > >
> > > > > then lets take it from there.
> > >
> >
> > > Note to self: build the darn thing after last minute changes.
> > >
> > > Block: Delay overloading of CFQ queues to improve read latency.
> > >
> > > Introduce a delay maximum dispatch timestamp, and stamp it when:
> > > 1. we encounter a known seeky or possibly new sync IO queue.
> > > 2. the current queue may go idle and we're draining async IO.
> > > 3. we have sync IO in flight and are servicing an async queue.
> > > 4 we are not the sole user of disk.
> > > Disallow exceeding quantum if any of these events have occurred recently.
> > >
> >
> > So it looks like primarily the issue seems to be that we done lot of
> > dispatch from async queue and if some sync queue comes in now, it will
> > experience latencies.
> >
> > For a ongoing seeky sync queue issue will be solved up to some extent
> > because previously we did not choose to idle for that queue now we will
> > idle, hence async queue will not get a chance to overload the dispatch
> > queue.
> >
> > For the sync queues where we choose not to enable idle, we still will see
> > the latencies. Instead of time stamping on all the above events, can we
> > just keep track of last sync request completed in the system and don't
> > allow async queue to flood/overload the dispatch queue with-in certain
> > time limit of that last sync request completion. This just gives a buffer
> > period to that sync queue to come back and submit more requests and
> > still not suffer large latencies?
> >
> > Thanks
> > Vivek
> >
>
> Hi Mike,
>
> Following is a quick hack patch for the above idea. It is just compile and
> boot tested. Can you please see if it helps in your scenario.

Box sends hugs and kisses. s/desktop/latency and ship 'em :)

perf stat 1.70 1.94 1.32 1.89 1.87 1.7 fairness=1 overload_delay=1
1.55 1.79 1.38 1.53 1.57 1.5 desktop=1 +last_end_sync

perf stat testo.sh Avg
108.12 106.33 106.34 97.00 106.52 104.8 1.000 fairness=0 overload_delay=0
93.98 102.44 94.47 97.70 98.90 97.4 .929 fairness=0 overload_delay=1
90.87 95.40 95.79 93.09 94.25 93.8 .895 fairness=1 overload_delay=0
89.93 90.57 89.13 93.43 93.72 91.3 .871 fairness=1 overload_delay=1
89.81 88.82 91.56 96.57 89.38 91.2 .870 desktop=1 +last_end_sync

-Mike

2009-10-03 14:02:43

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, 2009-10-03 at 09:56 -0400, Vivek Goyal wrote:

> I have kept the overload delay period as "cfq_slice_sync" same as Mike had
> done. We shall have to experiment what is a good waiting perioed. Is 100ms
> too long if we are waiting for a request from same process which recently
> finished IO and we did not enable idle on it.
>
> I guess we can tweak the delay period as we move along.

I kept the delay period very short to minimize possible damage. Without
the idle thing, it wasn't enough, but with, worked a treat, as does your
patch.

-Mike

2009-10-03 14:28:38

by Jens Axboe

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 09:56 -0400, Vivek Goyal wrote:
>
> > I have kept the overload delay period as "cfq_slice_sync" same as Mike had
> > done. We shall have to experiment what is a good waiting perioed. Is 100ms
> > too long if we are waiting for a request from same process which recently
> > finished IO and we did not enable idle on it.
> >
> > I guess we can tweak the delay period as we move along.
>
> I kept the delay period very short to minimize possible damage. Without
> the idle thing, it wasn't enough, but with, worked a treat, as does your
> patch.

Can you test the current line up of patches in for-linus? It has the
ramp up I talked about included as well.

--
Jens Axboe

2009-10-03 14:33:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, 2009-10-03 at 16:28 +0200, Jens Axboe wrote:
> On Sat, Oct 03 2009, Mike Galbraith wrote:
> > On Sat, 2009-10-03 at 09:56 -0400, Vivek Goyal wrote:
> >
> > > I have kept the overload delay period as "cfq_slice_sync" same as Mike had
> > > done. We shall have to experiment what is a good waiting perioed. Is 100ms
> > > too long if we are waiting for a request from same process which recently
> > > finished IO and we did not enable idle on it.
> > >
> > > I guess we can tweak the delay period as we move along.
> >
> > I kept the delay period very short to minimize possible damage. Without
> > the idle thing, it wasn't enough, but with, worked a treat, as does your
> > patch.
>
> Can you test the current line up of patches in for-linus? It has the
> ramp up I talked about included as well.

Sure. I'll go find it.

-Mike

2009-10-03 14:52:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, 2009-10-03 at 16:28 +0200, Jens Axboe wrote:
> On Sat, Oct 03 2009, Mike Galbraith wrote:
> > On Sat, 2009-10-03 at 09:56 -0400, Vivek Goyal wrote:
> >
> > > I have kept the overload delay period as "cfq_slice_sync" same as Mike had
> > > done. We shall have to experiment what is a good waiting perioed. Is 100ms
> > > too long if we are waiting for a request from same process which recently
> > > finished IO and we did not enable idle on it.
> > >
> > > I guess we can tweak the delay period as we move along.
> >
> > I kept the delay period very short to minimize possible damage. Without
> > the idle thing, it wasn't enough, but with, worked a treat, as does your
> > patch.
>
> Can you test the current line up of patches in for-linus? It has the
> ramp up I talked about included as well.

Well, it hasn't hit git.kernel.org yet, it's at...

* block-for-linus 1d22351 cfq-iosched: add a knob for desktop interactiveness

-Mike

2009-10-03 15:15:24

by Jens Axboe

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 16:28 +0200, Jens Axboe wrote:
> > On Sat, Oct 03 2009, Mike Galbraith wrote:
> > > On Sat, 2009-10-03 at 09:56 -0400, Vivek Goyal wrote:
> > >
> > > > I have kept the overload delay period as "cfq_slice_sync" same as Mike had
> > > > done. We shall have to experiment what is a good waiting perioed. Is 100ms
> > > > too long if we are waiting for a request from same process which recently
> > > > finished IO and we did not enable idle on it.
> > > >
> > > > I guess we can tweak the delay period as we move along.
> > >
> > > I kept the delay period very short to minimize possible damage. Without
> > > the idle thing, it wasn't enough, but with, worked a treat, as does your
> > > patch.
> >
> > Can you test the current line up of patches in for-linus? It has the
> > ramp up I talked about included as well.
>
> Well, it hasn't hit git.kernel.org yet, it's at...
>
> * block-for-linus 1d22351 cfq-iosched: add a knob for desktop interactiveness

It's the top three patches here, kernel.org sync sometimes takes a
while...

http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/for-linus

--
Jens Axboe

2009-10-03 15:58:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, 2009-10-03 at 17:14 +0200, Jens Axboe wrote:
> On Sat, Oct 03 2009, Mike Galbraith wrote:
> > On Sat, 2009-10-03 at 16:28 +0200, Jens Axboe wrote:
> > > On Sat, Oct 03 2009, Mike Galbraith wrote:
> > > > On Sat, 2009-10-03 at 09:56 -0400, Vivek Goyal wrote:
> > > >
> > > > > I have kept the overload delay period as "cfq_slice_sync" same as Mike had
> > > > > done. We shall have to experiment what is a good waiting perioed. Is 100ms
> > > > > too long if we are waiting for a request from same process which recently
> > > > > finished IO and we did not enable idle on it.
> > > > >
> > > > > I guess we can tweak the delay period as we move along.
> > > >
> > > > I kept the delay period very short to minimize possible damage. Without
> > > > the idle thing, it wasn't enough, but with, worked a treat, as does your
> > > > patch.
> > >
> > > Can you test the current line up of patches in for-linus? It has the
> > > ramp up I talked about included as well.
> >
> > Well, it hasn't hit git.kernel.org yet, it's at...
> >
> > * block-for-linus 1d22351 cfq-iosched: add a knob for desktop interactiveness
>
> It's the top three patches here, kernel.org sync sometimes takes a
> while...
>
> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/for-linus

Ok, already had the first two in, added the last.

Entered uncharted territory for konsole -e exit, but lost a bit of
throughput for home-brew concurrent git test.

perf stat 1.70 1.94 1.32 1.89 1.87 1.7 fairness=1 overload_delay=1
1.55 1.79 1.38 1.53 1.57 1.5 desktop=1 +last_end_sync
1.09 0.87 1.11 0.96 1.11 1.0 block-for-linus

perf stat testo.sh Avg
108.12 106.33 106.34 97.00 106.52 104.8 1.000 fairness=0 overload_delay=0
93.98 102.44 94.47 97.70 98.90 97.4 .929 fairness=0 overload_delay=1
90.87 95.40 95.79 93.09 94.25 93.8 .895 fairness=1 overload_delay=0
89.93 90.57 89.13 93.43 93.72 91.3 .871 fairness=1 overload_delay=1
89.81 88.82 91.56 96.57 89.38 91.2 .870 desktop=1 +last_end_sync
92.61 94.60 92.35 93.17 94.05 93.3 .890 block-for-linus

-Mike

2009-10-03 17:36:11

by Jens Axboe

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 17:14 +0200, Jens Axboe wrote:
> > On Sat, Oct 03 2009, Mike Galbraith wrote:
> > > On Sat, 2009-10-03 at 16:28 +0200, Jens Axboe wrote:
> > > > On Sat, Oct 03 2009, Mike Galbraith wrote:
> > > > > On Sat, 2009-10-03 at 09:56 -0400, Vivek Goyal wrote:
> > > > >
> > > > > > I have kept the overload delay period as "cfq_slice_sync" same as Mike had
> > > > > > done. We shall have to experiment what is a good waiting perioed. Is 100ms
> > > > > > too long if we are waiting for a request from same process which recently
> > > > > > finished IO and we did not enable idle on it.
> > > > > >
> > > > > > I guess we can tweak the delay period as we move along.
> > > > >
> > > > > I kept the delay period very short to minimize possible damage. Without
> > > > > the idle thing, it wasn't enough, but with, worked a treat, as does your
> > > > > patch.
> > > >
> > > > Can you test the current line up of patches in for-linus? It has the
> > > > ramp up I talked about included as well.
> > >
> > > Well, it hasn't hit git.kernel.org yet, it's at...
> > >
> > > * block-for-linus 1d22351 cfq-iosched: add a knob for desktop interactiveness
> >
> > It's the top three patches here, kernel.org sync sometimes takes a
> > while...
> >
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/for-linus
>
> Ok, already had the first two in, added the last.
>
> Entered uncharted territory for konsole -e exit, but lost a bit of
> throughput for home-brew concurrent git test.
>
> perf stat 1.70 1.94 1.32 1.89 1.87 1.7 fairness=1 overload_delay=1
> 1.55 1.79 1.38 1.53 1.57 1.5 desktop=1 +last_end_sync
> 1.09 0.87 1.11 0.96 1.11 1.0 block-for-linus

So that's pure goodness, at least.

> perf stat testo.sh Avg
> 108.12 106.33 106.34 97.00 106.52 104.8 1.000 fairness=0 overload_delay=0
> 93.98 102.44 94.47 97.70 98.90 97.4 .929 fairness=0 overload_delay=1
> 90.87 95.40 95.79 93.09 94.25 93.8 .895 fairness=1 overload_delay=0
> 89.93 90.57 89.13 93.43 93.72 91.3 .871 fairness=1 overload_delay=1
> 89.81 88.82 91.56 96.57 89.38 91.2 .870 desktop=1 +last_end_sync
> 92.61 94.60 92.35 93.17 94.05 93.3 .890 block-for-linus

Doesn't look too bad, all things considered. Apart from "stock" cfq,
it's consistent. And being consistent is a Good Thing. Performance wise,
it's losing out to "stock" but looks pretty competetive otherwise.

So far that looks like a winner. The dictator wanted good latency, he's
getting good latency. I'll continue working on this on monday, while I'm
waiting for delivery of the Trabant.

--
Jens Axboe

2009-10-03 17:47:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)



On Sat, 3 Oct 2009, Jens Axboe wrote:
>
> Doesn't look too bad, all things considered. Apart from "stock" cfq,
> it's consistent. And being consistent is a Good Thing. Performance wise,
> it's losing out to "stock" but looks pretty competetive otherwise.

I agree. And I think the numbers for the kconsole test are pretty
conclusive. That's a big improvement (on top of the already very
impressive improvement).

> So far that looks like a winner. The dictator wanted good latency, he's
> getting good latency. I'll continue working on this on monday, while I'm
> waiting for delivery of the Trabant.

Trabant?

As in the car?

Why would you _ever_ wait for delivery? The sane option would be to try to
hide, or run away?

Linus

2009-10-03 17:52:19

by Jens Axboe

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, Oct 03 2009, Linus Torvalds wrote:
>
>
> On Sat, 3 Oct 2009, Jens Axboe wrote:
> >
> > Doesn't look too bad, all things considered. Apart from "stock" cfq,
> > it's consistent. And being consistent is a Good Thing. Performance wise,
> > it's losing out to "stock" but looks pretty competetive otherwise.
>
> I agree. And I think the numbers for the kconsole test are pretty
> conclusive. That's a big improvement (on top of the already very
> impressive improvement).

Yes very much so. The tweaks are mostly straight forward, so my
confidence in the end results from a "will this work" stand point is
good. It will likely be somewhat slower for some things, but we can fix
those up as we continue testing.

I wont ask you to pull this yet, but I likely will next week when I've
done some benchmarks with it for the other end of the spectrum.

> > So far that looks like a winner. The dictator wanted good latency, he's
> > getting good latency. I'll continue working on this on monday, while I'm
> > waiting for delivery of the Trabant.
>
> Trabant?
>
> As in the car?
>
> Why would you _ever_ wait for delivery? The sane option would be to try to
> hide, or run away?

OK, so I'm not really waiting for a Trabant. I do have a car on order,
but not a 2-stroke :-)

--
Jens Axboe

2009-10-03 19:08:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, 2009-10-03 at 19:35 +0200, Jens Axboe wrote:
> On Sat, Oct 03 2009, Mike Galbraith wrote:
> > On Sat, 2009-10-03 at 17:14 +0200, Jens Axboe wrote:
> > > On Sat, Oct 03 2009, Mike Galbraith wrote:
> > > > On Sat, 2009-10-03 at 16:28 +0200, Jens Axboe wrote:
> > > > > On Sat, Oct 03 2009, Mike Galbraith wrote:
> > > > > > On Sat, 2009-10-03 at 09:56 -0400, Vivek Goyal wrote:
> > > > > >
> > > > > > > I have kept the overload delay period as "cfq_slice_sync" same as Mike had
> > > > > > > done. We shall have to experiment what is a good waiting perioed. Is 100ms
> > > > > > > too long if we are waiting for a request from same process which recently
> > > > > > > finished IO and we did not enable idle on it.
> > > > > > >
> > > > > > > I guess we can tweak the delay period as we move along.
> > > > > >
> > > > > > I kept the delay period very short to minimize possible damage. Without
> > > > > > the idle thing, it wasn't enough, but with, worked a treat, as does your
> > > > > > patch.
> > > > >
> > > > > Can you test the current line up of patches in for-linus? It has the
> > > > > ramp up I talked about included as well.
> > > >
> > > > Well, it hasn't hit git.kernel.org yet, it's at...
> > > >
> > > > * block-for-linus 1d22351 cfq-iosched: add a knob for desktop interactiveness
> > >
> > > It's the top three patches here, kernel.org sync sometimes takes a
> > > while...
> > >
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/for-linus
> >
> > Ok, already had the first two in, added the last.
> >
> > Entered uncharted territory for konsole -e exit, but lost a bit of
> > throughput for home-brew concurrent git test.
> >
> > perf stat 1.70 1.94 1.32 1.89 1.87 1.7 fairness=1 overload_delay=1
> > 1.55 1.79 1.38 1.53 1.57 1.5 desktop=1 +last_end_sync
> > 1.09 0.87 1.11 0.96 1.11 1.0 block-for-linus
>
> So that's pure goodness, at least.

Yeah, but it's a double edged sword, _maybe_ cut too far in the other
direction. (impression)

> > perf stat testo.sh Avg
> > 108.12 106.33 106.34 97.00 106.52 104.8 1.000 fairness=0 overload_delay=0
> > 93.98 102.44 94.47 97.70 98.90 97.4 .929 fairness=0 overload_delay=1
> > 90.87 95.40 95.79 93.09 94.25 93.8 .895 fairness=1 overload_delay=0
> > 89.93 90.57 89.13 93.43 93.72 91.3 .871 fairness=1 overload_delay=1
> > 89.81 88.82 91.56 96.57 89.38 91.2 .870 desktop=1 +last_end_sync
> > 92.61 94.60 92.35 93.17 94.05 93.3 .890 block-for-linus
>
> Doesn't look too bad, all things considered. Apart from "stock" cfq,
> it's consistent. And being consistent is a Good Thing. Performance wise,
> it's losing out to "stock" but looks pretty competetive otherwise.

No, not bad at all, still a large win over stock.

> So far that looks like a winner. The dictator wanted good latency, he's
> getting good latency. I'll continue working on this on monday, while I'm
> waiting for delivery of the Trabant.

I'm unsure feel wise. Disk is sounding too seeky, which worries me.

-Mike

2009-10-03 19:12:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, 2009-10-03 at 21:07 +0200, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 19:35 +0200, Jens Axboe wrote:

> > So that's pure goodness, at least.
>
> Yeah, but it's a double edged sword, _maybe_ cut too far in the other
> direction. (impression)
>
> > > perf stat testo.sh Avg
> > > 108.12 106.33 106.34 97.00 106.52 104.8 1.000 fairness=0 overload_delay=0
> > > 93.98 102.44 94.47 97.70 98.90 97.4 .929 fairness=0 overload_delay=1
> > > 90.87 95.40 95.79 93.09 94.25 93.8 .895 fairness=1 overload_delay=0
> > > 89.93 90.57 89.13 93.43 93.72 91.3 .871 fairness=1 overload_delay=1
> > > 89.81 88.82 91.56 96.57 89.38 91.2 .870 desktop=1 +last_end_sync
> > > 92.61 94.60 92.35 93.17 94.05 93.3 .890 block-for-linus
> >
> > Doesn't look too bad, all things considered. Apart from "stock" cfq,
> > it's consistent. And being consistent is a Good Thing. Performance wise,
> > it's losing out to "stock" but looks pretty competetive otherwise.
>
> No, not bad at all, still a large win over stock.
>
> > So far that looks like a winner. The dictator wanted good latency, he's
> > getting good latency. I'll continue working on this on monday, while I'm
> > waiting for delivery of the Trabant.
>
> I'm unsure feel wise. Disk is sounding too seeky, which worries me.

But, this is a _huge_ improvement of the dd vs reader thing regardless
of any further tweaking that may or may not prove necessary. That ages
old corner case seems to be defeated.

-Mike

2009-10-03 19:24:01

by Jens Axboe

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 19:35 +0200, Jens Axboe wrote:
> > On Sat, Oct 03 2009, Mike Galbraith wrote:
> > > On Sat, 2009-10-03 at 17:14 +0200, Jens Axboe wrote:
> > > > On Sat, Oct 03 2009, Mike Galbraith wrote:
> > > > > On Sat, 2009-10-03 at 16:28 +0200, Jens Axboe wrote:
> > > > > > On Sat, Oct 03 2009, Mike Galbraith wrote:
> > > > > > > On Sat, 2009-10-03 at 09:56 -0400, Vivek Goyal wrote:
> > > > > > >
> > > > > > > > I have kept the overload delay period as "cfq_slice_sync" same as Mike had
> > > > > > > > done. We shall have to experiment what is a good waiting perioed. Is 100ms
> > > > > > > > too long if we are waiting for a request from same process which recently
> > > > > > > > finished IO and we did not enable idle on it.
> > > > > > > >
> > > > > > > > I guess we can tweak the delay period as we move along.
> > > > > > >
> > > > > > > I kept the delay period very short to minimize possible damage. Without
> > > > > > > the idle thing, it wasn't enough, but with, worked a treat, as does your
> > > > > > > patch.
> > > > > >
> > > > > > Can you test the current line up of patches in for-linus? It has the
> > > > > > ramp up I talked about included as well.
> > > > >
> > > > > Well, it hasn't hit git.kernel.org yet, it's at...
> > > > >
> > > > > * block-for-linus 1d22351 cfq-iosched: add a knob for desktop interactiveness
> > > >
> > > > It's the top three patches here, kernel.org sync sometimes takes a
> > > > while...
> > > >
> > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/for-linus
> > >
> > > Ok, already had the first two in, added the last.
> > >
> > > Entered uncharted territory for konsole -e exit, but lost a bit of
> > > throughput for home-brew concurrent git test.
> > >
> > > perf stat 1.70 1.94 1.32 1.89 1.87 1.7 fairness=1 overload_delay=1
> > > 1.55 1.79 1.38 1.53 1.57 1.5 desktop=1 +last_end_sync
> > > 1.09 0.87 1.11 0.96 1.11 1.0 block-for-linus
> >
> > So that's pure goodness, at least.
>
> Yeah, but it's a double edged sword, _maybe_ cut too far in the other
> direction. (impression)

How can it be too fast? IOW, I think you'll have to quantify that
statement :-)

> > > perf stat testo.sh Avg
> > > 108.12 106.33 106.34 97.00 106.52 104.8 1.000 fairness=0 overload_delay=0
> > > 93.98 102.44 94.47 97.70 98.90 97.4 .929 fairness=0 overload_delay=1
> > > 90.87 95.40 95.79 93.09 94.25 93.8 .895 fairness=1 overload_delay=0
> > > 89.93 90.57 89.13 93.43 93.72 91.3 .871 fairness=1 overload_delay=1
> > > 89.81 88.82 91.56 96.57 89.38 91.2 .870 desktop=1 +last_end_sync
> > > 92.61 94.60 92.35 93.17 94.05 93.3 .890 block-for-linus
> >
> > Doesn't look too bad, all things considered. Apart from "stock" cfq,
> > it's consistent. And being consistent is a Good Thing. Performance wise,
> > it's losing out to "stock" but looks pretty competetive otherwise.
>
> No, not bad at all, still a large win over stock.
>
> > So far that looks like a winner. The dictator wanted good latency, he's
> > getting good latency. I'll continue working on this on monday, while I'm
> > waiting for delivery of the Trabant.
>
> I'm unsure feel wise. Disk is sounding too seeky, which worries me.

Care to elaborate on the feel? Seekiness is not good of course,
depending on timing the async delay could cause some skipping back and
forth. But remember that when you don't hear the disk, it could likely
be doing the async IO which will make the disk very quiet (since it's
just a streamed write). The konsole test is bound to cause seeks, when
it's juggling async IO too. Even alone it's likely pretty seeky. So is
the seekiness persistent, or just shortly when starting konsole?

--
Jens Axboe

2009-10-03 19:50:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, 2009-10-03 at 21:23 +0200, Jens Axboe wrote:
> On Sat, Oct 03 2009, Mike Galbraith wrote:

> > > So that's pure goodness, at least.
> >
> > Yeah, but it's a double edged sword, _maybe_ cut too far in the other
> > direction. (impression)
>
> How can it be too fast? IOW, I think you'll have to quantify that
> statement :-)

Oh boy. It it were perfectly fair, it should be roughly twice the time
it takes to load seekily when running solo, which it's exceeded
considerably. I'm not complaining mind you, just being a worry wart.
Previously the reader was suffering the pains of hell, which the two
dinky changes made match my expectations nearly perfectly (1.7 sec is
real close to 1.8, which is real close to the 0.9 it takes to get the
bugger loaded cold)

> > > So far that looks like a winner. The dictator wanted good latency, he's
> > > getting good latency. I'll continue working on this on monday, while I'm
> > > waiting for delivery of the Trabant.
> >
> > I'm unsure feel wise. Disk is sounding too seeky, which worries me.
>
> Care to elaborate on the feel? Seekiness is not good of course,
> depending on timing the async delay could cause some skipping back and
> forth. But remember that when you don't hear the disk, it could likely
> be doing the async IO which will make the disk very quiet (since it's
> just a streamed write). The konsole test is bound to cause seeks, when
> it's juggling async IO too. Even alone it's likely pretty seeky. So is
> the seekiness persistent, or just shortly when starting konsole?

It's a huge winner for sure, and there's no way to quantify. I'm just
afraid the other shoe will drop from what I see/hear. I should have
kept my trap shut and waited really, but the impression was strong.
Sorry for making unquantifiable noise. Ignore me. I've been
watching/feeling tests for too many hours and hours and hours ;-)

-Mike

2009-10-04 09:16:04

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

Hi Vivek,
On Sat, Oct 3, 2009 at 3:38 PM, Vivek Goyal <[email protected]> wrote:
> On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
>> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <[email protected]> wrote:
>> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
>> >> In fact I think that the 'rotating' flag name is misleading.
>> >> All the checks we are doing are actually checking if the device truly
>> >> supports multiple parallel operations, and this feature is shared by
>> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
>> >> NCQ-enabled SATA disk.
>> >>
>> >
>> > While we are at it, what happens to notion of priority of tasks on SSDs?
>> This is not changed by proposed patch w.r.t. current CFQ.
>
> This is a general question irrespective of current patch. Want to know
> what is our statement w.r.t ioprio and what it means for user? When do
> we support it and when do we not.
>
>> > Without idling there is not continuous time slice and there is no
>> > fairness. So ioprio is out of the window for SSDs?
>> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
>> me that the way in which queues are sorted in the rr tree may still
>> provide some sort of fairness and service differentiation for
>> priorities, in terms of number of IOs.
>
> I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
> not. I guess this happens because sometimes idling is enabled and sometmes
> not because of dyanamic nature of hw_tag.
>
My guess is that the formula that is used to handle this case is not
very stable.
The culprit code is (in cfq_service_tree_add):
} else if (!add_front) {
rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
rb_key += cfqq->slice_resid;
cfqq->slice_resid = 0;
} else

cfq_slice_offset is defined as:

static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
struct cfq_queue *cfqq)
{
/*
* just an approximation, should be ok.
*/
return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
}

Can you try changing the latter to a simpler (we already observed that
busy_queues is unstable, and I think that here it is not needed at
all):
return -cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
and remove the 'rb_key += cfqq->slice_resid; ' from the former.

This should give a higher probability of being first on the queue to
larger slice tasks, so it will work if we don't idle, but it needs
some adjustment if we idle.

> I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
> third prio7.
>
> (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
> (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
> (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec
>
> Note there is almost no difference between prio 0 and prio 4 job and prio7
> job has been penalized heavily (gets less than 10% BW of prio 4 job).
>
>> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
>> is not an issue for them.
>
> Agree.
>
>> >
>> > On SSDs, will it make more sense to provide fairness in terms of number or
>> > IO or size of IO and not in terms of time slices.
>> Not on all SSDs. There are still ones that have a non-negligible
>> penalty on non-sequential access pattern (hopefully the ones without
>> NCQ, but if we find otherwise, then we will have to benchmark access
>> time in I/O scheduler to select the best policy). For those, time
>> based may still be needed.
>
> Ok.
>
> So on better SSDs out there with NCQ, we probably don't support the notion of
> ioprio? Or, I am missing something.

I think we try, but the current formula is simply not good enough.

Thanks,
Corrado

>
> Thanks
> Vivek
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-10-04 10:51:33

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sat, 2009-10-03 at 21:49 +0200, Mike Galbraith wrote:

> It's a huge winner for sure, and there's no way to quantify. I'm just
> afraid the other shoe will drop from what I see/hear. I should have
> kept my trap shut and waited really, but the impression was strong.

Seems there was one "other shoe" at least. For concurrent read vs
write, we're losing ~10% throughput that we weren't losing prior to that
last commit. I got it back, and the concurrent git throughput back as
well with the tweak below, _seemingly_ without significant sacrifice.

cfq-iosched: adjust async delay.

8e29675: "implement slower async initiate and queue ramp up" introduced a
throughput regression for concurrent reader vs writer. Adjusting async delay
to use cfq_slice_async, unless someone adjusts async to have more bandwidth
allocation than sync, restored throughput.

Signed-off-by: Mike Galbraith <[email protected]>

---
block/cfq-iosched.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -1343,17 +1343,19 @@ static int cfq_dispatch_requests(struct
*/
if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_desktop) {
unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
+ unsigned long slice = max(cfq_slice_sync, cfq_slice_async);
unsigned int depth;

+ slice = min(slice, cfq_slice_async);
/*
* must wait a bit longer
*/
- if (last_sync < cfq_slice_sync) {
- cfq_schedule_dispatch(cfqd, cfq_slice_sync - last_sync);
+ if (last_sync < slice) {
+ cfq_schedule_dispatch(cfqd, slice - last_sync);
return 0;
}

- depth = last_sync / cfq_slice_sync;
+ depth = last_sync / slice;
if (depth < max_dispatch)
max_dispatch = depth;
}

--numbers--

dd vs konsole -e exit
1.70 1.94 1.32 1.89 1.87 1.7 fairness=1 overload_delay=1
1.55 1.79 1.38 1.53 1.57 1.5 desktop=1 +last_end_sync
1.09 0.87 1.11 0.96 1.11 1.02 block-for-linus
1.10 1.13 0.98 1.11 1.13 1.09 block-for-linus + tweak

concurrent git test
Avg
108.12 106.33 106.34 97.00 106.52 104.8 1.000 virgin
89.81 88.82 91.56 96.57 89.38 91.2 .870 desktop=1 +last_end_sync
92.61 94.60 92.35 93.17 94.05 93.3 .890 blk-for-linus
89.33 88.82 89.99 88.54 89.09 89.1 .850 blk-for-linus + tweak

read vs write test

desktop=0 Avg
elapsed 98.23 91.97 91.77 93.9 sec 1.000
30s-dd-read 48.5 49.6 49.1 49.0 mb/s 1.000
30s-dd-write 23.1 27.3 31.3 27.2 1.000
dd-read-total 49.4 50.1 49.6 49.7 1.000
dd-write-total 34.5 34.9 34.9 34.7 1.000

desktop=1 pop 8e296755 Avg
elapsed 93.30 92.77 90.11 92.0 .979
30s-dd-read 50.5 50.4 51.8 50.9 1.038
30s-dd-write 22.7 26.4 27.7 25.6 .941
dd-read-total 51.2 50.1 51.6 50.9 1.024
dd-write-total 34.2 34.5 35.6 34.7 1.000

desktop=1 push 8e296755 Avg
elapsed 104.51 104.52 101.20 103.4 1.101
30s-dd-read 43.0 43.6 44.5 43.7 .891
30s-dd-write 21.4 23.9 28.9 24.7 .908
dd-read-total 42.9 43.0 43.5 43.1 .867
dd-write-total 30.4 30.3 31.5 30.7 .884

desktop=1 push 8e296755 + tweak Avg
elapsed 92.10 94.34 93.68 93.3 .993
30s-dd-read 49.7 49.3 48.8 49.2 1.004
30s-dd-write 23.7 27.1 23.1 24.6 .904
dd-read-total 50.2 50.1 48.7 49.6 .997
dd-write-total 34.7 33.9 34.0 34.2 .985

#!/bin/sh

# dd if=/dev/zero of=deleteme bs=1M count=3000

echo 2 > /proc/sys/vm/drop_caches

dd if=/dev/zero of=deleteme2 bs=1M count=3000 &
dd if=deleteme of=/dev/null bs=1M count=3000 &
sleep 30
killall -q -USR1 dd &
wait
rm -f deleteme2
sync


2009-10-04 11:34:33

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

Wrong one. Let's try the post java version instead.

8e29675: "implement slower async initiate and queue ramp up" introduced a
throughput regression for concurrent reader vs writer. Adjusting async delay
to use cfq_slice_async restored throughput.

Signed-off-by: Mike Galbraith <[email protected]>

---
block/cfq-iosched.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -1343,17 +1343,18 @@ static int cfq_dispatch_requests(struct
*/
if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_desktop) {
unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
+ unsigned long slice = cfq_slice_async;
unsigned int depth;

/*
* must wait a bit longer
*/
- if (last_sync < cfq_slice_sync) {
- cfq_schedule_dispatch(cfqd, cfq_slice_sync - last_sync);
+ if (last_sync < slice) {
+ cfq_schedule_dispatch(cfqd, slice - last_sync);
return 0;
}

- depth = last_sync / cfq_slice_sync;
+ depth = last_sync / slice;
if (depth < max_dispatch)
max_dispatch = depth;
}

2009-10-04 12:12:58

by Vivek Goyal

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Sat, Oct 3, 2009 at 3:38 PM, Vivek Goyal <[email protected]> wrote:
> > On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
> >> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <[email protected]> wrote:
> >> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> >> >> In fact I think that the 'rotating' flag name is misleading.
> >> >> All the checks we are doing are actually checking if the device truly
> >> >> supports multiple parallel operations, and this feature is shared by
> >> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> >> >> NCQ-enabled SATA disk.
> >> >>
> >> >
> >> > While we are at it, what happens to notion of priority of tasks on SSDs?
> >> This is not changed by proposed patch w.r.t. current CFQ.
> >
> > This is a general question irrespective of current patch. Want to know
> > what is our statement w.r.t ioprio and what it means for user? When do
> > we support it and when do we not.
> >
> >> > Without idling there is not continuous time slice and there is no
> >> > fairness. So ioprio is out of the window for SSDs?
> >> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
> >> me that the way in which queues are sorted in the rr tree may still
> >> provide some sort of fairness and service differentiation for
> >> priorities, in terms of number of IOs.
> >
> > I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
> > not. I guess this happens because sometimes idling is enabled and sometmes
> > not because of dyanamic nature of hw_tag.
> >
> My guess is that the formula that is used to handle this case is not
> very stable.

In general I agree that formula to calculate the slice offset is very
puzzling as busy_queues varies and that changes the position of the task
sometimes.

> The culprit code is (in cfq_service_tree_add):
> } else if (!add_front) {
> rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
> rb_key += cfqq->slice_resid;
> cfqq->slice_resid = 0;
> } else
>
> cfq_slice_offset is defined as:
>
> static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
> struct cfq_queue *cfqq)
> {
> /*
> * just an approximation, should be ok.
> */
> return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
> cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
> }
>
> Can you try changing the latter to a simpler (we already observed that
> busy_queues is unstable, and I think that here it is not needed at
> all):
> return -cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
> and remove the 'rb_key += cfqq->slice_resid; ' from the former.
>
> This should give a higher probability of being first on the queue to
> larger slice tasks, so it will work if we don't idle, but it needs
> some adjustment if we idle.

I am not sure what's the intent here by removing busy_queues stuff. I have
got two questions though.

- Why don't we keep it simple round robin where a task is simply placed at
the end of service tree.

- Secondly, CFQ provides full slice length to queues only which are
idling (in case of sequenatial reader). If we do not enable idling, as
in case of NCQ enabled SSDs, then CFQ will expire the queue almost
immediately and put the queue at the end of service tree (almost).

So if we don't enable idling, at max we can provide fairness, we
esseitially just let every queue dispatch one request and put at the end
of the end of service tree. Hence no fairness....

Thanks
Vivek

>
> > I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
> > third prio7.
> >
> > (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
> > (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
> > (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec
> >
> > Note there is almost no difference between prio 0 and prio 4 job and prio7
> > job has been penalized heavily (gets less than 10% BW of prio 4 job).
> >
> >> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
> >> is not an issue for them.
> >
> > Agree.
> >
> >> >
> >> > On SSDs, will it make more sense to provide fairness in terms of number or
> >> > IO or size of IO and not in terms of time slices.
> >> Not on all SSDs. There are still ones that have a non-negligible
> >> penalty on non-sequential access pattern (hopefully the ones without
> >> NCQ, but if we find otherwise, then we will have to benchmark access
> >> time in I/O scheduler to select the best policy). For those, time
> >> based may still be needed.
> >
> > Ok.
> >
> > So on better SSDs out there with NCQ, we probably don't support the notion of
> > ioprio? Or, I am missing something.
>
> I think we try, but the current formula is simply not good enough.
>
> Thanks,
> Corrado
>
> >
> > Thanks
> > Vivek
> >
>
>
>
> --
> __________________________________________________________________________
>
> dott. Corrado Zoccolo mailto:[email protected]
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
> Tales of Power - C. Castaneda

2009-10-04 12:47:23

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

Hi Vivek,
On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <[email protected]> wrote:
> On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
>> Hi Vivek,
>> My guess is that the formula that is used to handle this case is not
>> very stable.
>
> In general I agree that formula to calculate the slice offset is very
> puzzling as busy_queues varies and that changes the position of the task
> sometimes.
>
> I am not sure what's the intent here by removing busy_queues stuff. I have
> got two questions though.

In the ideal case steady state, busy_queues will be a constant. Since
we are just comparing the values between themselves, we can just
remove this constant completely.

Whenever it is not constant, it seems to me that it can cause wrong
behaviour, i.e. when the number of processes with ready I/O reduces, a
later coming request can jump before older requests.
So it seems it does more harm than good, hence I suggest to remove it.

Moreover, I suggest removing also the slice_resid part, since its
semantics doesn't seem consistent.
When computed, it is not the residency, but the remaining time slice.
Then it is used to postpone, instead of anticipate, the position of
the queue in the RR, that seems counterintuitive (it would be
intuitive, though, if it was actually a residency, not a remaining
slice, i.e. you already got your full share, so you can wait longer to
be serviced again).

>
> - Why don't we keep it simple round robin where a task is simply placed at
>  the end of service tree.

This should work for the idling case, since we provide service
differentiation by means of time slice.
For non-idling case, though, the appropriate placement of queues in
the tree (as given by my formula) can still provide it.

>
> - Secondly, CFQ provides full slice length to queues only which are
>  idling (in case of sequenatial reader). If we do not enable idling, as
>  in case of NCQ enabled SSDs, then CFQ will expire the queue almost
>  immediately and put the queue at the end of service tree (almost).
>
> So if we don't enable idling, at max we can provide fairness, we
> esseitially just let every queue dispatch one request and put  at the end
> of the end of service tree. Hence no fairness....

We should distinguish the two terms fairness and service
differentiation. Fairness is when every queue gets the same amount of
service share. This is not what we want when priorities are different
(we want the service differentiation, instead), but is what we get if
we do just round robin without idling.

To fix this, we can alter the placement in the tree, so that if we
have Q1 with slice S1, and Q2 with slice S2, always ready to perform
I/O, we get that Q1 is in front of the three with probability
S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
This is what my formula should achieve.

Thanks,
Corrado

>
> Thanks
> Vivek
>

2009-10-04 16:17:26

by Fabio Checconi

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

> From: Corrado Zoccolo <[email protected]>
> Date: Sun, Oct 04, 2009 02:46:44PM +0200
>
> Hi Vivek,
> On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <[email protected]> wrote:
> > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> My guess is that the formula that is used to handle this case is not
> >> very stable.
> >
> > In general I agree that formula to calculate the slice offset is very
> > puzzling as busy_queues varies and that changes the position of the task
> > sometimes.
> >
> > I am not sure what's the intent here by removing busy_queues stuff. I have
> > got two questions though.
>
> In the ideal case steady state, busy_queues will be a constant. Since
> we are just comparing the values between themselves, we can just
> remove this constant completely.
>
> Whenever it is not constant, it seems to me that it can cause wrong
> behaviour, i.e. when the number of processes with ready I/O reduces, a
> later coming request can jump before older requests.
> So it seems it does more harm than good, hence I suggest to remove it.
>
> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.
> Then it is used to postpone, instead of anticipate, the position of
> the queue in the RR, that seems counterintuitive (it would be
> intuitive, though, if it was actually a residency, not a remaining
> slice, i.e. you already got your full share, so you can wait longer to
> be serviced again).
>
> >
> > - Why don't we keep it simple round robin where a task is simply placed at
> > ?the end of service tree.
>
> This should work for the idling case, since we provide service
> differentiation by means of time slice.
> For non-idling case, though, the appropriate placement of queues in
> the tree (as given by my formula) can still provide it.
>
> >
> > - Secondly, CFQ provides full slice length to queues only which are
> > ?idling (in case of sequenatial reader). If we do not enable idling, as
> > ?in case of NCQ enabled SSDs, then CFQ will expire the queue almost
> > ?immediately and put the queue at the end of service tree (almost).
> >
> > So if we don't enable idling, at max we can provide fairness, we
> > esseitially just let every queue dispatch one request and put ?at the end
> > of the end of service tree. Hence no fairness....
>
> We should distinguish the two terms fairness and service
> differentiation. Fairness is when every queue gets the same amount of
> service share. This is not what we want when priorities are different
> (we want the service differentiation, instead), but is what we get if
> we do just round robin without idling.
>
> To fix this, we can alter the placement in the tree, so that if we
> have Q1 with slice S1, and Q2 with slice S2, always ready to perform
> I/O, we get that Q1 is in front of the three with probability
> S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
> This is what my formula should achieve.
>

But if the ``always ready to perform I/O'' assumption held then even RR
would have provided service differentiation, always seeing backlogged
queues and serving them according to their weights.

In this case the problem is what Vivek described some time ago as the
interlocked service of sync queues, where the scheduler is trying to
differentiate between the queues, but they are not always asking for
service (as they are synchronous and they are backlogged only for short
time intervals).

2009-10-04 17:39:40

by Jens Axboe

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sun, Oct 04 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 21:49 +0200, Mike Galbraith wrote:
>
> > It's a huge winner for sure, and there's no way to quantify. I'm just
> > afraid the other shoe will drop from what I see/hear. I should have
> > kept my trap shut and waited really, but the impression was strong.
>
> Seems there was one "other shoe" at least. For concurrent read vs
> write, we're losing ~10% throughput that we weren't losing prior to that
> last commit. I got it back, and the concurrent git throughput back as
> well with the tweak below, _seemingly_ without significant sacrifice.
>
> cfq-iosched: adjust async delay.
>
> 8e29675: "implement slower async initiate and queue ramp up" introduced a
> throughput regression for concurrent reader vs writer. Adjusting async delay
> to use cfq_slice_async, unless someone adjusts async to have more bandwidth
> allocation than sync, restored throughput.

After comitting it yesterday, I was thinking more about this. We cannot
do the delay thing without at least doing one dispatch, or we risk
starving async writeout completely. This is problematic, as it could
cause indefinite delays and this opens up easy DoS attacks from local
users.

So I'll commit a change that doesn't do the delay at all, basically it
just offset the current code by one slice.

--
Jens Axboe

2009-10-04 18:24:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sun, 2009-10-04 at 19:39 +0200, Jens Axboe wrote:

> So I'll commit a change that doesn't do the delay at all, basically it
> just offset the current code by one slice.

Ok, when it hits block-for-linus, I'll take it out for a spin.

-Mike

2009-10-04 18:39:01

by Jens Axboe

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sun, Oct 04 2009, Mike Galbraith wrote:
> On Sun, 2009-10-04 at 19:39 +0200, Jens Axboe wrote:
>
> > So I'll commit a change that doesn't do the delay at all, basically it
> > just offset the current code by one slice.
>
> Ok, when it hits block-for-linus, I'll take it out for a spin.

It's committed:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e00c54c36ac2024c3a8a37432e2e2698ff849594

--
Jens Axboe

2009-10-04 19:48:38

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sun, 2009-10-04 at 20:38 +0200, Jens Axboe wrote:
> On Sun, Oct 04 2009, Mike Galbraith wrote:
> > On Sun, 2009-10-04 at 19:39 +0200, Jens Axboe wrote:
> >
> > > So I'll commit a change that doesn't do the delay at all, basically it
> > > just offset the current code by one slice.
> >
> > Ok, when it hits block-for-linus, I'll take it out for a spin.
>
> It's committed:
>
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e00c54c36ac2024c3a8a37432e2e2698ff849594

Looks like a keeper to me :)

-Mike

dd vs konsole -e exit
Avg
1.70 1.94 1.32 1.89 1.87 1.7 fairness=1 overload_delay=1
1.55 1.79 1.38 1.53 1.57 1.5 desktop=1 +last_end_sync
1.09 0.87 1.11 0.96 1.11 1.0 block-for-linus-8e29675
1.29 2.18 1.16 1.21 1.24 1.4 block-for-linus-e00c54c

concurrent git test
Avg
108.12 106.33 106.34 97.00 106.52 104.8 1.000 virgin
89.81 88.82 91.56 96.57 89.38 91.2 .870 desktop=1 +last_end_sync
92.61 94.60 92.35 93.17 94.05 93.3 .890 blk-for-linus-8e29675
88.73 89.02 88.10 90.84 88.30 88.9 .848 blk-for-linus-e00c54c

read vs write test

desktop=0 Avg
elapsed 98.23 91.97 91.77 93.9 sec 1.000
30s-dd-read 48.5 49.6 49.1 49.0 mb/s 1.000
30s-dd-write 23.1 27.3 31.3 27.2 1.000
dd-read-total 49.4 50.1 49.6 49.7 1.000
dd-write-total 34.5 34.9 34.9 34.7 1.000

desktop=1 pop 8e296755 Avg
elapsed 93.30 92.77 90.11 92.0 .979
30s-dd-read 50.5 50.4 51.8 50.9 1.038
30s-dd-write 22.7 26.4 27.7 25.6 .941
dd-read-total 51.2 50.1 51.6 50.9 1.024
dd-write-total 34.2 34.5 35.6 34.7 1.000

desktop=1 push 8e296755 Avg
elapsed 104.51 104.52 101.20 103.4 1.101
30s-dd-read 43.0 43.6 44.5 43.7 .891
30s-dd-write 21.4 23.9 28.9 24.7 .908
dd-read-total 42.9 43.0 43.5 43.1 .867
dd-write-total 30.4 30.3 31.5 30.7 .884

latency=1 push e00c54c Avg
elapsed 90.03 93.03 91.68 91.5 .974
30s-dd-read 53.1 54.8 51.3 53.0 1.081
30s-dd-write 20.0 23.3 28.2 23.8 .875
dd-read-total 53.7 54.0 52.6 53.4 1.074
dd-write-total 35.5 34.2 34.9 34.8 1.002

2009-10-04 20:17:47

by Jens Axboe

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sun, Oct 04 2009, Mike Galbraith wrote:
> On Sun, 2009-10-04 at 20:38 +0200, Jens Axboe wrote:
> > On Sun, Oct 04 2009, Mike Galbraith wrote:
> > > On Sun, 2009-10-04 at 19:39 +0200, Jens Axboe wrote:
> > >
> > > > So I'll commit a change that doesn't do the delay at all, basically it
> > > > just offset the current code by one slice.
> > >
> > > Ok, when it hits block-for-linus, I'll take it out for a spin.
> >
> > It's committed:
> >
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e00c54c36ac2024c3a8a37432e2e2698ff849594
>
> Looks like a keeper to me :)

Wonderful, so I take it this passes your 'gut feel' test too? :-)

--
Jens Axboe

2009-10-04 22:16:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: Do not overload dispatch queue (Was: Re: IO scheduler based IO controller V10)

On Sun, 2009-10-04 at 22:17 +0200, Jens Axboe wrote:
> On Sun, Oct 04 2009, Mike Galbraith wrote:
> > On Sun, 2009-10-04 at 20:38 +0200, Jens Axboe wrote:
> > > On Sun, Oct 04 2009, Mike Galbraith wrote:
> > > > On Sun, 2009-10-04 at 19:39 +0200, Jens Axboe wrote:
> > > >
> > > > > So I'll commit a change that doesn't do the delay at all, basically it
> > > > > just offset the current code by one slice.
> > > >
> > > > Ok, when it hits block-for-linus, I'll take it out for a spin.
> > >
> > > It's committed:
> > >
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e00c54c36ac2024c3a8a37432e2e2698ff849594
> >
> > Looks like a keeper to me :)
>
> Wonderful, so I take it this passes your 'gut feel' test too? :-)

Heh, yeah. Box feels good, disk sounds happy.. none of those little
indescribables that make you suddenly go alert.

-Mike

2009-10-05 10:38:47

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi,

Munehiro Ikeda <[email protected]> wrote:
> Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > Before finishing this mail, will throw a whacky idea in the ring. I was
> > going through the request based dm-multipath paper. Will it make sense
> > to implement request based dm-ioband? So basically we implement all the
> > group scheduling in CFQ and let dm-ioband implement a request function
> > to take the request and break it back into bios. This way we can keep
> > all the group control at one place and also meet most of the requirements.
> >
> > So request based dm-ioband will have a request in hand once that request
> > has passed group control and prio control. Because dm-ioband is a device
> > mapper target, one can put it on higher level devices (practically taking
> > CFQ at higher level device), and provide fairness there. One can also
> > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > them to use the IO scheduler.)
> >
> > I am sure that will be many issues but one big issue I could think of that
> > CFQ thinks that there is one device beneath it and dipsatches requests
> > from one queue (in case of idling) and that would kill parallelism at
> > higher layer and throughput will suffer on many of the dm/md configurations.
> >
> > Thanks
> > Vivek
>
> As long as using CFQ, your idea is reasonable for me. But how about for
> other IO schedulers? In my understanding, one of the keys to guarantee
> group isolation in your patch is to have per-group IO scheduler internal
> queue even with as, deadline, and noop scheduler. I think this is
> great idea, and to implement generic code for all IO schedulers was
> concluded when we had so many IO scheduler specific proposals.
> If we will still need per-group IO scheduler internal queues with
> request-based dm-ioband, we have to modify elevator layer. It seems
> out of scope of dm.
> I might miss something...

IIUC, the request based device-mapper could not break back a request
into bio, so it could not work with block devices which don't use the
IO scheduler.

How about adding a callback function to the higher level controller?
CFQ calls it when the active queue runs out of time, then the higer
level controller use it as a trigger or a hint to move IO group, so
I think a time-based controller could be implemented at higher level.

My requirements for IO controller are:
- Implement s a higher level controller, which is located at block
layer and bio is grabbed in generic_make_request().
- Can work with any type of IO scheduler.
- Can work with any type of block devices.
- Support multiple policies, proportional wegiht, max rate, time
based, ans so on.

The IO controller mini-summit will be held in next week, and I'm
looking forard to meet you all and discuss about IO controller.
https://sourceforge.net/apps/trac/ioband/wiki/iosummit

Thanks,
Ryo Tsuruta

2009-10-05 12:43:20

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> Hi,
>
> Munehiro Ikeda <[email protected]> wrote:
> > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > going through the request based dm-multipath paper. Will it make sense
> > > to implement request based dm-ioband? So basically we implement all the
> > > group scheduling in CFQ and let dm-ioband implement a request function
> > > to take the request and break it back into bios. This way we can keep
> > > all the group control at one place and also meet most of the requirements.
> > >
> > > So request based dm-ioband will have a request in hand once that request
> > > has passed group control and prio control. Because dm-ioband is a device
> > > mapper target, one can put it on higher level devices (practically taking
> > > CFQ at higher level device), and provide fairness there. One can also
> > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > them to use the IO scheduler.)
> > >
> > > I am sure that will be many issues but one big issue I could think of that
> > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > from one queue (in case of idling) and that would kill parallelism at
> > > higher layer and throughput will suffer on many of the dm/md configurations.
> > >
> > > Thanks
> > > Vivek
> >
> > As long as using CFQ, your idea is reasonable for me. But how about for
> > other IO schedulers? In my understanding, one of the keys to guarantee
> > group isolation in your patch is to have per-group IO scheduler internal
> > queue even with as, deadline, and noop scheduler. I think this is
> > great idea, and to implement generic code for all IO schedulers was
> > concluded when we had so many IO scheduler specific proposals.
> > If we will still need per-group IO scheduler internal queues with
> > request-based dm-ioband, we have to modify elevator layer. It seems
> > out of scope of dm.
> > I might miss something...
>
> IIUC, the request based device-mapper could not break back a request
> into bio, so it could not work with block devices which don't use the
> IO scheduler.
>

I think current request based multipath drvier does not do it but can't it
be implemented that requests are broken back into bio?

Anyway, I don't feel too strongly about this approach as it might
introduce more serialization at higher layer.

> How about adding a callback function to the higher level controller?
> CFQ calls it when the active queue runs out of time, then the higer
> level controller use it as a trigger or a hint to move IO group, so
> I think a time-based controller could be implemented at higher level.
>

Adding a call back should not be a big issue. But that means you are
planning to run only one group at higher layer at one time and I think
that's the problem because than we are introducing serialization at higher
layer. So any higher level device mapper target which has multiple
physical disks under it, we might be underutilizing these even more and
take a big hit on overall throughput.

The whole design of doing proportional weight at lower layer is optimial
usage of system.

> My requirements for IO controller are:
> - Implement s a higher level controller, which is located at block
> layer and bio is grabbed in generic_make_request().

How are you planning to handle the issue of buffered writes Andrew raised?

> - Can work with any type of IO scheduler.
> - Can work with any type of block devices.
> - Support multiple policies, proportional wegiht, max rate, time
> based, ans so on.
>
> The IO controller mini-summit will be held in next week, and I'm
> looking forard to meet you all and discuss about IO controller.
> https://sourceforge.net/apps/trac/ioband/wiki/iosummit

Is there a new version of dm-ioband now where you have solved the issue of
sync/async dispatch with-in group? Before meeting at mini-summit, I am
trying to run some tests and come up with numbers so that we have more
clear picture of pros/cons.

Thanks
Vivek

2009-10-05 14:56:14

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek,

Vivek Goyal <[email protected]> wrote:
> On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> > Hi,
> >
> > Munehiro Ikeda <[email protected]> wrote:
> > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > > going through the request based dm-multipath paper. Will it make sense
> > > > to implement request based dm-ioband? So basically we implement all the
> > > > group scheduling in CFQ and let dm-ioband implement a request function
> > > > to take the request and break it back into bios. This way we can keep
> > > > all the group control at one place and also meet most of the requirements.
> > > >
> > > > So request based dm-ioband will have a request in hand once that request
> > > > has passed group control and prio control. Because dm-ioband is a device
> > > > mapper target, one can put it on higher level devices (practically taking
> > > > CFQ at higher level device), and provide fairness there. One can also
> > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > > them to use the IO scheduler.)
> > > >
> > > > I am sure that will be many issues but one big issue I could think of that
> > > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > > from one queue (in case of idling) and that would kill parallelism at
> > > > higher layer and throughput will suffer on many of the dm/md configurations.
> > > >
> > > > Thanks
> > > > Vivek
> > >
> > > As long as using CFQ, your idea is reasonable for me. But how about for
> > > other IO schedulers? In my understanding, one of the keys to guarantee
> > > group isolation in your patch is to have per-group IO scheduler internal
> > > queue even with as, deadline, and noop scheduler. I think this is
> > > great idea, and to implement generic code for all IO schedulers was
> > > concluded when we had so many IO scheduler specific proposals.
> > > If we will still need per-group IO scheduler internal queues with
> > > request-based dm-ioband, we have to modify elevator layer. It seems
> > > out of scope of dm.
> > > I might miss something...
> >
> > IIUC, the request based device-mapper could not break back a request
> > into bio, so it could not work with block devices which don't use the
> > IO scheduler.
> >
>
> I think current request based multipath drvier does not do it but can't it
> be implemented that requests are broken back into bio?

I guess it would be hard to implement it, and we need to hold requests
and throttle them at there and it would break the ordering by CFQ.

> Anyway, I don't feel too strongly about this approach as it might
> introduce more serialization at higher layer.

Yes, I know it.

> > How about adding a callback function to the higher level controller?
> > CFQ calls it when the active queue runs out of time, then the higer
> > level controller use it as a trigger or a hint to move IO group, so
> > I think a time-based controller could be implemented at higher level.
> >
>
> Adding a call back should not be a big issue. But that means you are
> planning to run only one group at higher layer at one time and I think
> that's the problem because than we are introducing serialization at higher
> layer. So any higher level device mapper target which has multiple
> physical disks under it, we might be underutilizing these even more and
> take a big hit on overall throughput.
>
> The whole design of doing proportional weight at lower layer is optimial
> usage of system.

But I think that the higher level approch makes easy to configure
against striped software raid devices. If one would like to
combine some physical disks into one logical device like a dm-linear,
I think one should map the IO controller on each physical device and
combine them into one logical device.

> > My requirements for IO controller are:
> > - Implement s a higher level controller, which is located at block
> > layer and bio is grabbed in generic_make_request().
>
> How are you planning to handle the issue of buffered writes Andrew raised?

I think that it would be better to use the higher-level controller
along with the memory controller and have limits memory usage for each
cgroup. And as Kamezawa-san said, having limits of dirty pages would
be better, too.

> > - Can work with any type of IO scheduler.
> > - Can work with any type of block devices.
> > - Support multiple policies, proportional wegiht, max rate, time
> > based, ans so on.
> >
> > The IO controller mini-summit will be held in next week, and I'm
> > looking forard to meet you all and discuss about IO controller.
> > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
>
> Is there a new version of dm-ioband now where you have solved the issue of
> sync/async dispatch with-in group? Before meeting at mini-summit, I am
> trying to run some tests and come up with numbers so that we have more
> clear picture of pros/cons.

Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
dm-ioband handles sync/async IO requests separately and
the write-starve-read issue you pointed out is fixed. I would
appreciate it if you would try them.
http://sourceforge.net/projects/ioband/files/

Thanks,
Ryo Tsuruta

2009-10-05 15:07:26

by Jeff Moyer

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

Corrado Zoccolo <[email protected]> writes:

> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.

It stands for residual, not residency. Make more sense?

Cheers,
Jeff

2009-10-05 17:11:57

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> Vivek Goyal <[email protected]> wrote:
> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> > > Hi,
> > >
> > > Munehiro Ikeda <[email protected]> wrote:
> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > > > going through the request based dm-multipath paper. Will it make sense
> > > > > to implement request based dm-ioband? So basically we implement all the
> > > > > group scheduling in CFQ and let dm-ioband implement a request function
> > > > > to take the request and break it back into bios. This way we can keep
> > > > > all the group control at one place and also meet most of the requirements.
> > > > >
> > > > > So request based dm-ioband will have a request in hand once that request
> > > > > has passed group control and prio control. Because dm-ioband is a device
> > > > > mapper target, one can put it on higher level devices (practically taking
> > > > > CFQ at higher level device), and provide fairness there. One can also
> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > > > them to use the IO scheduler.)
> > > > >
> > > > > I am sure that will be many issues but one big issue I could think of that
> > > > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > > > from one queue (in case of idling) and that would kill parallelism at
> > > > > higher layer and throughput will suffer on many of the dm/md configurations.
> > > > >
> > > > > Thanks
> > > > > Vivek
> > > >
> > > > As long as using CFQ, your idea is reasonable for me. But how about for
> > > > other IO schedulers? In my understanding, one of the keys to guarantee
> > > > group isolation in your patch is to have per-group IO scheduler internal
> > > > queue even with as, deadline, and noop scheduler. I think this is
> > > > great idea, and to implement generic code for all IO schedulers was
> > > > concluded when we had so many IO scheduler specific proposals.
> > > > If we will still need per-group IO scheduler internal queues with
> > > > request-based dm-ioband, we have to modify elevator layer. It seems
> > > > out of scope of dm.
> > > > I might miss something...
> > >
> > > IIUC, the request based device-mapper could not break back a request
> > > into bio, so it could not work with block devices which don't use the
> > > IO scheduler.
> > >
> >
> > I think current request based multipath drvier does not do it but can't it
> > be implemented that requests are broken back into bio?
>
> I guess it would be hard to implement it, and we need to hold requests
> and throttle them at there and it would break the ordering by CFQ.
>
> > Anyway, I don't feel too strongly about this approach as it might
> > introduce more serialization at higher layer.
>
> Yes, I know it.
>
> > > How about adding a callback function to the higher level controller?
> > > CFQ calls it when the active queue runs out of time, then the higer
> > > level controller use it as a trigger or a hint to move IO group, so
> > > I think a time-based controller could be implemented at higher level.
> > >
> >
> > Adding a call back should not be a big issue. But that means you are
> > planning to run only one group at higher layer at one time and I think
> > that's the problem because than we are introducing serialization at higher
> > layer. So any higher level device mapper target which has multiple
> > physical disks under it, we might be underutilizing these even more and
> > take a big hit on overall throughput.
> >
> > The whole design of doing proportional weight at lower layer is optimial
> > usage of system.
>
> But I think that the higher level approch makes easy to configure
> against striped software raid devices.

How does it make easier to configure in case of higher level controller?

In case of lower level design, one just have to create cgroups and assign
weights to cgroups. This mininum step will be required in higher level
controller also. (Even if you get rid of dm-ioband device setup step).

> If one would like to
> combine some physical disks into one logical device like a dm-linear,
> I think one should map the IO controller on each physical device and
> combine them into one logical device.
>

In fact this sounds like a more complicated step where one has to setup
one dm-ioband device on top of each physical device. But I am assuming
that this will go away once you move to per reuqest queue like implementation.

I think it should be same in principal as my initial implementation of IO
controller on request queue and I stopped development on it because of FIFO
dispatch.

So you seem to be suggesting that you will move dm-ioband to request queue
so that setting up additional device setup is gone. You will also enable
it to do time based groups policy, so that we don't run into issues on
seeky media. Will also enable dispatch from one group only at a time so
that we don't run into isolation issues and can do time accounting
accruately.

If yes, then that has the potential to solve the issue. At higher layer one
can think of enabling size of IO/number of IO policy both for proportional
BW and max BW type of control. At lower level one can enable pure time
based control on seeky media.

I think this will still left with the issue of prio with-in group as group
control is separate and you will not be maintatinig separate queues for
each process. Similarly you will also have isseus with read vs write
ratios as IO schedulers underneath change.

So I will be curious to see that implementation.

> > > My requirements for IO controller are:
> > > - Implement s a higher level controller, which is located at block
> > > layer and bio is grabbed in generic_make_request().
> >
> > How are you planning to handle the issue of buffered writes Andrew raised?
>
> I think that it would be better to use the higher-level controller
> along with the memory controller and have limits memory usage for each
> cgroup. And as Kamezawa-san said, having limits of dirty pages would
> be better, too.
>

Ok. So if we plan to co-mount memory controller with per memory group
dirty_ratio implemented, that can work with both higher level as well as
low level controller. Not sure if we also require some kind of a per
memory group flusher thread infrastructure also to make sure higher weight
group gets more job done.

> > > - Can work with any type of IO scheduler.
> > > - Can work with any type of block devices.
> > > - Support multiple policies, proportional wegiht, max rate, time
> > > based, ans so on.
> > >
> > > The IO controller mini-summit will be held in next week, and I'm
> > > looking forard to meet you all and discuss about IO controller.
> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> >
> > Is there a new version of dm-ioband now where you have solved the issue of
> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> > trying to run some tests and come up with numbers so that we have more
> > clear picture of pros/cons.
>
> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> dm-ioband handles sync/async IO requests separately and
> the write-starve-read issue you pointed out is fixed. I would
> appreciate it if you would try them.
> http://sourceforge.net/projects/ioband/files/

Cool. Will get to testing it.

Thanks
Vivek

2009-10-05 18:12:52

by Nauman Rafique

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Mon, Oct 5, 2009 at 10:10 AM, Vivek Goyal <[email protected]> wrote:
> On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote:
>> Hi Vivek,
>>
>> Vivek Goyal <[email protected]> wrote:
>> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
>> > > Hi,
>> > >
>> > > Munehiro Ikeda <[email protected]> wrote:
>> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
>> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was
>> > > > > going through the request based dm-multipath paper. Will it make sense
>> > > > > to implement request based dm-ioband? So basically we implement all the
>> > > > > group scheduling in CFQ and let dm-ioband implement a request function
>> > > > > to take the request and break it back into bios. This way we can keep
>> > > > > all the group control at one place and also meet most of the requirements.
>> > > > >
>> > > > > So request based dm-ioband will have a request in hand once that request
>> > > > > has passed group control and prio control. Because dm-ioband is a device
>> > > > > mapper target, one can put it on higher level devices (practically taking
>> > > > > CFQ at higher level device), and provide fairness there. One can also
>> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
>> > > > > them to use the IO scheduler.)
>> > > > >
>> > > > > I am sure that will be many issues but one big issue I could think of that
>> > > > > CFQ thinks that there is one device beneath it and dipsatches requests
>> > > > > from one queue (in case of idling) and that would kill parallelism at
>> > > > > higher layer and throughput will suffer on many of the dm/md configurations.
>> > > > >
>> > > > > Thanks
>> > > > > Vivek
>> > > >
>> > > > As long as using CFQ, your idea is reasonable for me. ?But how about for
>> > > > other IO schedulers? ?In my understanding, one of the keys to guarantee
>> > > > group isolation in your patch is to have per-group IO scheduler internal
>> > > > queue even with as, deadline, and noop scheduler. ?I think this is
>> > > > great idea, and to implement generic code for all IO schedulers was
>> > > > concluded when we had so many IO scheduler specific proposals.
>> > > > If we will still need per-group IO scheduler internal queues with
>> > > > request-based dm-ioband, we have to modify elevator layer. ?It seems
>> > > > out of scope of dm.
>> > > > I might miss something...
>> > >
>> > > IIUC, the request based device-mapper could not break back a request
>> > > into bio, so it could not work with block devices which don't use the
>> > > IO scheduler.
>> > >
>> >
>> > I think current request based multipath drvier does not do it but can't it
>> > be implemented that requests are broken back into bio?
>>
>> I guess it would be hard to implement it, and we need to hold requests
>> and throttle them at there and it would break the ordering by CFQ.
>>
>> > Anyway, I don't feel too strongly about this approach as it might
>> > introduce more serialization at higher layer.
>>
>> Yes, I know it.
>>
>> > > How about adding a callback function to the higher level controller?
>> > > CFQ calls it when the active queue runs out of time, then the higer
>> > > level controller use it as a trigger or a hint to move IO group, so
>> > > I think a time-based controller could be implemented at higher level.
>> > >
>> >
>> > Adding a call back should not be a big issue. But that means you are
>> > planning to run only one group at higher layer at one time and I think
>> > that's the problem because than we are introducing serialization at higher
>> > layer. So any higher level device mapper target which has multiple
>> > physical disks under it, we might be underutilizing these even more and
>> > take a big hit on overall throughput.
>> >
>> > The whole design of doing proportional weight at lower layer is optimial
>> > usage of system.
>>
>> But I think that the higher level approch makes easy to configure
>> against striped software raid devices.
>
> How does it make easier to configure in case of higher level controller?
>
> In case of lower level design, one just have to create cgroups and assign
> weights to cgroups. This mininum step will be required in higher level
> controller also. (Even if you get rid of dm-ioband device setup step).
>
>> If one would like to
>> combine some physical disks into one logical device like a dm-linear,
>> I think one should map the IO controller on each physical device and
>> combine them into one logical device.
>>
>
> In fact this sounds like a more complicated step where one has to setup
> one dm-ioband device on top of each physical device. But I am assuming
> that this will go away once you move to per reuqest queue like implementation.
>
> I think it should be same in principal as my initial implementation of IO
> controller on request queue and I stopped development on it because of FIFO
> dispatch.
>
> So you seem to be suggesting that you will move dm-ioband to request queue
> so that setting up additional device setup is gone. You will also enable
> it to do time based groups policy, so that we don't run into issues on
> seeky media. Will also enable dispatch from one group only at a time so
> that we don't run into isolation issues and can do time accounting
> accruately.

Will that approach solve the problem of doing bandwidth control on
logical devices? What would be the advantages compared to Vivek's
current patches?

>
> If yes, then that has the potential to solve the issue. At higher layer one
> can think of enabling size of IO/number of IO policy both for proportional
> BW and max BW type of control. At lower level one can enable pure time
> based control on seeky media.
>
> I think this will still left with the issue of prio with-in group as group
> control is separate and you will not be maintatinig separate queues for
> each process. Similarly you will also have isseus with read vs write
> ratios as IO schedulers underneath change.
>
> So I will be curious to see that implementation.
>
>> > > My requirements for IO controller are:
>> > > - Implement s a higher level controller, which is located at block
>> > > ? layer and bio is grabbed in generic_make_request().
>> >
>> > How are you planning to handle the issue of buffered writes Andrew raised?
>>
>> I think that it would be better to use the higher-level controller
>> along with the memory controller and have limits memory usage for each
>> cgroup. And as Kamezawa-san said, having limits of dirty pages would
>> be better, too.
>>
>
> Ok. So if we plan to co-mount memory controller with per memory group
> dirty_ratio implemented, that can work with both higher level as well as
> low level controller. Not sure if we also require some kind of a per
> memory group flusher thread infrastructure also to make sure higher weight
> group gets more job done.
>
>> > > - Can work with any type of IO scheduler.
>> > > - Can work with any type of block devices.
>> > > - Support multiple policies, proportional wegiht, max rate, time
>> > > ? based, ans so on.
>> > >
>> > > The IO controller mini-summit will be held in next week, and I'm
>> > > looking forard to meet you all and discuss about IO controller.
>> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
>> >
>> > Is there a new version of dm-ioband now where you have solved the issue of
>> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
>> > trying to run some tests and come up with numbers so that we have more
>> > clear picture of pros/cons.
>>
>> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
>> dm-ioband handles sync/async IO requests separately and
>> the write-starve-read issue you pointed out is fixed. I would
>> appreciate it if you would try them.
>> http://sourceforge.net/projects/ioband/files/
>
> Cool. Will get to testing it.
>
> Thanks
> Vivek
>

2009-10-05 21:09:59

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <[email protected]> wrote:
> Corrado Zoccolo <[email protected]> writes:
>
>> Moreover, I suggest removing also the slice_resid part, since its
>> semantics doesn't seem consistent.
>> When computed, it is not the residency, but the remaining time slice.
>
> It stands for residual, not residency.  Make more sense?
It makes sense when computed, but not when used in rb_key computation.
Why should we postpone queues that where preempted, instead of giving
them a boost?

Thanks,
Corrado

>
> Cheers,
> Jeff
>

2009-10-05 21:22:12

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

On Sun, Oct 4, 2009 at 6:20 PM, Fabio Checconi <[email protected]> wrote:
> But if the ``always ready to perform I/O'' assumption held then even RR
> would have provided service differentiation, always seeing backlogged
> queues and serving them according to their weights.

Right, this property is too strong. But also a weaker "the two queues
have think times less than the disk access time" will be enough to
achieve the same goal by means of proper placement in the RR tree.

If both think times are greater than access time, then each queue will
get a service level equivalent to it being the only queue in the
system, so in this case service differentiation will not apply (do we
need to differentiate when everyone gets exactly what he needs?).

If one think time is less, and the other is more than the access time,
then we should decide what kind of fairness we want to have,
especially if the one with larger think time has also higher priority.

> In this case the problem is what Vivek described some time ago as the
> interlocked service of sync queues, where the scheduler is trying to
> differentiate between the queues, but they are not always asking for
> service (as they are synchronous and they are backlogged only for short
> time intervals).

Corrado

2009-10-06 07:18:23

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek and Nauman,

Nauman Rafique <[email protected]> wrote:
> >> > > How about adding a callback function to the higher level controller?
> >> > > CFQ calls it when the active queue runs out of time, then the higer
> >> > > level controller use it as a trigger or a hint to move IO group, so
> >> > > I think a time-based controller could be implemented at higher level.
> >> > >
> >> >
> >> > Adding a call back should not be a big issue. But that means you are
> >> > planning to run only one group at higher layer at one time and I think
> >> > that's the problem because than we are introducing serialization at higher
> >> > layer. So any higher level device mapper target which has multiple
> >> > physical disks under it, we might be underutilizing these even more and
> >> > take a big hit on overall throughput.
> >> >
> >> > The whole design of doing proportional weight at lower layer is optimial
> >> > usage of system.
> >>
> >> But I think that the higher level approch makes easy to configure
> >> against striped software raid devices.
> >
> > How does it make easier to configure in case of higher level controller?
> >
> > In case of lower level design, one just have to create cgroups and assign
> > weights to cgroups. This mininum step will be required in higher level
> > controller also. (Even if you get rid of dm-ioband device setup step).

In the case of lower level controller, if we need to assign weights on
a per device basis, we have to assign weights to all devices of which
a raid device consists, but in the case of higher level controller,
we just assign weights to the raid device only.

> >> If one would like to
> >> combine some physical disks into one logical device like a dm-linear,
> >> I think one should map the IO controller on each physical device and
> >> combine them into one logical device.
> >>
> >
> > In fact this sounds like a more complicated step where one has to setup
> > one dm-ioband device on top of each physical device. But I am assuming
> > that this will go away once you move to per reuqest queue like implementation.

I don't understand why the per request queue implementation makes it
go away. If dm-ioband is integrated into the LVM tools, it could allow
users to skip the complicated steps to configure dm-linear devices.

> > I think it should be same in principal as my initial implementation of IO
> > controller on request queue and I stopped development on it because of FIFO
> > dispatch.

I think that FIFO dispatch seldom lead to prioviry inversion, because
holding period for throttling is not too long to break the IO priority.
I did some tests to see whether priority inversion is happened.

The first test ran fio sequential readers on the same group. The BE0
reader got the highest throughput as I expected.

nr_threads 16 | 16 | 1
ionice BE7 | BE7 | BE0
------------------------+------------+-------------
vanilla 10,076KiB/s | 9,779KiB/s | 32,775KiB/s
ioband 9,576KiB/s | 9,367KiB/s | 34,154KiB/s

The second test ran fio sequential readers on two different groups and
give weights of 20 and 10 to each group respectively. The bandwidth
was distributed according to their weights and the BE0 reader got
higher throughput than the BE7 readers in the same group. IO priority
was preserved within the IO group.

group group1 | group2
weight 20 | 10
------------------------+--------------------------
nr_threads 16 | 16 | 1
ionice BE7 | BE7 | BE0
------------------------+--------------------------
ioband 27,513KiB/s | 3,524KiB/s | 10,248KiB/s
| Total = 13,772KiB/s

Here is my test script.
-------------------------------------------------------------------------
arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
--group_reporting"

sync
echo 3 > /proc/sys/vm/drop_caches

echo $$ > /cgroup/1/tasks
ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
echo $$ > /cgroup/2/tasks
ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
echo $$ > /cgroup/tasks
wait
-------------------------------------------------------------------------

Be that as it way, I think that if every bio can point the iocontext
of the process, then it makes it possible to handle IO priority in the
higher level controller. A patchse has already posted by Takhashi-san.
What do you think about this idea?

Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
Subject [RFC][PATCH 1/10] I/O context inheritance
From Hirokazu Takahashi <>
http://lkml.org/lkml/2008/4/22/195

> > So you seem to be suggesting that you will move dm-ioband to request queue
> > so that setting up additional device setup is gone. You will also enable
> > it to do time based groups policy, so that we don't run into issues on
> > seeky media. Will also enable dispatch from one group only at a time so
> > that we don't run into isolation issues and can do time accounting
> > accruately.
>
> Will that approach solve the problem of doing bandwidth control on
> logical devices? What would be the advantages compared to Vivek's
> current patches?

I will only move the point where dm-ioband grabs bios, other
dm-ioband's mechanism and functionality will stll be the same.
The advantages against to scheduler based controllers are:
- can work with any type of block devices
- can work with any type of IO scheduler and no need a big change.

> > If yes, then that has the potential to solve the issue. At higher layer one
> > can think of enabling size of IO/number of IO policy both for proportional
> > BW and max BW type of control. At lower level one can enable pure time
> > based control on seeky media.
> >
> > I think this will still left with the issue of prio with-in group as group
> > control is separate and you will not be maintatinig separate queues for
> > each process. Similarly you will also have isseus with read vs write
> > ratios as IO schedulers underneath change.
> >
> > So I will be curious to see that implementation.
> >
> >> > > My requirements for IO controller are:
> >> > > - Implement s a higher level controller, which is located at block
> >> > > ? layer and bio is grabbed in generic_make_request().
> >> >
> >> > How are you planning to handle the issue of buffered writes Andrew raised?
> >>
> >> I think that it would be better to use the higher-level controller
> >> along with the memory controller and have limits memory usage for each
> >> cgroup. And as Kamezawa-san said, having limits of dirty pages would
> >> be better, too.
> >>
> >
> > Ok. So if we plan to co-mount memory controller with per memory group
> > dirty_ratio implemented, that can work with both higher level as well as
> > low level controller. Not sure if we also require some kind of a per
> > memory group flusher thread infrastructure also to make sure higher weight
> > group gets more job done.

I'm not sure either that a per memory group flusher is necessary.
An we have to consider not only pdflush but also other threads which
issue IOs from multiple groups.

> >> > > - Can work with any type of IO scheduler.
> >> > > - Can work with any type of block devices.
> >> > > - Support multiple policies, proportional wegiht, max rate, time
> >> > > ? based, ans so on.
> >> > >
> >> > > The IO controller mini-summit will be held in next week, and I'm
> >> > > looking forard to meet you all and discuss about IO controller.
> >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> >> >
> >> > Is there a new version of dm-ioband now where you have solved the issue of
> >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> >> > trying to run some tests and come up with numbers so that we have more
> >> > clear picture of pros/cons.
> >>
> >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> >> dm-ioband handles sync/async IO requests separately and
> >> the write-starve-read issue you pointed out is fixed. I would
> >> appreciate it if you would try them.
> >> http://sourceforge.net/projects/ioband/files/
> >
> > Cool. Will get to testing it.

Thanks for your help in advance.

Thanks,
Ryo Tsuruta

2009-10-06 08:41:59

by Jens Axboe

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

On Mon, Oct 05 2009, Corrado Zoccolo wrote:
> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <[email protected]> wrote:
> > Corrado Zoccolo <[email protected]> writes:
> >
> >> Moreover, I suggest removing also the slice_resid part, since its
> >> semantics doesn't seem consistent.
> >> When computed, it is not the residency, but the remaining time slice.
> >
> > It stands for residual, not residency. ?Make more sense?
> It makes sense when computed, but not when used in rb_key computation.
> Why should we postpone queues that where preempted, instead of giving
> them a boost?

We should not, if it is/was working correctly, it should allow both for
increase/descrease of tree position (hence it's a long and can go
negative) to account for both over and under time.

--
Jens Axboe

2009-10-06 09:01:28

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <[email protected]> wrote:
> On Mon, Oct 05 2009, Corrado Zoccolo wrote:
>> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <[email protected]> wrote:
>> > It stands for residual, not residency.  Make more sense?
>> It makes sense when computed, but not when used in rb_key computation.
>> Why should we postpone queues that where preempted, instead of giving
>> them a boost?
>
> We should not, if it is/was working correctly, it should allow both for
> increase/descrease of tree position (hence it's a long and can go
> negative) to account for both over and under time.

I'm doing some tests with and without it.
How it is working now is:
definition:
if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
cfqq->slice_resid = cfqq->slice_end - jiffies;
cfq_log_cfqq(cfqd, cfqq, "resid=%ld",
cfqq->slice_resid);
}
* here resid is > 0 if there was residual time, and < 0 if the queue
overrun its slice.
use:
rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
rb_key += cfqq->slice_resid;
cfqq->slice_resid = 0;
* here if residual is > 0, we postpone, i.e. penalize. If residual is
< 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it.

So this is likely not what we want.
I did some tests with and without it, or changing the sign, and it
doesn't matter at all for pure sync workloads.

The only case in which it matters a little, from my experiments, is
for sync vs async workload. Here, since async queues are preempted,
the current form of the code penalizes them, so they get larger
delays, and we get more bandwidth for sync.
This is, btw, the only positive outcome (I can think of) from the
current form of the code, and I think we could obtain it more easily
by unconditionally adding a delay for async queues:
rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
if (!cfq_cfqq_sync(cfqq)) {
rb_key += CFQ_ASYNC_DELAY;
}

removing completely the resid stuff (or at least leaving us with the
ability of using it with the proper sign).

Corrado
>
> --
> Jens Axboe
>
>

2009-10-06 11:24:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Tue, Oct 06, 2009 at 04:17:44PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and Nauman,
>
> Nauman Rafique <[email protected]> wrote:
> > >> > > How about adding a callback function to the higher level controller?
> > >> > > CFQ calls it when the active queue runs out of time, then the higer
> > >> > > level controller use it as a trigger or a hint to move IO group, so
> > >> > > I think a time-based controller could be implemented at higher level.
> > >> > >
> > >> >
> > >> > Adding a call back should not be a big issue. But that means you are
> > >> > planning to run only one group at higher layer at one time and I think
> > >> > that's the problem because than we are introducing serialization at higher
> > >> > layer. So any higher level device mapper target which has multiple
> > >> > physical disks under it, we might be underutilizing these even more and
> > >> > take a big hit on overall throughput.
> > >> >
> > >> > The whole design of doing proportional weight at lower layer is optimial
> > >> > usage of system.
> > >>
> > >> But I think that the higher level approch makes easy to configure
> > >> against striped software raid devices.
> > >
> > > How does it make easier to configure in case of higher level controller?
> > >
> > > In case of lower level design, one just have to create cgroups and assign
> > > weights to cgroups. This mininum step will be required in higher level
> > > controller also. (Even if you get rid of dm-ioband device setup step).
>
> In the case of lower level controller, if we need to assign weights on
> a per device basis, we have to assign weights to all devices of which
> a raid device consists, but in the case of higher level controller,
> we just assign weights to the raid device only.
>

This is required only if you need to assign different weights to different
devices. This is just additional facility and not a requirement. Normally
you will not be required to do that and devices will inherit the cgroup
weights automatically. So one has to only assign the cgroup weights.

> > >> If one would like to
> > >> combine some physical disks into one logical device like a dm-linear,
> > >> I think one should map the IO controller on each physical device and
> > >> combine them into one logical device.
> > >>
> > >
> > > In fact this sounds like a more complicated step where one has to setup
> > > one dm-ioband device on top of each physical device. But I am assuming
> > > that this will go away once you move to per reuqest queue like implementation.
>
> I don't understand why the per request queue implementation makes it
> go away. If dm-ioband is integrated into the LVM tools, it could allow
> users to skip the complicated steps to configure dm-linear devices.
>

Those who are not using dm-tools will be forced to use dm-tools for
bandwidth control features.

> > > I think it should be same in principal as my initial implementation of IO
> > > controller on request queue and I stopped development on it because of FIFO
> > > dispatch.
>
> I think that FIFO dispatch seldom lead to prioviry inversion, because
> holding period for throttling is not too long to break the IO priority.
> I did some tests to see whether priority inversion is happened.
>
> The first test ran fio sequential readers on the same group. The BE0
> reader got the highest throughput as I expected.
>
> nr_threads 16 | 16 | 1
> ionice BE7 | BE7 | BE0
> ------------------------+------------+-------------
> vanilla 10,076KiB/s | 9,779KiB/s | 32,775KiB/s
> ioband 9,576KiB/s | 9,367KiB/s | 34,154KiB/s
>
> The second test ran fio sequential readers on two different groups and
> give weights of 20 and 10 to each group respectively. The bandwidth
> was distributed according to their weights and the BE0 reader got
> higher throughput than the BE7 readers in the same group. IO priority
> was preserved within the IO group.
>
> group group1 | group2
> weight 20 | 10
> ------------------------+--------------------------
> nr_threads 16 | 16 | 1
> ionice BE7 | BE7 | BE0
> ------------------------+--------------------------
> ioband 27,513KiB/s | 3,524KiB/s | 10,248KiB/s
> | Total = 13,772KiB/s
>

Interesting. In all the test cases you always test with sequential
readers. I have changed the test case a bit (I have already reported the
results in another mail, now running the same test again with dm-version
1.14). I made all the readers doing direct IO and in other group I put
a buffered writer. So setup looks as follows.

In group1, I launch 1 prio 0 reader and increasing number of prio4
readers. In group 2 I just run a dd doing buffered writes. Weights of
both the groups are 100 each.

Following are the results on 2.6.31 kernel.

With-dm-ioband
==============
<------------prio4 readers----------------------> <---prio0 reader------>
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec
2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec
4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec
8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec
16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec

With vanilla CFQ
================
<------------prio4 readers----------------------> <---prio0 reader------>
nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec
2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec
4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec
8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec
16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec


Above results are showing how bandwidth got distributed between prio4 and
prio1 readers with-in group as we increased number of prio4 readers in
the group. In another group a buffered writer is continuously going on
as competitor.

Notice, with dm-ioband how bandwidth allocation is broken.

With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.

With 2 prio4 readers, looks like prio4 got almost same BW as prio1.

With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
readers starve.

As we incresae number of prio4 readers in the group, their total aggregate
BW share should increase. Instread it is decreasing.

So to me in the face of competition with a writer in other group, BW is
all over the place. Some of these might be dm-ioband bugs and some of
these might be coming from the fact that buffering takes place in higher
layer and dispatch is FIFO?

> Here is my test script.
> -------------------------------------------------------------------------
> arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
> --group_reporting"
>
> sync
> echo 3 > /proc/sys/vm/drop_caches
>
> echo $$ > /cgroup/1/tasks
> ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> echo $$ > /cgroup/2/tasks
> ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> echo $$ > /cgroup/tasks
> wait
> -------------------------------------------------------------------------
>
> Be that as it way, I think that if every bio can point the iocontext
> of the process, then it makes it possible to handle IO priority in the
> higher level controller. A patchse has already posted by Takhashi-san.
> What do you think about this idea?
>
> Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> Subject [RFC][PATCH 1/10] I/O context inheritance
> From Hirokazu Takahashi <>
> http://lkml.org/lkml/2008/4/22/195

So far you have been denying that there are issues with ioprio with-in
group in higher level controller. Here you seems to be saying that there are
issues with ioprio and we need to take this patch in to solve the issue? I am
confused?

Anyway, if you think that above patch is needed to solve the issue of
ioprio in higher level controller, why are you not posting it as part of
your patch series regularly, so that we can also apply this patch along
with other patches and test the effects?

>
> > > So you seem to be suggesting that you will move dm-ioband to request queue
> > > so that setting up additional device setup is gone. You will also enable
> > > it to do time based groups policy, so that we don't run into issues on
> > > seeky media. Will also enable dispatch from one group only at a time so
> > > that we don't run into isolation issues and can do time accounting
> > > accruately.
> >
> > Will that approach solve the problem of doing bandwidth control on
> > logical devices? What would be the advantages compared to Vivek's
> > current patches?
>
> I will only move the point where dm-ioband grabs bios, other
> dm-ioband's mechanism and functionality will stll be the same.
> The advantages against to scheduler based controllers are:
> - can work with any type of block devices
> - can work with any type of IO scheduler and no need a big change.
>

The big change thing we will come to know for sure when we have
implementation for the timed groups done and shown that it works as well as my
patches. There are so many subtle things with time based approach.

[..]
> > >> > Is there a new version of dm-ioband now where you have solved the issue of
> > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> > >> > trying to run some tests and come up with numbers so that we have more
> > >> > clear picture of pros/cons.
> > >>
> > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> > >> dm-ioband handles sync/async IO requests separately and
> > >> the write-starve-read issue you pointed out is fixed. I would
> > >> appreciate it if you would try them.
> > >> http://sourceforge.net/projects/ioband/files/
> > >
> > > Cool. Will get to testing it.
>
> Thanks for your help in advance.

Against what kernel version above patches apply. The biocgroup patches
I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
against any of these?

So for the time being I am doing testing with biocgroup patches.

Thanks
Vivek

2009-10-06 18:54:23

by Jens Axboe

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

On Tue, Oct 06 2009, Corrado Zoccolo wrote:
> On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <[email protected]> wrote:
> > On Mon, Oct 05 2009, Corrado Zoccolo wrote:
> >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <[email protected]> wrote:
> >> > It stands for residual, not residency. ?Make more sense?
> >> It makes sense when computed, but not when used in rb_key computation.
> >> Why should we postpone queues that where preempted, instead of giving
> >> them a boost?
> >
> > We should not, if it is/was working correctly, it should allow both for
> > increase/descrease of tree position (hence it's a long and can go
> > negative) to account for both over and under time.
>
> I'm doing some tests with and without it.
> How it is working now is:
> definition:
> if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
> cfqq->slice_resid = cfqq->slice_end - jiffies;
> cfq_log_cfqq(cfqd, cfqq, "resid=%ld",
> cfqq->slice_resid);
> }
> * here resid is > 0 if there was residual time, and < 0 if the queue
> overrun its slice.
> use:
> rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
> rb_key += cfqq->slice_resid;
> cfqq->slice_resid = 0;
> * here if residual is > 0, we postpone, i.e. penalize. If residual is
> < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it.
>
> So this is likely not what we want.

Indeed, that should be -= cfqq->slice_resid.

> I did some tests with and without it, or changing the sign, and it
> doesn't matter at all for pure sync workloads.

For most cases it will not change things a lot, but it should be
technically correct.

> The only case in which it matters a little, from my experiments, is
> for sync vs async workload. Here, since async queues are preempted,
> the current form of the code penalizes them, so they get larger
> delays, and we get more bandwidth for sync.

Right

> This is, btw, the only positive outcome (I can think of) from the
> current form of the code, and I think we could obtain it more easily
> by unconditionally adding a delay for async queues:
> rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
> if (!cfq_cfqq_sync(cfqq)) {
> rb_key += CFQ_ASYNC_DELAY;
> }
>
> removing completely the resid stuff (or at least leaving us with the
> ability of using it with the proper sign).

It's more likely for the async queue to overrun, but it can happen for
others as well. I'm keeping the residual count, but making the sign
change of course.

--
Jens Axboe

2009-10-06 21:39:04

by Vivek Goyal

[permalink] [raw]
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

On Sun, Oct 04, 2009 at 02:46:44PM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <[email protected]> wrote:
> > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> My guess is that the formula that is used to handle this case is not
> >> very stable.
> >
> > In general I agree that formula to calculate the slice offset is very
> > puzzling as busy_queues varies and that changes the position of the task
> > sometimes.
> >
> > I am not sure what's the intent here by removing busy_queues stuff. I have
> > got two questions though.
>
> In the ideal case steady state, busy_queues will be a constant. Since
> we are just comparing the values between themselves, we can just
> remove this constant completely.
>
> Whenever it is not constant, it seems to me that it can cause wrong
> behaviour, i.e. when the number of processes with ready I/O reduces, a
> later coming request can jump before older requests.
> So it seems it does more harm than good, hence I suggest to remove it.
>

I agree here. busy_queues can vary, especially given the fact that CFQ
removes the queue from service tree immediately after the dispatch, if the
queue is empty, and then it waits for request completion from the queue
and idles on the queue.

So consider following scenration where two thinking readers and one writer
are executing. Readers preempt the writers and writers gets back into the
tree. When writer gets backlogged, at that point of time busy_queues=2
and when a readers gets backlogged, busy_queues=1 (most of the time,
because a reader is idling), and hence many a time readers gets placed ahead
of writer.

This is so subtle, that I am not sure it was the designed that way.

So dependence on busy_queues can change queue ordering in unpredicatable
ways.


> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.
> Then it is used to postpone, instead of anticipate, the position of
> the queue in the RR, that seems counterintuitive (it would be
> intuitive, though, if it was actually a residency, not a remaining
> slice, i.e. you already got your full share, so you can wait longer to
> be serviced again).
>
> >
> > - Why don't we keep it simple round robin where a task is simply placed at
> > ?the end of service tree.
>
> This should work for the idling case, since we provide service
> differentiation by means of time slice.
> For non-idling case, though, the appropriate placement of queues in
> the tree (as given by my formula) can still provide it.
>

So for non-idling case, instead of providing service differentiation by
number of times queue is scheduled to run then by providing a bigger slice
to the queue?

This will work only to an extent and depends on size of IO being
dispatched from each queue. If some queue is having bigger requests size
and some smaller size (can be easily driven by changing block size), then
again you will not see fairness numbers? In that case it might make sense
to provide fairness in terms of size of IO/number of IO.

So to me it boils down to what is the seek cose of the underlying media.
If seek cost is high, provide fairness in terms of time slice and if seek
cost is really low, one can afford to faster switching of queues without
loosing too much on throughput side and in that case fairness in terms of
size of IO should be good.

Now if on good SSDs with NCQ, seek cost is low, I am wondering if it will
make sense to tweak CFQ to change mode dynamically and start providing
fairness in terms of size of IO/number of IO?

> >
> > - Secondly, CFQ provides full slice length to queues only which are
> > ?idling (in case of sequenatial reader). If we do not enable idling, as
> > ?in case of NCQ enabled SSDs, then CFQ will expire the queue almost
> > ?immediately and put the queue at the end of service tree (almost).
> >
> > So if we don't enable idling, at max we can provide fairness, we
> > esseitially just let every queue dispatch one request and put ?at the end
> > of the end of service tree. Hence no fairness....
>
> We should distinguish the two terms fairness and service
> differentiation. Fairness is when every queue gets the same amount of
> service share.

Will it not be "proportionate amount of service share" instead of "same
amount of service share"

> This is not what we want when priorities are different
> (we want the service differentiation, instead), but is what we get if
> we do just round robin without idling.
>
> To fix this, we can alter the placement in the tree, so that if we
> have Q1 with slice S1, and Q2 with slice S2, always ready to perform
> I/O, we get that Q1 is in front of the three with probability
> S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
> This is what my formula should achieve.

I have yet to get into details but as I said, this sounds like fairness
by frequency or by number of times a queue is scheduled to dispatch. So it
will help up to some extent on NCQ enabled SSDs but will become unfair is
size of IO each queue dispatches is very different.

Thanks
Vivek

2009-10-07 14:38:44

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek,

Vivek Goyal <[email protected]> wrote:
> > > >> If one would like to
> > > >> combine some physical disks into one logical device like a dm-linear,
> > > >> I think one should map the IO controller on each physical device and
> > > >> combine them into one logical device.
> > > >>
> > > >
> > > > In fact this sounds like a more complicated step where one has to setup
> > > > one dm-ioband device on top of each physical device. But I am assuming
> > > > that this will go away once you move to per reuqest queue like implementation.
> >
> > I don't understand why the per request queue implementation makes it
> > go away. If dm-ioband is integrated into the LVM tools, it could allow
> > users to skip the complicated steps to configure dm-linear devices.
> >
>
> Those who are not using dm-tools will be forced to use dm-tools for
> bandwidth control features.

If once dm-ioband is integrated into the LVM tools and bandwidth can
be assigned per device by lvcreate, the use of dm-tools is no longer
required for users.

> Interesting. In all the test cases you always test with sequential
> readers. I have changed the test case a bit (I have already reported the
> results in another mail, now running the same test again with dm-version
> 1.14). I made all the readers doing direct IO and in other group I put
> a buffered writer. So setup looks as follows.
>
> In group1, I launch 1 prio 0 reader and increasing number of prio4
> readers. In group 2 I just run a dd doing buffered writes. Weights of
> both the groups are 100 each.
>
> Following are the results on 2.6.31 kernel.
>
> With-dm-ioband
> ==============
> <------------prio4 readers----------------------> <---prio0 reader------>
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec
> 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec
> 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec
> 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec
> 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec
>
> With vanilla CFQ
> ================
> <------------prio4 readers----------------------> <---prio0 reader------>
> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec
> 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec
> 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec
> 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec
> 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec
>
>
> Above results are showing how bandwidth got distributed between prio4 and
> prio1 readers with-in group as we increased number of prio4 readers in
> the group. In another group a buffered writer is continuously going on
> as competitor.
>
> Notice, with dm-ioband how bandwidth allocation is broken.
>
> With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.
>
> With 2 prio4 readers, looks like prio4 got almost same BW as prio1.
>
> With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
> readers starve.
>
> As we incresae number of prio4 readers in the group, their total aggregate
> BW share should increase. Instread it is decreasing.
>
> So to me in the face of competition with a writer in other group, BW is
> all over the place. Some of these might be dm-ioband bugs and some of
> these might be coming from the fact that buffering takes place in higher
> layer and dispatch is FIFO?

Thank you for testing. I did the same test and here are the results.

with vanilla CFQ
<------------prio4 readers------------------> prio0 group2
maxbw minbw aggrbw maxlat aggrbw bufwrite
1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s 1,923KiB/s
2 3,967KiB/s 3,930KiB/s 7,897KiB/s 30001msec 14,213KiB/s 1,586KiB/s
4 3,399KiB/s 3,066KiB/s 13,031KiB/s 30082msec 8,930KiB/s 1,296KiB/s
8 2,086KiB/s 1,720KiB/s 15,266KiB/s 30003msec 7,546KiB/s 517KiB/s
16 1,156KiB/s 837KiB/s 15,377KiB/s 30033msec 4,282KiB/s 600KiB/s

with dm-ioband weight-iosize policy
<------------prio4 readers------------------> prio0 group2
maxbw minbw aggrbw maxlat aggrbw bufwrite
1 107KiB/s 107KiB/s 107KiB/s 30007msec 12,242KiB/s 12,320KiB/s
2 1,259KiB/s 702KiB/s 1,961KiB/s 30037msec 9,657KiB/s 11,657KiB/s
4 2,705KiB/s 29KiB/s 5,186KiB/s 30026msec 5,927KiB/s 11,300KiB/s
8 2,428KiB/s 27KiB/s 5,629KiB/s 30054msec 5,057KiB/s 10,704KiB/s
16 2,465KiB/s 23KiB/s 4,309KiB/s 30032msec 4,750KiB/s 9,088KiB/s

The results are somewhat different from yours. The bandwidth is
distributed to each group equally, but CFQ priority is broken as you
said. I think that the reason is not because of FIFO, but because
some IO requests are issued from dm-ioband's kernel thread on behalf of
processes which origirante the IO requests, then CFQ assumes that the
kernel thread is the originator and uses its io_context.

> > Here is my test script.
> > -------------------------------------------------------------------------
> > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
> > --group_reporting"
> >
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> >
> > echo $$ > /cgroup/1/tasks
> > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> > echo $$ > /cgroup/2/tasks
> > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> > echo $$ > /cgroup/tasks
> > wait
> > -------------------------------------------------------------------------
> >
> > Be that as it way, I think that if every bio can point the iocontext
> > of the process, then it makes it possible to handle IO priority in the
> > higher level controller. A patchse has already posted by Takhashi-san.
> > What do you think about this idea?
> >
> > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> > Subject [RFC][PATCH 1/10] I/O context inheritance
> > From Hirokazu Takahashi <>
> > http://lkml.org/lkml/2008/4/22/195
>
> So far you have been denying that there are issues with ioprio with-in
> group in higher level controller. Here you seems to be saying that there are
> issues with ioprio and we need to take this patch in to solve the issue? I am
> confused?

The true intention of this patch is to preserve the io-context of a
process which originate it, but I think that we could also make use of
this patch for one of the way to solve this issue.

> Anyway, if you think that above patch is needed to solve the issue of
> ioprio in higher level controller, why are you not posting it as part of
> your patch series regularly, so that we can also apply this patch along
> with other patches and test the effects?

I will post the patch, but I would like to find out and understand the
reason of above test results before posting the patch.

> Against what kernel version above patches apply. The biocgroup patches
> I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
> against any of these?
>
> So for the time being I am doing testing with biocgroup patches.

I created those patches against 2.6.32-rc1 and made sure the patches
can be cleanly applied to that version.

Thanks,
Ryo Tsuruta

2009-10-07 15:12:16

by Vivek Goyal

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

On Wed, Oct 07, 2009 at 11:38:05PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> Vivek Goyal <[email protected]> wrote:
> > > > >> If one would like to
> > > > >> combine some physical disks into one logical device like a dm-linear,
> > > > >> I think one should map the IO controller on each physical device and
> > > > >> combine them into one logical device.
> > > > >>
> > > > >
> > > > > In fact this sounds like a more complicated step where one has to setup
> > > > > one dm-ioband device on top of each physical device. But I am assuming
> > > > > that this will go away once you move to per reuqest queue like implementation.
> > >
> > > I don't understand why the per request queue implementation makes it
> > > go away. If dm-ioband is integrated into the LVM tools, it could allow
> > > users to skip the complicated steps to configure dm-linear devices.
> > >
> >
> > Those who are not using dm-tools will be forced to use dm-tools for
> > bandwidth control features.
>
> If once dm-ioband is integrated into the LVM tools and bandwidth can
> be assigned per device by lvcreate, the use of dm-tools is no longer
> required for users.

But it is same thing. Now LVM tools is mandatory to use?

>
> > Interesting. In all the test cases you always test with sequential
> > readers. I have changed the test case a bit (I have already reported the
> > results in another mail, now running the same test again with dm-version
> > 1.14). I made all the readers doing direct IO and in other group I put
> > a buffered writer. So setup looks as follows.
> >
> > In group1, I launch 1 prio 0 reader and increasing number of prio4
> > readers. In group 2 I just run a dd doing buffered writes. Weights of
> > both the groups are 100 each.
> >
> > Following are the results on 2.6.31 kernel.
> >
> > With-dm-ioband
> > ==============
> > <------------prio4 readers----------------------> <---prio0 reader------>
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> > 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec
> > 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec
> > 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec
> > 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec
> > 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec
> >
> > With vanilla CFQ
> > ================
> > <------------prio4 readers----------------------> <---prio0 reader------>
> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency
> > 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec
> > 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec
> > 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec
> > 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec
> > 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec
> >
> >
> > Above results are showing how bandwidth got distributed between prio4 and
> > prio1 readers with-in group as we increased number of prio4 readers in
> > the group. In another group a buffered writer is continuously going on
> > as competitor.
> >
> > Notice, with dm-ioband how bandwidth allocation is broken.
> >
> > With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.
> >
> > With 2 prio4 readers, looks like prio4 got almost same BW as prio1.
> >
> > With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
> > readers starve.
> >
> > As we incresae number of prio4 readers in the group, their total aggregate
> > BW share should increase. Instread it is decreasing.
> >
> > So to me in the face of competition with a writer in other group, BW is
> > all over the place. Some of these might be dm-ioband bugs and some of
> > these might be coming from the fact that buffering takes place in higher
> > layer and dispatch is FIFO?
>
> Thank you for testing. I did the same test and here are the results.
>
> with vanilla CFQ
> <------------prio4 readers------------------> prio0 group2
> maxbw minbw aggrbw maxlat aggrbw bufwrite
> 1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s 1,923KiB/s
> 2 3,967KiB/s 3,930KiB/s 7,897KiB/s 30001msec 14,213KiB/s 1,586KiB/s
> 4 3,399KiB/s 3,066KiB/s 13,031KiB/s 30082msec 8,930KiB/s 1,296KiB/s
> 8 2,086KiB/s 1,720KiB/s 15,266KiB/s 30003msec 7,546KiB/s 517KiB/s
> 16 1,156KiB/s 837KiB/s 15,377KiB/s 30033msec 4,282KiB/s 600KiB/s
>
> with dm-ioband weight-iosize policy
> <------------prio4 readers------------------> prio0 group2
> maxbw minbw aggrbw maxlat aggrbw bufwrite
> 1 107KiB/s 107KiB/s 107KiB/s 30007msec 12,242KiB/s 12,320KiB/s
> 2 1,259KiB/s 702KiB/s 1,961KiB/s 30037msec 9,657KiB/s 11,657KiB/s
> 4 2,705KiB/s 29KiB/s 5,186KiB/s 30026msec 5,927KiB/s 11,300KiB/s
> 8 2,428KiB/s 27KiB/s 5,629KiB/s 30054msec 5,057KiB/s 10,704KiB/s
> 16 2,465KiB/s 23KiB/s 4,309KiB/s 30032msec 4,750KiB/s 9,088KiB/s
>
> The results are somewhat different from yours. The bandwidth is
> distributed to each group equally, but CFQ priority is broken as you
> said. I think that the reason is not because of FIFO, but because
> some IO requests are issued from dm-ioband's kernel thread on behalf of
> processes which origirante the IO requests, then CFQ assumes that the
> kernel thread is the originator and uses its io_context.

Ok. Our numbers can vary a bit depending on fio settings like block size
and underlying storage also. But that's not the important thing. Currently
with this test I just wanted to point out that model of ioprio with-in group
is currently broken with dm-ioband and good that you can reproduce that.

One minor nit, for max latency you need to look at "clat " row and "max=" field
in fio output. Most of the time "max latency" will matter most. You seem to
be currently grepping for "maxt" which is just seems to be telling how
long did test run and in this case 30 seconds.

Assigning reads to right context in CFQ and not to dm-ioband thread might
help a bit, but I am bit skeptical and following is the reason.

CFQ relies on time providing longer time slice length for higher priority
process and if one does not use time slice, it looses its share. So the moment
you buffer even single bio of a process in dm-layer, if CFQ was servicing that
process at same time, that process will loose its share. CFQ will at max
anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the
queue and move on to next queue. Later if you submit same bio and with
dm-ioband helper thread and even if CFQ attributes it to right process, it is
not going to help much as process already lost it slice and now a new slice
will start.

>
> > > Here is my test script.
> > > -------------------------------------------------------------------------
> > > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
> > > --group_reporting"
> > >
> > > sync
> > > echo 3 > /proc/sys/vm/drop_caches
> > >
> > > echo $$ > /cgroup/1/tasks
> > > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> > > echo $$ > /cgroup/2/tasks
> > > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> > > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> > > echo $$ > /cgroup/tasks
> > > wait
> > > -------------------------------------------------------------------------
> > >
> > > Be that as it way, I think that if every bio can point the iocontext
> > > of the process, then it makes it possible to handle IO priority in the
> > > higher level controller. A patchse has already posted by Takhashi-san.
> > > What do you think about this idea?
> > >
> > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> > > Subject [RFC][PATCH 1/10] I/O context inheritance
> > > From Hirokazu Takahashi <>
> > > http://lkml.org/lkml/2008/4/22/195
> >
> > So far you have been denying that there are issues with ioprio with-in
> > group in higher level controller. Here you seems to be saying that there are
> > issues with ioprio and we need to take this patch in to solve the issue? I am
> > confused?
>
> The true intention of this patch is to preserve the io-context of a
> process which originate it, but I think that we could also make use of
> this patch for one of the way to solve this issue.
>

Ok. Did you run the same test with this patch applied and how do numbers look
like? Can you please forward port it to 2.6.31 and I will also like to
play with it?

I am running more tests/numbers with 2.6.31 for all the IO controllers and
planning to post it to lkml before we meet for IO mini summit. Numbers can
help us understand the issue better.

In first phase I am planning to post numbers for IO scheudler controller
and dm-ioband. Then will get to max bw controller of Andrea Righi.

> > Anyway, if you think that above patch is needed to solve the issue of
> > ioprio in higher level controller, why are you not posting it as part of
> > your patch series regularly, so that we can also apply this patch along
> > with other patches and test the effects?
>
> I will post the patch, but I would like to find out and understand the
> reason of above test results before posting the patch.
>

Ok. So in the mean time, I will continue to do testing with dm-ioband
version 1.14.0 and post the numbers.

> > Against what kernel version above patches apply. The biocgroup patches
> > I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
> > against any of these?
> >
> > So for the time being I am doing testing with biocgroup patches.
>
> I created those patches against 2.6.32-rc1 and made sure the patches
> can be cleanly applied to that version.

I am applying dm-ioband patch first and then bio cgroup patches. Is this
right order? Will try again.

Anyway, don't have too much time for IO mini summit, so will stick to
2.6.31 for the time being. If time permits, will venture into 32-rc1 also.

Thanks
Vivek

2009-10-07 16:43:55

by Rik van Riel

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Ryo Tsuruta wrote:

> If once dm-ioband is integrated into the LVM tools and bandwidth can
> be assigned per device by lvcreate, the use of dm-tools is no longer
> required for users.

A lot of large data center users have a SAN, with volume management
handled SAN-side and dedicated LUNs for different applications or
groups of applications.

Because of alignment issues, they typically use filesystems directly
on top of the LUNs, without partitions or LVM layers. We cannot rely
on LVM for these systems, because people prefer not to use that.

Besides ... isn't the goal of the cgroups io bandwidth controller
to control the IO used by PROCESSES?

If we want to control processes, why would we want the configuration
to be applied to any other kind of object in the system?

--
All rights reversed.

2009-10-08 02:19:20

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Vivek,

Vivek Goyal <[email protected]> wrote:
> Ok. Our numbers can vary a bit depending on fio settings like block size
> and underlying storage also. But that's not the important thing. Currently
> with this test I just wanted to point out that model of ioprio with-in group
> is currently broken with dm-ioband and good that you can reproduce that.
>
> One minor nit, for max latency you need to look at "clat " row and "max=" field
> in fio output. Most of the time "max latency" will matter most. You seem to
> be currently grepping for "maxt" which is just seems to be telling how
> long did test run and in this case 30 seconds.
>
> Assigning reads to right context in CFQ and not to dm-ioband thread might
> help a bit, but I am bit skeptical and following is the reason.
>
> CFQ relies on time providing longer time slice length for higher priority
> process and if one does not use time slice, it looses its share. So the moment
> you buffer even single bio of a process in dm-layer, if CFQ was servicing that
> process at same time, that process will loose its share. CFQ will at max
> anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the
> queue and move on to next queue. Later if you submit same bio and with
> dm-ioband helper thread and even if CFQ attributes it to right process, it is
> not going to help much as process already lost it slice and now a new slice
> will start.

O.K. I would like to figure something out this issue.

> > > > Be that as it way, I think that if every bio can point the iocontext
> > > > of the process, then it makes it possible to handle IO priority in the
> > > > higher level controller. A patchse has already posted by Takhashi-san.
> > > > What do you think about this idea?
> > > >
> > > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> > > > Subject [RFC][PATCH 1/10] I/O context inheritance
> > > > From Hirokazu Takahashi <>
> > > > http://lkml.org/lkml/2008/4/22/195
> > >
> > > So far you have been denying that there are issues with ioprio with-in
> > > group in higher level controller. Here you seems to be saying that there are
> > > issues with ioprio and we need to take this patch in to solve the issue? I am
> > > confused?
> >
> > The true intention of this patch is to preserve the io-context of a
> > process which originate it, but I think that we could also make use of
> > this patch for one of the way to solve this issue.
> >
>
> Ok. Did you run the same test with this patch applied and how do numbers look
> like? Can you please forward port it to 2.6.31 and I will also like to
> play with it?

I'm sorry, I have no time to do that this week. I would like to do the
forward porting and test with it by the mini-summit when poissible.

> I am running more tests/numbers with 2.6.31 for all the IO controllers and
> planning to post it to lkml before we meet for IO mini summit. Numbers can
> help us understand the issue better.
>
> In first phase I am planning to post numbers for IO scheudler controller
> and dm-ioband. Then will get to max bw controller of Andrea Righi.

That sounds good. Thank you for your work.

> > I created those patches against 2.6.32-rc1 and made sure the patches
> > can be cleanly applied to that version.
>
> I am applying dm-ioband patch first and then bio cgroup patches. Is this
> right order? Will try again.

Yes, the order is right. Here are the sha1sums.
9f4e50878d77922c84a29be9913a8b5c3f66e6ec linux-2.6.32-rc1.tar.bz2
15d7cc9d801805327204296a2454d6c5346dd2ae dm-ioband-1.14.0.patch
5e0626c14a40c319fb79f2f78378d2de5cc97b02 blkio-cgroup-v13.tar.bz2

Thanks,
Ryo Tsuruta

2009-10-08 04:54:24

by Vivek Goyal

[permalink] [raw]
Subject: More performance numbers (Was: Re: IO scheduler based IO controller V10)

On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote:
[..]
> >
> > Testing
> > =======
> >
> > Environment
> > ==========
> > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
>
> That's a bit of a toy.
>
> Do we have testing results for more enterprisey hardware? Big storage
> arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha)
>
>

Hi Andrew,

I got hold of a relatively more enterprisey stuff. It is an storage array
with few striped disks(I think 4 or 5). So this is not high end stuff but
better than my single SATA disk. I guess may be entry level enterprisy stuff.
Still trying to get hold of higher end configuration...

Apart from IO scheduler controller number, I also got a chance to run same
tests with dm-ioband controller. I am posting these too. I am also
planning to run similar numbers on Andrea's "max bw" controller also.
Should be able to post those numbers also in 2-3 days.

Software Environment
====================
- 2.6.31 kernel
- V10 of IO scheduler based controller
- version v1.14.0 of dm-ioband patches

Used fio jobs for 30 seconds in various configurations. All the IO is
direct IO to eliminate the effects of caches.

I have run three sets for each test. Blindly reporting results of set2
from each test, otherwise it is too much of data to report.

Had lun of 2500GB capacity. Used 200G partitions with ext3 file system for my
testing. For IO scheduler based controller patches, created two cgroups of
weight 100 each doing IO to single 200G partition.

For dm-ioband, created two partitions of 200G each and created two ioband
devices of weight 100 each with policy "weight-iosize". Ideally I should
haved used cgroups on dm-ioband also but could not get cgroup patch going.
Because this is striped configuration, not expecting any major changes in
results due to that.

Sequential reader vs Random reader
==================================
Launched on random reader in one group and launched increasing number of
sequential readers in other group to see the effect on latency and
bandwidth of random reader.

[fio1 --rw=read --bs=4K --size=2G --runtime=30 --direct=1 ]
[fio2 --rw=randread --bs=4K --size=1G --runtime=30 --direct=1 --group_reporting]

Vanilla CFQ
-----------
[Sequential readers] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 13806KB/s 13806KB/s 13483KB/s 28672 usec 1 23KB/s 212 msec
2 6406KB/s 6268KB/s 12378KB/s 128K usec 1 10KB/s 453 msec
4 3934KB/s 2536KB/s 13103KB/s 321K usec 1 6KB/s 847 msec
8 1934KB/s 556KB/s 13009KB/s 876K usec 1 13KB/s 1632 msec
16 958KB/s 280KB/s 13761KB/s 1621K usec 1 10KB/s 3217 msec
32 512KB/s 126KB/s 13861KB/s 3241K usec 1 6KB/s 3249 msec

IO scheduler controller + CFQ
-----------------------------
[Sequential readers] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 5651KB/s 5651KB/s 5519KB/s 126K usec 1 222KB/s 130K usec
2 3144KB/s 1479KB/s 4515KB/s 347K usec 1 225KB/s 189K usec
4 1852KB/s 626KB/s 5128KB/s 775K usec 1 224KB/s 159K usec
8 971KB/s 279KB/s 6464KB/s 1666K usec 1 222KB/s 193K usec
16 454KB/s 129KB/s 6293KB/s 3356K usec 1 218KB/s 466K usec
32 239KB/s 42KB/s 5986KB/s 6753K usec 1 214KB/s 503K usec

Notes:
- The BW and latency of random reader are fairly stable in the face of
increasing number of sequential readers. There are couple of spikes
in latency, i guess comes from the hardware somehow. But will debug
more to make sure that I am not delaying in dispatch of request.

dm-ioaband + CFQ
----------------
[Sequential readers] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 12466KB/s 12466KB/s 12174KB/s 40078 usec 1 37KB/s 221 msec
2 6240KB/s 5904KB/s 11859KB/s 134K usec 1 12KB/s 443 msec
4 3517KB/s 2529KB/s 12368KB/s 357K usec 1 6KB/s 772 msec
8 1779KB/s 594KB/s 9857KB/s 719K usec 1 60KB/s 852K usec
16 914KB/s 300KB/s 10934KB/s 1467K usec 1 40KB/s 1285K usec
32 589KB/s 187KB/s 11537KB/s 3547K usec 1 14KB/s 3228 msec

Notes:
- Does not look like we provide fairness to random reader here. Latencies
are on the rise and BW is on the decline. this is almost like Vanilla
CFQ with reduced overall throughput.

- dm-ioband claims that they do not provide fairness for slow moving group
and I think it is a bad idea. This leads to very weak isolation with
no benefits. Especially if a buffered writer is running in other group.
This should be fixed.

Random writers vs Random reader
================================
[fio1 --rw=randwrite --bs=64K --size=2G --runtime=30 --ioengine=libaio --iodepth=4 --direct=1 ]
[fio2 --rw=randread --bs=4K --size=1G --runtime=30 --direct=1 --group_reporting]

Vanilla CFQ
-----------
[Random Writers] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 67785KB/s 67785KB/s 66197KB/s 45499 usec 1 170KB/s 94098 usec
2 35163KB/s 35163KB/s 68678KB/s 218K usec 1 75KB/s 2335 msec
4 17759KB/s 15308KB/s 64206KB/s 2387K usec 1 85KB/s 2331 msec
8 8725KB/s 6495KB/s 57120KB/s 3761K usec 1 67KB/s 2488K usec
16 3912KB/s 3456KB/s 57121KB/s 1273K usec 1 60KB/s 1668K usec
32 2020KB/s 1503KB/s 56786KB/s 4221K usec 1 39KB/s 1101 msec

IO scheduler controller + CFQ
-----------------------------
[Random Writers] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 20919KB/s 20919KB/s 20428KB/s 288K usec 1 213KB/s 580K usec
2 14765KB/s 14674KB/s 28749KB/s 776K usec 1 203KB/s 112K usec
4 7177KB/s 7091KB/s 27839KB/s 970K usec 1 197KB/s 132K usec
8 3027KB/s 2953KB/s 23285KB/s 3145K usec 1 218KB/s 203K usec
16 1959KB/s 1750KB/s 28919KB/s 1266K usec 1 160KB/s 182K usec
32 908KB/s 753KB/s 26267KB/s 2091K usec 1 208KB/s 144K usec

Notes:
- Again disk time has been divided half and half between random reader
group and random writer group. Fairly stable BW and latencies for
random reader in the face of increasing number of random writers.

- Drop in aggregate bw of random writers is expected as they now get only
half of disk time.

dm-ioaband + CFQ
----------------
[Random Writers] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 63659KB/s 63659KB/s 62167KB/s 89954 usec 1 164KB/s 72 msec
2 27109KB/s 27096KB/s 52933KB/s 674K usec 1 140KB/s 2204K usec
4 16553KB/s 16216KB/s 63946KB/s 694K usec 1 56KB/s 1871 msec
8 3907KB/s 3347KB/s 28752KB/s 2406K usec 1 226KB/s 2407K usec
16 2841KB/s 2647KB/s 42334KB/s 870K usec 1 52KB/s 3043 msec
32 738KB/s 657KB/s 21285KB/s 1529K usec 1 21KB/s 4435 msec

Notes:
- Again no fairness for random reader. Decreasing BW, increasing latency.
No isolation in this case.

- I am curious what happened to random writer throughput in case of "32"
writers. We did not get higher BW for random reader but random writer still
suffering in throughput for random writer. I can see this for all the
three sets.

Sequential Readers vs Sequential reader
=======================================
[fio1 --rw=read --bs=4K --size=2G --runtime=30 --direct=1]
[fio2 --rw=read --bs=4K --size=2G --runtime=30 --direct=1]

Vanilla CFQ
-----------
[Sequential Readers] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 6434KB/s 6434KB/s 6283KB/s 107K usec 1 7017KB/s 111K usec
2 4688KB/s 3284KB/s 7785KB/s 274K usec 1 4541KB/s 218K usec
4 3365KB/s 1326KB/s 9769KB/s 597K usec 1 3038KB/s 424K usec
8 1827KB/s 504KB/s 12053KB/s 813K usec 1 1389KB/s 813K usec
16 1022KB/s 301KB/s 13954KB/s 1618K usec 1 676KB/s 1617K usec
32 494KB/s 149KB/s 13611KB/s 3216K usec 1 416KB/s 3215K usec

IO scheduler controller + CFQ
-----------------------------
[Sequential Readers] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 6605KB/s 6605KB/s 6450KB/s 120K usec 1 6527KB/s 120K usec
2 3706KB/s 1985KB/s 5558KB/s 323K usec 1 6331KB/s 149K usec
4 2053KB/s 672KB/s 5731KB/s 721K usec 1 6267KB/s 148K usec
8 1013KB/s 337KB/s 6962KB/s 1525K usec 1 6136KB/s 120K usec
16 497KB/s 125KB/s 6873KB/s 3226K usec 1 5882KB/s 113K usec
32 297KB/s 48KB/s 6445KB/s 6394K usec 1 5767KB/s 116K usec

Notes:
- Stable BW and lateneis for sequential reader in the face of increasing
number of readers in other group.

dm-ioaband + CFQ
----------------
[Sequential Readers] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 7140KB/s 7140KB/s 6972KB/s 112K usec 1 6886KB/s 165K usec
2 3965KB/s 2762KB/s 6569KB/s 479K usec 1 5887KB/s 475K usec
4 2725KB/s 1483KB/s 7999KB/s 532K usec 1 4774KB/s 500K usec
8 1610KB/s 621KB/s 9565KB/s 729K usec 1 2910KB/s 677K usec
16 904KB/s 319KB/s 10809KB/s 1431K usec 1 1970KB/s 1399K usec
32 553KB/s 8KB/s 11794KB/s 2330K usec 1 1337KB/s 2398K usec

Notes:
- Decreasing throughput and increasing latencies for sequential reader.
Hence no isolation in this case.

- Also note the in case of "32" readers, difference between "max-bw" and
"min-bw" is relatively large, considering that all the 32 readers are
of same prio. So bw distribution with-in group is not very good. This is
the issue of ioprio with-in group I have pointed many times. Ryo is
looking into it now.

Sequential Readers vs Multiple Random Readers
=======================================
Ok, because dm-ioband does not provide fairness in case if heavy IO
activity is not going in the group, I decided to run a slightly different
test case where 16 sequential readers are running in one group and I
run increasing number of random readers in other group to see when do
I start getting fairness and its effect.

[fio1 --rw=read --bs=4K --size=2G --runtime=30 --direct=1 ]
[fio2 --rw=randread --bs=4K --size=1G --runtime=30 --direct=1 --group_reporting]

Vanilla CFQ
-----------
[Sequential Readers] [Multiple Random Readers]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
16 961KB/s 280KB/s 13978KB/s 1673K usec 1 10KB/s 3223 msec
16 903KB/s 260KB/s 12925KB/s 1770K usec 2 28KB/s 3465 msec
16 832KB/s 231KB/s 11428KB/s 2088K usec 4 57KB/s 3891K usec
16 765KB/s 187KB/s 9899KB/s 2500K usec 8 99KB/s 3937K usec
16 512KB/s 144KB/s 6759KB/s 3451K usec 16 148KB/s 5470K usec

IO scheduler controller + CFQ
-----------------------------
[Sequential Readers] [Multiple Random Readers]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
16 456KB/s 112KB/s 6380KB/s 3361K usec 1 221KB/s 503K usec
16 476KB/s 159KB/s 6040KB/s 3432K usec 2 214KB/s 549K usec
16 606KB/s 178KB/s 6052KB/s 3801K usec 4 177KB/s 1341K usec
16 589KB/s 83KB/s 6243KB/s 3394K usec 8 154KB/s 3288K usec
16 547KB/s 122KB/s 6122KB/s 3538K usec 16 145KB/s 5959K usec

Notes:
- Stable BW and latencies for sequential reader group in the face of
increasing number of random readers in other group.

- Because disk is divided half/half in terms of time, random reader group
also gets decent amount of job done. Not sure why BW dips a bit when
number of random readers increases. Too seeky to handle?

dm-ioaband + CFQ
----------------
[Sequential Readers] [Multiple Random Readers]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
16 926KB/s 293KB/s 10256KB/s 1634K usec 1 55KB/s 1377K usec
16 906KB/s 284KB/s 9240KB/s 1825K usec 2 71KB/s 2392K usec
16 321KB/s 18KB/s 1621KB/s 2037K usec 4 326KB/s 2054K usec
16 188KB/s 16KB/s 1188KB/s 9757K usec 8 404KB/s 3269K usec
16 167KB/s 64KB/s 1700KB/s 2859K usec 16 1064KB/s 2920K usec

Notes:
- Looks like ioband tried to provide fairness from the time when number of
random readers are 4. Note, there is sudden increase in BW of random
readers and drastic drop in BW of sequential readers.

- By the time number of readers reach 16, total array throughput reduces
to around 2.7 MB/s. It got killed because suddenly we are trying to
provide fairness in terms of size of IO. That's why on seeky media
fairness in terms of disk time works better.

- There is no isolation between groups. Throughput of sequential reader
group continues to drop and latencies rise.

- I think these are serious issues which should be looked into and fixed.

Thanks
Vivek

2009-10-08 08:35:19

by Andrea Righi

[permalink] [raw]
Subject: Re: More performance numbers (Was: Re: IO scheduler based IO controller V10)

On Thu, Oct 08, 2009 at 12:42:51AM -0400, Vivek Goyal wrote:
> Apart from IO scheduler controller number, I also got a chance to run same
> tests with dm-ioband controller. I am posting these too. I am also
> planning to run similar numbers on Andrea's "max bw" controller also.
> Should be able to post those numbers also in 2-3 days.

For those who are interested (expecially to help Vivek to test all this
stuff) here the all-in-one patchset of the io-throttle controller,
rebased to 2.6.31:
http://www.develer.com/~arighi/linux/patches/io-throttle/old/cgroup-io-throttle-2.6.31.patch

And this one is v18 rebased to 2.6.32-rc3:
http://www.develer.com/~arighi/linux/patches/io-throttle/cgroup-io-throttle-v18.patch

Thanks,
-Andrea

2009-10-08 10:23:22

by Ryo Tsuruta

[permalink] [raw]
Subject: Re: IO scheduler based IO controller V10

Hi Rik,

Rik van Riel <[email protected]> wrote:
> Ryo Tsuruta wrote:
>
> > If once dm-ioband is integrated into the LVM tools and bandwidth can
> > be assigned per device by lvcreate, the use of dm-tools is no longer
> > required for users.
>
> A lot of large data center users have a SAN, with volume management
> handled SAN-side and dedicated LUNs for different applications or
> groups of applications.
>
> Because of alignment issues, they typically use filesystems directly
> on top of the LUNs, without partitions or LVM layers. We cannot rely
> on LVM for these systems, because people prefer not to use that.

Thank you for your explanation. So I have a plan to reimplement
dm-ioband into the block layer to make dm-tools no longer required.
My opinion I wrote above assumes if dm-ioband is used for a logical
volume which consists of multiple physical devices. If dm-ioband is
integrated into the LVM tools, then the use of the dm-tools is not
required and the underlying physical devices can be automatically
deteced and configured to use dm-ioband.

Thanks,
Ryo Tsuruta

> Besides ... isn't the goal of the cgroups io bandwidth controller
> to control the IO used by PROCESSES?
>
> If we want to control processes, why would we want the configuration
> to be applied to any other kind of object in the system?

2009-10-10 19:54:44

by Vivek Goyal

[permalink] [raw]
Subject: Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)

On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote:

[..]
> > Environment
> > ==========
> > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
>
> That's a bit of a toy.
>
> Do we have testing results for more enterprisey hardware? Big storage
> arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha)
>
>

Hi All,

Couple of days back I posted some performance number of "IO scheduler
controller" and "dm-ioband" here.

http://lkml.org/lkml/2009/10/8/9

Now I have run similar tests with Andrea Righi's IO throttling approach
of max bandwidth control. This is the exercise to understand pros/cons
of each approach and see how can we take things forward.

Environment
===========
Software
--------
- 2.6.31 kenrel
- IO scheduler controller V10 on top of 2.6.31
- IO throttling patch on top of 2.6.31. Patch is available here.

http://www.develer.com/~arighi/linux/patches/io-throttle/old/cgroup-io-throttle-2.6.31.patch

Hardware
--------
A storage array of 5 striped disks of 500GB each.

Used fio jobs for 30 seconds in various configurations. Most of the IO is
direct IO to eliminate the effects of caches.

I have run three sets for each test. Blindly reporting results of set2
from each test, otherwise it is too much of data to report.

Had lun of 2500GB capacity. Used 200G partition with ext3 file system for
my testing. For IO scheduler controller testing, created two cgroups of
weight 100 each so that effectively disk can be divided half/half between
two groups.

For IO throttling patches also created two cgroups. Now tricky part is
that it is a max bw controller and not a proportional weight controller.
So dividing the disk capacity half/half between two cgroups is tricky. The
reason being I just don't know what's the BW capacity of underlying
storage. Throughput varies so much with type of workload. For example, on
my arrary, this is how throughput looks like with different workloads.

8 sequential buffered readers 115 MB/s
8 direct sequential readers bs=64K 64 MB/s
8 direct sequential readers bs=4K 14 MB/s

8 buffered random readers bs=64K 3 MB/s
8 direct random readers bs=64K 15 MB/s
8 direct random readers bs=4K 1.5 MB/s

So throughput seems to be varying from 1.5 MB/s to 115 MB/s depending
on workload. What should be the BW limits per cgroup to divide disk BW
in half/half between two groups?

So I took a conservative estimate and divide max bandwidth divide by 2,
and thought of array capacity as 60MB/s and assign each cgroup 30MB/s. In
some cases I have assigened even 10MB/s or 5MB/s to each cgropu to see the
effects of throttling. I am using "Leaky bucket" policy for all the tests.

As theme of two controllers is different, at some places it might sound
like apples vs oranges comparison. But still it does help...

Multiple Random Reader vs Sequential Reader
===============================================
Generally random readers bring the throughput down of others in the
system. Ran a test to see the impact of increasing number of random readers on
single sequential reader in different groups.

Vanilla CFQ
-----------------------------------
[Multiple Random Reader] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 23KB/s 23KB/s 22KB/s 691 msec 1 13519KB/s 468K usec
2 152KB/s 152KB/s 297KB/s 244K usec 1 12380KB/s 31675 usec
4 174KB/s 156KB/s 638KB/s 249K usec 1 10860KB/s 36715 usec
8 49KB/s 11KB/s 310KB/s 1856 msec 1 1292KB/s 990K usec
16 63KB/s 48KB/s 877KB/s 762K usec 1 3905KB/s 506K usec
32 35KB/s 27KB/s 951KB/s 2655 msec 1 1109KB/s 1910K usec

IO scheduler controller + CFQ
-----------------------------------
[Multiple Random Reader] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 228KB/s 228KB/s 223KB/s 132K usec 1 5551KB/s 129K usec
2 97KB/s 97KB/s 190KB/s 154K usec 1 5718KB/s 122K usec
4 115KB/s 110KB/s 445KB/s 208K usec 1 5909KB/s 116K usec
8 23KB/s 12KB/s 158KB/s 2820 msec 1 5445KB/s 168K usec
16 11KB/s 3KB/s 145KB/s 5963 msec 1 5418KB/s 164K usec
32 6KB/s 2KB/s 139KB/s 12762 msec 1 5398KB/s 175K usec

Notes:
- Sequential reader in group2 seems to be well isolated from random readers
in group1. Throughput and latency of sequential reader are stable and
don't drop as number of random readers inrease in system.

io-throttle + CFQ
------------------
BW limit group1=10 MB/s BW limit group2=10 MB/s
[Multiple Random Reader] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 37KB/s 37KB/s 36KB/s 218K usec 1 8006KB/s 20529 usec
2 185KB/s 183KB/s 360KB/s 228K usec 1 7475KB/s 33665 usec
4 188KB/s 171KB/s 699KB/s 262K usec 1 6800KB/s 46224 usec
8 84KB/s 51KB/s 573KB/s 1800K usec 1 2835KB/s 885K usec
16 21KB/s 9KB/s 294KB/s 3590 msec 1 437KB/s 1855K usec
32 34KB/s 27KB/s 980KB/s 2861K usec 1 1145KB/s 1952K usec

Notes:
- I have setup limits of 10MB/s in both the cgroups. Now random reader
group will never achieve that kind of speed, so it will not be throttled
and then it goes onto impact the throughput and latency of other groups
in the system.

- Now the key question is how conservative one should in be setting up
max BW limit. On this box if a customer has bought 10MB/s cgroup and if
he is running some random readers it will kill throughput of other
groups in the system and their latencies will shoot up. No isolation in
this case.

- So in general, max BW provides isolation from high speed groups but it
does not provide isolaton from random reader groups which are moving
slow.

Multiple Sequential Reader vs Random Reader
===============================================
Now running a reverse test where in one group I am running increasing
number of sequential readers and in other group I am running one random
reader and see the impact of sequential readers on random reader.

Vanilla CFQ
-----------------------------------
[Multiple Sequential Reader] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 13978KB/s 13978KB/s 13650KB/s 27614 usec 1 22KB/s 227 msec
2 6225KB/s 6166KB/s 12101KB/s 568K usec 1 10KB/s 457 msec
4 4052KB/s 2462KB/s 13107KB/s 322K usec 1 6KB/s 841 msec
8 1899KB/s 557KB/s 12960KB/s 829K usec 1 13KB/s 1628 msec
16 1007KB/s 279KB/s 13833KB/s 1629K usec 1 10KB/s 3236 msec
32 506KB/s 98KB/s 13704KB/s 3389K usec 1 6KB/s 3238 msec

IO scheduler controller + CFQ
-----------------------------------
[Multiple Sequential Reader] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 5721KB/s 5721KB/s 5587KB/s 126K usec 1 223KB/s 126K usec
2 3216KB/s 1442KB/s 4549KB/s 349K usec 1 224KB/s 176K usec
4 1895KB/s 640KB/s 5121KB/s 775K usec 1 222KB/s 189K usec
8 957KB/s 285KB/s 6368KB/s 1680K usec 1 223KB/s 142K usec
16 458KB/s 132KB/s 6455KB/s 3343K usec 1 219KB/s 165K usec
32 248KB/s 55KB/s 6001KB/s 6957K usec 1 220KB/s 504K usec

Notes:
- Random reader is well isolated from increasing number of sequential
readers in other group. BW and latencies are stable.

io-throttle + CFQ
-----------------------------------
BW limit group1=10 MB/s BW limit group2=10 MB/s
[Multiple Sequential Reader] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 8200KB/s 8200KB/s 8007KB/s 20275 usec 1 37KB/s 217K usec
2 3926KB/s 3919KB/s 7661KB/s 122K usec 1 16KB/s 441 msec
4 2271KB/s 1497KB/s 7672KB/s 611K usec 1 9KB/s 927 msec
8 1113KB/s 513KB/s 7507KB/s 849K usec 1 21KB/s 1020 msec
16 661KB/s 236KB/s 7959KB/s 1679K usec 1 13KB/s 2926 msec
32 292KB/s 109KB/s 7864KB/s 3446K usec 1 8KB/s 3439 msec

BW limit group1=5 MB/s BW limit group2=5 MB/s
[Multiple Sequential Reader] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 4686KB/s 4686KB/s 4576KB/s 21095 usec 1 57KB/s 219K usec
2 2298KB/s 2179KB/s 4372KB/s 132K usec 1 37KB/s 431K usec
4 1245KB/s 1019KB/s 4449KB/s 324K usec 1 26KB/s 835 msec
8 584KB/s 403KB/s 4109KB/s 833K usec 1 30KB/s 1625K usec
16 346KB/s 252KB/s 4605KB/s 1641K usec 1 129KB/s 3236K usec
32 175KB/s 56KB/s 4269KB/s 3236K usec 1 8KB/s 3235 msec

Notes:

- Above result is surprising to me. I have run it twice. In first run, I
setup per cgroup limit as 10MB/s and in second run I set it up 5MB/s. In
both the cases as number of sequential readers increase in other groups,
random reader's throughput decreases and latencies increase. This is
happening despite the fact that sequential readers are being throttled
to make sure it does not impact workload in other group. Wondering why
random readers are not seeing consistent throughput and latencies.

- Andrea, can you please also run similar tests to see if you see same
results or not. This is to rule out any testing methodology errors or
scripting bugs. :-). I also have collected the snapshot of some cgroup
files like bandwidth-max, throttlecnt, and stats. Let me know if you want
those to see what is happenig here.

Multiple Sequential Reader vs Sequential Reader
===============================================
- This time running random readers are out of the picture and trying to
see the effect of increasing number of sequential readers on another
sequential reader running in a different group.

Vanilla CFQ
-----------------------------------
[Multiple Sequential Reader] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 6325KB/s 6325KB/s 6176KB/s 114K usec 1 6902KB/s 120K usec
2 4588KB/s 3102KB/s 7510KB/s 571K usec 1 4564KB/s 680K usec
4 3242KB/s 1158KB/s 9469KB/s 495K usec 1 3198KB/s 410K usec
8 1775KB/s 459KB/s 12011KB/s 1178K usec 1 1366KB/s 818K usec
16 943KB/s 296KB/s 13285KB/s 1923K usec 1 728KB/s 1816K usec
32 511KB/s 148KB/s 13555KB/s 3286K usec 1 391KB/s 3212K usec

IO scheduler controller + CFQ
-----------------------------------
[Multiple Sequential Reader] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 6781KB/s 6781KB/s 6622KB/s 109K usec 1 6691KB/s 115K usec
2 3758KB/s 1876KB/s 5502KB/s 693K usec 1 6373KB/s 419K usec
4 2100KB/s 671KB/s 5751KB/s 987K usec 1 6330KB/s 569K usec
8 1023KB/s 355KB/s 6969KB/s 1569K usec 1 6086KB/s 120K usec
16 520KB/s 130KB/s 7094KB/s 3140K usec 1 5984KB/s 119K usec
32 245KB/s 86KB/s 6621KB/s 6571K usec 1 5850KB/s 113K usec

Notes:
- BW and latencies of sequential reader in group 2 are fairly stable as
number of readers increase in first group.

io-throttle + CFQ
-----------------------------------
BW limit group1=30 MB/s BW limit group2=30 MB/s
[Multiple Sequential Reader] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 6343KB/s 6343KB/s 6195KB/s 116K usec 1 6993KB/s 109K usec
2 4583KB/s 3046KB/s 7451KB/s 583K usec 1 4516KB/s 433K usec
4 2945KB/s 1324KB/s 9552KB/s 602K usec 1 3001KB/s 583K usec
8 1804KB/s 473KB/s 12257KB/s 861K usec 1 1386KB/s 815K usec
16 942KB/s 265KB/s 13560KB/s 1659K usec 1 718KB/s 1658K usec
32 462KB/s 143KB/s 13757KB/s 3482K usec 1 409KB/s 3480K usec

Notes:
- BW decreases and latencies increase in group2 as number of readers
increase in first group. This should be due to fact that no throttling
will happen as none of the groups is hitting the limit of 30MB/s. To
me this is the tricky part. How a service provider is supposed to
set the limit of groups. If groups are not hitting max limits, it will
still impact the BW and latencies in other group.

BW limit group1=10 MB/s BW limit group2=10 MB/s
[Multiple Sequential Reader] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 4128KB/s 4128KB/s 4032KB/s 215K usec 1 4076KB/s 170K usec
2 2880KB/s 1886KB/s 4655KB/s 291K usec 1 2891KB/s 212K usec
4 1912KB/s 888KB/s 5872KB/s 417K usec 1 1881KB/s 411K usec
8 1032KB/s 432KB/s 7312KB/s 841K usec 1 853KB/s 816K usec
16 540KB/s 259KB/s 7844KB/s 1728K usec 1 503KB/s 1609K usec
32 291KB/s 111KB/s 7920KB/s 3417K usec 1 249KB/s 3205K usec

Notes:
- Same test with 10MB/s as group limit. This is again a surprising result.
Max BW in first group is being throttled but still throughput is
dropping significantly in second group and latencies are on the rise.

- Limit of first group is 10MB/s but it is achieving max BW of around
8MB/s only. What happened to rest of the 2MB/s?

- Andrea, again, please do run this test. The throughput drop in second
group stumps me and forces me to think if I am doing something wrong.

BW limit group1=5 MB/s BW limit group2=5 MB/s
[Multiple Sequential Reader] [Sequential Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 2434KB/s 2434KB/s 2377KB/s 110K usec 1 2415KB/s 120K usec
2 1639KB/s 1186KB/s 2759KB/s 222K usec 1 1709KB/s 220K usec
4 1114KB/s 648KB/s 3314KB/s 420K usec 1 1163KB/s 414K usec
8 567KB/s 366KB/s 4060KB/s 901K usec 1 527KB/s 816K usec
16 329KB/s 179KB/s 4324KB/s 1613K usec 1 311KB/s 1613K usec
32 178KB/s 70KB/s 4320KB/s 3235K usec 1 163KB/s 3209K usec

- Setting the limit to 5MB/s per group also does not seem to be helping
the second group.

Multiple Random Writer vs Random Reader
===============================================
This time running multiple random writers in first group and see the
impact on throughput and latency of random reader in different group.

Vanilla CFQ
-----------------------------------
[Multiple Random Writer] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 64018KB/s 64018KB/s 62517KB/s 353K usec 1 190KB/s 96 msec
2 35298KB/s 35257KB/s 68899KB/s 208K usec 1 76KB/s 2416 msec
4 16387KB/s 14662KB/s 60630KB/s 3746K usec 1 106KB/s 2308K usec
8 5106KB/s 3492KB/s 33335KB/s 2995K usec 1 193KB/s 2292K usec
16 3676KB/s 3002KB/s 51807KB/s 2283K usec 1 72KB/s 2298K usec
32 2169KB/s 1480KB/s 56882KB/s 1990K usec 1 35KB/s 1093 msec

IO scheduler controller + CFQ
-----------------------------------
[Multiple Random Writer] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 20369KB/s 20369KB/s 19892KB/s 877K usec 1 255KB/s 137K usec
2 14347KB/s 14288KB/s 27964KB/s 1010K usec 1 228KB/s 117K usec
4 6996KB/s 6701KB/s 26775KB/s 1362K usec 1 221KB/s 180K usec
8 2849KB/s 2770KB/s 22007KB/s 2660K usec 1 250KB/s 485K usec
16 1463KB/s 1365KB/s 22384KB/s 2606K usec 1 254KB/s 115K usec
32 799KB/s 681KB/s 22404KB/s 2879K usec 1 266KB/s 107K usec

Notes
- BW and latencies of random reader in second group are fairly stable.

io-throttle + CFQ
-----------------------------------
BW limit group1=30 MB/s BW limit group2=30 MB/s
[Multiple Random Writer] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 21920KB/s 21920KB/s 21406KB/s 1017K usec 1 353KB/s 432K usec
2 14291KB/s 9626KB/s 23357KB/s 1832K usec 1 362KB/s 177K usec
4 7130KB/s 5135KB/s 24736KB/s 1336K usec 1 348KB/s 425K usec
8 3165KB/s 2949KB/s 23792KB/s 2133K usec 1 336KB/s 146K usec
16 1653KB/s 1406KB/s 23694KB/s 2198K usec 1 337KB/s 115K usec
32 793KB/s 717KB/s 23198KB/s 2195K usec 1 330KB/s 192K usec

BW limit group1=10 MB/s BW limit group2=10 MB/s
[Multiple Random Writer] [Random Reader]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 7903KB/s 7903KB/s 7718KB/s 1037K usec 1 474KB/s 103K usec
2 4496KB/s 4428KB/s 8715KB/s 1091K usec 1 450KB/s 553K usec
4 2153KB/s 1827KB/s 7914KB/s 2042K usec 1 458KB/s 108K usec
8 1129KB/s 1087KB/s 8688KB/s 1280K usec 1 432KB/s 98215 usec
16 606KB/s 527KB/s 8668KB/s 2303K usec 1 426KB/s 90609 usec
32 312KB/s 259KB/s 8599KB/s 2557K usec 1 441KB/s 95283 usec

Notes:
- IO throttling seems to be working really well here. Random writers are
contained in the first group and this gives stable BW and latencies
to random reader in second group.

Multiple Buffered Writer vs Buffered Writer
===========================================
This time run multiple buffered writers in group1 and see run a single
buffered writer in other group and see if we can provide fairness and
isolation.

Vanilla CFQ
------------
[Multiple Buffered Writer] [Buffered Writer]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 68997KB/s 68997KB/s 67380KB/s 645K usec 1 67122KB/s 567K usec
2 47509KB/s 46218KB/s 91510KB/s 865K usec 1 45118KB/s 865K usec
4 28002KB/s 26906KB/s 105MB/s 1649K usec 1 26879KB/s 1643K usec
8 15985KB/s 14849KB/s 117MB/s 943K usec 1 15653KB/s 766K usec
16 11567KB/s 6881KB/s 128MB/s 1174K usec 1 7333KB/s 947K usec
32 5877KB/s 3649KB/s 130MB/s 1205K usec 1 5142KB/s 988K usec

IO scheduler controller + CFQ
-----------------------------------
[Multiple Buffered Writer] [Buffered Writer]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 68580KB/s 68580KB/s 66972KB/s 2901K usec 1 67194KB/s 2901K usec
2 47419KB/s 45700KB/s 90936KB/s 3149K usec 1 44628KB/s 2377K usec
4 27825KB/s 27274KB/s 105MB/s 1177K usec 1 27584KB/s 1177K usec
8 15382KB/s 14288KB/s 114MB/s 1539K usec 1 14794KB/s 783K usec
16 9161KB/s 7592KB/s 124MB/s 3177K usec 1 7713KB/s 886K usec
32 4928KB/s 3961KB/s 126MB/s 1152K usec 1 6465KB/s 4510K usec

Notes:
- It does not work. Buffered writer in second group are being overwhelmed
by writers in group1.

- This is a limitation of IO scheduler based controller currently as page
cache at higher layer evens out the traffic and does not throw more
traffic from higher weight group.

- This is something needs more work at higher layers like dirty limts
per cgroup in memory contoller and the method to writeout buffered
pages belonging to a particular memory cgroup. This is still being
brainstormed.

io-throttle + CFQ
-----------------------------------
BW limit group1=30 MB/s BW limit group2=30 MB/s
[Multiple Buffered Writer] [Buffered Writer]
nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
1 33863KB/s 33863KB/s 33070KB/s 3046K usec 1 25165KB/s 13248K usec
2 13457KB/s 12906KB/s 25745KB/s 9286K usec 1 29958KB/s 3736K usec
4 7414KB/s 6543KB/s 27145KB/s 10557K usec 1 30968KB/s 8356K usec
8 3562KB/s 2640KB/s 24430KB/s 12012K usec 1 30801KB/s 7037K usec
16 3962KB/s 881KB/s 26632KB/s 12650K usec 1 31150KB/s 7173K usec
32 3275KB/s 406KB/s 27295KB/s 14609K usec 1 26328KB/s 8069K usec

Notes:
- This seems to work well here. io-throttle is throttling the writers
before they write too much of data in page cache. One side effect of
this seems to be that now a process will not be allowed to write at
memory speed in page cahce and will be limited to disk IO speed limits
set for the cgroup.

Andrea is thinking of removing throttling in balance_dirty_pages() to allow
writting at disk speed till we hit dirty_limits. But removing it leads
to a different issue where too many dirty pages from a single group can
be present from a cgroup in page cache and if that cgroup is slow moving
one, then pages are flushed to disk at slower speed delyaing other
higher rate cgroups. (all discussed in private mails with Andrea).


ioprio class and iopriority with-in cgroups issues with IO-throttle
===================================================================

Currently throttling logic is designed in such a way that it makes the
throttling uniform for every process in the group. So we will loose the
differentiation between different class of processes or differnetitation
between different priority of processes with-in group.

I have run the tests of these in the past and reported it here in the
past.

https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html

Thanks
Vivek

2009-10-10 22:28:13

by Andrea Righi

[permalink] [raw]
Subject: Re: Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)

On Sat, Oct 10, 2009 at 03:53:16PM -0400, Vivek Goyal wrote:
> On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote:
>
> [..]
> > > Environment
> > > ==========
> > > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
> >
> > That's a bit of a toy.
> >
> > Do we have testing results for more enterprisey hardware? Big storage
> > arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha)
> >
> >
>
> Hi All,

Hi Vivek,

thanks for posting this detailed report first of all. A few comments
below.

>
> Couple of days back I posted some performance number of "IO scheduler
> controller" and "dm-ioband" here.
>
> http://lkml.org/lkml/2009/10/8/9
>
> Now I have run similar tests with Andrea Righi's IO throttling approach
> of max bandwidth control. This is the exercise to understand pros/cons
> of each approach and see how can we take things forward.
>
> Environment
> ===========
> Software
> --------
> - 2.6.31 kenrel
> - IO scheduler controller V10 on top of 2.6.31
> - IO throttling patch on top of 2.6.31. Patch is available here.
>
> http://www.develer.com/~arighi/linux/patches/io-throttle/old/cgroup-io-throttle-2.6.31.patch
>
> Hardware
> --------
> A storage array of 5 striped disks of 500GB each.
>
> Used fio jobs for 30 seconds in various configurations. Most of the IO is
> direct IO to eliminate the effects of caches.
>
> I have run three sets for each test. Blindly reporting results of set2
> from each test, otherwise it is too much of data to report.
>
> Had lun of 2500GB capacity. Used 200G partition with ext3 file system for
> my testing. For IO scheduler controller testing, created two cgroups of
> weight 100 each so that effectively disk can be divided half/half between
> two groups.
>
> For IO throttling patches also created two cgroups. Now tricky part is
> that it is a max bw controller and not a proportional weight controller.
> So dividing the disk capacity half/half between two cgroups is tricky. The
> reason being I just don't know what's the BW capacity of underlying
> storage. Throughput varies so much with type of workload. For example, on
> my arrary, this is how throughput looks like with different workloads.
>
> 8 sequential buffered readers 115 MB/s
> 8 direct sequential readers bs=64K 64 MB/s
> 8 direct sequential readers bs=4K 14 MB/s
>
> 8 buffered random readers bs=64K 3 MB/s
> 8 direct random readers bs=64K 15 MB/s
> 8 direct random readers bs=4K 1.5 MB/s
>
> So throughput seems to be varying from 1.5 MB/s to 115 MB/s depending
> on workload. What should be the BW limits per cgroup to divide disk BW
> in half/half between two groups?
>
> So I took a conservative estimate and divide max bandwidth divide by 2,
> and thought of array capacity as 60MB/s and assign each cgroup 30MB/s. In
> some cases I have assigened even 10MB/s or 5MB/s to each cgropu to see the
> effects of throttling. I am using "Leaky bucket" policy for all the tests.
>
> As theme of two controllers is different, at some places it might sound
> like apples vs oranges comparison. But still it does help...
>
> Multiple Random Reader vs Sequential Reader
> ===============================================
> Generally random readers bring the throughput down of others in the
> system. Ran a test to see the impact of increasing number of random readers on
> single sequential reader in different groups.
>
> Vanilla CFQ
> -----------------------------------
> [Multiple Random Reader] [Sequential Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 23KB/s 23KB/s 22KB/s 691 msec 1 13519KB/s 468K usec
> 2 152KB/s 152KB/s 297KB/s 244K usec 1 12380KB/s 31675 usec
> 4 174KB/s 156KB/s 638KB/s 249K usec 1 10860KB/s 36715 usec
> 8 49KB/s 11KB/s 310KB/s 1856 msec 1 1292KB/s 990K usec
> 16 63KB/s 48KB/s 877KB/s 762K usec 1 3905KB/s 506K usec
> 32 35KB/s 27KB/s 951KB/s 2655 msec 1 1109KB/s 1910K usec
>
> IO scheduler controller + CFQ
> -----------------------------------
> [Multiple Random Reader] [Sequential Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 228KB/s 228KB/s 223KB/s 132K usec 1 5551KB/s 129K usec
> 2 97KB/s 97KB/s 190KB/s 154K usec 1 5718KB/s 122K usec
> 4 115KB/s 110KB/s 445KB/s 208K usec 1 5909KB/s 116K usec
> 8 23KB/s 12KB/s 158KB/s 2820 msec 1 5445KB/s 168K usec
> 16 11KB/s 3KB/s 145KB/s 5963 msec 1 5418KB/s 164K usec
> 32 6KB/s 2KB/s 139KB/s 12762 msec 1 5398KB/s 175K usec
>
> Notes:
> - Sequential reader in group2 seems to be well isolated from random readers
> in group1. Throughput and latency of sequential reader are stable and
> don't drop as number of random readers inrease in system.
>
> io-throttle + CFQ
> ------------------
> BW limit group1=10 MB/s BW limit group2=10 MB/s
> [Multiple Random Reader] [Sequential Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 37KB/s 37KB/s 36KB/s 218K usec 1 8006KB/s 20529 usec
> 2 185KB/s 183KB/s 360KB/s 228K usec 1 7475KB/s 33665 usec
> 4 188KB/s 171KB/s 699KB/s 262K usec 1 6800KB/s 46224 usec
> 8 84KB/s 51KB/s 573KB/s 1800K usec 1 2835KB/s 885K usec
> 16 21KB/s 9KB/s 294KB/s 3590 msec 1 437KB/s 1855K usec
> 32 34KB/s 27KB/s 980KB/s 2861K usec 1 1145KB/s 1952K usec
>
> Notes:
> - I have setup limits of 10MB/s in both the cgroups. Now random reader
> group will never achieve that kind of speed, so it will not be throttled
> and then it goes onto impact the throughput and latency of other groups
> in the system.
>
> - Now the key question is how conservative one should in be setting up
> max BW limit. On this box if a customer has bought 10MB/s cgroup and if
> he is running some random readers it will kill throughput of other
> groups in the system and their latencies will shoot up. No isolation in
> this case.
>
> - So in general, max BW provides isolation from high speed groups but it
> does not provide isolaton from random reader groups which are moving
> slow.

Remember that in addition to blockio.bandwidth-max the io-throttle
controlller also provides blockio.iops-max to enforce hard limits on the
number of IO operations per second. Probably for this testcase both
cgroups should be limited in terms of BW and iops to achieve a better
isolation.

>
> Multiple Sequential Reader vs Random Reader
> ===============================================
> Now running a reverse test where in one group I am running increasing
> number of sequential readers and in other group I am running one random
> reader and see the impact of sequential readers on random reader.
>
> Vanilla CFQ
> -----------------------------------
> [Multiple Sequential Reader] [Random Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 13978KB/s 13978KB/s 13650KB/s 27614 usec 1 22KB/s 227 msec
> 2 6225KB/s 6166KB/s 12101KB/s 568K usec 1 10KB/s 457 msec
> 4 4052KB/s 2462KB/s 13107KB/s 322K usec 1 6KB/s 841 msec
> 8 1899KB/s 557KB/s 12960KB/s 829K usec 1 13KB/s 1628 msec
> 16 1007KB/s 279KB/s 13833KB/s 1629K usec 1 10KB/s 3236 msec
> 32 506KB/s 98KB/s 13704KB/s 3389K usec 1 6KB/s 3238 msec
>
> IO scheduler controller + CFQ
> -----------------------------------
> [Multiple Sequential Reader] [Random Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 5721KB/s 5721KB/s 5587KB/s 126K usec 1 223KB/s 126K usec
> 2 3216KB/s 1442KB/s 4549KB/s 349K usec 1 224KB/s 176K usec
> 4 1895KB/s 640KB/s 5121KB/s 775K usec 1 222KB/s 189K usec
> 8 957KB/s 285KB/s 6368KB/s 1680K usec 1 223KB/s 142K usec
> 16 458KB/s 132KB/s 6455KB/s 3343K usec 1 219KB/s 165K usec
> 32 248KB/s 55KB/s 6001KB/s 6957K usec 1 220KB/s 504K usec
>
> Notes:
> - Random reader is well isolated from increasing number of sequential
> readers in other group. BW and latencies are stable.
>
> io-throttle + CFQ
> -----------------------------------
> BW limit group1=10 MB/s BW limit group2=10 MB/s
> [Multiple Sequential Reader] [Random Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 8200KB/s 8200KB/s 8007KB/s 20275 usec 1 37KB/s 217K usec
> 2 3926KB/s 3919KB/s 7661KB/s 122K usec 1 16KB/s 441 msec
> 4 2271KB/s 1497KB/s 7672KB/s 611K usec 1 9KB/s 927 msec
> 8 1113KB/s 513KB/s 7507KB/s 849K usec 1 21KB/s 1020 msec
> 16 661KB/s 236KB/s 7959KB/s 1679K usec 1 13KB/s 2926 msec
> 32 292KB/s 109KB/s 7864KB/s 3446K usec 1 8KB/s 3439 msec
>
> BW limit group1=5 MB/s BW limit group2=5 MB/s
> [Multiple Sequential Reader] [Random Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 4686KB/s 4686KB/s 4576KB/s 21095 usec 1 57KB/s 219K usec
> 2 2298KB/s 2179KB/s 4372KB/s 132K usec 1 37KB/s 431K usec
> 4 1245KB/s 1019KB/s 4449KB/s 324K usec 1 26KB/s 835 msec
> 8 584KB/s 403KB/s 4109KB/s 833K usec 1 30KB/s 1625K usec
> 16 346KB/s 252KB/s 4605KB/s 1641K usec 1 129KB/s 3236K usec
> 32 175KB/s 56KB/s 4269KB/s 3236K usec 1 8KB/s 3235 msec
>
> Notes:
>
> - Above result is surprising to me. I have run it twice. In first run, I
> setup per cgroup limit as 10MB/s and in second run I set it up 5MB/s. In
> both the cases as number of sequential readers increase in other groups,
> random reader's throughput decreases and latencies increase. This is
> happening despite the fact that sequential readers are being throttled
> to make sure it does not impact workload in other group. Wondering why
> random readers are not seeing consistent throughput and latencies.

Maybe because CFQ is still trying to be fair among processes instead of
cgroups. Remember that io-throttle doesn't touch the CFQ code (for this
I'm definitely convinced that CFQ should be changed to think also in
terms of cgroups, and io-throttle alone is not enough).

So, even if group1 is being throttled in part it is still able to submit
some requests that get a higher priority respect to the requests
submitted by the single random reader task.

It could be interesting to test another IO scheduler (deadline, as or
even noop) to check if this is the actual problem.

>
> - Andrea, can you please also run similar tests to see if you see same
> results or not. This is to rule out any testing methodology errors or
> scripting bugs. :-). I also have collected the snapshot of some cgroup
> files like bandwidth-max, throttlecnt, and stats. Let me know if you want
> those to see what is happenig here.

Sure, I'll do some tests ASAP. Another interesting test would be to set
a blockio.iops-max limit also for the sequential readers' cgroup, to be
sure we're not touching some iops physical disk limit.

Could you post all the options you used with fio, so I can repeat some
tests as similar as possible to yours?

>
> Multiple Sequential Reader vs Sequential Reader
> ===============================================
> - This time running random readers are out of the picture and trying to
> see the effect of increasing number of sequential readers on another
> sequential reader running in a different group.
>
> Vanilla CFQ
> -----------------------------------
> [Multiple Sequential Reader] [Sequential Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 6325KB/s 6325KB/s 6176KB/s 114K usec 1 6902KB/s 120K usec
> 2 4588KB/s 3102KB/s 7510KB/s 571K usec 1 4564KB/s 680K usec
> 4 3242KB/s 1158KB/s 9469KB/s 495K usec 1 3198KB/s 410K usec
> 8 1775KB/s 459KB/s 12011KB/s 1178K usec 1 1366KB/s 818K usec
> 16 943KB/s 296KB/s 13285KB/s 1923K usec 1 728KB/s 1816K usec
> 32 511KB/s 148KB/s 13555KB/s 3286K usec 1 391KB/s 3212K usec
>
> IO scheduler controller + CFQ
> -----------------------------------
> [Multiple Sequential Reader] [Sequential Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 6781KB/s 6781KB/s 6622KB/s 109K usec 1 6691KB/s 115K usec
> 2 3758KB/s 1876KB/s 5502KB/s 693K usec 1 6373KB/s 419K usec
> 4 2100KB/s 671KB/s 5751KB/s 987K usec 1 6330KB/s 569K usec
> 8 1023KB/s 355KB/s 6969KB/s 1569K usec 1 6086KB/s 120K usec
> 16 520KB/s 130KB/s 7094KB/s 3140K usec 1 5984KB/s 119K usec
> 32 245KB/s 86KB/s 6621KB/s 6571K usec 1 5850KB/s 113K usec
>
> Notes:
> - BW and latencies of sequential reader in group 2 are fairly stable as
> number of readers increase in first group.
>
> io-throttle + CFQ
> -----------------------------------
> BW limit group1=30 MB/s BW limit group2=30 MB/s
> [Multiple Sequential Reader] [Sequential Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 6343KB/s 6343KB/s 6195KB/s 116K usec 1 6993KB/s 109K usec
> 2 4583KB/s 3046KB/s 7451KB/s 583K usec 1 4516KB/s 433K usec
> 4 2945KB/s 1324KB/s 9552KB/s 602K usec 1 3001KB/s 583K usec
> 8 1804KB/s 473KB/s 12257KB/s 861K usec 1 1386KB/s 815K usec
> 16 942KB/s 265KB/s 13560KB/s 1659K usec 1 718KB/s 1658K usec
> 32 462KB/s 143KB/s 13757KB/s 3482K usec 1 409KB/s 3480K usec
>
> Notes:
> - BW decreases and latencies increase in group2 as number of readers
> increase in first group. This should be due to fact that no throttling
> will happen as none of the groups is hitting the limit of 30MB/s. To
> me this is the tricky part. How a service provider is supposed to
> set the limit of groups. If groups are not hitting max limits, it will
> still impact the BW and latencies in other group.

Are you using 4k block size here? because in case of too small blocks
you could hit some physical iops limit. Also for this case it could be
interesting to see what happens setting both BW and iops hard limits.

>
> BW limit group1=10 MB/s BW limit group2=10 MB/s
> [Multiple Sequential Reader] [Sequential Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 4128KB/s 4128KB/s 4032KB/s 215K usec 1 4076KB/s 170K usec
> 2 2880KB/s 1886KB/s 4655KB/s 291K usec 1 2891KB/s 212K usec
> 4 1912KB/s 888KB/s 5872KB/s 417K usec 1 1881KB/s 411K usec
> 8 1032KB/s 432KB/s 7312KB/s 841K usec 1 853KB/s 816K usec
> 16 540KB/s 259KB/s 7844KB/s 1728K usec 1 503KB/s 1609K usec
> 32 291KB/s 111KB/s 7920KB/s 3417K usec 1 249KB/s 3205K usec
>
> Notes:
> - Same test with 10MB/s as group limit. This is again a surprising result.
> Max BW in first group is being throttled but still throughput is
> dropping significantly in second group and latencies are on the rise.

Same consideration about CFQ and/or iops limit. Could you post all the
fio options you've used also for this test (or better, for all tests)?

>
> - Limit of first group is 10MB/s but it is achieving max BW of around
> 8MB/s only. What happened to rest of the 2MB/s?

Ditto.

>
> - Andrea, again, please do run this test. The throughput drop in second
> group stumps me and forces me to think if I am doing something wrong.
>
> BW limit group1=5 MB/s BW limit group2=5 MB/s
> [Multiple Sequential Reader] [Sequential Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 2434KB/s 2434KB/s 2377KB/s 110K usec 1 2415KB/s 120K usec
> 2 1639KB/s 1186KB/s 2759KB/s 222K usec 1 1709KB/s 220K usec
> 4 1114KB/s 648KB/s 3314KB/s 420K usec 1 1163KB/s 414K usec
> 8 567KB/s 366KB/s 4060KB/s 901K usec 1 527KB/s 816K usec
> 16 329KB/s 179KB/s 4324KB/s 1613K usec 1 311KB/s 1613K usec
> 32 178KB/s 70KB/s 4320KB/s 3235K usec 1 163KB/s 3209K usec
>
> - Setting the limit to 5MB/s per group also does not seem to be helping
> the second group.
>
> Multiple Random Writer vs Random Reader
> ===============================================
> This time running multiple random writers in first group and see the
> impact on throughput and latency of random reader in different group.
>
> Vanilla CFQ
> -----------------------------------
> [Multiple Random Writer] [Random Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 64018KB/s 64018KB/s 62517KB/s 353K usec 1 190KB/s 96 msec
> 2 35298KB/s 35257KB/s 68899KB/s 208K usec 1 76KB/s 2416 msec
> 4 16387KB/s 14662KB/s 60630KB/s 3746K usec 1 106KB/s 2308K usec
> 8 5106KB/s 3492KB/s 33335KB/s 2995K usec 1 193KB/s 2292K usec
> 16 3676KB/s 3002KB/s 51807KB/s 2283K usec 1 72KB/s 2298K usec
> 32 2169KB/s 1480KB/s 56882KB/s 1990K usec 1 35KB/s 1093 msec
>
> IO scheduler controller + CFQ
> -----------------------------------
> [Multiple Random Writer] [Random Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 20369KB/s 20369KB/s 19892KB/s 877K usec 1 255KB/s 137K usec
> 2 14347KB/s 14288KB/s 27964KB/s 1010K usec 1 228KB/s 117K usec
> 4 6996KB/s 6701KB/s 26775KB/s 1362K usec 1 221KB/s 180K usec
> 8 2849KB/s 2770KB/s 22007KB/s 2660K usec 1 250KB/s 485K usec
> 16 1463KB/s 1365KB/s 22384KB/s 2606K usec 1 254KB/s 115K usec
> 32 799KB/s 681KB/s 22404KB/s 2879K usec 1 266KB/s 107K usec
>
> Notes
> - BW and latencies of random reader in second group are fairly stable.
>
> io-throttle + CFQ
> -----------------------------------
> BW limit group1=30 MB/s BW limit group2=30 MB/s
> [Multiple Random Writer] [Random Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 21920KB/s 21920KB/s 21406KB/s 1017K usec 1 353KB/s 432K usec
> 2 14291KB/s 9626KB/s 23357KB/s 1832K usec 1 362KB/s 177K usec
> 4 7130KB/s 5135KB/s 24736KB/s 1336K usec 1 348KB/s 425K usec
> 8 3165KB/s 2949KB/s 23792KB/s 2133K usec 1 336KB/s 146K usec
> 16 1653KB/s 1406KB/s 23694KB/s 2198K usec 1 337KB/s 115K usec
> 32 793KB/s 717KB/s 23198KB/s 2195K usec 1 330KB/s 192K usec
>
> BW limit group1=10 MB/s BW limit group2=10 MB/s
> [Multiple Random Writer] [Random Reader]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 7903KB/s 7903KB/s 7718KB/s 1037K usec 1 474KB/s 103K usec
> 2 4496KB/s 4428KB/s 8715KB/s 1091K usec 1 450KB/s 553K usec
> 4 2153KB/s 1827KB/s 7914KB/s 2042K usec 1 458KB/s 108K usec
> 8 1129KB/s 1087KB/s 8688KB/s 1280K usec 1 432KB/s 98215 usec
> 16 606KB/s 527KB/s 8668KB/s 2303K usec 1 426KB/s 90609 usec
> 32 312KB/s 259KB/s 8599KB/s 2557K usec 1 441KB/s 95283 usec
>
> Notes:
> - IO throttling seems to be working really well here. Random writers are
> contained in the first group and this gives stable BW and latencies
> to random reader in second group.
>
> Multiple Buffered Writer vs Buffered Writer
> ===========================================
> This time run multiple buffered writers in group1 and see run a single
> buffered writer in other group and see if we can provide fairness and
> isolation.
>
> Vanilla CFQ
> ------------
> [Multiple Buffered Writer] [Buffered Writer]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 68997KB/s 68997KB/s 67380KB/s 645K usec 1 67122KB/s 567K usec
> 2 47509KB/s 46218KB/s 91510KB/s 865K usec 1 45118KB/s 865K usec
> 4 28002KB/s 26906KB/s 105MB/s 1649K usec 1 26879KB/s 1643K usec
> 8 15985KB/s 14849KB/s 117MB/s 943K usec 1 15653KB/s 766K usec
> 16 11567KB/s 6881KB/s 128MB/s 1174K usec 1 7333KB/s 947K usec
> 32 5877KB/s 3649KB/s 130MB/s 1205K usec 1 5142KB/s 988K usec
>
> IO scheduler controller + CFQ
> -----------------------------------
> [Multiple Buffered Writer] [Buffered Writer]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 68580KB/s 68580KB/s 66972KB/s 2901K usec 1 67194KB/s 2901K usec
> 2 47419KB/s 45700KB/s 90936KB/s 3149K usec 1 44628KB/s 2377K usec
> 4 27825KB/s 27274KB/s 105MB/s 1177K usec 1 27584KB/s 1177K usec
> 8 15382KB/s 14288KB/s 114MB/s 1539K usec 1 14794KB/s 783K usec
> 16 9161KB/s 7592KB/s 124MB/s 3177K usec 1 7713KB/s 886K usec
> 32 4928KB/s 3961KB/s 126MB/s 1152K usec 1 6465KB/s 4510K usec
>
> Notes:
> - It does not work. Buffered writer in second group are being overwhelmed
> by writers in group1.
>
> - This is a limitation of IO scheduler based controller currently as page
> cache at higher layer evens out the traffic and does not throw more
> traffic from higher weight group.
>
> - This is something needs more work at higher layers like dirty limts
> per cgroup in memory contoller and the method to writeout buffered
> pages belonging to a particular memory cgroup. This is still being
> brainstormed.
>
> io-throttle + CFQ
> -----------------------------------
> BW limit group1=30 MB/s BW limit group2=30 MB/s
> [Multiple Buffered Writer] [Buffered Writer]
> nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> 1 33863KB/s 33863KB/s 33070KB/s 3046K usec 1 25165KB/s 13248K usec
> 2 13457KB/s 12906KB/s 25745KB/s 9286K usec 1 29958KB/s 3736K usec
> 4 7414KB/s 6543KB/s 27145KB/s 10557K usec 1 30968KB/s 8356K usec
> 8 3562KB/s 2640KB/s 24430KB/s 12012K usec 1 30801KB/s 7037K usec
> 16 3962KB/s 881KB/s 26632KB/s 12650K usec 1 31150KB/s 7173K usec
> 32 3275KB/s 406KB/s 27295KB/s 14609K usec 1 26328KB/s 8069K usec
>
> Notes:
> - This seems to work well here. io-throttle is throttling the writers
> before they write too much of data in page cache. One side effect of
> this seems to be that now a process will not be allowed to write at
> memory speed in page cahce and will be limited to disk IO speed limits
> set for the cgroup.
>
> Andrea is thinking of removing throttling in balance_dirty_pages() to allow
> writting at disk speed till we hit dirty_limits. But removing it leads
> to a different issue where too many dirty pages from a single group can
> be present from a cgroup in page cache and if that cgroup is slow moving
> one, then pages are flushed to disk at slower speed delyaing other
> higher rate cgroups. (all discussed in private mails with Andrea).

I confirm this. :) But IMHO before removing the throttling in
balance_dirty_pages() we really need the per-cgroup dirty limit / dirty
page cache quota.

>
>
> ioprio class and iopriority with-in cgroups issues with IO-throttle
> ===================================================================
>
> Currently throttling logic is designed in such a way that it makes the
> throttling uniform for every process in the group. So we will loose the
> differentiation between different class of processes or differnetitation
> between different priority of processes with-in group.
>
> I have run the tests of these in the past and reported it here in the
> past.
>
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html
>
> Thanks
> Vivek

--
Andrea Righi - Develer s.r.l
http://www.develer.com

2009-10-11 12:34:39

by Vivek Goyal

[permalink] [raw]
Subject: Re: Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)

On Sun, Oct 11, 2009 at 12:27:30AM +0200, Andrea Righi wrote:

[..]
> >
> > - Andrea, can you please also run similar tests to see if you see same
> > results or not. This is to rule out any testing methodology errors or
> > scripting bugs. :-). I also have collected the snapshot of some cgroup
> > files like bandwidth-max, throttlecnt, and stats. Let me know if you want
> > those to see what is happenig here.
>
> Sure, I'll do some tests ASAP. Another interesting test would be to set
> a blockio.iops-max limit also for the sequential readers' cgroup, to be
> sure we're not touching some iops physical disk limit.
>
> Could you post all the options you used with fio, so I can repeat some
> tests as similar as possible to yours?
>

I will respond to rest of the points later after some testing with iops-max
rules. In the mean time here are my fio options so that you can try to
replicate the tests.

I am simply copying pasting from my script. I have written my own program
"semwait" so that two different instances of fio can synchronize on an
external semaphore. Generally all the jobs go in single fio files but here
we need to put two fio instances in two different cgroups. It is important
that two fio jobs are synchronized and start at the same time after laying
out files. (Becomes primarilly useful in write testing. Reads are fine
generally onces the files have been laid out).

Sequential readers
------------------
fio_args="--rw=read --bs=4K --size=2G --runtime=30 --numjobs=$nr_jobs --direct=1"
fio $fio_args --name=$jobname --directory=/mnt/$blockdev/fio --exec_prerun="'/usr/local/bin/semwait fiocgroup'" >> $outputdir/$outputfile &

Random Reader
-------------
fio_args="--rw=randread --bs=4K --size=1G --runtime=30 --direct=1 --numjobs=$nr_jobs"
fio $fio_args --name=$jobname --directory=/mnt/$blockdev/fio --exec_prerun="'/usr/local/bin/semwait fiocgroup'" >> $outputdir/$outputfile &

Random Writer
-------------
fio_args="--rw=randwrite --bs=64K --size=2G --runtime=30 --numjobs=$nr_jobs1 --ioengine=libaio --iodepth=4 --direct=1"

fio $fio_args --name=$jobname --directory=/mnt/$blockdev/fio --exec_prerun="'/usr/local/bin/semwait fiocgroup'" >> $outputdir/$outputfile &

Thanks
Vivek

2009-10-12 21:12:50

by Vivek Goyal

[permalink] [raw]
Subject: Re: Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)

On Sun, Oct 11, 2009 at 12:27:30AM +0200, Andrea Righi wrote:

[..]
> > Multiple Random Reader vs Sequential Reader
> > ===============================================
> > Generally random readers bring the throughput down of others in the
> > system. Ran a test to see the impact of increasing number of random readers on
> > single sequential reader in different groups.
> >
> > Vanilla CFQ
> > -----------------------------------
> > [Multiple Random Reader] [Sequential Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 23KB/s 23KB/s 22KB/s 691 msec 1 13519KB/s 468K usec
> > 2 152KB/s 152KB/s 297KB/s 244K usec 1 12380KB/s 31675 usec
> > 4 174KB/s 156KB/s 638KB/s 249K usec 1 10860KB/s 36715 usec
> > 8 49KB/s 11KB/s 310KB/s 1856 msec 1 1292KB/s 990K usec
> > 16 63KB/s 48KB/s 877KB/s 762K usec 1 3905KB/s 506K usec
> > 32 35KB/s 27KB/s 951KB/s 2655 msec 1 1109KB/s 1910K usec
> >
> > IO scheduler controller + CFQ
> > -----------------------------------
> > [Multiple Random Reader] [Sequential Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 228KB/s 228KB/s 223KB/s 132K usec 1 5551KB/s 129K usec
> > 2 97KB/s 97KB/s 190KB/s 154K usec 1 5718KB/s 122K usec
> > 4 115KB/s 110KB/s 445KB/s 208K usec 1 5909KB/s 116K usec
> > 8 23KB/s 12KB/s 158KB/s 2820 msec 1 5445KB/s 168K usec
> > 16 11KB/s 3KB/s 145KB/s 5963 msec 1 5418KB/s 164K usec
> > 32 6KB/s 2KB/s 139KB/s 12762 msec 1 5398KB/s 175K usec
> >
> > Notes:
> > - Sequential reader in group2 seems to be well isolated from random readers
> > in group1. Throughput and latency of sequential reader are stable and
> > don't drop as number of random readers inrease in system.
> >
> > io-throttle + CFQ
> > ------------------
> > BW limit group1=10 MB/s BW limit group2=10 MB/s
> > [Multiple Random Reader] [Sequential Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 37KB/s 37KB/s 36KB/s 218K usec 1 8006KB/s 20529 usec
> > 2 185KB/s 183KB/s 360KB/s 228K usec 1 7475KB/s 33665 usec
> > 4 188KB/s 171KB/s 699KB/s 262K usec 1 6800KB/s 46224 usec
> > 8 84KB/s 51KB/s 573KB/s 1800K usec 1 2835KB/s 885K usec
> > 16 21KB/s 9KB/s 294KB/s 3590 msec 1 437KB/s 1855K usec
> > 32 34KB/s 27KB/s 980KB/s 2861K usec 1 1145KB/s 1952K usec
> >
> > Notes:
> > - I have setup limits of 10MB/s in both the cgroups. Now random reader
> > group will never achieve that kind of speed, so it will not be throttled
> > and then it goes onto impact the throughput and latency of other groups
> > in the system.
> >
> > - Now the key question is how conservative one should in be setting up
> > max BW limit. On this box if a customer has bought 10MB/s cgroup and if
> > he is running some random readers it will kill throughput of other
> > groups in the system and their latencies will shoot up. No isolation in
> > this case.
> >
> > - So in general, max BW provides isolation from high speed groups but it
> > does not provide isolaton from random reader groups which are moving
> > slow.
>
> Remember that in addition to blockio.bandwidth-max the io-throttle
> controlller also provides blockio.iops-max to enforce hard limits on the
> number of IO operations per second. Probably for this testcase both
> cgroups should be limited in terms of BW and iops to achieve a better
> isolation.
>

I modified my report scripts to also output aggreagate iops numbers and
remove max-bandwidth and min-bandwidth numbers. So for same tests and same
results I am now reporting iops numbers also. ( I have not re-run the
tests.)

IO scheduler controller + CFQ
-----------------------------------
[Multiple Random Reader] [Sequential Reader]
nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops
1 223KB/s 132K usec 55 1 5551KB/s 129K usec 1387
2 190KB/s 154K usec 46 1 5718KB/s 122K usec 1429
4 445KB/s 208K usec 111 1 5909KB/s 116K usec 1477
8 158KB/s 2820 msec 36 1 5445KB/s 168K usec 1361
16 145KB/s 5963 msec 28 1 5418KB/s 164K usec 1354
32 139KB/s 12762 msec 23 1 5398KB/s 175K usec 1349

io-throttle + CFQ
-----------------------------------
BW limit group1=10 MB/s BW limit group2=10 MB/s
[Multiple Random Reader] [Sequential Reader]
nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops
1 36KB/s 218K usec 9 1 8006KB/s 20529 usec 2001
2 360KB/s 228K usec 89 1 7475KB/s 33665 usec 1868
4 699KB/s 262K usec 173 1 6800KB/s 46224 usec 1700
8 573KB/s 1800K usec 139 1 2835KB/s 885K usec 708
16 294KB/s 3590 msec 68 1 437KB/s 1855K usec 109
32 980KB/s 2861K usec 230 1 1145KB/s 1952K usec 286

Note that in case of random reader groups, iops are really small. Few
thougts.

- What should be the iops limit I should choose for the group. Lets say if
I choose "80", then things should be better for sequential reader group,
but just think of what will happen to random reader group. Especially,
if nature of workload in group1 changes to sequential. Group1 will
simply be killed.

So yes, one can limit a group both by BW as well as iops-max, but this
requires you to know in advance exactly what workload is running in the
group. The moment workoload changes, these settings might have a very
bad effects.

So my biggest concern with max-bwidth and max-iops limits is that how
will one configure the system for a dynamic environment. Think of two
virtual machines being used by two customers. At one point they might be
doing some copy operation and running sequential workload an later some
webserver or database query might be doing some random read operations.

- Notice the interesting case of 16 random readers. iops for random reader
group is really low, but still the throughput and iops of sequential
reader group is very bad. I suspect that at CFQ level, some kind of
mixup has taken place where we have not enabled idling for sequential
reader and disk became seek bound hence both the group are loosing.
(Just a guess)

Out of curiousity I looked at the results of set1 and set3 also and they
seem to be exhibiting the similar behavior.

Set1
----
io-throttle + CFQ
-----------------------------------
BW limit group1=10 MB/s BW limit group2=10 MB/s
[Multiple Random Reader] [Sequential Reader]
nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops
1 37KB/s 227K usec 9 1 8033KB/s 18773 usec 2008
2 342KB/s 601K usec 84 1 7406KB/s 476K usec 1851
4 677KB/s 163K usec 167 1 6743KB/s 69196 usec 1685
8 310KB/s 1780 msec 74 1 882KB/s 915K usec 220
16 877KB/s 431K usec 211 1 3278KB/s 274K usec 819
32 1109KB/s 1823 msec 261 1 1217KB/s 1022K usec 304

Set3
----
io-throttle + CFQ
-----------------------------------
BW limit group1=10 MB/s BW limit group2=10 MB/s
[Multiple Random Reader] [Sequential Reader]
nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops
1 34KB/s 693K usec 8 1 7908KB/s 469K usec 1977
2 343KB/s 204K usec 85 1 7402KB/s 33962 usec 1850
4 691KB/s 228K usec 171 1 6847KB/s 76957 usec 1711
8 306KB/s 1806 msec 73 1 852KB/s 925K usec 213
16 287KB/s 3581 msec 63 1 439KB/s 1820K usec 109
32 976KB/s 3592K usec 230 1 1170KB/s 2895K usec 292

> >
> > Multiple Sequential Reader vs Random Reader
> > ===============================================
> > Now running a reverse test where in one group I am running increasing
> > number of sequential readers and in other group I am running one random
> > reader and see the impact of sequential readers on random reader.
> >
> > Vanilla CFQ
> > -----------------------------------
> > [Multiple Sequential Reader] [Random Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 13978KB/s 13978KB/s 13650KB/s 27614 usec 1 22KB/s 227 msec
> > 2 6225KB/s 6166KB/s 12101KB/s 568K usec 1 10KB/s 457 msec
> > 4 4052KB/s 2462KB/s 13107KB/s 322K usec 1 6KB/s 841 msec
> > 8 1899KB/s 557KB/s 12960KB/s 829K usec 1 13KB/s 1628 msec
> > 16 1007KB/s 279KB/s 13833KB/s 1629K usec 1 10KB/s 3236 msec
> > 32 506KB/s 98KB/s 13704KB/s 3389K usec 1 6KB/s 3238 msec
> >
> > IO scheduler controller + CFQ
> > -----------------------------------
> > [Multiple Sequential Reader] [Random Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 5721KB/s 5721KB/s 5587KB/s 126K usec 1 223KB/s 126K usec
> > 2 3216KB/s 1442KB/s 4549KB/s 349K usec 1 224KB/s 176K usec
> > 4 1895KB/s 640KB/s 5121KB/s 775K usec 1 222KB/s 189K usec
> > 8 957KB/s 285KB/s 6368KB/s 1680K usec 1 223KB/s 142K usec
> > 16 458KB/s 132KB/s 6455KB/s 3343K usec 1 219KB/s 165K usec
> > 32 248KB/s 55KB/s 6001KB/s 6957K usec 1 220KB/s 504K usec
> >
> > Notes:
> > - Random reader is well isolated from increasing number of sequential
> > readers in other group. BW and latencies are stable.
> >
> > io-throttle + CFQ
> > -----------------------------------
> > BW limit group1=10 MB/s BW limit group2=10 MB/s
> > [Multiple Sequential Reader] [Random Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 8200KB/s 8200KB/s 8007KB/s 20275 usec 1 37KB/s 217K usec
> > 2 3926KB/s 3919KB/s 7661KB/s 122K usec 1 16KB/s 441 msec
> > 4 2271KB/s 1497KB/s 7672KB/s 611K usec 1 9KB/s 927 msec
> > 8 1113KB/s 513KB/s 7507KB/s 849K usec 1 21KB/s 1020 msec
> > 16 661KB/s 236KB/s 7959KB/s 1679K usec 1 13KB/s 2926 msec
> > 32 292KB/s 109KB/s 7864KB/s 3446K usec 1 8KB/s 3439 msec
> >
> > BW limit group1=5 MB/s BW limit group2=5 MB/s
> > [Multiple Sequential Reader] [Random Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 4686KB/s 4686KB/s 4576KB/s 21095 usec 1 57KB/s 219K usec
> > 2 2298KB/s 2179KB/s 4372KB/s 132K usec 1 37KB/s 431K usec
> > 4 1245KB/s 1019KB/s 4449KB/s 324K usec 1 26KB/s 835 msec
> > 8 584KB/s 403KB/s 4109KB/s 833K usec 1 30KB/s 1625K usec
> > 16 346KB/s 252KB/s 4605KB/s 1641K usec 1 129KB/s 3236K usec
> > 32 175KB/s 56KB/s 4269KB/s 3236K usec 1 8KB/s 3235 msec
> >
> > Notes:
> >
> > - Above result is surprising to me. I have run it twice. In first run, I
> > setup per cgroup limit as 10MB/s and in second run I set it up 5MB/s. In
> > both the cases as number of sequential readers increase in other groups,
> > random reader's throughput decreases and latencies increase. This is
> > happening despite the fact that sequential readers are being throttled
> > to make sure it does not impact workload in other group. Wondering why
> > random readers are not seeing consistent throughput and latencies.
>
> Maybe because CFQ is still trying to be fair among processes instead of
> cgroups. Remember that io-throttle doesn't touch the CFQ code (for this
> I'm definitely convinced that CFQ should be changed to think also in
> terms of cgroups, and io-throttle alone is not enough).
>

True. I think that's what is happening here. CFQ will see requests from
all the sequential readers and will try to give these 100ms slice but
random reader will get one chance to dispatch requests and then will again
be at the back of the service tree.

Throttling at higher layers should help a bit so that group1 does not get
to run for too long, but still it does not seem to be helping a lot.

So it becomes important that underying IO scheduler knows about groups and
then does the scheduling accordingly otherwise we run into issues of
"weak isolation" between groups and "not improved latecies".

> So, even if group1 is being throttled in part it is still able to submit
> some requests that get a higher priority respect to the requests
> submitted by the single random reader task.
>
> It could be interesting to test another IO scheduler (deadline, as or
> even noop) to check if this is the actual problem.
>
> >
> > - Andrea, can you please also run similar tests to see if you see same
> > results or not. This is to rule out any testing methodology errors or
> > scripting bugs. :-). I also have collected the snapshot of some cgroup
> > files like bandwidth-max, throttlecnt, and stats. Let me know if you want
> > those to see what is happenig here.
>
> Sure, I'll do some tests ASAP. Another interesting test would be to set
> a blockio.iops-max limit also for the sequential readers' cgroup, to be
> sure we're not touching some iops physical disk limit.
>
> Could you post all the options you used with fio, so I can repeat some
> tests as similar as possible to yours?
>
> >
> > Multiple Sequential Reader vs Sequential Reader
> > ===============================================
> > - This time running random readers are out of the picture and trying to
> > see the effect of increasing number of sequential readers on another
> > sequential reader running in a different group.
> >
> > Vanilla CFQ
> > -----------------------------------
> > [Multiple Sequential Reader] [Sequential Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 6325KB/s 6325KB/s 6176KB/s 114K usec 1 6902KB/s 120K usec
> > 2 4588KB/s 3102KB/s 7510KB/s 571K usec 1 4564KB/s 680K usec
> > 4 3242KB/s 1158KB/s 9469KB/s 495K usec 1 3198KB/s 410K usec
> > 8 1775KB/s 459KB/s 12011KB/s 1178K usec 1 1366KB/s 818K usec
> > 16 943KB/s 296KB/s 13285KB/s 1923K usec 1 728KB/s 1816K usec
> > 32 511KB/s 148KB/s 13555KB/s 3286K usec 1 391KB/s 3212K usec
> >
> > IO scheduler controller + CFQ
> > -----------------------------------
> > [Multiple Sequential Reader] [Sequential Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 6781KB/s 6781KB/s 6622KB/s 109K usec 1 6691KB/s 115K usec
> > 2 3758KB/s 1876KB/s 5502KB/s 693K usec 1 6373KB/s 419K usec
> > 4 2100KB/s 671KB/s 5751KB/s 987K usec 1 6330KB/s 569K usec
> > 8 1023KB/s 355KB/s 6969KB/s 1569K usec 1 6086KB/s 120K usec
> > 16 520KB/s 130KB/s 7094KB/s 3140K usec 1 5984KB/s 119K usec
> > 32 245KB/s 86KB/s 6621KB/s 6571K usec 1 5850KB/s 113K usec
> >
> > Notes:
> > - BW and latencies of sequential reader in group 2 are fairly stable as
> > number of readers increase in first group.
> >
> > io-throttle + CFQ
> > -----------------------------------
> > BW limit group1=30 MB/s BW limit group2=30 MB/s
> > [Multiple Sequential Reader] [Sequential Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 6343KB/s 6343KB/s 6195KB/s 116K usec 1 6993KB/s 109K usec
> > 2 4583KB/s 3046KB/s 7451KB/s 583K usec 1 4516KB/s 433K usec
> > 4 2945KB/s 1324KB/s 9552KB/s 602K usec 1 3001KB/s 583K usec
> > 8 1804KB/s 473KB/s 12257KB/s 861K usec 1 1386KB/s 815K usec
> > 16 942KB/s 265KB/s 13560KB/s 1659K usec 1 718KB/s 1658K usec
> > 32 462KB/s 143KB/s 13757KB/s 3482K usec 1 409KB/s 3480K usec
> >
> > Notes:
> > - BW decreases and latencies increase in group2 as number of readers
> > increase in first group. This should be due to fact that no throttling
> > will happen as none of the groups is hitting the limit of 30MB/s. To
> > me this is the tricky part. How a service provider is supposed to
> > set the limit of groups. If groups are not hitting max limits, it will
> > still impact the BW and latencies in other group.
>
> Are you using 4k block size here? because in case of too small blocks
> you could hit some physical iops limit. Also for this case it could be
> interesting to see what happens setting both BW and iops hard limits.
>

Hmm.., Same results posted with iops numbers.

io-throttle + CFQ
-----------------------------------
BW limit group1=30 MB/s BW limit group2=30 MB/s
[Multiple Sequential Reader] [Sequential Reader]
nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops
1 6195KB/s 116K usec 1548 1 6993KB/s 109K usec 1748
2 7451KB/s 583K usec 1862 1 4516KB/s 433K usec 1129
4 9552KB/s 602K usec 2387 1 3001KB/s 583K usec 750
8 12257KB/s 861K usec 3060 1 1386KB/s 815K usec 346
16 13560KB/s 1659K usec 3382 1 718KB/s 1658K usec 179
32 13757KB/s 3482K usec 3422 1 409KB/s 3480K usec 102

BW limit group1=10 MB/s BW limit group2=10 MB/s
[Multiple Sequential Reader] [Sequential Reader]
nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops
1 4032KB/s 215K usec 1008 1 4076KB/s 170K usec 1019
2 4655KB/s 291K usec 1163 1 2891KB/s 212K usec 722
4 5872KB/s 417K usec 1466 1 1881KB/s 411K usec 470
8 7312KB/s 841K usec 1824 1 853KB/s 816K usec 213
16 7844KB/s 1728K usec 1956 1 503KB/s 1609K usec 125
32 7920KB/s 3417K usec 1969 1 249KB/s 3205K usec 62

BW limit group1=5 MB/s BW limit group2=5 MB/s
[Multiple Sequential Reader] [Sequential Reader]
nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops
1 2377KB/s 110K usec 594 1 2415KB/s 120K usec 603
2 2759KB/s 222K usec 689 1 1709KB/s 220K usec 427
4 3314KB/s 420K usec 828 1 1163KB/s 414K usec 290
8 4060KB/s 901K usec 1011 1 527KB/s 816K usec 131
16 4324KB/s 1613K usec 1074 1 311KB/s 1613K usec 77
32 4320KB/s 3235K usec 1067 1 163KB/s 3209K usec 40

Note that with bw limit 30MB/s, we are able to hit iops more than 3400 but
with bw=5MB/s, we are hitting close to 1100 iops. So I think we are
under-utilizing the storage here and not run into any kind of iops
limit.

> >
> > BW limit group1=10 MB/s BW limit group2=10 MB/s
> > [Multiple Sequential Reader] [Sequential Reader]
> > nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency
> > 1 4128KB/s 4128KB/s 4032KB/s 215K usec 1 4076KB/s 170K usec
> > 2 2880KB/s 1886KB/s 4655KB/s 291K usec 1 2891KB/s 212K usec
> > 4 1912KB/s 888KB/s 5872KB/s 417K usec 1 1881KB/s 411K usec
> > 8 1032KB/s 432KB/s 7312KB/s 841K usec 1 853KB/s 816K usec
> > 16 540KB/s 259KB/s 7844KB/s 1728K usec 1 503KB/s 1609K usec
> > 32 291KB/s 111KB/s 7920KB/s 3417K usec 1 249KB/s 3205K usec
> >
> > Notes:
> > - Same test with 10MB/s as group limit. This is again a surprising result.
> > Max BW in first group is being throttled but still throughput is
> > dropping significantly in second group and latencies are on the rise.
>
> Same consideration about CFQ and/or iops limit. Could you post all the
> fio options you've used also for this test (or better, for all tests)?
>

Already posted in a separate mail.

> >
> > - Limit of first group is 10MB/s but it is achieving max BW of around
> > 8MB/s only. What happened to rest of the 2MB/s?
>
> Ditto.
>

For 10MB/s case, max iops seems to be 2000 collectively, way below than
3400. So I doubt that this is case of hitting max iops.

Thanks
Vivek

2009-10-17 15:18:19

by Andrea Righi

[permalink] [raw]
Subject: Re: Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)

On Mon, Oct 12, 2009 at 05:11:20PM -0400, Vivek Goyal wrote:

[snip]

> I modified my report scripts to also output aggreagate iops numbers and
> remove max-bandwidth and min-bandwidth numbers. So for same tests and same
> results I am now reporting iops numbers also. ( I have not re-run the
> tests.)
>
> IO scheduler controller + CFQ
> -----------------------------------
> [Multiple Random Reader] [Sequential Reader]
> nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops
> 1 223KB/s 132K usec 55 1 5551KB/s 129K usec 1387
> 2 190KB/s 154K usec 46 1 5718KB/s 122K usec 1429
> 4 445KB/s 208K usec 111 1 5909KB/s 116K usec 1477
> 8 158KB/s 2820 msec 36 1 5445KB/s 168K usec 1361
> 16 145KB/s 5963 msec 28 1 5418KB/s 164K usec 1354
> 32 139KB/s 12762 msec 23 1 5398KB/s 175K usec 1349
>
> io-throttle + CFQ
> -----------------------------------
> BW limit group1=10 MB/s BW limit group2=10 MB/s
> [Multiple Random Reader] [Sequential Reader]
> nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops
> 1 36KB/s 218K usec 9 1 8006KB/s 20529 usec 2001
> 2 360KB/s 228K usec 89 1 7475KB/s 33665 usec 1868
> 4 699KB/s 262K usec 173 1 6800KB/s 46224 usec 1700
> 8 573KB/s 1800K usec 139 1 2835KB/s 885K usec 708
> 16 294KB/s 3590 msec 68 1 437KB/s 1855K usec 109
> 32 980KB/s 2861K usec 230 1 1145KB/s 1952K usec 286
>
> Note that in case of random reader groups, iops are really small. Few
> thougts.
>
> - What should be the iops limit I should choose for the group. Lets say if
> I choose "80", then things should be better for sequential reader group,
> but just think of what will happen to random reader group. Especially,
> if nature of workload in group1 changes to sequential. Group1 will
> simply be killed.
>
> So yes, one can limit a group both by BW as well as iops-max, but this
> requires you to know in advance exactly what workload is running in the
> group. The moment workoload changes, these settings might have a very
> bad effects.
>
> So my biggest concern with max-bwidth and max-iops limits is that how
> will one configure the system for a dynamic environment. Think of two
> virtual machines being used by two customers. At one point they might be
> doing some copy operation and running sequential workload an later some
> webserver or database query might be doing some random read operations.

The main problem IMHO is how to accurately evaluate the cost of an IO
operation. On rotational media for example the cost to read two distant
blocks is not the same cost of reading two contiguous blocks (while on a
flash/SSD drive the cost is probably the same).

io-throttle tries to quantify the cost in absolute terms (iops and BW),
but this is not enough to cover all the possible cases. For example, you
could hit a physical disk limit, because you're doing a workload too
seeky, even if the iops and BW numbers are low.

>
> - Notice the interesting case of 16 random readers. iops for random reader
> group is really low, but still the throughput and iops of sequential
> reader group is very bad. I suspect that at CFQ level, some kind of
> mixup has taken place where we have not enabled idling for sequential
> reader and disk became seek bound hence both the group are loosing.
> (Just a guess)

Yes, my guess is the same.

I've re-run some of your tests using a SSD (a MOBI MTRON MSD-PATA3018-ZIF1),
but changing few parameters: I used a larger block size for the
sequential workload (there's no need to reduce the block size of the
single reads if we suppose to read a lot of contiguous blocks).

And for all the io-throttle tests I switched to noop scheduler (CFQ must
be changed to be cgroup-aware before using it together with io-throttle,
otherwise the result is that one simply breaks the logic of the other).

=== io-throttle settings ===
cgroup #1: max-bw 10MB/s, max-iops 2150 iop/s
cgroup #1: max-bw 10MB/s, max-iops 2150 iop/s

During the tests I used a larger block size for sequential readers,
respect to the random readers:

sequential-read: block size = 1MB
random-read: block size = 4KB

sequential-readers vs sequential-reader
=======================================
[ cgroup #1 workload ]
fio_args="--rw=read --bs=1M --size=512M --runtime=30 --numjobs=N --direct=1"
[ cgroup #2 workload ]
fio_args="--rw=read --bs=1M --size=512M --runtime=30 --numjobs=1 --direct=1"

__2.6.32-rc5__
[ cgroup #1 ] [ cgroup #2 ]
tasks aggr-bw tasks aggr-bw
1 36210KB/s 1 36992KB/s
2 47558KB/s 1 24479KB/s
4 57587KB/s 1 14809KB/s
8 64667KB/s 1 8393KB/s

__2.6.32-rc5-io-throttle__
[ cgroup #1 ] [ cgroup #2 ]
tasks aggr-bw tasks aggr-bw
1 10195KB/s 1 10193KB/s
2 10279KB/s 1 10276KB/s
4 10281KB/s 1 10277KB/s
8 10279KB/s 1 10277KB/s

random-readers vs sequential-reader
===================================
[ cgroup #1 workload ]
fio_args="--rw=randread --bs=4k --size=512M --runtime=30 --numjobs=N --direct=1"
[ cgroup #2 workload ]
fio_args="--rw=read --bs=1M --size=512M --runtime=30 --numjobs=1 --direct=1"

__2.6.32-rc5__
[ cgroup #1 ] [ cgroup #2 ]
tasks aggr-bw tasks aggr-bw
1 4767KB/s 1 52819KB/s
2 5900KB/s 1 39788KB/s
4 7783KB/s 1 27966KB/s
8 9296KB/s 1 17606KB/s

__2.6.32-rc5-io-throttle__
[ cgroup #1 ] [ cgroup #2 ]
tasks aggr-bw tasks aggr-bw
1 8861KB/s 1 8886KB/s
2 8887KB/s 1 7578KB/s
4 8886KB/s 1 7271KB/s
8 8889KB/s 1 7489KB/s

sequential-readers vs random-reader
===================================
[ cgroup #1 workload ]
fio_args="--rw=read --bs=1M --size=512M --runtime=30 --numjobs=N --direct=1"
[ cgroup #2 workload ]
fio_args="--rw=randread --bs=4k --size=512M --runtime=30 --numjobs=1 --direct=1"

__2.6.32-rc5__
[ cgroup #1 ] [ cgroup #2 ]
tasks aggr-bw tasks aggr-bw
1 54511KB/s 1 4865KB/s
2 70312KB/s 1 965KB/s
4 71543KB/s 1 484KB/s
8 72899KB/s 1 98KB/s

__2.6.32-rc5-io-throttle__
[ cgroup #1 ] [ cgroup #2 ]
tasks aggr-bw tasks aggr-bw
1 8875KB/s 1 8885KB/s
2 8884KB/s 1 8148KB/s
4 8886KB/s 1 7637KB/s
8 8886KB/s 1 7411KB/s

random-readers vs random-reader
===============================
[ cgroup #1 workload ]
fio_args="--rw=randread --bs=4k --size=512M --runtime=30 --numjobs=N --direct=1"
[ cgroup #2 workload ]
fio_args="--rw=randread --bs=4k --size=512M --runtime=30 --numjobs=1 --direct=1"

__2.6.32-rc5__
[ cgroup #1 ] [ cgroup #2 ]
tasks aggr-bw tasks aggr-bw
1 6141KB/s 1 6320KB/s
2 8567KB/s 1 3987KB/s
4 9783KB/s 1 2610KB/s
8 11067KB/s 1 1227KB/s

__2.6.32-rc5-io-throttle__
[ cgroup #1 ] [ cgroup #2 ]
tasks aggr-bw tasks aggr-bw
1 8883KB/s 1 8886KB/s
2 8888KB/s 1 7676KB/s
4 8887KB/s 1 7364KB/s
8 8884KB/s 1 7264KB/s

With the SSD there's not a consistent degradation of cgroup #2 when we
increase the tasks of the concurrent random readers in cgroup #1 (both
in the random-vs-random or random-vs-sequential cases).

We should better analyze the details (probably blktrace would help
here), but it seems that in your tests the mix of CFQ and io-throttle
generated a too seeky workload that caused the bad performance values of
the sequential reader.

-Andrea