Date: Thu, 1 Oct 2009 09:31:09 -0400
From: Vivek Goyal <vgoyal@redhat.com>
To: Ryo Tsuruta <ryov@valinux.co.jp>
Cc: nauman@google.com, linux-kernel@vger.kernel.org, jens.axboe@oracle.com,
       containers@lists.linux-foundation.org, dm-devel@redhat.com,
       dpshah@google.com, lizf@cn.fujitsu.com, mikew@google.com,
       fchecconi@gmail.com, paolo.valente@unimore.it, fernando@oss.ntt.co.jp,
       s-uchida@ap.jp.nec.com, taka@valinux.co.jp, guijianfeng@cn.fujitsu.com,
       jmoyer@redhat.com, dhaval@linux.vnet.ibm.com, balbir@linux.vnet.ibm.com,
       righi.andrea@gmail.com, m-ikeda@ds.jp.nec.com, agk@redhat.com,
       akpm@linux-foundation.org, peterz@infradead.org, jmarchan@redhat.com,
       torvalds@linux-foundation.org, mingo@elte.hu, riel@redhat.com,
       yoshikawa.takuya@oss.ntt.co.jp
Subject: Re: IO scheduler based IO controller V10
Message-ID: <20091001133109.GA4058@redhat.com>
References: <20090929141049.GA12141@redhat.com> <20090930.174319.183036386.ryov@valinux.co.jp> <20090930110500.GA26631@redhat.com> <20091001.154125.104044685.ryov@valinux.co.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20091001.154125.104044685.ryov@valinux.co.jp>
User-Agent: Mutt/1.5.18 (2008-05-17)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 11934
Lines: 254

On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > > Hi Vivek,
> > > 
> > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > I was thinking that elevator layer will do the merge of bios. So IO
> > > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > > into the disk and again timestamp with finish time once request finishes.
> > > > 
> > > > This way higher layer can get an idea how much disk time a group of bios
> > > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > > then time accounting becomes an issue.
> > > > 
> > > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > > time elapsed between each of milestones is t. Also assume that all these
> > > > requests are from same queue/group.
> > > > 
> > > >         t0   t1   t2   t3  t4   t5   t6   t7
> > > >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > > > 
> > > > Now higher layer will think that time consumed by group is:
> > > > 
> > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > > 
> > > > But the time elapsed is only 7t.
> > > 
> > > IO controller can know how many requests are issued and still in
> > > progress. Is it not enough to accumulate the time while in-flight IOs
> > > exist?
> > > 
> > 
> > That time would not reflect disk time used. It will be follwoing.
> > 
> > (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> > (time spent in disk)
> 
> In the case where multiple IO requests are issued from IO controller,
> that time measurement is the time from when the first IO request is
> issued until when the endio is called for the last IO request. Does
> not it reflect disk time?
> 

Not accurately as it will be including the time spent in CFQ queues as
well as dispatch queue. I will not worry much about dispatch queue time
but time spent CFQ queues can be significant.

This is assuming that you are using token based scheme and will be
dispatching requests from multiple groups at the same time.

But if you figure out a way that you dispatch requests from one group only
at one time and wait for all requests to finish and then let next group
go, then above can work fairly accurately. In that case it will become
like CFQ with the only difference that effectively we have one queue per
group instread of per process.

> > > > Secondly if a different group is running only single sequential reader,
> > > > there CFQ will be driving queue depth of 1 and time will not be running
> > > > faster and this inaccuracy in accounting will lead to unfair share between
> > > > groups.
> > > >
> > > > So we need something better to get a sense which group used how much of
> > > > disk time.
> > > 
> > > It could be solved by implementing the way to pass on such information
> > > from IO scheduler to higher layer controller.
> > > 
> > 
> > How would you do that? Can you give some details exactly how and what
> > information IO scheduler will pass to higher level IO controller so that IO
> > controller can attribute right time to the group.
> 
> If you would like to know when the idle timer is expired, how about
> adding a function to IO controller to be notified it from IO
> scheduler? IO scheduler calls the function when the timer is expired.
> 

This probably can be done. So this is like syncing between lower layers
and higher layers about when do we start idling and when do we stop it and
both the layers should be in sync.

This is something my common layer approach does. Becuase it is so close to
IO scheuler, I can do it relatively easily.

One probably can create interfaces to even propogate this information up.
But this all will probably come into the picture only if we don't use
token based schemes and come up with something where at one point of time
dispatch are from one group only.

> > > > > How about making throttling policy be user selectable like the IO
> > > > > scheduler and putting it in the higher layer? So we could support
> > > > > all of policies (time-based, size-based and rate limiting). There
> > > > > seems not to only one solution which satisfies all users. But I agree
> > > > > with starting with proportional bandwidth control first. 
> > > > > 
> > > > 
> > > > What are the cases where time based policy does not work and size based
> > > > policy works better and user would choose size based policy and not timed
> > > > based one?
> > > 
> > > I think that disk time is not simply proportional to IO size. If there
> > > are two groups whose wights are equally assigned and they issue
> > > different sized IOs repsectively, the bandwidth of each group would
> > > not distributed equally as expected. 
> > > 
> > 
> > If we are providing fairness in terms of time, it is fair. If we provide
> > equal time slots to two processes and if one got more IO done because it
> > was not wasting time seeking or it issued bigger size IO, it deserves that
> > higher BW. IO controller will make sure that process gets fair share in
> > terms of time and exactly how much BW one got will depend on the workload.
> > 
> > That's the precise reason that fairness in terms of time is better on
> > seeky media.
> 
> If the seek time is negligible, the bandwidth would not be distributed 
> according to a proportion of weight settings. I think that it would be
> unclear for users to understand how bandwidth is distributed. And I
> also think that seeky media would gradually become obsolete,
> 

I can understand that if lesser the seek cost game starts changing and
probably a size based policy also work decently.

In that case at some point of time probably CFQ will also need to support
another mode/policy where fairness is provided in terms of size of IO, if
it detects a SSD with hardware queuing. Currently it seem to be disabling
the idling in that case. But this is not very good from fairness point of
view. I guess if CFQ wants to provide fairness in such cases, it needs to
dynamically change the shape and start thinking in terms of size of IO.

So far my testing has been very limited to hard disks connected to my
computer. I will do some testing on high end enterprise storage and see
how much do seek matter and how well both the implementations work.

> > > > I am not against implementing things in higher layer as long as we can
> > > > ensure tight control on latencies, strong isolation between groups and
> > > > not break CFQ's class and ioprio model with-in group.
> > > > 
> > > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > > 
> > > > Can you elaborate little bit on this?
> > > 
> > > bio is grabbed in generic_make_request() and throttled as well as
> > > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > > 
> > 
> > Ok, so one would not need dm-ioband device now, but same dm-ioband
> > throttling policies will apply. So until and unless we figure out a
> > better way, the issues I have pointed out will still exists even in
> > new implementation.
> 
> Yes, those still exist, but somehow I would like to try to solve them.
> 
> > > The default value of io_limit on the previous test was 128 (not 192)
> > > which is equall to the default value of nr_request.
> > 
> > Hm..., I used following commands to create two ioband devices.
> > 
> > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> > "weight 0 :100" | dmsetup create ioband1
> > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> > "weight 0 :100" | dmsetup create ioband2
> > 
> > Here io_limit value is zero so it should pick default value. Following is
> > output of "dmsetup table" command.
> > 
> > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
> >                                     ^^^^
> > IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> > to be 192?
> 
> The default vaule has changed since v1.12.0 and increased from 128 to 192.
> 
> > > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > > writes.
> > > 
> > > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > > sync/async requests separately, and it solves this
> > > buffered-write-starves-read problem. I would like to post it soon
> > > after doing some more test.
> > > 
> > > > On top of that can you please give some details how increasing the
> > > > buffered queue length reduces the impact of writers?
> > > 
> > > When the number of in-flight IOs exceeds io_limit, processes which are
> > > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > > IOs are finished. But IO scheduler layer can accept IO requests more
> > > than the value of io_limit, so it was a bottleneck of the throughput.
> > > 
> > 
> > Ok, so it should have been throughput bottleneck but how did it solve the
> > issue of writer starving the reader as you had mentioned in the mail.
> 
> As wrote above, I modified dm-ioband to handle sync/async requests
> separately, so even if writers do a lot of buffered IOs, readers can
> issue IOs regardless writers' busyness. Once the IOs are backlogged
> for throttling, the both sync and async requests are issued according
> to the other of arrival.
> 

Ok, so if both the readers and writers are buffered and some tokens become
available then these tokens will be divided half and half between readers
or writer queues?

> > Secondly, you mentioned that processes are made to sleep once we cross 
> > io_limit. This sounds like request descriptor facility on requeust queue
> > where processes are made to sleep.
> >
> > There are threads in kernel which don't want to sleep while submitting
> > bios. For example, btrfs has bio submitting thread which does not want
> > to sleep hence it checks with device if it is congested or not and not
> > submit the bio if it is congested.  How would you handle such cases. Have
> > you implemented any per group congestion kind of interface to make sure
> > such IO's don't sleep if group is congested.
> >
> > Or this limit is per ioband device which every group on the device is
> > sharing. If yes, then how would you provide isolation between groups 
> > because if one groups consumes io_limit tokens, then other will simply
> > be serialized on that device?
> 
> There are two kind of limit and both limit the number of IO requests
> which can be issued simultaneously, but one is for per ioband device, 
> the other is for per ioband group. The per group limit assigned to
> each group is calculated by dividing io_limit according to their
> proportion of weight.
> 
> The kernel thread is not made to sleep by the per group limit, because
> several kinds of kernel threads submit IOs from multiple groups and
> for multiple devices in a single thread. At this time, the kernel
> thread is made to sleep by the per device limit only.
> 

Interesting. Actually not blocking kernel threads on per group limit
and instead blocking it only on per device limts sounds like a good idea.

I can also do something similar and that will take away the need of
exporting per group congestion interface to higher layers and reduce
complexity. If some kernel thread does not want to block, these will
continue to use existing per device/bdi congestion interface.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/