DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1E8BB21B64
Date: Tue, 5 Sep 2017 14:02:28 -0700
From: Shaohua Li <shli@kernel.org>
To: Paolo VALENTE <paolo.valente@unimore.it>
Cc: Shaohua Li <shli@fb.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        linux-block <linux-block@vger.kernel.org>, Kernel-team@fb.com,
        tj@kernel.org, axboe@fb.com, vgoyal@redhat.com
Subject: Re: [PATCH V6 00/18] blk-throttle: add .low limit
Message-ID: <20170905210228.vzwjtg24fbmwfl6y@kernel.org>
References: <cover.1484451062.git.shli@fb.com>
 <A6F2F3C7-8020-4912-BB31-C4E6CD3EC858@unimore.it>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <A6F2F3C7-8020-4912-BB31-C4E6CD3EC858@unimore.it>
User-Agent: NeoMutt/20170113 (1.7.2)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4410
Lines: 90

On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
> 
> > Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li <shli@fb.com> ha scritto:
> > 
> > Hi,
> > 
> > cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
> > much for SSD. This patch set try to add a conservative limit for blk-throttle.
> > It isn't a proportional scheduling, but can help prioritize cgroups. There are
> > several advantages we choose blk-throttle:
> > - blk-throttle resides early in the block stack. It works for both bio and
> >  request based queues.
> > - blk-throttle is light weight in general. It still takes queue lock, but it's
> >  not hard to implement a per-cpu cache and remove the lock contention.
> > - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
> >  mechanism is proved to harm performance for fast SSD.
> > 
> > The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
> > The existing io.max is a hard limit throttling. cgroup with a max limit never
> > dispatch more IO than its max limit. While io.low is a best effort throttling.
> > cgroups with 'low' limit can run above their 'low' limit at appropriate time.
> > Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
> > their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
> > will run according to their 'low' limit. So the 'low' limit could act as two
> > roles, it allows cgroups using free bandwidth and it protects cgroups from
> > their 'low' limit.
> > 
> > An example usage is we have a high prio cgroup with high 'low' limit and a low
> > prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
> > prio can run above its 'low' limit, so we don't waste the bandwidth. When the
> > high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
> > under its 'low' limit. This will protect high prio cgroup to get more
> > resources.
> > 
> 
> Hi Shaohua,

Hi,

Sorry for the late response.
> I would like to ask you some questions, to make sure I fully
> understand how the 'low' limit and the idle-group detection work in
> your above scenario.  Suppose that: the drive has a random-I/O peak
> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
> the low prio group has a 'low' limit of 10 MB/s.  If
> - the high prio process happens to do, say, only 5 MB/s for a given
>   long time
> - the low prio process constantly does greedy I/O
> - the idle-group detection is not being used
> then the low prio process is limited to 10 MB/s during all this time
> interval.  And only 10% of the device bandwidth is utilized.
> 
> To recover lost bandwidth through idle-group detection, we need to set
> a target IO latency for the high-prio group.  The high prio group
> should happen to be below the threshold, and thus to be detected as
> idle, leaving the low prio group free too use all the bandwidth.
> 
> Here are my questions:
> 1) Is all I wrote above correct?

Yes
> 2) In particular, maybe there are other better mechanism to saturate
> the bandwidth in the above scenario?

Assume it's the 4) below.
> If what I wrote above is correct:
> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
> full bandwidth, the latency threshold of the high prio group may be
> overcome, causing the high prio group to not be considered idle any
> longer, and thus the low prio group to be limited again; this in turn
> will cause the threshold to not be overcome any longer, and so on.

That's true. We try to mitigate the fluctuation by increasing the low prio
cgroup bandwidth graduately though.

> 4) Is there a way to compute an appropriate target latency of the high
> prio group, if it is a generic group, for which the latency
> requirements of the processes it contains are only partially known or
> completely unknown?  By appropriate target latency, I mean a target
> latency that enables the framework to fully utilize the device
> bandwidth while the high prio group is doing less I/O than its limit.

Not sure how we can do this. The device max bandwidth varies based on request
size and read/write ratio. We don't know when the max bandwidth is reached.
Also I think we must consider a case that the workloads never use the full
bandwidth of a disk, which is pretty common for SSD (at least in our
environment).

Thanks,
Shaohua