Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753260AbdIEVCd (ORCPT ); Tue, 5 Sep 2017 17:02:33 -0400 Received: from mail.kernel.org ([198.145.29.99]:40742 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753130AbdIEVCb (ORCPT ); Tue, 5 Sep 2017 17:02:31 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1E8BB21B64 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=shli@kernel.org Date: Tue, 5 Sep 2017 14:02:28 -0700 From: Shaohua Li To: Paolo VALENTE Cc: Shaohua Li , Linux Kernel Mailing List , linux-block , Kernel-team@fb.com, tj@kernel.org, axboe@fb.com, vgoyal@redhat.com Subject: Re: [PATCH V6 00/18] blk-throttle: add .low limit Message-ID: <20170905210228.vzwjtg24fbmwfl6y@kernel.org> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170113 (1.7.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4410 Lines: 90 On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote: > > > Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li ha scritto: > > > > Hi, > > > > cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not > > much for SSD. This patch set try to add a conservative limit for blk-throttle. > > It isn't a proportional scheduling, but can help prioritize cgroups. There are > > several advantages we choose blk-throttle: > > - blk-throttle resides early in the block stack. It works for both bio and > > request based queues. > > - blk-throttle is light weight in general. It still takes queue lock, but it's > > not hard to implement a per-cpu cache and remove the lock contention. > > - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The > > mechanism is proved to harm performance for fast SSD. > > > > The patch set add a new io.low limit for blk-throttle. It's only for cgroup2. > > The existing io.max is a hard limit throttling. cgroup with a max limit never > > dispatch more IO than its max limit. While io.low is a best effort throttling. > > cgroups with 'low' limit can run above their 'low' limit at appropriate time. > > Specifically, if all cgroups reach their 'low' limit, all cgroups can run above > > their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups > > will run according to their 'low' limit. So the 'low' limit could act as two > > roles, it allows cgroups using free bandwidth and it protects cgroups from > > their 'low' limit. > > > > An example usage is we have a high prio cgroup with high 'low' limit and a low > > prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low > > prio can run above its 'low' limit, so we don't waste the bandwidth. When the > > high prio cgroup runs and is below its 'low' limit, low prio cgroup will run > > under its 'low' limit. This will protect high prio cgroup to get more > > resources. > > > > Hi Shaohua, Hi, Sorry for the late response. > I would like to ask you some questions, to make sure I fully > understand how the 'low' limit and the idle-group detection work in > your above scenario. Suppose that: the drive has a random-I/O peak > rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and > the low prio group has a 'low' limit of 10 MB/s. If > - the high prio process happens to do, say, only 5 MB/s for a given > long time > - the low prio process constantly does greedy I/O > - the idle-group detection is not being used > then the low prio process is limited to 10 MB/s during all this time > interval. And only 10% of the device bandwidth is utilized. > > To recover lost bandwidth through idle-group detection, we need to set > a target IO latency for the high-prio group. The high prio group > should happen to be below the threshold, and thus to be detected as > idle, leaving the low prio group free too use all the bandwidth. > > Here are my questions: > 1) Is all I wrote above correct? Yes > 2) In particular, maybe there are other better mechanism to saturate > the bandwidth in the above scenario? Assume it's the 4) below. > If what I wrote above is correct: > 3) Doesn't fluctuation occur? I mean: when the low prio group gets > full bandwidth, the latency threshold of the high prio group may be > overcome, causing the high prio group to not be considered idle any > longer, and thus the low prio group to be limited again; this in turn > will cause the threshold to not be overcome any longer, and so on. That's true. We try to mitigate the fluctuation by increasing the low prio cgroup bandwidth graduately though. > 4) Is there a way to compute an appropriate target latency of the high > prio group, if it is a generic group, for which the latency > requirements of the processes it contains are only partially known or > completely unknown? By appropriate target latency, I mean a target > latency that enables the framework to fully utilize the device > bandwidth while the high prio group is doing less I/O than its limit. Not sure how we can do this. The device max bandwidth varies based on request size and read/write ratio. We don't know when the max bandwidth is reached. Also I think we must consider a case that the workloads never use the full bandwidth of a disk, which is pretty common for SSD (at least in our environment). Thanks, Shaohua