Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp585986ybi; Fri, 24 May 2019 08:18:10 -0700 (PDT) X-Google-Smtp-Source: APXvYqw06VU/5oezVHujk85bF1xySwmURsNa7XpiI78tmj77ylUwr8MKpRBuU70vCvvzhJ7gUomB X-Received: by 2002:a17:90a:bd8b:: with SMTP id z11mr10307404pjr.45.1558711090024; Fri, 24 May 2019 08:18:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558711090; cv=none; d=google.com; s=arc-20160816; b=tgjAO/FPKae98C6sRkjSECzyAgk6A8zKdArMd0oMQIB/mx9tihHDKteYPrHlyuNxlp N8Lh/Zi6Iw+Fvy7BkjPRfQO8ys9/yYUukcQ+SiXw355B58B4OU2K+v0fIu5BFqH3DeE5 zeZWi94K7iOe4KK3403892dVL/AHImjJVjSCllLnzzwjvsifs0VIkxo213GndWDDeUj6 wIMGm29syqUinhXVGTUkbqoqS+fc/JGZCpPFKv2sybK+Rp2VAXu8D2bGRsBt501yu5Pb v5foQ0D/XHLHA2vJWGBcRd+cR/F611xVybpDml1U1pPpMrnYcAwmvLmOaVBFUo8G+IJw WUGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=QKLfYJyC3pKaeS6F7jFzdQyiyzLOg1HgkzY2ER2lf8E=; b=keNbqxdtzoa0HEyk87vJPWUzlGqMt9x3Z8SDijqQ86iNBXoqk3gC9wwtgkuCNZn6fT xDbQw3OWWde2Nq1shSF5sv4kWr54GqXHDerUyquLBg5YL7v1IB4EkJp3jyUNFpV9YmKf dKRR74K9vAoqMYGOx4q4Tnf61gscdn+NYtxHsTDtawAeSSQz7STPtmA35aXD13Xh2Uff fI3RFf6bh609EKG2KARhq3zpcHXt7DRojkXetUUKPK+v+X2917Ff3lI3hOzNnbIsY3JT +PidmIXpjQ1SRdz9IycjL8Fmu7RJ377fk+A+XSzFQ9S2mXF9rkK6inuNXcBw+n+YA1Fy 7EJw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@indeed.com header.s=google header.b=SyJyK7bq; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=indeed.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e6si4446759pgj.590.2019.05.24.08.17.52; Fri, 24 May 2019 08:18:10 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@indeed.com header.s=google header.b=SyJyK7bq; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=indeed.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389790AbfEXPPD (ORCPT + 99 others); Fri, 24 May 2019 11:15:03 -0400 Received: from mail-it1-f195.google.com ([209.85.166.195]:53786 "EHLO mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389680AbfEXPPD (ORCPT ); Fri, 24 May 2019 11:15:03 -0400 Received: by mail-it1-f195.google.com with SMTP id m141so16324946ita.3 for ; Fri, 24 May 2019 08:15:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=indeed.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=QKLfYJyC3pKaeS6F7jFzdQyiyzLOg1HgkzY2ER2lf8E=; b=SyJyK7bqOaMgHMjzDpJD7dAwjgOIEi1xlEARg3dMKeAMKWck3EhGPDjqP3Nh4hmz1D N6RkEfG1q17Dn79D9gw30IPA1hQljik6lSpT9pcuHQWSC0uWEpj2+lu366snnFzTXXZE uM4krkIXH4V6O5L0XG32GACEDXsUx3lL9SeA4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=QKLfYJyC3pKaeS6F7jFzdQyiyzLOg1HgkzY2ER2lf8E=; b=BOBxCTxK3wRNCtC65qbq1taz+rB7UD8Xiwdhbu1GEUmdKCWE7ICj4OYsC5Aer9qIL3 djDwY5159oAhfSEhPU60ZD8244xXxnO1dTKZkk5IfU4OZqqPMxSnAlzN+VnV5k0wyZAq RfMW9ZPDH5Y83k5LXFu5YRYU3RS95uN0uIcb4SmhDlWoxk8qKTFpujn4J2OSIGBAbsQ0 KBGbWxb5lu1XiikCwdaHm1uarzKRNp4QSumHJOm9FAtzUcAVPwXLtJN9IL1RsFMtGccM 6Muvcsg3pfLHpgOdpzcsKjuV953yYSWPZ4BXstAhWVcwN2BZ/YWWnYXSA0GUJ5rSdd9r uSwQ== X-Gm-Message-State: APjAAAVtwJe+7+5NOLZWYt2oViZNSea88M+7FSz0ooqw/aIXMDh/FSb8 O/TRfvZb4u8cjksfVlWR5fwGqN658qjm/N8jCd+I7g== X-Received: by 2002:a24:680c:: with SMTP id v12mr17714903itb.67.1558710902196; Fri, 24 May 2019 08:15:02 -0700 (PDT) MIME-Version: 1.0 References: <1558121424-2914-1-git-send-email-chiluk+linux@indeed.com> <1558637087-20283-1-git-send-email-chiluk+linux@indeed.com> <1558637087-20283-2-git-send-email-chiluk+linux@indeed.com> <20190524143204.GB4684@lorien.usersys.redhat.com> In-Reply-To: <20190524143204.GB4684@lorien.usersys.redhat.com> From: Dave Chiluk Date: Fri, 24 May 2019 10:14:36 -0500 Message-ID: Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices To: Phil Auld Cc: Peter Oskolkov , Peter Zijlstra , Ingo Molnar , cgroups@vger.kernel.org, Linux Kernel Mailing List , Brendan Gregg , Kyle Anderson , Gabriel Munos , John Hammond , Cong Wang , Jonathan Corbet , linux-doc@vger.kernel.org, Ben Segall Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 24, 2019 at 9:32 AM Phil Auld wrote: > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote: > > If the machine runs at/close to capacity, won't the overallocation > > of the quota to bursty tasks necessarily negatively impact every other > > task? Should the "unused" quota be available only on idle CPUs? > > (Or maybe this is the behavior achieved here, and only the comment and > > the commit message should be fixed...) > > > > It's bounded by the amount left unused from the previous period. So > theoretically a process could use almost twice its quota. But then it > would have nothing left over in the next period. To repeat it would have > to not use any that next period. Over a longer number of periods it's the > same amount of CPU usage. > > I think that is more fair than throttling a process that has never used > its full quota. > > And it removes complexity. > > Cheers, > Phil Actually it's not even that bad. The overallocation of quota to a bursty task in a period is limited to at most one slice per cpu, and that slice must not have been used in the previous periods. The slice size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and defaults to 5ms. If a bursty task goes from underutilizing quota to using it's entire quota, it will not be able to burst in the subsequent periods. Therefore in an absolute worst case contrived scenario, a bursty task can add at most 5ms to the latency of other threads on the same CPU. I think this worst case 5ms tradeoff is entirely worth it. This does mean that a theoretically a poorly written massively threaded application on an 80 core box, that spreads itself onto 80 cpu run queues, can overutilize it's quota in a period by at most 5ms * 80 CPUs in a sincle period (slice * number of runqueues the application is running on). But that means that each of those threads would have had to not be use their quota in a previous period, and it also means that the application would have to be carefully written to exacerbate this behavior. Additionally if cpu bound threads underutilize a slice of their quota in a period due to the cfs choosing a bursty task to run, they should theoretically be able to make it up in the following periods when the bursty task is unable to "burst". Please be careful here quota and slice are being treated differently. Quota does not roll-over between periods, only slices of quota that has already been allocated to per cpu run queues. If you allocate 100ms of quota per period to an application, but it only spreads onto 3 cpu run queues that means it can in the worst case use 3 x slice size = 15ms in periods following underutilization. So why does this matter. Well applications that use thread pools *(*cough* java *cough*) with lots of tiny little worker threads, tend to spread themselves out onto a lot of run queues. These worker threads grab quota slices in order to run, then rarely use all of their slice (1 or 2ms out of the 5ms). This results in those worker threads starving the main application of quota, and then expiring the remainder of that quota slice on the per-cpu. Going back to my earlier 100ms quota / 80 cpu example. That means only 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a period. So only 20 out of these 80 cpus ever get a slice allocated to them. By allowing these per-cpu run queues to use their remaining slice in following periods these worker threads do not need to be allocated additional slice, and thereby the main threads are actually able to use the allocated cpu quota. This can be experienced by running fibtest available at https://github.com/indeedeng/fibtest/. $ runfibtest 1 runs a single fast thread taskset to cpu 0 $ runfibtest 8 Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset to cpus 1-7. This run is expected to show less iterations, but the worse problem is that the cpu usage is far less than the 500ms that it should have received. Thanks for the engagement on this, Dave Chiluk