Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp1225570ybi; Thu, 30 May 2019 13:46:18 -0700 (PDT) X-Google-Smtp-Source: APXvYqwMt289C+tB0NNclBqkwggugK4MR9sL2QU11m0J07rdepX9Zd5yEXrWR9ISPOuf1Exa8K6s X-Received: by 2002:a62:1a0f:: with SMTP id a15mr5864468pfa.111.1559249177955; Thu, 30 May 2019 13:46:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559249177; cv=none; d=google.com; s=arc-20160816; b=m1dtYLaimW3ahd7a3jZzRgT84xjVIBd4bquZ+dsIeEL5EbGR9wiEPlholOf1iuxfc3 dj2h/muhZ5iEWsCmBqm2GLCtL3nk0XJLekNH647VxIB9UrRiJ5jXgjYCrGnjvjuzLApr JHqw+mjFbhKNeQfG+LQsAbTQlxmI3ILW/9fvKMOcTiv7Ty3pCdJ/8DIgZxPpaNSei4TI ntZ4pb2r3P8reanDZ5fGo+Y4NL/WHXukx/aopACHPkvY6mIBHxLcBSvD+sQedzAT4qsx DB2QmWeQP5FuLGv+RNstuBmW2VjcEbeNEJEAMDyp2MTcBlOn47ZpHYNznxvMj+ZPh+sm soAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from:dkim-signature; bh=q6AUBPvqyat2bLNM++F9NloXkkCwBt3/tItx63cMlpE=; b=F2avOPj/+/ds49odWz5zD0gTwiUtJp4fiWOAg8+PnKmOvnZXsTujeuGfpVgZ76SKfb uCmP6OiD63XR2vDaYfoaDm5xAC4kMbbs4WqsKFii6uEejzUNEFvk5EM6PkClXQVfbe8H xDOrhhEpSTiBgI+5RXd3/aEtc719sr0afA8bu3dGHyWd7fw+Fa72rSjKv/PAeEOO5LDN hzsSYxjZAV/NEUmrUAPuunSC3DVu6MzfImgtPI4xiSLzmIlKJnpM/f14YJ3MMGFzrK5E /MECI3QnGa8TaZGfm4/gzNxpQWMI8L4mwzmLBVRw/30tzjbECtI8KBJH+UBFqlDgZbbF qXAA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="C2e/Jzbp"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v20si4109970pgk.58.2019.05.30.13.46.02; Thu, 30 May 2019 13:46:17 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="C2e/Jzbp"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726540AbfE3UpA (ORCPT + 99 others); Thu, 30 May 2019 16:45:00 -0400 Received: from mail-pf1-f194.google.com ([209.85.210.194]:36117 "EHLO mail-pf1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726489AbfE3Uo7 (ORCPT ); Thu, 30 May 2019 16:44:59 -0400 Received: by mail-pf1-f194.google.com with SMTP id u22so4701098pfm.3 for ; Thu, 30 May 2019 13:44:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version; bh=q6AUBPvqyat2bLNM++F9NloXkkCwBt3/tItx63cMlpE=; b=C2e/JzbpUueBW6GWiWsrwQbKeGbHeV+0umj6fYHo//SnFeP6ktUlkhrM4ciCTH1r01 GiqRMuwhQC4PFGeU4JtJ0cJmEDB1GcL+fEV+5ReV0/OX84vu6QQipYF9IhWRIseSCwye Yd6Ny28KV9jXxKId1cKev/qx2qjNT0bNlIcUcOqT0gSzZOqgBx5CcDyj9XKs/HG3tl2u A+9dmAt0jbXOjWnMgRSXENvJAGGJ+s8n42XCpdvtVEegDqVRT4irnQKFTlL8YP4C6XFA m7/G009ugSO8iFwOfoNeeFB2/YjQO5yA7UBg1McHFypCDaz91MGXSxTYKuZtMWd09wdF 2lsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version; bh=q6AUBPvqyat2bLNM++F9NloXkkCwBt3/tItx63cMlpE=; b=CT2hjKs7ZimHCKdAyq8mcZYCP032vQYVuW/+WmWV3IwJI34GEwk/bwG2Avqo4oR84v qSBT8qCojuoo+yTe1w/2bSKaGWITyRbToft0JZHYqjmfI3lWXdJpS7oWSo8DH/XxVKGj PHF53ta2qtD+0z5naEF7xXIKrNfNXAsV1AGlp5bSwp3biST5X5lzk/s/w3j9yuFmk1jE hcl8L7P+K2UH0LaQ9T1xu10HIYEnN8NVe3jWcUZvpd1Sc69LOqH0CMtvlUfRYOtjMzXk 7BoequhP5KmMoPxBnaf6aL8LmKKUEIMnQIPv39fnehN+PrY16mFaOicyCLlzxW2ei2k5 V9DQ== X-Gm-Message-State: APjAAAXDw4N0rD+eI+Q+g/M8VUYoyGfPPdwM4BnnhOmW0ljSC69awyLB 24HX85VKT31qaTKl5BnWLOqMGQ== X-Received: by 2002:a17:90a:2a09:: with SMTP id i9mr5287388pjd.103.1559249097503; Thu, 30 May 2019 13:44:57 -0700 (PDT) Received: from bsegall-linux.svl.corp.google.com.localhost ([2620:15c:2cd:202:39d7:98b3:2536:e93f]) by smtp.gmail.com with ESMTPSA id k22sm4021024pfk.54.2019.05.30.13.44.55 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 30 May 2019 13:44:56 -0700 (PDT) From: bsegall@google.com To: Dave Chiluk Cc: Phil Auld , Peter Oskolkov , Peter Zijlstra , Ingo Molnar , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Brendan Gregg , Kyle Anderson , Gabriel Munos , John Hammond , Cong Wang , Jonathan Corbet , linux-doc@vger.kernel.org, pjt@google.com Subject: Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices References: <1558121424-2914-1-git-send-email-chiluk+linux@indeed.com> <1559156926-31336-1-git-send-email-chiluk+linux@indeed.com> <1559156926-31336-2-git-send-email-chiluk+linux@indeed.com> Date: Thu, 30 May 2019 13:44:55 -0700 In-Reply-To: (Dave Chiluk's message of "Thu, 30 May 2019 12:53:37 -0500") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave Chiluk writes: > On Wed, May 29, 2019 at 02:05:55PM -0700, bsegall@google.com wrote: >> Dave Chiluk writes: >> >> Yeah, having run the test, stranding only 1 ms per cpu rather than 5 >> doesn't help if you only have 10 ms of quota and even 10 threads/cpus. >> The slack timer isn't important in this test, though I think it probably >> should be changed. > My min_cfs_rq_runtime was already set to 1ms. Yeah, I meant min_cfs_rq_runtime vs the 5ms if the slack stuff was broken. > > Additionally raising the amount of quota from 10ms to 50ms or even > 100ms, still results in throttling without full quota usage. > >> Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull >> quota more often / always. The worst case here I think is where you >> run/sleep for ~1ns, so you wind up taking the lock twice every >> min_cfs_rq_runtime: once for assign and once to return all but min, >> which you then use up doing short run/sleep. I suppose that determines >> how much we care about this overhead at all. > I'm not so concerned about how inefficiently the user-space application > runs, as that's up to the invidual developer. Increasing scheduler overhead is something we generally try to prevent is what I was worried about. > The fibtest testcase, is > purely my approximation of what a java application with lots of worker > threads might do, as I didn't have a great deterministic java > reproducer, and I feared posting java to LKML. I'm more concerned with > the fact that the user requested 10ms/period or 100ms/period and they > hit throttling while simultaneously not seeing that amount of cpu usage. > i.e. on an 8 core machine if I > $ ./runfibtest 1 > Iterations Completed(M): 1886 > Throttled for: 51 > CPU Usage (msecs) = 507 > $ ./runfibtest 8 > Iterations Completed(M): 1274 > Throttled for: 52 > CPU Usage (msecs) = 380 > > You see that in the 8 core case where we have 7 do nothing threads on > cpu's 1-7, we see only 380 ms of usage, and 52 periods of throttling > when we should have received ~500ms of cpu usage. > > Looking more closely at the __return_cfs_rq_runtime logic I noticed > if (cfs_b->quota != RUNTIME_INF && > cfs_rq->runtime_expires == cfs_b->runtime_expires) { > > Which is awfully similar to the logic that was fixed by 512ac999. Is it > possible that we are just not ever returning runtime back to the cfs_b > because of the runtime_expires comparison here? The relevant issue that patch fixes is that the old conditional was backwards. Also lowering min_cfs_rq_runtime to 0 fixes your testcase, so it's working. > >> Removing expiration means that in the worst case period and quota can be >> effectively twice what the user specified, but only on very particular >> workloads. > I'm only removing expiration of slices that have already been assigned > to individual cfs_rq. My understanding is that there is at most one > cfs_rq per cpu, and each of those can have at most one slice of > available runtime. So the worst case burst is slice_ms * cpus. Please > help me understand how you get to twice user specified quota and period > as it's not obvious to me *(I've only been looking at this for a few > months). The reason that this effect is so significant is because slice_ms * cpus is roughly 100% of the quota. So yes, it's roughly the same thing. Unfortunately if there are more spare cpus on the system just doubling quota and period (keeping the same ratio) would not fix your issue, while removing expiration does while also potentially having that effect. > >> I think we should at least think about instead lowering >> min_cfs_rq_runtime to some smaller value > Do you mean lower than 1ms? Yes