Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp507121img; Mon, 18 Mar 2019 08:01:24 -0700 (PDT) X-Google-Smtp-Source: APXvYqxQYGurxdCkaCamYwNua+TBJ66TEzglJsj6ei/JuO4GIX3BcYfxvA+8VOoh374aI9tNSDbn X-Received: by 2002:a63:5c66:: with SMTP id n38mr17753775pgm.15.1552921284409; Mon, 18 Mar 2019 08:01:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1552921284; cv=none; d=google.com; s=arc-20160816; b=hi4U8ROKtic7K49kHMKjDCBPGNNKI8ofGHTvQA9psSwhmdveqFl72+X36m4rab0DNO Ak/XxuD8sAFu6RTu1XPkaY8y3ePMO7a720xRd+qjRhk3LZBurUxPL9QIQro3Q7772wno B7zYZxZTTV5XJvBu5AaPJP5Erde3b2hdwJolJPvR3FOkbVSutgX3T5jg3D+y6n8nUyM8 I60qoGPDDFFk5rNjXq/8OEHNRDmQB0XOCgEWCP4j2JJ/MhOonF7+V3JLH0D7ZQEPZR5l Puq8LUfEGmTYnfdqbBGzOu6d734yV1ANX6LPHyCWaKCtKNQ3W0aUGCMmFd78yZ4dJecX Obww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:to:subject :message-id:date:from:mime-version:dkim-signature; bh=ws4FJo5ukAEQiG439+2y1DKalwBDgyKq1PiNzms22Kw=; b=WdhCunCRMsSGAy1vWCuk3bb9Krmmz/M+5GzpqUk5q2Ck15mf7SD4mxVaahjAHOTtJ3 WVbcP1RNYRj7ERLwFA6j7C0W4yrK/N6s08yStDMzpWrVxinfMCn2OMJh8Yo44Uuu35/3 OWwENxNFzyAjYlBDTS511I5kvfpG5d0KsqT4HF+T0Ucke8B5CIjEPrTnrHZh9JaXWO8Y rsWXbzShZh/rGf79H1efS09oUoMQ8inuaG1wdd1slZ7axA39W6tVN4qsPXchAPU+FheZ 0k//5HHozHgVr/7WHmTP81ALFKVLjwHOHP8JYTpiSNQsqaAJgrBi/l1KXzqjVcpVRaS2 TsUA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@indeed.com header.s=google header.b=HxkosBIx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=indeed.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p188si7628739pfp.123.2019.03.18.08.01.08; Mon, 18 Mar 2019 08:01:24 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@indeed.com header.s=google header.b=HxkosBIx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=indeed.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727846AbfCRO77 (ORCPT + 99 others); Mon, 18 Mar 2019 10:59:59 -0400 Received: from mail-io1-f68.google.com ([209.85.166.68]:37516 "EHLO mail-io1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726435AbfCRO77 (ORCPT ); Mon, 18 Mar 2019 10:59:59 -0400 Received: by mail-io1-f68.google.com with SMTP id x7so14732690ioh.4 for ; Mon, 18 Mar 2019 07:59:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=indeed.com; s=google; h=mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=ws4FJo5ukAEQiG439+2y1DKalwBDgyKq1PiNzms22Kw=; b=HxkosBIxEqzFF/0kkCpxkInPNuqViBtIZlV/vU4WhawG6sbQB8AVEsyBz999V/9jaJ xdh7fW4uI57qpZA2L+eN7jvTV/hKO7j2h2CjlnT7CxXff+O0ZBt0yRuukhvE5qSm25Mg cvVywpPROngtZu8fWfulOhe0adi5s0VyZLUBg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=ws4FJo5ukAEQiG439+2y1DKalwBDgyKq1PiNzms22Kw=; b=rrBOHuXhn6I9+KG9lGspUR+xUARn7aD5yPRNj9/vpeqRDqf6QcW5gJGX8Yfw3rPa4Q 9seYSoYVW8DCHfErOV0e17Cr/uCezPxscU1NxbpUlxauTUNAkOjHlz1SSWSeZItgxcxr 6sNPWkJ4ibxgOVC7gfxfZSb673nEPO2XSxgf8UY/2R4tWjgEFyR0zuy5lBy5lUuv1Hnk Uw/6yt4ekYQNC4LfF+VYrgO9sQ+5Z7kVqVgw+MHPHhH9OkDq2yZMnpLE66k/6j4PT5uT vxC07paD8JMpTEgkpqbhC8UIxUNjpNsT1YQU9oU8TVKNRg0mJPQ1OVCA+sLJxcCaRUmf JJxQ== X-Gm-Message-State: APjAAAXzYMoY2DmQu+hcp+stWVWMlLNMenFd4OdifpMZOXWeM9meYVVM eTDtK7imddoSCp1HozdImGGaHD9/0yO/JXbeoYgLhg== X-Received: by 2002:a6b:ea19:: with SMTP id m25mr13023349ioc.291.1552921197756; Mon, 18 Mar 2019 07:59:57 -0700 (PDT) MIME-Version: 1.0 From: Dave Chiluk Date: Mon, 18 Mar 2019 09:59:32 -0500 Message-ID: Subject: Cgroup cpu throttling with low cpu usage of multi-threaded applications on high-core count machines To: Peter Zijlstra , Ingo Molnar , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Brendan Gregg , Kyle Anderson , Gabriel Munos , John Hammond , Cong Wang Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We are seeing a high amount of cgroup cpu throttling, as measured by nr_throttled/nr_periods, while also seeing low cpu usage when running highly threaded applications on high core-count machines. In particular we are seeing this on "thread pool" design pattern applications that are being run on kubernetes cpu with hard limits. We've seen similar issues on other microservice cloud architectures that utilize cgroup cpu constraints. Most information out there about this problem is to over-commit cpu which is wasteful, or turn off hard limits and rely on the cpu_shares mechanism instead. We=E2=80=99ve root caused this to bandwitdth_slices being allocated to runqueues that own threads that do little work. This results in the primary =E2=80=9Cfast=E2=80=9D worker threads being starved for runtime/thr= ottled, while runtime allocated to the cfs_rq=E2=80=99s of the less productive thre= ads goes unused. Eventually the time slices on the less productive thread expires wasting cpu quota. This issue is exacerbated even further as you move from 8 core -> 80 core machines, as slices are allocated and left unused to that many more cfs_rq=E2=80=99s. With an 80 core machine assigning the default time slice (5ms) to each cfs_rq requires 400ms quota per 100ms period simply to allow each cfs_rq to have one time slice which is 4 CPUs worth of quota. In reality tasks rarely get spread out to every core like this, but that is only the worst case scenario. This is also why we saw a performance regression when moving from older 46 core machines to newer 80 core machines. Now that the world is moving to micro-services architectures such as kubernetes, more and more applications are being run with cgroup cpu constraints like this. I have created an artificial C testcase that reproduces that problem and have posted it at https://github.com/indeedeng/fibtest I have used that testcase to identify 512ac99 as the source of this performance regression. However as far as I can tell 512ac99 is technically a correct patch. Instead what was happening before is that the runtime on each cfs_rq would almost never be expired because the following conditional would almost always be true. vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) ... if (cfs_rq->runtime_expires !=3D cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires +=3D TICK_NSEC; .. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I verified this by adding a variable to the cfs_bandwidth structure that counted all of the runtime_remaining that would've been expired, in the else clause of this if and found that it was never hit pre 512ac99 on my test machines. However after 512ac99 lots of runtime would be expired. I understand that this experience is different from the submitters of 512ac99, and I suspect there may be some architecture or configuration difference at play there. So looking back at commit 51f2176d which introduced the above logic, the behavior appears to have existed since 3.16. Cong Wang submitted a patch that happens to work around this, by implementing bursting based on idle time https://lore.kernel.org/patchwork/patch/907450/. However beneficial, that patch is orthogonal to the root cause of this problem, but I wanted to mention it. So my question is what should be done? 1. Make the expiration time of a time slice configurable with the default set to INF to match the behavior of the kernel as it existed v3.16..v4.18-rc4. 2. Another option that should work afaik would be to remove all the cfs_bandwidth time slice expiration logic, as time slices naturally expire as they get used anyways. This is actually my preferred course of action as it's most performant, and it removes what appears to be some very hardware sensitive logic. Additionally the hard limit description still holds true albeit not strictly per each accounting period. However since no one has complained about that over the 5 years it was broken, I think it's pretty safe to assume that very few people are actually watching that carefully. Instead it's a much worse user experience when you ask for .5 cpu, and are only able to use .1 of it while being throttled because of time slice expiration. Thank you, Dave Chiluk p.s. I've copied representatives of Netflix and Yelp as well, as we were talking about this at the Scale 17x conference. There we discovered we had all individually hit this issue and had it on roadmaps to fix.