Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp41635pxj; Wed, 16 Jun 2021 19:44:53 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy+YLB96RlRqGZ3QrwRfPIy223kQzx7Z13dJEBCCpuB7L95VOdn5ZEwJZjpFz5pf78Cb/Ui X-Received: by 2002:a02:9109:: with SMTP id a9mr2294301jag.93.1623897892880; Wed, 16 Jun 2021 19:44:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623897892; cv=none; d=google.com; s=arc-20160816; b=iv1muuNd5HKb91V+79N6RqkJ59N7T3OvWYiEEPudiY8O6nIOyF4ybhlEr3JPM2aSgx O9YnMjcyezXaRS0ZR+gETwhPjwHwIerr3DDhssedEMp/R8ZDGh/7ig+WbtHoyyULOpAA 0ALdXf4nXCuBbKwp6gJfEbXbD3VK3dFr3JIiKiIsMvlwzKn6Pm9OjOPCnb+u4K7SebRh 3FWzVHa/Gg1+krOQ+nUGM/iuveMRlYhj1txj3jsp+KUBuqxbYwwWILYRBxyweiwncfLa Z9t/n/zoaZcT5Wr/meP92at561zYLAxPx5NctCCd8a3WfJAdWd7Tpjt+F47yl+EkwjB1 zADQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=V60FAIdwpRuq7i04YiGCgFlyof6pScSFtvGQvRXWUdk=; b=KuzJD477E6MPHgO1B96uSpd2CsU5o5Qt+pbV/J/ASl07WnNIoVTzu8bNUSVF7j9JRN o35/NhJRTo2xMhuijkPhLJprlD3YJ4qEReum3aN7WdbC97OM6JxLB7RY+HkxhbjiQaU0 vmpjLx11FUE97cw8Px+7q5t/XYspB3Yw6xTkZjxm3lLzIWlApYhlIhe90JOLCTxVUaZ4 LYrnzg19Lq7tpGYK0HytS4lfImuecHmttX3LhjJRGLtFAgJqvGpQrx9PRq2dPAyGM6Gy 1/drEC7VKwPfeDcpYzx2xkzVck3iatya3uTGzcfcQ2J5l9miYM4GhWV86WwOSai3REXf jfWA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=BI53e5Dl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a22si4448420jar.119.2021.06.16.19.44.33; Wed, 16 Jun 2021 19:44:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=BI53e5Dl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231743AbhFQBET (ORCPT + 99 others); Wed, 16 Jun 2021 21:04:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53478 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231628AbhFQBET (ORCPT ); Wed, 16 Jun 2021 21:04:19 -0400 Received: from mail-qt1-x834.google.com (mail-qt1-x834.google.com [IPv6:2607:f8b0:4864:20::834]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6684FC061574 for ; Wed, 16 Jun 2021 18:02:11 -0700 (PDT) Received: by mail-qt1-x834.google.com with SMTP id u20so3473356qtx.1 for ; Wed, 16 Jun 2021 18:02:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=V60FAIdwpRuq7i04YiGCgFlyof6pScSFtvGQvRXWUdk=; b=BI53e5Dlvr281LKLOlpIuAolvA0Y7t9NryNGS9dc8HvcR+eIOcXZeaESFKfvWgHpCg 8AJpxgc6EAI1cdYUY///BpVT9F1F2Yx6emJQrRdYeYckCCmiVBtjYI+1JLNLnZWuxXD4 wzc9vZABt4WH0Hr7L64R134+gqb+Wibt4GvkWiawOBkbfzYwtsOyDz7SYZnUFr/FYPoh wijyqLcnvSoLLsEPImDCyXkn7y11VSIhdRYt4+47Q/GaUxC6W7dgaW+7fnyIS9xSqi/f f8LwcLloAgWWLqUsSb2k4LHGwmJt/Qz8BcTdxXNhxEDb4Kj9LUs9YNROVPK6UCwG8dbB 0HHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=V60FAIdwpRuq7i04YiGCgFlyof6pScSFtvGQvRXWUdk=; b=XbuFvl3CMRu3iM1E1z3JkXvpMnZ6xmCQBDwNSVypAEVZMtl54pcrYBrAxjx3UQfdZ6 j5qi2/uqaLsVIP05PLVGvE1IrRYdnHHP9Ju6TNiPqK23vYgZsGjv+nmJMTiPRilAf46r wWNxUmLF6DNquyylnSOfE8iMrFYKh1at9q3SNwbvk/HMD2jETbTOsIGqefbWAQvtK86U Q6jDqWCa3DGWkTjFd4b1F1DLvPu8C/1gtw2NIivesoYZeusvgyfGb7dJ458ezaEQ0Sw/ wPtyxpnRbbb6GsytfibMS2O3mnkP02rJutbyY71QoYD002IZXuMILPAk1zeJzA4DDWoq HgTA== X-Gm-Message-State: AOAM532v7Zd5Gu3EPTkQSUo+26l5PDS/lc90INGwnQf36SV+SRwsmtav nrDPCK44wjnSS8c2V02PV9rXH93bh42hENilKQcskg== X-Received: by 2002:ac8:5d55:: with SMTP id g21mr2615210qtx.101.1623891730223; Wed, 16 Jun 2021 18:02:10 -0700 (PDT) MIME-Version: 1.0 References: <20210608231132.32012-1-joshdon@google.com> In-Reply-To: From: Josh Don Date: Wed, 16 Jun 2021 18:01:59 -0700 Message-ID: Subject: Re: [PATCH] sched: cgroup SCHED_IDLE support To: Tejun Heo Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Paul Turner , David Rientjes , Oleg Rombakh , Viresh Kumar , Steve Sistare , linux-kernel , Rik van Riel Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey Tejun, Thanks for taking a look. On Wed, Jun 16, 2021 at 8:42 AM Tejun Heo wrote: > > Hello, > > On Tue, Jun 08, 2021 at 04:11:32PM -0700, Josh Don wrote: > > This extends SCHED_IDLE to cgroups. > > > > Interface: cgroup/cpu.idle. > > 0: default behavior > > 1: SCHED_IDLE > > > > Extending SCHED_IDLE to cgroups means that we incorporate the existing > > aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its > > descendant threads towards the idle_h_nr_running count of all of its > > ancestor cgroups. Thus, sched_idle_rq() will work properly. > > Additionally, SCHED_IDLE cgroups are configured with minimum weight. > > > > There are two key differences between the per-task and per-cgroup > > SCHED_IDLE interface: > > > > - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to > > maintain their relative weights. The entity that is "idle" is the > > cgroup, not the tasks themselves. > > > > - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption > > decision is not made by comparing the current task with the woken task, > > but rather by comparing their matching sched_entity. > > > > A typical use-case for this is a user that creates an idle and a > > non-idle subtree. The non-idle subtree will dominate competition vs > > the idle subtree, but the idle subtree will still be high priority > > vs other users on the system. The latter is accomplished via comparing > > matching sched_entity in the waken preemption path (this could also be > > improved by making the sched_idle_rq() decision dependent on the > > perspective of a specific task). > > A high-level problem that I see with the proposal is that this would bake > the current recursive implementation into the interface. The semantics of > the currently exposed interface, at least the weight based part, is abstract > and doesn't necessarily dictate how the scheduling is actually performed. > Adding this would mean that we're now codifying the current behavior of > fully nested scheduling into the interface. > > There are several practical challenges with the current implementation > caused by the full nesting - e.g. nesting levels are expensive for context > switch heavy applicaitons often going over >1% per level, and heuristics > which assume global queue may behave unexpectedly - ie. we can create > conditions where things like idle-wakeup boost behave very differently > depending on whether tasks are inside a cgroup or not even when the eventual > relative weights and past usages are similar. To be clear, are you talking mainly about the idle_h_nr_running accounting? The fact that setting cpu.idle propagates changes upwards to ancestor cgroups? The goal is to match the SCHED_IDLE semantics here, which requires this behavior. The end result is no different than a nested cgroup with SCHED_IDLE tasks. One possible alternative would be to only count root-level cpu.idle cgroups towards the overall idle_h_nr_running. I've not opted for that, in order to make this work more similarly to the task interface. Also note that there isn't additional overhead from the accounting, since enqueue/dequeue already traverses these nesting levels. Could you elaborate a bit more on your statement here? > Can you please give more details on why this is beneficial? Is the benefit > mostly around making configuration easy or are there actual scheduling > behaviors that you can't achieve otherwise? There are actual scheduling behaviors that cannot be achieved without the cgroup interface. The task SCHED_IDLE mechanism is a combination of special SCHED_IDLE handling (idle_h_nr_running, check_preempt_wakeup), and minimum scheduling weight. Consider a tree like root / \ A C / \ | B idle t4 | | \ t1 t2 t3 Here, 'idle' is our cpu.idle cgroup. The following properties would not be possible if we moved t2/t3 into SCHED_IDLE without the cgroup interface: - t1 always preempts t2/t3 on wakeup, but t4 does not - t2 and t3 have different, non-minimum weights. Technically we could also achieve this by adding another layer of nested cgroups, but that starts to make the hierarchy much more complex. - I've also discussed with Peter a possible extension (vruntime adjustments) to the current SCHED_IDLE semantics. Similarly to the first bullet here, we'd need a cgroup idle toggle to achieve certain scheduling behaviors with this. Thanks, Josh