Received: by 2002:a05:7412:8598:b0:f9:33c2:5753 with SMTP id n24csp145667rdh; Mon, 18 Dec 2023 14:34:54 -0800 (PST) X-Google-Smtp-Source: AGHT+IGYJAscp1vEbNMQWKAPa9lkL1tWA/IEY82gpKQhV41zi0HhFyxCE5ehaLtFfPFg41GuYF36 X-Received: by 2002:a05:620a:60f1:b0:780:f33e:d006 with SMTP id dy49-20020a05620a60f100b00780f33ed006mr1028437qkb.87.1702938893870; Mon, 18 Dec 2023 14:34:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702938893; cv=none; d=google.com; s=arc-20160816; b=TiJwLtNRjYC6gm594vkTczYWJDQdYlrMBHuQBZaNcrusf9T6X8Mal3TAxzdctAusuD KigsyS9ha9CiXuSWdRXVGfgLNUQShhYQSeRjULVuSJsFcVBS/uHisZ/8FU7mtDDyQn6h QG9h948DjDwCh6w5tdZ21nlCOTQ8tOyw9lTNFqjlrd6DWyV2wLXlEiJqxKoNktJ037QZ I/LFeJvQNrZcHFBZq7jyO+RC9I1I6WuDful0oq5YtpTzMJ6eL6Y+1HmbI5/l5yzT3pof jheWpPK11GbJzp8E37SDc/YBMvVWHbsZ5Q73Evpdtlp/kQO80SQomthTFlEyS25B80x+ AOww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:list-unsubscribe:list-subscribe:list-id:precedence :user-agent:message-id:date:references:in-reply-to:subject:cc:to :from:dkim-signature; bh=mGdD/5IEfD8p7+cvvZG2CNSCawN+QXNticMsrF0mEfY=; fh=xZftsTTaTZMlHFDgj/jT3XLk/x9lTpNw17JFQ0qfgso=; b=uzZ/mgNmxgWEC3jAl74fpdavOv8puGdVXJgQdWMTPP9OZKI7uYdz3AlYQ3aeyDL+DZ SXbaRiK2Zv1fMH17u95DY6n9p29ygSt5xvl5mxUfA2f212lEd00Sg8fTFlDUBYWfP6ER Gypr0kUwJ8CwXlR5rQ4akayOtTApYfB8ImQwlPX1eUYAeGC+2uQ61Mu4FF/EtVtndqiE N7ggF6wva+5tcXNCgbI46bYeH/mTC0kmXtQFk5cE0VQIcxMGEZ97T63uhByX36DYkW9N kTcYqg3x8tYdXhEkbQQcqRTmLJwK6FN8rGxGrdqC3pdP8vZ1k7PnxdsnEiIyb1MiYajW KxpQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=whjQNCXt; spf=pass (google.com: domain of linux-kernel+bounces-4449-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-4449-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id k3-20020ae9f103000000b0077da9bb9868si23781046qkg.446.2023.12.18.14.34.53 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Dec 2023 14:34:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-4449-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=whjQNCXt; spf=pass (google.com: domain of linux-kernel+bounces-4449-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-4449-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 6496E1C22774 for ; Mon, 18 Dec 2023 22:34:53 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4614776082; Mon, 18 Dec 2023 22:34:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="whjQNCXt" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 898AE1DFC1 for ; Mon, 18 Dec 2023 22:34:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-1d3d7873796so18935ad.1 for ; Mon, 18 Dec 2023 14:34:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702938882; x=1703543682; darn=vger.kernel.org; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=mGdD/5IEfD8p7+cvvZG2CNSCawN+QXNticMsrF0mEfY=; b=whjQNCXtn9aQNuuf0248KdTP6Rh85x0rOd5S40DYlIEPUj5QERfNo0575bPiBkeEFL C3WAX8GN57fkvOyTLQVGVoNXiP50UUX6+/umJG6EMvnBUcLr9MAWZJr2me8vVZs0M/SI SO2MmIA2DYj2XQqWzvsySbm0l6TF9fE/qgT61TLMXZiLltuPgu/4kEwAF8KBNTevMTHR CFf1+awOkqBQmj2tpvZXwAJ5Hv094Obc8zG1xRTFM3aZnKrfTKc9+XD6FD+UiPbN2Ny5 OPodRmOpVEO6h1XIT6ZLmwj6vsHdlF0DR8ac1X8GVDQLzZiVq/YkE1PXhAQXcbgoClGt 0trA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702938882; x=1703543682; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=mGdD/5IEfD8p7+cvvZG2CNSCawN+QXNticMsrF0mEfY=; b=UYbMGFXzRtJvnXchKnyV59ziAIWbBOWe5qBxIpqHZgo4Hs1gddmSVLH5QP3K5KU6xo DwirJ/oncbK5BGS4vfgXSzPz11tJX/uVKfwzl+VN9R8YhXZsqEHCOZ1BnfcQAuYgGiXu qOySOB9LtSuLjV5Z1/2uF6Y3VM1qR2KFyhEF7rBzWjqzTBae6pbQROesit4OEdS+N92e RoUNdF+p2C8w3YdPds8JHIeqd5fGmdbnur23JcmtGsE0McTHKWqjVCVXZxtrL9pRS6vj khomSF7706nrZewhTgLcI75rZW8qxyzSOb28cSiPLayE0UB/n9jZ1NPtaVTQmIzBL5G6 0VPA== X-Gm-Message-State: AOJu0YyUKu5m9eA1GY45Odtyb1gdB8dgiXBbiFA69Q60k5sNhLJsmpKx WuS+Yw5nL0kXg5PJVVVYMKIEHl58K+6m X-Received: by 2002:a17:902:d48f:b0:1d3:dc24:31c with SMTP id c15-20020a170902d48f00b001d3dc24031cmr24664plg.3.1702938881600; Mon, 18 Dec 2023 14:34:41 -0800 (PST) Received: from bsegall-glaptop.localhost (c-73-158-249-138.hsd1.ca.comcast.net. [73.158.249.138]) by smtp.gmail.com with ESMTPSA id ju2-20020a170903428200b001d3a9e20b79sm3457934plb.120.2023.12.18.14.34.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Dec 2023 14:34:40 -0800 (PST) From: Benjamin Segall To: Valentin Schneider Cc: linux-kernel@vger.kernel.org, Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Mel Gorman , Daniel Bristot de Oliveira , Phil Auld , Clark Williams , Tomas Glozar Subject: Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace In-Reply-To: (Valentin Schneider's message of "Mon, 18 Dec 2023 19:07:12 +0100") References: <20231130161245.3894682-1-vschneid@redhat.com> <20231130161245.3894682-2-vschneid@redhat.com> Date: Mon, 18 Dec 2023 14:34:38 -0800 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain Valentin Schneider writes: > On 07/12/23 15:47, Benjamin Segall wrote: >> Alright, this took longer than I intended, but here's the basic version >> with the duplicate "runqueue" just being a list rather than actual EEVDF >> or anything sensible. >> >> I also wasn't actually able to adapt it to work via task_work like I >> thought I could; flagging the entry to the relevant schedule() calls is >> too critical. >> > > Thanks for sharing this! Have a first set of comments. > >> -------- >> >> Subject: [RFC PATCH] sched/fair: only throttle CFS tasks in userspace >> >> The basic idea of this implementation is to maintain duplicate runqueues >> in each cfs_rq that contain duplicate pointers to sched_entitys which >> should bypass throttling. Then we can skip throttling cfs_rqs that have >> any such children, and when we pick inside any not-actually-throttled >> cfs_rq, we only look at this duplicated list. >> >> "Which tasks should bypass throttling" here is "all schedule() calls >> that don't set a special flag", but could instead involve the lockdep >> markers (except for the problem of percpu-rwsem and similar) or explicit >> flags around syscalls and faults, or something else. >> >> This approach avoids any O(tasks) loops, but leaves partially-throttled >> cfs_rqs still contributing their full h_nr_running to their parents, >> which might result in worse balancing. Also it adds more (generally >> still small) overhead to the common enqueue/dequeue/pick paths. >> >> The very basic debug test added is to run a cpusoaker and "cat >> /sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same >> cgroup with a quota < 1 cpu. > > So if I'm trying to compare notes: > > The dual tree: > - Can actually throttle cfs_rq's once all tasks are in userspace > - Adds (light-ish) overhead to enqueue/dequeue > - Doesn't keep *nr_running updated when not fully throttled > > Whereas dequeue via task_work: > - Can never fully throttle a cfs_rq > - Adds (not so light) overhead to unthrottle > - Keeps *nr_running updated Yeah, this sounds right. (Though for the task_work version once all the tasks are throttled, the cfs_rq is dequeued like normal, so it doesn't seem like a big deal to me) > > My thoughts right now are that there might be a way to mix both worlds: go > with the dual tree, but subtract from h_nr_running at dequeue_kernel() > (sort of doing the count update in throttle_cfs_rq() early) and keep track > of in-user tasks via handle_kernel_task_prev() to add this back when > unthrottling a not-fully-throttled cfs_rq. I need to actually think about > it though :-) Probably, but I had tons of trouble getting the accounting correct as it was :P >> +#ifdef CONFIG_CFS_BANDWIDTH >> + /* >> + * TODO: This might trigger, I'm not sure/don't remember. Regardless, >> + * while we do not explicitly handle the case where h_kernel_running >> + * goes to 0, we will call account/check_cfs_rq_runtime at worst in >> + * entity_tick and notice that we can now properly do the full >> + * throttle_cfs_rq. >> + */ >> + WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children)); > > FWIW this triggers pretty much immediately in my environment (buildroot > over QEMU). Good to know; I feel like this should be avoidable (and definitely should be avoided if it can be), but I might be forgetting some case that's mostly unavoidable. I'll have to poke it when I have more time. >> @@ -8316,15 +8471,44 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int >> >> preempt: >> resched_curr(rq); >> } >> >> +static void handle_kernel_task_prev(struct task_struct *prev) >> +{ >> +#ifdef CONFIG_CFS_BANDWIDTH >> + struct sched_entity *se = &prev->se; >> + bool new_kernel = is_kernel_task(prev); >> + bool prev_kernel = !list_empty(&se->kernel_node); >> + /* >> + * These extra loops are bad and against the whole point of the merged >> + * PNT, but it's a pain to merge, particularly since we want it to occur >> + * before check_cfs_runtime. >> + */ >> + if (prev_kernel && !new_kernel) { >> + WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */ >> + for_each_sched_entity(se) { >> + dequeue_kernel(cfs_rq_of(se), se, 1); >> + if (cfs_rq_throttled(cfs_rq_of(se))) >> + break; >> + } >> + } else if (!prev_kernel && new_kernel && se->on_rq) { >> + for_each_sched_entity(se) { >> + enqueue_kernel(cfs_rq_of(se), se, 1); >> + if (cfs_rq_throttled(cfs_rq_of(se))) >> + break; >> + } >> + } >> +#endif >> +} > > > I'm trying to grok what the implications of a second tasks_timeline would > be on picking - we'd want a RBtree update equivalent to put_prev_entity()'s > __enqueue_entity(), but that second tree doesn't follow the same > enqueue/dequeue cycle as the first one. Would a __dequeue_entity() + > __enqueue_entity() to force a rebalance make sense? That'd also take care > of updating the min_vruntime. Yeah, the most sensible thing to do would probably be to just have curr not be in the queue, the same as the normal rbtree, and save the "on kernel list" as an on_rq-equivalent bool instead of as !list_empty(kernel_node).