Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp1836548ybh; Mon, 20 Jul 2020 08:21:44 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzBST9lSBgNC+rmmK5z7lk2M1BQBCmWoiw5wvTuSm96W3igdcBO5i/1DZO7dsyhsv0lpsyy X-Received: by 2002:aa7:cf94:: with SMTP id z20mr22625064edx.82.1595258504025; Mon, 20 Jul 2020 08:21:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1595258504; cv=none; d=google.com; s=arc-20160816; b=MIYDpUlkGeTUC7YFtfrNUf1Y7KYGz32ZYE8Es2LTjjpdB0WZfpTwDqIIQfqFHX7UlI UiG4NsanDy7VkBGpA56fTw8UDrLVutLsAbF0qqTdZq1IoYVxws9yPWVLcRJdQAKWk7jy t1QW/0ctD3B5CYG/ASavrdP49ySM8W3ifO9ySqpZaT3DgLW9GCboisKxG1v++F8Xkwsa qPlIGjGqaiO/XwqSIZh5LcBmCGHM3UBkM+ArseUyXQ7aIsWYi+VxnoGTADaklnfrMDke qrkdwqa/MO+dTtxjRHXlofdbNhfjH4dIP9NFpL//gtmFbWwKm3cC//V57smqAVZGmPT9 CPzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=Ysf+8/YCdjEbrDK2wJdiI6loWpMIMt3osD9k/2Tbbrw=; b=xjKRXsM0PFu8exMOZaz87w+QkcjTCcplZUnow6busJxbP40OK+q/CSHD5OmkXym5vh f/Rr7AfkCGRSWoqSvjhL0O4WX4lDXR81Vwir/cbSJjfReAqOmhcAdahBAFYCECWA5BFH ni5sFbilZCCShYMMgdrMnaVBQ85hq7jnN4LqCsoyPvAsLQVPx359uBNWTG5QAoEfG+Gb mhQ/VZeSp6rr7/BEbNAvbdz6/tuf8o0RPQfGyBHjjGqcURJEgrN6pELOXK0Ert0avSZJ ufgtRNVcyDv2MWoYCysjutqgelqiom6azqujgCRH+NpGovb1GmVSXMDKS3QEpSWrMDoO Cing== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=UXmHAwiu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f13si10947217edy.576.2020.07.20.08.21.20; Mon, 20 Jul 2020 08:21:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=UXmHAwiu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728008AbgGTPU7 (ORCPT + 99 others); Mon, 20 Jul 2020 11:20:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36102 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726491AbgGTPU6 (ORCPT ); Mon, 20 Jul 2020 11:20:58 -0400 Received: from mail-lf1-x143.google.com (mail-lf1-x143.google.com [IPv6:2a00:1450:4864:20::143]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51620C061794 for ; Mon, 20 Jul 2020 08:20:58 -0700 (PDT) Received: by mail-lf1-x143.google.com with SMTP id k13so955379lfo.0 for ; Mon, 20 Jul 2020 08:20:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Ysf+8/YCdjEbrDK2wJdiI6loWpMIMt3osD9k/2Tbbrw=; b=UXmHAwiu2MCd+wxhb5pCceWUNu/Y0e0Newrd/B9iSC9gYykchTh5GEb5qevcCOU0+V svFOKjX+qGRZoAoN7h4v2IyGs2PZpCP82PWhAyaRc8FrH50TvlNQgSxnVEe9pa5K1E/c oOh8oRjeqczsX0lq1k6CbI9xBzrmSQDq62Nqr/cA4959szT1+qT5CsCrCRPbHzOyzlXS FZmYOR2Zf6uZDHYMWOttgXcpoFrjqYfBAaZODtaEBd4GFzeKqCMvFBZO/XSAYN5b2pXt K4/D/EvFiL/RGBaakIr2AMTXbXO0GhJlaHKZ4/x7xTNeJQx740Ke0jvV6XJ1CpmdP8au +V+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Ysf+8/YCdjEbrDK2wJdiI6loWpMIMt3osD9k/2Tbbrw=; b=Zpm60R8rm7GV0F3O/TJCk0pl9rq1F8EFy+UgnId0RbIbqUmjNinZMZ2nqYnAPIF6wU s6lxAmDOEXgsvOGr0NdAxD9GT+fHEhqJt1IXzIeTHi/3ld6aOcGFogPOeMQ8HpwPukzM MhlqSjdFRvhKFjWf8nPV6N91XuCAv7K6AgUQMyqFDSSTkTdITcYQz78RBMdsfleH4edn ttUpXrAV1uO00x7UgVnA9V0F95gEsYS+tUehgMvhyPLP3s4mODsONnKhRIB86kRHTetA CsfgOWNDIAU3i7eviuHloDbU7A7jcXuWnoqfFwa95d4FUYD2dYiiT54N3pSKNJ/rlKRQ n+oQ== X-Gm-Message-State: AOAM531Fl+7sl6T+XKmXrxPG3fEdzOGzWoA0ikvRARDBE1i+I8j2q3Lu 5zehQqeePm6rzGMCX+kyJ1aAI0AzK9uN7E68E73F0g== X-Received: by 2002:a19:834f:: with SMTP id f76mr3089419lfd.25.1595258456652; Mon, 20 Jul 2020 08:20:56 -0700 (PDT) MIME-Version: 1.0 References: <87v9kv2545.derkling@matbug.com> <87h7wd15v2.derkling@matbug.net> <87imgrlrqi.derkling@matbug.net> <87mu5sqwkt.derkling@matbug.net> <87eer42clt.derkling@matbug.net> <87imfi2qbk.derkling@matbug.net> <39cc4666-6355-fb9f-654d-e85e1852bc6f@linux.ibm.com> In-Reply-To: <39cc4666-6355-fb9f-654d-e85e1852bc6f@linux.ibm.com> From: Vincent Guittot Date: Mon, 20 Jul 2020 17:20:44 +0200 Message-ID: Subject: Re: [SchedulerTaskPacking] Small background task packing To: Parth Shah Cc: Patrick Bellasi , LKML , Ingo Molnar , "Peter Zijlstra (Intel)" , Juri Lelli , Paul Turner , Ben Segall , Thomas Gleixner , Jonathan Corbet , Dhaval Giani , Dietmar Eggemann , Josef Bacik , Chris Hyser Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Parth, On Thu, 9 Jul 2020 at 14:09, Parth Shah wrote: > > > A) Name: > > Small background task packing > > > B) Target behaviour: > > All fair task wakeup follows a procedure of finding an idle CPU and > waking the task on this idle CPU. There are two major wakeup paths: > 1. Slow-path: Wake up the task on an idle CPU which is in the shallowest > idle states by searching in the highest sched_domain flagged with > SD_BALANCE_FORK or SD_BALANCE_EXEC. > 2. Fast-path: Wake up the task on an idle CPU in the LLC sched_domain of > the waker CPU. There are optimization to bias task placement on prev_cpu or > wake_cpu of the task. This path is majorly used except in few cases like > during fork() and exec(). > > This assumption of picking an idle CPU is fair in-order to uniformly > consume system resources. But not all tasks deserves to wakeup an idle core > as it can hurt power consumption. For e.g. like small background tasks > waking up an idle core and runs only for very small duration. Reducing > number of active cores allows busier cores to boost frequency and hence > saving power can also result in performance benefits. > > There is no mechanism in existing wake up path to detect small > background tasks which can be packed on fewer cores. > > > C) Existing control paths: > > fair:: .select_task_rq = select_task_rq_fair > > fair::select_task_rq_fair() > // 1) Slow-path: find idle CPU with shallowest idle states > find_idlest_cpu() > > // 2) Fast-path: find idle CPU > fair:select_idle_sibling() > // Wake up on idle core if available > fair:select_idle_core() > // Else wakeup on idle CPU if available > fair:select_idle_cpu() > fair:select_idle_smt() > > > There are multiple ways to call fair:select_task_rq_fair(); > > // 1) try_to_wake_up which should lead to fast-path > core::try_to_wake_up() > cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); > > // 2) wake_up_new_task which should lead to slow-path > core::wake_up_new_task() > __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK,0)); > > // 3) sched_exec which should lead to slow-path > core::sched_exec() > dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), > SD_BALANCE_EXEC, 0); > > > D) Desired behaviour: > > Latency tolerant tasks with small utilization should be packed > on a busy core rather than waking up a new core/CPU. > > Upon detecting small-background tasks, different code-path can be used to > search for a busy core as described below; > > sched/fair.c: > static inline bool is_small_bg_task(struct task_struct *p) > { > if (is_latency_tolerant(p) && > (task_util(p) > (SCHED_CAPACITY_SCALE >> 3))) This condition seems to be quite arbitrary and probably works on your current platform but other platforms might want another threshold. Is there a way to know how much utilization can be added to a CPU which would not use more capacity than the extra capacity that is provided by the turbo mode ? Also you may want to use task_util_est() instead of task_util() to make sure that the small background task has low chance to become a large one and not use more capacity than the one provided by turbo at the end > return true; > > return false; > } > > sched/fair.c: in select_task_rq_fair() > > if (sd_flag & SD_BALANCE_WAKE) { > if (is_small_bg_task(p)) { > // New proposed wakeup path to do task packing > new_cpu = select_non_idle_core(p, prev_cpu); > if (new_cpu >= 0) > return new_cpu; > } > } > > where select_non_idle_core() searches for a busy core already running some > tasks and selects an idle CPU in that busy core to pack the waking task. > > Complete code can be found on TurboSched patches [1]. > > > E) Existing knobs (if any): reference to whatever existing tunable > > There are no existing knob which can hint the scheduler about the latency > nature of task. Detecting latency nature of the task can help in > classifying task as small background tasks to be packed on fewer number of > cores. > > There are user-space hacks to do task packing for background tasks with the > use of cpuset.cpus or sched_setaffinity() to manually affine the process to > fewer dedicated cores. > > > F) Existing knobs (if any): one paragraph description of the limitations > > Sysadmin/user has to define cpumask for each process (aka task pinning) > which is static in nature. There are multiple limitations to pin the small > background tasks; > > - In presence of just one small background task, there is no power > consumption benefit here. It would be preferable to pin it to busy CPU. > > - If a task changes the behavior in its life-cycle then sysadmin will have > to manually pin/unpin such tasks. This is limitation of user to classify > tasks as only "background "one and cannot say if and when it will be > "small" in utilization. > > - Sysadmin cannot change the task's affinity mask based on the scheduling > pattern to give most optimal performance and energy consumption. > > > G) Proportionality Analysis: check the nature of the target behavior > > Task packing has to be noninvasive in nature, meaning only the tasks which > are latency tolerant should be packed on fewer cores. Providing this > behavior needs a run-time tunable knob which can hint the scheduler on > whether the waking task can be packed or not. > > Upon detecting the nature of the task, a specific wakeup path can be followed: > 1. On latency-tolerant tasks with low utilization, a new proposed > scheduling wakeup path will be followed to do packing > 2. On latency-sensitive task, an exiting approach of wakeup can be used. > > > H) Range Analysis: identify meaningful ranges > > The new knob can be binary input accepting 0/1, where 0 means > latency-sensitive and 1 means latency-tolerant task. > > Latency-sensitive tasks (value = 0) can search idle CPU in only the llc > sched_domain while the latency-tolerance (value = 1) tasks can go across > llc sched_domain (just like in slow-path) but here in-order to search for a > busy core. > > Mapping analysis: > ================ > The latency_nice proposal [2] offers a [-20, 19] range which can be > mapped into a binary switch, e.g. using a threshold based function. > > However, it is possible to extend the concept of finding busy core by > limiting the time spent on searching based on the latency_nice value from > range [-20, 19] where value of 19 indicates searching in the whole chip for > a busy core, whereas value of 10 could mean search for half of the cores in > the chip. > > > I) System-Wide tuning: which knobs are required > > The latency_nice provided knobs should be enough to get the desired > effect. > > > J) Per-Task tuning: which knobs are required > > sched_setscheduler() is sufficient. > > > K) Task-Group tuning: which knobs are required > > A single attribute classifying task-group as latency_nice or > latency_tolerant is sufficient. > > > > .:: References > > ============== > > [1] [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies > for longer durations > https://lkml.org/lkml/2020/1/21/39 > [2] [PATCH v5 0/4] Introduce per-task latency_nice for scheduler hints > Message-ID: 20200228090755.22829-1-parth@linux.ibm.com > https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com > [3] TurboSched: the return of small-task packing > https://lwn.net/Articles/792471/ > https://www.phoronix.com/scan.php?page=news_item&px=Linux-TurboSched-V4