Received: by 2002:ac0:98c7:0:0:0:0:0 with SMTP id g7-v6csp3801593imd; Mon, 29 Oct 2018 12:35:40 -0700 (PDT) X-Google-Smtp-Source: AJdET5fed0RecYv6ZLhZpSSnVKYFFdpVxY/ujyke7FopbQxQSeNlkBsmON1pHc+ZilDcqcOZ5VMN X-Received: by 2002:a63:5ec6:: with SMTP id s189mr14387820pgb.357.1540841740541; Mon, 29 Oct 2018 12:35:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1540841740; cv=none; d=google.com; s=arc-20160816; b=bEItWWTwPN3e8ZqoBngeNhkUpRtnE0Rt5/uSAnL3nKaLi9NDlYV2Dk00NCLifUDQCe H2KvbPv4MT7ESfP2knmAZcurl/HUr51iay7xQ1yXxDaZUCXNfWceC0tkUljZLL3/jZhV mJR4dhF0ZI8pYrGLzc80MCFfpnQgB+fC0dWjcFGeWW0ae8omYcJNCMUO6fWqpNaqmh2v p63bxIor0nMhM6chK63CIPrJrLEeS1W6EvFod6K+3EiTwEDd9vPZ4bEt7G0aLTpymHkD 5zEf0PvVvE+VRA8EqEEbjpY2uFoDUdR/8RU/ph9LU0+B4+ixYjU/i+suH5uV9M2L2Id1 TfsA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=zkDz0RTCr15lhBYpZJYjwqb+xEx8r1loafVBLUMFZqM=; b=wliNf5QMl4ypp7jbcd6dTlQVpYXVrJW697rFDDhXYO+cEkb2C5VlFEO/imgnuTovsF JV8uVLL31/38Pna2TyxzjkdTRrejtd4KRhlDf0UHeqo8eM31cg3Gd15rU/3jam16Ae5k 91mUlBuUeQzom3E/asd60rMEyZpqlONNNSoZI/WNJ69peQgsQkmVLi1xGTc5X9O+GRHB oBe9h9yCx6b9cNsacpKKX5qRadS4XXAkjK6gjWAQjikfziJZD6pQORg6asuMfi5j7qbd nTNxHWHhh8SoL0cqNmrPwBuEEPcmJM4AVUQG7dwksfhEkyPd7nxFmR/QfFhS92PaS03R nJJg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 62-v6si5617542ply.423.2018.10.29.12.35.24; Mon, 29 Oct 2018 12:35:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728689AbeJ3EY7 (ORCPT + 99 others); Tue, 30 Oct 2018 00:24:59 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:45424 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725781AbeJ3EY7 (ORCPT ); Tue, 30 Oct 2018 00:24:59 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0F502341; Mon, 29 Oct 2018 12:34:54 -0700 (PDT) Received: from [10.1.194.37] (e113632-lin.cambridge.arm.com [10.1.194.37]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C9B553F6A8; Mon, 29 Oct 2018 12:34:51 -0700 (PDT) Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc To: Steven Sistare , mingo@redhat.com, peterz@infradead.org Cc: subhra.mazumdar@oracle.com, dhaval.giani@oracle.com, rohit.k.jain@oracle.com, daniel.m.jordan@oracle.com, pavel.tatashin@microsoft.com, matt@codeblueprint.co.uk, umgwanakikbuti@gmail.com, riel@redhat.com, jbacik@fb.com, juri.lelli@redhat.com, linux-kernel@vger.kernel.org References: <1540220381-424433-1-git-send-email-steven.sistare@oracle.com> <1540220381-424433-8-git-send-email-steven.sistare@oracle.com> From: Valentin Schneider Message-ID: Date: Mon, 29 Oct 2018 19:34:50 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 26/10/2018 19:28, Steven Sistare wrote: > On 10/26/2018 2:04 PM, Valentin Schneider wrote: [...] >> >> I was thinking that perhaps we could have scenarios where some rq's >> keep stealing tasks off of each other and we end up circulating tasks >> between CPUs. Now, that would only happen if we had a handful of tasks >> with a very tiny period, and I'm not familiar with (real) such hyperactive >> workloads similar to those generated by hackbench where that could happen. > > That will not happen with the current code, as it only steals if nr_running > 1. > The src loses a task, the dst gains it and has nr_running == 1, so it will not > be re-stolen. > That's indeed fine, I was thinking of something like this: Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2). As soon as the task on r_1 ends/blocks, we'll go through idle balancing and can potentially steal the non-running task from r_2. Sometime later the task that was running on r_1 wakes up, and we end up with r_1->nr_running == 2 and r_2->nr_running == 1. IOW we've swapped their role in that example, and the whole thing can repeat. The shorter the period of those tasks, the more we'll migrate them between rq's, hence why I wonder if we shouldn't have some sort of throttling. > If we modify the code to handle misfits, we may steal when src nr_running == 1, > but a fast CPU will only steal the lone task from a slow one, never fast from fast, > and never slow from fast, so no tug of war. > >> In short, I wonder if we should have task_hot() in there. Drawing a >> parallel with load_balance(), even if load-balancing is happening between >> rqs of the same LLC, we do go check task_hot(). Have you already experimented >> with adding a task_hot() check in here? > > I tried task_hot, to see if L1/L2 cache warmth matters much on L1/L2/L3 systems, > and it reduced steals and overall performance. > Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY sched feature and the exec_start threshold. The first one should be sidestepped in the stealing case since we won't pass (if env->dst_rq->nr_running), that leaves us with the threshold. We might want to sidestep it when we are doing balancing within an LLC domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold in such cases. In any case, I think it would make sense to add some LLC conditions to task_hot() so that - regular load_balance() can also benefit from them - task stealing has at least some sort of throttling On a sidenote, I find it a bit odd that the exec_start threshold depends on sysctl_sched_migration_cost, which to me is more about idle_balance() cost than "how long does it take for a previously run task to go cache cold". >> I've run some iterations of hackbench (hackbench 2 process 100000) to >> investigate this task bouncing, but I didn't really see any of it. That was >> just a 4+4 big.LITTLE system though, I'll try to get numbers on a system >> with more CPUs. >> >> ----->8----- >> >> activations: # of task activations (task starts running) >> cpu_migrations: # of activations where cpu != prev_cpu >> % stats are percentiles >> >> - STEAL: >> >> | stat | cpu_migrations | activations | >> |-------+----------------+-------------| >> | count | 2005.000000 | 2005.000000 | >> | mean | 16.244888 | 290.608479 | >> | std | 38.963138 | 253.003528 | >> | min | 0.000000 | 3.000000 | >> | 50% | 3.000000 | 239.000000 | >> | 75% | 8.000000 | 436.000000 | >> | 90% | 45.000000 | 626.000000 | >> | 99% | 188.960000 | 1073.000000 | >> | max | 369.000000 | 1417.000000 | >> >> - NO_STEAL: >> >> | stat | cpu_migrations | activations | >> |-------+----------------+-------------| >> | count | 2005.000000 | 2005.000000 | >> | mean | 15.260848 | 297.860848 | >> | std | 46.331890 | 253.210813 | >> | min | 0.000000 | 3.000000 | >> | 50% | 3.000000 | 252.000000 | >> | 75% | 7.000000 | 444.000000 | >> | 90% | 32.600000 | 643.600000 | >> | 99% | 214.880000 | 1127.520000 | >> | max | 467.000000 | 1547.000000 | >> >> ----->8----- >> >> Otherwise, my only other concern at the moment is that since stealing >> doesn't care about load, we could steal a task that would cause a big >> imbalance, which wouldn't have happened with a call to load_balance(). >> >> I don't think this can be triggered with a symmetrical workload like >> hackbench, so I'll go explore something else. > > The dst is about to go idle with zero load, so stealing can only improve the > instantaneous balance between src and dst. For longer term average load, we > still rely on periodic load_balance to make adjustments. > Right, so my line of thinking was that by not doing a load_balance() and taking a shortcut (stealing a task), we may end up just postponing a load_balance() to after we've stolen a task. I guess in those cases there's no magic trick to be found and we just have to deal with it. And then there's some of the logic like we have in update_sd_pick_busiest() where we e.g. try to prevent misfit tasks from running on LITTLEs, but then if such tasks are waiting to be run and a LITTLE frees itself up, I *think* it's okay to steal it. > All good questions, keep them coming. > > - Steve >