Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp1244910imm; Wed, 19 Sep 2018 14:55:48 -0700 (PDT) X-Google-Smtp-Source: ANB0VdaHg4NHbai2dkg+IktmEHtGzkJfjLJeR57Tf2wMwcadag5dfK59FSuKJN7SIp5dUey2TuXN X-Received: by 2002:a63:2106:: with SMTP id h6-v6mr33614701pgh.161.1537394148676; Wed, 19 Sep 2018 14:55:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537394148; cv=none; d=google.com; s=arc-20160816; b=N/Kz1eBGg8m2KSQKk4rlGg1RAlaDUxUyPa9PFgdsfFEAY/yV4fQV+3jyFwQyngWqOb mCDhel0ojUNE7E0wKl0biZa0PfNEL+EvybDzOKJ2JHHWjlw+6VH0HSM51srdE3XtjUBd 2tjyUjyON1rJCSTFqZlTmsLXonEA8Fd2PGjsnJ8Co0ZxTA7yV4XFLw9EqHifhWjsEiFq duDrG12/5fK3E0TBL6/DaPfInC49sWVnyHU+piKWJfdo29a/32KQTJId/NDcbNXUkROp zYqgnxuSPJ9AFcph2ooVtaAoU8EJWXV2vZz9zc8yUJu6g7KXU6/0f4lpnxHmJY+7ESOP kzgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=D3Q75baxHQ9nN0q2Kgn1UR8GWmWUmb3JhpRDnZBDmTc=; b=qP2fTD2rd68tzWV3yYjwZ0sSYhBhMisB0Ks0/gwAuHZs2t0jFEqd/NvUFCyxZEHNU5 N3Hc/LazFzfidXL0LZrSnIk6fGGLdSeBmz4pcTyv1q5/aJ19QijO1+WyFPASat98Ohx6 paP93xKKsA5TrmkwPrEotbyQpAjNlzQ/Q/4l+MnwP+gb++fLKzBRZ2z8o8wB+bASya3w WCwlZ7kyyJmsvxpCxtxFcjxCH63pu3dQYkrWG+8mI220AE7BqYXGbvfaCtoZLGHopLnb whvFVJ7oKlfecMZw9lOuRLos2U2MHLj1MsjT3ljcA+N19iBFT/Ov9LQbD3JDonQRXO2d Af4g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=IGlVkPTu; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q10-v6si23397976pfd.153.2018.09.19.14.55.33; Wed, 19 Sep 2018 14:55:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=IGlVkPTu; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731738AbeITDeI (ORCPT + 99 others); Wed, 19 Sep 2018 23:34:08 -0400 Received: from userp2120.oracle.com ([156.151.31.85]:43902 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727745AbeITDeI (ORCPT ); Wed, 19 Sep 2018 23:34:08 -0400 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w8JLn6JT151752; Wed, 19 Sep 2018 21:53:52 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=D3Q75baxHQ9nN0q2Kgn1UR8GWmWUmb3JhpRDnZBDmTc=; b=IGlVkPTuEsd3lelHx9f4RM6u7urDA3tWOL9eAP4OuSrAj9JkWVYdIAHsLKRYswhDjLiF zWYzsJvs1Ojcfay+OP0V7eWymCR4gbLnDlWOjqcXjSZiql9VM3xRVHQXfISCzwclDNZF bFkLpq1IKA2gPFAS0c/P4k0xgkWLLCG9NPemn0HmwcM/CgEAHSvgfagpR2/2Mlxlq5Ox bIM2PeL0z6v7zSOOIkeb2AWJ93bgbNapr28tU3TMO/A+1qzOXtsK6R404azNEq8JEgEl qz1uLmDEM04kTa80SLyGBT2fjpnoGMwbruCiIdcrP7PDsv7Ww7ULlw7swUq6T3+j7eLX 5Q== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2120.oracle.com with ESMTP id 2mgtqr5mbh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 19 Sep 2018 21:53:52 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w8JLrkGs026445 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 19 Sep 2018 21:53:46 GMT Received: from abhmp0012.oracle.com (abhmp0012.oracle.com [141.146.116.18]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w8JLrimL000382; Wed, 19 Sep 2018 21:53:44 GMT Received: from [10.132.91.175] (/10.132.91.175) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 19 Sep 2018 14:53:43 -0700 Subject: Re: [RFC 00/60] Coscheduling for Linux To: "=?UTF-8?Q?Jan_H._Sch=c3=b6nherr?=" , Ingo Molnar , Peter Zijlstra Cc: linux-kernel@vger.kernel.org References: <20180907214047.26914-1-jschoenh@amazon.de> <3336974a-38f7-41dd-25a7-df05e077444f@oracle.com> <90282ce3-dd14-73dc-fb9f-e78bb4042221@amazon.de> From: Subhra Mazumdar Message-ID: Date: Wed, 19 Sep 2018 14:53:45 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <90282ce3-dd14-73dc-fb9f-e78bb4042221@amazon.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9021 signatures=668707 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1809190210 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/18/2018 04:44 AM, Jan H. Schönherr wrote: > On 09/18/2018 02:33 AM, Subhra Mazumdar wrote: >> On 09/07/2018 02:39 PM, Jan H. Schönherr wrote: >>> A) Quickstart guide for the impatient. >>> -------------------------------------- >>> >>> Here is a quickstart guide to set up coscheduling at core-level for >>> selected tasks on an SMT-capable system: >>> >>> 1. Apply the patch series to v4.19-rc2. >>> 2. Compile with "CONFIG_COSCHEDULING=y". >>> 3. Boot into the newly built kernel with an additional kernel command line >>>     argument "cosched_max_level=1" to enable coscheduling up to core-level. >>> 4. Create one or more cgroups and set their "cpu.scheduled" to "1". >>> 5. Put tasks into the created cgroups and set their affinity explicitly. >>> 6. Enjoy tasks of the same group and on the same core executing >>>     simultaneously, whenever they are executed. >>> >>> You are not restricted to coscheduling at core-level. Just select higher >>> numbers in steps 3 and 4. See also further below for more information, esp. >>> when you want to try higher numbers on larger systems. >>> >>> Setting affinity explicitly for tasks within coscheduled cgroups is >>> currently necessary, as the load balancing portion is still missing in this >>> series. >>> >> I don't get the affinity part. If I create two cgroups by giving them only >> cpu shares (no cpuset) and set their cpu.scheduled=1, will this ensure >> co-scheduling of each group on core level for all cores in the system? > Short answer: Yes. But ignoring the affinity part will very likely result in > a poor experience with this patch set. > > > I was referring to the CPU affinity of a task, that you can set via > sched_setaffinity() from within a program or via taskset from the command > line. For each task/thread within a cgroup, you should set the affinity to > exactly one CPU. Otherwise -- as the load balancing part is still missing -- > you might end up with all tasks running on one CPU or some other unfortunate > load distribution. > > Coscheduling itself does not care about the load, so each group will be > (co-)scheduled at core level, no matter where the tasks ended up. > > Regards > Jan > > PS: Below is an example to illustrate the resulting schedules a bit better, > and what might happen, if you don't bind the to-be-coscheduled tasks to > individual CPUs. > > > > For example, consider a dual-core system with SMT (i.e. 4 CPUs in total), > two task groups A and B, and tasks within them a0, a1, .. and b0, b1, .. > respectively. > > Let the system topology look like this: > > System (level 2) > / \ > Core 0 Core 1 (level 1) > / \ / \ > CPU0 CPU1 CPU2 CPU3 (level 0) > > > If you set cpu.scheduled=1 for A and B, each core will be coscheduled > independently, if there are tasks of A or B on the core. Assuming there > are runnable tasks in A and B and some other tasks on a core, you will > see a schedule like: > > A -> B -> other tasks -> A -> B -> other tasks -> ... > > (or some permutation thereof) happen synchronously across both CPUs > of a core -- with no guarantees which tasks within A/within B/ > within the other tasks will execute simultaneously -- and with no > guarantee what will execute on the other two CPUs simultaneously. (The > distribution of CPU time between A, B, and other tasks follows the usual > CFS weight proportional distribution, just at core level.) If neither > CPU of a core has any runnable tasks of a certain group, it won't be part > of the schedule (e.g., A -> other -> A -> other). > > With cpu.scheduled=2, you lift this schedule to system-level and you would > see it happen across all four CPUs synchronously. With cpu.scheduled=0, you > get this schedule at CPU-level as we're all used to with no synchronization > between CPUs. (It gets a tad more interesting, when you start mixing groups > with cpu.scheduled=1 and =2.) > > > Here are some schedules, that you might see, with A and B coscheduled at > core level (and that can be enforced this way (along the horizontal dimension) > by setting the affinity of tasks; without setting the affinity, it could be > any of them): > > Tasks equally distributed within A and B: > > t CPU0 CPU1 CPU2 CPU3 > 0 a0 a1 b2 b3 > 1 a0 a1 other other > 2 b0 b1 other other > 3 b0 b1 a2 a3 > 4 other other a2 a3 > 5 other other b2 b3 > > All tasks within A and B on one CPU: > > t CPU0 CPU1 CPU2 CPU3 > 0 a0 -- other other > 1 a1 -- other other > 2 b0 -- other other > 3 b1 -- other other > 4 other other other other > 5 a2 -- other other > 6 a3 -- other other > 7 b2 -- other other > 8 b3 -- other other > > Tasks within a group equally distributed across one core: > > t CPU0 CPU1 CPU2 CPU3 > 0 a0 a2 b1 b3 > 1 a0 a3 other other > 2 a1 a3 other other > 3 a1 a2 b0 b3 > 4 other other b0 b2 > 5 other other b1 b2 > > You will never see an A-task sharing a core with a B-task at any point in time > (except for the 2 microseconds or so, that the collective context switch takes). > Ok got it. Can we have a more generic interface, like specifying a set of task ids to be co-scheduled with a particular level rather than tying this with cgroups? KVMs may not always run with cgroups and there might be other use cases where we might want co-scheduling that doesn't relate to cgroups.