Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp261157pxu; Wed, 2 Dec 2020 22:22:39 -0800 (PST) X-Google-Smtp-Source: ABdhPJw5uQ8NxOl+yQfZ3iHwmVxv9vXxB8vFE+JEnuGoTdMpTK3QLOitaeFBl8yZ7IIV5fkpOunA X-Received: by 2002:a17:906:f1cc:: with SMTP id gx12mr1147762ejb.164.1606976559531; Wed, 02 Dec 2020 22:22:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1606976559; cv=none; d=google.com; s=arc-20160816; b=vbu/VA2nisxyF7ykWSl6e1BqHzlTt4M4e6WR5M/D4bsgpLmNHpGpWxE0ly5TMfu8Oc NglchKXYEyG0XL0lvzSOet9Z6kczkB2Lx8uKMz9I7KggEvbk3EenWLrRfR9jOV5IhuR0 7WsbEjYc8x+Tcr0w5mbeq3tEkO0gI8zdefSjsc51Mh0EaLP0k+OcZJwwJCfSXVKfNsKt wuxJG2NW40PGzmstovf3IFHrQy5eeRPqugu95mUyzunxZlYp4Q5fuPChB5NIf7GFaTOr urtpsackPEP8Gc2cmFMBE3u+dwFOKNpOAmMwaIvufAtaCiMAcNacXaQsvdSxTmO5dpLH wsrg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject:ironport-sdr:ironport-sdr; bh=7UJYmJp8JKTjaVvAe81WsT91PtlMx3StQDZHLvm3dpk=; b=a0PeeN8cgaVwmUWkGolukGwMCU7JT8G1AgHQYzUjAXU3cxLDbvijiaS0tGL73wKnGZ nfiI2lYegK/uOJXhc2V3ee2+0tDQmfXkHnKiaclOyVzOwsjp21kpD2tVY0AuLZ4ZrZdh aKLUG5zHzextzxbdOao+0MbXQHjo7Kgm+cTcTw2yk3rBQSnU1mVyLqXNfUfUfOviVhiE lQrOPfXLH+CgaeBfLoiZiECfj2oH3gJZrkvjEzyeArExdwr2s5NRLf51QyKxzK8LS4JP YVtVsAfdtT9e27KF/EZVt/2ssHXJpIZIjdwEdPcHoSBcnxgOOve8c992o7Z24zQPnwtL FdfA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h8si566058ejf.491.2020.12.02.22.22.16; Wed, 02 Dec 2020 22:22:39 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728625AbgLCGSW (ORCPT + 99 others); Thu, 3 Dec 2020 01:18:22 -0500 Received: from mga06.intel.com ([134.134.136.31]:44405 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726140AbgLCGSV (ORCPT ); Thu, 3 Dec 2020 01:18:21 -0500 IronPort-SDR: nWnIsqfyKQBTAHngGp+IJcnMihzlwxH+bEaAqlm9+qTaLvISfxY2Jc18OmpQ7bFMo2OCw5JTXG EC8vd6jXuuGg== X-IronPort-AV: E=McAfee;i="6000,8403,9823"; a="234746606" X-IronPort-AV: E=Sophos;i="5.78,388,1599548400"; d="scan'208";a="234746606" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Dec 2020 22:16:40 -0800 IronPort-SDR: GN97xz7zz5hG86qKUjh4I1g4AvS8u+wqV7hMxmg5Y34w9DjViNC6skdgvv0dTkYhhESLUbDTTZ XRcDWXmDEneg== X-IronPort-AV: E=Sophos;i="5.78,388,1599548400"; d="scan'208";a="550370055" Received: from hongyuni-mobl1.ccr.corp.intel.com (HELO [10.238.1.49]) ([10.238.1.49]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Dec 2020 22:16:30 -0800 Subject: Re: [PATCH -tip 00/32] Core scheduling (v9) To: Joel Fernandes , Vincent Guittot Cc: Nishanth Aravamudan , Julien Desfossez , Peter Zijlstra , Tim Chen , Vineeth Pillai , Aaron Lu , Aubrey Li , Thomas Gleixner , linux-kernel , Ingo Molnar , Linus Torvalds , Frederic Weisbecker , Kees Cook , Greg Kerr , Phil Auld , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini , vineeth@bitbyteword.org, Chen Yu , Christian Brauner , Agata Gruza , Antonio Gomez Iglesias , Alexander Graf , konrad.wilk@oracle.com, Dario Faggioli , Paul Turner , Steven Rostedt , Patrick Bellasi , Jiang Biao , Alexandre Chartre , James Bottomley , OWeisse@umich.edu, Dhaval Giani , Junaid Shahid , Jesse Barnes , "Hyser,Chris" , Ben Segall , Josh Don , Hao Luo , Tom Lendacky , Aubrey Li , "Paul E. McKenney" , Tim Chen References: <20201117232003.3580179-1-joel@joelfernandes.org> From: "Ning, Hongyu" Message-ID: <2aa08e37-c938-c98b-8212-556d63eb730f@linux.intel.com> Date: Thu, 3 Dec 2020 14:16:28 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2020/11/24 23:08, Joel Fernandes wrote: >>> >>> Core-Scheduling >>> =============== >>> Enclosed is series v9 of core scheduling. >>> v9 is rebased on tip/master (fe4adf6f92c4 ("Merge branch 'irq/core'")).. >>> I hope that this version is acceptable to be merged (pending any new review >>> comments that arise) as the main issues in the past are all resolved: >>> 1. Vruntime comparison. >>> 2. Documentation updates. >>> 3. CGroup and per-task interface developed by Google and Oracle. >>> 4. Hotplug fixes. >>> Almost all patches also have Reviewed-by or Acked-by tag. See below for full >>> list of changes in v9. >>> >>> Introduction of feature >>> ======================= >>> Core scheduling is a feature that allows only trusted tasks to run >>> concurrently on cpus sharing compute resources (eg: hyperthreads on a >>> core). The goal is to mitigate the core-level side-channel attacks >>> without requiring to disable SMT (which has a significant impact on >>> performance in some situations). Core scheduling (as of v7) mitigates >>> user-space to user-space attacks and user to kernel attack when one of >>> the siblings enters the kernel via interrupts or system call. >>> >>> By default, the feature doesn't change any of the current scheduler >>> behavior. The user decides which tasks can run simultaneously on the >>> same core (for now by having them in the same tagged cgroup). When a tag >>> is enabled in a cgroup and a task from that cgroup is running on a >>> hardware thread, the scheduler ensures that only idle or trusted tasks >>> run on the other sibling(s). Besides security concerns, this feature can >>> also be beneficial for RT and performance applications where we want to >>> control how tasks make use of SMT dynamically. >>> >>> Both a CGroup and Per-task interface via prctl(2) are provided for configuring >>> core sharing. More details are provided in documentation patch. Kselftests are >>> provided to verify the correctness/rules of the interface. >>> >>> Testing >>> ======= >>> ChromeOS testing shows 300% improvement in keypress latency on a Google >>> docs key press with Google hangout test (the maximum latency drops from 150ms >>> to 50ms for keypresses). >>> >>> Julien: TPCC tests showed improvements with core-scheduling as below. With kernel >>> protection enabled, it does not show any regression. Possibly ASI will improve >>> the performance for those who choose kernel protection (can be toggled through >>> sched_core_protect_kernel sysctl). >>> average stdev diff >>> baseline (SMT on) 1197.272 44.78312824 >>> core sched ( kernel protect) 412.9895 45.42734343 -65.51% >>> core sched (no kernel protect) 686.6515 71.77756931 -42.65% >>> nosmt 408.667 39.39042872 -65.87% >>> (Note these results are from v8). >>> >>> Vineeth tested sysbench and does not see any regressions. >>> Hong and Aubrey tested v9 and see results similar to v8. There is a known issue >>> with uperf that does regress. This appears to be because of ksoftirq heavily >>> contending with other tasks on the core. The consensus is this can be improved >>> in the future. >>> >>> Changes in v9 >>> ============= >>> - Note that the vruntime snapshot change is written in 2 patches to show the >>> progression of the idea and prevent merge conflicts: >>> sched/fair: Snapshot the min_vruntime of CPUs on force idle >>> sched: Improve snapshotting of min_vruntime for CGroups >>> Same with the RT priority inversion change: >>> sched: Fix priority inversion of cookied task with sibling >>> sched: Improve snapshotting of min_vruntime for CGroups >>> - Disable coresched on certain AMD HW. >>> Adding workloads and negative case test results for core scheduling v9 posted: - kernel under test: -- coresched community v9 posted from https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=sched/coresched-v9-posted (tag: sched/coresched-v9-posted) -- latest commit: d48636e429de (HEAD -> coresched-v9-posted, tag: sched/coresched-v9-posted) sched: Debug bits... -- coresched=on kernel parameter applied - workloads: -- A. sysbench cpu (192 threads) + sysbench cpu (192 threads) -- B. sysbench cpu (192 threads) + sysbench mysql (192 threads, mysqld forced into the same cgroup) -- C. uperf netperf.xml (192 threads over TCP or UDP protocol separately) -- D. will-it-scale context_switch via pipe (192 threads) - negative case: -- A. continuously toggle cpu.core_tag, during full loading uperf workload running with cs_on -- B. continuously toggle smt setting via /sys/devices/system/cpu/smt/control, during full loading uperf workload running with cs_on -- C. continuously cgroup switch between cs_on cgroup and cs_off cgroup via cgclassify, during full loading uperf workload running - test machine setup: CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 4 - test results of workloads, no obvious performance drop compared to community v8 build: -- workload A: +----------------------+------+----------------------+------------------------+ | workloads | ** | sysbench cpu * 192 | sysbench cpu * 192 | +======================+======+======================+========================+ | cgroup | ** | cg_sysbench_cpu_0 | cg_sysbench_cpu_1 | +----------------------+------+----------------------+------------------------+ | record_item | ** | Tput_avg (events/s) | Tput_avg (events/s) | +----------------------+------+----------------------+------------------------+ | coresched_normalized | ** | 0.97 | 1.02 | +----------------------+------+----------------------+------------------------+ | default_normalized | ** | 1.00 | 1.00 | +----------------------+------+----------------------+------------------------+ | smtoff_normalized | ** | 0.60 | 0.60 | +----------------------+------+----------------------+------------------------+ -- workload B: +----------------------+------+----------------------+------------------------+ | workloads | ** | sysbench cpu * 192 | sysbench mysql * 192 | +======================+======+======================+========================+ | cgroup | ** | cg_sysbench_cpu_0 | cg_sysbench_mysql_0 | +----------------------+------+----------------------+------------------------+ | record_item | ** | Tput_avg (events/s) | Tput_avg (events/s) | +----------------------+------+----------------------+------------------------+ | coresched_normalized | ** | 0.94 | 0.88 | +----------------------+------+----------------------+------------------------+ | default_normalized | ** | 1.00 | 1.00 | +----------------------+------+----------------------+------------------------+ | smtoff_normalized | ** | 0.56 | 0.84 | +----------------------+------+----------------------+------------------------+ -- workload C: +----------------------+------+---------------------------+---------------------------+ | workloads | ** | uperf netperf TCP * 192 | uperf netperf UDP * 192 | +======================+======+===========================+===========================+ | cgroup | ** | cg_uperf | cg_uperf | +----------------------+------+---------------------------+---------------------------+ | record_item | ** | Tput_avg (Gb/s) | Tput_avg (Gb/s) | +----------------------+------+---------------------------+---------------------------+ | coresched_normalized | ** | 0.64 | 0.68 | +----------------------+------+---------------------------+---------------------------+ | default_normalized | ** | 1.00 | 1.00 | +----------------------+------+---------------------------+---------------------------+ | smtoff_normalized | ** | 0.92 | 0.89 | +----------------------+------+---------------------------+---------------------------+ -- workload D: +----------------------+------+-------------------------------+ | workloads | ** | will-it-scale * 192 | | | | (pipe based context_switch) | +======================+======+===============================+ | cgroup | ** | cg_will-it-scale | +----------------------+------+-------------------------------+ | record_item | ** | threads_avg | +----------------------+------+-------------------------------+ | coresched_normalized | ** | 0.30 | +----------------------+------+-------------------------------+ | default_normalized | ** | 1.00 | +----------------------+------+-------------------------------+ | smtoff_normalized | ** | 0.87 | +----------------------+------+-------------------------------+ -- notes on record_item: * coresched_normalized: smton, cs enabled, test result normalized by default value * default_normalized: smton, cs disabled, test result normalized by default value * smtoff_normalized: smtoff, test result normalized by default value - test results of negative case, all as expected, no kernel panic or system hang observed Hongyu