Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp2077874pxa; Mon, 24 Aug 2020 04:38:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy+SrCfk4KgnRv7FsfWeLKDKKruKRXm4d/S0KBL/e1pUv8YVYimZFWkU0u4zSXoLqukmUgH X-Received: by 2002:a17:906:5811:: with SMTP id m17mr5059083ejq.40.1598269135674; Mon, 24 Aug 2020 04:38:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1598269135; cv=none; d=google.com; s=arc-20160816; b=DhoPxwR6sx1KRqIIhAu8a90uBXco8voNbepfD8U1panwzhSUmoqDgl6Fddfu+/ODVD hyb62A6JdGnmslsGN3C1fStLhTX+ox5/bA7+BOa4/OpWOIWVfyYvWut9DUeTnl2+BV/F qk0QpdXJ1eLv++5KPS83pcE2jscc0/tlSca7fCjz00JXxn/lEzfsNvNNMMPDjvDkCDLI 6voi+tev/pNGZ5EKZwnMkVbZRP7uLibBTxs7Oyzjsw20zqzUGBpulwXSLwaed1Ehk6by WseDOanPTGIrQqTUUsKn+UCburoicvokRegBHytdfQidTYuHSoTbj+LKx2uBtxe54vj9 UVsg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=9DiUHolhkNru49P23XOkyQDXz+EeQP5GDOxo76bpdxo=; b=0/PlVZ9FsFGXotxKOXgXOgqCID2fvhkiS/UvcrA2yOZBs9JPi8S63SjbNYhPJrsSfT 3c5+SFs0Ct/MrNLG6rAKlaXGeSCQ4bUsdR8IlPBpfitQb7CwjJQK0he1IGOXEMkaJtNN HBmPg4lX7Ff3lpPGXy0wJfbN5ZLh6YKE5gqfJPrkUJrqxNdDPdimRs4Sq0n004duGv/K wDqhdLTHdi72jOqN2BeIWx73nrYIv/Oe0c5B/frTuLg5+O1910q6SbIM+mE6TTmZFl0F 4HHO11wDvZLSmIYZ9LDzkK4243M/lIpMv7c9C97jvE+hzsd0X2i/Arog05l6lkUPQ9hB LCLg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=N82OtXxz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l2si6358547eds.451.2020.08.24.04.38.32; Mon, 24 Aug 2020 04:38:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=N82OtXxz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727079AbgHXLdm (ORCPT + 99 others); Mon, 24 Aug 2020 07:33:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34362 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727031AbgHXLc0 (ORCPT ); Mon, 24 Aug 2020 07:32:26 -0400 Received: from mail-wm1-x341.google.com (mail-wm1-x341.google.com [IPv6:2a00:1450:4864:20::341]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 90F49C061573 for ; Mon, 24 Aug 2020 04:32:18 -0700 (PDT) Received: by mail-wm1-x341.google.com with SMTP id s20so2940433wmj.1 for ; Mon, 24 Aug 2020 04:32:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=9DiUHolhkNru49P23XOkyQDXz+EeQP5GDOxo76bpdxo=; b=N82OtXxzEv7KNfxea1kKJyIvBhm6oAmVu1OS5Zx6ff698tfVJnjMBvLrwkHPg2N11q evTXJg1xx9TRBnSvInQ1x96tb+1cutj/8oblHmVH3fegXDvsgZoNElu6kNtiDMNeaQ8K KfbiHbU2w+hAzzPOrJaSpMOl5jpUKbdH/w9T7RSRQ3bQbNBFOl8HC8V62qr/x/w2qF0X VMZU5KwuxEuxxmOKrOks2kFswyx3rFmGEHhc0beu8fKirt6s5m9cdkqE06PudufJ9Dwl MOUovMDINJzRGFcGj+n1Dg+CVsjEuVCedvQCpLR4A8TL1QTux1FEN0t15UHvam+d9Hnj e1SQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=9DiUHolhkNru49P23XOkyQDXz+EeQP5GDOxo76bpdxo=; b=KkGzbsucXQvhOMuOxfBjpZr786S6meyh6D1aJFz+zl3Zb5VQpSwBi9hW6a+2RFEXbN fMIUVuMFRtzbtEWq9jQUazxhUfppg9LQB/j+qFJavddNPnI47DSlsynxv5GdxQ1MZkIF QpCUc/G3wRMaCbkE4vGpp+Dw7aSBy/rTh+fQLgBES8WtIkN6Q8MH3DqZqIGJEkp/u7w+ gaNDWDD9RBGGXyBGzCR6RsFqV9LBrhQjg2X6YmFXh4LZQ/88DY1zk/4X5mwbAfrQH8Zh PqC4N5IDfpkGM8aOj+WuXzbZVjHNES0RFVe7nGBZr1t8nolq4yZ4vlu/e0SR2bV+om0e TySw== X-Gm-Message-State: AOAM530ncshJAe1e8aLwftQ8SkzPif5sLm+EhHpsbkCthAJ8/O7shOrh xOGgBCCVjNeGil77hETVw/4FBQ9T9fWDI8mqHeU= X-Received: by 2002:a1c:f70e:: with SMTP id v14mr5248986wmh.74.1598268737042; Mon, 24 Aug 2020 04:32:17 -0700 (PDT) MIME-Version: 1.0 References: <20200822030155.GA414063@google.com> In-Reply-To: <20200822030155.GA414063@google.com> From: Vineeth Pillai Date: Mon, 24 Aug 2020 07:32:05 -0400 Message-ID: Subject: Re: [RFC] Design proposal for upstream core-scheduling interface To: Joel Fernandes Cc: Nishanth Aravamudan , JulienDesfossez@google.com, Julien Desfossez , Peter Zijlstra , Tim Chen , mingo@kernel.org, Thomas Gleixner , Paul Turner , linux-kernel@vger.kernel.org, fweisbec@gmail.com, Kees Cook , Phil Auld , Aaron Lu , Aubrey Li , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini , Joel Fernandes , Chen Yu , Christian Brauner , chris hyser , dhaval.giani@gmail.com, "Paul E . McKenney" , joshdon@google.com, xii@google.com, haoluo@google.com, bsegall@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > Let me know your thoughts and looking forward to a good LPC MC discussion= ! > Nice write up Joel, thanks for taking time to compile this with great detai= l! After going through the details of interface proposal using cgroup v2 controllers, and based on our discussion offline, would like to note down this idea about a new pseudo filesystem interface for core scheduling. We could include this also for the API discussion during core scheduler MC. coreschedfs: pseudo filesystem interface for Core Scheduling ---------------------------------------------------------------------------= ------- The basic requirement of core scheduling is simple - we need to group a set of tasks into a trust group that can share a core. So we don=E2=80=99t real= ly need a nested hierarchy for the trust groups. Cgroups v2 follow a unified nested hierarchy model that causes a considerable confusion if the trusted tasks are in different levels of the hierarchy and we need to allow them to share the core. Cgroup v2's single hierarchy model makes it difficult to regroup tasks in different levels of nesting for core scheduling. As noted in this mail, we could use multi-file approach and other interfaces like prctl to overcome this limitation. The idea proposed here to overcome the above limitation is to come up with = a new pseudo filesystem - =E2=80=9Ccoreschedfs=E2=80=9D. This filesystem is basic= ally a flat filesystem with maximum nesting level of 1. That means, root directory can have sub-directories for sub-groups, but those sub-directories cannot have more sub-directories representing trust groups. Root directory is to represent the system wide trust group and sub-directories represent trusted groups. Each directory including the root directory has the following set of files/directories: - cookie_id: User exposed id for a cookie. This can be compared to a file descriptor. This could be used in programmatic API to join/leave a group - properties: This is an interface to specify how child tasks of this group should behave. Can be used for specifying future flag requirements as well. Current list of properties include: NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group will result in creation of a new trust group SAME_COOKIE_FOR_CHILD: All fork() for tasks in this group will end up in this same group ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this group goes to the root group - tasks: Lists the tasks in this group. Main interface for adding removing tasks in a group - : A directory per task who is am member of this trust group. - /properties: This file is same as the parent properties file but this is to override the group setting. This pseudo filesystem can be mounted any where in the root filesystem, I propose the default to be in =E2=80=9C/sys/kernel/coresched=E2=80=9D When coresched is enabled, kernel internally creates the framework for this filesystem. The filesystem gets mounted to the default location and admin can change this if needed. All tasks by default are in the root group. The admin or programs can then create trusted groups on top of this filesystem. Hooks will be placed in fork() and exit() to make sure that the filesystem=E2=80=99s view of tasks is up-to-date with the system. Also, APIs manipulating core scheduling trusted groups should also make sure that the filesystem's view is updated. Note: The above idea is very similar to cgroups v1. Since there is no unified hierarchy in cgroup v1, most of the features of coreschedfs could be implemented as a cgroup v1 controller. As no new v1 controllers are allowed, I feel the best alternative to have a simple API is to come up with a new filesystem - coreschedfs. The advantages of this approach is: - Detached from cgroup unified hierarchy and hence the very simple requirem= ent of core scheduling can be easily materialized. - Admin can have fine-grained control of groups using shell and scripting - Can have programmatic access to this using existing APIs like mkdir,rmdir= , write, read. Or can come up with new APIs using the cookie_id which can = wrap t he above linux apis or use a new systemcall for core scheduling. - Fine grained permission control using linux filesystem permissions and AC= Ls Disadvantages are - yet another psuedo filesystem. - very similar to cgroup v1 and might be re-implementing features that are already provided by cgroups v1. Use Cases ----------------- Usecase 1: Google cloud --------------------------------- Since we no longer depend on cgroup v2 hierarchies, there will not be any issue of nesting and sharing. The main daemon can create trusted groups in the fileystem and provide required permissions for the group. Then the init processes for each job can be added to respective groups for them to create children tasks as needed. Multiple jobs under the same customer which needs to share the core can be housed in one group. Usecase 2: Chrome browser ------------------------ We start with one group for the first task and then set properties to NEW_COOKIE_FOR_CHILD. Usecase 3: chrome VMs --------------------- Similar to chrome browser, the VM task can make each vcpu on its own group. Usecase 4: Oracle use case -------------------------- This is also similar to use case 1 with this interface. All tasks that need= to be in the root group can be easily added by the admin. Use case 5: General virtualization ---------------------------------- The requirement is each VM should be isolated. This can be easily done by creating a new group per VM Please have a look at the above proposal and let us know your thoughts. We shall include this also during the interface discussion at core scheduling MC. Thanks, Vineeth