Received: by 2002:a05:6358:a55:b0:ec:fcf4:3ecf with SMTP id 21csp594963rwb; Thu, 12 Jan 2023 09:45:14 -0800 (PST) X-Google-Smtp-Source: AMrXdXu966ms8r/tBFJ84CGvLkBPaQCMszE5hQmiPC3dCUhZoCdQM1sq6TvCB7vVp0EQqAw5ubu+ X-Received: by 2002:aa7:cf09:0:b0:499:b5fc:e56f with SMTP id a9-20020aa7cf09000000b00499b5fce56fmr13263996edy.19.1673545514462; Thu, 12 Jan 2023 09:45:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673545514; cv=none; d=google.com; s=arc-20160816; b=w3YuUzqH68dlhF4zcmgFHrJgdMA7qKXxSVG5XBCknZsYO7dAkgn6A4HaX7+n3eQ7R9 iWge2/J1fn25tbLmV88Bh2I4yWVRI8gkVaOtx84hDasF0C9cb5xYN9EFhp3mFDl5T1Sv uPvazfP9sVjyRKIQCgT2CPd0W4EICKQrICwLRR6+PRtLlg/Ddic6EjHkmlfYVVsnOt8X LcZa6u5reBcTGYLsf6XvF5TBws1W9v5rJUmK0reuAbSFw/MfHK6HqzBuGNzQesAiWKQj ZPM15wZt13vlUC2j4xIqTR8C1zdn7xPLhHZQs162tm+dEtTwZK7GYRZ5iAptT7StU8k7 ZA4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=Ge2+k2D47zajD6UT/GWHAbL8ybpUdHXZ/qhQiCTPwMw=; b=Vzzi7BQUxEMfR0Svm3PalEhjV1lMINE1eWDxDf70mCcKSH6tB44iL5edO0D40M5RBn T+FtcMmMMB6AP06+ltqo+8L6ermHkUC7DbYpAGB38wdjxbLg3V0o2PO2K1xzNjiBw29A S6k8JTWpefxvrEWOOjSy1a4yYozhAFFpCNwjk2KGbSvv6k0xe2QmNxpJ61pW8lEKb+48 Pe47zDcaonXevwP7qilnNhKRU9CRUqsaAinA++RA8Nz1bSn7VFZ/JMz5+NBesVelFmtG Xgr+k1nVRGMiRqofMGltP0Uit/i65ZtynlwMQlUhO14ZLf/COOxGZlv7sxUy0Y2hFU0i OVdw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=YkNrbCbA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c16-20020aa7df10000000b00475bd5f3ae7si16137209edy.102.2023.01.12.09.45.00; Thu, 12 Jan 2023 09:45:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=YkNrbCbA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231398AbjALRfx (ORCPT + 52 others); Thu, 12 Jan 2023 12:35:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53434 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229835AbjALRfO (ORCPT ); Thu, 12 Jan 2023 12:35:14 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6607A6B1BF; Thu, 12 Jan 2023 08:57:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673542660; x=1705078660; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=V4SKDzmsS6x7U3N+8mtwUbu1ps9tCegCs00ZfkA9voQ=; b=YkNrbCbAXqR6+1dFD3raJOCHonQ3Vz/I69ivf2KAtwEljD8sZ/ypc/Wx KN3+7c++SL8bhkZzLeC4ZqG8kNH35nb+MzIBxOUP50bxKWFz2eKcNBOOt MWXMtccQRkz69eDj6a//pnY2DoqJg3jtRprLdV36RyH6Arhki1bhtXlI9 wAEvbP680yT77xowNhkgKSDazskb1bEYnqEZXBRRTAxL4kT4f0clOOJJu Q3noPCG0YFQaRxbfZjmmIzixYyZV6niY0Exq/vdMQ+skCKaEYgsf9nY5l 4GkKZJsbmHn3ikcoF+xthyBCqufz1aaG4anT0yDhEKkXL2vgU+WcFeIv7 Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10588"; a="325016136" X-IronPort-AV: E=Sophos;i="5.97,211,1669104000"; d="scan'208";a="325016136" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jan 2023 08:56:25 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10588"; a="651232500" X-IronPort-AV: E=Sophos;i="5.97,211,1669104000"; d="scan'208";a="651232500" Received: from jacton-mobl.ger.corp.intel.com (HELO localhost.localdomain) ([10.213.195.171]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jan 2023 08:56:21 -0800 From: Tvrtko Ursulin To: Intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Tejun Heo , Johannes Weiner , Zefan Li , Dave Airlie , Daniel Vetter , Rob Clark , =?UTF-8?q?St=C3=A9phane=20Marchesin?= , "T . J . Mercier" , Kenny.Ho@amd.com, =?UTF-8?q?Christian=20K=C3=B6nig?= , Brian Welty , Tvrtko Ursulin Subject: [RFC v3 00/12] DRM scheduling cgroup controller Date: Thu, 12 Jan 2023 16:55:57 +0000 Message-Id: <20230112165609.1083270-1-tvrtko.ursulin@linux.intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,HK_RANDOM_ENVFROM,HK_RANDOM_FROM, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Tvrtko Ursulin This series contains a proposal for a DRM scheduling cgroup controller which implements a weight based hierarchical GPU usage budget based controller similar in concept to some of the existing controllers. Motivation mostly comes from my earlier proposal where I identified that GPU scheduling lags significantly behind what is available for CPU and IO. Whereas back then I was proposing to somehow tie this with process nice, feedback mostly was that people wanted cgroups. So here it is - in the world of heterogenous computing pipelines I think it is time to do something about this gap. Code is not finished but should survive some light experimenting with. I am sharing it early since the topic has been controversial in the past. I hope to demonstrate there are gains to be had in real world usage(*), today, and that the concepts the proposal relies are well enough established and stable. *) Specifically under ChromeOS which uses cgroups to control CPU bandwith for VMs based on the window focused status. It can be demonstrated how GPU scheduling control can easily be integrated into that setup. *) Another real world example later in the cover letter. There should be no conflict with this proposal and any efforts to implement memory usage based controller. Skeleton DRM cgroup controller is deliberatly purely a skeleton patch where any further functionality can be added with no real conflicts. [In fact, perhaps scheduling is even easier to deal with than memory accounting.] Structure of the series is as follows: 1-2) Improve client ownership tracking in DRM core. 3) Adds a skeleton DRM cgroup controller with no functionality. 4-9) Laying down some infrastructure to enable the controller. 10) The controller itself. 11-12) i915 support for the controller. The proposals defines a delegation of duties between the tree parties: cgroup controller, DRM core and individual drivers. Two way communication interfaces are then defined to enable the delegation to work. DRM scheduling soft limits ~~~~~~~~~~~~~~~~~~~~~~~~~~ Because of the heterogenous hardware and driver DRM capabilities, soft limits are implemented as a loose co-operative (bi-directional) interface between the controller and DRM core. The controller configures the GPU time allowed per group and periodically scans the belonging tasks to detect the over budget condition, at which point it invokes a callback notifying the DRM core of the condition. DRM core provides an API to query per process GPU utilization and 2nd API to receive notification from the cgroup controller when the group enters or exits the over budget condition. Individual DRM drivers which implement the interface are expected to act on this in the best-effort manner only. There are no guarantees that the soft limits will be respected. DRM scheduling soft limits interface files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drm.weight Standard cgroup weight based control [1, 10000] used to configure the relative distributing of GPU time between the sibling groups. drm.period_us (Most probably only a debugging aid during RFC phase.) An integer representing the period with which the controller should look at the GPU usage by the group and potentially send the over/under budget signal. Value of zero (defaul) disables the soft limit checking. This builds upon the per client GPU utilisation work which landed recently for a few drivers. My thinking is that in principle, an intersect of drivers which support both that and some sort of scheduling control, like priorities, could also in theory support this controller. Another really interesting angle for this controller is that it mimics the same control menthod used by the CPU scheduler. That is the proportional/weight based GPU time budgeting. Which makes it easy to configure and does not need a new mental model. However, as the introduction mentions, GPUs are much more heterogenous and therefore the controller uses very "soft" wording as to what it promises. The general statement is that it can define budgets, notify clients when they are over them, and let individual drivers implement best effort handling of those conditions. Delegation of duties in the implementation goes likes this: * DRM cgroup controller implements the control files and the scanning loop. * DRM core is required to track all DRM clients belonging to processes so it can answer when asked how much GPU time is a process using. * DRM core also provides a call back which the controller will call when a certain process is over budget. * Individual drivers need to implement two similar hooks, but which work for a single DRM client. Over budget callback and GPU utilisation query. What I have demonstrated in practice is that when wired to i915, in a really primitive way where the over-budget condition simply lowers the scheduling priority, the concept can be almost equally effective as the static priority control. I say almost because the design where budget control depends on the periodic usage scanning has a fundamental delay, so responsiveness will depend on the scanning period, which may or may not be a problem for a particular use case. There are also interesting conversations to be had around mental models for what is GPU usage as a single number when faced with GPUs which have different execution engines. To an extent this is similar to the multi-core and cgroup CPU controller problems, but definitely goes further than that. I deliberately did not want to include any such complications in the controller itself and left the individual drivers to handle it. For instance in the i915 over-budget callback it will not do anything unless client's GPU usage is on a physical engine which is oversubscribed. This enables multiple clients to be harmlessly over budget, as long as they are not competing for the same GPU resource. Example usage from within a Linux desktop ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Standard Linux distributions like Ubuntu already uses cgroups heavily for session management and that could easily be extended with the DRM controller. After logging into the system graphically we can enable the DRM controller throughout the cgroups hierarchy: echo +drm > /sys/fs/cgroup/cgroup.subtree_control echo +drm > /sys/fs/cgroup/user.slice/cgroup.subtree_control echo +drm > /sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.subtree_control Next we will open two SSH sessions, just so separate cgroups are handily created by systemd for this experiment. Roughly simultaneously we run the following two benchmarks in each session respectively: 1) ./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen /no_scorebox /benchmark /benchmark_duration_ms=60000 2) vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 --fullscreen 1 --gfx=glfw -t gl_manhattan (The only reason for vsync off here is because I struggled to find an easily runnable and demanding enough benchmark, or to run on a screen large enough to make even a simpler ones demanding.) With this test we get 252fps from GpuTest and 96fps from GfxBenchmark. Premise here is that one of these GPU intensive benchmarks is intended to be ran by the user with lower priority. Imagine kicking off some background compute processing and continuing to use the UI for other tasks. Hence the user will now re-run the test by first lowering the weight control of the first session (DRM cgroup): 1) echo 50 | sudo tee /sys/fs/cgroup/`cut -d':' -f3 /proc/self/cgroup`/drm.weight ./GpuTest /test=pixmark_julia_fp32 /width=1920 /height=1080 /fullscreen /no_scorebox /benchmark /benchmark_duration_ms=60000 2) vblank_mode=0 bin/testfw_app --gl_api=desktop_core --width=1920 --height=1080 --fullscreen 1 --gfx=glfw -t gl_manhattan In this case we will see that GpuTest has recorded 208fps (~18% down) and GfxBenchmark 114fps (18% up), demonstrating that even a very simple approach of wiring up i915 to the DRM cgroup controller can enable external GPU scheduling control. * Note here that default weight is 100, so setting 50 for the background session is asking the controller to give it half as much GPU bandwidth. * Also note that in the RFC stage the DRM controller itself boots in a disabled state and needs to be explicitly enabled by setting a scanning period such as: echo 1000000 | sudo tee /sys/fs/cgroup/drm.period_us. v2: * Prefaced the series with some core DRM work as suggested by Christian. * Dropped the priority based controller for now. * Dropped the introspection cgroup controller file. * Implemented unused budget sharing/propagation. * Some small fixes/tweak as per review feedback and in general. v3: * Dropped one upstreamed patch. * Logging cleanup (use DRM macros where available). Tvrtko Ursulin (12): drm: Track clients by tgid and not tid drm: Update file owner during use cgroup: Add the DRM cgroup controller drm/cgroup: Track clients per owning process drm/cgroup: Allow safe external access to file_priv drm/cgroup: Add ability to query drm cgroup GPU time drm/cgroup: Add over budget signalling callback drm/cgroup: Only track clients which are providing drm_cgroup_ops cgroup/drm: Client exit hook cgroup/drm: Introduce weight based drm cgroup control drm/i915: Wire up with drm controller GPU time query drm/i915: Implement cgroup controller over budget throttling Documentation/admin-guide/cgroup-v2.rst | 37 ++ drivers/gpu/drm/Kconfig | 1 + drivers/gpu/drm/Makefile | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 +- drivers/gpu/drm/drm_auth.c | 3 +- drivers/gpu/drm/drm_cgroup.c | 203 +++++++ drivers/gpu/drm/drm_debugfs.c | 12 +- drivers/gpu/drm/drm_file.c | 60 +- drivers/gpu/drm/drm_ioctl.c | 3 + .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 38 +- drivers/gpu/drm/i915/i915_driver.c | 11 + drivers/gpu/drm/i915/i915_drm_client.c | 209 ++++++- drivers/gpu/drm/i915/i915_drm_client.h | 13 + drivers/gpu/drm/nouveau/nouveau_drm.c | 5 +- drivers/gpu/drm/vmwgfx/vmwgfx_gem.c | 6 +- include/drm/drm_clients.h | 47 ++ include/drm/drm_drv.h | 36 ++ include/drm/drm_file.h | 17 +- include/linux/cgroup_drm.h | 13 + include/linux/cgroup_subsys.h | 4 + init/Kconfig | 8 + kernel/cgroup/Makefile | 1 + kernel/cgroup/drm.c | 570 ++++++++++++++++++ 23 files changed, 1273 insertions(+), 31 deletions(-) create mode 100644 drivers/gpu/drm/drm_cgroup.c create mode 100644 include/drm/drm_clients.h create mode 100644 include/linux/cgroup_drm.h create mode 100644 kernel/cgroup/drm.c -- 2.34.1