Received: by 2002:a05:6a10:af89:0:0:0:0 with SMTP id iu9csp1152861pxb; Sat, 15 Jan 2022 04:28:14 -0800 (PST) X-Google-Smtp-Source: ABdhPJyw0c0Y0x0mgmlWM5KmyEGkCJ02ca9KEsEyHJC3YdJEIWRAg0dfFPbHDJjQFsHnjglk/qqs X-Received: by 2002:a05:6402:291b:: with SMTP id ee27mr5652192edb.363.1642249694517; Sat, 15 Jan 2022 04:28:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1642249694; cv=none; d=google.com; s=arc-20160816; b=OqxTPsu3a5m1ojupjlspInIhen9xEqhFzJY0Q62jDza/O5+Vp5rZKMHvo3WMfegMBV 1fxBYGmJILYB18fDTZQvf7eIerLIxecKE3kUJX++rUKDPK1lyQJ2ErK/r4UyKSf5PNw0 oBNok29ih1JzmKscjD0eOvkJNaNrcVm4K6X2+RBoImZvhizTwu+tnOlfIWIG3y3sFMcZ pZhhAkrnFV9GNZMUPTY7SKpF/j1z/814aCckum2YNZJAbQehXxYqUQPbA7KgUFIywU7Y ZnCbhWAD5oEkiufhhXYfE6QT/IO1qg8UbwKATOlHbWxFlMnH053pEgOeSr65npv/aQVK 30FQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :references:mime-version:message-id:in-reply-to:date:dkim-signature; bh=VVL93OjDpl50EsTf01Y02AFzpDgsoiwDg8IFEaHEx7o=; b=WNnwtZdVepMkChSsP+qmvOX2KLaj5I9Ujj6FxZufBTahgF+YFxDFA1Mk0f1BkMJXpQ AEd3jWg/XliJnlEqva065UFivBxEJtnE+gxwV/u0Lp39Yi4er0Fb0S6EHTEX+K9wdLJI 99TWSLTZcrWFnkvFXod6ayU1s2DUU5WEMy+zloJZ0BKB162UTVZOVWs2tMxo75TYpAVt V9SVVPLI737YzglfNo1xEVVEjP5PYIjQeJyREEkUGHiuS1L5SB3yqm8z6QsTIueSuys4 lQc+OPfvvA8UCGQOfIyRZVRWPqieAEV3ILMq1ipqFujtpJm8QXK9l23RSWcdDp+dvAMD G0MA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=SeDudgjZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id i2si711323ejw.380.2022.01.15.04.27.49; Sat, 15 Jan 2022 04:28:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=SeDudgjZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231771AbiAOBHY (ORCPT + 99 others); Fri, 14 Jan 2022 20:07:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54856 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231772AbiAOBHX (ORCPT ); Fri, 14 Jan 2022 20:07:23 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 47B82C06173F for ; Fri, 14 Jan 2022 17:07:23 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id t44-20020a25aaaf000000b006118ed6d30cso21611557ybi.15 for ; Fri, 14 Jan 2022 17:07:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc:content-transfer-encoding; bh=VVL93OjDpl50EsTf01Y02AFzpDgsoiwDg8IFEaHEx7o=; b=SeDudgjZy+YJyTXR9JLoFhR5vDMcIwZaaa8yQTqwNo0eCjpi1e+Z6fuk5CVH0j8T6Q yftZP1F1lLghLhdqxp/d2Hi9b3XVn1g1t87rKY7jFrOaC36QUheGeR1xOTUz/2bLyPhV 62QLjtSW0fnKArrUMAigsH+EedrTR+GBYAk7odkg7PsOu9hJD19pSdQygjNHP69Oxkqd AAUXclJGnEsBrXj622qDr25tJi+BZW4ce4s3tyPfO7BAn5XNcrCyZAECVNeP7XHBD3HS ZHn3F7wLmnaUtrfd6x1VO2YBDfTcTt+rETEo/hAE820ttJA+H9Wi63ki5ATMsl4pmovf PDfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc:content-transfer-encoding; bh=VVL93OjDpl50EsTf01Y02AFzpDgsoiwDg8IFEaHEx7o=; b=SQvh6xmJznKILEkGeXrRVaKQ/8af8ff9Qy3fmL9htMiN49KoE4hyWjdifzV67DZFvX QQ9eJTQ4y4kulLuI+OXHa0u818G7RyO9Rs/tjIqQtfqYvR2oM8Lrb80BNU7yKPmemaCi M/oCiO6cxuR97pGd5dBqT+GRUx649GA/+imF3gG1+hFLWStwpR5Whd7/9te6Ik8ijSed iZbz/XoV0h+T/49x7h6+S8/De6QSUEAvyC0qgaJov68oaoW4Ur6smSJqHywbqZr3mMWF BWbLTIbDUJpjwDpUp4wzVVjgwErMcAQAan2KxcBETOgwNEhSG1zaJ+38Lhcn88pObO1d BOsg== X-Gm-Message-State: AOAM531ZxeHkI1RWv4ASu9/eH026Zx8LPW5vPZGk3cxITJ9nCkkiicCb 6QGqSQdHtxtMOd6U/4kn5G7B/UMDmhU= X-Received: from hridya.mtv.corp.google.com ([2620:15c:211:200:5860:362a:3112:9d85]) (user=hridya job=sendgmr) by 2002:a25:bf82:: with SMTP id l2mr16594693ybk.356.1642208842425; Fri, 14 Jan 2022 17:07:22 -0800 (PST) Date: Fri, 14 Jan 2022 17:05:59 -0800 In-Reply-To: <20220115010622.3185921-1-hridya@google.com> Message-Id: <20220115010622.3185921-2-hridya@google.com> Mime-Version: 1.0 References: <20220115010622.3185921-1-hridya@google.com> X-Mailer: git-send-email 2.34.1.703.g22d0c6ccf7-goog Subject: [RFC 1/6] gpu: rfc: Proposal for a GPU cgroup controller From: Hridya Valsaraju To: David Airlie , Daniel Vetter , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , Jonathan Corbet , Greg Kroah-Hartman , "=?UTF-8?q?Arve=20Hj=C3=B8nnev=C3=A5g?=" , Todd Kjos , Martijn Coenen , Joel Fernandes , Christian Brauner , Hridya Valsaraju , Suren Baghdasaryan , Sumit Semwal , Benjamin Gaignard , Liam Mark , Laura Abbott , Brian Starkey , John Stultz , "=?UTF-8?q?Christian=20K=C3=B6nig?=" , Tejun Heo , Zefan Li , Johannes Weiner , Dave Airlie , Kenneth Graunke , Simon Ser , Jason Ekstrand , Matthew Auld , Matthew Brost , Li Li , Marco Ballesio , Finn Behrens , Hang Lu , Wedson Almeida Filho , Masahiro Yamada , Andrew Morton , Nathan Chancellor , Kees Cook , Nick Desaulniers , Miguel Ojeda , Vipin Sharma , Chris Down , Daniel Borkmann , Vlastimil Babka , Arnd Bergmann , dri-devel@lists.freedesktop.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, linaro-mm-sig@lists.linaro.org, cgroups@vger.kernel.org Cc: Kenny.Ho@amd.com, daniels@collabora.com, kaleshsingh@google.com, tjmercier@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch adds a proposal for a new GPU cgroup controller for accounting/limiting GPU and GPU-related memory allocations. The proposed controller is based on the DRM cgroup controller[1] and follows the design of the RDMA cgroup controller. The new cgroup controller would: * Allow setting per-cgroup limits on the total size of buffers charged to it. * Allow setting per-device limits on the total size of buffers allocated by device within a cgroup. * Expose a per-device/allocator breakdown of the buffers charged to a cgroup. The prototype in the following patches are only for memory accounting using the GPU cgroup controller and does not implement limit setting. [1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@int= el.com/ Signed-off-by: Hridya Valsaraju --- Hi all, Here is the RFC documentation for the GPU cgroup controller that we talked about at LPC 2021 along with a prototype. I reached out to Tejun with the idea recently and he mentioned that cgroup-aware BPF(by Kenny Ho) or the new misc cgroup controller can also be considered as alternatives to track GPU resources. I am sending the RFC to the list to give everyone else a chance to chime in with their thoughts as well so that we can reach an agreement on how to proceed. Thanks in advance! Regards, Hridya Documentation/gpu/rfc/gpu-cgroup.rst | 192 +++++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 196 insertions(+) create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst diff --git a/Documentation/gpu/rfc/gpu-cgroup.rst b/Documentation/gpu/rfc/g= pu-cgroup.rst new file mode 100644 index 000000000000..9bff23007b22 --- /dev/null +++ b/Documentation/gpu/rfc/gpu-cgroup.rst @@ -0,0 +1,192 @@ +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +GPU cgroup controller +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Goals +=3D=3D=3D=3D=3D +This document intends to outline a plan to create a cgroup v2 controller s= ubsystem +for the per-cgroup accounting of device and system memory allocated by the= GPU +and related subsystems. + +The new cgroup controller would: + +* Allow setting per-cgroup limits on the total size of buffers charged to = it. + +* Allow setting per-device limits on the total size of buffers allocated b= y a + device/allocator within a cgroup. + +* Expose a per-device/allocator breakdown of the buffers charged to a cgro= up. + +Alternatives Considered +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +The following alternatives were considered: + +The memory cgroup controller +____________________________ + +1. As was noted in [1], memory accounting provided by the GPU cgroup +controller is not a good fit for integration into memcg due to the +differences in how accounting is performed. It implements a mechanism +for the allocator attribution of GPU and GPU-related memory by +charging each buffer to the cgroup of the process on behalf of which +the memory was allocated. The buffer stays charged to the cgroup until +it is freed regardless of whether the process retains any references +to it. On the other hand, the memory cgroup controller offers a more +fine-grained charging and uncharging behavior depending on the kind of +page being accounted. + +2. Memcg performs accounting in units of pages. In the DMA-BUF buffer shar= ing model, +a process takes a reference to the entire buffer(hence keeping it alive) e= ven if +it is only accessing parts of it. Therefore, per-page memory tracking for = DMA-BUF +memory accounting would only introduce additional overhead without any ben= efits. + +[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9= 506-1-brian.welty@intel.com/#22624705 + +Userspace service to keep track of buffer allocations and releases +__________________________________________________________________ + +1. There is no way for a userspace service to intercept all allocations an= d releases. +2. In case the process gets killed or restarted, we lose all accounting so= far. + +UAPI +=3D=3D=3D=3D +When enabled, the new cgroup controller would create the following files i= n every cgroup. + +:: + + gpu.memory.current (R) + gpu.memory.max (R/W) + +gpu.memory.current is a read-only file and would contain per-device memory= allocations +in a key-value format where key is a string representing the device name +and the value is the size of memory charged to the device in the cgroup in= bytes. + +For example: + +:: + + cat /sys/kernel/fs/cgroup1/gpu.memory.current + dev1 4194304 + dev2 4194304 + +The string key for each device is set by the device driver when the device= registers +with the GPU cgroup controller to participate in resource accounting(see s= ection +'Design and Implementation' for more details). + +gpu.memory.max is a read/write file. It would show the current total +size limits on memory usage for the cgroup and the limits on total memory = usage +for each allocator/device. + +Setting a total limit for a cgroup can be done as follows: + +:: + + echo =E2=80=9Ctotal 41943040=E2=80=9D > /sys/kernel/fs/cgroup1/gpu= .memory.max + +Setting a total limit for a particular device/allocator can be done as fol= lows: + +:: + + echo =E2=80=9Cdev1 4194304=E2=80=9D > /sys/kernel/fs/cgroup1/gpu.= memory.max + +In this example, 'dev1' is the string key set by the device driver during +registration. + +Design and Implementation +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D + +The cgroup controller would closely follow the design of the RDMA cgroup c= ontroller +subsystem where each cgroup maintains a list of resource pools. +Each resource pool contains a struct device and the counter to track curre= nt total, +and the maximum limit set for the device. + +The below code block is a preliminary estimation on how the core kernel da= ta structures +and APIs would look like. + +.. code-block:: c + + /** + * The GPU cgroup controller data structure. + */ + struct gpucg { + struct cgroup_subsys_state css; + /* list of all resource pools that belong to this cgroup *= / + struct list_head rpools; + }; + + struct gpucg_device { + /* + * list of various resource pools in various cgroups that= the device is + * part of. + */ + struct list_head rpools; + /* list of all devices registered for GPU cgroup accountin= g */ + struct list_head dev_node; + /* name to be used as identifier for accounting and limit = setting */ + const char *name; + }; + + struct gpucg_resource_pool { + /* The device whose resource usage is tracked by this reso= urce pool */ + struct gpucg_device *device; + + /* list of all resource pools for the cgroup */ + struct list_head cg_node; + + /* + * list maintained by the gpucg_device to keep track of it= s + * resource pools + */ + struct list_head dev_node; + + /* tracks memory usage of the resource pool */ + struct page_counter total; + }; + + /** + * gpucg_register_device - Registers a device for memory accountin= g using the + * GPU cgroup controller. + * + * @device: The device to register for memory accounting. Must rem= ain valid + * after registration. + * @name: Pointer to a string literal to denote the name of the de= vice. + */ + void gpucg_register_device(struct gpucg_device *gpucg_dev, const c= har *name); + + /** + * gpucg_try_charge - charge memory to the specified gpucg and gpu= cg_device. + * + * @gpucg: The gpu cgroup to charge the memory to. + * @device: The device to charge the memory to. + * @usage: size of memory to charge in bytes. + * + * Return: returns 0 if the charging is successful and otherwise r= eturns an + * error code. + */ + int gpucg_try_charge(struct gpucg *gpucg, struct gpucg_device *dev= ice, u64 usage); + + /** + * gpucg_uncharge - uncharge memory from the specified gpucg and g= pucg_device. + * + * @gpucg: The gpu cgroup to uncharge the memory from. + * @device: The device to charge the memory from. + * @usage: size of memory to uncharge in bytes. + */ + void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_device *devi= ce, u64 usage); + +Future Work +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Additional GPU resources can be supported by adding new controller files. + +Upstreaming Plan +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +* Decide on a UAPI that accommodates all use-cases for the upstream GPU ec= osystem + as well as for Android. + +* Prototype the GPU cgroup controller and integrate its usage into the DMA= -BUF + system heap. + +* Demonstrate its usage from userspace in the Android Open Space Project. + +* Send out RFCs to LKML for the GPU cgroup controller and iterate. diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.= rst index 91e93a705230..0a9bcd94e95d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree:: =20 i915_scheduler.rst + +.. toctree:: + + gpu-cgroup.rst --=20 2.34.1.703.g22d0c6ccf7-goog