Received: by 2002:a05:6358:489b:b0:bb:da1:e618 with SMTP id x27csp885879rwn; Thu, 8 Sep 2022 10:05:11 -0700 (PDT) X-Google-Smtp-Source: AA6agR45jaH+FeCGb0nwqoIF7hT4Dl8PSPI5MEe5H7dzmUpUcBZHHy2k19iT+KmmJAYWHw+pD6Xn X-Received: by 2002:a17:90b:1e49:b0:200:6d41:9662 with SMTP id pi9-20020a17090b1e4900b002006d419662mr4979896pjb.221.1662656711591; Thu, 08 Sep 2022 10:05:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662656711; cv=none; d=google.com; s=arc-20160816; b=xJZR6yD27X/Ll1TKGFA/tXY9frxZgXBOk5Au0ROFLFkdbxqL9aNbxRPbZdvYwcm2WH eYUjculrF6BWOWJ1wsA0+hxGkz3nM3B1MsLfE+k4b+AhS5GidvflLgFpFVxtQadH79c6 UQ1Ha/FR1XiQ4VIP0qEckmSHjCo4+4GDpSUExoRCnRBdVRdVxifWV7l9iIAzIAKM7gRg PKjV6zIorHgWHtdWzm2sXm6/6G+OWTc6vshcmMy67dyT7h3kCESywmDxcsHknbEL5zAc ySUsv3vtOeU9s1DP97pciWikJ3Dbry1tqgo2NAsl6i2CsfMNNoBDa2LbtvfcP+ucVsu5 jGGA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=I55acvKkYz3DOyOmmK+2+SP9JvKG4Jm+wWUWvY/NBUo=; b=kYpAdxCL2fwKo2FzVI54GnF0XJiSe2jy6W7dvFiycIgM6klZhlo3uiLUdsb8sd9FkG YlBmKxZP+A6dI2SaFpQYAIzoNpJydbrOuwWp8LEUetg0Vev6RgOf5fqV3rWg+lSk/xvi kX7PBtZI48/3EnWC8T+cvn8QOb97bbVsOe7je96GWVdktualogRib1j8FEp+ulgJep/M O+BaNacp91p2u9tbQtBB2PtSJc8qB0v42hr56zh3zJNIkT4gs6MmTtWCvZG9EYNf/tbI 3Wc35t8g73nXCiaPYvjcDCN+Hw33pD8RuogC4Xd2yY2wcCpODCw1lPksMNrUo3FMnrzv rakg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=EstMal6x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 73-20020a63004c000000b0042b8e60376esi17232043pga.605.2022.09.08.10.04.58; Thu, 08 Sep 2022 10:05:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=EstMal6x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230236AbiIHQoz (ORCPT + 99 others); Thu, 8 Sep 2022 12:44:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38306 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229984AbiIHQow (ORCPT ); Thu, 8 Sep 2022 12:44:52 -0400 Received: from mail-ej1-x62d.google.com (mail-ej1-x62d.google.com [IPv6:2a00:1450:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 693DA31DCA for ; Thu, 8 Sep 2022 09:44:47 -0700 (PDT) Received: by mail-ej1-x62d.google.com with SMTP id gb36so39359842ejc.10 for ; Thu, 08 Sep 2022 09:44:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date; bh=I55acvKkYz3DOyOmmK+2+SP9JvKG4Jm+wWUWvY/NBUo=; b=EstMal6xg8rfG+Oj8gHgvLHf7wMAJlp+hZEAIMPt0PcV0t921QIrL/XuzStVWe8DCx qjI2Wo9A8F/ftwfsWbzW3H1v2NfnqcCXb3isEWCsoxl7X+T7t2iVbsmhnbr7B7WIuVei WV1jJtpooiNmIOlouAt8gxORvEd9Es5o0Co5ytfBhumDc660lsGxADudGsMdryXwyyJc wY/wg0wCpBZmUZfzldMXQXmL1AHrvojE1AuCzjPRpIGrul8p77wVaciZ+cMBypwfrpLn u9mgOI4fBBVc8Hp+2l4k7GBhg/5uf+7fAtVBQ4vNCYt4K5wBEaqFLPDI12KjP13vRo4M IauA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date; bh=I55acvKkYz3DOyOmmK+2+SP9JvKG4Jm+wWUWvY/NBUo=; b=wgpCrTdfIIndZ4tle9ZWdWugBKMQ1URnqWVfHyGsRuCH38WBwKJFPIhD0uA5nIGTjz aVJObjjiGUItKC/Yzuw4ARgjIYLiBfnVDJzrlZn4li3TghRuOwSnI6DEvmMr6lcdmdVG +ewID/iRODP33+mfS1tyhrq+kd2pD1bAVHGouXRcwWb+YIrVgPRYv1d9qyI3t/23a15u NYP5QJofjIWwV4ahJyNNufCvh7qlK6dOtUmioEAwgQrcMelMTVdTLwBevoPCAsTFou1o sezaZtHyjs9VJxEKA2byMT/EJ81w8cZP/9vzEbtjP4j1sJwWmraZY/T/cWCjE+/Sn9LF i6/A== X-Gm-Message-State: ACgBeo1XS4gjrAbZkGp9ubyQKhgD1HcjSlZnxOg6NWOYXfJt3L+MpPUz 8D8vNuShldvHI7wlMePlLAT4fUEyYV1woYdSp4UkmQ== X-Received: by 2002:a17:906:845b:b0:770:86da:9702 with SMTP id e27-20020a170906845b00b0077086da9702mr5612971ejy.244.1662655485519; Thu, 08 Sep 2022 09:44:45 -0700 (PDT) MIME-Version: 1.0 References: <7e047ee0-0243-d9d4-f0bc-7ed19ed33c19@quicinc.com> In-Reply-To: <7e047ee0-0243-d9d4-f0bc-7ed19ed33c19@quicinc.com> From: "T.J. Mercier" Date: Thu, 8 Sep 2022 09:44:34 -0700 Message-ID: Subject: Re: GPU device resource reservations with cgroups? To: Jeffrey Hugo Cc: Tejun Heo , Zefan Li , cgroups@vger.kernel.org, Johannes Weiner , dri-devel@lists.freedesktop.org, "linux-kernel@vger.kernel.org" , Carl Vanderlip , quic_ajitpals@quicinc.com, quic_pkanojiy@quicinc.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 16, 2022 at 1:39 PM Jeffrey Hugo wrote= : > > Hello cgroup experts, > > I have a GPU device [1] that supports organizing its resources for the > purposes of supporting containers. I am attempting to determine how to > represent this in the upstream kernel, and I wonder if it fits in cgroups= . > > The device itself has a number of resource types =E2=80=93 compute cores, > memory, bus replicators, semaphores, and dma channels. Any particular > workload may consume some set of these resources. For example, a > workload may consume two compute cores, 1GB of memory, one dma channel, > but no semaphores and no bus replicators. > > By default all of the resources are in a global pool. This global pool > is managed by the device firmware. Linux makes a request to the > firmware to load a workload. The firmware reads the resource > requirements from the workload itself, and then checks the global pool. > If the global pool contains sufficient resources to satisfy the needs of > the workload, the firmware assigns the required resources from the > global pool to the workload. If there are insufficient resources, the > workload request from Linux is rejected. > > Some users may want to share the device between multiple containers, but > provide device level isolation between those containers. For example, a > user may have 4 workloads to run, one per container, and each workload > takes 1/4th of the set of compute cores. The user would like to reserve > sets of compute cores for each container so that container X always has > the expected set of resources available, and if container Y > malfunctions, it cannot =E2=80=9Csteal=E2=80=9D resources from container = X. > > To support this, the firmware supports a concept of partitioning. A > partition is a pool of resources which are removed from the global pool, > and pre-assigned to the partition=E2=80=99s pool. A workload can then be= run > from within a partition, and it consumes resources from that partition=E2= =80=99s > pool instead of from the global pool. The firmware manages creating > partitions and assigning resources to them. > > Partitions do not nest. > Do partitions have any significance in hardware, or are they just a logical concept? Does it matter which compute core / bus replicator / dma channel a user gets, or are they interchangeable between uses? > In the above user example, the user can create 4 partitions, and divide > up the compute cores among them. Then the user can assign each > individual container their own individual partition. Each container > would be limited to the resources within it=E2=80=99s assigned partition,= but > also that container would have exclusive access to those resources. > This essentially provides isolation, and some Quality of Service (QoS). > > How this is currently implemented (in downstream), is perhaps not ideal. > A privileged daemon process reads a configuration file which defines > the number of partitions, and the set of resources assigned to each. > That daemon makes requests to the firmware to create the partitions, and > gets a unique ID for each. Then the daemon makes a request to the > driver to create a =E2=80=9Cshadow device=E2=80=9D, which is a child dev = node. The > driver verifies with the firmware that the partition ID is valid, and > then creates the dev node. Internally the driver associates this shadow > device with the partition ID so that each request to the firmware is > tagged with the partition ID by the driver. This tagging allows the > firmware to determine that a request is targeted for a specific > partition. Finally, the shadow device is passed into the container, > instead of the normal dev node. The userspace within the container > operates the shadow device normally. > > One concern with the current implementation is that it is possible to > create a large number of partitions. Since each partition is > represented by a shadow device dev node, this can create a large number > of dev nodes and exhaust the minor number space. > > I wonder if this functionality is better represented by a cgroup. > Instead of creating a dev node for the partition, we can just run the > container process within the cgroup. However it doesn=E2=80=99t look lik= e > cgroups have a concept of resource reservation. It is just a limit. If > that impression is accurate, then I struggle to see how to provide the > desired isolation as some entity not under the cgroup could consume all > of the device resources, leaving the containers unable to perform their > tasks. Given the top-down resource distribution policy for cgroups, I think you'd have to have a cgroup subtree where limits for these resources are exclusively passed to, and maintain the placement of processes in the appropriate cgroup under this subtree (one per partition + global). The limit for these resources in all other subtrees under the root would need to be 0. The only trick would be to maintain the limit(s) on the global pool based on the sum of the limits for the partitions to ensure that the global pool cannot exhaust resources "reserved" for the partitions. If partitions don't come and go at runtime then that seems pretty straightforward, otherwise I could see the maintenance/adjustment of those limits as a source of frustration. > > So, cgroup experts, does this sound like something that should be > represented by a cgroup, or is cgroup the wrong mechanism for this usecas= e? > > [1] - > https://lore.kernel.org/all/1660588956-24027-1-git-send-email-quic_jhugo@= quicinc.com/