Received: by 2002:a05:6359:6284:b0:131:369:b2a3 with SMTP id se4csp4677184rwb; Tue, 8 Aug 2023 11:59:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHZtQxyUqJxaostnvPpW32jDCLt5l9JlAnNNe3XaedaRIOuVZVId7j5umuiebRKlXSSo7Wc X-Received: by 2002:a05:6a21:3e09:b0:13f:1622:29de with SMTP id bk9-20020a056a213e0900b0013f162229demr389830pzc.7.1691521175646; Tue, 08 Aug 2023 11:59:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691521175; cv=none; d=google.com; s=arc-20160816; b=VdGj+Khti0XtDISnQnkKFKE/RDUPtWDPVhWlXnieO3KP37iW8LqGWDpz7EAJP2q8de 4AP9uQcmyicsPfrZzUY3vchoo6ry/Eftuc4v8avfuYynwyADmZQlDeL+PXnQKrctGpSX clTclLvC5hr1UE7LmtI30a/t7Ra2SYHfgw86+0mdyPKqI2fzLlq2qUDhA4V/83a/kql+ LA9NS2wNbJYjO91qcxwtEy4NKBZcux3iHiifYatPkSuDn0BH8rXRYLJetSG1YBlpB3k7 V2SI0lCkQjTAfHrfs0aqB3G6Oumd8QCU17tiK1KK5bfwfYt2tfBhxo875aQosv8lIS32 ujzQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=MCtF+tlL9RHfHxmqFs9C5rEqPpa4jyBqd3r5LEKKols=; fh=TU7jWeDk+cJI7g1xJNfTEmsNB0etsQn+C7fOB+cXsvA=; b=diuFGyAi+f5lWIufi9gkjdOovw7ehi7hHP8wPV9GMescC5/rZYH9dIAMcs9uB8UQJM ghEXW34sdXAdzdjLIGLHgwfT0iB93HJPpncNrUpN7lC3wlyM7CFlMSgL+c4R3lRXDCXh B60kca5a5mjAtQY1xfr2urx3z/JC3+j9fgksVGmUD109TiUa8AFuRYlrlCir3EbF/MCk +hYroc5ng3txyobz60kv+/kp93ceVzGYzYcV7doh/+rXo1bOdVsQAz0GqnbdR/b1aV+E cIm32aEKGwgZro4+QL56wlBu0jLDgdoYLyrpn59VNpCA8Dlu1XzLKnbFNJq0URmkeAwv FU4A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b="aBZltA2/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x10-20020a63f70a000000b00557899ed8b4si7389260pgh.156.2023.08.08.11.59.23; Tue, 08 Aug 2023 11:59:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b="aBZltA2/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235226AbjHHSF6 (ORCPT + 99 others); Tue, 8 Aug 2023 14:05:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60264 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234352AbjHHSFe (ORCPT ); Tue, 8 Aug 2023 14:05:34 -0400 Received: from mail-ed1-x52b.google.com (mail-ed1-x52b.google.com [IPv6:2a00:1450:4864:20::52b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 97FE914D27E for ; Tue, 8 Aug 2023 10:03:50 -0700 (PDT) Received: by mail-ed1-x52b.google.com with SMTP id 4fb4d7f45d1cf-5230f8da574so3397527a12.3 for ; Tue, 08 Aug 2023 10:03:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1691514228; x=1692119028; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=MCtF+tlL9RHfHxmqFs9C5rEqPpa4jyBqd3r5LEKKols=; b=aBZltA2/j9gJFH0DRcIaD9UZ+V275GL4XmWvgdVWIUuVgUcK27xI5Ww5ou0S7ZgVMr snr8roKmVsjyHHP11xm9jT1M/Zp+uryGzqDuWIWnEzOxZ7D0ygH5UFIvdvX3eFJIMWIH G3VOIqnsf8GUXZY1dL3Bzcj3/Z8a8eGMNdO3vIFIJqH3hfYVuF5F0lXDnqZ1uE4LtLuX CDgwjYrk3+fPOwLG8R+U4eN3LV6sSG1M0xr7TjY1vQIQaz5YSfS66mP8qo23BCxne606 h1CAp1VACppO8pFyUmq4aByRHoPBHz1f4Uf7NWs4wvGQfUPrmGlFOcMrNqTFl3go/TMV 8Fqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691514228; x=1692119028; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MCtF+tlL9RHfHxmqFs9C5rEqPpa4jyBqd3r5LEKKols=; b=IcwP9SxHzhAOeuHTkUbpKrY4YjppvvuKS0UoV9ixdRX09+jG24Lwkk3WD5IPxb/0yP kBmwP/OzOffgIGJJWzB8jmIxxRgceqDrnUawDOdF4y+RVw3QY5xuM1F8tBBEvVfOQeV+ 8OOjmwh8M7iJ0dQfZKBd8V0OkCBahD0t7eiXstW/qqt+33Havfk7SBUNReae7mXyxPNK WGyBXrDVDKcO0FJWDF22CNYkO32bCIYM1T/2JOqTlu0jWB0SKy/DgoqA4I5J3ThTG4Gg hiJW3AcGBn+ec92WTl6Ba1nKksYWGtaWCaW6I9NRff/OKyRTFJT8Dt10IObU2eC8sj54 HZsA== X-Gm-Message-State: AOJu0YzA1PISu0uxrmRAnN0oDc4NBOAc3P0V400L2lPE+eTPog4L4WCD 5qL+gzxWNACExDKSxThAA2UgvXS+eHiqC7DP4CA= X-Received: by 2002:aa7:d04c:0:b0:523:9c4:544f with SMTP id n12-20020aa7d04c000000b0052309c4544fmr386743edo.31.1691514227607; Tue, 08 Aug 2023 10:03:47 -0700 (PDT) MIME-Version: 1.0 References: <20230627132323.115440-1-andrealmeid@igalia.com> In-Reply-To: From: =?UTF-8?B?TWFyZWsgT2zFocOhaw==?= Date: Tue, 8 Aug 2023 13:03:11 -0400 Message-ID: Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations To: Sebastian Wick Cc: =?UTF-8?Q?Andr=C3=A9_Almeida?= , dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org, kernel-dev@igalia.com, alexander.deucher@amd.com, christian.koenig@amd.com, pierre-eric.pelloux-prayer@amd.com, Simon Ser , Rob Clark , Pekka Paalanen , Daniel Stone , Dave Airlie , =?UTF-8?Q?Michel_D=C3=A4nzer?= , Samuel Pitoiset , =?UTF-8?Q?Timur_Krist=C3=B3f?= , Bas Nieuwenhuizen , Randy Dunlap , Pekka Paalanen Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org It's the same situation as SIGSEGV. A process can catch the signal, but if it doesn't, it gets killed. GL and Vulkan APIs give you a way to catch the GPU error and prevent the process termination. If you don't use the API, you'll get undefined behavior, which means anything can happen, including process termination. Marek On Tue, Aug 8, 2023 at 8:14=E2=80=AFAM Sebastian Wick wrote: > > On Fri, Aug 4, 2023 at 3:03=E2=80=AFPM Daniel Vetter wr= ote: > > > > On Tue, Jun 27, 2023 at 10:23:23AM -0300, Andr=C3=A9 Almeida wrote: > > > Create a section that specifies how to deal with DRM device resets fo= r > > > kernel and userspace drivers. > > > > > > Acked-by: Pekka Paalanen > > > Signed-off-by: Andr=C3=A9 Almeida > > > --- > > > > > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@i= galia.com/ > > > > > > Changes: > > > - Grammar fixes (Randy) > > > > > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++= ++ > > > 1 file changed, 68 insertions(+) > > > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-u= api.rst > > > index 65fb3036a580..3cbffa25ed93 100644 > > > --- a/Documentation/gpu/drm-uapi.rst > > > +++ b/Documentation/gpu/drm-uapi.rst > > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a = third handler for > > > mmapped regular files. Threads cause additional pain with signal > > > handling as well. > > > > > > +Device reset > > > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > + > > > +The GPU stack is really complex and is prone to errors, from hardwar= e bugs, > > > +faulty applications and everything in between the many layers. Some = errors > > > +require resetting the device in order to make the device usable agai= n. This > > > +sections describes the expectations for DRM and usermode drivers whe= n a > > > +device resets and how to propagate the reset status. > > > + > > > +Kernel Mode Driver > > > +------------------ > > > + > > > +The KMD is responsible for checking if the device needs a reset, and= to perform > > > +it as needed. Usually a hang is detected when a job gets stuck execu= ting. KMD > > > +should keep track of resets, because userspace can query any time ab= out the > > > +reset stats for an specific context. This is needed to propagate to = the rest of > > > +the stack that a reset has happened. Currently, this is implemented = by each > > > +driver separately, with no common DRM interface. > > > + > > > +User Mode Driver > > > +---------------- > > > + > > > +The UMD should check before submitting new commands to the KMD if th= e device has > > > +been reset, and this can be checked more often if the UMD requires i= t. After > > > +detecting a reset, UMD will then proceed to report it to the applica= tion using > > > +the appropriate API error code, as explained in the section below ab= out > > > +robustness. > > > + > > > +Robustness > > > +---------- > > > + > > > +The only way to try to keep an application working after a reset is = if it > > > +complies with the robustness aspects of the graphical API that it is= using. > > > + > > > +Graphical APIs provide ways to applications to deal with device rese= ts. However, > > > +there is no guarantee that the app will use such features correctly,= and the > > > +UMD can implement policies to close the app if it is a repeating off= ender, > > > > Not sure whether this one here is due to my input, but s/UMD/KMD. Repea= t > > offender killing is more a policy where the kernel enforces policy, and= no > > longer up to userspace to dtrt (because very clearly userspace is not > > really doing the right thing anymore when it's just hanging the gpu in = an > > endless loop). Also maybe tune it down further to something like "the > > kernel driver may implemnent ..." > > > > In my opinion the umd shouldn't implement these kind of magic guesses, = the > > entire point of robustness apis is to delegate responsibility for > > correctly recovering to the application. And the kernel is left with > > enforcing fair resource usage policies (which eventually might be a > > cgroups limit on how much gpu time you're allowed to waste with gpu > > resets). > > Killing apps that the kernel thinks are misbehaving really doesn't > seem like a good idea to me. What if the process is a service getting > restarted after getting killed? What if killing that process leaves > the system in a bad state? > > Can't the kernel provide some information to user space so that e.g. > systemd can handle those situations? > > > > +likely in a broken loop. This is done to ensure that it does not kee= p blocking > > > +the user interface from being correctly displayed. This should be do= ne even if > > > +the app is correct but happens to trigger some bug in the hardware/d= river. > > > + > > > +OpenGL > > > +~~~~~~ > > > + > > > +Apps using OpenGL should use the available robust interfaces, like t= he > > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL= ES). This > > > +interface tells if a reset has happened, and if so, all the context = state is > > > +considered lost and the app proceeds by creating new ones. If it is = possible to > > > +determine that robustness is not in use, the UMD will terminate the = app when a > > > +reset is detected, giving that the contexts are lost and the app won= 't be able > > > +to figure this out and recreate the contexts. > > > + > > > +Vulkan > > > +~~~~~~ > > > + > > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for subm= issions. > > > +This error code means, among other things, that a device reset has h= appened and > > > +it needs to recreate the contexts to keep going. > > > + > > > +Reporting causes of resets > > > +-------------------------- > > > + > > > +Apart from propagating the reset through the stack so apps can recov= er, it's > > > +really useful for driver developers to learn more about what caused = the reset in > > > +first place. DRM devices should make use of devcoredump to store rel= evant > > > +information about the reset, so this information can be added to use= r bug > > > +reports. > > > > Since we do not seem to have a solid consensus in the community about > > non-robust userspace, maybe we could just document that lack of consens= us > > to unblock this patch? Something like this: > > > > Non-Robust Userspace > > -------------------- > > > > Userspace that doesn't support robust interfaces (like an non-robust > > OpenGL context or API without any robustness support like libva) leave = the > > robustness handling entirely to the userspace driver. There is no stron= g > > community consensus on what the userspace driver should do in that case= , > > since all reasonable approaches have some clear downsides. > > > > With the s/UMD/KMD/ further up and maybe something added to record the > > non-robustness non-consensus: > > > > Acked-by: Daniel Vetter > > > > Cheers, Daniel > > > > > > > > > + > > > .. _drm_driver_ioctl: > > > > > > IOCTL Support on Device Nodes > > > -- > > > 2.41.0 > > > > > > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch > > >