Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp38057603rwd; Wed, 12 Jul 2023 02:19:25 -0700 (PDT) X-Google-Smtp-Source: APBJJlH0hf/sLo0YOm6UZZ2/KFJFmgiVMjpMsHOUtNe5dAhgeusepTIJmvFI6lz3+dPNRMAC38WI X-Received: by 2002:a05:6358:7f05:b0:135:57d0:d170 with SMTP id p5-20020a0563587f0500b0013557d0d170mr8867963rwn.11.1689153564976; Wed, 12 Jul 2023 02:19:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689153564; cv=none; d=google.com; s=arc-20160816; b=R0X+1cIe5KC4k/O2ucg7YHoxbhhVnu79rpro+W695Jdksnl9gyFksnK6blfHc5CUlK c6BIGtzUYufX/6j+9VN+BdQyR3Y1WhyCl201X3MWW9zVM6xG6CcSe+v8mypiF4WWy34P fNTv1Ch/FY72WSt1tUrhApw8tX8D9+pmNPHuqczYQ0KQPHmgN6bTNqZBEOWGWIgK5T7/ lVorzLbdjb6msCeL8ImcfFEhlmfdb1FITZcaf7I7WFjOl6X+gU1dHAoDZb3WnmpAHSYB FP5CSDMf4AF9eVeGeBuxgCujtjNLVBGqoVbiPUnzPbW/NzQRUUAfyRm5PbWXiOwqHP9w kacA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent :content-transfer-encoding:references:in-reply-to:date:cc:to:from :subject:message-id; bh=tqDp0N7Irkpo2k8FddOq/MBn1uPtmgnLmt2nTj2oAQY=; fh=as83xKXz40zkMGkThvvAdXZoMmgBjHMDzpLxpmxp5DI=; b=I9Rhukb9ArJUjxCWIMVzA8GWKjgeQT42pFpW6X5wJQK3uNAdTjXhS0SX3eAke62kUn em8iAqm+9Geg1IrGR8dFpdamngnhp5NtVzCNQvtNErg5qb531SsW1f8vFt/DeNdwJhKZ B1v0ivAgbgPxKh8OFGf7Y21tPygtIJ0od0AjnRHZ7/7ZdU+pk/8PLIq+U4DaykeVyS96 L7aYkcfglLXLiElwAwwY/TlQZN5ntsCF4gsXuKgYuULjRfGWgFA68O61KODS0h0JLcvp ni+CIBnPtkf0Zv1+gTE7EyRh1Mxg4mqn737NH45X1+JcTpdIKq4z4M64HWD2yaUVqDsZ RKhA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s7-20020a637707000000b0055380fbd9e1si2794858pgc.483.2023.07.12.02.19.13; Wed, 12 Jul 2023 02:19:24 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229475AbjGLJCb convert rfc822-to-8bit (ORCPT + 99 others); Wed, 12 Jul 2023 05:02:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51812 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232963AbjGLJBX (ORCPT ); Wed, 12 Jul 2023 05:01:23 -0400 Received: from metis.ext.pengutronix.de (metis.ext.pengutronix.de [IPv6:2001:67c:670:201:290:27ff:fe1d:cc33]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0935A1994 for ; Wed, 12 Jul 2023 02:00:07 -0700 (PDT) Received: from ptz.office.stw.pengutronix.de ([2a0a:edc0:0:900:1d::77] helo=[IPv6:::1]) by metis.ext.pengutronix.de with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1qJVhX-0002vB-0h; Wed, 12 Jul 2023 10:59:59 +0200 Message-ID: <0e7f2b0cc29ac77d4a55d0de6a66c477d867fbf7.camel@pengutronix.de> Subject: Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically From: Lucas Stach To: Christian =?ISO-8859-1?Q?K=F6nig?= , =?ISO-8859-1?Q?Andr=E9?= Almeida , dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: pierre-eric.pelloux-prayer@amd.com, 'Marek =?UTF-8?Q?Ol=C5=A1=C3=A1k=27?= , Timur =?ISO-8859-1?Q?Krist=F3f?= , michel.daenzer@mailbox.org, Samuel Pitoiset , kernel-dev@igalia.com, alexander.deucher@amd.com Date: Wed, 12 Jul 2023 10:59:56 +0200 In-Reply-To: References: <20230711213501.526237-1-andrealmeid@igalia.com> <20230711213501.526237-4-andrealmeid@igalia.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT User-Agent: Evolution 3.46.4 (3.46.4-1.fc37) MIME-Version: 1.0 X-SA-Exim-Connect-IP: 2a0a:edc0:0:900:1d::77 X-SA-Exim-Mail-From: l.stach@pengutronix.de X-SA-Exim-Scanned: No (on metis.ext.pengutronix.de); SAEximRunCond expanded to false X-PTX-Original-Recipient: linux-kernel@vger.kernel.org X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König: > Am 11.07.23 um 23:34 schrieb André Almeida: > > Instead of storing coredump information inside amdgpu_device struct, > > move if to a proper separated struct and allocate it dynamically. This > > will make it easier to further expand the logged information. > > Verry big NAK to this. The problem is that memory allocation isn't > allowed during a GPU reset. > > What you could do is to allocate the memory with GFP_ATOMIC or similar, > but for a large structure that might not be possible. > I'm still not fully clear on what the rules are here. In etnaviv we do devcoredump allocation in the GPU reset path with __GFP_NOWARN | __GFP_NORETRY, which means the allocation will kick memory reclaim if necessary, but will just give up if no memory could be made available easily. This satisfies the forward progress guarantee in the absence of successful memory allocation, which is the most important property in this path, I think. However, I'm not sure if the reclaim could lead to locking issues or something like that with the more complex use-cases with MMU notifiers and stuff like that. Christian, do you have any experience or information that would be good to share in this regard? Regards, Lucas > Regards, > Christian. > > > > > Signed-off-by: André Almeida > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 14 +++-- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++-------- > > 2 files changed, 51 insertions(+), 28 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > index dbe062a087c5..e1cc83a89d46 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > @@ -1068,11 +1068,6 @@ struct amdgpu_device { > > uint32_t *reset_dump_reg_list; > > uint32_t *reset_dump_reg_value; > > int num_regs; > > -#ifdef CONFIG_DEV_COREDUMP > > - struct amdgpu_task_info reset_task_info; > > - bool reset_vram_lost; > > - struct timespec64 reset_time; > > -#endif > > > > bool scpm_enabled; > > uint32_t scpm_status; > > @@ -1085,6 +1080,15 @@ struct amdgpu_device { > > uint32_t aid_mask; > > }; > > > > +#ifdef CONFIG_DEV_COREDUMP > > +struct amdgpu_coredump_info { > > + struct amdgpu_device *adev; > > + struct amdgpu_task_info reset_task_info; > > + struct timespec64 reset_time; > > + bool reset_vram_lost; > > +}; > > +#endif > > + > > static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev) > > { > > return container_of(ddev, struct amdgpu_device, ddev); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index e25f085ee886..23b9784e9787 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev) > > return 0; > > } > > > > -#ifdef CONFIG_DEV_COREDUMP > > +#ifndef CONFIG_DEV_COREDUMP > > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, > > + struct amdgpu_reset_context *reset_context) > > +{ > > +} > > +#else > > static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset, > > size_t count, void *data, size_t datalen) > > { > > struct drm_printer p; > > - struct amdgpu_device *adev = data; > > + struct amdgpu_coredump_info *coredump = data; > > struct drm_print_iterator iter; > > int i; > > > > @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset, > > drm_printf(&p, "**** AMDGPU Device Coredump ****\n"); > > drm_printf(&p, "kernel: " UTS_RELEASE "\n"); > > drm_printf(&p, "module: " KBUILD_MODNAME "\n"); > > - drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec); > > - if (adev->reset_task_info.pid) > > + drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec); > > + if (coredump->reset_task_info.pid) > > drm_printf(&p, "process_name: %s PID: %d\n", > > - adev->reset_task_info.process_name, > > - adev->reset_task_info.pid); > > + coredump->reset_task_info.process_name, > > + coredump->reset_task_info.pid); > > > > - if (adev->reset_vram_lost) > > + if (coredump->reset_vram_lost) > > drm_printf(&p, "VRAM is lost due to GPU reset!\n"); > > - if (adev->num_regs) { > > + if (coredump->adev->num_regs) { > > drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); > > > > - for (i = 0; i < adev->num_regs; i++) > > + for (i = 0; i < coredump->adev->num_regs; i++) > > drm_printf(&p, "0x%08x: 0x%08x\n", > > - adev->reset_dump_reg_list[i], > > - adev->reset_dump_reg_value[i]); > > + coredump->adev->reset_dump_reg_list[i], > > + coredump->adev->reset_dump_reg_value[i]); > > } > > > > return count - iter.remain; > > @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset, > > > > static void amdgpu_devcoredump_free(void *data) > > { > > + kfree(data); > > } > > > > -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev) > > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, > > + struct amdgpu_reset_context *reset_context) > > { > > + struct amdgpu_coredump_info *coredump; > > struct drm_device *dev = adev_to_drm(adev); > > > > - ktime_get_ts64(&adev->reset_time); > > - dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL, > > + coredump = kmalloc(sizeof(*coredump), GFP_KERNEL); > > + > > + if (!coredump) { > > + DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__); > > + return; > > + } > > + > > + memset(coredump, 0, sizeof(*coredump)); > > + > > + coredump->reset_vram_lost = vram_lost; > > + > > + if (reset_context->job && reset_context->job->vm) > > + coredump->reset_task_info = reset_context->job->vm->task_info; > > + > > + coredump->adev = adev; > > + > > + ktime_get_ts64(&coredump->reset_time); > > + > > + dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL, > > amdgpu_devcoredump_read, amdgpu_devcoredump_free); > > } > > #endif > > @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle, > > goto out; > > > > vram_lost = amdgpu_device_check_vram_lost(tmp_adev); > > -#ifdef CONFIG_DEV_COREDUMP > > - tmp_adev->reset_vram_lost = vram_lost; > > - memset(&tmp_adev->reset_task_info, 0, > > - sizeof(tmp_adev->reset_task_info)); > > - if (reset_context->job && reset_context->job->vm) > > - tmp_adev->reset_task_info = > > - reset_context->job->vm->task_info; > > - amdgpu_reset_capture_coredumpm(tmp_adev); > > -#endif > > + > > + amdgpu_coredump(tmp_adev, vram_lost, reset_context); > > + > > if (vram_lost) { > > DRM_INFO("VRAM is lost due to GPU reset!\n"); > > amdgpu_inc_vram_lost(tmp_adev); >