Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp2394344rwp; Fri, 14 Jul 2023 05:41:29 -0700 (PDT) X-Google-Smtp-Source: APBJJlGZa08EVVDQOFqtdlUUDOQIO/HMbo4kLeSgDpkyZp6BOmaLQ+J3tPIzOgvAjVNFe4Dw0FQ6 X-Received: by 2002:a17:902:c3d5:b0:1b3:b3c5:1d1f with SMTP id j21-20020a170902c3d500b001b3b3c51d1fmr3614975plj.8.1689338488824; Fri, 14 Jul 2023 05:41:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689338488; cv=none; d=google.com; s=arc-20160816; b=vibcgT2q89aRNlHLzjj00QvHDJTZL8MWjpqkBIauTyXaOpejeLuSYQZOEGCcJNs9xy N97YC+E4PmE9C/3lyb9qxs3KMVT76YdnbeMiNfLuduV2OZd4owPFpy0sbuqXTfbGRolz 9DjNSctPxiuz9O26brqUhMt3deJhE178gYdABFpP7ho9iPpnnROkXrwTptP74K1icguq P1DUM3XR9WaAuGM5AZBUuia+RLy1Wtw9594IRK5jFtvkwix/w/OAmBd6+Kovjslk7NgP h7WbraFu3IJJ+D4eHFeHjzUWFgJn6uGyjKB8iOdyTkqaKvMisRWIX8ARKNBt0HOxptde pB1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=b1Gph0vuGXSeRBOUXNrnFyU+mX28k/NdwjFLIR6GaPw=; fh=9ysZBPv9RAsG2mXorlGtiMVlA/JCy6aCt1f8XqTIXoo=; b=uTSMY5XMUoZNeTN+tmszS9gvVXevp22F2fPWFBsJPFthBqJNBI6kNmI3djYqsrwg/1 Wsnp17AuAhuzK9/ARcn9VRNNXRwxz+NTxpoI1bEXyEhm7Sr3Ht8aMG2ZxsmZ6mynYUML bviCCnn9FvvLU2/ntHqe5aL+UhviOE2pnOx4xDGyktPI4ltbZ4NCSZcvgPuqv+QQkZNf 53dSqb3pjfUEOmapjXuK/DbY3EDZQlsNqOuSs9Bn8xLq4ss/yFz7eRpuRiGBr/PK0XsT Hx4kPn2Q5AYmoBEu6tgg2A+UllwPqqbbjIrgQs1Sf+vg0J4/P1ojbriv/mbhgJHD0kTn XRpQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b=GmhqCPA6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u4-20020a17090341c400b001b8c4021be9si7444776ple.397.2023.07.14.05.41.16; Fri, 14 Jul 2023 05:41:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b=GmhqCPA6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235868AbjGNMYa (ORCPT + 99 others); Fri, 14 Jul 2023 08:24:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36276 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235842AbjGNMYU (ORCPT ); Fri, 14 Jul 2023 08:24:20 -0400 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 75E233A9E for ; Fri, 14 Jul 2023 05:24:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:From: References:Cc:To:Subject:MIME-Version:Date:Message-ID:Sender:Reply-To: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=b1Gph0vuGXSeRBOUXNrnFyU+mX28k/NdwjFLIR6GaPw=; b=GmhqCPA6yfKkTaZwzAhwBcaA82 1wZqmYKBvGQs/NT6daPi0IramA0DwS0ycGIPhjFVw6itABCVfJlgifWaCVpnsFo2wU3bLBYQlWfAp iVhUrrrbp3ZX0i3JtzW8HaDg5AI3mKcstAkk99rP1S0YwEK1HzsSOHZN5ijwnUIP/TjrgN8T1lLmO Ef7XI4/i+dIh1CON8pHnkAj3Z6UxKrznt50Ie5GhwyUsyFKRoc0nQEVmtGA+LHVMRs5vQtVwciaGD aHRXkrdIMZU4EpRO+eNZhvqjDzqjH+Fi9E8baadyYXUqBx2F3NiydLssYawZJ55silcoDl+N1DG6I e9r/OqTw==; Received: from [187.74.70.209] (helo=[192.168.1.111]) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_128_GCM:128) (Exim) id 1qKHq3-00EVOo-L8; Fri, 14 Jul 2023 14:23:59 +0200 Message-ID: <50fa1365-ab6a-58a1-e82f-ebaf1b623010@igalia.com> Date: Fri, 14 Jul 2023 09:23:55 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [PATCH v2 5/6] drm/amdgpu: Log IBs and ring name at coredump Content-Language: en-US To: =?UTF-8?Q?Christian_K=c3=b6nig?= Cc: pierre-eric.pelloux-prayer@amd.com, dri-devel@lists.freedesktop.org, =?UTF-8?B?J01hcmVrIE9sxaHDoWsn?= , amd-gfx@lists.freedesktop.org, =?UTF-8?Q?Timur_Krist=c3=b3f?= , linux-kernel@vger.kernel.org, michel.daenzer@mailbox.org, Samuel Pitoiset , kernel-dev@igalia.com, Bas Nieuwenhuizen , alexander.deucher@amd.com, christian.koenig@amd.com References: <20230713213242.680944-1-andrealmeid@igalia.com> <20230713213242.680944-6-andrealmeid@igalia.com> <6485568b-da41-b549-f6bd-36139df59215@gmail.com> From: =?UTF-8?Q?Andr=c3=a9_Almeida?= In-Reply-To: <6485568b-da41-b549-f6bd-36139df59215@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Em 14/07/2023 04:57, Christian König escreveu: > Am 13.07.23 um 23:32 schrieb André Almeida: >> Log the IB addresses used by the hung job along with the stuck ring >> name. Note that due to nested IBs, the one that caused the reset itself >> may be in not listed address. >> >> Signed-off-by: André Almeida >> --- >>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  3 +++ >>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 +++++++++++++++++++++- >>   2 files changed, 33 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> index e1cc83a89d46..cfeaf93934fd 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> @@ -1086,6 +1086,9 @@ struct amdgpu_coredump_info { >>       struct amdgpu_task_info         reset_task_info; >>       struct timespec64               reset_time; >>       bool                            reset_vram_lost; >> +    u64                *ibs; >> +    u32                num_ibs; >> +    char                ring_name[16]; >>   }; >>   #endif >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index 07546781b8b8..431ccc3d7857 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -5008,12 +5008,24 @@ static ssize_t amdgpu_devcoredump_read(char >> *buffer, loff_t offset, >>                      coredump->adev->reset_dump_reg_value[i]); >>       } >> +    if (coredump->num_ibs) { >> +        drm_printf(&p, "IBs:\n"); >> +        for (i = 0; i < coredump->num_ibs; i++) >> +            drm_printf(&p, "\t[%d] 0x%llx\n", i, coredump->ibs[i]); >> +    } >> + >> +    if (coredump->ring_name[0] != '\0') >> +        drm_printf(&p, "ring name: %s\n", coredump->ring_name); >> + >>       return count - iter.remain; >>   } >>   static void amdgpu_devcoredump_free(void *data) >>   { >> -    kfree(data); >> +    struct amdgpu_coredump_info *coredump = data; >> + >> +    kfree(coredump->ibs); >> +    kfree(coredump); >>   } >>   static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost, >> @@ -5021,6 +5033,8 @@ static void amdgpu_coredump(struct amdgpu_device >> *adev, bool vram_lost, >>   { >>       struct amdgpu_coredump_info *coredump; >>       struct drm_device *dev = adev_to_drm(adev); >> +    struct amdgpu_job *job = reset_context->job; >> +    int i; >>       coredump = kmalloc(sizeof(*coredump), GFP_NOWAIT); >> @@ -5038,6 +5052,21 @@ static void amdgpu_coredump(struct >> amdgpu_device *adev, bool vram_lost, >>       coredump->adev = adev; >> +    if (job && job->num_ibs) { > > I really really really don't want any dependency of the core dump > feature towards the job. > Because of the lifetime of job? Do you think implementing amdgpu_job_get()/put() would help here? > What we could do is to record the first executed IB VAs in the hw fence, > but I'm not sure how useful this is in the first place. > I see, any hint here of the timedout job would be helpful AFAIK. > We have some internal feature in progress to query the VA of the draw > command which cause the waves currently executing in the SQ to be > retrieved. > >> +        struct amdgpu_ring *ring = to_amdgpu_ring(job->base.sched); >> +        u32 num_ibs = job->num_ibs; >> + >> +        coredump->ibs = kmalloc_array(num_ibs, sizeof(coredump->ibs), >> GFP_NOWAIT); > > This can fail pretty easily. Because of its size? > > Christian. > >> +        if (coredump->ibs) >> +            coredump->num_ibs = num_ibs; >> + >> +        for (i = 0; i < coredump->num_ibs; i++) >> +            coredump->ibs[i] = job->ibs[i].gpu_addr; >> + >> +        if (ring) >> +            strncpy(coredump->ring_name, ring->name, 16); >> +    } >> + >>       ktime_get_ts64(&coredump->reset_time); >>       dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_NOWAIT, >