Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp859864pxf; Wed, 7 Apr 2021 13:29:29 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx96L1Cjo4W4h5WDvThW+4r2aMD3Yk2+b9bJgwSYJDSQTYymgAfBYkPJ7AG5hiPHFdLz/vA X-Received: by 2002:a02:7086:: with SMTP id f128mr5337938jac.104.1617827369763; Wed, 07 Apr 2021 13:29:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617827369; cv=none; d=google.com; s=arc-20160816; b=LplnFSFPmYOqVeOguHXCEA2A7TyFx8H+UEYrvncAfcqpY8xdNBI+uGbYC7YBzE/AUq wR+S1wRRJHwD1pS+jx5IIIqtdY0iQ/J+G+fXi/FlPuVQ9QzDspOy9u+y4b0jFRY1LBEB dCgzfbfdOsTkoDmHTfb6b4JFtzV+L13seFAaIl2o3GcC2fALFAZAUUGPEyRay8f3AqMt 9SRUtZIWNI0TR7Ir9drL884zUt1cZZa4yWz4Wf+Eeh2/wVxGp/xgM9Mu9StLPxMF15+N 6obF9WJU7mzsADzwG+0TL45xnh+L57qdn4HX3SMjwksRfxL1q/INjghhwQfsRDdYWd/L NL+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=plCjkshfPYRJPx8mJ31yMaLutaXbhwBVGKKe4dKObKg=; b=MS1thsaV+lbO3ouXxewLk8KAwRVqNbXNoHMpJg+Nfa6p/RtJGswQhFZMHvx6+iwN7o Q1W11D93IGl6WA/jUhGn4//wrh+nYAk1vftDQQ0oIcv4i+mYgr12bu3hUAD9kJY/tws+ r4gD+F9Z8KJV9Ni7duK+PvTq6lh082qxAQCsA7Es/aG6hLnhGtYIFbSsm9vVUvBnwqWX 7ezBVL1L2ikq75DCTco3wuj9bHAmWc0A1pP30M5m3caYfLX4JpTqmB8weOtyLBMpqFBK tYFNGzu1/8C5CdG/Z+zd1A7MQxuJit/QYQoPjNRjAUy2Zp2RgBeZecxD+AvjitegWKMI MWCA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@qtec.com header.s=google header.b=SBNCdqNh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=qtec.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a14si22967554jap.42.2021.04.07.13.29.17; Wed, 07 Apr 2021 13:29:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@qtec.com header.s=google header.b=SBNCdqNh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=qtec.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346825AbhDGHsF (ORCPT + 99 others); Wed, 7 Apr 2021 03:48:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43772 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346805AbhDGHsE (ORCPT ); Wed, 7 Apr 2021 03:48:04 -0400 Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com [IPv6:2a00:1450:4864:20::534]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BD771C06174A for ; Wed, 7 Apr 2021 00:47:53 -0700 (PDT) Received: by mail-ed1-x534.google.com with SMTP id w23so8394034edx.7 for ; Wed, 07 Apr 2021 00:47:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qtec.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=plCjkshfPYRJPx8mJ31yMaLutaXbhwBVGKKe4dKObKg=; b=SBNCdqNh5r5ZKz/pliBDgu/zjGwpff6X/nMMNlQmT7otug07Z6cKL91Q8eqijlggW1 GIU4Zkpxf9PRcK+1jKgOYrph1WS/8B9detl2G+Zy0E3x6BQs3YQcuMZ3h8Zu+Od6FyDo yAaSC2CsjNIh8jstnXqh4/49PkqX+G1QvY9oi+8j8zsMHlhEmLHg3hDZptpOEOq98/WM 5zXIMnL/POlJkAKeagAFrLAo3kNPu/xPJ7Pg8qS3ntuwMV7AH+qxamiHzrENUGdzIejJ kg8olDBaXWedrKL+0s0GV6BP57iWHHu0x2yQbfCclpx4iRguiHa770hQ4f9rXEQSPQXl p9jw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=plCjkshfPYRJPx8mJ31yMaLutaXbhwBVGKKe4dKObKg=; b=GWFZHvVttTi1ihLWdYkDufJEDGg1uYxtL+vhodmPov5izN83uj6CYwgj6PjfvBqLfG kW4VWWsqjtm5k3S272otz/TzVb230ImyBse8vhXBJ2QA1yfqgNLwsqbrBPlIOEkoLTj6 LzKRVVZHuj6I+FmHtoQWF13YofO99RibE4v0xyz8Wl9Kz1h3Wy0Xv/Ke8rIu2nPTvLcj nW2gHrGeUrM+mSyEUtWyHc+fDqWiJ6YGcV26MudCI64WdN7qEtH1BfrRqZpFAr0jpG8r HtKtNvjltLuv/yHSP8eqjdTytdujxOgSyiebiNVzJrnBVX34mThfAh3AEjo77zWzIaZ0 IZSQ== X-Gm-Message-State: AOAM532sN9YTZii0RmimpxwagPJMq80KnHQDbCNtj5OMuUvgU0KcLHga zVP+BBDx6F/Sud7Cfjk2SazDZfxKOUkYrq/CQAffiA== X-Received: by 2002:aa7:c5c4:: with SMTP id h4mr2669994eds.375.1617781672421; Wed, 07 Apr 2021 00:47:52 -0700 (PDT) MIME-Version: 1.0 References: <20210318083236.43578-1-daniel@qtec.com> <375f0915-83b3-c729-b95f-939d828d24d0@amd.com> In-Reply-To: From: Daniel Gomez Date: Wed, 7 Apr 2021 09:47:41 +0200 Message-ID: Subject: Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages To: Alex Deucher Cc: =?UTF-8?Q?Christian_K=C3=B6nig?= , Felix Kuehling , David Airlie , "linux-kernel@vger.kernel.org" , "dri-devel@lists.freedesktop.org" , "Deucher, Alexander" , "amd-gfx@lists.freedesktop.org" , "Koenig, Christian" , "dagmcr@gmail.com" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 6 Apr 2021 at 22:56, Alex Deucher wrote: > > On Mon, Mar 22, 2021 at 6:34 AM Christian K=C3=B6nig > wrote: > > > > Hi Daniel, > > > > Am 22.03.21 um 10:38 schrieb Daniel Gomez: > > > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling = wrote: > > >> This caused a regression in kfdtest in a large-buffer stress test af= ter > > >> memory allocation for user pages fails: > > > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and > > > not this one. > > > Just some background for the mem leak patch if helps to understand th= is: > > > The leak was introduce here: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/co= mmit/?id=3D0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb > > > where the bound status was introduced for all drm drivers including > > > radeon and amdgpu. So this patch just reverts the logic to the > > > original code but keeping the bound status. In my case, the binding > > > code allocates the user pages memory and returns without bounding (at > > > amdgpu_gtt_mgr_has_gart_addr). So, > > > when the unbinding happens, the memory needs to be cleared to prevent= the leak. > > > > Ah, now I understand what's happening here. Daniel your patch is not > > really correct. > > > > The problem is rather that we don't set the tt object to bound if it > > doesn't have a GTT address. > > > > Going to provide a patch for this. > > Did this patch ever land? I don't think so but I might send a v2 following Christian's comment if you guys agree. Also, the patch here is for radeon but the pagefault issue reported by Felix is affected by the amdgpu one: radeon patch: drm/radeon/ttm: Fix memory leak userptr pages https://patchwork.kernel.org/project/dri-devel/patch/20210318083236.43578-1= -daniel@qtec.com/ amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages https://patchwork.kernel.org/project/dri-devel/patch/20210317160840.36019-1= -daniel@qtec.com/ I assume both need to be fixed with the same approach. Daniel > > Alex > > > > > Regards, > > Christian. > > > > > > > >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -1= 6 > > >> [17359.543746] BUG: kernel NULL pointer dereference, address: 000000= 0000000000 > > >> [17359.551494] #PF: supervisor read access in kernel mode > > >> [17359.557375] #PF: error_code(0x0000) - not-present page > > >> [17359.563247] PGD 0 P4D 0 > > >> [17359.566514] Oops: 0000 [#1] SMP PTI > > >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd= -fkuehlin #193 > > >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS= 3201 06/17/2016 > > >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgp= u] > > >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8= 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e= 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45 > > >> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206 > > >> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000= 000000000000 > > >> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff= 950eadec5e80 > > >> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000= 000000000000 > > >> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff= 950c03377800 > > >> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000= 000000000000 > > >> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000)= knlGS:0000000000000000 > > >> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > >> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 0000= 0000001706e0 > > >> [17359.682883] Call Trace: > > >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu] > > >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm] > > >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm] > > >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu] > > >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 = [amdgpu] > > >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu] > > >> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu] > > >> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu] > > >> [17359.734152] __x64_sys_ioctl+0x8b/0xd0 > > >> [17359.738796] do_syscall_64+0x2d/0x40 > > >> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > >> [17359.749205] RIP: 0033:0x7febb083b6d7 > > >> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00= 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f= 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48 > > >> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX:= 0000000000000010 > > >> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000= 7febb083b6d7 > > >> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000= 000000000003 > > >> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 0000= 0000c4000004 > > >> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000= 559416e4e2aa > > >> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000= 000000000000 > > >> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable= _filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched i= p_tables x_tables > > >> [17359.837776] CR2: 0000000000000000 > > >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]--- > > >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgp= u] > > >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8= 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e= 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45 > > >> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206 > > >> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000= 000000000000 > > >> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff= 950eadec5e80 > > >> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000= 000000000000 > > >> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff= 950c03377800 > > >> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000= 000000000000 > > >> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000)= knlGS:0000000000000000 > > >> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > >> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 0000= 0000001706e0 > > > From what I understand, the init_user_pages fails (returns EBUSY) an= d > > > the code goes to allocate_init_user_pages_failed where the unbind and > > > the userptr clear occurs. > > > Can we prevent this if we save the bounding status + userptr alloc? s= o > > > the function amdgpu_ttm_backend_unbind returns without trying to clea= r > > > the userptr memory? > > > > > > Something like: > > > > > > amdgpu_ttm_backend_bind: > > > if (gtt->userptr) { > > > r =3D amdgpu_ttm_tt_pin_userptr(bdev, ttm); > > > if (r) ... > > > gtt->sg_table =3D true; > > > } > > > > > > amdgpu_ttm_backend_unbind: > > > if (gtt->sg_table) { > > > if (gtt->user_ptr) ... > > > } > > > > > > If you agree, I'll send a v2 patch. Otherwise, maybe we could return > > > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated. > > > > > > Any other ideas? > > > > > > Regards, > > > Daniel > > > > > >> Reverting this patch fixes the problem for me. > > >> > > >> Regards, > > >> Felix > > >> > > >> On 2021-03-18 10:57 p.m., Alex Deucher wrote: > > >>> Applied. Thanks! > > >>> > > >>> Alex > > >>> > > >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian > > >>> wrote: > > >>>> Reviewed-by: Christian K=C3=B6nig > > >>>> ________________________________ > > >>>> Von: Daniel Gomez > > >>>> Gesendet: Donnerstag, 18. M=C3=A4rz 2021 09:32 > > >>>> Cc: dagmcr@gmail.com ; Daniel Gomez ; Deucher, Alexander ; Koenig, Christian ; David Airlie ; Daniel Vetter ; amd-gfx@lists.freedesktop.org ; dri-devel@lists.freedesktop.org ; lin= ux-kernel@vger.kernel.org > > >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages > > >>>> > > >>>> If userptr pages have been pinned but not bounded, > > >>>> they remain uncleared. > > >>>> > > >>>> Signed-off-by: Daniel Gomez > > >>>> --- > > >>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++-- > > >>>> 1 file changed, 3 insertions(+), 2 deletions(-) > > >>>> > > >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm= /radeon/radeon_ttm.c > > >>>> index e8c66d10478f..bbcc6264d48f 100644 > > >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c > > >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c > > >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct= ttm_bo_device *bdev, struct ttm_tt > > >>>> struct radeon_ttm_tt *gtt =3D (void *)ttm; > > >>>> struct radeon_device *rdev =3D radeon_get_rdev(bdev); > > >>>> > > >>>> + if (gtt->userptr) > > >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm); > > >>>> + > > >>>> if (!gtt->bound) > > >>>> return; > > >>>> > > >>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages); > > >>>> > > >>>> - if (gtt->userptr) > > >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm); > > >>>> gtt->bound =3D false; > > >>>> } > > >>>> > > >>>> -- > > >>>> 2.30.2 > > >>>> > > >>>> _______________________________________________ > > >>>> dri-devel mailing list > > >>>> dri-devel@lists.freedesktop.org > > >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel > > >>> _______________________________________________ > > >>> dri-devel mailing list > > >>> dri-devel@lists.freedesktop.org > > >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel > > > _______________________________________________ > > > amd-gfx mailing list > > > amd-gfx@lists.freedesktop.org > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > >