Received: by 2002:a25:868d:0:0:0:0:0 with SMTP id z13csp2841574ybk; Tue, 12 May 2020 09:22:48 -0700 (PDT) X-Google-Smtp-Source: APiQypI+Z4zCOcY1cTSBTjqzGYOgDEX49RZO1qzaEzNfoG/inbE4QKq2ue5W8wrbsw/Kxa9HOEpc X-Received: by 2002:a17:907:4420:: with SMTP id om24mr17314010ejb.99.1589300568311; Tue, 12 May 2020 09:22:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1589300568; cv=none; d=google.com; s=arc-20160816; b=tTomDMpfSF4AcUrSuih+Vn50oHR330X/VlatDIzS98rt3ymvF+w/J+BYIkpiI7vsf0 nDuKpyj1hvAhADf+DmN/rdciXlIekG+vcIeY348cq3kNO7zCDYqjLUKT15Hfwhc+X8tz ZI1M4M4QYmZ4uy+xYlEdGdaOHZ4D/X0iGh7j1UEo5nhmo5aGOttJz5/eN/BIJATDk1dr 71y7razUQA67NrL9L2P3VFZ34UeqP5EdJU8DlFQ+GuurQd6463L3//I8/6e7RdFyw+cK bWQaibhFyY+Uaw59Nu5bUo3rYyHBeUq+ScqQbA1xPihkFZDcZSqJ5t8yvKkoHfJFA9Wq M0Qg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=qNiF29/cfN4VOlPkYqMr3FarYTQvhg96vG8RwMx8obo=; b=ZCWkZVmFPdT/B5Gb+REiA0ySdtWfNrX8eYetbPf+m3vkTh67K3YkS2vZrJkIogPnlz cPMc2KI3t9pr6nUsdDZNQ7Ne0YOiDERLu565cVVWuPDI6FFQd18ieXo6hvNEEL+C0mbM dral+1CbcN6U9sYCWlVSvTuPmZQls3tpspscDn/cqBfD3aXccK5viZIC7EbjCWSiRi+5 uKNqtxyVeB7pht0KUFx87DHk28Q3E6DSPl5I14QPW7Mmizjrp7NE4SNTXp+17dDuPYkc 3dt2wN3sbUu8KwXGMXFn9d7pMGB2WOWRaaaDPnwf8YnFwQy5gJzi4HMlbbeGwHttMFP3 St+A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=fsPcxtuC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w17si3360477eds.146.2020.05.12.09.22.22; Tue, 12 May 2020 09:22:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=fsPcxtuC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726287AbgELQUc (ORCPT + 99 others); Tue, 12 May 2020 12:20:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55540 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725816AbgELQUb (ORCPT ); Tue, 12 May 2020 12:20:31 -0400 Received: from mail-ot1-x343.google.com (mail-ot1-x343.google.com [IPv6:2607:f8b0:4864:20::343]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B5DE0C05BD09 for ; Tue, 12 May 2020 09:20:31 -0700 (PDT) Received: by mail-ot1-x343.google.com with SMTP id a68so2457406otb.10 for ; Tue, 12 May 2020 09:20:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=qNiF29/cfN4VOlPkYqMr3FarYTQvhg96vG8RwMx8obo=; b=fsPcxtuC5PpoLYTssI5SGf/6vhYUMtFrAVBJbQ1d3x8lk5v8vASrBlnFdbeMiPq+6P Vme/VMZP8fTBjIew6ey/nysjt2nTd9gNydCPEATJcAyzdF2IbKuOybaFFxfOGo24AqoB V0bG7FETglPGytl2yPPLsnQ70K08olAQJrw6U= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=qNiF29/cfN4VOlPkYqMr3FarYTQvhg96vG8RwMx8obo=; b=rrdOa7jfIh/to9aZ83+6+Ucm0EIaOthSA+xSe3N4eEN4koZcT6yvoO4F5ZnM/O5zgC uA9bRajw1kVxRv5VrUBnwN8XTLBomvXQPXieXgzuKiE8QLUwke5zWbN94d5IRYcVL5ca 8gSGjvoIxzDcT5s5GiRGEobpz6GDKnBn+D6Z3RaJI22ml4g7Zv/vqaQYJVUS8JEaBXnt KaKE72JjHwiYdmysNZ7+dHsP2PtcamUDXdzeR/eno72VBxCwc+QgbM5cJoS8V8EofHXo i/WcXnGqeFTO8Njsq+sWUJbTXweWQ4uwySN6CDwBfjfeIOCJQII5DTmz+Ygr5JiiK947 fjXg== X-Gm-Message-State: AGi0PuZMOT5dc6KcSlx4ksN1ztaHItY6iIJ8xRGt9XT+MFWa5rflY5Sp veeCNCEQXu+GdF0HNonyJDgmwdsApbz5iWDK9NEk8B49hgo= X-Received: by 2002:a9d:d06:: with SMTP id 6mr18373833oti.188.1589300430108; Tue, 12 May 2020 09:20:30 -0700 (PDT) MIME-Version: 1.0 References: <20200512085944.222637-1-daniel.vetter@ffwll.ch> <20200512085944.222637-11-daniel.vetter@ffwll.ch> <879b127e-2180-bc59-f522-252416a7ac01@amd.com> In-Reply-To: <879b127e-2180-bc59-f522-252416a7ac01@amd.com> From: Daniel Vetter Date: Tue, 12 May 2020 18:20:18 +0200 Message-ID: Subject: Re: [RFC 10/17] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code To: =?UTF-8?Q?Christian_K=C3=B6nig?= Cc: DRI Development , LKML , "open list:DMA BUFFER SHARING FRAMEWORK" , "moderated list:DMA BUFFER SHARING FRAMEWORK" , linux-rdma , amd-gfx list , intel-gfx , Chris Wilson , Maarten Lankhorst , Daniel Vetter Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 12, 2020 at 5:56 PM Christian K=C3=B6nig wrote: > > Hui what? Of hand that doesn't looks correct to me. It's not GFP_ATOMIC, it's just that GFP_ATOMIC is the only shotgun we have to avoid direct reclaim. And direct reclaim might need to call into your mmu notifier, which might need to wait on a fence, which is never going to happen because your scheduler is stuck. Note that all the explanations for the deadlocks and stuff I'm trying to hunt here are in the other patches, the driver ones are more informational, so I left these here rather bare-bones to shut up lockdep so I can get through the entire driver and all major areas (scheduler, reset, modeset code). Now you can do something like GFP_NOFS, but the only reasons that works is because the direct reclaim annotations (fs_reclaim_acquire/release) only validates against __GFP_FS, and not against any of the other flags. We should probably add some lockdep annotations so that __GFP_RECLAIM is annotated against the __mmu_notifier_invalidate_range_start_map lockdep map I've recently added for mmu notifiers. End result (assuming I'm not mixing anything up here, this is all rather tricky stuff): GFP_ATOMIC is the only kind of memory allocation you can do. > Why the heck should this be an atomic context? If that's correct > allocating memory is the least of the problems we have. It's not about atomic, it's !__GFP_RECLAIM. Which more or less is GFP_ATOMIC. Correct fix is probably GFP_ATOMIC + a mempool for the scheduler fixes so that if you can't allocate them for some reason, you at least know that your scheduler should eventually retire retire some of them, which you can then pick up from the mempool to guarantee forward progress. But I really didn't dig into details of the code, this was just a quick hac= k. So sleeping and taking all kinds of locks (but not all, e.g. dma_resv_lock and drm_modeset_lock are no-go) is still totally ok. Just think #define GFP_NO_DIRECT_RECLAIM GFP_ATOMIC Cheers, Daniel > > Regards, > Christian. > > Am 12.05.20 um 10:59 schrieb Daniel Vetter: > > My dma-fence lockdep annotations caught an inversion because we > > allocate memory where we really shouldn't: > > > > kmem_cache_alloc+0x2b/0x6d0 > > amdgpu_fence_emit+0x30/0x330 [amdgpu] > > amdgpu_ib_schedule+0x306/0x550 [amdgpu] > > amdgpu_job_run+0x10f/0x260 [amdgpu] > > drm_sched_main+0x1b9/0x490 [gpu_sched] > > kthread+0x12e/0x150 > > > > Trouble right now is that lockdep only validates against GFP_FS, which > > would be good enough for shrinkers. But for mmu_notifiers we actually > > need !GFP_ATOMIC, since they can be called from any page laundering, > > even if GFP_NOFS or GFP_NOIO are set. > > > > I guess we should improve the lockdep annotations for > > fs_reclaim_acquire/release. > > > > Ofc real fix is to properly preallocate this fence and stuff it into > > the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of > > the way. > > > > v2: Two more allocations in scheduler paths. > > > > Frist one: > > > > __kmalloc+0x58/0x720 > > amdgpu_vmid_grab+0x100/0xca0 [amdgpu] > > amdgpu_job_dependency+0xf9/0x120 [amdgpu] > > drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched] > > drm_sched_main+0xf9/0x490 [gpu_sched] > > > > Second one: > > > > kmem_cache_alloc+0x2b/0x6d0 > > amdgpu_sync_fence+0x7e/0x110 [amdgpu] > > amdgpu_vmid_grab+0x86b/0xca0 [amdgpu] > > amdgpu_job_dependency+0xf9/0x120 [amdgpu] > > drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched] > > drm_sched_main+0xf9/0x490 [gpu_sched] > > > > Cc: linux-media@vger.kernel.org > > Cc: linaro-mm-sig@lists.linaro.org > > Cc: linux-rdma@vger.kernel.org > > Cc: amd-gfx@lists.freedesktop.org > > Cc: intel-gfx@lists.freedesktop.org > > Cc: Chris Wilson > > Cc: Maarten Lankhorst > > Cc: Christian K=C3=B6nig > > Signed-off-by: Daniel Vetter > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 2 +- > > 3 files changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/dr= m/amd/amdgpu/amdgpu_fence.c > > index d878fe7fee51..055b47241bb1 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, str= uct dma_fence **f, > > uint32_t seq; > > int r; > > > > - fence =3D kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL); > > + fence =3D kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC); > > if (fence =3D=3D NULL) > > return -ENOMEM; > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/= amd/amdgpu/amdgpu_ids.c > > index fe92dcd94d4a..fdcd6659f5ad 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c > > @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *= vm, > > if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait)) > > return amdgpu_sync_fence(sync, ring->vmid_wait, false); > > > > - fences =3D kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KER= NEL); > > + fences =3D kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATO= MIC); > > if (!fences) > > return -ENOMEM; > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm= /amd/amdgpu/amdgpu_sync.c > > index b87ca171986a..330476cc0c86 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c > > @@ -168,7 +168,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, str= uct dma_fence *f, > > if (amdgpu_sync_add_later(sync, f, explicit)) > > return 0; > > > > - e =3D kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL); > > + e =3D kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC); > > if (!e) > > return -ENOMEM; > > > --=20 Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch