Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp7680335rwb; Tue, 15 Nov 2022 16:18:25 -0800 (PST) X-Google-Smtp-Source: AA0mqf5P8hWoMSm27wUU5TE89ADSEelDd//X7rn/Tmdzcm03hwFPkmR30PTg/au/H4dAlXSCdfUQ X-Received: by 2002:a05:6402:519:b0:467:6847:1ea7 with SMTP id m25-20020a056402051900b0046768471ea7mr16016162edv.237.1668557905091; Tue, 15 Nov 2022 16:18:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668557905; cv=none; d=google.com; s=arc-20160816; b=iLFvcPmi2yNpmllipkYkDL0wiV/bm/QxHdOzQzp9NHGi1Uz5ZekDyEWOic3Z7QkVm/ CtHBTsH03p4uV/ErY+wVAfdT/Bt3Bb7lox1Sh9X1WugZfsOGGGIwQpXMVIlkT03SE/uv aTEw8R6bMIEv/yk6o4mKnHUw4AYfWAB/cXsAam9mVWqWZLlrRuaX0RxSPqquJ5u1rW/N tO0ODjbN0yqJ/Y0uRYmAZL7EWMwqcvANNodvq3f0zzYBdMHQCLSTaYB3HLEbx6U0mq+x ds/Z67VbIztQvrw9nEARmHhRuSvgfupcjy4Odx+O5x7y8J7zZ0gEdxayXd5aeO4sRy4c np3Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=GzpV787PpQa3wMcQZOLjE1ggwFm+OA6jZZnJ1PhdDmg=; b=CSi3Gk808I+/ORjljIPoFAssncXXZvIjipt96Hz5TOVTu+1I7ePgNFj+qtqWa947nl vzqyfT+WubvjfwTkFX9gB1y3PfwPw1o7twECgG2MpYHqFpmRILqDQZ2hGduuN96KtOQb 1v+8TOV8vZvg2M7CZUHpqU8MoBZ/hYZ+nwOeqQnQf074bu7iULI6VoKoqs3rgB5NPg5H Uz7ohEkJgA1BfILVWi/yfPvACLc7KSZ/PaPowzVnxyH6gRZ0v04D4O+YwdF+VNYxSefx yseX0/YrTM7A2e7LCLTKuMs4s9ZZT3QeGFVWEBhwibd7NZhDotc2sKtxJH+gqLAS4eyp wBHw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=d0yWECzl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v8-20020aa7cd48000000b00461a32e0e38si10310781edw.306.2022.11.15.16.17.56; Tue, 15 Nov 2022 16:18:25 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=d0yWECzl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230244AbiKOXyo (ORCPT + 90 others); Tue, 15 Nov 2022 18:54:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47508 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230195AbiKOXym (ORCPT ); Tue, 15 Nov 2022 18:54:42 -0500 Received: from mail-yb1-xb2b.google.com (mail-yb1-xb2b.google.com [IPv6:2607:f8b0:4864:20::b2b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5BCC4183AB for ; Tue, 15 Nov 2022 15:54:41 -0800 (PST) Received: by mail-yb1-xb2b.google.com with SMTP id f201so17873634yba.12 for ; Tue, 15 Nov 2022 15:54:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GzpV787PpQa3wMcQZOLjE1ggwFm+OA6jZZnJ1PhdDmg=; b=d0yWECzlYCXEVhjYtrWEoRNmuvq3nEwlipzMv1j84WB7bQgxYMKw4HMBjzKyUpL30S 1XRUP7sCgbfSypKs9svs06yJYnwP0cTEXBN97RzKwml+1J04Q/L+vx1vNMXNyWCDzByQ +pLyac9RvNKH/XAYmNBHzPlYMdC/DZ8/sIXtY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GzpV787PpQa3wMcQZOLjE1ggwFm+OA6jZZnJ1PhdDmg=; b=dPz5LDKheIpEBrehnE6yAquXDZ3wbu8wWFWbU0Va9bp7UqtE4ZmusiOIIPak2OHynJ Z3nFnEcoLF/ifPaMO+47mAaQERYk0UAa1ipfoms7BueS1aKnUiV+t4QWd6HDaEjl0R4d TSz+68W1ZI0SfrSyyEocPPkp9GA/0b9KhnkPLXpbSihDORJxK/ijRKJvSTFaQe9s2KFD yJa9J92dsPjXhufqA5VMQ9vnQXhJyr78uBpn4w3+sa2ZxQM/aLE9zb8cNbC5HpSVGRHG UA8Ym3ofc2RYI9gh484QgqHzV7MQq3tnoEdkDgAyOISMIUKYFB5t/xg/Llaour+7I6HA sc7Q== X-Gm-Message-State: ANoB5pl/Ng0Tv5Lrr4JOSnMo8jF/ZoqmnvUU6V4/qgpFj/WlFkSfALds txgOHJnGW3XJ7n1B6dxoPqoMkZDiXTDhXHviVMMEZg== X-Received: by 2002:a25:dace:0:b0:6bc:e934:83c0 with SMTP id n197-20020a25dace000000b006bce93483c0mr19330675ybf.167.1668556480561; Tue, 15 Nov 2022 15:54:40 -0800 (PST) MIME-Version: 1.0 References: <20221110053133.2433412-1-mani@chromium.org> <1d066cb4-fb82-bffd-5e89-97ba572be3fa@intel.com> <9d0b5696-496f-a03a-2b5c-e38f36a02d86@linux.intel.com> In-Reply-To: <9d0b5696-496f-a03a-2b5c-e38f36a02d86@linux.intel.com> From: Mani Milani Date: Wed, 16 Nov 2022 10:54:29 +1100 Message-ID: Subject: Re: [PATCH] drm/i915: Fix unhandled deadlock in grab_vma() To: =?UTF-8?Q?Thomas_Hellstr=C3=B6m?= Cc: Matthew Auld , LKML , Tvrtko Ursulin , Maarten Lankhorst , Chris Wilson , =?UTF-8?Q?Christian_K=C3=B6nig?= , Daniel Vetter , David Airlie , Jani Nikula , Joonas Lahtinen , Niranjana Vishwanathapura , Nirmoy Das , Rodrigo Vivi , dri-devel@lists.freedesktop.org, intel-gfx@lists.freedesktop.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Thomas, It is a user-space application that crashes due to receiving an -ENOSPC. This occurs after code reaches the line below where grab_vma() fails (due to deadlock) and consequently i915_gem_evict_for_node() returns -ENOSPC. I provided the call stack for when this happens in my previous message: https://github.com/torvalds/linux/blame/59d0d52c30d4991ac4b329f049cc37118e0= 0f5b0/drivers/gpu/drm/i915/i915_gem_evict.c#L386 Context: This crash is happening on an intel GeminiLake chromebook, when running a video seek h264 stress test, and it is reproducible 100%. To troubleshoot, I did a git bisect and found the following commit to be the culprit (which is when grab_vma() has been introduced): https://github.com/torvalds/linux/commit/7e00897be8bf13ef9c68c95a8e386b714c= 29ad95 I also have crash dumps and further logs that I can send you if needed. But please let me know how to share those with you, since pasting them here does not seem reasonable to me. Thanks, Mani On Mon, Nov 14, 2022 at 11:48 PM Thomas Hellstr=C3=B6m wrote: > > Hi, Mani. > > On 11/14/22 03:16, Mani Milani wrote: > > Thank you for your comments. > > > > To Thomas's point, the crash always seems to happen when the following > > sequence of events occurs: > > > > 1. When inside "i915_gem_evict_vm()", the call to > > "i915_gem_object_trylock(vma->obj, ww)" fails (due to deadlock), and > > eviction of a vma is skipped as a result. Basically if the code > > reaches here: > > https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/i915/i915= _gem_evict.c#L468 > > And here is the stack dump for this scenario: > > Call Trace: > > > > dump_stack_lvl+0x68/0x95 > > i915_gem_evict_vm+0x1d2/0x369 > > eb_validate_vmas+0x54a/0x6ae > > eb_relocate_parse+0x4b/0xdb > > i915_gem_execbuffer2_ioctl+0x6f5/0xab6 > > ? i915_gem_object_prepare_write+0xfb/0xfb > > drm_ioctl_kernel+0xda/0x14d > > drm_ioctl+0x27f/0x3b7 > > ? i915_gem_object_prepare_write+0xfb/0xfb > > __se_sys_ioctl+0x7a/0xbc > > do_syscall_64+0x56/0xa1 > > ? exit_to_user_mode_prepare+0x3d/0x8c > > entry_SYSCALL_64_after_hwframe+0x61/0xcb > > RIP: 0033:0x78302de5fae7 > > Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c > > 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48> > > 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48 > > RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 000000000000= 0010 > > RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7 > > RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d > > RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950 > > R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d > > R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469 > > > > It is worth noting that "i915_gem_evict_vm()" still returns success in > > this case. > > > > 2. After step 1 occurs, the next call to "grab_vma()" always fails > > (with "i915_gem_object_trylock(vma->obj, ww)" failing also due to > > deadlock), which then results in the crash. > > Here is the stack dump for this scenario: > > Call Trace: > > > > dump_stack_lvl+0x68/0x95 > > grab_vma+0x6c/0xd0 > > i915_gem_evict_for_node+0x178/0x23b > > i915_gem_gtt_reserve+0x5a/0x82 > > i915_vma_insert+0x295/0x29e > > i915_vma_pin_ww+0x41e/0x5c7 > > eb_validate_vmas+0x5f5/0x6ae > > eb_relocate_parse+0x4b/0xdb > > i915_gem_execbuffer2_ioctl+0x6f5/0xab6 > > ? i915_gem_object_prepare_write+0xfb/0xfb > > drm_ioctl_kernel+0xda/0x14d > > drm_ioctl+0x27f/0x3b7 > > ? i915_gem_object_prepare_write+0xfb/0xfb > > __se_sys_ioctl+0x7a/0xbc > > do_syscall_64+0x56/0xa1 > > ? exit_to_user_mode_prepare+0x3d/0x8c > > entry_SYSCALL_64_after_hwframe+0x61/0xcb > > RIP: 0033:0x78302de5fae7 > > Code: c0 0f 89 74 ff ff ff 48 83 c4 08 49 c7 c4 ff ff ff ff 5b 4c > > 89 e0 41 5c 41 5d 5d c3 0f 1f 80 00 00 00 00 b8 10 00 00 00 0f 05 <48> > > 3d 01 f0 ff ff 73 01 c3 48 8b 0d 51 c3 0c 00 f7 d8 64 89 01 48 > > RSP: 002b:00007ffe64b87f78 EFLAGS: 00000246 ORIG_RAX: 000000000000= 0010 > > RAX: ffffffffffffffda RBX: 000003cc00470000 RCX: 000078302de5fae7 > > RDX: 00007ffe64b87fd0 RSI: 0000000040406469 RDI: 000000000000000d > > RBP: 00007ffe64b87fa0 R08: 0000000000000013 R09: 000003cc004d0950 > > R10: 0000000000000200 R11: 0000000000000246 R12: 000000000000000d > > R13: 0000000000000000 R14: 00007ffe64b87fd0 R15: 0000000040406469 > > > > > > My Notes: > > - I verified the two "i915_gem_object_trylock()" failures I mentioned > > above are due to deadlock by slightly modifying the code to call > > "i915_gem_object_lock()" only in those exact cases and subsequent to > > the trylock failure, only to look at the return error code. > > - The two cases mentioned above, are the only cases where > > "i915_gem_object_trylock(obj, ww)" is called with the second argument > > not being forced to NULL. > > - When in either of the two cases above (i.e. inside "grab_vma()" or > > "i915_gem_evict_vm") I replace calling "i915_gem_object_trylock" with > > "i915_gem_object_lock", the issue gets resolved (because deadlock is > > detected and resolved). > > > > So if this could matches the design better, another solution could be > > for "grab_vma" to continue to call "i915_gem_object_trylock", but for > > "i915_gem_evict_vm" to call "i915_gem_object_lock" instead. > > No, i915_gem_object_lock() is not allowed when the vm mutex is held. > > > > > > Further info: > > - Would you like any further info on the crash? If so, could you > > please advise 1) what exactly you need and 2) how I can share with you > > especially if it is big dumps? > > Yes, I would like to know how the crash manifests itself. Is it a kernel > BUG or a kernel WARNING or is it the user-space application that crashes > due to receiveing an -ENOSPC? > > Thanks, > > Thomas > > > > > > > Thanks.