Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp32414011rwd; Fri, 7 Jul 2023 13:35:00 -0700 (PDT) X-Google-Smtp-Source: APBJJlHNxmDzTzt0dc+k4oS4DYPPDwi0tXns9VDu6edu9mx2DR+BXNhBNBojGva2IaAzW29wkz+P X-Received: by 2002:a05:6808:1b0f:b0:39e:8678:4035 with SMTP id bx15-20020a0568081b0f00b0039e86784035mr7524180oib.13.1688762100644; Fri, 07 Jul 2023 13:35:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688762100; cv=none; d=google.com; s=arc-20160816; b=J12d0K0cgcjbfSBFbQUoujpKK//n5ZKHuLtZwurbSMJX18xn+/Ujx8yQwvXJk4I8i+ d9fhHRfzknLmOv+mAK6eJvfW3qOiFSkWS+PYAG8LjLTaw9xik8iFEWAA+7wbkCiujEdr KKl/TLREdBZZM/GyEE9jfjAs46iV3JS8oaUoxG91qAY8o64q9po7fXqrH5KKdBPThKeL wXrI3WicxPedkCY4zkk0PdAI6jeRHOwi2/tC4GUlYkJ1xrAIlRm8Sc0mZVsvca400nHy Bi0ZkgF6qJvfqfm+fzs1XMx0I9Fim4hxUmMQRgDmBdle55VROzieNDI24L2B79cc1Wss QiGw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=LUimeKJYxhpgKclI6L5R79wAVOIgG6dRaplv0HPA/iE=; fh=g73jZSkaVwDciJFpbzf0500DowHntzQolsIhOQCXAnI=; b=ejGxV26P8loVIH68vQ8Q5Usg27Ltp4nHyxjdMhQBsh7WDRLPa14WQKasA05eG4Ftr8 74QLd+j+jMhyHrXSND8evYVpbDtKkPcXlp02VQTPazxt0f0SWT3eaAb3ZxUOoIfSFVRK mdAbY+kICk8a8RP+iKZTo9Yase4XYzRfGr2hk9zTyRfLRVwh104yVnxLqKAjdI7YyMoC vLACNXW1nnCKK6grS7GNPPwXSi5KHXigiuyR3RrMLmOL9+8xrav/Ydlj4W5OALV+5+kY BQDYL9ZBeaRDNA9gKhG7IzESLh4/0m5w/c+Hrg66/IPh49qm6h3NNRa7eUKC1mHswnL/ pZEQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=AZQAXmXw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y68-20020a636447000000b00553a56f7b36si4683530pgb.522.2023.07.07.13.34.47; Fri, 07 Jul 2023 13:35:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=AZQAXmXw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229552AbjGGUXU (ORCPT + 99 others); Fri, 7 Jul 2023 16:23:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47772 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229471AbjGGUXT (ORCPT ); Fri, 7 Jul 2023 16:23:19 -0400 Received: from mail-qt1-x830.google.com (mail-qt1-x830.google.com [IPv6:2607:f8b0:4864:20::830]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C094D1709 for ; Fri, 7 Jul 2023 13:23:18 -0700 (PDT) Received: by mail-qt1-x830.google.com with SMTP id d75a77b69052e-40371070eb7so16321cf.1 for ; Fri, 07 Jul 2023 13:23:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1688761398; x=1691353398; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=LUimeKJYxhpgKclI6L5R79wAVOIgG6dRaplv0HPA/iE=; b=AZQAXmXwMSlnzJ/3zxxvnmRe0zCnV0bn0b99w40paOUFk8ypRivPK3cwbOWf+4zH2Q xXn0qnfryq6PzPBAwB9zH4sPMh6kungbMooMxgvoM48EnlnvuN6n8b6LspT84L8732Ky dOS3x+Cq2D4BeaVzi+di1yPdE9vX9sKPtZJwOqoRTOuVQXERevi0y9CCmYRbUw5YTibI aDKyMpSsOb7BQ7WWc//E9ShhaP9rY1GrHfIv7M7eDa4Ec87IGHVmiK2LINWcdMNoVVvw q8gYDuuRLZZjM6S8Y7vESAOJ6orrcfLaHf+seMNrb2qXkdkJkXTEN/JwpsdRs56YxnEA ZV0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688761398; x=1691353398; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LUimeKJYxhpgKclI6L5R79wAVOIgG6dRaplv0HPA/iE=; b=FK993+Sd5ad3/tNteBV0adlHNF6HfSWneYiIHq9PcITKIxaQNp0ZwBVrgwRSG/NfaV dKy2aF0zk93CizNrv+fI5H+jrlKQImcyNIkvKeMWe5IrXUv7Ahs4oF3HImGuN+bD78Dg 6i6bCxKBtGaqfRuCOInLDrzXW7YFiHkVs6Hwk6NzaYEJ4ayYw83u0fstqiT9XhMk7J8/ x0fJ13lRCtu4n5i55slNjJ+XzfCo8j9woumvi0eQMdT7GMFipviuDMApc6xTvPaZay+Y GNlYfZ+Orx9oAKtlLVbe7Y4bgYCkjJWZUOwShgLD+Roj35xWwzddvZsyXQWFS3mw0fQg PErA== X-Gm-Message-State: ABy/qLaROSn4TlNX/J/rkoQD39sIbUpMs2De6j9b7kS3+Jb2NqKtpojj VqCQkZCzdJ2JrhsK4mZq+J3XcjnGNE35x/L1G/S8 X-Received: by 2002:ac8:5906:0:b0:403:a090:41c5 with SMTP id 6-20020ac85906000000b00403a09041c5mr46367qty.16.1688761397763; Fri, 07 Jul 2023 13:23:17 -0700 (PDT) MIME-Version: 1.0 References: <895ef450-4fb3-5d29-a6ad-790657106a5a@intel.com> In-Reply-To: <895ef450-4fb3-5d29-a6ad-790657106a5a@intel.com> From: John Stultz Date: Fri, 7 Jul 2023 13:23:07 -0700 Message-ID: Subject: Re: ww_mutex.sh hangs since v5.16-rc1 To: Li Zhijian Cc: peterz@infradead.org, mingo@redhat.com, will@kernel.org, longman@redhat.com, boqun.feng@gmail.com, open list , "linux-kselftest@vger.kernel.org" , "lkp@lists.01.org" , Chris Wilson , Dietmar Eggemann , Joel Fernandes , Maarten Lankhorst Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 30, 2021 at 5:26=E2=80=AFPM Li Zhijian = wrote: > > LKP/0Day found that ww_mutex.sh cannot complete since v5.16-rc1, but > I'm pretty sorry that we failed to bisect the FBC, instead, the bisection= pointed > to a/below merge commit(91e1c99e17) finally. > > Due to this hang, other tests in the same group are also blocked in 0Day,= we > hope we can fix this hang ASAP. > > So if you have any idea about this, or need more debug information, feel = free to let me know :) > > BTW, ww_mutex.sh was failed in v5.15 without hang, and looks it cannot re= produce on a vm. > So, as part of the proxy-execution work, I've been recently trying to understand why the patch series was causing apparent hangs in the ww_mutex test with large(64) cpu counts. I was assuming my changes were causing a lost wakeup somehow, but as I dug in I found it looked like the stress_inorder_work() function was live-locking. I noticed that adding printks to the logic would change the behavior, and finally realized I could reproduce a livelock against mainline by adding a printk before the "return -EDEADLK;" in __ww_mutex_kill(), making it clear the logic was timing sensitive. Then searching around I found this old and unresolved thread. Part of the issue is that we may not hit the timeout check at the end of the loop, as the EDEADLK case short-cuts back to retry, allowing the test to effectively get stuck. But I know with ww_mutexes there's supposed to be a guarantee of forward progress as the older context wins, but it's not clear to me that works here. The EDEADLK case results in a releasing and reacquiring of the locks (only with the contended lock being taken first), and if a second EDEADLK occurs, it starts over again from scratch (though with the new contended lock being chosen first instead - which seems to lose any progress). So maybe the test has broken that guarentee in how it restarts, or with 128 threads trying to acquire a random order of 16 locks without contention (and the order shifting slightly each time it does see contention) it might just be a very big space to resolve if we don't luck into good timing. Anyway, I wanted to get some feedback from folks who have a better theoretical understanding of the ww_mutexes. With large cpu counts are we just asking for trouble here? Is the test doing something wrong? Or is there possibly a ww_mutex bug under this? thanks -john