Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp5169324rwd; Mon, 12 Jun 2023 00:36:37 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5cUXV405F1ShY5DWHrBqyg3faqw04Wa3U7W6Vq34F2UWNaodcfpbRBG8mw410DKFOAVbT0 X-Received: by 2002:a17:907:368d:b0:974:1c91:a751 with SMTP id bi13-20020a170907368d00b009741c91a751mr8552370ejc.29.1686555397150; Mon, 12 Jun 2023 00:36:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686555397; cv=none; d=google.com; s=arc-20160816; b=QVLvzE15YkfEb0pYsAMcf99jnMuLWu7p1Fnux9rMO6blBCcwqNuHOVSSlGiQ4cWh/r VFCyJ2HxYmb3RDHGazv75fheWC2zWbUejug7SnP3pKYyHqJi951lLeJXfbFJ0kxQAvzq vuEr9BgjwJdGcy8sC7c5FdCiyaNjX7SQztjqT82Foov3OFR2eeBR9c+aBAuAVgNZOuDr j8ioXQCIDoZLTLxleLpEoTlAjrZnsTMXVrhGTPGck/t2In2adV4pjwDlquFxghZgU9HV rD0WQ6ERXkLoxt7sW6LjSbzVcSAsJqt6dcd/x8GRxgWgJv7o21DIOXNuwdXEry7PQ8fX 7jRQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=WTHQGrWcRnFhQdvS1hgwBK1IpR28Ymon3LGQetfTV6Y=; b=L0NoBy1pSP36zVrFcw6xRVjXfijfTlXclVxHPDqwGaFEqg4N+jXzlNpYqR5GNCANrm aRBFmOdGjJbdK3jn3QsUhBZyavYIgpNLv/RqoaRnwhkAu3bROrMbo3+blw54WicBgM1a nhUB7ffeQalN5xeiWLReUPBEK+vsII3uMIApoG34ySbgSeLFA3KzN6CunwBsm4sGOWMe XvsKBrNVwrIvHU9up0JwCZP52Ydp+a9hoZnSv3RgRzoQvau7ScwjWD32+Wt/W8EROPkN ef9163RsxYFjKpquFcxaJvxkjcajXOn9vAyqFHqNBd4D9pvCwII03xVJxDc0T1Mr9an6 bKrQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=P+YLcFkW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g8-20020a170906594800b0097453d92972si4720821ejr.1018.2023.06.12.00.36.12; Mon, 12 Jun 2023 00:36:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=P+YLcFkW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232777AbjFLHZ7 (ORCPT + 99 others); Mon, 12 Jun 2023 03:25:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43112 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229455AbjFLHZ5 (ORCPT ); Mon, 12 Jun 2023 03:25:57 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C965210FE for ; Mon, 12 Jun 2023 00:20:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1686554280; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WTHQGrWcRnFhQdvS1hgwBK1IpR28Ymon3LGQetfTV6Y=; b=P+YLcFkW76uWLw9AtotidNmXSdushTUL23qZXvYaRAJNKInINVVqbYPOaFGd/Uip0rjuCc uO04NRuzWYr6HPYcYX7Xs0hNNPlcJcd5PXXZ19Ns9cs726artTwOReu+aQnklf/ua6K9kB O0MsMWHS3NlajE9O2qIgGeqZbHkHTcU= Received: from mail-oo1-f70.google.com (mail-oo1-f70.google.com [209.85.161.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-665-Mh-XVFmvOJ2VgaOEh-HG-w-1; Mon, 12 Jun 2023 02:36:09 -0400 X-MC-Unique: Mh-XVFmvOJ2VgaOEh-HG-w-1 Received: by mail-oo1-f70.google.com with SMTP id 006d021491bc7-55b0d97f849so3257920eaf.3 for ; Sun, 11 Jun 2023 23:36:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686551768; x=1689143768; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=WTHQGrWcRnFhQdvS1hgwBK1IpR28Ymon3LGQetfTV6Y=; b=A7TTkCFsuHdhXBwaVIN+IqhaUaApBrmlafX0oRbxKLIwPoZ6+506l2MMjXgMkfnYZ1 kI5O5Kq0eRs5DN7zrl2Eh92JiEsX1Gojoc8LZFMdK6RKUipQ1fxDJgk8pCJIhif+ziiR otug0s5fUkBIZj0ufTsPmTd51i+yKvSUEqVO8ojJalnLdtipRq+X8xRsMSJ0YTW/UXDl ZAVjvC+75QnHUxe8bJSLCXxV3tQEVihuH3IjPhSU+rtBZvEfd7qNpJ52TlWw515HrshO qoUP8k+DaH+oJVRfvPqC9pZGKfO+pc0uCo3+FDPPh8cGdvUXjQ8lQwxrnVlOf9X8rAfB L/iQ== X-Gm-Message-State: AC+VfDzvOUl8eE2FTZbldagdP91g2uGXFE1oAzFScGaxv7TVDn4UwRMm v4IEhAt0R6wgulpat1jS6o1ClTyrOVhicMFIztaEzfSQsq8F5qaTIT3NM9ybRqEWVE/YXfnaf+6 QNZvPYck4d8J2H1TWyGr+WkZwJWGR32xXDos= X-Received: by 2002:a05:6808:1382:b0:398:4465:ed25 with SMTP id c2-20020a056808138200b003984465ed25mr4273425oiw.37.1686551768226; Sun, 11 Jun 2023 23:36:08 -0700 (PDT) X-Received: by 2002:a05:6808:1382:b0:398:4465:ed25 with SMTP id c2-20020a056808138200b003984465ed25mr4273408oiw.37.1686551767923; Sun, 11 Jun 2023 23:36:07 -0700 (PDT) Received: from zlang-mailbox ([209.132.188.80]) by smtp.gmail.com with ESMTPSA id mi12-20020a17090b4b4c00b0024df6bbf5d8sm6738898pjb.30.2023.06.11.23.36.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 11 Jun 2023 23:36:07 -0700 (PDT) Date: Mon, 12 Jun 2023 14:36:02 +0800 From: Zorro Lang To: Linus Torvalds Cc: Dave Chinner , "Darrick J. Wong" , linux-xfs@vger.kernel.org, "Eric W. Biederman" , Mike Christie , "Michael S. Tsirkin" , linux-kernel@vger.kernel.org Subject: Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+) Message-ID: <20230612063602.qk2mgh55leqqefpc@zlang-mailbox> References: <20230611124836.whfktwaumnefm5z5@zlang-mailbox> <20230612015145.GA11441@frogsfrogsfrogs> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jun 11, 2023 at 08:14:25PM -0700, Linus Torvalds wrote: > On Sun, Jun 11, 2023 at 7:22 PM Dave Chinner wrote: > > > > I guess the regression fix needs a regression fix.... > > Yup. > > From the description of the problem, it sounds like this happens on > real hardware, no vhost anywhere? > > Or maybe Darrick (who doesn't see the issue) is running on raw > hardware, and you and Zorro are running in a virtual environment? I tested virtual environment and raw hardware both. We have a testing machines pool which contains lots of different machines, include real machines, kvm, and other kind of vm, include different rches (aarch64, s390x, ppc64le and x86_64), and different kind of storage (virt, hard RAID, generic scsi disk, pmem, etc...). They all hang on fstests generic/051. I remembered Darrick said he did test with ~160 VMs (need confirmation from him). So this issue might not be related with VMs or real machine. Hmm... maybe related with some kernel config? If Darrick would like to provide his .config file, I can make a diff with mine to check the difference. Thanks, Zorro > > It sounds like zap_other_threads() and coredump_task_exit() do not > agree about the core_state->nr_threads counting, which is part of what > changed there. > > [ Goes off to look ] > > Hmm. Both seem to be using the same test for > > (t->flags & (PF_IO_WORKER | PF_USER_WORKER)) != PF_USER_WORKER > > which I don't love - I don't think io_uring threads should participate > in core dumping either, so I think the test could just be > > (t->flags & PF_IO_WORKER) > > but that shouldn't be the issue here. > > But according to > > https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/ > > it's clearly hanging in wait_for_completion_state() in > coredump_wait(), so it really looks like some confusion about that > core_waiters (aka core_state->nr_threads) count. > > Oh. Humm. Mike changed that initial rough patch of mine, and I had > moved the "if you don't participate in c ore dumps" test up also past > the "do_coredump()" logic. > > And I think it's horribly *wrong* for a thread that doesn't get > counted for core-dumping to go into do_coredump(), because then it > will set the "core_state" to possibly be the core-state of the vhost > thread that isn't even counted. > > So *maybe* this attached patch might fix it? I haven't thought very > deeply about this, but vhost workers most definitely shouldn't call > do_coredump(), since they are then not counted. > > (And again, I think we should just check that PF_IO_WORKER bit, not > use this more complex test, but that's a separate and bigger change). > > Linus