Received: by 2002:ab2:6c55:0:b0:1fd:c486:4f03 with SMTP id v21csp534482lqp; Wed, 12 Jun 2024 08:44:22 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCWJHyrDWnoh2aVROhDgfOPFEXms6zBt/PM9PMQ7NcFobTgyaKy+a8p8NgOO0H4SxHMiSriTUCD7DaUNaceEVIrkvAtQbngFj2ug07Dhpw== X-Google-Smtp-Source: AGHT+IHyDTcft46KYOVDh5xlAV/DXDBukdhevI3N4OvND/1o+I58wcEzsaOlp6zOeVPgDCWHfDI+ X-Received: by 2002:a17:90b:3618:b0:2c4:a3fd:f26d with SMTP id 98e67ed59e1d1-2c4a76d400dmr2421791a91.36.1718207061743; Wed, 12 Jun 2024 08:44:21 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1718207061; cv=pass; d=google.com; s=arc-20160816; b=c664fLbanxDsRbZumZj661avzI0zcp/BYIBzDGNBgnpbda4xXog1DhDtjkwPxzMi80 PFWqXcm6aKv/MLhOmxGTcxCuZF4SLGPGUfHPXq0DLkYn2qoOLI0+45wsrf39DBJyjQd0 jMjLc+1HJTERRyJXudgENAe7r46W6CkzDIgwHxFUhFndxy9OzpuhdO6UXQNJ29Vunybk IeGb2UaUmUVi/beDW64FezaBgyVhPaXW+8Tti4Z3nY82ibfj/LcnrKsN6g2tYtPVxyHb OARmilM9uSvlzaA9BHF0KkMOhye7K5AOmKAHZ5I4TxW82ma+SGIgOkGFfa1OI4rMziq8 EADQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=wxqZM2FEq3dzaJDIifmIE2uNTBlslAiCloNUuclrU/c=; fh=J+RVUsCVJsHUmur/f1gYAW/dbOeJB4CrKNTVMPyyFWs=; b=j/7P9pR4urdCidhtDFRDKD9JRH8qkU6f3xF0uUl11Z8D6dx+V8uL6DapSuy/aJgSFc LIr8CO1a0UTfTqe8ifuHE67MR0ok50Gbb0W5CBTbe5YywctXZNCQX+pzhctFk5jPeSDU pHbkOiaQlfGq4wiv2jEp53iV5KUzt7Q9xENrgsT2qBojOhQ3JPrvy/fE/mLHEcMBNmLT ssz5WP6dsQhre92uifyxHHeV+HnFfHQJSKLeLFiXC4iIgRoZzYy3hNX4W+Y/kjYdFU8v Qab4Zl7L2V6LVjjZYpzO1tpY1QajwVrANMwxjTkjomG9A/eKMjXH9JF9VW//HuK62a+A 6L0Q==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=UpyN8NMe; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-211776-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-211776-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id 98e67ed59e1d1-2c4aa3df32csi1653914a91.183.2024.06.12.08.44.21 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Jun 2024 08:44:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-211776-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=UpyN8NMe; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-211776-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-211776-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 78ED0B2647C for ; Wed, 12 Jun 2024 15:17:44 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2341617DE31; Wed, 12 Jun 2024 15:17:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="UpyN8NMe" Received: from mail-yw1-f172.google.com (mail-yw1-f172.google.com [209.85.128.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DDACE1CABD for ; Wed, 12 Jun 2024 15:17:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718205457; cv=none; b=mGueCrHD+TFqmXui9cQoKcKUlJ1UIJTPzVuZQr5uKLL+cv/CgtF+6EVTCxBMdJXAZq1ERkL0/kVgaqWfoBmZ51O2wOrTAJcuvmuTOI7vizRFj1YErqGzic9tuhefMfOAdSjbh5guV+cASJYTzct+b2khJFI31xMwYLdX4dM30m8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718205457; c=relaxed/simple; bh=wxqZM2FEq3dzaJDIifmIE2uNTBlslAiCloNUuclrU/c=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=DvJjQ8ngc1LCFCrHf/SNsYJ3va2QrA56orWqb5Ms4Oy0gkaThWIBkz1mdqFpluHHa3JQNuOeGqMI8eLDwkC5l8ELq+wGZ1ylUozEekCTCzX++DclLkJniZhVvd8hdODlT3XHB5/hib4UnOoT1xAuDvvxMQTcAwi890vVQcGISvc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=UpyN8NMe; arc=none smtp.client-ip=209.85.128.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-yw1-f172.google.com with SMTP id 00721157ae682-62a08092c4dso25938337b3.0 for ; Wed, 12 Jun 2024 08:17:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1718205455; x=1718810255; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wxqZM2FEq3dzaJDIifmIE2uNTBlslAiCloNUuclrU/c=; b=UpyN8NMeTp16BKJjsrYz4B433Oyo2BWmj1B6LQxw7d5z6JyuWzY3G8HTg+9aWxIPOH 2VChhf6wI0As40bcuL7mOGBQMrjEfpShUzgYkYV48maZRyZ1uXQvtypx7Qi7Zr2IUGnS 5H1/z7e3DcQYsrWQ6Ild5knJayUJdOEOAqBWL+WxhPhpbLe4H+54uXLb/QoFEHUdgfGs ZtDk4WgBtSC9v61UHgiXoEtwWhjPVBThmLsWZjSAMnms0BvIeF0rAZ+zp9yWlNpwKXsS ogr/SAzxcokaEBaxJRDWtlerdmOOcDg/yN36KWO0AExKU+CNEWswpF5dO+30VqsY0NQ9 2m4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718205455; x=1718810255; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wxqZM2FEq3dzaJDIifmIE2uNTBlslAiCloNUuclrU/c=; b=EiNsl+cJKA/857/Fja2/4fdxcYKigqoZPZrzZENG4RHps/kJhkaXOGGpnOaxgBFlgH rmQ7rz8HSejaMzHfyZRLztSTf+/irvV/cSXciq0c3J4peDEahc3Jw+jTA/MuUFl3JsqI qPJgEftlcRhYjAPi3eAsUfzvavMcmMP9oW+02QfsZp6/StNADgpixc1vxkKEq6+2B8Y3 enIIkh23t5nf9/AwBJDFAoVqffgJ9fAIrutFNgPUCNHa7sc6AU0wfrPGXQJpo7l02Wsy zjSd1ZYRs1A0XgWpQXun5hShAaKmUk/Y+XZ4fOP/nFH9ZYtrftEi+51OYN7H0BVY50kl OovQ== X-Forwarded-Encrypted: i=1; AJvYcCWZwCEgXTJwDUhcwclgtZKm48DtHuT7af8JdmwzvUMeppBg1knRyFwSSE1Ls9QrU66NoJfAcsCbqduRlgWmKClaLNery/GNqNNLl4LU X-Gm-Message-State: AOJu0YyaxUx5w7T7KEmAg78yLaFELtFlX04fOxe6j/SQHEUcuJZoFvrM YsDQuM+0+Zq41cmobizGuE9bD4f0DfLmZzKMWNXs59G7rFcGM6BMGy8CCxLE2i6ZQL/qXnINfAP 8RUFa6bxTdrHocAqXWdi3yg3piMGdXI3xm1yrSDxMMl10rqSZ547G X-Received: by 2002:a0d:c285:0:b0:62c:c642:89de with SMTP id 00721157ae682-62fb8e3c5ebmr20422497b3.29.1718205454513; Wed, 12 Jun 2024 08:17:34 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: In-Reply-To: From: Suren Baghdasaryan Date: Wed, 12 Jun 2024 08:17:20 -0700 Message-ID: Subject: Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter To: Max Kellermann Cc: Johannes Weiner , Peter Zijlstra , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Jun 11, 2024 at 11:49=E2=80=AFPM Max Kellermann wrote: > > On Wed, Jun 12, 2024 at 7:01=E2=80=AFAM Suren Baghdasaryan wrote: > > Instead I think what might be happening is that the task is terminated > > while it's in memstall. > > How is it possible to terminate a task that's in memstall? > This must be between psi_memstall_enter() and psi_memstall_leave(), > but I had already checked all the callers and found nothing > suspicious; no obvious way to escape the section without > psi_memstall_leave(). In my understanding, it's impossible to > terminate a task that's currently stuck in the kernel. First, it needs > to leave the kernel and go back to userspace, doesn't it? Doh! I made an assumption that this can happen while it should not, unless psi_memstall_enter()/psi_memstall_leave() are not balanced. My bad. Since the issue is hard to reproduce, maybe you could add debugging code to store _RET_IP_ inside the task_struct at the end of psi_memstall_enter() and clear it inside psi_memstall_leave(). Then in do_exit() you check if it's still set and generate a warning reporting recorded _RET_IP_. This should hint us to which psi_memstall_enter() was missing its psi_memstall_leave(). > > > I think if your theory was > > correct and psi_task_change() was called while task's cgroup is > > destroyed then task_psi_group() would have returned an invalid pointer > > and we would crash once that value is dereferenced. > > I was thinking of something slightly different; something about the > cgroup being deleted or a task being terminated and the bookkeeping of > the PSI flags getting wrong, maybe some data race. I found the whole > PSI code with per-task flags, per-cpu per-cgroup counters and flags > somewhat obscure (but somebody else's code is always obscure, of > course); I thought there was a lot of potential for mistakes with the > bookkeeping, but I found nothing specific. > > Anyway, thanks for looking into this - I hope we can get a grip on > this issue, as it's preventing me from using PSI values for actual > process management; the servers that go into this state will always > appear overloaded and that would lead to killing all the workload > processes forever. > > Max