Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp670843imu; Thu, 22 Nov 2018 03:42:03 -0800 (PST) X-Google-Smtp-Source: AJdET5cCe1lk3eVNsrcsU1lsdXbZlKjCtKuTkYf1HjV+hrl+wX2ruvPYmsZLRGDqBO20i3miNrll X-Received: by 2002:a62:647:: with SMTP id 68-v6mr11553978pfg.42.1542886923110; Thu, 22 Nov 2018 03:42:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542886923; cv=none; d=google.com; s=arc-20160816; b=m41KaMYt2RH02aSq6yVc1OtB0+UbuD4jFG4cmPDQClnKgQgj+z90w23w/P19kxBMkV Vrswrfzscfx8nDVxfytK18tE2QJ4pIB99IW7iukXNmMpQipGjX1PzsEPNb0uqZFdNcoM dYWKcxjBg7Sp+YtxgUq/Ajlk3VwA43uwKF6Z4UREDGm54K5hIAjmYPOaDrBiWcxKUnww hHM33Flqt2ifRmw5ZahnO5p1Yuu4uBOBdVrWjYPXCHxOy67WNnT57G3piQGFUNlw0lhk 2iNs3qcmBSWiAm0+ziUH5uZk4Hbzd/9Xebz6ApQ3VNmFp3uP2ATNi0Xt+PpR80OYmvBM tSxA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=N6qHFZS1KIfFer1ewUcyMvg7ks+nyu8tahEJCT4JhaE=; b=YS4PYfoFAK376lGJZqpxP2ynJqCWGpjzvjMRNJozy9Lbqhg/rHnC5LqmHeqGHDLpKF kuMln2qgv/u2u3jSdLH9tu9u68H0J5DdSa3SjnpscQjSE2pHAs2QpgUZmnUXwTu3ViZ8 0kGnPbKm25JYf5v5sx6wm3MKsmj6uzJSHQ9QZu4iWtcgU2e+zNUCKStzUgBmxWvMuEOP qMDRhNSjr7vc3ttMWU3jQMCR0ggArd/TNtzuH0juYwz1sJ/SJknoQrWC7G3YPJQfDeht nzIDPa/yYtHx02a/g6uqDEaynNcTxTP+8O9qGdgk4d/blazue/9HgtjoOohIYVz0JPXy TsDw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=nUFbTesD; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h36si48243568pgm.200.2018.11.22.03.41.48; Thu, 22 Nov 2018 03:42:03 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=nUFbTesD; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391272AbeKVLpP (ORCPT + 99 others); Thu, 22 Nov 2018 06:45:15 -0500 Received: from mail-vs1-f67.google.com ([209.85.217.67]:35800 "EHLO mail-vs1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388700AbeKVLpP (ORCPT ); Thu, 22 Nov 2018 06:45:15 -0500 Received: by mail-vs1-f67.google.com with SMTP id e7so4418258vsc.2 for ; Wed, 21 Nov 2018 17:08:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=N6qHFZS1KIfFer1ewUcyMvg7ks+nyu8tahEJCT4JhaE=; b=nUFbTesDhF7wMjoe1BfoiNbdkl7v1Y0m7uW6Akf+4EvOtqUMQ1/oXLHzxxWBb3NFQE XfVs4+KXPTFctOm4MFJmGD1IS1QaWNa9DXsQRFDqyJSfQSaDS9cBHctNN/GarzbvMFM1 poWZhHgzq0DcV0egplpK+RMUUXaDvRIAmbgIMnKI8xo6szmrqHhpPd67D7U0vniQRvfu s+xqPV4hBiiBuM/0NvadQjw9bp0JItitr0I6wq6sIaKK/AQTBTkRhCaeheEOekpSv3ed NcDb2VqQ8+MiLFCh6adWGOYQUN2KTZpW0nWgJFdQ/qdAFvoVNx2HiMAaujklT4UBE5eO HP3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=N6qHFZS1KIfFer1ewUcyMvg7ks+nyu8tahEJCT4JhaE=; b=peIWkGPDrZsBdo4KRgRBgsYElwPDyLzxO2N/JscXphlI6hwlgQXtJLvwzP+hZRxoDq tR7hDJcypn7cwGE2lXnsK33HtKyi29//QLHl4ltg5Lx/QoUHBu0y1X+RWTMHm0OU9mMq FYldLcx7Vu1yYXYBfX4V3ZC5gJexuTR8GBLQHWMCd6ZVWSKnNBkTgNApkVvrDjCumxPZ TNnt8mMe3Tkum1dkioigYU+QyeblI/0zDv0O81/YLO8jjfQivK5aE/uJz50idyu4p7GT 5C/DU9JoCUkBBo+hnevdh+Jy0/K2UL+2Am+lVL5x59pJCEUIoj3Pw0ezTkxR4/S/oiNC 41eQ== X-Gm-Message-State: AGRZ1gLlhDN3SbNbhLRlz2G+CJ4855mLb6x/fSDalbgalbyDlACNZXjU yIrKRDRL9xy7ffVo4BzxaCFlm7D3W+lWPugFMCTtSA== X-Received: by 2002:a67:6346:: with SMTP id x67mr3708473vsb.114.1542848900313; Wed, 21 Nov 2018 17:08:20 -0800 (PST) MIME-Version: 1.0 References: <20181121201452.77173-1-dancol@google.com> <20181121205428.165205-1-dancol@google.com> <20181121141220.0e533c1dcb4792480efbf3ff@linux-foundation.org> <20181121145043.fa029f4f91afddc2a10bb81e@linux-foundation.org> <20181121162247.467fcab6c0aca0819a822286@linux-foundation.org> <20181121165741.ef089df784482632c4a66370@linux-foundation.org> In-Reply-To: <20181121165741.ef089df784482632c4a66370@linux-foundation.org> From: Daniel Colascione Date: Wed, 21 Nov 2018 17:08:08 -0800 Message-ID: Subject: Re: [PATCH v2] Add /proc/pid_gen To: Andrew Morton Cc: linux-kernel , Linux API , Tim Murray , Primiano Tucci , Joel Fernandes , Jonathan Corbet , Mike Rapoport , Vlastimil Babka , Roman Gushchin , Prashant Dhamdhere , "Dennis Zhou (Facebook)" , "Eric W. Biederman" , rostedt@goodmis.org, tglx@linutronix.de, mingo@kernel.org, linux@dominikbrodowski.net, jpoimboe@redhat.com, Ard Biesheuvel , Michal Hocko , Stephen Rothwell , ktsanaktsidis@zendesk.com, David Howells , "open list:DOCUMENTATION" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 21, 2018 at 4:57 PM Andrew Morton wrote: > > On Wed, 21 Nov 2018 16:28:56 -0800 Daniel Colascione wrote: > > > > > The problem here is the possibility of confusion, even if it's rare. > > > > Does the naive approach of just walking /proc and ignoring the > > > > possibility of PID reuse races work most of the time? Sure. But "most > > > > of the time" isn't good enough. It's not that there are tons of sob > > > > stories: it's that without completely robust reporting, we can't rule > > > > out of the possibility that weirdness we observe in a given trace is > > > > actually just an artifact from a kinda-sort-working best-effort trace > > > > collection system instead of a real anomaly in behavior. Tracing, > > > > essentially, gives us deltas for system state, and without an accurate > > > > baseline, collected via some kind of scan on trace startup, it's > > > > impossible to use these deltas to robustly reconstruct total system > > > > state at a given time. And this matters, because errors in > > > > reconstruction (e.g., assigning a thread to the wrong process because > > > > the IDs happen to be reused) can affect processing of the whole trace. > > > > If it's 3am and I'm analyzing the lone trace from a dogfooder > > > > demonstrating a particularly nasty problem, I don't want to find out > > > > that the trace I'm analyzing ended up being useless because the > > > > kernel's trace system is merely best effort. It's very cheap to be > > > > 100% reliable here, so let's be reliable and rule out sources of > > > > error. > > > > > > So we're solving a problem which isn't known to occur, but solving it > > > provides some peace-of-mind? Sounds thin! > > > > So you want to reject a cheap fix for a problem that you know occurs > > at some non-zero frequency? There's a big difference between "may or > > may not occur" and "will occur eventually, given enough time, and so > > must be taken into account in analysis". Would you fix a refcount race > > that you knew was possible, but didn't observe? What, exactly, is your > > threshold for accepting a fix that makes tracing more reliable? > > Well for a start I'm looking for a complete patch changelog. One which > permits readers to fully understand the user-visible impact of the > problem. The patch already describes the problem, the solution, and the way in which this solution is provided. What more information do you want? > If it is revealed that is a theoretical problem which has negligible > end-user impact then sure, it is rational to leave things as they are. > That's what "negligible" means! I don't think the problem is negligible. There's a huge difference between 99% and 100% reliability! The possibility of a theoretical problem is a real problem when, in retrospective analysis, the possibility of theoretical problems must be taken into account when trying to figure out how the system got into whatever state it was observed to be in. Look, if I were proposing some expensive new bit of infrastructure, that would be one thing. But this is trivial. What form of patch *would* you take here? Would you take a tracepoint, as I discussed in your other message? Is there *any* snapshot approach here that you would take? Is your position that providing an atomic process tree hierarchy snapshot is just not a capable the kernel should provide? I'm writing trace analysis tools, and I'm saying that in order to be confident in the results of the analysis, we need a way to be certain about baseline system state, and without added robustness, there's always going to be some doubt as to whether any particular observation is real or an artifact. I'm open to various technical options for providing this information, but I think it's reasonable to ask the system "what is your state?" and somehow get back an answer that's guaranteed not to be self-contradictory. Have you done much retrospective long trace analysis?