Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp609904imu; Thu, 22 Nov 2018 02:41:47 -0800 (PST) X-Google-Smtp-Source: AFSGD/WyHXvCSNLSDeEO3h8ax9mP/HQ9tMqoBnrz+GF+A9wJdJaIrI7uyOFkQZe3ag63mWHpO+F5 X-Received: by 2002:a65:4049:: with SMTP id h9mr9492410pgp.304.1542883307387; Thu, 22 Nov 2018 02:41:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542883307; cv=none; d=google.com; s=arc-20160816; b=tfwipcdAMc9F5RiSpbuHmr4tW2GrPrY+zj1kmxSbf2S/gMvn1pRg9FAC+KWBE6bWnx LsxM//VwM3FmT7RqCAyUZNvSB6O0/+XlpANH5liiJS3xlSxmnVSj3hRtcuNsqqa4rgY3 FLoGHLO2JVGKEhccU5TGMG4PKJDjsuAmMaORCjgcCdEz25ls0PvkKz/7HG+ZhTuZFAUS WsH5BKLcIAZPRJdBohg2GFFeifQzRgUREwOdfwZ+Bd4ekXWb84Mbbizm14P8+rdkDj3e 2Mcees+MM6gmTLObnezysoSdz+VqZ2Zm7JaZVBkOAuRu5x8LrKmRua4CVTTo8Vm8q4JF MqwQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=vT5QLQLBvF9tb71lEfzgsLD/mZfphxrvGxwn3SOscDE=; b=Fke8pMkm3igiP51KH18VcskG5nafEeUBvps428vG0ggZH8YFOI96TnrZKid84oamCW 6mYdItK35X+dxOwwl0MmLc9RjTkgFUMBR/wV28ESY9RbCHSZRz9nWwfn5BL+yFcanpee Og4X1WW/+HEoDqaWji98LlRUA+AEt90douk2Un5OGAcbvX9AqUrANhU+Q2ypRdC4/d1q FizhUMN59EPu6szZEz+QuuV/IXrQET24FR63dhDEJIRB7t6EQvHnWM0FVBzzdAwjktGT k8aWZyA59jDr2sHHf5GY32Oys1IqOsfUxU9ylHeQB2ohi7rOZpAWIVVEsuVpXFR1KzGL uLbw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=RFsLpiVh; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s5si38618091pgl.481.2018.11.22.02.41.32; Thu, 22 Nov 2018 02:41:47 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=RFsLpiVh; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388642AbeKVLHk (ORCPT + 99 others); Thu, 22 Nov 2018 06:07:40 -0500 Received: from mail-ua1-f67.google.com ([209.85.222.67]:37681 "EHLO mail-ua1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388600AbeKVLHk (ORCPT ); Thu, 22 Nov 2018 06:07:40 -0500 Received: by mail-ua1-f67.google.com with SMTP id u19so2561057uae.4 for ; Wed, 21 Nov 2018 16:30:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=vT5QLQLBvF9tb71lEfzgsLD/mZfphxrvGxwn3SOscDE=; b=RFsLpiVhFTA9C3N3WxgW0gr562MUdo5coo7V1VsgLP3muD7cIxAC1VdnLmMYiowHD/ BHQsd3xEvJJDYwj29BFjKXiiszTI0ZijJ6F1c6PqLC6HMpYkOUVCdlvsK/uPLC7fdX4m +tu8KS0Y7Lrk+RYzGcf8aHAW29l92Cvv13shkVEC4cuPawFUGNmMB/rQEQxrMdMgu6+M JZyLOwzwH6a3RrHbiDMzE43eeLYwY+k9TyM7H+Q1c/dzDj3vOyxbGiSgpjvL8/BiidUX IcTerfFAmNXPVCwN5/UhsDFRrj10gQ2Y7oWpRnI5tsDEE1g9zhUjYXujbzSz+IrPzwBy II/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=vT5QLQLBvF9tb71lEfzgsLD/mZfphxrvGxwn3SOscDE=; b=J6x3zajjAKwU6z7qJg4QdmCrc3mGyqFPfDAg6/Hp+nXHZJc9p8GDMKY3Zu82xEk434 Sf6IfPV76jPCRVycuFdEBDC2g/4Wz7CMQxM0YkxNIE81Nmq156T+ZitVKDV8VDCxbu+6 gBUstJ38198iQXBdrVoYluNKjR4iEFy8yFzr00/Sx2HvagqMHJb79ezUINwgwr7eOBsF OAYip8ewjgc0czC+deJ+irM5WMchRIDQw5cH3i2SiSTYuBP7Jl7ONuXexHq89ziVNVTQ sxovpfyYpobDRCgPwuYtr7IzUj7IapivxOKWyrUcUcvFbOPgLjYMsgj4PPVDy2AJPIKR R0Hg== X-Gm-Message-State: AA+aEWbOR5B3YySCdnaIyRflhTcG2DpFpvbQACYNtmHJo6yCXuRppBih GWK/FOB4dOYVvnNWbb5xQ60xFsbb0ONPpNIwxRK4GA== X-Received: by 2002:ab0:45e2:: with SMTP id u89mr3799107uau.13.1542846653783; Wed, 21 Nov 2018 16:30:53 -0800 (PST) MIME-Version: 1.0 References: <20181121201452.77173-1-dancol@google.com> <20181121205428.165205-1-dancol@google.com> <20181121141220.0e533c1dcb4792480efbf3ff@linux-foundation.org> <20181121145043.fa029f4f91afddc2a10bb81e@linux-foundation.org> <20181121162247.467fcab6c0aca0819a822286@linux-foundation.org> In-Reply-To: From: Daniel Colascione Date: Wed, 21 Nov 2018 16:30:42 -0800 Message-ID: Subject: Re: [PATCH v2] Add /proc/pid_gen To: Andrew Morton Cc: linux-kernel , Linux API , Tim Murray , Primiano Tucci , Joel Fernandes , Jonathan Corbet , Mike Rapoport , Vlastimil Babka , Roman Gushchin , Prashant Dhamdhere , "Dennis Zhou (Facebook)" , "Eric W. Biederman" , rostedt@goodmis.org, tglx@linutronix.de, mingo@kernel.org, linux@dominikbrodowski.net, jpoimboe@redhat.com, Ard Biesheuvel , Michal Hocko , Stephen Rothwell , ktsanaktsidis@zendesk.com, David Howells , "open list:DOCUMENTATION" , Mathieu Desnoyers Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 21, 2018 at 4:28 PM Daniel Colascione wrote: > > On Wed, Nov 21, 2018 at 4:22 PM Andrew Morton wrote: > > > > On Wed, 21 Nov 2018 15:21:40 -0800 Daniel Colascione wrote: > > > > > On Wed, Nov 21, 2018 at 2:50 PM Andrew Morton wrote: > > > > > > > > On Wed, 21 Nov 2018 14:40:28 -0800 Daniel Colascione wrote: > > > > > > > > > On Wed, Nov 21, 2018 at 2:12 PM Andrew Morton wrote: > > > > > > ... > > > > > > > > I wouldn't call tracing a specialized thing: it's important enough to > > > > > justify its own summit and a whole ecosystem of trace collection and > > > > > analysis tools. We use it in every day in Android. It's tremendously > > > > > helpful for understanding system behavior, especially in cases where > > > > > multiple components interact in ways that we can't readily predict or > > > > > replicate. Reliability and precision in this area are essential: > > > > > retrospective analysis of difficult-to-reproduce problems involves > > > > > puzzling over trace files and testing hypothesis, and when the trace > > > > > system itself is occasionally unreliable, the set of hypothesis to > > > > > consider grows. I've tried to keep the amount of kernel infrastructure > > > > > needed to support this precision and reliability to a minimum, pushing > > > > > most of the complexity to userspace. But we do need, from the kernel, > > > > > reliable process disambiguation. > > > > > > > > > > Besides: things like checkpoint and restart are also non-core > > > > > features, but the kernel has plenty of infrastructure to support them. > > > > > We're talking about a very lightweight feature in this thread. > > > > > > > > I'm still not understanding the seriousness of the problem. Presumably > > > > you've hit problems in real-life which were serious and frequent enough > > > > to justify getting down and writing the code. Please share some sob stories > > > > with us! > > > > > > The problem here is the possibility of confusion, even if it's rare. > > > Does the naive approach of just walking /proc and ignoring the > > > possibility of PID reuse races work most of the time? Sure. But "most > > > of the time" isn't good enough. It's not that there are tons of sob > > > stories: it's that without completely robust reporting, we can't rule > > > out of the possibility that weirdness we observe in a given trace is > > > actually just an artifact from a kinda-sort-working best-effort trace > > > collection system instead of a real anomaly in behavior. Tracing, > > > essentially, gives us deltas for system state, and without an accurate > > > baseline, collected via some kind of scan on trace startup, it's > > > impossible to use these deltas to robustly reconstruct total system > > > state at a given time. And this matters, because errors in > > > reconstruction (e.g., assigning a thread to the wrong process because > > > the IDs happen to be reused) can affect processing of the whole trace. > > > If it's 3am and I'm analyzing the lone trace from a dogfooder > > > demonstrating a particularly nasty problem, I don't want to find out > > > that the trace I'm analyzing ended up being useless because the > > > kernel's trace system is merely best effort. It's very cheap to be > > > 100% reliable here, so let's be reliable and rule out sources of > > > error. > > > > So we're solving a problem which isn't known to occur, but solving it > > provides some peace-of-mind? Sounds thin! > > So you want to reject a cheap fix for a problem that you know occurs > at some non-zero frequency? There's a big difference between "may or > may not occur" and "will occur eventually, given enough time, and so > must be taken into account in analysis". Would you fix a refcount race > that you knew was possible, but didn't observe? What, exactly, is your > threshold for accepting a fix that makes tracing more reliable?