Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp679852imu; Thu, 22 Nov 2018 03:51:14 -0800 (PST) X-Google-Smtp-Source: AFSGD/VChjTIvM7mpetjAB9jLU4cXTWqucddC8TvwU0xNoeA0SUbQpFjoEiju60gov86ycmc5WS/ X-Received: by 2002:a17:902:4681:: with SMTP id p1mr10302576pld.184.1542887474088; Thu, 22 Nov 2018 03:51:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542887474; cv=none; d=google.com; s=arc-20160816; b=Htq5+GHKL2xIXYdSYHUnGuBBd4xZyuZy7979APApAplowJi3Uex8MxPh1xz+kiG6GL JT12EWcGde7HkqcTOz/XrqyH0g0xl9/Z6Q6qy25Mh9ShQmSILuDRBtHdITcRGtKKIfzq fcBUPsjj0cZUjxW1tI5EyMDkKH3jnJOybNP6FFaWw4IKP6inXw2Qo7c8pv9XVpd+P2bU mXoJmoX9QhkImCTV743I3JmESJa9uEwfYHjJz/tmghWLlnOVmauOwcQm8wsCBGfgbFMD sicXf/C7PwdkVD/8a8uzFCl6tHUSk7fBPpZlR0pS6XoiXvMBvJ43j+3cRxBipfwcT4QY 8LHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=tGafZM8nHztSVoXzvUk9Gd122Ugi4qjzBznaComsH8Y=; b=h+Q4F1f0Lq+TE6//OjHQZ9vOlsYuE4UH/9K/isQqe7Hi9Z8Wp6Uur60Eg45JpC2jpW GeS1JUlrFjZhkxR3LsPUY1mNgQssH2JJJ73bczal+qM1NqF4+z+OwZTBhdnKniczJ4Ee eoFhBl5BcNlBEyxJECScSMbRGTjGJp878xT/zGw8qYFspRibRsA6sdVEjPpoNlp2ruT0 DmRvvrHc4Tj8I8I8OnJB7TUQBtfsI62mD9NAateyhTABkDImqIzAagiKbt+R86hLhhM2 ltri6MN1sAw06B8QX8Xjys8bv9bHRGCWG1bqydxxj4VdbvU9hel4mZ0yr8dxoSHG9Yjv 69XQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="TtJ/j+gX"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i20si43517113pgm.586.2018.11.22.03.50.58; Thu, 22 Nov 2018 03:51:14 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="TtJ/j+gX"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388306AbeKVK55 (ORCPT + 99 others); Thu, 22 Nov 2018 05:57:57 -0500 Received: from mail-vs1-f65.google.com ([209.85.217.65]:39952 "EHLO mail-vs1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388197AbeKVK54 (ORCPT ); Thu, 22 Nov 2018 05:57:56 -0500 Received: by mail-vs1-f65.google.com with SMTP id z3so4360987vsf.7 for ; Wed, 21 Nov 2018 16:21:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=tGafZM8nHztSVoXzvUk9Gd122Ugi4qjzBznaComsH8Y=; b=TtJ/j+gXIXQ/NnAtsRbeulJiJq8sqvcOm/cNyhNee3K8aeqfLAwfkQYKidDAfMdwDj ZS+FJ4R1uNq/eoR1nHjGaa5auKk9Dc9rBWEe39nIBs8/qlKi3iP+HJmgrDVkbeX4LqAK AjAD4q2OlYCMwXkn3U7Wml/TnZo2jKJTX2cHNbsqmmXngjn0f2JM7u1Kikwf617OJsBd WQxxqEQPthuu4JX1rMvY/TZJF0iLZJb3Z216/Qp/wSOZowAoJr0HCV0U7cikisl/2w/P wrAGqiZfDwGIhBs6iaobWnud5UdLUiPQhR+IhGoZiTfMmSs4S9Qzv27AYggBUc/JZmyq TSCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=tGafZM8nHztSVoXzvUk9Gd122Ugi4qjzBznaComsH8Y=; b=Exc06iNFiC+PxunxmGI1+2EFoDilul9jmYgIRR8+V1PGWH51EAU0Pvs2tgkEGFatx+ h2YvcrRG6usB8o+EKwVtFe8nQZWVSBNZ9q7lzvZFHrjyRmzoPpnHroWULnQb4CrtTJCR TQXuRWObLZDIqcbWFoW3mzE4IpXmZT0By5JYk+HX4fNufHGSEkKQZI82GVCtNnUjdZGK GVypNMTIS2T+6uxP8oGdARE3hfEj9qDMvYBxOK91bkeL62p48dscrHHEfK321kBaR4Lr AvQrnWc1wyujPurXOqIyE4CBRzDoHLBg7ScN6r/bc8jCIMJuQj7DIyNPFfUBSN7Uhi5F Dbgw== X-Gm-Message-State: AGRZ1gJfRQ8UczbHfOI6QEw2IcsHEkcruVnutAZy056chTshZs72wjzB PKFB04yg+O5R6TcBcRKve/UU/414BaxO2bn93FZHow== X-Received: by 2002:a67:105:: with SMTP id 5mr3689481vsb.183.1542846072209; Wed, 21 Nov 2018 16:21:12 -0800 (PST) MIME-Version: 1.0 References: <20181121201452.77173-1-dancol@google.com> <20181121205428.165205-1-dancol@google.com> <20181121141220.0e533c1dcb4792480efbf3ff@linux-foundation.org> <20181121145043.fa029f4f91afddc2a10bb81e@linux-foundation.org> <37255927-1A93-4B8B-A916-B5A3983D56B6@amacapital.net> In-Reply-To: <37255927-1A93-4B8B-A916-B5A3983D56B6@amacapital.net> From: Daniel Colascione Date: Wed, 21 Nov 2018 16:21:00 -0800 Message-ID: Subject: Re: [PATCH v2] Add /proc/pid_gen To: Andy Lutomirski Cc: Andrew Morton , linux-kernel , Linux API , Tim Murray , Primiano Tucci , Joel Fernandes , Jonathan Corbet , Mike Rapoport , Vlastimil Babka , Roman Gushchin , Prashant Dhamdhere , "Dennis Zhou (Facebook)" , "Eric W. Biederman" , rostedt@goodmis.org, tglx@linutronix.de, mingo@kernel.org, linux@dominikbrodowski.net, jpoimboe@redhat.com, Ard Biesheuvel , Michal Hocko , Stephen Rothwell , ktsanaktsidis@zendesk.com, David Howells , "open list:DOCUMENTATION" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 21, 2018 at 3:35 PM Andy Lutomirski wrote: > > On Nov 21, 2018, at 4:21 PM, Daniel Colascione wrote: > > > >> On Wed, Nov 21, 2018 at 2:50 PM Andrew Morton wrote: > >> > >>> On Wed, 21 Nov 2018 14:40:28 -0800 Daniel Colascione wrote: > >>> > >>>> On Wed, Nov 21, 2018 at 2:12 PM Andrew Morton wrote: > >>>> > >>>>> On Wed, 21 Nov 2018 12:54:20 -0800 Daniel Colascione wrote: > >>>>> > >>>>> Trace analysis code needs a coherent picture of the set of processes > >>>>> and threads running on a system. While it's possible to enumerate all > >>>>> tasks via /proc, this enumeration is not atomic. If PID numbering > >>>>> rolls over during snapshot collection, the resulting snapshot of the > >>>>> process and thread state of the system may be incoherent, confusing > >>>>> trace analysis tools. The fundamental problem is that if a PID is > >>>>> reused during a userspace scan of /proc, it's impossible to tell, in > >>>>> post-processing, whether a fact that the userspace /proc scanner > >>>>> reports regarding a given PID refers to the old or new task named by > >>>>> that PID, as the scan of that PID may or may not have occurred before > >>>>> the PID reuse, and there's no way to "stamp" a fact read from the > >>>>> kernel with a trace timestamp. > >>>>> > >>>>> This change adds a per-pid-namespace 64-bit generation number, > >>>>> incremented on PID rollover, and exposes it via a new proc file > >>>>> /proc/pid_gen. By examining this file before and after /proc > >>>>> enumeration, user code can detect the potential reuse of a PID and > >>>>> restart the task enumeration process, repeating until it gets a > >>>>> coherent snapshot. > >>>>> > >>>>> PID rollover ought to be rare, so in practice, scan repetitions will > >>>>> be rare. > >>>> > >>>> In general, tracing is a rather specialized thing. Why is this very > >>>> occasional confusion a sufficiently serious problem to warrant addition > >>>> of this code? > >>> > >>> I wouldn't call tracing a specialized thing: it's important enough to > >>> justify its own summit and a whole ecosystem of trace collection and > >>> analysis tools. We use it in every day in Android. It's tremendously > >>> helpful for understanding system behavior, especially in cases where > >>> multiple components interact in ways that we can't readily predict or > >>> replicate. Reliability and precision in this area are essential: > >>> retrospective analysis of difficult-to-reproduce problems involves > >>> puzzling over trace files and testing hypothesis, and when the trace > >>> system itself is occasionally unreliable, the set of hypothesis to > >>> consider grows. I've tried to keep the amount of kernel infrastructure > >>> needed to support this precision and reliability to a minimum, pushing > >>> most of the complexity to userspace. But we do need, from the kernel, > >>> reliable process disambiguation. > >>> > >>> Besides: things like checkpoint and restart are also non-core > >>> features, but the kernel has plenty of infrastructure to support them. > >>> We're talking about a very lightweight feature in this thread. > >> > >> I'm still not understanding the seriousness of the problem. Presumably > >> you've hit problems in real-life which were serious and frequent enough > >> to justify getting down and writing the code. Please share some sob stories > >> with us! > > > > The problem here is the possibility of confusion, even if it's rare. > > Does the naive approach of just walking /proc and ignoring the > > possibility of PID reuse races work most of the time? Sure. But "most > > of the time" isn't good enough. It's not that there are tons of sob > > stories: it's that without completely robust reporting, we can't rule > > out of the possibility that weirdness we observe in a given trace is > > actually just an artifact from a kinda-sort-working best-effort trace > > collection system instead of a real anomaly in behavior. Tracing, > > essentially, gives us deltas for system state, and without an accurate > > baseline, collected via some kind of scan on trace startup, it's > > impossible to use these deltas to robustly reconstruct total system > > state at a given time. And this matters, because errors in > > reconstruction (e.g., assigning a thread to the wrong process because > > the IDs happen to be reused) can affect processing of the whole trace. > > If it's 3am and I'm analyzing the lone trace from a dogfooder > > demonstrating a particularly nasty problem, I don't want to find out > > that the trace I'm analyzing ended up being useless because the > > kernel's trace system is merely best effort. It's very cheap to be > > 100% reliable here, so let's be reliable and rule out sources of > > error. > > > >>>> Which userspace tools will be using pid_gen? Are the developers of > >>>> those tools signed up to use pid_gen? > >>> > >>> I'll be changing Android tracing tools to capture process snapshots > >>> using pid_gen, using the algorithm in the commit message. > >> > >> Which other tools could use this and what was the feedback from their > >> developers? > > > > I'm going to have Android's systrace and Perfetto use this approach. > > Exactly how many tools signed up to use this feature do you need? > > > >> Those people are the intended audience and the > >> best-positioned reviewers so let's hear from them? > > > > I'm writing plenty of trace analysis tools myself, so I'm part of this > > intended audience. Other tracing tool authors have told me about > > out-of-tree hacks for process atomic snapshots via ftrace events. This > > approach avoids the necessity of these more-invasive hacks. > > Would a tracepoint for pid reuse solve your problem? I initially thought "no", but maybe it would after all. The /proc scanner would need some way of consuming this tracepoint so that it would know to re-run the scan. For some tracing systems (like Perfetto) doing simultaneous event consumption and recording would work (although it'd be awkward), but other systems (like things based on the simpler atrace framework) aren't able to process events, and inferring PID reuse after the fact it's as helpful as being able to re-scan in response to PID rollover. OTOH, the ftrace histogram stuff could record a running count of rollovers events pretty easily independent of the trace recording machinery, and that running event count would be good enough for doing the re-scan I want. If /proc/pid_gen isn't on the table, doing it that way (with a new tracepoint) would work too. There'd be no way to separate out rollovers per PID namespace though, so we'd have to re-scan when *any* PID namespace rolled over, not just the current one. But that's probably not a problem. I'll send a patch with a PID rollover tracepoint. Thanks.