Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp573248imu; Thu, 22 Nov 2018 02:03:53 -0800 (PST) X-Google-Smtp-Source: AFSGD/VT4QWzoMDz9lYa8bIOO5BFDwrbheGdmKVp6q/F63J5zb86xc2XOSXShPEzarmkJLrRfJb6 X-Received: by 2002:a63:2586:: with SMTP id l128mr9566275pgl.104.1542881033582; Thu, 22 Nov 2018 02:03:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542881033; cv=none; d=google.com; s=arc-20160816; b=vgn0FmgavHB7LH9ykbfq4HXThd7NZtzswKbeKdco0WwJyNBtbkq+a0/cygwZqtoHUl cDsjPFLEALyUEKEdKPChOHSzwV8W8MeczvoJKk7R87oHhsUeZe7Ugak0qIFvETJSfrHd pOoknVq7AO44n7pTqn0vx1rrx1ugoziLQusYUkbFG09GRe5wl/QKOUdXoLLq1q18FwnC LukcoKcdZNlaiAzlXbnpO0X5ENqtLGivT7qjUeJA6cUjCZNHy1p5gsL5COdJPbYpP9h0 UmiikfKB4cagIIAI4eTKJ3Z1s6p8TzhaNdsvv5akAau4eHqReRYXLKenlWF9qPPf4R2O ZpcQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=M9tjA2lexfD1mUkUX1Fi8axcHOA8khiAorvZefsr+Tc=; b=xWX35TcL+2fyORdvsBpM7KEpCKYH/0CpNZ7ZHk7TMgqxnLoo3bycIHWWm1drXDNsH4 f8jy6H6juI7NdSYrhnh1RN9KqFox3/20EUKs8svDH8bF8WyxXA4rZe3G6sr4GCv+j8Fq 6MN8NWz1GPbuT4wWKAzoUewA5CPemc5iqYvy+X7bwL9PiXsL0b2rOqkHEDFpQERZA35N PBEiAs+OKkZVpWRGwlsplL8kiBJEwDpr84rW62+u1iu5BAD0NnrMTkjLzUFIEWohodh2 yx3txnv2xpT/WWd8+qJAg4jAeCRHeEKj/ZuKdF/aJMYY6XOG7oXx8H6+FrJe7w4rNZD3 LrKQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=uqsfKe5r; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p14-v6si50137704plq.271.2018.11.22.02.03.38; Thu, 22 Nov 2018 02:03:53 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=uqsfKe5r; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390236AbeKVJRG (ORCPT + 99 others); Thu, 22 Nov 2018 04:17:06 -0500 Received: from mail-vk1-f194.google.com ([209.85.221.194]:42712 "EHLO mail-vk1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390227AbeKVJRF (ORCPT ); Thu, 22 Nov 2018 04:17:05 -0500 Received: by mail-vk1-f194.google.com with SMTP id y14so1595509vky.9 for ; Wed, 21 Nov 2018 14:40:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=M9tjA2lexfD1mUkUX1Fi8axcHOA8khiAorvZefsr+Tc=; b=uqsfKe5rzzL+6gPwo7cGlTd5Xx09B/Myn22adVg6PEaMepPKfFOCFWojESzUMBW3SU /932QBoqHLtPy+kDFg1kwKrLLtysnqw8BG5Xk5BR+AsEk2hgU95C/GjLadLOjR7b/N/o aConQjXLLUfKug2PXGlqbh4pp8ebE7hBEwWO4ou+vZmq2r6yEDEpYrN6PpkK0jGbCoJb snx2U9Kcfn4YwUBKWVqWjFZtb7UApu0BX99Ao0wpX6lW2QoUhYpfrcp1ds5k+pxsn/LU hBxp46Je3Q8g6x6iWAYmwDtWMGg3UDWnJzH36XYgXX6FuxfaifY0GH/Com+IjzKz539a jsDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=M9tjA2lexfD1mUkUX1Fi8axcHOA8khiAorvZefsr+Tc=; b=D3OOYMd6gSpjX1+dTTEYlXgTHR5QwlpdJZBjT1Z1zZMGwrpHWD9CJohIloDkIzLRYG e7IPIzBGsLCLZWgeyPG3BmP5O+ssvvq4vZ+feM9fkf8PyaU3iDEaWMmZAm53r/SuSsfK p2/DAjp7oLsMPwVWCM4ReLz7EdMviIc7c0vWA1A2tBArUtKVyFJriRLJrvhFJXboQT7g dBjVPPHncrBJiQT4sghVqFZpf1ZPfvzfaufBDZ7gKIfoPSLimH1rXVdexMuHfAllS2nd OajEGQ7oB6xniINcXkkwRI0vbPYfh+vZsCZq/RtAkSt4QGv+pVp8IAZ8TocaN5RtU1q4 WqHg== X-Gm-Message-State: AA+aEWav6ZPgN6MFCJFqszEgm0Pwk4NaVkBVxxTyps4K9HZzOHJ6+V6O k9ulos3bkVgLu7sLduUTWZ86p94kAXyjbLCzyzjUiA== X-Received: by 2002:a1f:7cca:: with SMTP id x193mr3620786vkc.89.1542840040673; Wed, 21 Nov 2018 14:40:40 -0800 (PST) MIME-Version: 1.0 References: <20181121201452.77173-1-dancol@google.com> <20181121205428.165205-1-dancol@google.com> <20181121141220.0e533c1dcb4792480efbf3ff@linux-foundation.org> In-Reply-To: <20181121141220.0e533c1dcb4792480efbf3ff@linux-foundation.org> From: Daniel Colascione Date: Wed, 21 Nov 2018 14:40:28 -0800 Message-ID: Subject: Re: [PATCH v2] Add /proc/pid_gen To: Andrew Morton Cc: linux-kernel , Linux API , Tim Murray , Primiano Tucci , Joel Fernandes , Jonathan Corbet , Mike Rapoport , Vlastimil Babka , Roman Gushchin , Prashant Dhamdhere , "Dennis Zhou (Facebook)" , "Eric W. Biederman" , rostedt@goodmis.org, tglx@linutronix.de, mingo@kernel.org, linux@dominikbrodowski.net, jpoimboe@redhat.com, ard.biesheuvel@linaro.org, Michal Hocko , sfr@canb.auug.org.au, ktsanaktsidis@zendesk.com, David Howells , "open list:DOCUMENTATION" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 21, 2018 at 2:12 PM Andrew Morton wrote: > > On Wed, 21 Nov 2018 12:54:20 -0800 Daniel Colascione wrote: > > > Trace analysis code needs a coherent picture of the set of processes > > and threads running on a system. While it's possible to enumerate all > > tasks via /proc, this enumeration is not atomic. If PID numbering > > rolls over during snapshot collection, the resulting snapshot of the > > process and thread state of the system may be incoherent, confusing > > trace analysis tools. The fundamental problem is that if a PID is > > reused during a userspace scan of /proc, it's impossible to tell, in > > post-processing, whether a fact that the userspace /proc scanner > > reports regarding a given PID refers to the old or new task named by > > that PID, as the scan of that PID may or may not have occurred before > > the PID reuse, and there's no way to "stamp" a fact read from the > > kernel with a trace timestamp. > > > > This change adds a per-pid-namespace 64-bit generation number, > > incremented on PID rollover, and exposes it via a new proc file > > /proc/pid_gen. By examining this file before and after /proc > > enumeration, user code can detect the potential reuse of a PID and > > restart the task enumeration process, repeating until it gets a > > coherent snapshot. > > > > PID rollover ought to be rare, so in practice, scan repetitions will > > be rare. > > In general, tracing is a rather specialized thing. Why is this very > occasional confusion a sufficiently serious problem to warrant addition > of this code? I wouldn't call tracing a specialized thing: it's important enough to justify its own summit and a whole ecosystem of trace collection and analysis tools. We use it in every day in Android. It's tremendously helpful for understanding system behavior, especially in cases where multiple components interact in ways that we can't readily predict or replicate. Reliability and precision in this area are essential: retrospective analysis of difficult-to-reproduce problems involves puzzling over trace files and testing hypothesis, and when the trace system itself is occasionally unreliable, the set of hypothesis to consider grows. I've tried to keep the amount of kernel infrastructure needed to support this precision and reliability to a minimum, pushing most of the complexity to userspace. But we do need, from the kernel, reliable process disambiguation. Besides: things like checkpoint and restart are also non-core features, but the kernel has plenty of infrastructure to support them. We're talking about a very lightweight feature in this thread. > Which userspace tools will be using pid_gen? Are the developers of > those tools signed up to use pid_gen? I'll be changing Android tracing tools to capture process snapshots using pid_gen, using the algorithm in the commit message. > > --- a/include/linux/pid.h > > +++ b/include/linux/pid.h > > @@ -112,6 +112,7 @@ extern struct pid *find_ge_pid(int nr, struct pid_namespace *); > > int next_pidmap(struct pid_namespace *pid_ns, unsigned int last); > > > > extern struct pid *alloc_pid(struct pid_namespace *ns); > > +extern u64 read_pid_generation(struct pid_namespace *ns); > > pig_generation_read() would be a better (and more idiomatic) name. Thanks. I'll change it. > > > extern void free_pid(struct pid *pid); > > extern void disable_pid_allocation(struct pid_namespace *ns); > > > > ... > > > > +u64 read_pid_generation(struct pid_namespace *ns) > > +{ > > + u64 generation; > > + > > + > > + spin_lock_irq(&pidmap_lock); > > + generation = ns->generation; > > + spin_unlock_irq(&pidmap_lock); > > + return generation; > > +} > > What is the spinlocking in here for? afaict the only purpose it serves > is to make the 64-bit read atomic, so it isn't needed on 32-bit? ITYM the spinlock is necessary *only* on 32-bit, since 64-bit architectures have atomic 64-bit reads, and 64-bit reads on 32-bit architectures can tear. This function isn't a particularly hot path, so I thought consistency across architectures would be more valuable than avoiding the lock on some systems. > > void disable_pid_allocation(struct pid_namespace *ns) > > { > > spin_lock_irq(&pidmap_lock); > > @@ -449,6 +463,17 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns) > > return idr_get_next(&ns->idr, &nr); > > } > > > > +#ifdef CONFIG_PROC_FS > > +static int pid_generation_show(struct seq_file *m, void *v) > > +{ > > + u64 generation = > > + read_pid_generation(proc_pid_ns(file_inode(m->file))); > > u64 generation; > > generation = read_pid_generation(proc_pid_ns(file_inode(m->file))); > > is a nicer way of avoiding column wrap. Sure. > > + seq_printf(m, "%llu\n", generation); > > + return 0; > > + > > +}; > > +#endif > > + > > void __init pid_idr_init(void) > > { > > /* Verify no one has done anything silly: */ > > @@ -465,4 +490,13 @@ void __init pid_idr_init(void) > > > > init_pid_ns.pid_cachep = KMEM_CACHE(pid, > > SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT); > > + > > +} > > + > > +void __init pid_proc_init(void) > > +{ > > + /* pid_idr_init is too early, so get a separate init function. */ > > s/get a/use a/ Will change. > > +#ifdef CONFIG_PROC_FS > > + WARN_ON(!proc_create_single("pid_gen", 0, NULL, pid_generation_show)); > > +#endif > > } > > This whole function could vanish if !CONFIG_PROC_FS. Doesn't matter > much with __init code though. I wanted to keep the ifdefed region as small as possible. I wonder whether LTO is good enough to make the function and its call site disappear entirely in configurations where it has an empty body.