Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp4948049ybv; Wed, 26 Feb 2020 05:51:17 -0800 (PST) X-Google-Smtp-Source: APXvYqyxzFyXWlSgmlvb0MY6l+xDshlxFwTvzmnWKvYq1ZchCQZy2K2a0gTAw52B8JYXkqFhyp32 X-Received: by 2002:aca:48cd:: with SMTP id v196mr3262487oia.102.1582725077773; Wed, 26 Feb 2020 05:51:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582725077; cv=none; d=google.com; s=arc-20160816; b=bbYfHWd7egP8PJxRd5zm/5NZYbes8oUr6F4w+bO09m8AOp+4nmAdums5af3YeV9cR0 JwAkICTryhfV+JLgh6XaRgDtuujlJXMX5FFlz3jiaVT4a9q5Up63kBP4AN8rJDEO6ggT /vKGS22n33wW7Sn09q5INLEuDCkFJfUOrPULMin9URg9TW2R8iJ5rvM3c+X+zaV702ei y0YDl5SK0OL6OYtjMLdOalKxum5FLUIQVar3+Ur2TPlXZ9naDZ8b8R5fNlFiL+pm7tqm TULdoy/TaGrfd6FQCb3OmT+vacDCt7Z/y/ygQR8kkwMEm6YON2eUWwXM9592loF+gPRY rxiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:dkim-signature; bh=W1kcFDCxBAKgTZSDNvvKohek86v0q+UsZEyJchNCOOg=; b=mr+b6K4YC7316llkQrW7IfbEZSdzzsbt1aCL8IGfyA42uYgVyNNQr1epylTe/h/kg1 OmVVDv4XWFGZ4Gmpvjjb382EhIhLYjiw0t0eZUCTqZBETsUaXCWiYf+DBPG2gkn/RUGn E37ot7WwS2guIYLQGwKnMW3Kndx3DhGZll6PWLr7OkJTWDeWAETv/BJG8qI1xg1SIbT9 fODfrFqIzhK9xz7aqctGZyafnexY5r6u0pYaj4ZKGkSCo0EH9lM62X3Do0EaHIet1Zs7 tIi/tBIVhMDlxQ0RccjnckWj2bQFrx7wtOZF21UMNflZ46rDVWXW+QUlk8wm1UpPwnn8 unDg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Ev1MjJdU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 2si1263196oij.97.2020.02.26.05.51.06; Wed, 26 Feb 2020 05:51:17 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Ev1MjJdU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727401AbgBZNul (ORCPT + 99 others); Wed, 26 Feb 2020 08:50:41 -0500 Received: from mail-pj1-f74.google.com ([209.85.216.74]:51687 "EHLO mail-pj1-f74.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726388AbgBZNul (ORCPT ); Wed, 26 Feb 2020 08:50:41 -0500 Received: by mail-pj1-f74.google.com with SMTP id z5so2076359pjt.1 for ; Wed, 26 Feb 2020 05:50:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=W1kcFDCxBAKgTZSDNvvKohek86v0q+UsZEyJchNCOOg=; b=Ev1MjJdUHo1fXI45iPMFb/YK1S5aeMO9r3qt9MOO26XJ/B3tEhzoDV8/1bnD6VOVzO 0qLiW2eZyLB/sTMShBiLBj7lr0OuXUpFBbvaPUVnqSd1lFc2ObORZYB+/am8/MnPdaSf GiyoJl/ZtFrXgpd1TdExHY84DfwtgjOZXh2DQkIZP/irJObqsOuBUaFwViCRTsI7PgzH NpfrF6wr9TV9Unc7yuqa6i4VJzGNcslgg2lOgRd3k+RzB3cjYcURF8H42awFuUf1/QWV vLCHeVRfXSdPfUlN8bUkzD9DO0ubjz9Hia4LEPHNNXxE9hP/SeFLnEuriWOL4bTm5p4+ BW2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=W1kcFDCxBAKgTZSDNvvKohek86v0q+UsZEyJchNCOOg=; b=RVrkqW8hO2xtNhRPjIiQADfCDSe4e+1nyxDKtIemrU5LLaGSkfzU4jfr5aiPitYis1 Qi5/zw7CG3a3REdal+DNkm9gcI2XQiI+5jBDV5adJHQmeylaWhdfNYse2+VrNsStFJpD sB3CDBRT8+jN75sSQhnjlLtY1xTGY6Zt7/eSZhKoFMamQcjOgahxJ/ZZEy/nuyXQMl2G b9tajDQAJD3cLnN6MLQ39vn0wvsM+SHvn9z15b6Gu/cRnaptZUcfudd0yfhfcaMa7IkD LZQwvrHdGTxiK4lP/9QsoKRQ92SK02D6hrObzR7+fMkhUvTo1oMSMRWxdAgVIsApPLJj 2ejw== X-Gm-Message-State: APjAAAU/CDx8G1XUsWz1l757koCERaoOebgXvXnXKNiHGtMU12ekBA8b 7R3yjsUzCNxWHQxbrWC9Tbaa/3GUcETVzkA7pIlvsEX9wRhOJafmT9lA3hm1O8lXbB5WOkHy79e HgZeXMRKS9N6seRcH1E4HE8Dws2/u3B1j8tBGYtvykyjVhrIallnZjNpH04XNK0DwQo0YTQ== X-Received: by 2002:a63:5f8e:: with SMTP id t136mr3911349pgb.411.1582725038910; Wed, 26 Feb 2020 05:50:38 -0800 (PST) Date: Wed, 26 Feb 2020 05:50:26 -0800 In-Reply-To: <20200226135027.34538-1-lrizzo@google.com> Message-Id: <20200226135027.34538-2-lrizzo@google.com> Mime-Version: 1.0 References: <20200226135027.34538-1-lrizzo@google.com> X-Mailer: git-send-email 2.25.0.265.gbab2e86ba0-goog Subject: [PATCH v3 1/2] kstats: kernel metric collector From: Luigi Rizzo To: linux-kernel@vger.kernel.org, mhiramat@kernel.org, akpm@linux-foundation.org, gregkh@linuxfoundation.org, naveen.n.rao@linux.ibm.com, ardb@kernel.org, rizzo@iet.unipi.it, pabeni@redhat.com, giuseppe.lettieri@unipi.it, toke@redhat.com, hawk@kernel.org, mingo@redhat.com, acme@kernel.org, rostedt@goodmis.org, peterz@infradead.org Cc: Luigi Rizzo Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org kstats is a helper to accumulate in-kernel metrics (timestamps, sizes, etc.) and show distributions through debugfs. Set CONFIG_KSTATS=m or y to enable it. Creating a metric takes one line of code (and one to destroy it): struct kstats *key = kstats_new("foo", 3 /* frac_bits */); ... kstats_delete(key); The following line records a u64 sample: kstats_record(key, value); kstats_record() is cheap (5ns hot cache, 250ns cold cache). Samples are accumulated in a per-cpu array with 2^frac_bits slots for each power of 2. Using frac_bits = 3 gives about 30 slots per decade. Each metric has an entry in /sys/kernel/debug/kstats which can be used to read and control data collection: - writing start/stop/reset starts and stops sample collection, or resets the counters, e.g. echo reset > /sys/kernel/debug/kstats/foo - reading from the node produces a detailed output that can be processed with external tools for improved presentation: cat /sys/kernel/debug/kstats/foo ... slot 55 CPU 0 count 589 avg 480 p 0.027613 slot 55 CPU 1 count 18 avg 480 p 0.002572 slot 55 CPU 2 count 25 avg 480 p 0.003325 ... slot 55 CPUS 16 count 814 avg 480 p 0.002474 ... slot 97 CPU 8 count 1150 avg 20130 p 0.447442 slot 97 CPU 12 count 26 avg 20294 p 0.275555 slot 97 CPUS 16 count 152585 avg 19809 p 0.651747 slot 98 CPU 0 count 38 avg 21360 p 0.954691 slot 98 CPU 1 count 456 avg 21412 p 0.872619 ... slot 144 CPUS 16 count 12 avg 1061896 p 0.999999 slot 146 CPU 4 count 1 avg 1313664 p 1.000000 slot 146 CPUS 16 count 1 avg 1313664 p 1.000000 Postprocessing examples: look at statistics for a single CPU (note double space) cd /sys/kernel/kstats/ echo reset > foo; watch grep "'CPU 12 '" foo Use gnuplot to plot distribution and cdf in a terminal from a remote host gnuplot set terminal dumb size 200 80 set logscale x plot " --- include/linux/kstats.h | 62 +++++++++ lib/Kconfig.debug | 7 + lib/Makefile | 1 + lib/kstats.c | 290 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 360 insertions(+) create mode 100644 include/linux/kstats.h create mode 100644 lib/kstats.c diff --git a/include/linux/kstats.h b/include/linux/kstats.h new file mode 100644 index 0000000000000..636cc007fa138 --- /dev/null +++ b/include/linux/kstats.h @@ -0,0 +1,62 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_KSTATS_H +#define _LINUX_KSTATS_H + +#include + +/* + * Helper to collect and report kernel metrics. Use as follows: + * + * - creates a new debugfs entry in /sys/kernel/debug/kstats/foo + * to collect the metric, accumulating samples in 2^frac_bits slots + * per power of 2 + * + * struct kstats *key = kstats_new("foo", frac_bits); + * + * - add instrumentation around code: + * + * u64 dt = ktime_get_ns(); // about 20ns + *
+ * dt = ktime_get_ns() - dt; // about 20ns + * kstats_record(key, dt); // 5ns hot cache, 300ns cold + * + * - read values from debugfs + * cat /sys/kernel/debug/kstats/foo + * ... + * slot 55 CPU 0 count 589 avg 480 p 0.027613 + * slot 55 CPU 1 count 18 avg 480 p 0.002572 + * slot 55 CPU 2 count 25 avg 480 p 0.003325 + * ... + * slot 55 CPUS 28 count 814 avg 480 p 0.002474 + * ... + * slot 97 CPU 13 count 1150 avg 20130 p 0.447442 + * slot 97 CPUS 28 count 152585 avg 19809 p 0.651747 + * ... + * + * - write to the file STOP, START, RESET executes the corresponding action + * + * echo RESET > /sys/kernel/debug/kstats/foo + */ + +struct kstats; + +#if defined(CONFIG_KSTATS) || defined(CONFIG_KSTATS_MODULE) +/* Add an entry to debugfs. */ +struct kstats *kstats_new(const char *name, u8 frac_bits); + +/* Record a sample */ +void kstats_record(struct kstats *key, u64 value); + +/* Remove an entry and frees memory */ +void kstats_delete(struct kstats *key); +#else +static inline struct kstats *kstats_new(const char *name, u8 frac_bits) +{ + return NULL; +} + +static inline void kstats_record(struct kstats *key, u64 value) {} +static inline void kstats_delete(struct kstats *key) {} +#endif + +#endif /* _LINUX_KSTATS_H */ diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 69def4a9df009..7d7fae57bfde5 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1452,6 +1452,13 @@ config LATENCYTOP Enable this option if you want to use the LatencyTOP tool to find out which userspace is blocking on what kernel operations. +config KSTATS + tristate "collect percpu metrics and export distributions through debugfs" + depends on DEBUG_FS + help + Helper library to collect percpu kernel metrics, exporting + distributions through debugfs. See lib/kstats.c + source "kernel/trace/Kconfig" config PROVIDE_OHCI1394_DMA_INIT diff --git a/lib/Makefile b/lib/Makefile index 611872c069269..3f8578007db4e 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -236,6 +236,7 @@ obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o obj-$(CONFIG_PERCPU_TEST) += percpu_test.o +obj-$(CONFIG_KSTATS) += kstats.o obj-$(CONFIG_ASN1) += asn1_decoder.o diff --git a/lib/kstats.c b/lib/kstats.c new file mode 100644 index 0000000000000..885ab448708de --- /dev/null +++ b/lib/kstats.c @@ -0,0 +1,290 @@ +/* + * kstats, collect samples and export distributions through debugfs + * + * CREATE OBJECT: + * struct kstats *key = kstats_new("some_name", 3); + * + * ADD A SAMPLE: + * u64 dt = ktime_get_ns(); // about 20ns + * + * dt = ktime_get_ns() - t; // about 20ns + * kstats_record(key, dt); // 5ns hot cache, 250ns cold + * + * SHOW DATA: + * cat /sys/kernel/debug/kstats/some_name + * + * ... + * slot 12 CPU 0 count 764 avg 12 p 0.011339 + * slot 12 CPU 1 count 849 avg 12 p 0.011496 + * slot 12 CPU 2 count 712 avg 12 p 0.009705 + * ... + * slot 12 CPU 243 count 1 avg 12 p 0.000097 + * slot 12 CPUS 256 count 19977 avg 12 p 0.006153 + * ... + */ + +#include +#include +#include +#include +#include +#include +#include + +/* + * Values are 64 bit unsigned and are accumulated per cpu, in one bucket for + * each power of 2. Each bucket is further subdivided in 2^frac_bits slots. + * The range for each slot is 2^-frac_bits of the base value for the bucket. + */ +#define BUCKETS 64 /* Total powers of 2 */ + +/* For large values, sum is scaled to reduce the chance of overflow */ +#define SUM_SCALE 20 + +/* Internal names start with ks_, external ones with kstats_ */ + +struct ks_slot { + u64 samples; + u64 sum; +}; + +struct kstats { + u16 n_slots; /* typically BUCKETS * 2^frac_bits + 2 */ + u8 frac_bits; + u8 frac_mask; /* 2^frac_bits - 1 */ + bool active; /* recording status */ + struct ks_slot __percpu *slots; + struct dentry *entry; /* debugfs entry */ +}; + +static void ks_print(struct seq_file *p, int slot, int cpu, u64 sum, + u64 tot, unsigned long samples, unsigned long avg) +{ + unsigned long frac = (tot == 0) ? 0 : ((sum % tot) * 1000000) / tot; + + seq_printf(p, + "slot %-3d CPU%c %-4d count %8lu avg %8lu p %c.%06lu\n", + slot, cpu == nr_cpu_ids ? 'S' : ' ', cpu, + samples, avg, sum == tot ? '1' : '0', frac); +} + +/* Read handler */ +static int ks_show_entry(struct seq_file *p, void *v) +{ + u32 slot, cpu; + struct ks_slot *slots, *cur; + struct kstats *ks = p->private; + u64 *partials, *totals, grand_total = 0; + const size_t rowsize = ks ? ks->n_slots * sizeof(struct ks_slot) : 0; + + if (!ks) + return -ENOENT; + if (!rowsize) + return 0; + /* + * Counters are updated while we read them, so make a copy first. + * kvzalloc'ed memory contains three areas: + * + * slots: [ nr_cpu_ids ][ ks->n_slots ](struct ks_slot) + * partials: [ nr_cpu_ids ](u64) + * totals: [ nr_cpu_ids ](u64) + */ + slots = kvzalloc(nr_cpu_ids * (rowsize + 2 * sizeof(u64)), GFP_KERNEL); + if (!slots) + return -ENOMEM; + partials = (u64 *)(slots + ks->n_slots * nr_cpu_ids); + totals = partials + nr_cpu_ids; + /* Copy data and compute counts totals (per-cpu and grand_total). + * These values are needed to compute percentiles. + */ + for_each_possible_cpu(cpu) { + cur = slots + ks->n_slots * cpu; + memcpy(cur, per_cpu_ptr(ks->slots, cpu), rowsize); + for (slot = 0; slot < ks->n_slots; slot++) + totals[cpu] += cur[slot].samples; + grand_total += totals[cpu]; + } + + /* Second pass, produce individual lines */ + for (slot = 0; slot < ks->n_slots; slot++) { + u64 n, samples = 0, sum = 0, samples_cumulative = 0; + u32 bucket = slot >> ks->frac_bits; + u32 sum_shift = bucket < SUM_SCALE ? 0 : bucket - SUM_SCALE; + + for_each_possible_cpu(cpu) { + cur = slots + ks->n_slots * cpu; + sum += cur[slot].sum; + n = cur[slot].samples; + samples += n; + partials[cpu] += n; + samples_cumulative += partials[cpu]; + if (n == 0) + continue; + ks_print(p, slot, cpu, partials[cpu], totals[cpu], n, + (cur[slot].sum / n) << sum_shift); + } + if (samples == 0) + continue; + ks_print(p, slot, nr_cpu_ids, samples_cumulative, grand_total, + samples, (sum / samples) << sum_shift); + } + kvfree(slots); + return 0; +} + +static ssize_t ks_write(struct file *fp, const char __user *user_buffer, + size_t count, loff_t *position) +{ + struct inode *ino = fp->f_inode; + struct kstats *ks = ino ? ino->i_private : NULL; + char buf[256] = {}; + ssize_t ret; + u32 cpu; + + if (count >= sizeof(buf) - 1) + return -EINVAL; + ret = simple_write_to_buffer(buf, sizeof(buf), + position, user_buffer, count); + if (ret < 0) + return ret; + /* Trim final newline if any */ + if (count > 0 && buf[count - 1] == '\n') + buf[count - 1] = '\0'; + + if (ks == NULL) + return -EINVAL; + + if (!strcasecmp(buf, "START")) { + ks->active = 1; + } else if (!strcasecmp(buf, "STOP")) { + ks->active = 0; + } else if (!strcasecmp(buf, "RESET")) { + for_each_possible_cpu(cpu) { + memset(per_cpu_ptr(ks->slots, cpu), 0, + ks->n_slots * sizeof(struct ks_slot)); + } + } else { + ret = -EINVAL; + } + /* TODO: add another command to turn off and deallocate memory. */ + return ret; +} + +static int ks_open(struct inode *inode, struct file *f) +{ + return single_open(f, ks_show_entry, inode->i_private); +} + +static const struct file_operations ks_file_ops = { + .owner = THIS_MODULE, + .open = ks_open, + .release = single_release, + .read = seq_read, + .write = ks_write, + .llseek = seq_lseek, +}; + +static struct dentry *ks_root; /* kstats root in debugfs */ + +static int __init ks_init(void) +{ + ks_root = debugfs_create_dir("kstats", NULL); + return 0; +} + +static void __exit ks_exit(void) +{ + debugfs_remove_recursive(ks_root); +} + +/* Run as soon as possible, but after debugfs, which is in core_initcall */ +postcore_initcall(ks_init); +module_exit(ks_exit); +MODULE_LICENSE("GPL"); + +/* User API: kstats_new(), kstats_delete(), kstats_record() */ + +struct kstats *kstats_new(const char *name, u8 frac_bits) +{ + struct kstats *ks = NULL; + const char *errmsg = ""; + + if (IS_ERR_OR_NULL(ks_root)) { + errmsg = "ks_root not set yet"; + goto error; + } + + if (frac_bits > 5) { + pr_info("fractional bits %d too large, using 3\n", frac_bits); + frac_bits = 3; + } + ks = kzalloc(sizeof(*ks), GFP_KERNEL); + if (!ks) + return NULL; + ks->active = 1; + ks->frac_bits = frac_bits; + ks->frac_mask = (1 << frac_bits) - 1; + ks->n_slots = ((BUCKETS - frac_bits + 1) << frac_bits) + 1; + + /* Add one extra bucket for user timestamps */ + ks->slots = __alloc_percpu((1 + ks->n_slots) * sizeof(struct ks_slot), + sizeof(u64)); + if (!ks->slots) { + errmsg = "failed to allocate pcpu"; + goto error; + } + + /* 'ks' is saved in the inode (entry->d_inode->i_private). */ + ks->entry = debugfs_create_file(name, 0444, ks_root, ks, &ks_file_ops); + __module_get(THIS_MODULE); + return ks; + +error: + pr_info("kstats: '%s' error %s\n", name, errmsg); + kstats_delete(ks); + return NULL; +} +EXPORT_SYMBOL(kstats_new); + +void kstats_delete(struct kstats *ks) +{ + if (!ks) + return; + debugfs_remove(ks->entry); + if (ks->slots) + free_percpu(ks->slots); + *ks = (struct kstats){}; + kfree(ks); + module_put(THIS_MODULE); +} +EXPORT_SYMBOL(kstats_delete); + +void kstats_record(struct kstats *ks, u64 val) +{ + u32 bucket, slot; + + if (!ks || !ks->active) + return; + /* The leftmost 1 selects the bucket, subsequent frac_bits select + * the slot within the bucket. fls returns 0 when the argument is 0. + */ + bucket = fls64(val >> ks->frac_bits); + slot = bucket == 0 ? val : + ((bucket << ks->frac_bits) | + ((val >> (bucket - 1)) & ks->frac_mask)); + + /* Use the last slot on overflow if BUCKETS < 64 */ + if (slot > ks->n_slots - 2) + slot = ks->n_slots - 1; + + /* preempt_disable makes sure samples and sum modify the same slot. + * this_cpu_add() uses a non-interruptible add to protect against + * hardware interrupts which may call kstats_record. + */ + preempt_disable(); + this_cpu_add(ks->slots[slot].samples, 1); + this_cpu_add(ks->slots[slot].sum, + bucket < SUM_SCALE ? val : (val >> (bucket - SUM_SCALE))); + preempt_enable(); +} +EXPORT_SYMBOL(kstats_record); -- 2.25.0.265.gbab2e86ba0-goog