2015-07-29 04:51:58

by Takao Indoh

[permalink] [raw]
Subject: [PATCH RFC 0/3] x86: Intel Processor Trace Logger

Hi all,

These patch series provide logging feature for Intel Processor Trace
(Intel PT).

Intel PT is a new feature of Intel CPU "Broadwell", it captures
information about program execution flow. Here is a article about Intel
PT.
https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing

Once Intel PT is enabled, the events which change program flow, like
branch instructions, exceptions, interruptions, traps and so on are
logged in the memory. This is very useful for debugging because we can
know the detailed behavior of software.

This patch creates log buffer for Intel PT and enable logging at boot
time. When kernel panic occurs, we can get this log buffer from
crashdump file by kdump, and reconstruct the flow that led to the panic.

Takao Indoh (3):
x86: Add Intel PT common files
x86: Add Intel PT logger
x86: Stop Intel PT and save its registers when panic occurs

arch/x86/Kconfig | 16 ++
arch/x86/include/asm/intel_pt.h | 84 +++++++++
arch/x86/kernel/cpu/Makefile | 3 +
arch/x86/kernel/cpu/intel_pt.h | 131 -------------
arch/x86/kernel/cpu/intel_pt_cap.c | 69 +++++++
arch/x86/kernel/cpu/intel_pt_log.c | 288 +++++++++++++++++++++++++++++
arch/x86/kernel/cpu/intel_pt_perf.h | 78 ++++++++
arch/x86/kernel/cpu/perf_event_intel_pt.c | 54 +-----
arch/x86/kernel/crash.c | 9 +
9 files changed, 556 insertions(+), 176 deletions(-)
create mode 100644 arch/x86/include/asm/intel_pt.h
delete mode 100644 arch/x86/kernel/cpu/intel_pt.h
create mode 100644 arch/x86/kernel/cpu/intel_pt_cap.c
create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c
create mode 100644 arch/x86/kernel/cpu/intel_pt_perf.h


2015-07-29 04:52:15

by Takao Indoh

[permalink] [raw]
Subject: [PATCH RFC 1/3] x86: Add Intel PT common files

Rename existing intel_pt.h to intel_pt_perf.h as a perf-specific header,
and make a new intel_pt.h as a common header of Intel PT feature. Also
add intel_pt_cap.c for Intel PT capability stuff.

Signed-off-by: Takao Indoh <[email protected]>
---
arch/x86/include/asm/intel_pt.h | 82 ++++++++++++++++++
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/intel_pt.h | 131 -----------------------------
arch/x86/kernel/cpu/intel_pt_cap.c | 69 +++++++++++++++
arch/x86/kernel/cpu/intel_pt_perf.h | 78 +++++++++++++++++
arch/x86/kernel/cpu/perf_event_intel_pt.c | 54 ++----------
6 files changed, 239 insertions(+), 176 deletions(-)
create mode 100644 arch/x86/include/asm/intel_pt.h
delete mode 100644 arch/x86/kernel/cpu/intel_pt.h
create mode 100644 arch/x86/kernel/cpu/intel_pt_cap.c
create mode 100644 arch/x86/kernel/cpu/intel_pt_perf.h

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
new file mode 100644
index 0000000..7cb16e1
--- /dev/null
+++ b/arch/x86/include/asm/intel_pt.h
@@ -0,0 +1,82 @@
+/*
+ * Intel(R) Processor Trace common header
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#ifndef __INTEL_PT_H__
+#define __INTEL_PT_H__
+
+/*
+ * Table of Physical Addresses bits
+ */
+enum topa_sz {
+ TOPA_4K = 0,
+ TOPA_8K,
+ TOPA_16K,
+ TOPA_32K,
+ TOPA_64K,
+ TOPA_128K,
+ TOPA_256K,
+ TOPA_512K,
+ TOPA_1MB,
+ TOPA_2MB,
+ TOPA_4MB,
+ TOPA_8MB,
+ TOPA_16MB,
+ TOPA_32MB,
+ TOPA_64MB,
+ TOPA_128MB,
+ TOPA_SZ_END,
+};
+
+static inline unsigned int sizes(enum topa_sz tsz)
+{
+ return 1 << (tsz + 12);
+};
+
+struct topa_entry {
+ u64 end : 1;
+ u64 rsvd0 : 1;
+ u64 intr : 1;
+ u64 rsvd1 : 1;
+ u64 stop : 1;
+ u64 rsvd2 : 1;
+ u64 size : 4;
+ u64 rsvd3 : 2;
+ u64 base : 36;
+ u64 rsvd4 : 16;
+};
+
+#define TOPA_SHIFT 12
+#define PT_CPUID_LEAVES 2
+
+/*
+ * Capability stuff
+ */
+enum pt_capabilities {
+ PT_CAP_max_subleaf = 0,
+ PT_CAP_cr3_filtering,
+ PT_CAP_topa_output,
+ PT_CAP_topa_multiple_entries,
+ PT_CAP_payloads_lip,
+};
+
+void pt_cap_init(void);
+u32 pt_cap_get(enum pt_capabilities cap);
+const char *pt_cap_name(enum pt_capabilities cap);
+int pt_cap_num(void);
+
+#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 9bff687..77d371c 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_CPU_SUP_CYRIX_32) += cyrix.o
obj-$(CONFIG_CPU_SUP_CENTAUR) += centaur.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o
+obj-$(CONFIG_CPU_SUP_INTEL) += intel_pt_cap.o

obj-$(CONFIG_PERF_EVENTS) += perf_event.o

diff --git a/arch/x86/kernel/cpu/intel_pt.h b/arch/x86/kernel/cpu/intel_pt.h
deleted file mode 100644
index 1c338b0..0000000
--- a/arch/x86/kernel/cpu/intel_pt.h
+++ /dev/null
@@ -1,131 +0,0 @@
-/*
- * Intel(R) Processor Trace PMU driver for perf
- * Copyright (c) 2013-2014, Intel Corporation.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms and conditions of the GNU General Public License,
- * version 2, as published by the Free Software Foundation.
- *
- * This program is distributed in the hope it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
- * more details.
- *
- * Intel PT is specified in the Intel Architecture Instruction Set Extensions
- * Programming Reference:
- * http://software.intel.com/en-us/intel-isa-extensions
- */
-
-#ifndef __INTEL_PT_H__
-#define __INTEL_PT_H__
-
-/*
- * Single-entry ToPA: when this close to region boundary, switch
- * buffers to avoid losing data.
- */
-#define TOPA_PMI_MARGIN 512
-
-/*
- * Table of Physical Addresses bits
- */
-enum topa_sz {
- TOPA_4K = 0,
- TOPA_8K,
- TOPA_16K,
- TOPA_32K,
- TOPA_64K,
- TOPA_128K,
- TOPA_256K,
- TOPA_512K,
- TOPA_1MB,
- TOPA_2MB,
- TOPA_4MB,
- TOPA_8MB,
- TOPA_16MB,
- TOPA_32MB,
- TOPA_64MB,
- TOPA_128MB,
- TOPA_SZ_END,
-};
-
-static inline unsigned int sizes(enum topa_sz tsz)
-{
- return 1 << (tsz + 12);
-};
-
-struct topa_entry {
- u64 end : 1;
- u64 rsvd0 : 1;
- u64 intr : 1;
- u64 rsvd1 : 1;
- u64 stop : 1;
- u64 rsvd2 : 1;
- u64 size : 4;
- u64 rsvd3 : 2;
- u64 base : 36;
- u64 rsvd4 : 16;
-};
-
-#define TOPA_SHIFT 12
-#define PT_CPUID_LEAVES 2
-
-enum pt_capabilities {
- PT_CAP_max_subleaf = 0,
- PT_CAP_cr3_filtering,
- PT_CAP_topa_output,
- PT_CAP_topa_multiple_entries,
- PT_CAP_payloads_lip,
-};
-
-struct pt_pmu {
- struct pmu pmu;
- u32 caps[4 * PT_CPUID_LEAVES];
-};
-
-/**
- * struct pt_buffer - buffer configuration; one buffer per task_struct or
- * cpu, depending on perf event configuration
- * @cpu: cpu for per-cpu allocation
- * @tables: list of ToPA tables in this buffer
- * @first: shorthand for first topa table
- * @last: shorthand for last topa table
- * @cur: current topa table
- * @nr_pages: buffer size in pages
- * @cur_idx: current output region's index within @cur table
- * @output_off: offset within the current output region
- * @data_size: running total of the amount of data in this buffer
- * @lost: if data was lost/truncated
- * @head: logical write offset inside the buffer
- * @snapshot: if this is for a snapshot/overwrite counter
- * @stop_pos: STOP topa entry in the buffer
- * @intr_pos: INT topa entry in the buffer
- * @data_pages: array of pages from perf
- * @topa_index: table of topa entries indexed by page offset
- */
-struct pt_buffer {
- int cpu;
- struct list_head tables;
- struct topa *first, *last, *cur;
- unsigned int cur_idx;
- size_t output_off;
- unsigned long nr_pages;
- local_t data_size;
- local_t lost;
- local64_t head;
- bool snapshot;
- unsigned long stop_pos, intr_pos;
- void **data_pages;
- struct topa_entry *topa_index[0];
-};
-
-/**
- * struct pt - per-cpu pt context
- * @handle: perf output handle
- * @handle_nmi: do handle PT PMI on this cpu, there's an active event
- */
-struct pt {
- struct perf_output_handle handle;
- int handle_nmi;
-};
-
-#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/cpu/intel_pt_cap.c b/arch/x86/kernel/cpu/intel_pt_cap.c
new file mode 100644
index 0000000..a2cfbfc
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt_cap.c
@@ -0,0 +1,69 @@
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/mm.h>
+#include <asm/intel_pt.h>
+
+enum cpuid_regs {
+ CR_EAX = 0,
+ CR_ECX,
+ CR_EDX,
+ CR_EBX
+};
+
+static u32 cpuid_cache[4 * PT_CPUID_LEAVES];
+static int pt_cap_initialized;
+
+#define PT_CAP(_n, _l, _r, _m) \
+ [PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l, \
+ .reg = _r, .mask = _m }
+
+static struct pt_cap_desc {
+ const char *name;
+ u32 leaf;
+ u8 reg;
+ u32 mask;
+} pt_caps[] = {
+ PT_CAP(max_subleaf, 0, CR_EAX, 0xffffffff),
+ PT_CAP(cr3_filtering, 0, CR_EBX, BIT(0)),
+ PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
+ PT_CAP(topa_multiple_entries, 0, CR_ECX, BIT(1)),
+ PT_CAP(payloads_lip, 0, CR_ECX, BIT(31)),
+};
+
+u32 pt_cap_get(enum pt_capabilities cap)
+{
+ struct pt_cap_desc *cd = &pt_caps[cap];
+ u32 c = cpuid_cache[cd->leaf * 4 + cd->reg];
+ unsigned int shift = __ffs(cd->mask);
+
+ return (c & cd->mask) >> shift;
+}
+
+const char *pt_cap_name(enum pt_capabilities cap)
+{
+ return pt_caps[cap].name;
+}
+
+int pt_cap_num(void)
+{
+ return ARRAY_SIZE(pt_caps);
+}
+
+void __init pt_cap_init(void)
+{
+ int i;
+
+ if (pt_cap_initialized)
+ return;
+
+ for (i = 0; i < PT_CPUID_LEAVES; i++) {
+ cpuid_count(20, i,
+ &cpuid_cache[CR_EAX + i*4],
+ &cpuid_cache[CR_EBX + i*4],
+ &cpuid_cache[CR_ECX + i*4],
+ &cpuid_cache[CR_EDX + i*4]);
+ }
+
+ pt_cap_initialized = 1;
+}
+
diff --git a/arch/x86/kernel/cpu/intel_pt_perf.h b/arch/x86/kernel/cpu/intel_pt_perf.h
new file mode 100644
index 0000000..1e77646
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt_perf.h
@@ -0,0 +1,78 @@
+/*
+ * Intel(R) Processor Trace PMU driver for perf
+ * Copyright (c) 2013-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * Intel PT is specified in the Intel Architecture Instruction Set Extensions
+ * Programming Reference:
+ * http://software.intel.com/en-us/intel-isa-extensions
+ */
+
+#ifndef __INTEL_PT_PERF_H__
+#define __INTEL_PT_PERF_H__
+
+/*
+ * Single-entry ToPA: when this close to region boundary, switch
+ * buffers to avoid losing data.
+ */
+#define TOPA_PMI_MARGIN 512
+
+struct pt_pmu {
+ struct pmu pmu;
+};
+
+/**
+ * struct pt_buffer - buffer configuration; one buffer per task_struct or
+ * cpu, depending on perf event configuration
+ * @cpu: cpu for per-cpu allocation
+ * @tables: list of ToPA tables in this buffer
+ * @first: shorthand for first topa table
+ * @last: shorthand for last topa table
+ * @cur: current topa table
+ * @nr_pages: buffer size in pages
+ * @cur_idx: current output region's index within @cur table
+ * @output_off: offset within the current output region
+ * @data_size: running total of the amount of data in this buffer
+ * @lost: if data was lost/truncated
+ * @head: logical write offset inside the buffer
+ * @snapshot: if this is for a snapshot/overwrite counter
+ * @stop_pos: STOP topa entry in the buffer
+ * @intr_pos: INT topa entry in the buffer
+ * @data_pages: array of pages from perf
+ * @topa_index: table of topa entries indexed by page offset
+ */
+struct pt_buffer {
+ int cpu;
+ struct list_head tables;
+ struct topa *first, *last, *cur;
+ unsigned int cur_idx;
+ size_t output_off;
+ unsigned long nr_pages;
+ local_t data_size;
+ local_t lost;
+ local64_t head;
+ bool snapshot;
+ unsigned long stop_pos, intr_pos;
+ void **data_pages;
+ struct topa_entry *topa_index[0];
+};
+
+/**
+ * struct pt - per-cpu pt context
+ * @handle: perf output handle
+ * @handle_nmi: do handle PT PMI on this cpu, there's an active event
+ */
+struct pt {
+ struct perf_output_handle handle;
+ int handle_nmi;
+};
+
+#endif /* __INTEL_PT_PERF_H__ */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 183de71..c3aec2c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -27,21 +27,15 @@
#include <asm/perf_event.h>
#include <asm/insn.h>
#include <asm/io.h>
+#include <asm/intel_pt.h>

#include "perf_event.h"
-#include "intel_pt.h"
+#include "intel_pt_perf.h"

static DEFINE_PER_CPU(struct pt, pt_ctx);

static struct pt_pmu pt_pmu;

-enum cpuid_regs {
- CR_EAX = 0,
- CR_ECX,
- CR_EDX,
- CR_EBX
-};
-
/*
* Capabilities of Intel PT hardware, such as number of address bits or
* supported output schemes, are cached and exported to userspace as "caps"
@@ -53,32 +47,6 @@ enum cpuid_regs {
* width encoded in IP-related packets), and event configuration (bitmasks with
* permitted values for certain bit fields).
*/
-#define PT_CAP(_n, _l, _r, _m) \
- [PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l, \
- .reg = _r, .mask = _m }
-
-static struct pt_cap_desc {
- const char *name;
- u32 leaf;
- u8 reg;
- u32 mask;
-} pt_caps[] = {
- PT_CAP(max_subleaf, 0, CR_EAX, 0xffffffff),
- PT_CAP(cr3_filtering, 0, CR_EBX, BIT(0)),
- PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
- PT_CAP(topa_multiple_entries, 0, CR_ECX, BIT(1)),
- PT_CAP(payloads_lip, 0, CR_ECX, BIT(31)),
-};
-
-static u32 pt_cap_get(enum pt_capabilities cap)
-{
- struct pt_cap_desc *cd = &pt_caps[cap];
- u32 c = pt_pmu.caps[cd->leaf * 4 + cd->reg];
- unsigned int shift = __ffs(cd->mask);
-
- return (c & cd->mask) >> shift;
-}
-
static ssize_t pt_cap_show(struct device *cdev,
struct device_attribute *attr,
char *buf)
@@ -121,35 +89,31 @@ static int __init pt_pmu_hw_init(void)
size_t size;
int ret;
long i;
+ int cap_num;

attrs = NULL;
ret = -ENODEV;
if (!test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT))
goto fail;

- for (i = 0; i < PT_CPUID_LEAVES; i++) {
- cpuid_count(20, i,
- &pt_pmu.caps[CR_EAX + i*4],
- &pt_pmu.caps[CR_EBX + i*4],
- &pt_pmu.caps[CR_ECX + i*4],
- &pt_pmu.caps[CR_EDX + i*4]);
- }
+ pt_cap_init();
+ cap_num = pt_cap_num();

ret = -ENOMEM;
- size = sizeof(struct attribute *) * (ARRAY_SIZE(pt_caps)+1);
+ size = sizeof(struct attribute *) * (cap_num+1);
attrs = kzalloc(size, GFP_KERNEL);
if (!attrs)
goto fail;

- size = sizeof(struct dev_ext_attribute) * (ARRAY_SIZE(pt_caps)+1);
+ size = sizeof(struct dev_ext_attribute) * (cap_num+1);
de_attrs = kzalloc(size, GFP_KERNEL);
if (!de_attrs)
goto fail;

- for (i = 0; i < ARRAY_SIZE(pt_caps); i++) {
+ for (i = 0; i < cap_num; i++) {
struct dev_ext_attribute *de_attr = de_attrs + i;

- de_attr->attr.attr.name = pt_caps[i].name;
+ de_attr->attr.attr.name = pt_cap_name(i);

sysfs_attr_init(&de_attr->attr.attr);

--
1.7.1

2015-07-29 05:02:40

by Takao Indoh

[permalink] [raw]
Subject: [PATCH RFC 2/3] x86: Add Intel PT logger

This patch provides Intel PT logging feature. When system boots with a
parameter "intel_pt_log", log buffers for Intel PT are allocated and
logging starts, then processor flow information is written in the log
buffer by hardware like flight recorder. This is very helpful to
investigate a cause of kernel panic.

The log buffer size is specified by the parameter
"intel_pt_log_buf_len=<size>". This buffer is used as circular buffer,
therefore old events are overwritten by new events.

Signed-off-by: Takao Indoh <[email protected]>
---
arch/x86/Kconfig | 16 ++
arch/x86/kernel/cpu/Makefile | 2 +
arch/x86/kernel/cpu/intel_pt_log.c | 288 ++++++++++++++++++++++++++++++++++++
3 files changed, 306 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 55bced1..c31400f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1658,6 +1658,22 @@ config X86_INTEL_MPX

If unsure, say N.

+config X86_INTEL_PT_LOG
+ prompt "Intel PT logger"
+ def_bool n
+ depends on CPU_SUP_INTEL
+ ---help---
+ Intel PT is a hardware features that can capture information
+ about program execution flow. Once Intel PT is enabled, the
+ events which change program flow, like branch instructions,
+ exceptions, interruptions, traps and so on are logged in
+ the memory.
+
+ This option enables starting Intel PT logging feature at boot
+ time. When kernel panic occurs, Intel PT log buffer can be
+ retrieved from crash dump file and enables to reconstruct the
+ detailed flow that led to the panic.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 77d371c..24629ff 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -58,6 +58,8 @@ obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o perf_event_amd_ibs.o

obj-$(CONFIG_HYPERVISOR_GUEST) += vmware.o hypervisor.o mshyperv.o

+obj-$(CONFIG_X86_INTEL_PT_LOG) += intel_pt_log.o
+
ifdef CONFIG_X86_FEATURE_NAMES
quiet_cmd_mkcapflags = MKCAP $@
cmd_mkcapflags = $(CONFIG_SHELL) $(srctree)/$(src)/mkcapflags.sh $< $@
diff --git a/arch/x86/kernel/cpu/intel_pt_log.c b/arch/x86/kernel/cpu/intel_pt_log.c
new file mode 100644
index 0000000..b1c4d66
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_pt_log.c
@@ -0,0 +1,288 @@
+/*
+ * Intel Processor Trace Logger
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <asm/intel_pt.h>
+
+#define PT_LOG_GFP (GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY)
+
+struct pt_log_buf {
+ int cpu;
+
+ void **region; /* array of pointer to output region */
+ int region_size; /* size of region array */
+ int region_order; /* page order of region */
+
+ void **tbl; /* array of pointer to ToPA table */
+ int tbl_size; /* size of tbl array */
+
+ /* Saved registers on panic */
+ u64 saved_msr_ctl;
+ u64 saved_msr_status;
+ u64 saved_msr_output_base;
+ u64 saved_msr_output_mask;
+};
+
+static int pt_log_enabled;
+static int pt_log_buf_nr_pages = 1024; /* number of pages for log buffer */
+
+static DEFINE_PER_CPU(struct pt_log_buf, pt_log_buf_ptr);
+static struct cpumask pt_cpu_mask;
+
+static void enable_pt(int enable)
+{
+ u64 ctl;
+
+ rdmsrl(MSR_IA32_RTIT_CTL, ctl);
+
+ if (enable)
+ ctl |= RTIT_CTL_TRACEEN;
+ else
+ ctl &= ~RTIT_CTL_TRACEEN;
+
+ wrmsrl(MSR_IA32_RTIT_CTL, ctl);
+}
+
+void save_intel_pt_registers(void)
+{
+ struct pt_log_buf *buf = this_cpu_ptr(&pt_log_buf_ptr);
+
+ if (!cpumask_test_cpu(smp_processor_id(), &pt_cpu_mask))
+ return;
+
+ enable_pt(0);
+
+ rdmsrl(MSR_IA32_RTIT_CTL, buf->saved_msr_ctl);
+ rdmsrl(MSR_IA32_RTIT_STATUS, buf->saved_msr_status);
+ rdmsrl(MSR_IA32_RTIT_OUTPUT_BASE, buf->saved_msr_output_base);
+ rdmsrl(MSR_IA32_RTIT_OUTPUT_MASK, buf->saved_msr_output_mask);
+}
+
+static void setup_pt_ctl_register(void)
+{
+ u64 reg;
+
+ rdmsrl(MSR_IA32_RTIT_CTL, reg);
+
+ reg |= RTIT_CTL_OS|RTIT_CTL_USR|RTIT_CTL_TOPA|RTIT_CTL_TSC_EN|RTIT_CTL_BRANCH_EN;
+
+ wrmsrl(MSR_IA32_RTIT_CTL, reg);
+}
+
+static void setup_pt_output_register(void *base, unsigned int topa_idx,
+ unsigned int output_off)
+{
+ u64 reg;
+
+ wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, virt_to_phys(base));
+
+ reg = 0x7f | ((u64)topa_idx << 7) | ((u64)output_off << 32);
+
+ wrmsrl(MSR_IA32_RTIT_OUTPUT_MASK, reg);
+}
+
+static void *pt_alloc_pages(void **buf, int *index, int node, int order)
+{
+ struct page *page;
+ void *ptr = NULL;
+
+ page = alloc_pages_node(node, PT_LOG_GFP, order);
+ if (page) {
+ ptr = page_address(page);
+ buf[(*index)++] = ptr;
+ }
+
+ return ptr;
+}
+
+static void pt_free_pages(void **buf, int size)
+{
+ int i;
+
+ for (i = 0; i < size; i++)
+ __free_page(virt_to_page(buf[i]));
+}
+
+static int setup_pt_buffer(struct pt_log_buf *buf)
+{
+ int node = cpu_to_node(buf->cpu);
+ int size, order;
+
+ if (pt_cap_get(PT_CAP_topa_multiple_entries)) {
+ /* A page is used as one output region */
+ size = pt_log_buf_nr_pages;
+ order = 0;
+ } else {
+ /* One contiguous memory range is used as one output region */
+ size = 1;
+ order = min(get_order(pt_log_buf_nr_pages*PAGE_SIZE),
+ TOPA_SZ_END - 1);
+ }
+
+ buf->region = kzalloc_node(size * sizeof(void *), GFP_KERNEL, node);
+ if (!buf->region)
+ return -ENOMEM;
+
+ buf->region_size = 0;
+ buf->region_order = order;
+
+ while (buf->region_size < size) {
+ if (!pt_alloc_pages(buf->region, &(buf->region_size),
+ node, order)) {
+ pt_free_pages(buf->region, buf->region_size);
+ kfree(buf->region);
+ return -ENOMEM;
+ }
+ }
+
+ return 0;
+}
+
+static int setup_pt_topa_tbl(struct pt_log_buf *buf)
+{
+ int node = cpu_to_node(buf->cpu);
+ int nr_pages, nr_entries_per_page, i;
+ struct topa_entry *entry;
+ int topa_offset = 0;
+ void *new_tbl;
+
+ /*
+ * Count the number of ToPA entris in a page. ToPA entry size
+ * is 8byte, threfore there are (PAGE_SIZE >> 3) entries in one
+ * page. And one entry is used for END entry.
+ */
+ nr_entries_per_page = (PAGE_SIZE >> 3) - 1;
+
+ nr_pages = 0;
+ while (nr_pages*nr_entries_per_page < buf->region_size)
+ nr_pages++;
+
+ buf->tbl = kzalloc_node(nr_pages * sizeof(void *), GFP_KERNEL, node);
+ if (!buf->tbl)
+ return -ENOMEM;
+
+ buf->tbl_size = 0;
+ entry = pt_alloc_pages(buf->tbl, &(buf->tbl_size), node, 0);
+ if (!entry)
+ goto fail;
+
+ /* Insert all buf->region pages into ToPA table */
+ for (i = 0; i < buf->region_size; i++) {
+ if (topa_offset == nr_entries_per_page) {
+ /* Use the last entry as END entry */
+ new_tbl = pt_alloc_pages(buf->tbl, &(buf->tbl_size),
+ node, 0);
+ if (!new_tbl)
+ goto fail;
+
+ entry[topa_offset].end = 1;
+ entry[topa_offset].base =
+ virt_to_phys(new_tbl) >> TOPA_SHIFT;
+ topa_offset = 0;
+ entry = new_tbl;
+ }
+
+ /* Add region to ToPA table */
+ entry[topa_offset].size = buf->region_order;
+ entry[topa_offset].base =
+ virt_to_phys(buf->region[i]) >> TOPA_SHIFT;
+ topa_offset++;
+ }
+
+ /* END entry */
+ entry[topa_offset].end = 1;
+ entry[topa_offset].base = virt_to_phys(buf->tbl[0]) >> TOPA_SHIFT;
+
+ return 0;
+
+fail:
+ pt_free_pages(buf->tbl, buf->tbl_size);
+ kfree(buf->tbl);
+ return -ENOMEM;
+}
+
+static void pt_log_start(void *data)
+{
+ struct pt_log_buf *buf = this_cpu_ptr(&pt_log_buf_ptr);
+
+ setup_pt_output_register(buf->tbl[0], 0, 0);
+ setup_pt_ctl_register();
+
+ enable_pt(1);
+ cpumask_set_cpu(smp_processor_id(), &pt_cpu_mask);
+}
+
+__init int pt_log_init(void)
+{
+ int cpu;
+ struct cpumask status;
+
+ cpumask_clear(&pt_cpu_mask);
+ cpumask_clear(&status);
+
+ if (!test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT))
+ return 0;
+
+ if (!pt_log_enabled)
+ return 0;
+
+ pt_cap_init();
+
+ if (!pt_cap_get(PT_CAP_topa_output)) {
+ pr_err("ToPA table is not supported.\n");
+ return -ENODEV;
+ }
+
+ /* Prepare log buffer */
+ for_each_online_cpu(cpu) {
+ struct pt_log_buf *buf = per_cpu_ptr(&pt_log_buf_ptr, cpu);
+
+ buf->cpu = cpu;
+ if (setup_pt_buffer(buf)) {
+ pr_err("[%d]: Failed to set up log buffer\n", cpu);
+ continue;
+ }
+
+ if (setup_pt_topa_tbl(buf)) {
+ pt_free_pages(buf->region, buf->region_size);
+ kfree(buf->region);
+ pr_err("[%d]: Failed to set up ToPA table\n", cpu);
+ continue;
+ }
+
+ cpumask_set_cpu(cpu, &status);
+ }
+
+ /* Start logging on each CPU */
+ smp_call_function_many(&status, pt_log_start, NULL, 1);
+ if (cpumask_test_cpu(smp_processor_id(), &status))
+ pt_log_start(NULL);
+
+ pr_info("logging started: %*pb\n", cpumask_pr_args(&pt_cpu_mask));
+
+ return 0;
+}
+postcore_initcall(pt_log_init);
+
+static __init int pt_log_buf_setup(char *str)
+{
+ int len;
+
+ if (get_option(&str, &len))
+ pt_log_buf_nr_pages = len>>PAGE_SHIFT;
+
+ return 1;
+}
+__setup("intel_pt_log_buf_len", pt_log_buf_setup);
+
+static __init int pt_log_setup(char *str)
+{
+ pt_log_enabled = 1;
+ return 1;
+}
+__setup("intel_pt_log", pt_log_setup);
--
1.7.1

2015-07-29 04:52:27

by Takao Indoh

[permalink] [raw]
Subject: [PATCH RFC 3/3] x86: Stop Intel PT and save its registers when panic occurs

When panic occurs, Intel PT logging is stopped to prevent it from
overwrite its log buffer. The registers of Intel PT are saved in the
memory on panic, they are needed for debugger to find the last position
where Intel PT wrote data.

Signed-off-by: Takao Indoh <[email protected]>
---
arch/x86/include/asm/intel_pt.h | 2 ++
arch/x86/kernel/crash.c | 9 +++++++++
2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/intel_pt.h b/arch/x86/include/asm/intel_pt.h
index 7cb16e1..71bcd8d 100644
--- a/arch/x86/include/asm/intel_pt.h
+++ b/arch/x86/include/asm/intel_pt.h
@@ -79,4 +79,6 @@ u32 pt_cap_get(enum pt_capabilities cap);
const char *pt_cap_name(enum pt_capabilities cap);
int pt_cap_num(void);

+void save_intel_pt_registers(void);
+
#endif /* __INTEL_PT_H__ */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index e068d66..953c086 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -35,6 +35,7 @@
#include <asm/cpu.h>
#include <asm/reboot.h>
#include <asm/virtext.h>
+#include <asm/intel_pt.h>

/* Alignment required for elf header segment */
#define ELF_CORE_HEADER_ALIGN 4096
@@ -127,6 +128,10 @@ static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();

+#ifdef CONFIG_X86_INTEL_PT_LOG
+ save_intel_pt_registers();
+#endif
+
disable_local_APIC();
}

@@ -172,6 +177,10 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
cpu_emergency_vmxoff();
cpu_emergency_svm_disable();

+#ifdef CONFIG_X86_INTEL_PT_LOG
+ save_intel_pt_registers();
+#endif
+
#ifdef CONFIG_X86_IO_APIC
/* Prevent crash_kexec() from deadlocking on ioapic_lock. */
ioapic_zap_locks();
--
1.7.1

2015-07-29 05:44:30

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] x86: Intel Processor Trace Logger

Takao Indoh <[email protected]> writes:

> Hi all,
>
> This patch creates log buffer for Intel PT and enable logging at boot
> time. When kernel panic occurs, we can get this log buffer from
> crashdump file by kdump, and reconstruct the flow that led to the panic.

Good to see this work going forward!

> Takao Indoh (3):
> x86: Add Intel PT common files
> x86: Add Intel PT logger
> x86: Stop Intel PT and save its registers when panic occurs
>
> arch/x86/Kconfig | 16 ++
> arch/x86/include/asm/intel_pt.h | 84 +++++++++
> arch/x86/kernel/cpu/Makefile | 3 +
> arch/x86/kernel/cpu/intel_pt.h | 131 -------------
> arch/x86/kernel/cpu/intel_pt_cap.c | 69 +++++++
> arch/x86/kernel/cpu/intel_pt_log.c | 288 +++++++++++++++++++++++++++++
> arch/x86/kernel/cpu/intel_pt_perf.h | 78 ++++++++
> arch/x86/kernel/cpu/perf_event_intel_pt.c | 54 +-----
> arch/x86/kernel/crash.c | 9 +
> 9 files changed, 556 insertions(+), 176 deletions(-)
> create mode 100644 arch/x86/include/asm/intel_pt.h
> delete mode 100644 arch/x86/kernel/cpu/intel_pt.h
> create mode 100644 arch/x86/kernel/cpu/intel_pt_cap.c
> create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c
> create mode 100644 arch/x86/kernel/cpu/intel_pt_perf.h

One note here: you want to use -M with git-format-patch so that renames
are handled better.

Regards,
--
Alex

2015-07-29 05:52:13

by Takao Indoh

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] x86: Intel Processor Trace Logger

On 2015/07/29 14:44, Alexander Shishkin wrote:
> Takao Indoh <[email protected]> writes:
>
>> Hi all,
>>
>> This patch creates log buffer for Intel PT and enable logging at boot
>> time. When kernel panic occurs, we can get this log buffer from
>> crashdump file by kdump, and reconstruct the flow that led to the panic.
>
> Good to see this work going forward!
>
>> Takao Indoh (3):
>> x86: Add Intel PT common files
>> x86: Add Intel PT logger
>> x86: Stop Intel PT and save its registers when panic occurs
>>
>> arch/x86/Kconfig | 16 ++
>> arch/x86/include/asm/intel_pt.h | 84 +++++++++
>> arch/x86/kernel/cpu/Makefile | 3 +
>> arch/x86/kernel/cpu/intel_pt.h | 131 -------------
>> arch/x86/kernel/cpu/intel_pt_cap.c | 69 +++++++
>> arch/x86/kernel/cpu/intel_pt_log.c | 288 +++++++++++++++++++++++++++++
>> arch/x86/kernel/cpu/intel_pt_perf.h | 78 ++++++++
>> arch/x86/kernel/cpu/perf_event_intel_pt.c | 54 +-----
>> arch/x86/kernel/crash.c | 9 +
>> 9 files changed, 556 insertions(+), 176 deletions(-)
>> create mode 100644 arch/x86/include/asm/intel_pt.h
>> delete mode 100644 arch/x86/kernel/cpu/intel_pt.h
>> create mode 100644 arch/x86/kernel/cpu/intel_pt_cap.c
>> create mode 100644 arch/x86/kernel/cpu/intel_pt_log.c
>> create mode 100644 arch/x86/kernel/cpu/intel_pt_perf.h
>
> One note here: you want to use -M with git-format-patch so that renames
> are handled better.

Thank you, I didn't know this option. I'll do next time.

Thanks,
Takao Indoh


>
> Regards,
> --
> Alex
>

2015-07-29 06:08:08

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH RFC 2/3] x86: Add Intel PT logger

Takao Indoh <[email protected]> writes:

> This patch provides Intel PT logging feature. When system boots with a
> parameter "intel_pt_log", log buffers for Intel PT are allocated and
> logging starts, then processor flow information is written in the log
> buffer by hardware like flight recorder. This is very helpful to
> investigate a cause of kernel panic.
>
> The log buffer size is specified by the parameter
> "intel_pt_log_buf_len=<size>". This buffer is used as circular buffer,
> therefore old events are overwritten by new events.

[skip]

> +static void enable_pt(int enable)
> +{
> + u64 ctl;
> +
> + rdmsrl(MSR_IA32_RTIT_CTL, ctl);

Ideally, you shouldn't need this rdmsr(), because in this code you
should know exactly which ctl bits you need set when you enable.

> +
> + if (enable)
> + ctl |= RTIT_CTL_TRACEEN;
> + else
> + ctl &= ~RTIT_CTL_TRACEEN;
> +
> + wrmsrl(MSR_IA32_RTIT_CTL, ctl);
> +}

But the bigger problem with this approach is that it duplicates the
existing driver's functionality and some of the code, which just makes
it harder to maintain amoung other things.

Instead, we should be able to do use the existing perf functionality to
enable the system-wide tracing, so that it goes through the
driver. Another thing to remember is that you'd also need some of the
sideband data (vm mappings, context switches) to be able to properly
decode the trace, which also can come from perf. And it'd also be much
less code. The only missing piece is the code that would allocate the
ring buffer for such events.

Something like:

static DEFINE_PER_CPU(struct perf_event *, perf_kdump_event);

static struct perf_event_attr perf_kdump_attr;

...

static int perf_kdump_init(void)
{
struct perf_event *event;
int cpu;

get_online_cpus();
for_each_possible_cpu(cpu) {
event = perf_create_kernel_counter(&perf_kdump_attr,
cpu, NULL,
NULL, NULL);

...

ret = rb_alloc_kernel(event, perf_kdump_data_size, perf_kdump_aux_size);

...

per_cpu(perf_kdump_event, cpu) = event;
}
put_online_cpus();
}

2015-07-29 08:13:53

by Takao Indoh

[permalink] [raw]
Subject: Re: [PATCH RFC 2/3] x86: Add Intel PT logger

On 2015/07/29 15:08, Alexander Shishkin wrote:
> Takao Indoh <[email protected]> writes:
>
>> This patch provides Intel PT logging feature. When system boots with a
>> parameter "intel_pt_log", log buffers for Intel PT are allocated and
>> logging starts, then processor flow information is written in the log
>> buffer by hardware like flight recorder. This is very helpful to
>> investigate a cause of kernel panic.
>>
>> The log buffer size is specified by the parameter
>> "intel_pt_log_buf_len=<size>". This buffer is used as circular buffer,
>> therefore old events are overwritten by new events.
>
> [skip]
>
>> +static void enable_pt(int enable)
>> +{
>> + u64 ctl;
>> +
>> + rdmsrl(MSR_IA32_RTIT_CTL, ctl);
>
> Ideally, you shouldn't need this rdmsr(), because in this code you
> should know exactly which ctl bits you need set when you enable.

I see, I'll remove this rdmsr in next version.

>
>> +
>> + if (enable)
>> + ctl |= RTIT_CTL_TRACEEN;
>> + else
>> + ctl &= ~RTIT_CTL_TRACEEN;
>> +
>> + wrmsrl(MSR_IA32_RTIT_CTL, ctl);
>> +}
>
> But the bigger problem with this approach is that it duplicates the
> existing driver's functionality and some of the code, which just makes
> it harder to maintain amoung other things.
>
> Instead, we should be able to do use the existing perf functionality to
> enable the system-wide tracing, so that it goes through the

"existing driver" means PMU driver (perf_event_intel_pt.c)?

The feature of these patches is a sort of flight recorder. Once it
starts, never stop, not export anything to user, it just captures data
with minimum overhead in preparation for kernel panic. This usage is
different from perf and therefore I'm not sure whether this feature can
be implemented using perf infrastructure.

> driver. Another thing to remember is that you'd also need some of the
> sideband data (vm mappings, context switches) to be able to properly
> decode the trace, which also can come from perf. And it'd also be much
> less code. The only missing piece is the code that would allocate the
> ring buffer for such events.

The sideband data is needed if we want to reconstruct user program flow,
but is it needed to reconstruct kernel panic path?

Thanks,
Takao Indoh


>
> Something like:
>
> static DEFINE_PER_CPU(struct perf_event *, perf_kdump_event);
>
> static struct perf_event_attr perf_kdump_attr;
>
> ...
>
> static int perf_kdump_init(void)
> {
> struct perf_event *event;
> int cpu;
>
> get_online_cpus();
> for_each_possible_cpu(cpu) {
> event = perf_create_kernel_counter(&perf_kdump_attr,
> cpu, NULL,
> NULL, NULL);
>
> ...
>
> ret = rb_alloc_kernel(event, perf_kdump_data_size, perf_kdump_aux_size);
>
> ...
>
> per_cpu(perf_kdump_event, cpu) = event;
> }
> put_online_cpus();
> }
>

2015-07-29 09:09:19

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH RFC 2/3] x86: Add Intel PT logger

Takao Indoh <[email protected]> writes:

> On 2015/07/29 15:08, Alexander Shishkin wrote:
>> Instead, we should be able to do use the existing perf functionality to
>> enable the system-wide tracing, so that it goes through the
>
> "existing driver" means PMU driver (perf_event_intel_pt.c)?

Yes.

> The feature of these patches is a sort of flight recorder. Once it
> starts, never stop, not export anything to user, it just captures data
> with minimum overhead in preparation for kernel panic. This usage is
> different from perf and therefore I'm not sure whether this feature can
> be implemented using perf infrastructure.

Why not? There is an established infrastructure for in-kernel perf
events already, take a look at the nmi watchdog, for example.

>> driver. Another thing to remember is that you'd also need some of the
>> sideband data (vm mappings, context switches) to be able to properly
>> decode the trace, which also can come from perf. And it'd also be much
>> less code. The only missing piece is the code that would allocate the
>> ring buffer for such events.
>
> The sideband data is needed if we want to reconstruct user program flow,
> but is it needed to reconstruct kernel panic path?

You are not really interested in the panic path as much as events
leading up to the panic and those usually have context, which is much
easier to reconstruct with sideband info. Some of it you can reconstruct
by walking kernel's data structures, but that is not reliable after the
panic.

Regards,
--
Alex

2015-07-30 01:49:31

by Takao Indoh

[permalink] [raw]
Subject: Re: [PATCH RFC 2/3] x86: Add Intel PT logger

On 2015/07/29 18:09, Alexander Shishkin wrote:
> Takao Indoh <[email protected]> writes:
>
>> On 2015/07/29 15:08, Alexander Shishkin wrote:
>>> Instead, we should be able to do use the existing perf functionality to
>>> enable the system-wide tracing, so that it goes through the
>>
>> "existing driver" means PMU driver (perf_event_intel_pt.c)?
>
> Yes.
>
>> The feature of these patches is a sort of flight recorder. Once it
>> starts, never stop, not export anything to user, it just captures data
>> with minimum overhead in preparation for kernel panic. This usage is
>> different from perf and therefore I'm not sure whether this feature can
>> be implemented using perf infrastructure.
>
> Why not? There is an established infrastructure for in-kernel perf
> events already, take a look at the nmi watchdog, for example.

Ok, I'm reading the code around perf_event_create_kernel_counter. It
seems to work for my purpose, I'll try to update my patch with this.

Thanks,
Takao Indoh

>
>>> driver. Another thing to remember is that you'd also need some of the
>>> sideband data (vm mappings, context switches) to be able to properly
>>> decode the trace, which also can come from perf. And it'd also be much
>>> less code. The only missing piece is the code that would allocate the
>>> ring buffer for such events.
>>
>> The sideband data is needed if we want to reconstruct user program flow,
>> but is it needed to reconstruct kernel panic path?
>
> You are not really interested in the panic path as much as events
> leading up to the panic and those usually have context, which is much
> easier to reconstruct with sideband info. Some of it you can reconstruct
> by walking kernel's data structures, but that is not reliable after the
> panic.
>
> Regards,
> --
> Alex
>

2015-07-30 05:34:51

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH RFC 2/3] x86: Add Intel PT logger

Takao Indoh <[email protected]> writes:

> Ok, I'm reading the code around perf_event_create_kernel_counter. It
> seems to work for my purpose, I'll try to update my patch with this.

Thank you.

Regards,
--
Alex

2015-08-02 10:02:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] x86: Add Intel PT common files

On Wed, 29 Jul 2015, Takao Indoh wrote:
> +/*
> + * Table of Physical Addresses bits
> + */
> +enum topa_sz {
> + TOPA_4K = 0,
> + TOPA_8K,
> + TOPA_16K,
> + TOPA_32K,
> + TOPA_64K,
> + TOPA_128K,
> + TOPA_256K,
> + TOPA_512K,
> + TOPA_1MB,
> + TOPA_2MB,
> + TOPA_4MB,
> + TOPA_8MB,
> + TOPA_16MB,
> + TOPA_32MB,
> + TOPA_64MB,
> + TOPA_128MB,
> + TOPA_SZ_END,
> +};

While moving this around, can we pretty please clean that up? That
enum just pointless. None of the values is ever used and they hardly
have any value as they are just computable.

> +static inline unsigned int sizes(enum topa_sz tsz)
> +{
> + return 1 << (tsz + 12);

12?? PAGE_SHIFT perhaps?

> +#define TOPA_SHIFT 12

Sigh.

> diff --git a/arch/x86/kernel/cpu/intel_pt_cap.c b/arch/x86/kernel/cpu/intel_pt_cap.c
> new file mode 100644
> index 0000000..a2cfbfc
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/intel_pt_cap.c
> @@ -0,0 +1,69 @@
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/mm.h>
> +#include <asm/intel_pt.h>
> +
> +enum cpuid_regs {
> + CR_EAX = 0,
> + CR_ECX,
> + CR_EDX,
> + CR_EBX
> +};
> +
> +static u32 cpuid_cache[4 * PT_CPUID_LEAVES];

4 ? Magic constant pulled from thin air?

> +static int pt_cap_initialized;
> +
> +#define PT_CAP(_n, _l, _r, _m) \
> + [PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l, \
> + .reg = _r, .mask = _m }
> +
> +static struct pt_cap_desc {
> + const char *name;
> + u32 leaf;
> + u8 reg;
> + u32 mask;
> +} pt_caps[] = {
> + PT_CAP(max_subleaf, 0, CR_EAX, 0xffffffff),
> + PT_CAP(cr3_filtering, 0, CR_EBX, BIT(0)),
> + PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
> + PT_CAP(topa_multiple_entries, 0, CR_ECX, BIT(1)),
> + PT_CAP(payloads_lip, 0, CR_ECX, BIT(31)),
> +};
> +
> +u32 pt_cap_get(enum pt_capabilities cap)
> +{
> + struct pt_cap_desc *cd = &pt_caps[cap];
> + u32 c = cpuid_cache[cd->leaf * 4 + cd->reg];

Ditto

> + unsigned int shift = __ffs(cd->mask);
> +
> + return (c & cd->mask) >> shift;
> +}
> +
> +const char *pt_cap_name(enum pt_capabilities cap)
> +{
> + return pt_caps[cap].name;
> +}
> +
> +int pt_cap_num(void)
> +{
> + return ARRAY_SIZE(pt_caps);
> +}
> +
> +void __init pt_cap_init(void)
> +{
> + int i;
> +
> + if (pt_cap_initialized)
> + return;
> +
> + for (i = 0; i < PT_CPUID_LEAVES; i++) {
> + cpuid_count(20, i,
> + &cpuid_cache[CR_EAX + i*4],

Once more.

Thanks,

tglx

2015-08-03 03:14:44

by Takao Indoh

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] x86: Add Intel PT common files

On 2015/08/02 19:02, Thomas Gleixner wrote:
> On Wed, 29 Jul 2015, Takao Indoh wrote:
>> +/*
>> + * Table of Physical Addresses bits
>> + */
>> +enum topa_sz {
>> + TOPA_4K = 0,
>> + TOPA_8K,
>> + TOPA_16K,
>> + TOPA_32K,
>> + TOPA_64K,
>> + TOPA_128K,
>> + TOPA_256K,
>> + TOPA_512K,
>> + TOPA_1MB,
>> + TOPA_2MB,
>> + TOPA_4MB,
>> + TOPA_8MB,
>> + TOPA_16MB,
>> + TOPA_32MB,
>> + TOPA_64MB,
>> + TOPA_128MB,
>> + TOPA_SZ_END,
>> +};
>
> While moving this around, can we pretty please clean that up? That
> enum just pointless. None of the values is ever used and they hardly
> have any value as they are just computable.

Ok, I'll update my patches based on Alex's comments, but before that
I'll clean up intel_pt.h and perf_event_intel_pt.c.

Thanks,
Takao Indoh

>
>> +static inline unsigned int sizes(enum topa_sz tsz)
>> +{
>> + return 1 << (tsz + 12);
>
> 12?? PAGE_SHIFT perhaps?
>
>> +#define TOPA_SHIFT 12
>
> Sigh.
>
>> diff --git a/arch/x86/kernel/cpu/intel_pt_cap.c b/arch/x86/kernel/cpu/intel_pt_cap.c
>> new file mode 100644
>> index 0000000..a2cfbfc
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/intel_pt_cap.c
>> @@ -0,0 +1,69 @@
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/mm.h>
>> +#include <asm/intel_pt.h>
>> +
>> +enum cpuid_regs {
>> + CR_EAX = 0,
>> + CR_ECX,
>> + CR_EDX,
>> + CR_EBX
>> +};
>> +
>> +static u32 cpuid_cache[4 * PT_CPUID_LEAVES];
>
> 4 ? Magic constant pulled from thin air?
>
>> +static int pt_cap_initialized;
>> +
>> +#define PT_CAP(_n, _l, _r, _m) \
>> + [PT_CAP_ ## _n] = { .name = __stringify(_n), .leaf = _l, \
>> + .reg = _r, .mask = _m }
>> +
>> +static struct pt_cap_desc {
>> + const char *name;
>> + u32 leaf;
>> + u8 reg;
>> + u32 mask;
>> +} pt_caps[] = {
>> + PT_CAP(max_subleaf, 0, CR_EAX, 0xffffffff),
>> + PT_CAP(cr3_filtering, 0, CR_EBX, BIT(0)),
>> + PT_CAP(topa_output, 0, CR_ECX, BIT(0)),
>> + PT_CAP(topa_multiple_entries, 0, CR_ECX, BIT(1)),
>> + PT_CAP(payloads_lip, 0, CR_ECX, BIT(31)),
>> +};
>> +
>> +u32 pt_cap_get(enum pt_capabilities cap)
>> +{
>> + struct pt_cap_desc *cd = &pt_caps[cap];
>> + u32 c = cpuid_cache[cd->leaf * 4 + cd->reg];
>
> Ditto
>
>> + unsigned int shift = __ffs(cd->mask);
>> +
>> + return (c & cd->mask) >> shift;
>> +}
>> +
>> +const char *pt_cap_name(enum pt_capabilities cap)
>> +{
>> + return pt_caps[cap].name;
>> +}
>> +
>> +int pt_cap_num(void)
>> +{
>> + return ARRAY_SIZE(pt_caps);
>> +}
>> +
>> +void __init pt_cap_init(void)
>> +{
>> + int i;
>> +
>> + if (pt_cap_initialized)
>> + return;
>> +
>> + for (i = 0; i < PT_CPUID_LEAVES; i++) {
>> + cpuid_count(20, i,
>> + &cpuid_cache[CR_EAX + i*4],
>
> Once more.
>
> Thanks,
>
> tglx
>

2015-08-26 08:11:27

by Takao Indoh

[permalink] [raw]
Subject: Re: [PATCH RFC 2/3] x86: Add Intel PT logger

On 2015/07/29 15:08, Alexander Shishkin wrote:
> Takao Indoh <[email protected]> writes:
>
>> This patch provides Intel PT logging feature. When system boots with a
>> parameter "intel_pt_log", log buffers for Intel PT are allocated and
>> logging starts, then processor flow information is written in the log
>> buffer by hardware like flight recorder. This is very helpful to
>> investigate a cause of kernel panic.
>>
>> The log buffer size is specified by the parameter
>> "intel_pt_log_buf_len=<size>". This buffer is used as circular buffer,
>> therefore old events are overwritten by new events.
>
> [skip]
>
>> +static void enable_pt(int enable)
>> +{
>> + u64 ctl;
>> +
>> + rdmsrl(MSR_IA32_RTIT_CTL, ctl);
>
> Ideally, you shouldn't need this rdmsr(), because in this code you
> should know exactly which ctl bits you need set when you enable.
>
>> +
>> + if (enable)
>> + ctl |= RTIT_CTL_TRACEEN;
>> + else
>> + ctl &= ~RTIT_CTL_TRACEEN;
>> +
>> + wrmsrl(MSR_IA32_RTIT_CTL, ctl);
>> +}
>
> But the bigger problem with this approach is that it duplicates the
> existing driver's functionality and some of the code, which just makes
> it harder to maintain amoung other things.
>
> Instead, we should be able to do use the existing perf functionality to
> enable the system-wide tracing, so that it goes through the
> driver. Another thing to remember is that you'd also need some of the
> sideband data (vm mappings, context switches) to be able to properly
> decode the trace, which also can come from perf. And it'd also be much
> less code. The only missing piece is the code that would allocate the
> ring buffer for such events.

Alexander,

I checked perf code to find out what kinds of information are needed as
side-band data. It seems that the following two events are used.
- sched:sched_switch
- dummy(PERF_COUNT_SW_DUMMY)

So, what I need to do is adding kernel counter for three events
(intel_pt, sched:sched_switch, dummy). My understanding is correct?

Thanks,
Takao Indoh

>
> Something like:
>
> static DEFINE_PER_CPU(struct perf_event *, perf_kdump_event);
>
> static struct perf_event_attr perf_kdump_attr;
>
> ...
>
> static int perf_kdump_init(void)
> {
> struct perf_event *event;
> int cpu;
>
> get_online_cpus();
> for_each_possible_cpu(cpu) {
> event = perf_create_kernel_counter(&perf_kdump_attr,
> cpu, NULL,
> NULL, NULL);
>
> ...
>
> ret = rb_alloc_kernel(event, perf_kdump_data_size, perf_kdump_aux_size);
>
> ...
>
> per_cpu(perf_kdump_event, cpu) = event;
> }
> put_online_cpus();
> }
>