Hi,
This patch series adds support to store trace events in pstore.
Storing trace entries in persistent RAM would help in understanding what
happened just before the system went down. The trace events that led to the
crash can be retrieved from the pstore after a warm reboot. This will help
debug what happened before machine’s last breath. This has to be done in a
scalable way so that tracing a live system does not impact the performance
of the system.
This requires a new backend - ramtrace that allocates pages from
persistent storage for the tracing utility. This feature can be enabled
using TRACE_EVENTS_TO_PSTORE.
In this feature, the new backend is used only as a page allocator and
once the users chooses to use pstore to record trace entries, the ring
buffer pages are freed and allocated in pstore. Once this switch is done,
ring_buffer continues to operate just as before without much overhead.
Since the ring buffer uses the persistent RAM buffer directly to record
trace entries, all tracers would also persist across reboot.
To test this feature, I used a simple module that would call panic during
a write operation to file in tracefs directory. Before writing to the file,
the ring buffer is moved to persistent RAM buffer through command line
as shown below,
$echo 1 > /sys/kernel/tracing/options/persist
Writing to the file,
$echo 1 > /sys/kernel/tracing/crash/panic_on_write
The above write operation results in system crash. After reboot, once the
pstore is mounted, the trace entries from previous boot are available in file,
/sys/fs/pstore/trace-ramtrace-0
Looking through this file, gives us the stack trace that led to the crash.
<...>-1 [001] .... 49.083909: __vfs_write <-vfs_write
<...>-1 [001] .... 49.083933: panic <-panic_on_write
<...>-1 [001] d... 49.084195: printk <-panic
<...>-1 [001] d... 49.084201: vprintk_func <-printk
<...>-1 [001] d... 49.084207: vprintk_default <-printk
<...>-1 [001] d... 49.084211: vprintk_emit <-printk
<...>-1 [001] d... 49.084216: __printk_safe_enter <-vprintk_emit
<...>-1 [001] d... 49.084219: _raw_spin_lock <-vprintk_emit
<...>-1 [001] d... 49.084223: vprintk_store <-vprintk_emit
Patchwise oneline description is given below:
Patch 1 adds support to allocate ring buffer pages from persistent RAM buffer.
Patch 2 introduces a new backend, ramtrace.
Patch 3 adds methods to read previous boot pages from pstore.
Patch 4 adds the functionality to allocate page-sized memory from pstore.
Patch 5 adds the seq_operation methods to iterate through trace entries.
Patch 6 modifies ring_buffer to allocate from ramtrace when pstore is used.
Patch 7 adds ramtrace DT node as child-node of /reserved-memory.
Nachammai Karuppiah (7):
tracing: Add support to allocate pages from persistent memory
pstore: Support a new backend, ramtrace
pstore: Read and iterate through trace entries in PSTORE
pstore: Allocate and free page-sized memory in persistent RAM buffer
tracing: Add support to iterate through pages retrieved from pstore
tracing: Use ramtrace alloc and free methods while using persistent
RAM
dt-bindings: ramtrace: Add ramtrace DT node
.../bindings/reserved-memory/ramtrace.txt | 13 +
drivers/of/platform.c | 1 +
fs/pstore/Makefile | 2 +
fs/pstore/inode.c | 46 +-
fs/pstore/platform.c | 1 +
fs/pstore/ramtrace.c | 821 +++++++++++++++++++++
include/linux/pstore.h | 3 +
include/linux/ramtrace.h | 28 +
include/linux/ring_buffer.h | 19 +
include/linux/trace.h | 13 +
kernel/trace/Kconfig | 10 +
kernel/trace/ring_buffer.c | 663 ++++++++++++++++-
kernel/trace/trace.c | 312 +++++++-
kernel/trace/trace.h | 5 +-
14 files changed, 1924 insertions(+), 13 deletions(-)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/ramtrace.txt
create mode 100644 fs/pstore/ramtrace.c
create mode 100644 include/linux/ramtrace.h
--
2.7.4
Add support in ring buffer to allocate pages from persistent RAM
buffer. This feature supports switching to persistent memory and vice-versa.
A new option 'persist' has been added and once this is enabled, the pages in
ring buffer are freed up and new pages are allocated from persistent
memory.
Signed-off-by: Nachammai Karuppiah <[email protected]>
---
kernel/trace/Kconfig | 10 ++
kernel/trace/ring_buffer.c | 257 ++++++++++++++++++++++++++++++++++++++++++++-
kernel/trace/trace.c | 12 ++-
kernel/trace/trace.h | 3 +-
4 files changed, 279 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index a4020c0..f72a9df 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -739,6 +739,16 @@ config GCOV_PROFILE_FTRACE
Note that on a kernel compiled with this config, ftrace will
run significantly slower.
+config TRACE_EVENTS_TO_PSTORE
+ bool "Enable users to store trace records in persistent storage"
+ default n
+ help
+ This option enables users to store trace records in a
+ persistent RAM buffer so that they can be retrieved after
+ system reboot.
+
+ If unsure, say N.
+
config FTRACE_SELFTEST
bool
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f15471c..60b587a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -25,7 +25,7 @@
#include <linux/list.h>
#include <linux/cpu.h>
#include <linux/oom.h>
-
+#include <linux/ramtrace.h>
#include <asm/local.h>
static void update_pages_handler(struct work_struct *work);
@@ -479,6 +479,9 @@ struct ring_buffer_per_cpu {
struct completion update_done;
struct rb_irq_work irq_work;
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ bool use_pstore;
+#endif
};
struct trace_buffer {
@@ -513,6 +516,15 @@ struct ring_buffer_iter {
int missed_events;
};
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+/* This semaphore is being used to ensure that buffer_data_page memory
+ * is not switched to persistent storage or vice versa while a reader page
+ * is swapped out. All consuming reads need to be finished before memory
+ * switch happens.
+ */
+DECLARE_RWSEM(trace_read_sem);
+#endif
+
/**
* ring_buffer_nr_pages - get the number of buffer pages in the ring buffer
* @buffer: The ring_buffer to get the number of pages from
@@ -1705,6 +1717,247 @@ static void update_pages_handler(struct work_struct *work)
complete(&cpu_buffer->update_done);
}
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+static void free_buffer_data_page(struct buffer_data_page *page, int cpu,
+ bool persist)
+{
+ if (persist)
+ ramtrace_free_page(page, cpu);
+ else
+ free_page((unsigned long)page);
+
+}
+
+static int rb_allocate_persistent_pages(struct buffer_data_page **pages,
+ int nr_pages, int cpu)
+{
+ int i;
+
+ for (i = 0; i < nr_pages; i++) {
+ void *address = ramtrace_alloc_page(cpu);
+
+ if (!address)
+ goto free_pages;
+ pages[i] = address;
+ }
+ return 0;
+
+free_pages:
+ for (i = 0; i < nr_pages; i++)
+ ramtrace_free_page(pages[i], cpu);
+
+ return -ENOMEM;
+}
+
+static int
+rb_allocate_buffer_data_pages(struct buffer_data_page **pages, int nr_pages,
+ int cpu)
+{
+ bool user_thread = current->mm != NULL;
+ gfp_t mflags;
+ long i;
+
+ /*
+ * Check if the available memory is there first.
+ * Note, si_mem_available() only gives us a rough estimate of available
+ * memory. It may not be accurate. But we don't care, we just want
+ * to prevent doing any allocation when it is obvious that it is
+ * not going to succeed.
+ */
+ i = si_mem_available();
+ if (i < nr_pages)
+ return -ENOMEM;
+
+ /*
+ * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
+ * gracefully without invoking oom-killer and the system is not
+ * destabilized.
+ */
+ mflags = GFP_KERNEL | __GFP_RETRY_MAYFAIL;
+
+ /*
+ * If a user thread allocates too much, and si_mem_available()
+ * reports there's enough memory, even though there is not.
+ * Make sure the OOM killer kills this thread. This can happen
+ * even with RETRY_MAYFAIL because another task may be doing
+ * an allocation after this task has taken all memory.
+ * This is the task the OOM killer needs to take out during this
+ * loop, even if it was triggered by an allocation somewhere else.
+ */
+ if (user_thread)
+ set_current_oom_origin();
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page;
+
+ page = alloc_pages_node(cpu_to_node(cpu), mflags, 0);
+ if (!page)
+ goto free_pages;
+ pages[i] = page_address(page);
+ rb_init_page(pages[i]);
+
+ if (user_thread && fatal_signal_pending(current))
+ goto free_pages;
+ }
+
+ if (user_thread)
+ clear_current_oom_origin();
+ return 0;
+free_pages:
+ for (i = 0; i < nr_pages; i++)
+ free_page((unsigned long)pages[i]);
+
+ return -ENOMEM;
+}
+
+static void rb_switch_memory(struct trace_buffer *buffer, bool persist)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+ struct list_head *head;
+ struct buffer_page *bpage;
+ struct buffer_data_page ***new_pages;
+ unsigned long flags;
+ int cpu, nr_pages;
+
+ new_pages = kmalloc_array(buffer->cpus, sizeof(void *), GFP_KERNEL);
+
+ for_each_buffer_cpu(buffer, cpu) {
+ cpu_buffer = buffer->buffers[cpu];
+ nr_pages = cpu_buffer->nr_pages;
+ /* Include the reader page */
+ new_pages[cpu] = kmalloc_array(nr_pages + 1, sizeof(void *), GFP_KERNEL);
+ if (persist) {
+ if (rb_allocate_persistent_pages(new_pages[cpu],
+ nr_pages + 1, cpu) < 0)
+ goto out;
+ } else {
+ if (rb_allocate_buffer_data_pages(new_pages[cpu],
+ nr_pages + 1, cpu) < 0)
+ goto out;
+ }
+ }
+
+ for_each_buffer_cpu(buffer, cpu) {
+ int i = 0;
+
+ cpu_buffer = buffer->buffers[cpu];
+ nr_pages = cpu_buffer->nr_pages;
+ /* Acquire the reader lock to ensure reading is disabled.*/
+ raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
+
+ if (RB_WARN_ON(cpu_buffer, local_read(&cpu_buffer->committing)))
+ goto out;
+ /* Prevent another thread from grabbing free_page. */
+ arch_spin_lock(&cpu_buffer->lock);
+
+ free_buffer_data_page(cpu_buffer->reader_page->page,
+ cpu, cpu_buffer->use_pstore);
+ cpu_buffer->reader_page->page = new_pages[cpu][i++];
+ rb_head_page_deactivate(cpu_buffer);
+
+ head = cpu_buffer->pages;
+ if (head) {
+ list_for_each_entry(bpage, head, list) {
+ free_buffer_data_page(bpage->page, cpu,
+ cpu_buffer->use_pstore);
+ bpage->page = new_pages[cpu][i++];
+ rb_init_page(bpage->page);
+ }
+ bpage = list_entry(head, struct buffer_page, list);
+ free_buffer_data_page(bpage->page, cpu,
+ cpu_buffer->use_pstore);
+ bpage->page = new_pages[cpu][nr_pages];
+ rb_init_page(bpage->page);
+ }
+ kfree(new_pages[cpu]);
+
+ if (cpu_buffer->free_page) {
+ free_buffer_data_page(cpu_buffer->free_page, cpu,
+ cpu_buffer->use_pstore);
+ cpu_buffer->free_page = 0;
+ }
+
+ cpu_buffer->use_pstore = persist;
+
+ rb_reset_cpu(cpu_buffer);
+ arch_spin_unlock(&cpu_buffer->lock);
+ raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
+ }
+
+ kfree(new_pages);
+ return;
+out:
+ for_each_buffer_cpu(buffer, cpu) {
+ int i = 0;
+
+ cpu_buffer = buffer->buffers[cpu];
+ for (i = 0; i < cpu_buffer->nr_pages + 1; i++) {
+ if (new_pages[cpu][i])
+ free_buffer_data_page(new_pages[cpu][i], cpu,
+ persist);
+ }
+ kfree(new_pages[cpu]);
+ }
+ kfree(new_pages);
+}
+
+void pstore_tracing_off(void);
+
+/**
+ * ring_buffer_switch_memory - If boolean argument 'persist' is true, switch
+ * to persistent memory and if false, switch to non persistent memory.
+ */
+int
+ring_buffer_switch_memory(struct trace_buffer *buffer, const char *tracer_name,
+ int clock_id, bool persist)
+{
+ int cpu;
+ int online_cpu = 0;
+ int nr_pages_total = 0;
+
+ if (RB_WARN_ON(buffer, !down_write_trylock(&trace_read_sem)))
+ return -EBUSY;
+
+ if (persist) {
+ /* Quit if there is no reserved ramtrace region available */
+ if (!is_ramtrace_available())
+ return -ENOMEM;
+
+ /* Disable pstore_trace buffers which are used for reading
+ * previous boot data pages.
+ */
+ pstore_tracing_off();
+
+ /* Estimate the number of pages needed. */
+ for_each_buffer_cpu(buffer, cpu) {
+ online_cpu++;
+ /* count the reader page as well */
+ nr_pages_total += buffer->buffers[cpu]->nr_pages + 1;
+ }
+ /* Initialize ramtrace pages */
+ if (init_ramtrace_pages(online_cpu, nr_pages_total, tracer_name, clock_id))
+ return -ENOMEM;
+ }
+
+
+ ring_buffer_record_disable(buffer);
+
+ /* Make sure all pending commits have finished */
+ synchronize_rcu();
+
+ /* prevent another thread from changing buffer sizes */
+ mutex_lock(&buffer->mutex);
+
+ rb_switch_memory(buffer, persist);
+
+ mutex_unlock(&buffer->mutex);
+
+ ring_buffer_record_enable(buffer);
+ up_write(&trace_read_sem);
+ return 0;
+
+}
+#endif
+
/**
* ring_buffer_resize - resize the ring buffer
* @buffer: the buffer to resize.
@@ -4716,6 +4969,7 @@ void *ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
out:
rb_init_page(bpage);
+ down_read(&trace_read_sem);
return bpage;
}
@@ -4753,6 +5007,7 @@ void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu, void *data
out:
free_page((unsigned long)bpage);
+ up_read(&trace_read_sem);
}
EXPORT_SYMBOL_GPL(ring_buffer_free_read_page);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bb62269..2b3d8e9 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -48,6 +48,7 @@
#include <linux/fsnotify.h>
#include <linux/irq_work.h>
#include <linux/workqueue.h>
+#include <linux/ramtrace.h>
#include "trace.h"
#include "trace_output.h"
@@ -265,7 +266,8 @@ unsigned long long ns2usecs(u64 nsec)
/* trace_flags that are default zero for instances */
#define ZEROED_TRACE_FLAGS \
- (TRACE_ITER_EVENT_FORK | TRACE_ITER_FUNC_FORK)
+ (TRACE_ITER_EVENT_FORK | TRACE_ITER_FUNC_FORK | \
+ TRACE_ITER_PERSIST)
/*
* The global_trace is the descriptor that holds the top-level tracing
@@ -4851,6 +4853,14 @@ int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
trace_printk_control(enabled);
}
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ if (mask == TRACE_ITER_PERSIST) {
+ ring_buffer_switch_memory(tr->array_buffer.buffer,
+ tr->current_trace->name,
+ tr->clock_id, enabled);
+ }
+#endif
+
return 0;
}
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 13db400..2a4ab72 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1336,7 +1336,8 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
FUNCTION_FLAGS \
FGRAPH_FLAGS \
STACK_FLAGS \
- BRANCH_FLAGS
+ BRANCH_FLAGS \
+ C(PERSIST, "persist"),
/*
* By defining C, we can make TRACE_FLAGS a list of bit names
--
2.7.4
Add a new trace_array, pstore_trace. This descriptor holds the
top-level buffers used for managing the pages retrieved from
persistent RAM. Since pstore_trace uses the pages that pertain to
previous boot, there is no write that happens to these buffers. The
reads are non-consuming and hence we do not have to serialize the
readers.
The buffers in pstore_trace are disabled once the user switches live
tracing to use persistent RAM buffer.
During the first seq_start method call to read the previous boot
trace entries, the top-level buffers of pstore_trace are set up.
The pages retrieved from pstore are used to construct
cpu_buffer->pages for pstore_trace.
Signed-off-by: Nachammai Karuppiah <[email protected]>
---
include/linux/ring_buffer.h | 19 +++
include/linux/trace.h | 13 ++
kernel/trace/ring_buffer.c | 284 +++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 300 +++++++++++++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 2 +
5 files changed, 616 insertions(+), 2 deletions(-)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index c76b2f3..ece71c9 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -18,6 +18,13 @@ struct ring_buffer_event {
u32 array[];
};
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+struct data_page {
+ struct list_head list;
+ struct buffer_data_page *page;
+};
+#endif
+
/**
* enum ring_buffer_type - internal ring buffer types
*
@@ -210,4 +217,16 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
#define trace_rb_cpu_prepare NULL
#endif
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+struct trace_buffer *reconstruct_ring_buffer(void);
+
+void ring_buffer_order_pages(struct list_head *pages);
+int ring_buffer_switch_memory(struct trace_buffer *buffer,
+ const char *tracer_name, int clock_id,
+ bool persist);
+void ring_buffer_set_tracer_name(struct trace_buffer *buffer,
+ const char *tracer_name);
+void ring_buffer_free_pstore_trace(struct trace_buffer *buffer);
+#endif
+
#endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/linux/trace.h b/include/linux/trace.h
index 7fd86d3..8f37b70 100644
--- a/include/linux/trace.h
+++ b/include/linux/trace.h
@@ -32,6 +32,19 @@ int trace_array_printk(struct trace_array *tr, unsigned long ip,
void trace_array_put(struct trace_array *tr);
struct trace_array *trace_array_get_by_name(const char *name);
int trace_array_destroy(struct trace_array *tr);
+
+
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+struct trace_iterator;
+
+void *pstore_trace_start(struct seq_file *m, loff_t *pos);
+void *pstore_trace_next(struct seq_file *m, void *v, loff_t *pos);
+int pstore_trace_show(struct seq_file *m, void *v);
+void pstore_trace_stop(struct seq_file *m, void *v);
+int pstore_tracing_release(struct trace_iterator *iter);
+void pstore_tracing_erase(void);
+#endif
+
#endif /* CONFIG_TRACING */
#endif /* _LINUX_TRACE_H */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 60b587a..34e50c1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1296,6 +1296,92 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
return 0;
}
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+static int rb_reconstruct_pages(struct ring_buffer_per_cpu *cpu_buffer,
+ struct list_head *dpages, int cpu)
+{
+ struct buffer_page *bpage, *tmp;
+ struct data_page *dpage;
+ LIST_HEAD(pages);
+
+ list_for_each_entry(dpage, dpages, list) {
+ bpage = kzalloc(ALIGN(sizeof(*bpage), cache_line_size()),
+ GFP_KERNEL);
+ if (!bpage)
+ goto free_pages;
+
+ list_add_tail(&bpage->list, &pages);
+ bpage->page = dpage->page;
+ }
+
+ if (!list_empty(&pages)) {
+ cpu_buffer->pages = pages.next;
+ list_del(&pages);
+ } else
+ cpu_buffer->pages = NULL;
+
+ return 0;
+
+free_pages:
+ list_for_each_entry_safe(bpage, tmp, &pages, list) {
+ list_del(&bpage->list);
+ kfree(bpage);
+ }
+ return -ENOMEM;
+}
+
+static struct ring_buffer_per_cpu *
+__reconstruct_cpu_buffer(struct trace_buffer *rb, struct list_head *dpages,
+ void *page, int cpu)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+ struct buffer_page *bpage;
+ struct data_page *dpage;
+
+ cpu_buffer = kzalloc(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
+ GFP_KERNEL);
+ if (!cpu_buffer)
+ return NULL;
+
+ cpu_buffer->buffer = rb;
+ raw_spin_lock_init(&cpu_buffer->reader_lock);
+ lockdep_set_class(&cpu_buffer->reader_lock, rb->reader_lock_key);
+ cpu_buffer->lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
+
+ bpage = kzalloc(ALIGN(sizeof(*bpage), cache_line_size()),
+ GFP_KERNEL);
+ if (!bpage)
+ goto fail_free_buffer;
+
+ bpage->page = page;
+
+ rb_check_bpage(cpu_buffer, bpage);
+ cpu_buffer->reader_page = bpage;
+ INIT_LIST_HEAD(&cpu_buffer->reader_page->list);
+ INIT_LIST_HEAD(&cpu_buffer->new_pages);
+
+ if (rb_reconstruct_pages(cpu_buffer, dpages, cpu) < 0)
+ goto fail_free_reader;
+ INIT_LIST_HEAD(&cpu_buffer->reader_page->list);
+
+ cpu_buffer->head_page = list_entry(cpu_buffer->pages,
+ struct buffer_page, list);
+ cpu_buffer->commit_page = list_entry(cpu_buffer->pages->prev,
+ struct buffer_page, list);
+
+ rb_head_page_activate(cpu_buffer);
+
+ return cpu_buffer;
+
+fail_free_reader:
+ free_buffer_page(cpu_buffer->reader_page);
+
+fail_free_buffer:
+ kfree(cpu_buffer);
+ return NULL;
+}
+#endif /* CONFIG_TRACE_EVENTS_TO_PSTORE */
+
static struct ring_buffer_per_cpu *
rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
{
@@ -1378,6 +1464,81 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
kfree(cpu_buffer);
}
+
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+/**
+ * reconstruct_ring_buffer - reconstruct ring_buffer for pstore trace
+ */
+struct trace_buffer *reconstruct_ring_buffer(void)
+{
+ struct trace_buffer *buffer;
+ static struct lock_class_key __key;
+ void *page;
+ int bsize;
+ int i, cpu;
+
+ buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
+ GFP_KERNEL);
+ if (!buffer)
+ return NULL;
+
+ if (!zalloc_cpumask_var(&buffer->cpumask, GFP_KERNEL))
+ goto release_buffer;
+ buffer->cpus = ramtrace_get_prev_boot_nr_cpus();
+
+ buffer->reader_lock_key = &__key;
+
+ bsize = sizeof(void *) * buffer->cpus;
+ buffer->buffers = kzalloc(ALIGN(bsize, cache_line_size()),
+ GFP_KERNEL);
+ if (!buffer->buffers)
+ goto release_cpumask_var;
+
+ /*
+ * Allocate an empty reader page. This page doesn't contain any data
+ * and is set as the reader page. The same reader page is used for all
+ * CPU. Since this is an empty page and guaranteed to be empty always,
+ * all CPUs can use the same page. The pages retrieved from PSTORE are
+ * used to populate cpu_buffer->pages list.
+ */
+ page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+ if (!page)
+ goto release_buffers;
+ page = page_address(page);
+ rb_init_page(page);
+ for (i = 0; i < buffer->cpus; i++) {
+ struct list_head *dpages = ramtrace_get_read_buffer(i);
+
+ if (dpages) {
+ buffer->buffers[i] = __reconstruct_cpu_buffer(buffer,
+ dpages, page, i);
+ if (!buffer->buffers[i])
+ goto release_reader_page;
+ cpumask_set_cpu(i, buffer->cpumask);
+ }
+
+ }
+ if (cpumask_empty(buffer->cpumask))
+ goto release_reader_page;
+
+ return buffer;
+
+release_reader_page:
+ free_page((unsigned long)page);
+release_buffers:
+ for_each_buffer_cpu(buffer, cpu) {
+ if (buffer->buffers[cpu])
+ rb_free_cpu_buffer(buffer->buffers[cpu]);
+ }
+ kfree(buffer->buffers);
+release_cpumask_var:
+ free_cpumask_var(buffer->cpumask);
+release_buffer:
+ kfree(buffer);
+ return NULL;
+}
+#endif /* CONFIG_TRACE_EVENTS_TO_PSTORE */
+
/**
* __ring_buffer_alloc - allocate a new ring_buffer
* @size: the size in bytes per cpu that is needed.
@@ -1478,12 +1639,75 @@ ring_buffer_free(struct trace_buffer *buffer)
}
EXPORT_SYMBOL_GPL(ring_buffer_free);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+static void
+rb_free_pstore_trace_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+ struct list_head *head = cpu_buffer->pages;
+ struct buffer_page *bpage, *tmp;
+
+ kfree(cpu_buffer->reader_page);
+
+ if (head) {
+ rb_head_page_deactivate(cpu_buffer);
+ list_for_each_entry_safe(bpage, tmp, head, list) {
+ list_del_init(&bpage->list);
+ kfree(bpage);
+ }
+
+ bpage = list_entry(head, struct buffer_page, list);
+ kfree(bpage);
+ }
+ kfree(cpu_buffer);
+}
+
+/**
+ * ring_buffer_free_pstore_trace - free pstore_trace buffers.
+ *
+ * Free top-level buffers and buffer_page pertaining to previous boot trace
+ * provided by pstore_trace descriptor.
+ */
+void ring_buffer_free_pstore_trace(struct trace_buffer *buffer)
+{
+ int cpu;
+ void *page = NULL;
+
+ for_each_buffer_cpu(buffer, cpu) {
+ if (!page) {
+ page = buffer->buffers[cpu]->reader_page->page;
+ printk("reader page %px\n", page);
+ free_page((unsigned long)page);
+ }
+ rb_free_pstore_trace_cpu_buffer(buffer->buffers[cpu]);
+ }
+ kfree(buffer->buffers);
+ free_cpumask_var(buffer->cpumask);
+
+ kfree(buffer);
+}
+#endif
+
void ring_buffer_set_clock(struct trace_buffer *buffer,
u64 (*clock)(void))
{
buffer->clock = clock;
}
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+void ring_buffer_set_tracer_name(struct trace_buffer *buffer,
+ const char *tracer_name)
+{
+ int cpu;
+
+ get_online_cpus();
+ for_each_buffer_cpu(buffer, cpu)
+ if (buffer->buffers[cpu]->use_pstore && cpu_online(cpu)) {
+ ramtrace_set_tracer_name(tracer_name);
+ break;
+ }
+}
+#endif
+
void ring_buffer_set_time_stamp_abs(struct trace_buffer *buffer, bool abs)
{
buffer->time_stamp_abs = abs;
@@ -5251,6 +5475,66 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
return 0;
}
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+void ring_buffer_order_pages(struct list_head *pages)
+{
+ struct data_page *temp, *data_page, *min_page;
+ u64 min_ts = 0;
+ u64 prev_ts;
+ int count = 0;
+
+ min_page = NULL;
+
+ /* Find the oldest page and move the list head before it.
+ * While starting from the oldest page, the list should mostly be in
+ * order except for few out of order pages as long as the buffer had
+ * not been repeatedly expanded and shrunk.
+ */
+ list_for_each_entry_safe(data_page, temp, pages, list) {
+ u64 ts = data_page->page->time_stamp;
+
+ if (ts == 0) {
+ list_del(&data_page->list);
+ kfree(data_page);
+ } else {
+ count++;
+ if (ts < min_ts || min_ts == 0) {
+ min_ts = ts;
+ min_page = data_page;
+ }
+ }
+ }
+
+ if (min_ts) {
+ /* move the list head before the oldest page */
+ list_move_tail(pages, &min_page->list);
+ prev_ts = min_ts;
+ data_page = min_page;
+ list_for_each_entry(data_page, pages, list) {
+ u64 ts = data_page->page->time_stamp;
+
+ if (ts >= prev_ts)
+ prev_ts = ts;
+ else {
+ struct data_page *node, *swap_page;
+
+ /* Move out of order page to the right place */
+ list_for_each_entry(node, pages, list) {
+ if (node->page->time_stamp > ts) {
+ swap_page = data_page;
+ data_page = list_entry(data_page->list.prev, struct data_page, list);
+ list_del(&swap_page->list);
+ list_add_tail(&swap_page->list, &node->list);
+ break;
+ }
+ }
+ }
+ }
+ }
+
+}
+#endif /* CONFIG_TRACE_EVENTS_TO_PSTORE */
+
#ifdef CONFIG_RING_BUFFER_STARTUP_TEST
/*
* This is a basic integrity check of the ring buffer.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2b3d8e9..16e50ba8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -277,6 +277,16 @@ static struct trace_array global_trace = {
.trace_flags = TRACE_DEFAULT_FLAGS,
};
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+/*
+ * The pstore_trace is the descriptor that holds the top-level tracing
+ * buffers for the pages retrieved from persistent storage.
+ */
+static struct trace_array pstore_trace = {
+ .trace_flags = TRACE_DEFAULT_FLAGS,
+};
+#endif
+
LIST_HEAD(ftrace_trace_arrays);
int trace_array_get(struct trace_array *this_tr)
@@ -650,6 +660,26 @@ int tracing_is_enabled(void)
return !global_trace.buffer_disabled;
}
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+/**
+ * pstore_tracing_is_enabled - Show if pstore_trace has been disabled
+ *
+ * This is similar to tracing_is_enabled() but checks pstore_trace instead.
+ * pstore_trace holds the tracing buffers for the pages pertaining to previous
+ * boot retrieved from pstore.
+ */
+int pstore_tracing_is_enabled(void)
+{
+ /*
+ * For quick access (irqsoff uses this in fast path), just
+ * return the mirror variable of the state of the ring buffer.
+ * It's a little racy, but we don't really care.
+ */
+ smp_rmb();
+ return !pstore_trace.buffer_disabled;
+}
+#endif
+
/*
* trace_buf_size is the size in bytes that is allocated
* for a buffer. Note, the number of bytes is always rounded
@@ -1299,6 +1329,21 @@ void tracing_off(void)
}
EXPORT_SYMBOL_GPL(tracing_off);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+
+/**
+ * pstore_tracing_off - turn off tracing buffers.
+ *
+ * Incase of pstore_trace, turning off tracing buffers stops readers from
+ * retrieving any more data. This is needed once the global_trace tries to
+ * use pstore memory.
+ */
+void pstore_tracing_off(void)
+{
+ tracer_tracing_off(&pstore_trace);
+}
+#endif
+
void disable_trace_on_warning(void)
{
if (__disable_trace_on_warning) {
@@ -5826,7 +5871,7 @@ static void tracing_set_nop(struct trace_array *tr)
{
if (tr->current_trace == &nop_trace)
return;
-
+
tr->current_trace->enabled--;
if (tr->current_trace->reset)
@@ -5945,6 +5990,9 @@ int tracing_set_tracer(struct trace_array *tr, const char *buf)
tr->current_trace = t;
tr->current_trace->enabled++;
trace_branch_enable(tr);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ ring_buffer_set_tracer_name(tr->array_buffer.buffer, tr->current_trace->name);
+#endif
out:
mutex_unlock(&trace_types_lock);
@@ -7056,9 +7104,257 @@ static int snapshot_raw_open(struct inode *inode, struct file *filp)
return ret;
}
-
#endif /* CONFIG_TRACER_SNAPSHOT */
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+/*
+ * pstore_trace_set_up - set up pstore_trace descriptor.
+ *
+ * This method is called during the seq_start method call to setup the
+ * pstore_trace for the very first read operation. The pages from pstore are
+ * read and ring buffer is constructed.
+ */
+static struct trace_array *pstore_trace_set_up(void)
+{
+ struct trace_array *p_tr = &pstore_trace;
+ struct tracer *t;
+ char *tracer_name;
+
+ /*
+ * Create the top level buffers during first seq_start call.
+ * Use the previously created one in the subsequent calls.
+ */
+ if (p_tr->array_buffer.buffer)
+ return p_tr;
+
+ tracer_name = ramtrace_get_prev_boot_tracer_name();
+ mutex_lock(&trace_types_lock);
+ for (t = trace_types; t; t = t->next) {
+ if (strcmp(t->name, tracer_name) == 0)
+ break;
+ }
+ mutex_unlock(&trace_types_lock);
+ if (!t)
+ goto release_tr_info;
+ p_tr->current_trace = t;
+
+ p_tr->clock_id = ramtrace_get_prev_boot_clock_id();
+
+ p_tr->array_buffer.tr = p_tr;
+
+ p_tr->array_buffer.buffer = reconstruct_ring_buffer();
+ if (!p_tr->array_buffer.buffer)
+ goto release_tr_info;
+
+ raw_spin_lock_init(&p_tr->start_lock);
+
+ INIT_LIST_HEAD(&p_tr->systems);
+ INIT_LIST_HEAD(&p_tr->events);
+ INIT_LIST_HEAD(&p_tr->hist_vars);
+ INIT_LIST_HEAD(&p_tr->err_log);
+
+ ftrace_init_trace_array(p_tr);
+ list_add(&p_tr->list, &ftrace_trace_arrays);
+
+ return p_tr;
+
+release_tr_info:
+ return NULL;
+}
+
+static struct trace_iterator *pstore_iter_setup(void)
+{
+ struct trace_array *p_tr;
+ struct trace_iterator *iter;
+ int cpu, cpus;
+
+ p_tr = pstore_trace_set_up();
+ if (!p_tr)
+ return NULL;
+
+ iter = kzalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ goto out;
+
+ iter->buffer_iter = kcalloc(nr_cpu_ids, sizeof(*iter->buffer_iter),
+ GFP_KERNEL);
+ if (!iter->buffer_iter)
+ goto release;
+
+ memset(iter->buffer_iter, 0, nr_cpu_ids * sizeof(*iter->buffer_iter));
+ iter->trace = p_tr->current_trace;
+ iter->trace->use_max_tr = false;
+
+ if (!zalloc_cpumask_var(&iter->started, GFP_KERNEL))
+ goto fail;
+
+ iter->tr = p_tr;
+ iter->array_buffer = &p_tr->array_buffer;
+
+ iter->snapshot = true;
+
+ iter->pos = -1;
+ iter->cpu_file = RING_BUFFER_ALL_CPUS;
+ mutex_init(&iter->mutex);
+
+ if (iter->trace && iter->trace->open)
+ iter->trace->open(iter);
+
+ if (trace_clocks[p_tr->clock_id].in_ns)
+ iter->iter_flags |= TRACE_FILE_TIME_IN_NS;
+
+ cpus = ramtrace_get_prev_boot_nr_cpus();
+ for (cpu = 0; cpu < cpus; cpu++) {
+ iter->buffer_iter[cpu] =
+ ring_buffer_read_prepare(iter->array_buffer->buffer, cpu,
+ GFP_KERNEL);
+ }
+ ring_buffer_read_prepare_sync();
+ for (cpu = 0; cpu < cpus; cpu++)
+ ring_buffer_read_start(iter->buffer_iter[cpu]);
+
+ return iter;
+
+fail:
+ kfree(iter->buffer_iter);
+release:
+ kfree(iter);
+out:
+ return ERR_PTR(-ENOMEM);
+}
+
+void *pstore_trace_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct trace_iterator *iter = m->private;
+ int i = (int) *pos;
+ void *ent;
+
+ WARN_ON_ONCE(iter->leftover);
+
+ (*pos)++;
+
+ /* can't go backwards */
+ if (iter->idx > i)
+ return NULL;
+
+ if (iter->idx < 0)
+ ent = trace_find_next_entry_inc(iter);
+ else
+ ent = iter;
+
+ while (ent && iter->idx < i)
+ ent = trace_find_next_entry_inc(iter);
+
+ iter->pos = *pos;
+
+ if (ent == NULL)
+ return NULL;
+ return iter;
+}
+/*
+ * Below are the seq_operation methods used to read the previous boot
+ * data pages from pstore. In this case, there is no producer and no
+ * consuming read. So we do not have to serialize readers.
+ */
+void *pstore_trace_start(struct seq_file *m, loff_t *pos)
+{
+ struct trace_iterator *iter = m->private;
+ void *p = NULL;
+ loff_t l = 0;
+
+ /*
+ * pstore_trace is disabled once the user starts utilizing the
+ * ramtrace pstore region to write the trace records.
+ */
+ if (!pstore_tracing_is_enabled())
+ return NULL;
+ if (iter == NULL) {
+ iter = pstore_iter_setup();
+ if (!iter)
+ return NULL;
+ m->private = iter;
+ }
+
+
+ if (*pos != iter->pos) {
+ iter->ent = NULL;
+ iter->cpu = 0;
+ iter->idx = -1;
+ iter->leftover = 0;
+ for (p = iter; p && l < *pos; p = pstore_trace_next(m, p, &l))
+ ;
+ } else {
+ if (!iter->leftover) {
+ l = *pos - 1;
+ p = pstore_trace_next(m, iter, &l);
+ } else
+ p = iter;
+ }
+
+ return p;
+}
+
+int pstore_trace_show(struct seq_file *m, void *v)
+{
+ struct trace_iterator *iter = v;
+ int ret;
+
+ if (iter->ent == NULL) {
+ if (iter->tr) {
+ seq_printf(m, "# tracer: %s\n", iter->trace->name);
+ seq_puts(m, "#\n");
+ }
+ } else if (iter->leftover) {
+ ret = trace_print_seq(m, &iter->seq);
+ iter->leftover = ret;
+
+ } else {
+ print_trace_line(iter);
+ ret = trace_print_seq(m, &iter->seq);
+ iter->leftover = ret;
+ }
+ return 0;
+}
+
+void pstore_trace_stop(struct seq_file *m, void *v)
+{
+}
+
+int pstore_tracing_release(struct trace_iterator *iter)
+{
+ int cpu;
+
+ if (!iter)
+ return 0;
+ mutex_lock(&trace_types_lock);
+ for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+ if (iter->buffer_iter[cpu])
+ ring_buffer_read_finish(iter->buffer_iter[cpu]);
+
+ if (iter->trace && iter->trace->close)
+ iter->trace->close(iter);
+
+ mutex_unlock(&trace_types_lock);
+ mutex_destroy(&iter->mutex);
+ free_cpumask_var(iter->started);
+ kfree(iter->buffer_iter);
+ kfree(iter);
+
+ return 0;
+}
+
+void pstore_tracing_erase(void)
+{
+ struct trace_array *trace = &pstore_trace;
+
+ if (!trace->array_buffer.buffer)
+ return;
+ ring_buffer_free_pstore_trace(trace->array_buffer.buffer);
+ trace->array_buffer.buffer = NULL;
+
+}
+#endif /* CONFIG_TRACE_EVENTS_TO_PSTORE */
+
static const struct file_operations tracing_thresh_fops = {
.open = tracing_open_generic,
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2a4ab72..66670f8 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -2078,4 +2078,6 @@ static __always_inline void trace_iterator_reset(struct trace_iterator *iter)
iter->pos = -1;
}
+void pstore_tracing_off(void);
+
#endif /* _LINUX_KERNEL_TRACE_H */
--
2.7.4
If persistent RAM is being used to record trace entries, allocate and
free pages using ramtrace_alloc_page and ramtrace_free_page.
Signed-off-by: Nachammai Karuppiah <[email protected]>
---
kernel/trace/ring_buffer.c | 122 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 119 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 34e50c1..c99719e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -353,6 +353,18 @@ static void free_buffer_page(struct buffer_page *bpage)
kfree(bpage);
}
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+static void
+free_buffer_page_cpu(struct buffer_page *bpage, int cpu, bool use_pstore)
+{
+ if (use_pstore) {
+ ramtrace_free_page(bpage->page, cpu);
+ kfree(bpage);
+ } else
+ free_buffer_page(bpage);
+}
+#endif
+
/*
* We need to fit the time_stamp delta into 27 bits.
*/
@@ -1200,7 +1212,12 @@ static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
return 0;
}
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu,
+ bool use_pstore)
+#else
static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
+#endif
{
struct buffer_page *bpage, *tmp;
bool user_thread = current->mm != NULL;
@@ -1214,6 +1231,11 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
* to prevent doing any allocation when it is obvious that it is
* not going to succeed.
*/
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ if (use_pstore)
+ i = ramtrace_available_mem();
+ else
+#endif
i = si_mem_available();
if (i < nr_pages)
return -ENOMEM;
@@ -1246,10 +1268,22 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
list_add(&bpage->list, pages);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ if (use_pstore) {
+ void *address = ramtrace_alloc_page(cpu);
+
+ if (!address)
+ goto free_pages;
+ bpage->page = address;
+ } else {
+#endif
page = alloc_pages_node(cpu_to_node(cpu), mflags, 0);
if (!page)
goto free_pages;
bpage->page = page_address(page);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ }
+#endif
rb_init_page(bpage->page);
if (user_thread && fatal_signal_pending(current))
@@ -1263,7 +1297,11 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
free_pages:
list_for_each_entry_safe(bpage, tmp, pages, list) {
list_del_init(&bpage->list);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ free_buffer_page_cpu(bpage, cpu, use_pstore);
+#else
free_buffer_page(bpage);
+#endif
}
if (user_thread)
clear_current_oom_origin();
@@ -1278,7 +1316,12 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
WARN_ON(!nr_pages);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ if (__rb_allocate_pages(nr_pages, &pages, cpu_buffer->cpu,
+ cpu_buffer->use_pstore))
+#else
if (__rb_allocate_pages(nr_pages, &pages, cpu_buffer->cpu))
+#endif
return -ENOMEM;
/*
@@ -1414,10 +1457,23 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
rb_check_bpage(cpu_buffer, bpage);
cpu_buffer->reader_page = bpage;
+
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ if (cpu_buffer->use_pstore) {
+ void *address = ramtrace_alloc_page(cpu);
+
+ if (!address)
+ goto fail_free_reader;
+ bpage->page = address;
+ } else {
+#endif
page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ }
+#endif
rb_init_page(bpage->page);
INIT_LIST_HEAD(&cpu_buffer->reader_page->list);
@@ -1436,7 +1492,12 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
return cpu_buffer;
fail_free_reader:
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ free_buffer_page_cpu(cpu_buffer->reader_page, cpu,
+ cpu_buffer->use_pstore);
+#else
free_buffer_page(cpu_buffer->reader_page);
+#endif
fail_free_buffer:
kfree(cpu_buffer);
@@ -1447,18 +1508,32 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
{
struct list_head *head = cpu_buffer->pages;
struct buffer_page *bpage, *tmp;
-
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ free_buffer_page_cpu(cpu_buffer->reader_page, cpu_buffer->cpu,
+ cpu_buffer->use_pstore);
+#else
free_buffer_page(cpu_buffer->reader_page);
+#endif
rb_head_page_deactivate(cpu_buffer);
if (head) {
list_for_each_entry_safe(bpage, tmp, head, list) {
list_del_init(&bpage->list);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ free_buffer_page_cpu(bpage, cpu_buffer->cpu,
+ cpu_buffer->use_pstore);
+#else
free_buffer_page(bpage);
+#endif
}
bpage = list_entry(head, struct buffer_page, list);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ free_buffer_page_cpu(bpage, cpu_buffer->cpu,
+ cpu_buffer->use_pstore);
+#else
free_buffer_page(bpage);
+#endif
}
kfree(cpu_buffer);
@@ -1832,7 +1907,12 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages)
* We have already removed references to this list item, just
* free up the buffer_page and its page
*/
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ free_buffer_page_cpu(to_remove_page, cpu_buffer->cpu,
+ cpu_buffer->use_pstore);
+#else
free_buffer_page(to_remove_page);
+#endif
nr_removed--;
} while (to_remove_page != last_page);
@@ -1913,7 +1993,12 @@ rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer)
list_for_each_entry_safe(bpage, tmp, &cpu_buffer->new_pages,
list) {
list_del_init(&bpage->list);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ free_buffer_page_cpu(bpage, cpu_buffer->cpu,
+ cpu_buffer->use_pstore);
+#else
free_buffer_page(bpage);
+#endif
}
}
return success;
@@ -2252,8 +2337,14 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
* allocated without receiving ENOMEM
*/
INIT_LIST_HEAD(&cpu_buffer->new_pages);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ if (__rb_allocate_pages(cpu_buffer->nr_pages_to_update,
+ &cpu_buffer->new_pages, cpu,
+ cpu_buffer->use_pstore)) {
+#else
if (__rb_allocate_pages(cpu_buffer->nr_pages_to_update,
&cpu_buffer->new_pages, cpu)) {
+#endif
/* not enough memory for new pages */
err = -ENOMEM;
goto out_err;
@@ -2319,7 +2410,12 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
INIT_LIST_HEAD(&cpu_buffer->new_pages);
if (cpu_buffer->nr_pages_to_update > 0 &&
__rb_allocate_pages(cpu_buffer->nr_pages_to_update,
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ &cpu_buffer->new_pages, cpu_id,
+ cpu_buffer->use_pstore)) {
+#else
&cpu_buffer->new_pages, cpu_id)) {
+#endif
err = -ENOMEM;
goto out_err;
}
@@ -2379,7 +2475,12 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
list_for_each_entry_safe(bpage, tmp, &cpu_buffer->new_pages,
list) {
list_del_init(&bpage->list);
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ free_buffer_page_cpu(bpage, cpu,
+ cpu_buffer->use_pstore);
+#else
free_buffer_page(bpage);
+#endif
}
}
out_err_unlock:
@@ -5184,13 +5285,22 @@ void *ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
if (bpage)
goto out;
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ if (cpu_buffer->use_pstore) {
+ bpage = (struct buffer_data_page *)ramtrace_alloc_page(cpu);
+ if (!bpage)
+ return ERR_PTR(-ENOMEM);
+ } else {
+#endif
page = alloc_pages_node(cpu_to_node(cpu),
GFP_KERNEL | __GFP_NORETRY, 0);
if (!page)
return ERR_PTR(-ENOMEM);
bpage = page_address(page);
-
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ }
+#endif
out:
rb_init_page(bpage);
down_read(&trace_read_sem);
@@ -5229,7 +5339,13 @@ void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu, void *data
arch_spin_unlock(&cpu_buffer->lock);
local_irq_restore(flags);
- out:
+out:
+#ifdef CONFIG_TRACE_EVENTS_TO_PSTORE
+ if (cpu_buffer->use_pstore) {
+ ramtrace_free_page(bpage, cpu);
+ return;
+ }
+#endif
free_page((unsigned long)bpage);
up_read(&trace_read_sem);
}
--
2.7.4
ramtrace backend acts as a page allocator and manages
the persistent RAM buffer.
ramtrace supports allocation and deallocation of page-sized memory
through methods, ramtrace_alloc_page and ramtrace_free_page.
This functionality is required by ring buffer in trace when the user
switches to persistent storage.
Just prior to allocating pages for recording trace entries, ramtrace
backend frees up the list used to maintain pages pertaining to previous
boot. After this, reading previous boot trace entries from
/sys/fs/pstore is disabled.
Signed-off-by: Nachammai Karuppiah <[email protected]>
---
fs/pstore/ramtrace.c | 336 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ramtrace.h | 13 ++
2 files changed, 349 insertions(+)
diff --git a/fs/pstore/ramtrace.c b/fs/pstore/ramtrace.c
index ca48a76..de6d09e8 100644
--- a/fs/pstore/ramtrace.c
+++ b/fs/pstore/ramtrace.c
@@ -19,6 +19,17 @@ module_param(mem_size, ulong, 0400);
MODULE_PARM_DESC(mem_size,
"size of reserved RAM used to store trace data");
+struct ramtrace_pagelist {
+ struct list_head list;
+ void *page;
+};
+
+struct tr_persistent_info {
+ char *tracer_name;
+ int trace_clock;
+ unsigned int nr_cpus;
+ struct list_head **data_pages;
+};
struct ramtrace_context {
phys_addr_t phys_addr; /* Physical address of the persistent memory */
@@ -37,6 +48,50 @@ struct ramtrace_context {
int read_buffer_status;
};
+/*
+ * The first page in the ramtrace area is the metadata page, followed by
+ * bitmap pages and then the buffer_data_page allocated by trace.
+ * Each bitmap page can represent upto page_size * 8 number of pages.
+ * The number of bitmaps needed per cpu is determined by the size of the
+ * pstore memory. Each CPU is allocated sufficient bitmap pages to represent
+ * the entire memory region.
+ * The figure below illustrates how the ramtrace memory area is organized.
+ *
+ * +------------------------------------------+
+ * | metadata |
+ * +------------------------------------------+
+ * | CPU 1 Bitmap 1 to buffer pages |
+ * +------------------------------------------+
+ * | CPU 1 Bitmap 2 to buffer pages |
+ * +------------------------------------------+
+ * | . . . |
+ * +------------------------------------------+
+ * | CPU 1 Bitmap K to buffer pages |
+ * +------------------------------------------+
+ * | CPU 2 Bitmap 1 to buffer pages |
+ * +------------------------------------------+
+ * | CPU 2 Bitmap 2 to buffer pages |
+ * +------------------------------------------+
+ * | . . . |
+ * +------------------------------------------+
+ * | CPU 2 Bitmap K to buffer pages |
+ * +------------------------------------------+
+ * | . . . . . . |
+ * +------------------------------------------+
+ * | CPU N Bitmap K to buffer pages |
+ * +------------------------------------------+
+ * | buffer_data_page 1 belonging to any CPU |
+ * +------------------------------------------+
+ * | buffer_data_page 2 belonging to any CPU |
+ * +------------------------------------------+
+ * | . . . |
+ * +------------------------------------------+
+ * | buffer_data_page (K x 4096) |
+ * | belonging to any CPU |
+ * +------------------------------------------+
+ */
+
+static void ramtrace_read_pages(void);
static int ramtrace_pstore_open(struct pstore_info *psi);
static ssize_t ramtrace_pstore_read(struct pstore_record *record);
static int ramtrace_pstore_erase(struct pstore_record *record);
@@ -122,6 +177,227 @@ static int ramtrace_pstore_erase(struct pstore_record *record)
static struct platform_device *dummy;
+bool is_ramtrace_available(void)
+{
+ return (trace_ctx.size > 0) ? 1 : 0;
+}
+
+int ramtrace_available_mem(void)
+{
+ return trace_ctx.pages_available;
+}
+
+/**
+ * ramtrace_init_bitmap: Initialize bitmap pages.
+ *
+ * This method allocates and initializes bitmap pages.
+ */
+static void ramtrace_init_bitmap(unsigned int npages)
+{
+ int i;
+ unsigned long flags;
+ struct ramtrace_pagelist *freelist = trace_ctx.freelist;
+
+ trace_ctx.bitmap_pages = kmalloc_array(npages, sizeof(void *),
+ GFP_KERNEL);
+ spin_lock_irqsave(&trace_ctx.lock, flags);
+ for (i = 0; i < npages; i++) {
+ struct ramtrace_pagelist *freelist_node;
+ void *page;
+
+ freelist_node = list_next_entry(freelist, list);
+ page = freelist_node->page;
+ memset(page, 0, PAGE_SIZE);
+ trace_ctx.bitmap_pages[i] = page;
+ list_del(&freelist_node->list);
+ kfree(freelist_node);
+ }
+ spin_unlock_irqrestore(&trace_ctx.lock, flags);
+ trace_ctx.base_address = trace_ctx.bitmap_pages[npages - 1] + PAGE_SIZE;
+}
+
+
+static void ramtrace_write_int(int **buffer, int n)
+{
+ **buffer = n;
+ (*buffer)++;
+}
+
+void ramtrace_set_clock_id(int clock_id)
+{
+ *(trace_ctx.clock_id) = clock_id;
+}
+
+void ramtrace_set_tracer_name(const char *tracer_name)
+{
+ sprintf(trace_ctx.tracer_name, "%s", tracer_name);
+}
+
+/*
+ * init_ramtrace_pages: Initialize metadata page, bitmap and trace context.
+ *
+ * Below is the layout of the metadata page.
+ * +------------------------------------------+
+ * | Kernel Version |
+ * +------------------------------------------+
+ * | tracer_name |
+ * +------------------------------------------+
+ * | Number of CPU’s Buffers = N |
+ * +------------------------------------------+
+ * | trace_clock_name |
+ * +------------------------------------------+
+ * | pages per cpu |
+ * +------------------------------------------+
+ */
+
+int
+init_ramtrace_pages(int cpu, unsigned long npages, const char *tracer_name,
+ int clock_id)
+{
+ const char kernel_version[] = UTS_RELEASE;
+ struct ramtrace_pagelist *freelist_node;
+ void *metapage;
+ unsigned long flags;
+ int n_bitmap = 0;
+ int ramtrace_pages;
+
+
+ ramtrace_pages = (trace_ctx.size >> PAGE_SHIFT) - 1;
+
+ /* Calculate number of bitmap pages required for npages */
+ n_bitmap = ramtrace_pages / ((PAGE_SIZE << 3) + cpu);
+
+ if (ramtrace_pages % (PAGE_SIZE << 3) > cpu)
+ n_bitmap++;
+ if (ramtrace_pages - n_bitmap < npages)
+ return 1;
+
+ spin_lock_irqsave(&trace_ctx.lock, flags);
+ freelist_node = list_next_entry(trace_ctx.freelist, list);
+ metapage = freelist_node->page;
+ list_del(&freelist_node->list);
+ spin_unlock_irqrestore(&trace_ctx.lock, flags);
+
+ pstore_tracing_erase();
+ free_persist_info();
+
+ /* Initialize metadata page */
+ ramtrace_write_int((int **)&metapage, cpu);
+ trace_ctx.clock_id = (int *)metapage;
+ ramtrace_write_int((int **)&metapage, clock_id);
+ ramtrace_write_int((int **)&metapage, n_bitmap);
+ sprintf(metapage, "%s", kernel_version);
+ metapage += strlen(kernel_version) + 1;
+ trace_ctx.tracer_name = (char *)metapage;
+ sprintf(metapage, "%s", tracer_name);
+
+ kfree(freelist_node);
+ trace_ctx.cpu = cpu;
+ trace_ctx.num_bitmap_per_cpu = n_bitmap;
+ trace_ctx.pages_available = ramtrace_pages - n_bitmap;
+ ramtrace_init_bitmap(cpu * n_bitmap);
+ return 0;
+}
+
+static void ramtrace_set_bit(char *bitmap, int index)
+{
+ bitmap[index >> 3] |= (1 << index % 8);
+}
+
+static bool ramtrace_is_allocated(char *bitmap, int index)
+{
+ return bitmap[index >> 3] & (1 << index % 8);
+}
+
+static void ramtrace_reset_bit(char *bitmap, int index)
+{
+ bitmap[index >> 3] &= ~(1 << index % 8);
+}
+
+
+void *ramtrace_alloc_page(int cpu)
+{
+ void *address = NULL;
+ struct ramtrace_pagelist *freelist = trace_ctx.freelist;
+
+ if (!list_empty(&freelist->list)) {
+ struct ramtrace_pagelist *freelist_node;
+ char *bitmap_page;
+ unsigned long page_num;
+ unsigned long flags;
+ int index, bitmap_page_index;
+
+ /* Acquire lock and obtain a page from freelist */
+ spin_lock_irqsave(&trace_ctx.lock, flags);
+ freelist_node = list_next_entry(freelist, list);
+ list_del(&freelist_node->list);
+ trace_ctx.pages_available--;
+ spin_unlock_irqrestore(&trace_ctx.lock, flags);
+
+ address = freelist_node->page;
+ memset(address, 0, PAGE_SIZE);
+
+ /* Determine the bitmap index for the allocated page */
+ page_num = (address - trace_ctx.base_address) >> PAGE_SHIFT;
+
+ /* Every bitmap page represents PAGE_SIZE * 8 or
+ * 1 << (PAGE_SHIFT + 3) pages. Determine the nth bitmap for
+ * this cpu assosciated with the allocated page address.
+ */
+ bitmap_page_index = page_num >> (PAGE_SHIFT + 3);
+ bitmap_page = trace_ctx.bitmap_pages[trace_ctx.num_bitmap_per_cpu * cpu + bitmap_page_index];
+ /* Determine the index */
+ index = page_num - (bitmap_page_index << (PAGE_SHIFT + 3));
+
+ ramtrace_set_bit(bitmap_page, index);
+
+ }
+ return address;
+
+}
+
+void ramtrace_free_page(void *page_address, int cpu)
+{
+ void *bitmap;
+ int index;
+
+ /*
+ * Determine the page number by calculating the offset from the base
+ * address and divide it by page size.
+ * Each bitmap can hold page_size * 8 indices. In case we have more
+ * than one bitmap per cpu, divide page_num by (page size * 8).
+ */
+ unsigned long page_num = (page_address - trace_ctx.base_address) >> PAGE_SHIFT;
+ int bitmap_page_index = page_num >> (PAGE_SHIFT + 3);
+
+ if (page_address == NULL)
+ return;
+ bitmap = (char *)(trace_ctx.bitmap_pages[trace_ctx.num_bitmap_per_cpu * cpu + bitmap_page_index]);
+ /*
+ * When a single bitmap per cpu is used, page_num gives the index
+ * in the bitmap. In case of multiple bitmaps per cpu,
+ * page_num - bitmap_page_index * page_size * 8 gives the index.
+ * Note: When page_num is less than (page_size * 8), bitmap_page_index
+ * is zero.
+ * */
+ index = page_num - (bitmap_page_index << (PAGE_SHIFT + 3));
+ if (ramtrace_is_allocated(bitmap, index)) {
+ struct ramtrace_pagelist *freelist_node =
+ kmalloc(sizeof(struct ramtrace_pagelist), GFP_KERNEL);
+ unsigned long flags;
+
+ freelist_node->page = page_address;
+ spin_lock_irqsave(&trace_ctx.lock, flags);
+ list_add_tail(&freelist_node->list, &(trace_ctx.freelist->list));
+ trace_ctx.pages_available++;
+ spin_unlock_irqrestore(&trace_ctx.lock, flags);
+
+ ramtrace_reset_bit(bitmap, index);
+ }
+
+}
+
+
static int ramtrace_parse_dt(struct platform_device *pdev,
struct ramtrace_platform_data *pdata)
{
@@ -232,6 +508,32 @@ ramtrace_read_bitmap(int n_cpu, int n_bitmap, struct list_head **per_cpu)
}
+struct list_head *ramtrace_get_read_buffer(int n_cpu)
+{
+ if (n_cpu >= (trace_ctx.persist_info)->nr_cpus)
+ return NULL;
+
+ return (trace_ctx.persist_info)->data_pages[n_cpu];
+}
+
+int ramtrace_get_prev_boot_nr_cpus(void)
+{
+ return (trace_ctx.persist_info)->nr_cpus;
+}
+
+int ramtrace_get_prev_boot_clock_id(void)
+{
+ return (trace_ctx.persist_info)->trace_clock;
+}
+
+char *ramtrace_get_prev_boot_tracer_name(void)
+{
+ return (trace_ctx.persist_info)->tracer_name;
+}
+
+
+
+
static void ramtrace_read_pages(void)
{
void *metapage = trace_ctx.vaddr;
@@ -270,6 +572,40 @@ static void ramtrace_read_pages(void)
out:
trace_ctx.persist_info = persist;
}
+
+/**
+ * free_persist_info - free the list pertaining to previous boot.
+ *
+ * Free the list and array that was allocated to manage previous boot data.
+ * Note: There is no need to free the ramtrace pages memory area.
+ */
+static void free_persist_info(void)
+{
+ struct tr_persistent_info *persist;
+ int i;
+
+ persist = trace_ctx.persist_info;
+
+ if (persist) {
+ for (i = 0; i < persist->nr_cpus; i++) {
+ struct ramtrace_pagelist *node, *tmp;
+ struct list_head *page_list = persist->data_pages[i];
+
+ if (page_list == NULL)
+ continue;
+ list_for_each_entry_safe(node, tmp, page_list, list) {
+ list_del(&node->list);
+ kfree(node);
+ }
+ kfree(page_list);
+ }
+ kfree(persist->data_pages);
+ kfree(persist->tracer_name);
+ kfree(persist);
+ }
+ trace_ctx.persist_info = NULL;
+}
+
static int ramtrace_init_mem(struct ramtrace_context *ctx)
{
diff --git a/include/linux/ramtrace.h b/include/linux/ramtrace.h
index faf459f..8f9936c 100644
--- a/include/linux/ramtrace.h
+++ b/include/linux/ramtrace.h
@@ -13,3 +13,16 @@ struct ramtrace_platform_data {
unsigned long mem_size;
phys_addr_t mem_address;
};
+
+void *ramtrace_alloc_page(int cpu);
+void ramtrace_free_page(void *address, int cpu);
+void ramtrace_dump(void);
+int init_ramtrace_pages(int cpu, unsigned long npages,
+ const char *tracer_name, int clock_id);
+bool is_ramtrace_available(void);
+struct list_head *ramtrace_get_read_buffer(int cpu);
+char *ramtrace_get_prev_boot_tracer_name(void);
+int ramtrace_get_prev_boot_clock_id(void);
+int ramtrace_get_prev_boot_nr_cpus(void);
+int ramtrace_available_mem(void);
+void ramtrace_set_tracer_name(const char *tracer_name);
--
2.7.4
On Wed, Sep 2, 2020 at 4:01 PM Nachammai Karuppiah
<[email protected]> wrote:
>
> Hi,
>
> This patch series adds support to store trace events in pstore.
>
> Storing trace entries in persistent RAM would help in understanding what
> happened just before the system went down. The trace events that led to the
> crash can be retrieved from the pstore after a warm reboot. This will help
> debug what happened before machine’s last breath. This has to be done in a
> scalable way so that tracing a live system does not impact the performance
> of the system.
Just to add, Nachammai was my intern in the recent outreachy program
and we designed together a way for trace events to be written to
pstore backed memory directory instead of regular memory. The basic
idea is to allocate frace's ring buffer on pstore memory and have it
right there. Then recover it on reboot. Nachammai wrote the code with
some guidance :) . I talked to Steve as well in the past about the
basic of idea of this. Steve is on vacation this week though.
This is similar to what +Sai Prakash Ranjan was trying to do sometime
ago: https://lkml.org/lkml/2018/9/8/221 . But that approach involved
higher overhead due to synchronization of writing to the otherwise
lockless ring buffer.
+Brian Norris has also expressed interest for this feature.
thanks,
- Joel
>
> This requires a new backend - ramtrace that allocates pages from
> persistent storage for the tracing utility. This feature can be enabled
> using TRACE_EVENTS_TO_PSTORE.
> In this feature, the new backend is used only as a page allocator and
> once the users chooses to use pstore to record trace entries, the ring
> buffer pages are freed and allocated in pstore. Once this switch is done,
> ring_buffer continues to operate just as before without much overhead.
> Since the ring buffer uses the persistent RAM buffer directly to record
> trace entries, all tracers would also persist across reboot.
>
> To test this feature, I used a simple module that would call panic during
> a write operation to file in tracefs directory. Before writing to the file,
> the ring buffer is moved to persistent RAM buffer through command line
> as shown below,
>
> $echo 1 > /sys/kernel/tracing/options/persist
>
> Writing to the file,
> $echo 1 > /sys/kernel/tracing/crash/panic_on_write
>
> The above write operation results in system crash. After reboot, once the
> pstore is mounted, the trace entries from previous boot are available in file,
> /sys/fs/pstore/trace-ramtrace-0
>
> Looking through this file, gives us the stack trace that led to the crash.
>
> <...>-1 [001] .... 49.083909: __vfs_write <-vfs_write
> <...>-1 [001] .... 49.083933: panic <-panic_on_write
> <...>-1 [001] d... 49.084195: printk <-panic
> <...>-1 [001] d... 49.084201: vprintk_func <-printk
> <...>-1 [001] d... 49.084207: vprintk_default <-printk
> <...>-1 [001] d... 49.084211: vprintk_emit <-printk
> <...>-1 [001] d... 49.084216: __printk_safe_enter <-vprintk_emit
> <...>-1 [001] d... 49.084219: _raw_spin_lock <-vprintk_emit
> <...>-1 [001] d... 49.084223: vprintk_store <-vprintk_emit
>
> Patchwise oneline description is given below:
>
> Patch 1 adds support to allocate ring buffer pages from persistent RAM buffer.
>
> Patch 2 introduces a new backend, ramtrace.
>
> Patch 3 adds methods to read previous boot pages from pstore.
>
> Patch 4 adds the functionality to allocate page-sized memory from pstore.
>
> Patch 5 adds the seq_operation methods to iterate through trace entries.
>
> Patch 6 modifies ring_buffer to allocate from ramtrace when pstore is used.
>
> Patch 7 adds ramtrace DT node as child-node of /reserved-memory.
>
> Nachammai Karuppiah (7):
> tracing: Add support to allocate pages from persistent memory
> pstore: Support a new backend, ramtrace
> pstore: Read and iterate through trace entries in PSTORE
> pstore: Allocate and free page-sized memory in persistent RAM buffer
> tracing: Add support to iterate through pages retrieved from pstore
> tracing: Use ramtrace alloc and free methods while using persistent
> RAM
> dt-bindings: ramtrace: Add ramtrace DT node
>
> .../bindings/reserved-memory/ramtrace.txt | 13 +
> drivers/of/platform.c | 1 +
> fs/pstore/Makefile | 2 +
> fs/pstore/inode.c | 46 +-
> fs/pstore/platform.c | 1 +
> fs/pstore/ramtrace.c | 821 +++++++++++++++++++++
> include/linux/pstore.h | 3 +
> include/linux/ramtrace.h | 28 +
> include/linux/ring_buffer.h | 19 +
> include/linux/trace.h | 13 +
> kernel/trace/Kconfig | 10 +
> kernel/trace/ring_buffer.c | 663 ++++++++++++++++-
> kernel/trace/trace.c | 312 +++++++-
> kernel/trace/trace.h | 5 +-
> 14 files changed, 1924 insertions(+), 13 deletions(-)
> create mode 100644 Documentation/devicetree/bindings/reserved-memory/ramtrace.txt
> create mode 100644 fs/pstore/ramtrace.c
> create mode 100644 include/linux/ramtrace.h
>
> --
> 2.7.4
>
On Wed, Sep 2, 2020 at 5:47 PM Joel Fernandes <[email protected]> wrote:
>
> On Wed, Sep 2, 2020 at 4:01 PM Nachammai Karuppiah
> <[email protected]> wrote:
> >
> > Hi,
> >
> > This patch series adds support to store trace events in pstore.
> >
Been a long day...
> > Storing trace entries in persistent RAM would help in understanding what
> > happened just before the system went down. The trace events that led to the
> > crash can be retrieved from the pstore after a warm reboot. This will help
> > debug what happened before machine’s last breath. This has to be done in a
> > scalable way so that tracing a live system does not impact the performance
> > of the system.
>
> Just to add, Nachammai was my intern in the recent outreachy program
> and we designed together a way for trace events to be written to
> pstore backed memory directory instead of regular memory. The basic
s/directory/directly/
> idea is to allocate frace's ring buffer on pstore memory and have it
> right there. Then recover it on reboot. Nachammai wrote the code with
s/right/write/
- Joel
> some guidance :) . I talked to Steve as well in the past about the
> basic of idea of this. Steve is on vacation this week though.
>
> This is similar to what +Sai Prakash Ranjan was trying to do sometime
> ago: https://lkml.org/lkml/2018/9/8/221 . But that approach involved
> higher overhead due to synchronization of writing to the otherwise
> lockless ring buffer.
>
> +Brian Norris has also expressed interest for this feature.
>
> thanks,
>
> - Joel
>
> >
> > This requires a new backend - ramtrace that allocates pages from
> > persistent storage for the tracing utility. This feature can be enabled
> > using TRACE_EVENTS_TO_PSTORE.
> > In this feature, the new backend is used only as a page allocator and
> > once the users chooses to use pstore to record trace entries, the ring
> > buffer pages are freed and allocated in pstore. Once this switch is done,
> > ring_buffer continues to operate just as before without much overhead.
> > Since the ring buffer uses the persistent RAM buffer directly to record
> > trace entries, all tracers would also persist across reboot.
> >
> > To test this feature, I used a simple module that would call panic during
> > a write operation to file in tracefs directory. Before writing to the file,
> > the ring buffer is moved to persistent RAM buffer through command line
> > as shown below,
> >
> > $echo 1 > /sys/kernel/tracing/options/persist
> >
> > Writing to the file,
> > $echo 1 > /sys/kernel/tracing/crash/panic_on_write
> >
> > The above write operation results in system crash. After reboot, once the
> > pstore is mounted, the trace entries from previous boot are available in file,
> > /sys/fs/pstore/trace-ramtrace-0
> >
> > Looking through this file, gives us the stack trace that led to the crash.
> >
> > <...>-1 [001] .... 49.083909: __vfs_write <-vfs_write
> > <...>-1 [001] .... 49.083933: panic <-panic_on_write
> > <...>-1 [001] d... 49.084195: printk <-panic
> > <...>-1 [001] d... 49.084201: vprintk_func <-printk
> > <...>-1 [001] d... 49.084207: vprintk_default <-printk
> > <...>-1 [001] d... 49.084211: vprintk_emit <-printk
> > <...>-1 [001] d... 49.084216: __printk_safe_enter <-vprintk_emit
> > <...>-1 [001] d... 49.084219: _raw_spin_lock <-vprintk_emit
> > <...>-1 [001] d... 49.084223: vprintk_store <-vprintk_emit
> >
> > Patchwise oneline description is given below:
> >
> > Patch 1 adds support to allocate ring buffer pages from persistent RAM buffer.
> >
> > Patch 2 introduces a new backend, ramtrace.
> >
> > Patch 3 adds methods to read previous boot pages from pstore.
> >
> > Patch 4 adds the functionality to allocate page-sized memory from pstore.
> >
> > Patch 5 adds the seq_operation methods to iterate through trace entries.
> >
> > Patch 6 modifies ring_buffer to allocate from ramtrace when pstore is used.
> >
> > Patch 7 adds ramtrace DT node as child-node of /reserved-memory.
> >
> > Nachammai Karuppiah (7):
> > tracing: Add support to allocate pages from persistent memory
> > pstore: Support a new backend, ramtrace
> > pstore: Read and iterate through trace entries in PSTORE
> > pstore: Allocate and free page-sized memory in persistent RAM buffer
> > tracing: Add support to iterate through pages retrieved from pstore
> > tracing: Use ramtrace alloc and free methods while using persistent
> > RAM
> > dt-bindings: ramtrace: Add ramtrace DT node
> >
> > .../bindings/reserved-memory/ramtrace.txt | 13 +
> > drivers/of/platform.c | 1 +
> > fs/pstore/Makefile | 2 +
> > fs/pstore/inode.c | 46 +-
> > fs/pstore/platform.c | 1 +
> > fs/pstore/ramtrace.c | 821 +++++++++++++++++++++
> > include/linux/pstore.h | 3 +
> > include/linux/ramtrace.h | 28 +
> > include/linux/ring_buffer.h | 19 +
> > include/linux/trace.h | 13 +
> > kernel/trace/Kconfig | 10 +
> > kernel/trace/ring_buffer.c | 663 ++++++++++++++++-
> > kernel/trace/trace.c | 312 +++++++-
> > kernel/trace/trace.h | 5 +-
> > 14 files changed, 1924 insertions(+), 13 deletions(-)
> > create mode 100644 Documentation/devicetree/bindings/reserved-memory/ramtrace.txt
> > create mode 100644 fs/pstore/ramtrace.c
> > create mode 100644 include/linux/ramtrace.h
> >
> > --
> > 2.7.4
> >
On 2020-09-03 03:17, Joel Fernandes wrote:
> On Wed, Sep 2, 2020 at 4:01 PM Nachammai Karuppiah
> <[email protected]> wrote:
>>
>> Hi,
>>
>> This patch series adds support to store trace events in pstore.
>>
>> Storing trace entries in persistent RAM would help in understanding
>> what
>> happened just before the system went down. The trace events that led
>> to the
>> crash can be retrieved from the pstore after a warm reboot. This will
>> help
>> debug what happened before machine’s last breath. This has to be done
>> in a
>> scalable way so that tracing a live system does not impact the
>> performance
>> of the system.
>
> Just to add, Nachammai was my intern in the recent outreachy program
> and we designed together a way for trace events to be written to
> pstore backed memory directory instead of regular memory. The basic
> idea is to allocate frace's ring buffer on pstore memory and have it
> right there. Then recover it on reboot. Nachammai wrote the code with
> some guidance :) . I talked to Steve as well in the past about the
> basic of idea of this. Steve is on vacation this week though.
>
> This is similar to what +Sai Prakash Ranjan was trying to do sometime
> ago: https://lkml.org/lkml/2018/9/8/221 . But that approach involved
> higher overhead due to synchronization of writing to the otherwise
> lockless ring buffer.
>
> +Brian Norris has also expressed interest for this feature.
>
Great work Nachammai and Joel, I have few boards with warm reboot
support and will test
this series in coming days.
Thanks,
Sai
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member
of Code Aurora Forum, hosted by The Linux Foundation
On Wed, Sep 2, 2020 at 3:47 PM Joel Fernandes <[email protected]> wrote:
>
> On Wed, Sep 2, 2020 at 4:01 PM Nachammai Karuppiah
> <[email protected]> wrote:
> >
> > Hi,
> >
> > This patch series adds support to store trace events in pstore.
> >
> > Storing trace entries in persistent RAM would help in understanding what
> > happened just before the system went down. The trace events that led to the
> > crash can be retrieved from the pstore after a warm reboot. This will help
> > debug what happened before machine’s last breath. This has to be done in a
> > scalable way so that tracing a live system does not impact the performance
> > of the system.
>
> Just to add, Nachammai was my intern in the recent outreachy program
> and we designed together a way for trace events to be written to
> pstore backed memory directory instead of regular memory. The basic
> idea is to allocate frace's ring buffer on pstore memory and have it
> right there. Then recover it on reboot. Nachammai wrote the code with
> some guidance :) . I talked to Steve as well in the past about the
> basic of idea of this. Steve is on vacation this week though.
ramoops is already the RAM backend for pstore and ramoops already has
an ftrace region defined. What am I missing?
From a DT standpoint, we already have a reserved persistent RAM
binding too. There's already too much kernel specifics on how it is
used, we don't need more of that in DT. We're not going to add another
separate region (actually, you can have as many regions defined as you
want. They will just all be 'ramoops' compatible).
Rob
Hi Rob,
(Back from holidays, digging through the email pile). Reply below:
On Thu, Sep 3, 2020 at 2:09 PM Rob Herring <[email protected]> wrote:
>
> On Wed, Sep 2, 2020 at 3:47 PM Joel Fernandes <[email protected]> wrote:
> >
> > On Wed, Sep 2, 2020 at 4:01 PM Nachammai Karuppiah
> > <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > This patch series adds support to store trace events in pstore.
> > >
> > > Storing trace entries in persistent RAM would help in understanding what
> > > happened just before the system went down. The trace events that led to the
> > > crash can be retrieved from the pstore after a warm reboot. This will help
> > > debug what happened before machine’s last breath. This has to be done in a
> > > scalable way so that tracing a live system does not impact the performance
> > > of the system.
> >
> > Just to add, Nachammai was my intern in the recent outreachy program
> > and we designed together a way for trace events to be written to
> > pstore backed memory directory instead of regular memory. The basic
> > idea is to allocate frace's ring buffer on pstore memory and have it
> > right there. Then recover it on reboot. Nachammai wrote the code with
> > some guidance :) . I talked to Steve as well in the past about the
> > basic of idea of this. Steve is on vacation this week though.
>
> ramoops is already the RAM backend for pstore and ramoops already has
> an ftrace region defined. What am I missing?
ramoops is too slow for tracing. Honestly, the ftrace functionality in
ramoops should be removed in favor of Nachammai's patches (she did it
for events but function tracing could be trivially added). No one uses
the current ftrace in pstore because it is darned slow. ramoops sits
in between the writing of the ftrace record and the memory being
written to adding more overhead in the process, while also writing
ftrace records in a non-ftrace format. So ramoop's API and
infrastructure fundamentally does not meet the requirements of high
speed persistent tracing. The idea of this work is to keep the trace
events enabled for a long period time (possibly even in production)
and low overhead until the problem like machine crashing happens.
> From a DT standpoint, we already have a reserved persistent RAM
> binding too. There's already too much kernel specifics on how it is
> used, we don't need more of that in DT. We're not going to add another
> separate region (actually, you can have as many regions defined as you
> want. They will just all be 'ramoops' compatible).
I agree with the sentiment here on DT. Maybe the DT can be generalized
to provide a ram region to which either ramoops or ramtrace can
attach.
- Joel
On Thu, 10 Sep 2020 21:25:11 -0400
Joel Fernandes <[email protected]> wrote:
> Hi Rob,
> (Back from holidays, digging through the email pile). Reply below:
What ever happen to this?
Sorry, I was expecting more replies, and when there was nothing, it got
lost in my inbox.
>
> On Thu, Sep 3, 2020 at 2:09 PM Rob Herring <[email protected]> wrote:
> >
> > On Wed, Sep 2, 2020 at 3:47 PM Joel Fernandes <[email protected]> wrote:
> > >
> > > On Wed, Sep 2, 2020 at 4:01 PM Nachammai Karuppiah
> > > <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > This patch series adds support to store trace events in pstore.
> > > >
> > > > Storing trace entries in persistent RAM would help in understanding what
> > > > happened just before the system went down. The trace events that led to the
> > > > crash can be retrieved from the pstore after a warm reboot. This will help
> > > > debug what happened before machine’s last breath. This has to be done in a
> > > > scalable way so that tracing a live system does not impact the performance
> > > > of the system.
> > >
> > > Just to add, Nachammai was my intern in the recent outreachy program
> > > and we designed together a way for trace events to be written to
> > > pstore backed memory directory instead of regular memory. The basic
> > > idea is to allocate frace's ring buffer on pstore memory and have it
> > > right there. Then recover it on reboot. Nachammai wrote the code with
> > > some guidance :) . I talked to Steve as well in the past about the
> > > basic of idea of this. Steve is on vacation this week though.
> >
> > ramoops is already the RAM backend for pstore and ramoops already has
> > an ftrace region defined. What am I missing?
>
> ramoops is too slow for tracing. Honestly, the ftrace functionality in
> ramoops should be removed in favor of Nachammai's patches (she did it
> for events but function tracing could be trivially added). No one uses
> the current ftrace in pstore because it is darned slow. ramoops sits
> in between the writing of the ftrace record and the memory being
> written to adding more overhead in the process, while also writing
> ftrace records in a non-ftrace format. So ramoop's API and
> infrastructure fundamentally does not meet the requirements of high
> speed persistent tracing. The idea of this work is to keep the trace
> events enabled for a long period time (possibly even in production)
> and low overhead until the problem like machine crashing happens.
>
> > From a DT standpoint, we already have a reserved persistent RAM
> > binding too. There's already too much kernel specifics on how it is
> > used, we don't need more of that in DT. We're not going to add another
> > separate region (actually, you can have as many regions defined as you
> > want. They will just all be 'ramoops' compatible).
>
> I agree with the sentiment here on DT. Maybe the DT can be generalized
> to provide a ram region to which either ramoops or ramtrace can
> attach.
Right,
Perhaps just remove patch 7, but still have the ramoops work move forward?
-- Steve
On Thu, Jun 30, 2022 at 3:48 PM Steven Rostedt <[email protected]> wrote:
>
> On Thu, 10 Sep 2020 21:25:11 -0400
> Joel Fernandes <[email protected]> wrote:
>
> > Hi Rob,
> > (Back from holidays, digging through the email pile). Reply below:
>
> What ever happen to this?
>
> Sorry, I was expecting more replies, and when there was nothing, it got
> lost in my inbox.
>
[...]
> > > From a DT standpoint, we already have a reserved persistent RAM
> > > binding too. There's already too much kernel specifics on how it is
> > > used, we don't need more of that in DT. We're not going to add another
> > > separate region (actually, you can have as many regions defined as you
> > > want. They will just all be 'ramoops' compatible).
> >
> > I agree with the sentiment here on DT. Maybe the DT can be generalized
> > to provide a ram region to which either ramoops or ramtrace can
> > attach.
>
> Right,
>
> Perhaps just remove patch 7, but still have the ramoops work move forward?
This was an internship project submission which stalled after the
internship ended, I imagine Nachammai has moved on to doing other
things since.
I am curious how this came on your radar after 2 years, did someone
tell you to prioritize improving performance of ftrace on pstore? I
could probably make time to work on it more if someone has a usecase
for this or something.
Thanks,
- Joel
On Fri, 1 Jul 2022 12:37:35 -0400
Joel Fernandes <[email protected]> wrote:
> I am curious how this came on your radar after 2 years, did someone
> tell you to prioritize improving performance of ftrace on pstore? I
> could probably make time to work on it more if someone has a usecase
> for this or something.
I'm looking into ways to extract the ftrace ring buffer from crashes, and
it was brought up that pstore was used before.
-- Steve
On Fri, Jul 1, 2022 at 12:46 PM Steven Rostedt <[email protected]> wrote:
>
> On Fri, 1 Jul 2022 12:37:35 -0400
> Joel Fernandes <[email protected]> wrote:
>
> > I am curious how this came on your radar after 2 years, did someone
> > tell you to prioritize improving performance of ftrace on pstore? I
> > could probably make time to work on it more if someone has a usecase
> > for this or something.
>
> I'm looking into ways to extract the ftrace ring buffer from crashes, and
> it was brought up that pstore was used before.
Interesting. In the case of pstore, you know exactly where the pages
are for ftrace. How would you know that for the buddy system where
pages are in the wild wild west? I guess you would need to track where
ftrace pages where allocated, within the crash dump/report.
- Joel
On Fri, 1 Jul 2022 12:53:17 -0400
Joel Fernandes <[email protected]> wrote:
> Interesting. In the case of pstore, you know exactly where the pages
> are for ftrace. How would you know that for the buddy system where
> pages are in the wild wild west? I guess you would need to track where
> ftrace pages where allocated, within the crash dump/report.
kexec/kdump already does that (of course it requires the DWARF symbols of
the kernel to be accessible by the kdump kernel).
But if we write the raw ftrace data into persistent memory that can survive
a reboot, then we can extract that raw data and convert it back to text
offline.
Thus, I would like to remove the converting to text and compressing into
pstore, and possibly look at a solution that simply writes the raw data
into pstore.
-- Steve