Date: Tue, 11 Oct 2016 05:57:53 -0400
From: Steven Rostedt <rostedt@goodmis.org>
To: Joel Fernandes <joelaf@google.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [RFC 0/7] pstore: Improve performance of ftrace backend with
 ramoops
Message-ID: <20161011055753.2690178f@grimm.local.home>
In-Reply-To: <1475904515-24970-1-git-send-email-joelaf@google.com>
References: <1475904515-24970-1-git-send-email-joelaf@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3243
Lines: 71

On Fri,  7 Oct 2016 22:28:27 -0700
Joel Fernandes <joelaf@google.com> wrote:

> Here's an early RFC for a patch series on improving ftrace throughput with
> ramoops. I am hoping to get some early comments so I'm releasing it in advance.
> It is functional and tested.
> 
> Currently ramoops uses a single zone to store function traces. To make this
> work, it has to uses locking to synchronize accesses to the buffers. Recently
> the synchronization was completely moved from a cmpxchg mechanism to raw
> spinlocks due to difficulties in using cmpxchg on uncached memory and also on
> RAMs behind PCIe. [1] This change further dropped the peformance of ramoops
> pstore backend by more than half in my tests.
> 
> This patch series improves the situation dramatically by around 280% from what
> it is now by creating a ramoops persistent zone for each CPU and avoiding use of
> locking altogether for ftrace. At init time, the persistent zones are then
> merged together.
> 
> Here are some tests to show the improvements.  Tested using a qemu quad core
> x86_64 instance with -mem-path to persist the guest RAM to a file. I measured
> avergage throughput of dd over 30 seconds:
> 
> dd if=/dev/zero | pv | dd of=/dev/null
> 
> Without this patch series: 24MB/s
> With per-cpu buffers and counter increment: 91.5 MB/s (improvement by ~ 281%)
> with per-cpu buffers and trace_clock: 51.9 MB/s
> 
> Some more considerations:
> 1. Inorder to do the merge of the individual buffers, I am using racy counters
> since I didn't want to sacrifice throughput for perfect time stamps.
> trace_clock() for timestamps although did the job but was almost half the
> throughput of using counter based timestamp.
> 
> 2. Since the patches divide the available ftrace persistent space by the number
> of CPUs, lesser space will now be available per-CPU however the user is free to
> disable per CPU behavior and revert to the old behavior by specifying
> PSTORE_PER_CPU flag.  Its a space vs performance trade-off so if user has
> enough space and not a lot of CPUs, then using per-CPU persistent buffers make
> sense for better performance.
> 
> 3. Without using any counters or timestamps, the improvement is even more
> (~140MB/s) but the buffers cannot be merged.
> 
> [1] https://lkml.org/lkml/2016/9/8/375

>From a tracing point of view, I have no qualms with this patch set.

-- Steve

> 
> Joel Fernandes (7):
>   pstore: Make spinlock per zone instead of global
>   pstore: locking: dont lock unless caller asks to
>   pstore: Remove case of PSTORE_TYPE_PMSG write using deprecated
>     function
>   pstore: Make ramoops_init_przs generic for other prz arrays
>   ramoops: Split ftrace buffer space into per-CPU zones
>   pstore: Add support to store timestamp counter in ftrace records
>   pstore: Merge per-CPU ftrace zones into one zone for output
> 
>  fs/pstore/ftrace.c         |   3 +
>  fs/pstore/inode.c          |   7 +-
>  fs/pstore/internal.h       |  34 -------
>  fs/pstore/ram.c            | 234 +++++++++++++++++++++++++++++++++++----------
>  fs/pstore/ram_core.c       |  30 +++---
>  include/linux/pstore.h     |  69 +++++++++++++
>  include/linux/pstore_ram.h |   6 +-
>  7 files changed, 280 insertions(+), 103 deletions(-)
>