Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp3554462imm; Sun, 14 Oct 2018 23:28:25 -0700 (PDT) X-Google-Smtp-Source: ACcGV639PxhI82PqoKHANEmpp4zCGnYxZ85xOGGYBPwEM++lKe+OMqe5giIPXDep/6zDgmjm3H8x X-Received: by 2002:a63:27c1:: with SMTP id n184-v6mr14899166pgn.334.1539584905917; Sun, 14 Oct 2018 23:28:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539584905; cv=none; d=google.com; s=arc-20160816; b=HGW/7LAcHLZKLGSWKktF0exgecI72CQgDm6ha+5yb0RDRTG/j3C2H2z+HcNbccug8x 3+/uSM8q8HkFprLhWzd4LwetAEFmoPVyvLaut6oAHavmrF1ShVWqRAGvnGPEVfWIC/51 j0S1Ea1zpvFXwOCTLeC7sxIzzkcbZ6xySAy2QGfrTBkWIQEDyDWWb9hqNF+Bfo5sKrH6 CwGcsmDb3IfcMBnGvJQGJXP0ZcTdQaraDAABQOhVuyBLKE3uH1ISwrFNgYc/xsoBOrHo JVU4r3eWpgOMUz5w8nCnc76F6+hiyGo/2IBumRM3gGSN/aK9rezOwKAS0BnweNTBYO1o 8xow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:mime-version:user-agent:date:message-id :organization:subject:from:cc:to; bh=6YavC3V9sa52aFjgRbI61WQtryU9nHgIfeqyj6wbOMk=; b=F7sErclDu6dyKl033MIHeDDWPB7/h/8WtGgR4jyE0kE3x+47RnBQ2ulLjujH3tgBbq 7lLJcfIITtiwhIYD16JgBa+usyC3y6g/P7xuVTbvhgn/GqGY8BKo3CJzMlE4rr5CJ8Hc t/nKeIpRdn3d0OXmN1dE+XHkRltURErDrXOKkMMeeLWK9lgrXtgM/bJIOxOcOGsUCc2D BzWJBmHb8PhbaxjYEKROnL8qUPWZomljAfRSE/ZGmiyn/NYFx87ATZiF9YgKsjEo6JmO 7EwrBHxVNMAecfbPSmwVoG7ZWrcqbZUyIhwPbe8UPB70WX2xzM1G6wg16Sx//Y9/c5NB 6Jug== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 136-v6si10530912pfw.278.2018.10.14.23.28.10; Sun, 14 Oct 2018 23:28:25 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726549AbeJOOKG (ORCPT + 99 others); Mon, 15 Oct 2018 10:10:06 -0400 Received: from mga04.intel.com ([192.55.52.120]:43856 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726164AbeJOOKF (ORCPT ); Mon, 15 Oct 2018 10:10:05 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 14 Oct 2018 23:26:14 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,383,1534834800"; d="scan'208";a="271361657" Received: from linux.intel.com ([10.54.29.200]) by fmsmga005.fm.intel.com with ESMTP; 14 Oct 2018 23:26:14 -0700 Received: from [10.125.251.255] (unknown [10.125.251.255]) by linux.intel.com (Postfix) with ESMTP id B2057580113; Sun, 14 Oct 2018 23:26:11 -0700 (PDT) To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo Cc: Alexander Shishkin , Jiri Olsa , Namhyung Kim , Andi Kleen , linux-kernel From: Alexey Budankov Subject: [PATCH v14 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Organization: Intel Corp. Message-ID: Date: Mon, 15 Oct 2018 09:26:09 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently in record mode the tool implements trace writing serially. The algorithm loops over mapped per-cpu data buffers and stores ready data chunks into a trace file using write() system call. At some circumstances the kernel may lack free space in a buffer because the other buffer's half is not yet written to disk due to some other buffer's data writing by the tool at the moment. Thus serial trace writing implementation may cause the kernel to loose profiling data and that is what observed when profiling highly parallel CPU bound workloads on machines with big number of cores. Experiment with profiling matrix multiplication code executing 128 threads on Intel Xeon Phi (KNM) with 272 cores, like below, demonstrates data loss metrics value of 98%: /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \ --call-graph dwarf,1024 --user-regs=IP,SP,BP --switch-events \ -e cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \ matrix.gcc Data loss metrics is the ratio lost_time/elapsed_time where lost_time is the sum of time intervals containing PERF_RECORD_LOST records and elapsed_time is the elapsed application run time under profiling. Applying asynchronous trace streaming thru Posix AIO API [1] lowers data loss metrics value providing 2x improvement (from 98% to ~1%) Asynchronous trace streaming is currently limited to glibc linkage. musl libc [5] also provides Posix AIO API implementation, however the patchkit is not tested with it. There may be other libc libraries linked by Perf tool that currently lack Posix AIO API support [2], [3], [4] so NO_AIO define may be used to limit Perf tool binary to serial streaming only. --- Alexey Budankov (3): perf util: map data buffer for preserving collected data perf record: enable asynchronous trace writing perf record: extend trace writing to multi AIO tools/perf/Documentation/perf-record.txt | 5 + tools/perf/Makefile.config | 5 + tools/perf/Makefile.perf | 7 +- tools/perf/builtin-record.c | 252 ++++++++++++++++++++++++++++++- tools/perf/perf.h | 1 + tools/perf/util/evlist.c | 6 +- tools/perf/util/evlist.h | 2 +- tools/perf/util/mmap.c | 146 +++++++++++++++++- tools/perf/util/mmap.h | 26 +++- 9 files changed, 439 insertions(+), 11 deletions(-) --- Changes in v14: - implement default nr_cblocks_default variable - fix --aio option handling Changes in v13: - named new functions with _aio_ word - grouped aio functions under single #ifdef HAVE_AIO_SUPPORT - moved perf_mmap__aio_push() stub into header - removed trailed white space Changes in v12: - applied stub functions design for the whole patch kit - grouped AIO related data into a struct under struct perf_mmap - implemented record__aio_get/set_pos(), record__aio_enabled() - implemented simple --aio option - extended --aio option to --aio-cblocks= Changes in v11: - replacing the both lseek() syscalls in every loop iteration by the only two syscalls just before and after the loop at record__mmap_read_evlist() and advancing *in-flight* off file pos value at perf_mmap__aio_push() Changes in v10: - moved specific code to perf_mmap__aio_mmap(), perf_mmap__aio_munmap(); - adjusted error reporting by using %m - avoided lseek() setting file pos back in case of record__aio_write() failure - compacted code selecting between serial and AIO streaming - optimized call places of record__mmap_read_sync() - added description of aio-cblocks option into perf-record.txt Changes in v9: - enable AIO streaming only when --aio-cblocks option is specified explicitly - enable AIO based implementation when linking with glibc only - define NO_AIO to limit Perf binary to serial implementation Changes in v8: - run the whole thing thru checkpatch.pl and corrected found issues except lines longer than 80 symbols - corrected comments alignment and formatting - moved multi AIO implementation into 3rd patch in the series - implemented explicit cblocks array allocation - split AIO completion check into separate record__aio_complete() - set nr_cblocks default to 1 and max allowed value to 4 Changes in v7: - implemented handling record.aio setting from perfconfig file Changes in v6: - adjusted setting of priorities for cblocks; - handled errno == EAGAIN case from aio_write() return; Changes in v5: - resolved livelock on perf record -e intel_pt// -- dd if=/dev/zero of=/dev/null count=100000 - data loss metrics decreased from 25% to 2x in trialed configuration; - reshaped layout of data structures; - implemented --aio option; - avoided nanosleep() prior calling aio_suspend(); - switched to per-cpu aio multi buffer record__aio_sync(); - record_mmap_read_sync() now does global sync just before switching trace file or collection stop; Changes in v4: - converted mmap()/munmap() to malloc()/free() for mmap->data buffer management - converted void *bf to struct perf_mmap *md in signatures - written comment in perf_mmap__push() just before perf_mmap__get(); - written comment in record__mmap_read_sync() on possible restarting of aio_write() operation and releasing perf_mmap object after all; - added perf_mmap__put() for the cases of failed aio_write(); Changes in v3: - written comments about nanosleep(0.5ms) call prior aio_suspend() to cope with intrusiveness of its implementation in glibc; - written comments about rationale behind coping profiling data into mmap->data buffer; Changes in v2: - converted zalloc() to calloc() for allocation of mmap_aio array, - cleared typo and adjusted fallback branch code; --- [1] http://man7.org/linux/man-pages/man7/aio.7.html [2] https://android.googlesource.com/platform/bionic/+/master/docs/status.md [3] https://www.uclibc.org/ [4] https://uclibc-ng.org/ [5] https://www.musl-libc.org/