Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754797AbbERWiQ (ORCPT ); Mon, 18 May 2015 18:38:16 -0400 Received: from mga02.intel.com ([134.134.136.20]:48823 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754383AbbERWiN (ORCPT ); Mon, 18 May 2015 18:38:13 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,455,1427785200"; d="scan'208";a="731403430" From: "Verma, Vishal L" To: "Elliott@hp.com" CC: "linux-kernel@vger.kernel.org" , "mingo@kernel.org" , "neilb@suse.de" , "hch@lst.de" , "Williams, Dan J" , "axboe@fb.com" , "hpa@zytor.com" , "linux-nvdimm@lists.01.org" , "gregkh@linuxfoundation.org" , "luto@amacapital.net" , "david@fromorbit.com" Subject: Re: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates Thread-Topic: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates Thread-Index: AQHQkbtRxwNKdplhG0iltM6wnt3fbA== Date: Mon, 18 May 2015 22:38:11 +0000 Message-ID: <1431988690.15857.31.camel@intel.com> References: <20150428181203.35812.60474.stgit@dwillia2-desk3.amr.corp.intel.com> <20150428182557.35812.38292.stgit@dwillia2-desk3.amr.corp.intel.com> <94D0CD8314A33A4D9D801C0FE68B40295A9097E5@G9W0745.americas.hpqcorp.net> In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295A9097E5@G9W0745.americas.hpqcorp.net> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.232.112.38] Content-Type: text/plain; charset="utf-8" Content-ID: MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id t4IMcMpb029386 Content-Length: 9597 Lines: 203 On Sun, 2015-05-17 at 01:19 +0000, Elliott, Robert (Server Storage) wrote: > > -----Original Message----- > > From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of > > Dan Williams > > Sent: Tuesday, April 28, 2015 1:26 PM > > To: linux-nvdimm@lists.01.org > > Cc: Ingo Molnar; Neil Brown; Greg KH; Dave Chinner; linux- > > kernel@vger.kernel.org; Andy Lutomirski; Jens Axboe; H. Peter Anvin; > > Christoph Hellwig > > Subject: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates > > > > From: Vishal Verma > > > > BTT stands for Block Translation Table, and is a way to provide power > > fail sector atomicity semantics for block devices that have the ability > > to perform byte granularity IO. It relies on the ->rw_bytes() capability > > of provided nd namespace devices. > > > > The BTT works as a stacked blocked device, and reserves a chunk of space > > from the backing device for its accounting metadata. BLK namespaces may > > mandate use of a BTT and expect the bus to initialize a BTT if not > > already present. Otherwise if a BTT is desired for other namespaces (or > > partitions of a namespace) a BTT may be manually configured. > ... > > Running btt above pmem with a variety of workloads, I see an awful lot > of time spent in two places: > * _raw_spin_lock > * btt_make_request I'm curious about why _raw_spin_lock might be taking up so much time - does the test system have more than 256 CPUs? Even ignoring Dan's fix for nr_cpus which will allow the lockless path optimization to kick in (coming in v3), the lane spinlocks should be largely uncontested unless you have more than 256 CPUs. One explanation could be that this is being dominated by 'map_locks' - I'd need to get my perf setup going again to find out. (The more I think about it, the more I'm convinced this is the case. Fortunately it is simple to reduce map_lock contention by increasing the number of locks). > > This occurs for fio to raw /dev/ndN devices, ddpt over ext4 or xfs, > cp -R of large directories, and running make on the linux kernel. > > Some specific results: > > fio 4 KiB random reads, WC cache type, memcpy: > * 43175 MB/s, 8 M IOPS pmem0 and pmem1 > * 18500 MB/s, 1.5 M IOPS nd0 and nd1 > > fio 4 KiB random reads, WC cache type, memcpy with non-temporal > loads (when everything is 64-byte aligned): > * 33814 MB/s, 4.3 M IOPS nd0 and nd1 > > Zeroing out 32 MiB with ddpt: > * 19 s, 1800 MiB/s pmem > * 55 s, 625 MiB/s btt > > If btt_make_request needs to stall this much, maybe it'd be better > to utilize the blk-mq request queues, keeping requests in per-CPU > queues while they're waiting, and using IPIs for completion > interrupts when they're finally done. > > > fio 4 KiB random reads without non-temporal memcpy > ================================================== > perf top shows memcpy_erms taking all the time, a function that > uses 8-byte REP; MOVSB instructions: > 85.78% [kernel] [k] memcpy_erms > 1.21% [kernel] [k] _raw_spin_lock > 0.72% [nd_btt] [k] btt_make_request > 0.67% [kernel] [k] do_blockdev_direct_IO > 0.47% fio [.] get_io_u > > fio 4 KiB random reads with non-temporal memcpy > =============================================== > perf top shows there are still quite a few unaligned accesses > resulting in legacy memcpy, but about equal time is now spent > in legacy vs NT memcpy: I take it memcpy_lnt_st_64 expects 64B aligned accesses? The BTT will always have some unaligned accesses - the map updates are 4B long. Flog updates are 16B long, but about half of them will not start at a 64B alignment due to how it is laid out. I'm open to hearing ideas about how this could be improved. One way, as Dan already said is reducing the number of metadata updates by configuring a larger (4K) sector size. > 30.47% [kernel] [k] memcpy_erms > 26.27% [kernel] [k] memcpy_lnt_st_64 > 5.37% [kernel] [k] _raw_spin_lock > 2.20% [kernel] [k] btt_make_request > 2.03% [kernel] [k] do_blockdev_direct_IO > 1.41% fio [.] get_io_u > 1.22% [kernel] [k] btt_map_read > 1.15% [kernel] [k] pmem_rw_bytes > 1.01% [kernel] [k] nd_btt_rw_bytes > 0.98% [kernel] [k] nd_region_acquire_lane > 0.89% fio [.] get_next_rand_block > 0.88% fio [.] thread_main > 0.79% fio [.] ios_completed > 0.76% fio [.] td_io_queue > 0.75% [kernel] [k] _raw_spin_lock_irqsave > 0.68% [kernel] [k] kmem_cache_free > 0.66% [kernel] [k] kmem_cache_alloc > 0.59% [kernel] [k] __audit_syscall_exit > 0.57% [kernel] [k] aio_complete > 0.54% [kernel] [k] do_io_submit > 0.52% [kernel] [k] _raw_spin_unlock_irqrestore > > fio randrw workload > =================== > perf top shows that adding writes to the mix brings btt_make_request > its cpu_relax() loop to the forefront: This is good data - we've thought about benchmarking an alternate strategy for the 'Read Tracking Table' - instead of a linear array, use a hash table (hash new_postmap into an RTT slot) . It will add false waiting due to hash collisions, but the writers won't have to walk the array - they can just hash into the right spot. Another option that Dan suggested could be instead of waiting in the cpu_relax() loop, we drop the lane, and start over with _acquire_lane. This way, we are waiting in the spin_lock stage, thus giving a chance to other threads/CPUs that resolve to the same spinlock to run. > 21.09% [nd_btt] [k] btt_make_request > 19.06% [kernel] [k] memcpy_erms > 14.35% [kernel] [k] _raw_spin_lock > 10.38% [nd_pmem] [k] memcpy_lnt_st_64 > 1.57% [kernel] [k] do_blockdev_direct_IO > 1.51% [nd_pmem] [k] memcpy_lt_snt_64 > 1.43% [nd_btt] [k] nd_btt_rw_bytes > 1.39% [kernel] [k] radix_tree_next_chunk > 1.33% [kernel] [k] put_page > 1.21% [nd_pmem] [k] pmem_rw_bytes > 1.11% fio [.] get_io_u > 0.90% fio [.] io_u_queued_complete > 0.74% [kernel] [k] system_call > 0.72% [libnd] [k] nd_region_acquire_lane > 0.71% [nd_btt] [k] btt_map_read > 0.62% fio [.] thread_main > > inside btt_make_request: > > ¦ /* Wait if the new block is being read from */ > ¦ for (i = 0; i < arena->nfree; i++) > 2.98 ¦ ? je 2b4 > 0.05 ¦ mov 0x60(%r14),%rax > 0.00 ¦ mov %ebx,%edx > ¦ xor %esi,%esi > 0.03 ¦ or $0x80000000,%edx > 0.05 ¦ nop > ¦ while (arena->rtt[i] == (RTT_VALID | new_postmap)) > 22.98 ¦290: mov %esi,%edi > 0.01 ¦ cmp %edx,(%rax,%rdi,4) > 30.97 ¦ lea 0x0(,%rdi,4),%rcx > 21.05 ¦ ? jne 2ab > ¦ nop > ¦ } > ¦ > ¦ /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */ > ¦ static inline void rep_nop(void) > ¦ { > ¦ asm volatile("rep; nop" ::: "memory"); > ¦2a0: pause > ¦ mov 0x60(%r14),%rax > ¦ cmp (%rax,%rcx,1),%edx > ¦ ? je 2a0 > ¦ } > > > ddpt zeroing out > ================ > perf top shows 27% in spinlocks, and 14% in btt_make_request (all in > the "wait if the new block is being read from" loop). > > 26.48% [kernel] [k] _raw_spin_lock > 14.46% [nd_btt] [k] btt_make_request > 13.14% [kernel] [k] memcpy_erms > 10.34% [kernel] [k] copy_user_enhanced_fast_string > 3.12% [nd_pmem] [k] memcpy_lt_snt_64 > 1.15% [kernel] [k] __block_commit_write.isra.21 > 0.96% [nd_pmem] [k] pmem_rw_bytes > 0.96% [nd_btt] [k] nd_btt_rw_bytes > 0.86% [kernel] [k] unlock_page > 0.65% [kernel] [k] _raw_spin_lock_irqsave > 0.58% [kernel] [k] bdev_read_only > 0.56% [kernel] [k] release_pages > 0.54% [nd_pmem] [k] memcpy_lnt_st_64 > 0.53% [ext4] [k] ext4_mark_iloc_dirty > 0.52% [kernel] [k] __wake_up_bit > 0.52% [kernel] [k] __clear_user > > --- > Robert Elliott, HP Server Storage Thanks, this is great analysis - I'd be quite interested in how these numbers look like with the two changes Dan mentioned. -Vishal ????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?