Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758985Ab3FMX0a (ORCPT ); Thu, 13 Jun 2013 19:26:30 -0400 Received: from mga09.intel.com ([134.134.136.24]:17383 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758933Ab3FMX03 (ORCPT ); Thu, 13 Jun 2013 19:26:29 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.87,862,1363158000"; d="scan'208";a="329160424" Subject: Performance regression from switching lock to rw-sem for anon-vma tree From: Tim Chen To: Ingo Molnar Cc: Andrea Arcangeli , Mel Gorman , "Shi, Alex" , Andi Kleen , Andrew Morton , Michel Lespinasse , Davidlohr Bueso , "Wilcox, Matthew R" , Dave Hansen , Peter Zijlstra , Rik van Riel , linux-kernel@vger.kernel.org, linux-mm Content-Type: text/plain; charset="UTF-8" Date: Thu, 13 Jun 2013 16:26:32 -0700 Message-ID: <1371165992.27102.573.camel@schen9-DESK> Mime-Version: 1.0 X-Mailer: Evolution 2.32.3 (2.32.3-1.fc14) Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5080 Lines: 104 Ingo, At the time of switching the anon-vma tree's lock from mutex to rw-sem (commit 5a505085), we encountered regressions for fork heavy workload. A lot of optimizations to rw-sem (e.g. lock stealing) helped to mitigate the problem. I tried an experiment on the 3.10-rc4 kernel to compare the performance of rw-sem to one that uses mutex. I saw a 8% regression in throughput for rw-sem vs a mutex implementation in 3.10-rc4. For the experiments, I used the exim mail server workload in the MOSBENCH test suite on 4 socket (westmere) and a 4 socket (ivy bridge) with the number of clients sending mail equal to number of cores. The mail server will fork off a process to handle an incoming mail and put it into mail spool. The lock protecting the anon-vma tree is stressed due to heavy forking. On both machines, I saw that the mutex implementation has 8% more throughput. I've pinned the cpu frequency to maximum in the experiments. I've tried two separate tweaks to the rw-sem on 3.10-rc4. I've tested each tweak individually. 1) Add an owner field when a writer holds the lock and introduce optimistic spinning when an active writer is holding the semaphore. It reduced the context switching by 30% to a level very close to the mutex implementation. However, I did not see any throughput improvement of exim. 2) When the sem->count's active field is non-zero (i.e. someone is holding the lock), we can skip directly to the down_write_failed path, without adding the RWSEM_DOWN_WRITE_BIAS and taking it off again from sem->count, saving us two atomic operations. Since we will try the lock stealing again later, this should be okay. Unfortunately it did not improve the exim workload either. Any suggestions on the difference between rwsem and mutex performance and possible improvements to recover this regression? Thanks. Tim vmstat for mutex implementation: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 38 0 0 130957920 47860 199956 0 0 0 56 236342 476975 14 72 14 0 0 41 0 0 130938560 47860 219900 0 0 0 0 236816 479676 14 72 14 0 0 vmstat for rw-sem implementation (3.10-rc4) procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 40 0 0 130933984 43232 202584 0 0 0 0 321817 690741 13 71 16 0 0 39 0 0 130913904 43232 224812 0 0 0 0 322193 692949 13 71 16 0 0 Profile for mutex implementation: 5.02% exim [kernel.kallsyms] [k] page_fault 3.67% exim [kernel.kallsyms] [k] anon_vma_interval_tree_insert 2.66% exim [kernel.kallsyms] [k] unmap_single_vma 2.15% exim [kernel.kallsyms] [k] do_raw_spin_lock 2.14% exim [kernel.kallsyms] [k] page_cache_get_speculative 2.04% exim [kernel.kallsyms] [k] copy_page_rep 1.58% exim [kernel.kallsyms] [k] clear_page_c 1.55% exim [kernel.kallsyms] [k] cpu_relax 1.55% exim [kernel.kallsyms] [k] mutex_unlock 1.42% exim [kernel.kallsyms] [k] __slab_free 1.16% exim [kernel.kallsyms] [k] mutex_lock 1.12% exim libc-2.13.so [.] vfprintf 0.99% exim [kernel.kallsyms] [k] find_vma 0.95% exim [kernel.kallsyms] [k] __list_del_entry Profile for rw-sem implementation 4.88% exim [kernel.kallsyms] [k] page_fault 3.43% exim [kernel.kallsyms] [k] anon_vma_interval_tree_insert 2.65% exim [kernel.kallsyms] [k] unmap_single_vma 2.46% exim [kernel.kallsyms] [k] do_raw_spin_lock 2.25% exim [kernel.kallsyms] [k] copy_page_rep 2.01% exim [kernel.kallsyms] [k] page_cache_get_speculative 1.81% exim [kernel.kallsyms] [k] clear_page_c 1.51% exim [kernel.kallsyms] [k] __slab_free 1.12% exim libc-2.13.so [.] vfprintf 1.06% exim [kernel.kallsyms] [k] __list_del_entry 1.02% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 1.00% exim [kernel.kallsyms] [k] find_vma 0.93% exim [kernel.kallsyms] [k] mutex_unlock turbostat for mutex implementation: pk cor CPU %c0 GHz TSC %c1 %c3 %c6 CTMP %pc3 %pc6 82.91 2.39 2.39 11.65 2.76 2.68 51 0.00 0.00 turbostat of rw-sem implementation (3.10-rc4): pk cor CPU %c0 GHz TSC %c1 %c3 %c6 CTMP %pc3 %pc6 80.10 2.39 2.39 14.96 2.80 2.13 52 0.00 0.00 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/