Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751683AbbGKUhU (ORCPT ); Sat, 11 Jul 2015 16:37:20 -0400 Received: from g9t5008.houston.hp.com ([15.240.92.66]:37998 "EHLO g9t5008.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751499AbbGKUhS (ORCPT ); Sat, 11 Jul 2015 16:37:18 -0400 From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Thomas Gleixner , "H. Peter Anvin" Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Scott J Norton , Douglas Hatch , Waiman Long Subject: [PATCH 0/7] locking/qspinlock: Enhance pvqspinlock & introduce queued unfair lock Date: Sat, 11 Jul 2015 16:36:51 -0400 Message-Id: <1436647018-49734-1-git-send-email-Waiman.Long@hp.com> X-Mailer: git-send-email 1.7.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4591 Lines: 102 This patchset consists of two parts: 1) Patches 1-5 enhance the performance of PV qspinlock especially for overcommitted guest. The first patch moves all the CPU kicking to the unlock code. The 2nd and 3rd patches implement a kick-ahead and wait-early mechanism that was shown to improve performance for overcommitted guest. They are inspired by the "Do Virtual Machines Really Scale?" blog from Sanidhya Kashyap. The 4th patch adds code to collect PV qspinlock statistics. The last patch adds the pending bit support to PV qspinlock to improve performance at light load. This is important as the PV queuing code has even higher overhead than the native queuing code. 2) Patches 6 introduces queued unfair lock as a replacement of the existing unfair byte lock. The queued unfair lock is fairer than the byte lock currently in the qspinlock while improving performance at high contention level. Patch 7 adds a kernel command line option to KVM for disabling PV spinlock, similar to the one in Xen, if the administrators choose to do so. The last patch adds statistics collection to the queued unfair lock code. Linux kernel builds were run in KVM guest on an 8-socket, 4 cores/socket Westmere-EX system and a 4-socket, 8 cores/socket Haswell-EX system. So both systems have 32 physical CPUs. VM guests (no NUMA pinning) were set up with 32, 48 and 60 vCPUs. The kernel build times (make -j , where was the number of vCPUs) on various configurations were as follows: Westere-EX (8x4): Kernel 32 vCPUs 48 vCPUs 60 vCPUs ------ -------- -------- -------- pvticketlock (4.1.1) 5m02.0s 13m27.6s 15m49.9s pvqspinlock (4.2-rc1) 3m39.9s 11.17.8s 12m19.9s patched pvqspinlock 3m38.5s 9m27.8s 9m39.4s unfair byte lock 4m23.8s 7m14.7s 8m50.4s unfair queued lock 3m03.4s 3m29.7s 4m15.4s Haswell-EX (4x8): Kernel 32 vCPUs 48 vCPUs 60 vCPUs ------ -------- -------- -------- pvticketlock (4.1.1) 1m58.9s 18m57.0s 20m46.1s pvqspinlock (4.2-rc1) 1m59.9s 18m44.2s 18m57.0s patched pvqspinlock 2m01.7s 8m03.7s 8m29.5s unfair byte lock 2m04.5s 2m46.7s 3m15.6s unfair queued lock 1m59.4s 2m04.9s 2m18.6s It can be seen that queued unfair lock has the best performance in almost all the cases. As can be seen in patch 4, the overhead of PV kicking and waiting is quite high. Unfair locks avoid those overhead and spend the time on productive work instead. On the other hand, the pvqspinlock is fair while the byte lock is not. The queued unfair lock is kind of in the middle between those two. It is not as fair as the pvqspinlock, but is fairer than the byte lock. Looking at the PV locks, the pvqspinlock patch did increase performance in the overcommitted guests by about 20% in Westmere-EX and more than 2X in Haswell-EX. More investigation may be needed to find out why there was slowdown in Haswell-EX compared with Westmere-EX. In conclusion, unfair lock is actually better performance-wise when a VM guest is over-committed. If there is no over-commitment, PV locks work fine, too. When the VM guest was changed to NUMA pinned (direct mapping between physical and virtual CPUs) in the Westmere-EX system, the build times became: Kernel 32 vCPUs ------ -------- pvticketlock (4.1.1) 2m47.1s pvqspinlock (4.2-rc1) 2m45.9s patched pvqspinlock 2m45.2s unfair byte lock 2m45.4s unfair queued lock 2m44.9s It can be seen that the build times are virtually the same for all the configurations. Waiman Long (7): locking/pvqspinlock: Only kick CPU at unlock time locking/pvqspinlock: Allow vCPUs kick-ahead locking/pvqspinlock: Implement wait-early for overcommitted guest locking/pvqspinlock: Collect slowpath lock statistics locking/pvqspinlock: Add pending bit support locking/qspinlock: A fairer queued unfair lock locking/qspinlock: Collect queued unfair lock slowpath statistics arch/x86/Kconfig | 8 + arch/x86/include/asm/qspinlock.h | 17 +- kernel/locking/qspinlock.c | 140 ++++++++++- kernel/locking/qspinlock_paravirt.h | 436 ++++++++++++++++++++++++++++++++--- kernel/locking/qspinlock_unfair.h | 327 ++++++++++++++++++++++++++ 5 files changed, 880 insertions(+), 48 deletions(-) create mode 100644 kernel/locking/qspinlock_unfair.h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/