Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp4283391imm; Mon, 20 Aug 2018 13:04:02 -0700 (PDT) X-Google-Smtp-Source: AA+uWPw+5s3i7ofoYZGbNxOmVKZ1RkSCRNOkgvxK/BYNahJM0Y5fiWYjv4niyJh2U+2TuPNCBv+c X-Received: by 2002:a63:5a50:: with SMTP id k16-v6mr8333350pgm.143.1534795442103; Mon, 20 Aug 2018 13:04:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1534795442; cv=none; d=google.com; s=arc-20160816; b=sXb6dgR0d0Ug6hn86Fg9S3nQbjX24zNh8WjmaUrQ5ri16fZ/AyXQCuALFF+zdnavh8 q0BKYn7+kQymPIYenzpdYcIehE/cVDtQTcENaNI0bgs9RvRJyk5zjsWza1IhTtYEGWxq Ul9XLaCw9t/TP7+nUtazTpjGKbHofDu89oIwgzI5FIjJNHSCPvYWmCMiNOnHRbfhab2K mS0ftPj993n7zv/imnk4MFt3HR8DTtqf0u8ipDpbF1929SPXVEcCwkm7sKiozrldCOxz zBa5eOyA8jCiGsI8AscEEClU3eFm1fJeG006G39Hh2HWS6+J+t3RyGZmiawAxMfTb8LG hyVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=XJGEXgSkLHacus+59EQmFJvZgIIFGMw63FeriWYDDh4=; b=cP7C0Q/1o8aJF5eATnjB3ndpQHFSP0b2TizU3D4gUVAuUZOdBqQGIfI6I2T4hR13+4 wAyy/plqYFxEL+au6g1NujKHm0L0Zpol3e3pfGlje/7VBZkTqnncXBNzGSHuUV2N29Rs rfSKHk2885hg4muSBuq2i3on9AQ65zP51pLWksn24wAxUT4570A7idiqE4Ua4jUXaJZa jk4AE8f+5rTkpxttSTo5nUgQx0W9V0w1AVMCllgyCyKYpOWZSErCcaUY7IvfGqFAaoTw x0gLZDLPLqkiiwW3veH96aOWRgo3kaxcYzTsy+O4IAzHsnBWp4yRAkGkmTLMM5hNGF+r BeLQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 69-v6si10406414pla.505.2018.08.20.13.03.44; Mon, 20 Aug 2018 13:04:02 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726524AbeHTXTV (ORCPT + 99 others); Mon, 20 Aug 2018 19:19:21 -0400 Received: from fieldses.org ([173.255.197.46]:45016 "EHLO fieldses.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726099AbeHTXTV (ORCPT ); Mon, 20 Aug 2018 19:19:21 -0400 Received: by fieldses.org (Postfix, from userid 2815) id 322C0189; Mon, 20 Aug 2018 16:02:23 -0400 (EDT) Date: Mon, 20 Aug 2018 16:02:23 -0400 From: "J. Bruce Fields" To: Martin Wilck Cc: Jeff Layton , NeilBrown , Alexander Viro , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups Message-ID: <20180820200223.GG5468@fieldses.org> References: <153369219467.12605.13472423449508444601.stgit@noble> <20180808182959.GB23873@fieldses.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 20, 2018 at 01:02:21PM +0200, Martin Wilck wrote: > On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote: > > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote: > > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: > > > > If you have a many-core machine, and have many threads all > > > > wanting to > > > > briefly lock a give file (udev is known to do this), you can get > > > > quite > > > > poor performance. > > > > > > > > When one thread releases a lock, it wakes up all other threads > > > > that > > > > are waiting (classic thundering-herd) - one will get the lock and > > > > the > > > > others go to sleep. > > > > When you have few cores, this is not very noticeable: by the time > > > > the > > > > 4th or 5th thread gets enough CPU time to try to claim the lock, > > > > the > > > > earlier threads have claimed it, done what was needed, and > > > > released. > > > > With 50+ cores, the contention can easily be measured. > > > > > > > > This patchset creates a tree of pending lock request in which > > > > siblings > > > > don't conflict and each lock request does conflict with its > > > > parent. > > > > When a lock is released, only requests which don't conflict with > > > > each > > > > other a woken. > > > > > > > > Testing shows that lock-acquisitions-per-second is now fairly > > > > stable even > > > > as number of contending process goes to 1000. Without this > > > > patch, > > > > locks-per-second drops off steeply after a few 10s of processes. > > > > > > > > There is a small cost to this extra complexity. > > > > At 20 processes running a particular test on 72 cores, the lock > > > > acquisitions per second drops from 1.8 million to 1.4 million > > > > with > > > > this patch. For 100 processes, this patch still provides 1.4 > > > > million > > > > while without this patch there are about 700,000. > > > > > > > > NeilBrown > > > > > > > > --- > > > > > > > > NeilBrown (4): > > > > fs/locks: rename some lists and pointers. > > > > fs/locks: allow a lock request to block other requests. > > > > fs/locks: change all *_conflict() functions to return bool. > > > > fs/locks: create a tree of dependent requests. > > > > > > > > > > > > fs/cifs/file.c | 2 - > > > > fs/locks.c | 142 > > > > +++++++++++++++++++++++++-------------- > > > > include/linux/fs.h | 5 + > > > > include/trace/events/filelock.h | 16 ++-- > > > > 4 files changed, 103 insertions(+), 62 deletions(-) > > > > > > > > > > Nice work! I looked over this and I think it looks good. > > > > > > I made an attempt to fix this issue several years ago, but my > > > method > > > sucked as it ended up penalizing the unlocking task too much. This > > > is > > > much cleaner and should scale well overall, I think. > > > > I think I also took a crack at this at one point while I was at > > UM/CITI > > and never got anything I was happy with. Looks like good work! > > > > I remember one main obstacle that I felt like I never had a good > > benchmark.... > > > > How did you choose this workload and hardware? Was it in fact udev > > (booting a large machine?), or was there some other motivation? > > Some details can be found here: > > https://github.com/systemd/systemd/pull/9551 > > https://github.com/systemd/systemd/pull/8667#issuecomment-385520335 > and comments further down. 8667 has been superseded by 9551. > > The original problem was that the symlink "/dev/disk/by- > partlabel/primary" may be claimed by _many_ devices on big systems > under certain distributions, which use older versions of parted for > partition creation on GPT disk labels. I've seen systems with literally > thousands of contenders for this symlink. > > We found that with current systemd, this can cause a boot-time race > where the wrong device is eventually assigned the "best" contender for > the symlink (e.g. a partition on multipath member rather than a > partition on the multipath map itself). I extended the udev test suite, > creating a test that makes this race easily reproducible, at least on > systems with many CPUs (the test host I used most had 72 cores). > > I created an udev patch that would use systemd's built in fcntl-based > locking to avoid this race, but I found that it would massively slow > down the system, and found the contention to be in the spin locks in > posix_lock_common(). (I therefore added more the systemd patches to > make the locking scale better, but that's irrelevant for the kernel- > side discussion). > > I further created an artificial test just for the scaling of > fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the > scaling problem easily, and do some quantitive experiments. My tests > didn't use any byte ranges, only "full" locking of 0-byte files. Thanks for the explanation! I wonder whether there's also anything we could do to keep every waiter from having to take the same spinlock. --b. > > > Not that I'm likely to do it any time soon, but could you share > > sufficient details for someone else to reproduce your results? > > > > --b. > > The udev test code can be found in the above links. It adds a new > script "test/sd-script.py" that would be run after "test/sys-script.py" > using a numeric argument indicating the number of contenders for the > test link to be created, such as "python test/sd-script.py test 1000". > Next step would be running "test/udev-test.pl 152" e.g. under perf (152 > is the test ID of the scaling test). > > Of course I can also share my other test program if you desire so.