Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp3740813imm; Mon, 20 Aug 2018 04:03:48 -0700 (PDT) X-Google-Smtp-Source: AA+uWPyrvUjjMRoEAu/1yNnPjSQZVpiOmA+66zQEhNJ9PxUlbs+OVPc0EIIxcLnc3uenWu6ZJ7c8 X-Received: by 2002:a63:1124:: with SMTP id g36-v6mr4349562pgl.332.1534763028843; Mon, 20 Aug 2018 04:03:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1534763028; cv=none; d=google.com; s=arc-20160816; b=e16VFI/nemBzi8rWruUtvI2LEWVkvU+jp4Cad/PYtgUE+RJm+ChS53JsJ2GCHu51BI g2Y2xHxGPGvoX5Tn9JUu9mo6JcZqAixSGLxseNGWKmEAW/9wOjG78zEbws1vbQpArWuo sLxaElIHAFV5m3MBAJGfoXuZFPhZ0skXK/28xq84S/bYJAr3z1ZWhFtoybKQpUJiDQAO A4fXKFrjpzkxUrOlAIA9a8gjmAiyIpFqPGcxwVS6HVifvZQsyUGTf+XYqb4ICrcvOzOw gmytVEO7FVqaiE/sOfW2mh5wyC3y3bEdWWJUsYGgt8D1sCHInOdl7nKTzbSBRkpKn/qg QUrw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:date:cc:to:from:subject:message-id :arc-authentication-results; bh=0I6g1/+zl1KJDC+rI0qbF4C+cuTme5FVaHspAY3ByX4=; b=hD6wH8eoyeUW+91jB/PJN+S3dXvSMQjA/eUjsbep8jFxatJVNCOuYJDovvUto3FScW 7A0jsYo5xlDa0dUg10hg3WJi1qGK9O3iV4yLJ4iDcabbGhx6N2LBVfPoA23QYEYEaM60 Ia2Jk0MvKPKQxOncbIuvF7aXfljWxfMj5a38mNRAG4rkatt+N4DFiOVcGKg0h2eXydRZ WkpPl4AXBqa9Ys/03P1HyOS9tZZvrmMS7gixnXAm55E3r4F9vPo3qqjvHt890AuL+ePk Xk17AYNWIJ3cRrrXd9ZKGJnhpre2muNZnI+E216RroGZfYszMt/vxf4ahwklp2OoQMaA imLA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v84-v6si10016555pfa.103.2018.08.20.04.03.32; Mon, 20 Aug 2018 04:03:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726201AbeHTORe (ORCPT + 99 others); Mon, 20 Aug 2018 10:17:34 -0400 Received: from mx2.suse.de ([195.135.220.15]:39910 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725948AbeHTORd (ORCPT ); Mon, 20 Aug 2018 10:17:33 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id E2FA0AE1F; Mon, 20 Aug 2018 11:02:22 +0000 (UTC) Message-ID: Subject: Re: [PATCH 0/4] locks: avoid thundering-herd wake-ups From: Martin Wilck To: "J. Bruce Fields" , Jeff Layton Cc: NeilBrown , Alexander Viro , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Date: Mon, 20 Aug 2018 13:02:21 +0200 In-Reply-To: <20180808182959.GB23873@fieldses.org> References: <153369219467.12605.13472423449508444601.stgit@noble> <20180808182959.GB23873@fieldses.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.28.5 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2018-08-08 at 14:29 -0400, J. Bruce Fields wrote: > On Wed, Aug 08, 2018 at 12:47:22PM -0400, Jeff Layton wrote: > > On Wed, 2018-08-08 at 11:51 +1000, NeilBrown wrote: > > > If you have a many-core machine, and have many threads all > > > wanting to > > > briefly lock a give file (udev is known to do this), you can get > > > quite > > > poor performance. > > > > > > When one thread releases a lock, it wakes up all other threads > > > that > > > are waiting (classic thundering-herd) - one will get the lock and > > > the > > > others go to sleep. > > > When you have few cores, this is not very noticeable: by the time > > > the > > > 4th or 5th thread gets enough CPU time to try to claim the lock, > > > the > > > earlier threads have claimed it, done what was needed, and > > > released. > > > With 50+ cores, the contention can easily be measured. > > > > > > This patchset creates a tree of pending lock request in which > > > siblings > > > don't conflict and each lock request does conflict with its > > > parent. > > > When a lock is released, only requests which don't conflict with > > > each > > > other a woken. > > > > > > Testing shows that lock-acquisitions-per-second is now fairly > > > stable even > > > as number of contending process goes to 1000. Without this > > > patch, > > > locks-per-second drops off steeply after a few 10s of processes. > > > > > > There is a small cost to this extra complexity. > > > At 20 processes running a particular test on 72 cores, the lock > > > acquisitions per second drops from 1.8 million to 1.4 million > > > with > > > this patch. For 100 processes, this patch still provides 1.4 > > > million > > > while without this patch there are about 700,000. > > > > > > NeilBrown > > > > > > --- > > > > > > NeilBrown (4): > > > fs/locks: rename some lists and pointers. > > > fs/locks: allow a lock request to block other requests. > > > fs/locks: change all *_conflict() functions to return bool. > > > fs/locks: create a tree of dependent requests. > > > > > > > > > fs/cifs/file.c | 2 - > > > fs/locks.c | 142 > > > +++++++++++++++++++++++++-------------- > > > include/linux/fs.h | 5 + > > > include/trace/events/filelock.h | 16 ++-- > > > 4 files changed, 103 insertions(+), 62 deletions(-) > > > > > > > Nice work! I looked over this and I think it looks good. > > > > I made an attempt to fix this issue several years ago, but my > > method > > sucked as it ended up penalizing the unlocking task too much. This > > is > > much cleaner and should scale well overall, I think. > > I think I also took a crack at this at one point while I was at > UM/CITI > and never got anything I was happy with. Looks like good work! > > I remember one main obstacle that I felt like I never had a good > benchmark.... > > How did you choose this workload and hardware? Was it in fact udev > (booting a large machine?), or was there some other motivation? Some details can be found here: https://github.com/systemd/systemd/pull/9551 https://github.com/systemd/systemd/pull/8667#issuecomment-385520335 and comments further down. 8667 has been superseded by 9551. The original problem was that the symlink "/dev/disk/by- partlabel/primary" may be claimed by _many_ devices on big systems under certain distributions, which use older versions of parted for partition creation on GPT disk labels. I've seen systems with literally thousands of contenders for this symlink. We found that with current systemd, this can cause a boot-time race where the wrong device is eventually assigned the "best" contender for the symlink (e.g. a partition on multipath member rather than a partition on the multipath map itself). I extended the udev test suite, creating a test that makes this race easily reproducible, at least on systems with many CPUs (the test host I used most had 72 cores). I created an udev patch that would use systemd's built in fcntl-based locking to avoid this race, but I found that it would massively slow down the system, and found the contention to be in the spin locks in posix_lock_common(). (I therefore added more the systemd patches to make the locking scale better, but that's irrelevant for the kernel- side discussion). I further created an artificial test just for the scaling of fcntl(F_OFD_SETLKW) and flock(), with which I could reproduce the scaling problem easily, and do some quantitive experiments. My tests didn't use any byte ranges, only "full" locking of 0-byte files. > Not that I'm likely to do it any time soon, but could you share > sufficient details for someone else to reproduce your results? > > --b. The udev test code can be found in the above links. It adds a new script "test/sd-script.py" that would be run after "test/sys-script.py" using a numeric argument indicating the number of contenders for the test link to be created, such as "python test/sd-script.py test 1000". Next step would be running "test/udev-test.pl 152" e.g. under perf (152 is the test ID of the scaling test). Of course I can also share my other test program if you desire so. Regards, Martin