Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp291123ybt; Wed, 17 Jun 2020 00:40:05 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw839jWoLcNYiDJvBpSeU6gjZd9n1VByfAU2ONR4yuGV4OvBtQ8gHR+yBKNROaRjXbEAJUY X-Received: by 2002:a17:907:9484:: with SMTP id dm4mr6685640ejc.56.1592379605721; Wed, 17 Jun 2020 00:40:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1592379605; cv=none; d=google.com; s=arc-20160816; b=kCbEWRNZCbCkEekSR5HkOU6WAejguEzFGR1CHNvqTjh9QQsFvhTRKxJf9l6c6eqvlS iBcgrKeokaFzyG5y+oxbWxDriJk7sIT9JSk+0Mzf7OQJ0bHE2qFLS1dwRNe/2Kyozblr s09MVrJajfJixtuG9DbkGsnbkWbZDEBueUJ19k/DZ6tOgOU/eQK+nLM1+/2jOIGrtvAS daxB2KnUqH+e8MykQa7g+W/vi3lduMjR2vfUdGZ2HsQNRp3liPXPN8SlQmdofQsKT7IR vU0B7/zoewmUrz3ZM7Pu11TTwI75PfHQyLMSQM4fSehFnVjetD/MW2kZX61nF9HDySaI lY7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:message-id:date:cc:to:from:subject:dkim-signature :dkim-signature; bh=deSknUuePkPVUMFanOBKM4EFKlG91i09uI4kkHritao=; b=SyLSAdLxCm+4h6j4uIi+iao7tEkdhTjLorSu2ho5ENDpHkNze4aecql/OCtCqwvA2o a8P1cgbWJdYlgsh92wfUgHfSciyxjldXDk5AXbrSjqQPLb4krf5MNOHygy5JjV+LFr27 ar46Djka6H/CSP9gJbyFw38cCauKL8c3xOXH4y6l3+TBkiymgYdRh0ha322LwsqJ+2nk 2Z+MhHUp+wIcQcg6AUTk1vhQMGK5r0rEw5j+de+MX8RTfJJnRuV667YXQR57gch8Gm7S ivI/MVdo6p/MuHpvf0wIaa+IJQeaKhAyTudKRQky6ACtU9rFXielVWmvY6yL6oAhIMRR PT2w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@themaw.net header.s=fm3 header.b=E4QNVxCg; dkim=pass header.i=@messagingengine.com header.s=fm3 header.b=GC5+45us; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o25si12025984ejh.304.2020.06.17.00.39.43; Wed, 17 Jun 2020 00:40:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@themaw.net header.s=fm3 header.b=E4QNVxCg; dkim=pass header.i=@messagingengine.com header.s=fm3 header.b=GC5+45us; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726434AbgFQHht (ORCPT + 99 others); Wed, 17 Jun 2020 03:37:49 -0400 Received: from wout5-smtp.messagingengine.com ([64.147.123.21]:54675 "EHLO wout5-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725901AbgFQHhs (ORCPT ); Wed, 17 Jun 2020 03:37:48 -0400 Received: from compute2.internal (compute2.nyi.internal [10.202.2.42]) by mailout.west.internal (Postfix) with ESMTP id 3BF69561; Wed, 17 Jun 2020 03:37:47 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Wed, 17 Jun 2020 03:37:47 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=themaw.net; h= subject:from:to:cc:date:message-id:mime-version:content-type :content-transfer-encoding; s=fm3; bh=deSknUuePkPVUMFanOBKM4EFKl G91i09uI4kkHritao=; b=E4QNVxCgWqec3OWDCZ9LyfGcARtXufmujhExPOCQ1X 4TwUZzboMsThL2lFiJ3RTUAk+qH7JYGrG9nZ9dRQQ3t10Jhyd2OEwfbPuD0ffDo6 ND9Xway+VIYgwoPG96I7ZZjqTWR2qEp1o7fJbJnbwfNLASPpB6wzz2L0XRJ5jpBu glCJ5pXInT0S4yHE9ewBGG5eshStUaeNjkZQT9w9lP9lk/hX84lVbLxknf/umrak hRqDPpYig7A3R2TaPqtcR88Wh9R1n5/ypT3xND/lvgd3BXi6n4Ol3HVzc8P0rWDW hT4JkdjDSV2HF0bT7EtqIHmFHva17g8Z0nWhz10AYkPg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:message-id:mime-version:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; bh=deSknU uePkPVUMFanOBKM4EFKlG91i09uI4kkHritao=; b=GC5+45usB1GFyP495DypVm TXCHOx7SjND3pATQyVritfQd8+LdLFOzoALMeXGcDmxpBX84kDq2XtFiKN2fyu36 vQlvzC+cuMKjW1GajWcUmQFc2nau1rTn9H/gO546mu26Ad6/LfOmDEVdASQuvDQf 7EumYzN0VS0Iypg61aFb1zO/rVGrdHIA5Pc3EmnZmFCam8lTnOVg+yucWKtLvjsP jeYn3HBv6/iD4xRkCTcNPCoTdvNM7CXeRU7h2jfny5Ko7TjCziX9yXq6b2faaR6i 4ug/x598aQ9V3TXpJJi54Irz6cX3RILwMkCrfIXy400xfmOps3rHFT5RiFL658aQ == X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduhedrudejuddguddvhecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd enucfjughrpefuhffvfffkfgggtgfgsehtjedttddtreejnecuhfhrohhmpefkrghnucfm vghnthcuoehrrghvvghnsehthhgvmhgrfidrnhgvtheqnecuggftrfgrthhtvghrnhepue eikeevgeekveetueffgefhveetieeigefghfejkefgteevheeguddufeethfdunecukfhp peehkedrjedrudelgedrkeejnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpe hmrghilhhfrhhomheprhgrvhgvnhesthhhvghmrgifrdhnvght X-ME-Proxy: Received: from mickey.themaw.net (58-7-194-87.dyn.iinet.net.au [58.7.194.87]) by mail.messagingengine.com (Postfix) with ESMTPA id 6536D3280063; Wed, 17 Jun 2020 03:37:46 -0400 (EDT) Received: from mickey.themaw.net (localhost [127.0.0.1]) by mickey.themaw.net (Postfix) with ESMTP id 3FCD1A0314; Wed, 17 Jun 2020 15:37:43 +0800 (AWST) Subject: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement From: Ian Kent To: Greg Kroah-Hartman Cc: Tejun Heo , Stephen Rothwell , Andrew Morton , Al Viro , Rick Lindsley , David Howells , Miklos Szeredi , linux-fsdevel , Kernel Mailing List Date: Wed, 17 Jun 2020 15:37:43 +0800 Message-ID: <159237905950.89469.6559073274338175600.stgit@mickey.themaw.net> User-Agent: StGit/0.19 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org For very large IBM Power mainframe systems with hundreds of CPUs and TBs of RAM booting can take a very long time. Initial reports showed that booting a configuration of several hundred CPUs and 64TB of RAM would take more than 30 minutes and require kernel parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600 to prevent dropping into emergency mode. Gathering information about what's happening during the boot is a bit challenging but two main issues appeared to be: a large number of path lookups for non-existent files, and very high lock contention in the VFS during path walks particularly in the dentry allocation code path. The underlying cause of this was thought to be the sheer number of sysfs memory objects, 100,000+ for a 64TB memory configuration as the hardware divides the memory into 256MB logical blocks. This is believed to be due to either IBM Power hardware design or a requirement of the mainframe software used to create logical partitions (LPARs, that are used to install an operating system to provide services), since these can be made up of a wide range of resources, CPU, Memory, disks, etc. It's unclear yet whether the creation of syfs nodes for these memory devices can be postponed or spread out over a larger amount of time. That's because the high overhead looks to be due to notifications received by udev which invokes a systemd program for them and attempts by systemd folks to improve this have not focused on changing the handling of these notifications, possibly because of difficulties with doing so. This remains an avenue of investigation. Kernel traces show there are many path walks with a fairly large portion of those for non-existent paths. However, looking at the systemd code invoked by the udev action it appears there's only one additional lookup for each invocation so the large number of negative lookups is most likely due to the large number of notifications rather than a fault with the systemd program. The series here tries to reduce the locking needed during path walks based on the assumption that there are many path walks with a fairly large portion of those for non-existent paths, as described above. That was done by adding kernfs negative dentry caching (non-existent paths) to avoid continual alloc/free cycle of dentries and a read/write semaphore introduced to increase kernfs concurrency during path walks. With these changes we still need kernel parameters of udev.children-max=2048 and systemd.default_timeout_start_sec=300 for the fastest boot times of under 5 minutes. There may be opportunities for further improvements but the series here has seen a fair amount of testing and thinking about what else these could be. Discussing it with Rick Lindsay, I suspect improvements will get more difficult to implement for somewhat less improvement so I think what we have here is a good start for now. Changes since v1: - fix locking in .permission() and .getattr() by re-factoring the attribute handling code. --- Ian Kent (6): kernfs: switch kernfs to use an rwsem kernfs: move revalidate to be near lookup kernfs: improve kernfs path resolution kernfs: use revision to identify directory node changes kernfs: refactor attr locking kernfs: make attr_mutex a local kernfs node lock fs/kernfs/dir.c | 284 ++++++++++++++++++++++++++++--------------- fs/kernfs/file.c | 4 - fs/kernfs/inode.c | 58 +++++---- fs/kernfs/kernfs-internal.h | 29 ++++ fs/kernfs/mount.c | 12 +- fs/kernfs/symlink.c | 4 - include/linux/kernfs.h | 7 + 7 files changed, 259 insertions(+), 139 deletions(-) -- Ian