Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp943726ybl; Wed, 4 Dec 2019 13:44:55 -0800 (PST) X-Google-Smtp-Source: APXvYqyvk5MaslJI/O7qXGHzKO+aKFd4/Ntevmd2HBIwXt55RA8uNViq1IztjWoVrUt6vQqwL6jb X-Received: by 2002:a9d:7394:: with SMTP id j20mr4266759otk.273.1575495895767; Wed, 04 Dec 2019 13:44:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575495895; cv=none; d=google.com; s=arc-20160816; b=k5/adgSKvVMSsKojeG7omASe12tupuwg5UVb+TLsfj1MwrDFQPet2lzyasMsSwN6hQ uDp6pSKNBcUL/VyRXvwDrhKkGcGjcU9pY+Fr5Dbk0ArBf1PfOPsaZ/kMNZYTC20ZHYav hOg0RlvxoUgAMKspqMzjYQndczGr5WOyR12Oi7fbHqWMratk/l3Nh+R+EoF6FNLkpwuL 99aimnKm0hY2oD1Emry0UigMLNm9rXm7eHEiASsXkH1+pmUS7oYIKz4hpGqfClMLxVVf yUIUz71McfhkhJMNajcLJUryyEC/weLQ1J2gJLIBw7DoIthqyJFHwS87JkzUJYcRBwBL LGPw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=ceXBu9rUTGRXwdVgxUhhcREiQjwtQITiD1CpqaQia4M=; b=qsbcEadZXoACmU7fYuUAnn1O7iJ60NkYzc0viPxsSqPTcnw7SqSwXNQeRcLOkW7vQ4 7TlVajhJuNeIedMOCqkkOfR+58+I5GAKL4Xo612kIodDVUtdFoxsc4QcGVDgg3gWiTMp QBOE/R9XfPDi30/PLoYzmouN96j1cr8fop5I4B1uPbMrfEgfKoVkknUY8F1kljhL5Bu9 lqztSFQmT6aOIWmizuGt4RfBFhEOUQ8WIHEiqCRlm4a0FSwTlhh10Xr1ADtsyy/0Srhj cor/LAEHebNsnSNYwX4+5PmKQPGDotT986aRbg6Z9i7RpR0U5ONXqofhO0s5WBJ5+u/P 2kyg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 5si1471699ott.186.2019.12.04.13.44.36; Wed, 04 Dec 2019 13:44:55 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727989AbfLDVoc (ORCPT + 99 others); Wed, 4 Dec 2019 16:44:32 -0500 Received: from mail.phunq.net ([66.183.183.73]:47326 "EHLO phunq.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727982AbfLDVoc (ORCPT ); Wed, 4 Dec 2019 16:44:32 -0500 Received: from [172.16.1.14] by phunq.net with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_128_GCM:128) (Exim 4.92.3) (envelope-from ) id 1iccRi-0007lz-Dk; Wed, 04 Dec 2019 13:44:30 -0800 Subject: Re: [RFC] Thing 1: Shardmap fox Ext4 To: Andreas Dilger Cc: "Theodore Y. Ts'o" , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, OGAWA Hirofumi References: <176a1773-f5ea-e686-ec7b-5f0a46c6f731@phunq.net> <20191127142508.GB5143@mit.edu> <20191128022817.GE22921@mit.edu> <3b5f28e5-2b88-47bb-1b32-5c2fed989f0b@phunq.net> <20191130175046.GA6655@mit.edu> <76ddbdba-55ba-3426-2e29-0fa17db9b6d8@phunq.net> <23F33101-065E-445A-AE5C-D05E35E2B78B@dilger.ca> From: Daniel Phillips Message-ID: Date: Wed, 4 Dec 2019 13:44:30 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <23F33101-065E-445A-AE5C-D05E35E2B78B@dilger.ca> Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On 2019-12-04 10:31 a.m., Andreas Dilger wrote: > One important use case that we have for Lustre that is not yet in the > upstream ext4[*] is the ability to do parallel directory operations. > This means we can create, lookup, and/or unlink entries in the same > directory concurrently, to increase parallelism for large directories. This is a requirement for an upcoming transactional version of user space Shardmap. In the database world they call it "row locking". I am working on a hash based scheme with single record granularity that maps onto the existing shard buckets, which should be nice and efficient, maybe a bit tricky with respect to rehash but looks not too bad. Per-shard rw locks are a simpler alternative, but might get a bit fiddly if you need to lock multiple entries in the same directory at the same time, which is required for mv is it not? > This is implemented by progressively locking the htree root and index > blocks (typically read-only), then leaf blocks (read-only for lookup, > read-write for insert/delete). This provides improved parallelism > as the directory grows in size. This will be much easier and more efficient with Shardmap because there are only three levels: top level shard array; shard hash bucket; record block. Locking applies only to cache, so no need to worry about possible upper tier during incremental "reshard". I think Shardmap will also split more cleanly across metadata nodes than HTree. > Will there be some similar ability in Shardmap to have parallel ops? This work is already in progress for user space Shardmap. If there is also a kernel use case then we can just go forward assuming that this work or some variation of it applies to both. We need VFS changes to exploit parallel dirops in general, I think, confirmed by your comment below. Seems like a good bit of work for somebody. I bet the benchmarks will show well, suitable grist for a master's thesis I would think. Fine-grained directory locking may have a small enough footprint in the Shardmap kernel port that there is no strong argument for getting rid of it, just because VFS doesn't support it yet. Really, this has the smell of a VFS flaw (interested in Al's comments...) > Also, does Shardmap have the ability to shrink as entries are removed? No shrink so far. What would you suggest? Keeping in mind that POSIX+NFS semantics mean that we cannot in general defrag on the fly. I planned to just hole_punch blocks that happen to become completely empty. This aspect has so far not gotten attention because, historically, we just never shrink a directory except via fsck/tools. What would you like to see here? Maybe an ioctl to invoke directory defrag? A mode bit to indicate we don't care about persistent telldir cookies? How about automatic defrag that only runs when directory open count is zero, plus a flag to disable? > [*] we've tried to submit the pdirops patch a couple of times, but the > main blocker is that the VFS has a single directory mutex and couldn't > use the added functionality without significant VFS changes. How significant would it be, really nasty or just somewhat nasty? I bet the resulting efficiencies would show up in some general use cases. > Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD This URL gives me git://git.whamcloud.com/fs/lustre-release.git/summary, am I missing something? Regards, Daniel