Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp991536ybl; Sun, 1 Dec 2019 17:46:15 -0800 (PST) X-Google-Smtp-Source: APXvYqz35pLXCtKesn0rRenU3w+mNA5YYAPVnBgSD+cXyBB4hxRRSx0Bn/+P1NzeUMt/tvmUxSUy X-Received: by 2002:a50:8f64:: with SMTP id 91mr8792007edy.27.1575251174949; Sun, 01 Dec 2019 17:46:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575251174; cv=none; d=google.com; s=arc-20160816; b=JNsPlyO+DQo1a/t6XVtyslvj1WH65o3MTqnJq8lB9I0uAeNWa7uA5NigXz7GIoDhtR 3H2n+8hpP57i5kk16nG2w9o6f2mW2rWAkWu8sUFORduINWG5i0y2rfwogMZuk0KyZYwv qrD5p/Ge+H1TMJXz6prZzaIj7/96WQz7lKMm4vZwTvBEHU3VICQ60P0sco3oITVguxH3 E4lQ7hZthLhm7vjzuibtJZAXin25/v4pBFHo+y5VhAJ5SUs2lwL6Rr+glGx/XiFtDKGP JIkAe8Lh0ahMgj24vfKR4igpWke0vKMJ03dhvli945IDrMs+u8FLwiz/yOVyXdEU7Ztv 7TBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=R98xv8g2NrB8OiJRGB+6RlphxKwBOTcEyOqA711zJlA=; b=l7yyxR32BGwlCg3HG7WlrtS2Sa+eOfDJfWB7+PPWezmznNifUcCITd2o0lhypJ/E+f J07TMsJhTouA/DzLoJou4Fl/KscsAE0Q+imFkD9ovEAJBPWsFmYmj2BDFzvmXRK2spJR F6JVswStjNper9jVGWUGce3HkCIyEQT+HEzDhnSmSnxikBnBkjJAG+sJNanNCCAuSHkJ BtpVw6FvLxqILdkVccgGgol2vdLdTTVc6zNEWT4mVc/qH0uYjtC1IWTKby+lFw2NkrnT uBhOnTWkNze6yql7WUToUFrHturq66nodsqzLz276VhC9c2ajfBQgB0R+jH6jIboUQCR 1yGw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d19si9041555edr.49.2019.12.01.17.45.39; Sun, 01 Dec 2019 17:46:14 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727308AbfLBBpH (ORCPT + 99 others); Sun, 1 Dec 2019 20:45:07 -0500 Received: from mail.phunq.net ([66.183.183.73]:38782 "EHLO phunq.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727279AbfLBBpH (ORCPT ); Sun, 1 Dec 2019 20:45:07 -0500 Received: from [172.16.1.14] by phunq.net with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_128_GCM:128) (Exim 4.92.3) (envelope-from ) id 1ibalt-00027V-Dm; Sun, 01 Dec 2019 17:45:05 -0800 Subject: Re: [RFC] Thing 1: Shardmap fox Ext4 To: "Theodore Y. Ts'o" Cc: linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, OGAWA Hirofumi References: <176a1773-f5ea-e686-ec7b-5f0a46c6f731@phunq.net> <20191127142508.GB5143@mit.edu> From: Daniel Phillips Message-ID: <6b6242d9-f88b-824d-afe9-d42382a93b34@phunq.net> Date: Sun, 1 Dec 2019 17:45:05 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <20191127142508.GB5143@mit.edu> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote: > (3) It's not particularly well documented... We regard that as an issue needing attention. Here is a pretty picture to get started: https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format This needs some explaining. The bottom part of the directory file is a simple linear range of directory blocks, with a freespace map block appearing once every 4K blocks or so. This freespace mapping needs a post of its own, it is somewhat subtle. This will be a couple of posts in the future. The Shardmap index appears at a higher logical address, sufficiently far above the directory base to accommodate a reasonable number of record entry blocks below it. We try not to place the index at so high an address that the radix tree gets extra levels, slowing everything down. When the index needs to be expanded, either because some shard exceeded a threshold number of entries, or the record entry blocks ran into the the bottom of the index, then a new index tier with more shards is created at a higher logical address. The lower index tier is not copied immediately to the upper tier, but rather, each shard is incrementally split when it hits the threshold because of an insert. This bounds the latency of any given insert to the time needed to split one shard, which we target nominally at less than one millisecond. Thus, Shardmap takes a modest step in the direction of real time response. Each index tier is just a simple array of shards, each of which fills up with 8 byte entries from bottom to top. The count of entries in each shard is stored separately in a table just below the shard array. So at shard load time, we can determine rapidly from the count table which tier a given shard belongs to. There are other advantages to breaking the shard counts out separately having to do with the persistent memory version of Shardmap, interesting details that I will leave for later. When all lower tier shards have been deleted, the lower tier may be overwritten by the expanding record entry block region. In practice, a Shardmap file normally has just one tier most of the time, the other tier existing only long enough to complete the incremental expansion of the shard table, insert by insert. There is a small header in the lowest record entry block, giving the positions of the one or two index tiers, count of entry blocks, and various tuning parameters such as maximum shard size and average depth of cache hash collision lists. That is it for media format. Very simple, is it not? My next post will explain the Shardmap directory block format, with a focus on deficiencies of the traditional Ext2 format that were addressed. Regards, Daniel