LinuxLists.cc - Re: [RFC] Thing 1: Shardmap fox Ext4

2019-12-02 01:46:15

Subject: Re: [RFC] Thing 1: Shardmap fox Ext4

On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote:
> (3) It's not particularly well documented...

We regard that as an issue needing attention. Here is a pretty picture
to get started:

https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format

This needs some explaining. The bottom part of the directory file is
a simple linear range of directory blocks, with a freespace map block
appearing once every 4K blocks or so. This freespace mapping needs a
post of its own, it is somewhat subtle. This will be a couple of posts
in the future.

The Shardmap index appears at a higher logical address, sufficiently
far above the directory base to accommodate a reasonable number of
record entry blocks below it. We try not to place the index at so high
an address that the radix tree gets extra levels, slowing everything
down.

When the index needs to be expanded, either because some shard exceeded
a threshold number of entries, or the record entry blocks ran into the
the bottom of the index, then a new index tier with more shards is
created at a higher logical address. The lower index tier is not copied
immediately to the upper tier, but rather, each shard is incrementally
split when it hits the threshold because of an insert. This bounds the
latency of any given insert to the time needed to split one shard, which
we target nominally at less than one millisecond. Thus, Shardmap takes a
modest step in the direction of real time response.

Each index tier is just a simple array of shards, each of which fills
up with 8 byte entries from bottom to top. The count of entries in each
shard is stored separately in a table just below the shard array. So at
shard load time, we can determine rapidly from the count table which
tier a given shard belongs to. There are other advantages to breaking
the shard counts out separately having to do with the persistent memory
version of Shardmap, interesting details that I will leave for later.

When all lower tier shards have been deleted, the lower tier may be
overwritten by the expanding record entry block region. In practice,
a Shardmap file normally has just one tier most of the time, the other
tier existing only long enough to complete the incremental expansion
of the shard table, insert by insert.

There is a small header in the lowest record entry block, giving the
positions of the one or two index tiers, count of entry blocks, and
various tuning parameters such as maximum shard size and average depth
of cache hash collision lists.

That is it for media format. Very simple, is it not? My next post
will explain the Shardmap directory block format, with a focus on
deficiencies of the traditional Ext2 format that were addressed.

Regards,

Daniel

2019-12-04 15:59:57

by Viacheslav Dubeyko

[permalink] [raw]

Subject: Re: [RFC] Thing 1: Shardmap fox Ext4

On Sun, 2019-12-01 at 17:45 -0800, Daniel Phillips wrote:
> On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote:
> > (3) It's not particularly well documented...
>
> We regard that as an issue needing attention. Here is a pretty
> picture
> to get started:
>
> https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format
>
> This needs some explaining. The bottom part of the directory file is
> a simple linear range of directory blocks, with a freespace map block
> appearing once every 4K blocks or so. This freespace mapping needs a
> post of its own, it is somewhat subtle. This will be a couple of
> posts
> in the future.
>
> The Shardmap index appears at a higher logical address, sufficiently
> far above the directory base to accommodate a reasonable number of
> record entry blocks below it. We try not to place the index at so
> high
> an address that the radix tree gets extra levels, slowing everything
> down.
>
> When the index needs to be expanded, either because some shard
> exceeded
> a threshold number of entries, or the record entry blocks ran into
> the
> the bottom of the index, then a new index tier with more shards is
> created at a higher logical address. The lower index tier is not
> copied
> immediately to the upper tier, but rather, each shard is
> incrementally
> split when it hits the threshold because of an insert. This bounds
> the
> latency of any given insert to the time needed to split one shard,
> which
> we target nominally at less than one millisecond. Thus, Shardmap
> takes a
> modest step in the direction of real time response.
>
> Each index tier is just a simple array of shards, each of which fills
> up with 8 byte entries from bottom to top. The count of entries in
> each
> shard is stored separately in a table just below the shard array. So
> at
> shard load time, we can determine rapidly from the count table which
> tier a given shard belongs to. There are other advantages to breaking
> the shard counts out separately having to do with the persistent
> memory
> version of Shardmap, interesting details that I will leave for later.
>
> When all lower tier shards have been deleted, the lower tier may be
> overwritten by the expanding record entry block region. In practice,
> a Shardmap file normally has just one tier most of the time, the
> other
> tier existing only long enough to complete the incremental expansion
> of the shard table, insert by insert.
>
> There is a small header in the lowest record entry block, giving the
> positions of the one or two index tiers, count of entry blocks, and
> various tuning parameters such as maximum shard size and average
> depth
> of cache hash collision lists.
>
> That is it for media format. Very simple, is it not? My next post
> will explain the Shardmap directory block format, with a focus on
> deficiencies of the traditional Ext2 format that were addressed.
>

I've tried to take a look into the source code. And it was not easy
try. :) I expected to have the bird-fly understanding from shardmap.h
file. My expectation was to find the initial set of structure
declarations with the good comments. But, frankly speaking, it's very
complicated path for the concept understanding. Even from C++ point of
view, the class declarations look very complicated if there are mixing
of fields with methods declarations. It's tough to read such
implementation.

So, I believe it makes sense to declare the necessary set of structures
in the file's beginning with the good comments. Even it will be good to
split the structure declarations and methods in different files. I
believe it will ease the way to understand the concept. Otherwise, it
will be tough to review such code.

Thanks,
Viacheslav Dubeyko.

2019-12-04 18:24:43

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC] Thing 1: Shardmap fox Ext4

On Dec 1, 2019, at 6:45 PM, Daniel Phillips <[email protected]> wrote:
>
> On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote:
>> (3) It's not particularly well documented...
>
> We regard that as an issue needing attention. Here is a pretty picture
> to get started:
>
> https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format

The shardmap diagram is good conceptually, but it would be useful
to add a legend on the empty side of the diagram that shows the on-disk
structures.

>
> This needs some explaining. The bottom part of the directory file is
> a simple linear range of directory blocks, with a freespace map block
> appearing once every 4K blocks or so. This freespace mapping needs a
> post of its own, it is somewhat subtle. This will be a couple of posts
> in the future.
>
> The Shardmap index appears at a higher logical address, sufficiently
> far above the directory base to accommodate a reasonable number of
> record entry blocks below it. We try not to place the index at so high
> an address that the radix tree gets extra levels, slowing everything
> down.
>
> When the index needs to be expanded, either because some shard exceeded
> a threshold number of entries, or the record entry blocks ran into the
> the bottom of the index, then a new index tier with more shards is
> created at a higher logical address. The lower index tier is not copied
> immediately to the upper tier, but rather, each shard is incrementally
> split when it hits the threshold because of an insert. This bounds the
> latency of any given insert to the time needed to split one shard, which
> we target nominally at less than one millisecond. Thus, Shardmap takes a
> modest step in the direction of real time response.
>
> Each index tier is just a simple array of shards, each of which fills
> up with 8 byte entries from bottom to top. The count of entries in each
> shard is stored separately in a table just below the shard array. So at
> shard load time, we can determine rapidly from the count table which
> tier a given shard belongs to. There are other advantages to breaking
> the shard counts out separately having to do with the persistent memory
> version of Shardmap, interesting details that I will leave for later.
>
> When all lower tier shards have been deleted, the lower tier may be
> overwritten by the expanding record entry block region. In practice,
> a Shardmap file normally has just one tier most of the time, the other
> tier existing only long enough to complete the incremental expansion
> of the shard table, insert by insert.
>
> There is a small header in the lowest record entry block, giving the
> positions of the one or two index tiers, count of entry blocks, and
> various tuning parameters such as maximum shard size and average depth
> of cache hash collision lists.
>
> That is it for media format. Very simple, is it not? My next post
> will explain the Shardmap directory block format, with a focus on
> deficiencies of the traditional Ext2 format that were addressed.
>
> Regards,
>
> Daniel

Cheers, Andreas

Attachments:

signature.asc (890.00 B)
Message signed with OpenPGP

2019-12-05 09:46:48

by Daniel Phillips

[permalink] [raw]

Subject: Re: [RFC] Thing 1: Shardmap fox Ext4

On 2019-12-04 7:55 a.m., Vyacheslav Dubeyko wrote:
>> That is it for media format. Very simple, is it not? My next post
>> will explain the Shardmap directory block format, with a focus on
>> deficiencies of the traditional Ext2 format that were addressed.
>
> I've tried to take a look into the source code. And it was not easy
> try. :)

Let's see what we can do about that, starting with removing the duopack
(media index entry) and tripack (cache index entry) templates. Now that
the design has settled down we don't need that level of generality so
much any more. The replacements are mostly C-style now and by the time
the Tux3 kernel port is done, will be officially C.

So far I only described the media format, implemented in define_layout(),
which I hope is self explanatory. You should be able to tie it back to
this diagram pretty easily.

https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format

> I expected to have the bird-fly understanding from shardmap.h
> file. My expectation was to find the initial set of structure
> declarations with the good comments.

Our wiki is slowly getting populated with design documentation. Most of
what you see in shardmap.h is concerned with the Shardmap cache form,
where all the action happens. I have not said much about that yet, but
there is a post on the way. The main structures are struct shard (a
self contained hash table) and struct keymap (a key value store
populated with shards). Those are obvious I think, please correct me
if I am wrong. A more tricky one is struct tier, which implements our
incremental hash table expansion. You might expect that to be a bit
subtle, and it is.

Before getting into those details, there is an upcoming post about
the record block format, which is pretty non-abstract and, I think,
easy enough to understand from the API declaration in shardmap.h and
the code in recops.c.

There is a diagram here:

https://github.com/danielbot/Shardmap/wiki/Shardmap-record-block-format

but the post this belongs to is not quite ready to go out yet. That one
will be an interlude before for the cache form discussion, which is
where the magic happens, things like rehash and reshard and add_tier,
and the way the hash code gets chopped up as it runs through the access
stack.

Here is a diagram of the cache structures, very simple:

https://github.com/danielbot/Shardmap/wiki/Shardmap-cache-format

And here is a diagram of the Shardmap three level hashing scheme,
which ties everything together:

https://github.com/danielbot/Shardmap/wiki/Shardmap-hashing-scheme

This needs explanation. It is something new that you won't find in any
textbook, this is the big reveal right here.

> But, frankly speaking, it's very
> complicated path for the concept understanding. Even from C++ point of
> view, the class declarations look very complicated if there are mixing
> of fields with methods declarations.

In each class, fields are declared first, then methods. In the kernel
port of course we will not have classes, and the function names will be
longer as usual.

> So, I believe it makes sense to declare the necessary set of structures
> in the file's beginning with the good comments. Even it will be good to
> split the structure declarations and methods in different files. I
> believe it will ease the way to understand the concept. Otherwise, it
> will be tough to review such code.

Declaring structures and functions in the same file is totally normal
for kernel code, you don't really want these in separate files unless
they break out naturally that way.

This code is dense, there is a lot going on in not very many lines. So
we need lots of lines of documentation to make up for that, which has
not been a priority until now, so please bear with me. And please do
not hesitate to ask specific questions - the answers may well end up in
the wiki.

Regards,

Daniel

2019-12-06 11:48:27

by Viacheslav Dubeyko

[permalink] [raw]

Subject: Re: [RFC] Thing 1: Shardmap fox Ext4

On Thu, 2019-12-05 at 01:46 -0800, Daniel Phillips wrote:
> On 2019-12-04 7:55 a.m., Vyacheslav Dubeyko wrote:
> > >

<snipped and reoredered>

> And here is a diagram of the Shardmap three level hashing scheme,
> which ties everything together:
>
> https://github.com/danielbot/Shardmap/wiki/Shardmap-hashing-scheme
>
> This needs explanation. It is something new that you won't find in
> any
> textbook, this is the big reveal right here.
>

This diagram is pretty good and provides the high-level view of the
whole scheme. But, maybe, it makes sense to show the granularity of
hash code. It looks like the low hash is the hash of a name. Am I
correct? But how the mid- and high- parts of the hash code are defined?
It looks like that cached shard stores LBAs of record entry blocks are
associated with the low hash values. But what does it mean that shard
is cached?

Here is a diagram of the cache structures, very simple:
>
> https://github.com/danielbot/Shardmap/wiki/Shardmap-cache-format
>

This diagram is not easy to relate with the previous one. So, shard
table and shard array are the same entities or not? Or do you mean that
shard table is storeed on the volume but shard array is constructed in
memory?

>There is a diagram here:
>
>
>
https://github.com/danielbot/Shardmap/wiki/Shardmap-record-block-format

I am slightly confused here. Does header be located at the bottom of
the record block? My understanding is that records grow from top of the
block down to the header direction. Am I correct? Why header is not
located at the top of the block with entry dictionary? Any special
purpose here?

Thanks,
Viacheslav Dubeyko.

2019-12-07 00:46:28

by Daniel Phillips

[permalink] [raw]

Subject: Re: [RFC] Thing 1: Shardmap for Ext4

On 2019-12-06 3:47 a.m., Vyacheslav Dubeyko wrote:
> On Thu, 2019-12-05 at 01:46 -0800, Daniel Phillips wrote:
>> On 2019-12-04 7:55 a.m., Vyacheslav Dubeyko wrote:
>>>>
>
> <snipped and reoredered>
>
>> And here is a diagram of the Shardmap three level hashing scheme,
>> which ties everything together:
>>
>> https://github.com/danielbot/Shardmap/wiki/Shardmap-hashing-scheme
>>
>> This needs explanation. It is something new that you won't find in
>> any
>> textbook, this is the big reveal right here.
>>
>
> This diagram is pretty good and provides the high-level view of the
> whole scheme. But, maybe, it makes sense to show the granularity of
> hash code. It looks like the low hash is the hash of a name. Am I
> correct?

Not quite. A 64 bit hash code is computed per name, then divided up into
three parts as shown in the diagram. Each part of the hash addresses a
different level of the Shardmap index hierarchy: high bits address the
top level shard array, giving a pointer to a shard; middle bits address
a hash bucket within that shard; low bits are used to resolve collisions
within the hash bucket (and collisions still may occur even when the low
bits are considered, forcing a record block access and full string
compare.

> But how the mid- and high- parts of the hash code are defined?

Given the above description, does the diagram make sense? If so I will
add this description to the wiki.

> It looks like that cached shard stores LBAs of record entry blocks are
> associated with the low hash values.

Rather, associated with the entire hash value.

> But what does it mean that shard is cached?

This is the cache form of the shard, meaning that the unordered hash/lba
index pairs (duopack) were read in from media and loaded into this cache
object (or newly created by recent directory operations.)

> Here is a diagram of the cache structures, very simple:
>>
>> https://github.com/danielbot/Shardmap/wiki/Shardmap-cache-format
>
> This diagram is not easy to relate with the previous one. So, shard
> table and shard array are the same entities or not?

They are, and I have updated the hashing scheme diagram to refer to both
as "array". I will similarly update the code, which currently calls the
shard array field "map".

> Or do you mean that
> shard table is storeed on the volume but shard array is constructed in
> memory?

Sorry about that, it should be clear now. On the volume, a simple
unordered collection of hash:lba pairs is stored per shard, which is
reorganized into shard cache form (a hash table object) at demand-load
time.

>> There is a diagram here:
>>
> https://github.com/danielbot/Shardmap/wiki/Shardmap-record-block-format
>
> I am slightly confused here. Does header be located at the bottom of
> the record block?

The header (just 32 bytes at the moment, possibly to be expanded to 48
or 64) is stored at the top of the zeroth record entry block, which is
therefore a little smaller than any other record entry block.

> My understanding is that records grow from top of the
> block down to the header direction. Am I correct? Why header is not
> located at the top of the block with entry dictionary? Any special
> purpose here?

That should be clear now. I will add the above descriptive text to the
wiki.

Regards,

Daniel