Hi Ted,
I trust you will find your initial points satisfactorily addressed below.
On 2019-11-27 6:25 a.m., Theodore Y. Ts'o wrote:
> A couple of quick observations about Shardmap.
>
> (1) It's licensed[1] under the GPLv3, so it's not compatible with the
> kernel license. That doesn't matter much for ext4, because...
>
> [1] https://github.com/danielbot/Shardmap/blob/master/LICENSE
The kernel port of Shardmap will (necessarily) be licensed under GPLv2.
> (2) It's implemented as userspace code (e.g., it uses open(2),
> mmap(2), et. al) and using C++, so it would need to be reimplemented
> from scratch for use in the kernel.
Right. Some of these details, like open, are obviously trivial, others
less so. Reimplementing from scratch is an overstatement because the
actual intrusions of user space code are just a small portion of the code
and nearly all abstracted behind APIs that can be implemented as needed
for userspace or kernel in out of line helpers, so that the main source
is strictly unaware of the difference. That said, we can just fork off a
kernel version and not worry about keeping compatiblity with user space
if you wish, though putting in the extra effort to make it dual mode
would probably be helpful for e2fsck.
Also, most of this work is already being done for Tux3, so the only
Ext4-specific work needing doing may well be the differences in atomic
commit required to accommodate Ext4's ordered journaling, versus Tux3's
(more precise) delta commit. To that end, we could discuss the atomic
commit strategy that we use for the persistent memory implementation of
Shardmap, which may turn out to be largely applicable to Ext4's journal
transaction model.
> (3) It's not particularly well documented, making the above more
> challenging, but it appears to be a variation of an extensible hashing
> scheme, which was used by dbx and Berkley DB.
Sorry about that. There is this post from a few years back:
https://lkml.org/lkml/2013/6/18/869
And there is a paper in the works. I can also follow up here with a post
on Shardmap internals, a number of which are interesting and unique.
Shardmap (introduced above as an "an O(1) extensible hash table") is indeed
an extensible hashing scheme. Fixed size hash tables are impractical for
databases and file system directories because small data sets waste too
much table space and large data sets have too many collisions. Therefore
every such design must incorporate some form of extensibility. Shardmap's
extension scheme is unique, and worthy of note in its own right as a
contribution to hash table technology. We did benchmark against Berkeley
DB and found Shardmap to be markedly faster. I will hunt around for those
numbers.
Very briefly, the Shardmap index has two distinct forms, one optimized
for media and the other for cache. These are bijective, each being
constructable from the other. The media form (the backing store) only
has a single purpose: to reconstruct the cache form on demand, one shard
at a time.
The cache form is the main source of Shardmap's efficiency. This is a
two level hash table with each entry in the top level table being a
pointer to a self contained hash table object. In contrast to other
extensible hashing schemes, these cache shard are not themselves
extensible. Rather, we simply rewrite entire shards into subshards
as needed.
The top level hash table is where the extensibility happens. At some
threshold, the top level table is expanded by duplicating the pointers
to the hash objects so that multiple buckets may reference the same
hash object. When any of those objects passes a threshold number of
entries, it is split into multiple, smaller hash objects, each with a
unique pointer from the top level table. Traversing this two level
table for lookup or existence tests takes just a few nanoseconds.
Extending the hash in cache is mirrored by extending the media form,
by serializing the cache shard into multiple linear regions on media.
Now here is the key idea: even taking the cost of this media rewrite
into account, insert performance remains O(1), just with a slightly
higher constant factor. Shardmap exploits this subtle mathematical
fact to get the best of both worlds: O(1) performance like a hash and
extensibility like a BTree.
In fact, if you wish to avoid that constant media rewrite factor
entirely, Shardmap lets you do it, by allowing you to specify the
number and size of shards at directory creation time. I have not
benchmarked this, but it could improve average create performance by 20%
or so. However, even with the "extra" media copy, Shardmap still has
impressively high insert performance, in fact it is significantly
faster than any of the high performance key value stores we have tried
so far.
> (4) Because of (2), we won't be able to do any actual benchmarks for a
> while.
(2) is not an issue, the copyright is entirely mine and the license can
be retuned as convenient. Just indicate where the GPLv2 version should
be posted and I will make it so. Perhaps a new Github repo, or Gitlab?
> I just checked the latest version of Tux3[2], and it appears
> to be be still using a linear search scheme for its directory ---
> e.g., an O(n) lookup ala ext2. So I'm guessing Shardmap may have been
> *designed* for Tux3, but it has not yet been *implemented* for Tux3?
>
> [2] https://github.com/OGAWAHirofumi/linux-tux3/blob/hirofumi/fs/tux3/dir.c#L283
Correct, not yet ported to Tux3, however this work is in progress. There
are some sticky little points to work out such as how to implement the
largish cache shard objects without using virtual memory. The PAGEMAP
compilation option in the current source breaks those objects up into
pages, essentially doing virtual memory by hand, which will add some
small amount of additional overhead to the kernel version versus the
user space version, nothing to worry about. However it does make me wish
that we had better kernel support for virtual memory.
There are various other kernel porting details that are maybe a bit too
fine grained for this post. Example: Shardmap is a memory mapped db but
we don't have mmap in kernel, so must do this by hand also.
> (5) The claim is made that readdir() accesses files sequentially; but
> there is also mention in Shardmap of compressing shards (e.g.,
> rewriting them) to squeeze out deleted and tombstone entries. This
> pretty much guarantees that it will not be possible to satisfy POSIX
> requirements of telldir(2)/seekdir(3) (using a 32-bit or 64-bitt
> cookie), NFS (which also requires use of a 32-bit or 64-bit cookie
> while doing readdir scan), or readdir() semantics in the face of
> directory entries getting inserted or removed from the directory.
No problem, the data blocks are completely separate from the index so
readdir just walks through them in linear order a la classic UFS/Ext2.
What could possibly be simpler, faster or more POSIX compliant?
> (To be specific, POSIX requires readdir returns each entry in a
> directory once and only once, and in the case of a directory entry
> which is removed or inserted, that directory entry must be returned
> exactly zero or one times. This is true even if telldir(2) ort
> seekdir(2) is used to memoize a particular location in the directory,
> which means you have a 32-bit or 64-bit cookie to define a particular
> location in the readdir(2) stream. If the file system wants to be
> exportable via NFS, it must meet similar requirements ---- except the
> 32-bit or 64-bit cookie MUST survive a reboot.)
So we finally get to fix this nagging HTree defect after all these
years. Thank you once again for that sweet hack, but with luck we
will be able to obsolete it by this time next year.
Regards,
Daniel
On Wed, Nov 27, 2019 at 02:27:27PM -0800, Daniel Phillips wrote:
> > (2) It's implemented as userspace code (e.g., it uses open(2),
> > mmap(2), et. al) and using C++, so it would need to be reimplemented
> > from scratch for use in the kernel.
>
> Right. Some of these details, like open, are obviously trivial, others
> less so. Reimplementing from scratch is an overstatement because the
> actual intrusions of user space code are just a small portion of the code
> and nearly all abstracted behind APIs that can be implemented as needed
> for userspace or kernel in out of line helpers, so that the main source
> is strictly unaware of the difference.
The use of C++ with templates is presumably one of the "less so"
parts, and it was that which I had in mind when I said,
"reimplementing from scratch".
> Also, most of this work is already being done for Tux3,
Great, when that work is done, we can take a look at the code and
see....
> > (5) The claim is made that readdir() accesses files sequentially; but
> > there is also mention in Shardmap of compressing shards (e.g.,
> > rewriting them) to squeeze out deleted and tombstone entries. This
> > pretty much guarantees that it will not be possible to satisfy POSIX
> > requirements of telldir(2)/seekdir(3) (using a 32-bit or 64-bitt
> > cookie), NFS (which also requires use of a 32-bit or 64-bit cookie
> > while doing readdir scan), or readdir() semantics in the face of
> > directory entries getting inserted or removed from the directory.
>
> No problem, the data blocks are completely separate from the index so
> readdir just walks through them in linear order a la classic UFS/Ext2.
> What could possibly be simpler, faster or more POSIX compliant?
OK, so what you're saying then is for every single directory entry
addition or removal, there must be (at least) two blocks which must be
modified, an (at least one) index block, and a data block, no? That
makes it worse than htree, where most of the time we only need to
modify a single leaf node. We only have to touch an index block when
a leaf node gets full and it needs to be split.
Anyway, let's wait and see how you and Hirofumi-san work out those
details for Tux3, and we can look at that and consider next steps at
that time.
Cheers,
- Ted
On 2019-11-27 6:28 p.m., Theodore Y. Ts'o wrote:
> On Wed, Nov 27, 2019 at 02:27:27PM -0800, Daniel Phillips wrote:
>>> (2) It's implemented as userspace code (e.g., it uses open(2),
>>> mmap(2), et. al) and using C++, so it would need to be reimplemented
>>> from scratch for use in the kernel.
>>
>> Right. Some of these details, like open, are obviously trivial, others
>> less so. Reimplementing from scratch is an overstatement because the
>> actual intrusions of user space code are just a small portion of the code
>> and nearly all abstracted behind APIs that can be implemented as needed
>> for userspace or kernel in out of line helpers, so that the main source
>> is strictly unaware of the difference.
>
> The use of C++ with templates is presumably one of the "less so"
> parts, and it was that which I had in mind when I said,
> "reimplementing from scratch".
Ah, I see what you mean. To be honest, C++ has now become so natural for
me that I don't even think about it. You and Linus really ought to get
some of that :-)
If you look closely, you will notice that my C++ is largely just "C
compiled by C++", and I cheerfully undertake to convert away the few
places where I have simplified my life and sped up development by using
actual idiomatic C++ constructs.
By way of anecdote, coding the current user space version of Shardmap
in C++ cut my development time to go from the pure C prototype to the
world record persistent memory demonstration by a factor of roughly 3.
I now find it far faster to develop in C++ and then translate mindlessly
back to C as necessary, than to slog through the entire development
process using nothing but classic C, a language that really ought to
have done the right thing and stayed back in the century for which it
was created.
But to each his own. Ask for a pure C version licensed under GPLv2
and you shall receive. Note that we already have one here:
https://github.com/OGAWAHirofumi/tux3/blob/master/user/devel/shard.c
Perhaps this will be easier on your eyes. It is essentially the same
thing less the persistent memory support and plus a bug or two.
Ah, one more anecdote. Shardmap implements polymorphic record block
operations, so that low level record format can be uniquely tailored
to the kind of data being stored. Overlooking the fact that we can
simply remove that mechanism for the kernel port because Ext4 does
not need more than one kind of record format, I can cheerfully
report that the official C++ way of implementing polymorphism using
virtual functions turned out to suck donkey dung compared to the
classic C/kernel way, where function vectors are handled as first
class data objects.
I actually implemented it both ways, but the virtual function way
turned out to be unspeakably painful for various reasons, hard to
read, and hard to modify without having it regularly blow up into
zillions of itty bitty little insane pieces. One day, after sinking
a couple of weeks into getting it finally working the official C++
way, I just threw this all out and recoded in kernel style, which
took about 3 hours and the result was not only much easier to read
and write, it generated better machine code.
So there you have it, ammunition to use against C++ if you want it.
But oh wait, it's still C++ isn't it? Why yes it is. C++, just try
it, you'll like it, and nobody is too late to learn it.
But once again, let's be very clear about it: I'm going to remove
*all* the C++ from Shardmap in aid of integrating with Tux3 and
Ext4. So there is no need at all to stay awake at night worrying
about this question.
>> Also, most of this work is already being done for Tux3,
>
> Great, when that work is done, we can take a look at the code and
> see....
Surely there is much to discuss even before the Tux3 kernel port is
completed. Discussing and planning being cheap compared to leaving
things to the last minute as usual, then rushing them.
>>> (5) The claim is made that readdir() accesses files sequentially; but
>>> there is also mention in Shardmap of compressing shards (e.g.,
>>> rewriting them) to squeeze out deleted and tombstone entries. This
>>> pretty much guarantees that it will not be possible to satisfy POSIX
>>> requirements of telldir(2)/seekdir(3) (using a 32-bit or 64-bitt
>>> cookie), NFS (which also requires use of a 32-bit or 64-bit cookie
>>> while doing readdir scan), or readdir() semantics in the face of
>>> directory entries getting inserted or removed from the directory.
>>
>> No problem, the data blocks are completely separate from the index so
>> readdir just walks through them in linear order a la classic UFS/Ext2.
>> What could possibly be simpler, faster or more POSIX compliant?
>
> OK, so what you're saying then is for every single directory entry
> addition or removal, there must be (at least) two blocks which must be
> modified, an (at least one) index block, and a data block, no? That
> makes it worse than htree, where most of the time we only need to
> modify a single leaf node. We only have to touch an index block when
> a leaf node gets full and it needs to be split.
The operative word above is "single". Usually when we modify a single
entry in a directory we do not care whether the file system touches one
block or two, because a typical minimum commit involves many more than
that.
It may be that you were really thinking about mass instances of single
updates, which Shardmap handles much more efficiently than HTree. Under
mass insert, Shardmap repeatedly updates the same record block whereas
HTree updates some random, usually different leaf block per insert.
You are right that Shardmap also must update the shard fifo tail block,
however there is only one index shard up to 64K entries, so all the new
index entries go into the same tail block(s). Shardmap wins this one by
a mile.
As far as deletes go, of course you know how bad HTree is at that. Sure,
HTree only needs to update a single block to remove an entry, but then
it does unspeakable things to the inode table that tend to cause serious
performance losses in not-so-rare corner cases. Shardmap definitely
fixes that.
For O_SYNC operation you have more of a point, however again I doubt
that one directory block versus two will move the needle, and if it did
to the point of somebody actually caring about it, we can easily finesse
away one or both of those updates using journal techniques. More likely,
nobody will ever notice or care about this single extra block per sync
commit.
> Anyway, let's wait and see how you and Hirofumi-san work out those
> details for Tux3
We need to discuss those details up front in order to avoid duplicating
work or heading off in costly design directions that you may reject
later. And who would dream of depriving the LKML viewing public of their
weekly kernel design discussion entertainment? Not me.
Important example: how is atomic directory commit going to work for
Ext4? What can we do in the immediate future to make that work easier
for Ext4 devs? And many other details. The atomic commit discussion
alone is essential, and is a long lead item as we have all
experienced.
Bottom line: let's keep talking, it's better, and there is much of
interest to discuss. Surely you would at least like to know what
happened to your suggestion back in New Orleans about how to track
free records in huge directories?
Regards,
Daniel
On 2019-11-27 6:28 p.m., Theodore Y. Ts'o wrote:
> The use of C++ with templates is presumably one of the "less so"
> parts, and it was that which I had in mind when I said,
> "reimplementing from scratch".
Ah, duopack and tripack. Please consider the C code that was replaced.
by those. See tons of bogus extra parameters resulting in nonsense like
this:
set_entry(shard_entry(shard, bucket), key & bitmask(shard->lowbits), loc, next, shard->locbits, nextbits);
which in the c++ version looks like:
*entry = {trio.pack(next, loc, key & bitmask(lowbits))};
They generate roughly identical machine code, but I know which one I prefer
to maintain. That said, porting back to C (in progress right now) includes
substituting some appropriate macro code for those sweet little variable
field width structure templates. Which by the way are central to Shardmap's
scalability and performance. These are what allow the index entry to stay
at just 8 bytes over the entire range from one entry to one billion.
So right, tripack and duopack weill be reimplemented from scratch, using
the template code as a model. Basically just expanding the templates by
hand and adding in some macro sugar for clarity. Shardmap itself does not
need to be rewritten from scratch in order to port to kernel, far from it.
Regards,
Daniel
On Wed, Nov 27, 2019 at 08:27:59PM -0800, Daniel Phillips wrote:
> You are right that Shardmap also must update the shard fifo tail block,
> however there is only one index shard up to 64K entries, so all the new
> index entries go into the same tail block(s).
So how big is an index shard? If it is 64k entries, and each entry is
16 bytes (8 bytes hash, 8 bytes block number), then a shard is a
megabyte, right? Are entries in an index shard stored in sorted or
unsorted manner? If they are stored in an unsorted manner, then when
trying to do a lookup, you need to search all of the index shard ---
which means for a directory that is being frequently accessed, the
entire index shard has to be kept in memory, no? (Or paged in as
necessary, if you are using mmap in userspace).
> Important example: how is atomic directory commit going to work for
> Ext4?
The same way all metadata updates work in ext4. Which is to say, you
need to declare the maximum number of 4k metadata blocks that an
operation might need to change when calling ext4_journal_start() to
create a handle; and before modifying a 4k block, you need to call
ext4_journal_get_write_access(), passing in the handle and the block's
buffer_head. After modifying the block, you must call
ext4_handle_dirty_metadata() on the buffer_head. And when you are
doing with the changes in an atomic metadata operation, you call
ext4_journal_stop() on the handle.
This hasn't changed since the days of ext3 and htree.
- Ted
On 2019-11-30 9:50 a.m., Theodore Y. Ts'o wrote:
> On Wed, Nov 27, 2019 at 08:27:59PM -0800, Daniel Phillips wrote:
>> You are right that Shardmap also must update the shard fifo tail block,
>> however there is only one index shard up to 64K entries, so all the new
>> index entries go into the same tail block(s).
>
> So how big is an index shard? If it is 64k entries, and each entry is
> 16 bytes (8 bytes hash, 8 bytes block number), then a shard is a
> megabyte, right? Are entries in an index shard stored in sorted or
> unsorted manner? If they are stored in an unsorted manner, then when
> trying to do a lookup, you need to search all of the index shard ---
> which means for a directory that is being frequently accessed, the
> entire index shard has to be kept in memory, no? (Or paged in as
> necessary, if you are using mmap in userspace).
Exactly correct, except that in cache a shard is a hash table, while
on media it is just an unordered collection that is entered into hash
buckets at shard load time.
This is actually the main defining characteristic of Shardmap, both
giving rise to the theoretical read multiply issue alluded to above
and on the positive side, acting as a kind of cache read ahead due to
the coarse demand read granularity. In other words, each time we hit
a not present shard, we read multiple blocks of the given shard into
cache all at once, instead of loading blocks piecemeal with lots of
inefficient little reads. This is especially beneficial for spinning
disk, which we Tux3 devs still worry about, and I would think, you
also. Paraphrasing the immortal bard, "it's not dead yet".
Shard size is tunable at directory creation time. A shard entry is
actually 8 bytes, not 16, because block number can initially be very
small, just 10 bits by default, leaving 53 bits for the hash and one
bit to indicate tombstone or not. As the directory size increases,
the block number field size increases to accommodate more record
blocks and the hash field size decreases, however number of shards
increases at the same rate (more or less linear, enforced by logic)
so that, together with the hash bits implied by the shard number,
the hash resolution stays constant. Isn't that nice?
The importance of hash resolution is that, at high scale any hash
collision within a bucket must be resolved by accessing the record
block and resolving it there. So high collision rate corresponds to
significant slowdown in operations, getting worse linearly as the
directory expands. This is N**2 behavior in the sense that the time
to perform N operations increases as N**2 (IOW our usual definition
of file system performance.) It turns out that 53 hash bits are
sufficient to hold the collision rate to a few tens in one billion
inserts, insignificant at that scale, even more so at typical scale.
The question of cache footprint is indeed central, as you imply. 8
bytes per entry cuts the cache footprint in half, so that is nice.
But would it be better if we could hold only the pieces of index
in cache that we actually need? This question is far more subtle
than it first seems. Here is a possibly surprising mathematical
fact: when number of accesses is similar to the number of cache
blocks the cache footprint of any randomly accessed index is the
entire cache. This entirely independent of the index algorithm in
use: you see exactly the same behavior with BTrees. The best and
possibly only remedy is to make the index as compact as possible,
hence the impact of 8 byte vs 16 byte entries.
This highlights another significant advantage that Shardmap has
over HTree: HTree embeds its entries directly in the index while
Shardmap separates them out into traditional record entry blocks.
The HTree strategy does save significant CPU by avoiding one
block level deference, however as mentioned earlier, this merely
allows HTree to tie Shardmap in block accesses at the lowest
index scale, because Shardmap does one of its accesses into a
cache object, this avoiding radix tree overhead. The cost of
HTree's strategy at high scale, or with a large number of
directories open, is large, a factor of 2 or more greater cache
pressure depending on average name size.
So Shardmap is significantly more cache friendly than HTree, however,
as you deduced, if cache thrashing does happen then Shardmap with
shard size in the 256K to 1 Mbyte range might have to read a hundred
times as many blocks to reload an evicted shard than HTree does to
load a single btree block. On the other hand, the thrashing knee
occurs with 3-5 times less cache for Shardmap than HTree, so which
one wins here? I say Shardmap, because Shardmap does more with less
cache. If you are thrashing that badly then your machine must be
grossly misconfigured for its load.
Now suppose you wanted to fix this theoretical read multiplication,
then how? An emerging technique aimed at precisely the kind of dual
format caching scheme that Shardmap uses has been dubbed "anticache".
Instead of evicting and reloading the cache object, page the cache
to swap, or any available backing store (e.g., to the file system
volume itself). Then the cache object can be reloaded at your choice
of granularity, for example, 4K pages, loaded in via the hash table
walk algorithm. This will be one or more steps: 1) look up hash chain
head in table possibly loading a page; 2+) if entry not already found
then look up in next chain entry, possibly loading a page.
The thing is, I can't imagine any Linux configuration that could hit
this, short of an artificial construction. Take the case of a 1M file
directory. The shardmap media fifos for that will be 8MB, there will
be 16 of them (depending on tuning) and each cache shard will be
somewhere around 640K or 160 pages per shard for a total of 1280
cache pages, or 5 MB cache footprint. If this is going to put your
system into thrash then I submit that the real problem is not
Shardmap.
So this theoretical read multiplication issue is essentially just a
question of how fast can we thrash. If somebody does come up with a
valid use case where we need to thrash faster than now, we can always
implement anticache or something similar, and maybe make that a
generic facility while at it, because again, the real problem is not
Shardmap, it is somebody feeding in an inappropriate load for their
machine configuration.
>> Important example: how is atomic directory commit going to work for
>> Ext4?
>
> The same way all metadata updates work in ext4. Which is to say, you
> need to declare the maximum number of 4k metadata blocks that an
> operation might need to change when calling ext4_journal_start() to
> create a handle; and before modifying a 4k block, you need to call
> ext4_journal_get_write_access(), passing in the handle and the block's
> buffer_head. After modifying the block, you must call
> ext4_handle_dirty_metadata() on the buffer_head. And when you are
> doing with the changes in an atomic metadata operation, you call
> ext4_journal_stop() on the handle.
>
> This hasn't changed since the days of ext3 and htree.
OK good. And I presume that directory updates are prevented until
the journal transaction is at least fully written to buffers. Maybe
delayed until the journal transaction is actually committed?
In Tux3 we don't block directory updates during backend commit, and I
just assumed that Ext4 and others also do that now, so thanks for the
correction. As far I can see, there will be no new issue with Shardmap,
as you say. My current plan is that user space mmap will become kmap in
kernel. I am starting on this part for Tux3 right now. My goal is to
refine the current Shardmap data access api to hide the fact that mmap
is used in user space but kmap in kernel. Again, I wish we actually had
mmap in kernel and maybe we should consider properly supporting it in
the future, perhaps by improving kmalloc.
One thing we do a bit differently frou our traditional fs is, in the
common, unfragmented case, mass inserts go into the same block until
the block is full. So we get a significant speedup by avoiding a page
cache lookup and kmap per insert. Borrowing a bit of mechanism from
the persistent memory version of Shardmap, we create the new entries
in a separate cache page. Then, on commit, copy this "front buffer" to
the page cache. I think that will translate pretty well to Ext4 also.
Regards,
Daniel
On 2019-11-27 6:28 p.m., Theodore Y. Ts'o wrote:
> The use of C++ with templates is presumably one of the "less so"
> parts, and it was that which I had in mind when I said,
> "reimplementing from scratch".
The templates were removed without reimplementing from scratch:
https://github.com/danielbot/Shardmap/blob/master/shardmap.h#L88
https://github.com/danielbot/Shardmap/blob/master/shardmap.cc#L82
The duopack/tripack facility, central to Shardmap efficient scalability, are
now just ordinary C code that happens to be compiled by a C++ compiler. I
think the machine code should be identical to what the templates produced,
though I did not verify.
This was a strictly mechanical conversion, less error prone than
reimplementing from scratch I would think. I expect the rest of the
back conversions to be similarly mechanical.
Regards,
Daniel