Hi folks,
Here is my somewhat tardy followup to my Three Things post from earlier
this fall. I give you Thing 1, Shardmap. What I hope to accomplish today
is to initiate a conversation with Ext4 developers, and other interested
observers, which will eventually result in merging the new Shardmap
directory index code into Ext4, thereby solving a number of longstanding
issues with the venerable and somewhat problematic HTree.
HTree is the most used directory index in the known universe. HTree does
some things fantastically well, particularly in the most common range of
directory sizes, but also exhibits some well known flaws and at least one
previously unknown flaw, explained below. Shardmap is a new index design,
just seven years old, an O(1) extensible hash table, meant to address all
of HTree's flaws while improving performance and scaling into the
previously inaccessible billion file per directory range. Subject to
verifying these claims, it would seem to be logical to move on to the
logistics of porting Shardmap to Ext4 as an optional index algorithm,
eventually deprecating HTree.
Shardmap equals or outperforms HTree at all scales, with the single
exception of one theoretical case with a likewise theoretical solution.
Shardmap is O(1) in all operations - insert, delete, lookup and readdir,
while HTree is O(log N) in the first three and suffers from a relatively
large constant readdir overhead. This is not the only reason Shardmap is
faster than HTree, far from it.
I will break performance discussion down into four ranges:
1) unindexed single block, to about 300 entries
2) indexed up to 30 thousand entries
3) indexed up to 3 million entries
4) indexed up to 1 billion entries.
In theory, Shardmap and HTree are exactly tied in the common single
block case, because creating the index is delayed in both cases until
there are at least two blocks to index. However, Shardmap broke away
from the traditional Ext2 entry block format in order to improve block
level operation efficiency and to prevent record fragmentation under
heavy aging, and is therefore significantly faster than HTree even in
the single block case.
Shardmap does not function at all in the fourth range, up to 1 billion
entries, because its Btree has at most 2 levels. This simple flaw could be
corrected without much difficulty but Shardmap would remain superior for
a number of reasons.
The most interesting case is the 300 to 30 thousand entry range, where
Htree and Shardmap should theoretically be nearly equal, each requiring
two accesses per lookup. However, HTree does two radix tree lookups while
Shardmap does one, and the other lookup is in a cached memory object.
Coupled with the faster record level operations, Shardmap is significantly
faster in this range. In the 30 thousand to 3 million range, Shardmap's
performance advantage further widens in accordance with O(1) / O(log N).
For inserts, Shardmap's streaming update strategy is far superior to
HTree's random index block write approach. HTree will tend to dirty every
index block under heavy insert load, so that every index block must be
written to media per commit, causing serious write multiplication
issues. In fact, all Btree schemes suffer from this issue, which on the
face of it appears to be enough reason to eliminate the Btree as a
realistic design alternative. Shardmap dramatically reduces such per
commit write multiplication by appending new index entries linearly to
the tail blocks of a small number of index shards. For delete,
Shardmap avoids write multiplication by appending tombstone entries to
index shards, thereby addressing a well known HTree delete performance
issue.
HTree has always suffered from a serious mismatch between name storage
order and inode storage order, greatly exacerbated by the large number
of directory entries and inodes stored per block (false sharing). In
particular, a typical HTree hash order directory traversal accesses the
inode table randomly, multiplying both the cache footprint and write
traffic. Historically, this was the cause of many performance complaints
about HTree, though to be sure, such complaints have fallen off with
the advent of solid state storage. Even so, this issue will continue rear
its ugly head when users push the boundaries of cache and directory size
(google telldir+horror). Shardmap avoids this issue entirely by storing
and traversing directory entries in simple, classic linear order.
This touches on the single most significant difference between Shardmap
and HTree: Shardmap strictly separates its index from record entry blocks,
while HTree embeds entries directly in the BTree index. The HTree
strategy performs impressively well at low to medium directory scales,
but at higher scales it causes significantly more cache pressure, due to
the fact that the cache footprint of any randomly accessed index is
necessarily the entire index. In contrast, Shardmap stores directory
entries permanently in creation order, so that directory traversal is
in simple linear order with effectively zero overhead. This avoids
perhaps HTree's most dubious design feature, its arcane and not completely
reliable hash order readdir traversal, which miraculously has avoided
serious meltdown over these past two decades due to a legendary hack by
Ted and subsequent careful adaptation to handle various corner cases.
Nowhere in Shardmap will you find any such arcane and marginally
supportable code, which should be a great relief to Ext4 maintainers.
Or to put it another way, if somebody out there wishes to export a
billion file directory using NFSv2, Shardmap will not be the reason
why that does not work whereas HTree most probably would be.
Besides separating out the index from entry records and accessing those
records linearly in most situations, Shardmap also benefits from a very
compact index design. Even if a directory has a billion entries, each
index entry is only eight bytes in size. This exercise in structure
compaction proved possible because the number of hash bits needed for the
hash code decreases as the number of index shards increases, freeing up
bits for larger block numbers as the directory expands. Shardmap
therefore implements an index entry as several variable sized fields.
This strategy works well up to the billion entry range, above which the
number of hash index collisions (each of which must be resolved by
accessing an underlying record block) starts to increase noticeably.
This is really the only factor that limits Shardmap performance above
a billion entries. Should we wish Shardmap to scale to trillions of
entries without losing performance, we will need to increase the index
entry size to ten bytes or so, or come up with some as yet unknown
clever improvement (potential thesis project goes here).
There are many additional technical details of Shardmap that ought to be
explained, however today my primary purpose is simply to introduce what
I view as a compelling case for obsoleting HTree. To that end, fewer
words are better and this post is already quite long enough. I would love
to get into some other interesting details, for example, the Bigmap free
space mapping strategy, but that really needs its own post to do it
justice, as do a number of other subjects.
Wrapping up, what about that theoretical case where HTree outperforms
Shardmap? This is theoretical because one needs to operate on a huge
directory with tiny cache to trigger it. Both Shardmap and HTree will
exhibit read multiplication under such conditions, due to frequent
cache evictions, however the Shardmap read multiplication will be many
times higher than HTree because of its coarser cache granularity. In the
unlikely event that we ever need to fix this, one viable solution is to
add paging support for Shardmap's in-memory cache structure, a standard
technique sometimes called "anticache".
That is it for today. There remains much to explain about Shardmap both
within and beyond the Ext4 context. For example, Shardmap has proved to
work very well as a standalone key value store, particularly with
persistent memory. In fact, we have benchmarked Shardmap at over three
million atomic, durable database operations per second on an Intel
Optane server, which might well be a new world record. The details of
how this was done are fascinating, however this post is far too small to
contain them today. Perhaps this should be thing 1(b) for next week.
Regards,
Daniel
A couple of quick observations about Shardmap.
(1) It's licensed[1] under the GPLv3, so it's not compatible with the
kernel license. That doesn't matter much for ext4, because...
[1] https://github.com/danielbot/Shardmap/blob/master/LICENSE
(2) It's implemented as userspace code (e.g., it uses open(2),
mmap(2), et. al) and using C++, so it would need to be reimplemented
from scratch for use in the kernel.
(3) It's not particularly well documented, making the above more
challenging, but it appears to be a variation of an extensible hashing
scheme, which was used by dbx and Berkley DB.
(4) Because of (2), we won't be able to do any actual benchmarks for a
while. I just checked the latest version of Tux3[2], and it appears
to be be still using a linear search scheme for its directory ---
e.g., an O(n) lookup ala ext2. So I'm guessing Shardmap may have been
*designed* for Tux3, but it has not yet been *implemented* for Tux3?
[2] https://github.com/OGAWAHirofumi/linux-tux3/blob/hirofumi/fs/tux3/dir.c#L283
(5) The claim is made that readdir() accesses files sequentially; but
there is also mention in Shardmap of compressing shards (e.g.,
rewriting them) to squeeze out deleted and tombstone entries. This
pretty much guarantees that it will not be possible to satisfy POSIX
requirements of telldir(2)/seekdir(3) (using a 32-bit or 64-bitt
cookie), NFS (which also requires use of a 32-bit or 64-bit cookie
while doing readdir scan), or readdir() semantics in the face of
directory entries getting inserted or removed from the directory.
(To be specific, POSIX requires readdir returns each entry in a
directory once and only once, and in the case of a directory entry
which is removed or inserted, that directory entry must be returned
exactly zero or one times. This is true even if telldir(2) ort
seekdir(2) is used to memoize a particular location in the directory,
which means you have a 32-bit or 64-bit cookie to define a particular
location in the readdir(2) stream. If the file system wants to be
exportable via NFS, it must meet similar requirements ---- except the
32-bit or 64-bit cookie MUST survive a reboot.)
Regards,
- Ted
On Dec 1, 2019, at 1:21 AM, Daniel Phillips <[email protected]> wrote:
>>> Important example: how is atomic directory commit going to work for
>>> Ext4?
>>
>> The same way all metadata updates work in ext4. Which is to say, you
>> need to declare the maximum number of 4k metadata blocks that an
>> operation might need to change when calling ext4_journal_start() to
>> create a handle; and before modifying a 4k block, you need to call
>> ext4_journal_get_write_access(), passing in the handle and the block's
>> buffer_head. After modifying the block, you must call
>> ext4_handle_dirty_metadata() on the buffer_head. And when you are
>> doing with the changes in an atomic metadata operation, you call
>> ext4_journal_stop() on the handle.
>>
>> This hasn't changed since the days of ext3 and htree.
>
> OK good. And I presume that directory updates are prevented until
> the journal transaction is at least fully written to buffers. Maybe
> delayed until the journal transaction is actually committed?
>
> In Tux3 we don't block directory updates during backend commit, and I
> just assumed that Ext4 and others also do that now, so thanks for the
> correction. As far I can see, there will be no new issue with Shardmap,
> as you say. My current plan is that user space mmap will become kmap in
> kernel. I am starting on this part for Tux3 right now. My goal is to
> refine the current Shardmap data access api to hide the fact that mmap
> is used in user space but kmap in kernel. Again, I wish we actually had
> mmap in kernel and maybe we should consider properly supporting it in
> the future, perhaps by improving kmalloc.
>
> One thing we do a bit differently frou our traditional fs is, in the
> common, unfragmented case, mass inserts go into the same block until
> the block is full. So we get a significant speedup by avoiding a page
> cache lookup and kmap per insert. Borrowing a bit of mechanism from
> the persistent memory version of Shardmap, we create the new entries
> in a separate cache page. Then, on commit, copy this "front buffer" to
> the page cache. I think that will translate pretty well to Ext4 also.
One important use case that we have for Lustre that is not yet in the
upstream ext4[*] is the ability to do parallel directory operations.
This means we can create, lookup, and/or unlink entries in the same
directory concurrently, to increase parallelism for large directories.
This is implemented by progressively locking the htree root and index
blocks (typically read-only), then leaf blocks (read-only for lookup,
read-write for insert/delete). This provides improved parallelism
as the directory grows in size.
Will there be some similar ability in Shardmap to have parallel ops?
Also, does Shardmap have the ability to shrink as entries are removed?
Cheers, Andreas
[*] we've tried to submit the pdirops patch a couple of times, but the
main blocker is that the VFS has a single directory mutex and couldn't
use the added functionality without significant VFS changes.
Patch at https://git.whamcloud.com/?p=fs/lustre-release.git;f=ldiskfs/kernel_patches/patches/rhel8/ext4-pdirop.patch;hb=HEAD
On 2019-12-04 10:03 a.m., Andreas Dilger wrote:
>> Here is a pretty picture to get started:
>>
>> https://github.com/danielbot/Shardmap/wiki/Shardmap-media-format
>
> The shardmap diagram is good conceptually, but it would be useful
> to add a legend on the empty side of the diagram that shows the on-disk
> structures.
Sounds good, but not sure exactly what you had in mind. Fields of a
shard entry? Fields of the block 0 header? The record entry block has
its own diagram, and is polymorphic anyway, so no fixed format.
Regards,
Daniel