On Tue, 2019-11-26 at 17:47 -0800, Daniel Phillips wrote:
> Hi folks,
>
> Here is my somewhat tardy followup to my Three Things post from
> earlier
> this fall. I give you Thing 1, Shardmap. What I hope to accomplish
> today
> is to initiate a conversation with Ext4 developers, and other
> interested
> observers, which will eventually result in merging the new Shardmap
> directory index code into Ext4, thereby solving a number of
> longstanding
> issues with the venerable and somewhat problematic HTree.
>
> HTree is the most used directory index in the known universe. HTree
> does
> some things fantastically well, particularly in the most common range
> of
> directory sizes, but also exhibits some well known flaws and at least
> one
> previously unknown flaw, explained below. Shardmap is a new index
> design,
> just seven years old, an O(1) extensible hash table, meant to address
> all
> of HTree's flaws while improving performance and scaling into the
> previously inaccessible billion file per directory range. Subject to
> verifying these claims, it would seem to be logical to move on to the
> logistics of porting Shardmap to Ext4 as an optional index algorithm,
> eventually deprecating HTree.
>
As far as I know, usually, a folder contains dozens or hundreds
files/folders in average. There are many research works that had showed
this fact. Do you mean some special use-case when folder could contain
the billion files? Could you share some research work that describes
some practical use-case with billion files per folder?
If you are talking about improving the performance then do you mean
some special open-source implementation?
> Shardmap equals or outperforms HTree at all scales, with the single
> exception of one theoretical case with a likewise theoretical
> solution.
> Shardmap is O(1) in all operations - insert, delete, lookup and
> readdir,
> while HTree is O(log N) in the first three and suffers from a
> relatively
> large constant readdir overhead. This is not the only reason Shardmap
> is
> faster than HTree, far from it.
>
> I will break performance discussion down into four ranges:
> 1) unindexed single block, to about 300 entries
> 2) indexed up to 30 thousand entries
> 3) indexed up to 3 million entries
> 4) indexed up to 1 billion entries.
>
> In theory, Shardmap and HTree are exactly tied in the common single
> block case, because creating the index is delayed in both cases until
> there are at least two blocks to index. However, Shardmap broke away
> from the traditional Ext2 entry block format in order to improve
> block
> level operation efficiency and to prevent record fragmentation under
> heavy aging, and is therefore significantly faster than HTree even in
> the single block case.
>
> Shardmap does not function at all in the fourth range, up to 1
> billion
> entries, because its Btree has at most 2 levels. This simple flaw
> could be
> corrected without much difficulty but Shardmap would remain superior
> for
> a number of reasons.
>
> The most interesting case is the 300 to 30 thousand entry range,
> where
> Htree and Shardmap should theoretically be nearly equal, each
> requiring
> two accesses per lookup. However, HTree does two radix tree lookups
> while
> Shardmap does one, and the other lookup is in a cached memory object.
> Coupled with the faster record level operations, Shardmap is
> significantly
> faster in this range. In the 30 thousand to 3 million range,
> Shardmap's
> performance advantage further widens in accordance with O(1) / O(log
> N).
>
> For inserts, Shardmap's streaming update strategy is far superior to
> HTree's random index block write approach. HTree will tend to dirty
> every
> index block under heavy insert load, so that every index block must
> be
> written to media per commit, causing serious write multiplication
> issues. In fact, all Btree schemes suffer from this issue, which on
> the
> face of it appears to be enough reason to eliminate the Btree as a
> realistic design alternative. Shardmap dramatically reduces such per
> commit write multiplication by appending new index entries linearly
> to
> the tail blocks of a small number of index shards. For delete,
> Shardmap avoids write multiplication by appending tombstone entries
> to
> index shards, thereby addressing a well known HTree delete
> performance
> issue.
Do you mean Copy-On-Write policy here or some special technique? How
could be good Shardmap for the SSD use-case? Do you mean that we could
reduce write amplification issue for NAND flash case?
>
> HTree has always suffered from a serious mismatch between name
> storage
> order and inode storage order, greatly exacerbated by the large
> number
> of directory entries and inodes stored per block (false sharing). In
> particular, a typical HTree hash order directory traversal accesses
> the
> inode table randomly, multiplying both the cache footprint and write
> traffic. Historically, this was the cause of many performance
> complaints
> about HTree, though to be sure, such complaints have fallen off with
> the advent of solid state storage. Even so, this issue will continue
> rear
> its ugly head when users push the boundaries of cache and directory
> size
> (google telldir+horror). Shardmap avoids this issue entirely by
> storing
> and traversing directory entries in simple, classic linear order.
>
> This touches on the single most significant difference between
> Shardmap
> and HTree: Shardmap strictly separates its index from record entry
> blocks,
> while HTree embeds entries directly in the BTree index. The HTree
> strategy performs impressively well at low to medium directory
> scales,
> but at higher scales it causes significantly more cache pressure, due
> to
> the fact that the cache footprint of any randomly accessed index is
> necessarily the entire index. In contrast, Shardmap stores directory
> entries permanently in creation order, so that directory traversal is
> in simple linear order with effectively zero overhead. This avoids
> perhaps HTree's most dubious design feature, its arcane and not
> completely
> reliable hash order readdir traversal, which miraculously has avoided
> serious meltdown over these past two decades due to a legendary hack
> by
> Ted and subsequent careful adaptation to handle various corner cases.
> Nowhere in Shardmap will you find any such arcane and marginally
> supportable code, which should be a great relief to Ext4 maintainers.
> Or to put it another way, if somebody out there wishes to export a
> billion file directory using NFSv2, Shardmap will not be the reason
> why that does not work whereas HTree most probably would be.
>
> Besides separating out the index from entry records and accessing
> those
> records linearly in most situations, Shardmap also benefits from a
> very
> compact index design. Even if a directory has a billion entries, each
> index entry is only eight bytes in size. This exercise in structure
> compaction proved possible because the number of hash bits needed for
> the
> hash code decreases as the number of index shards increases, freeing
> up
> bits for larger block numbers as the directory expands. Shardmap
> therefore implements an index entry as several variable sized fields.
> This strategy works well up to the billion entry range, above which
> the
> number of hash index collisions (each of which must be resolved by
> accessing an underlying record block) starts to increase noticeably.
> This is really the only factor that limits Shardmap performance above
> a billion entries. Should we wish Shardmap to scale to trillions of
> entries without losing performance, we will need to increase the
> index
> entry size to ten bytes or so, or come up with some as yet unknown
> clever improvement (potential thesis project goes here).
>
> There are many additional technical details of Shardmap that ought to
> be
> explained, however today my primary purpose is simply to introduce
> what
> I view as a compelling case for obsoleting HTree. To that end, fewer
> words are better and this post is already quite long enough. I would
> love
> to get into some other interesting details, for example, the Bigmap
> free
> space mapping strategy, but that really needs its own post to do it
> justice, as do a number of other subjects.
>
> Wrapping up, what about that theoretical case where HTree outperforms
> Shardmap? This is theoretical because one needs to operate on a huge
> directory with tiny cache to trigger it. Both Shardmap and HTree will
> exhibit read multiplication under such conditions, due to frequent
> cache evictions, however the Shardmap read multiplication will be
> many
> times higher than HTree because of its coarser cache granularity. In
> the
> unlikely event that we ever need to fix this, one viable solution is
> to
> add paging support for Shardmap's in-memory cache structure, a
> standard
> technique sometimes called "anticache".
>
> That is it for today. There remains much to explain about Shardmap
> both
> within and beyond the Ext4 context. For example, Shardmap has proved
> to
> work very well as a standalone key value store, particularly with
> persistent memory. In fact, we have benchmarked Shardmap at over
> three
> million atomic, durable database operations per second on an Intel
> Optane server, which might well be a new world record. The details of
> how this was done are fascinating, however this post is far too small
> to
> contain them today. Perhaps this should be thing 1(b) for next week.
>
Let's imagine that it needs to implement the Shardmap approach. Could
you estimate the implementation and stabilization time? How expensive
and long-term efforts could it be?
Thanks,
Viacheslav Dubeyko.
On 2019-11-26 11:40 p.m., Vyacheslav Dubeyko wrote:
> As far as I know, usually, a folder contains dozens or hundreds
> files/folders in average. There are many research works that had showed
> this fact. Do you mean some special use-case when folder could contain
> the billion files? Could you share some research work that describes
> some practical use-case with billion files per folder?
You are entirely correct that the vast majority of directories contain
only a handful of files. That is my case (1). A few directories on a
typical server can go into the tens of thousands of files. There was
a time when we could not handle those efficiently, and now thanks to
HTree we can. Some directories go into the millions, ask the Lustre
people about that. If you could have a directory with a billion files
then somebody will have a use for it. For example, you may be able to
avoid a database for a particular application and just use the file
system instead.
Now, scaling to a billion files is just one of several things that
Shardmap does better than HTree. More immediately, Shardmap implements
readdir simply, accurately and efficiently, unlike HTree. See here for
some discussion:
https://lwn.net/Articles/544520/
"Widening ext4's readdir() cookie"
See the recommendation that is sometimes offered to work around
HTree's issues with processing files in hash order. Basically, read
the entire directory into memory, sort by inode number, then process
in that order. As an application writer do you really want to do this,
or would you prefer that the filesystem just take care of for you so
the normal, simple and readable code is also the most efficient code?
> If you are talking about improving the performance then do you mean
> some special open-source implementation?
I mean the same kind of kernel filesystem implementation that HTree
currently has. Open source of course, GPLv2 to be specific.
>> For delete, Shardmap avoids write multiplication by appending tombstone
>> entries to index shards, thereby addressing a well known HTree delete
>> performance issue.
>
> Do you mean Copy-On-Write policy here or some special technique?
The technique Shardmap uses to reduce write amplication under heavy
load is somewhat similar to the technique used by Google's Bigtable to
achieve a similar result for data files. (However, the resemblance to
Bigtable ends there.)
Each update to a Shardmap index is done twice: once in a highly
optimized hash table shard in cache, then again by appending an
entry to the tail of the shard's media "fifo". Media writes are
therefore mostly linear. I say mostly, because if there is a large
number of shards then a single commit may need to update the tail
block of each one, which still adds up to vastly fewer blocks than
the BTree case, where it is easy to construct cases where every
index block must be updated on every commit, a nasty example of
n**2 performance overhead.
> How could be good Shardmap for the SSD use-case? Do you mean that we
> could reduce write amplification issue for NAND flash case?
Correct. Reducing write amplification is particularly important for
flash based storage. It also has a noticeable beneficial effect on
efficiency under many common and not so common loads.
> Let's imagine that it needs to implement the Shardmap approach. Could
> you estimate the implementation and stabilization time? How expensive
> and long-term efforts could it be?
Shardmap is already implemented and stable, though it does need wider
usage and testing. Code is available here:
https://github.com/danielbot/Shardmap
Shardmap needs to be ported to kernel, already planned and in progress
for Tux3. Now I am proposing that the Ext4 team should consider porting
Shardmap to Ext4, or at least enter into a serious discussion of the
logistics.
Added Darrick to cc, as he is already fairly familiar with this subject,
once was an Ext4 developer, and perhaps still is should the need arise.
By the way, is there a reason that Ted's MIT address bounced on my
original post?
Regards,
Daniel
> On Nov 27, 2019, at 11:28 AM, Daniel Phillips <[email protected]> wrote:
>
> On 2019-11-26 11:40 p.m., Vyacheslav Dubeyko wrote:
>> As far as I know, usually, a folder contains dozens or hundreds
>> files/folders in average. There are many research works that had showed
>> this fact. Do you mean some special use-case when folder could contain
>> the billion files? Could you share some research work that describes
>> some practical use-case with billion files per folder?
>
> You are entirely correct that the vast majority of directories contain
> only a handful of files. That is my case (1). A few directories on a
> typical server can go into the tens of thousands of files. There was
> a time when we could not handle those efficiently, and now thanks to
> HTree we can. Some directories go into the millions, ask the Lustre
> people about that. If you could have a directory with a billion files
> then somebody will have a use for it. For example, you may be able to
> avoid a database for a particular application and just use the file
> system instead.
>
> Now, scaling to a billion files is just one of several things that
> Shardmap does better than HTree. More immediately, Shardmap implements
> readdir simply, accurately and efficiently, unlike HTree. See here for
> some discussion:
>
> https://lwn.net/Articles/544520/
> "Widening ext4's readdir() cookie"
>
So, it looks like that Shardmap could be better for the case of billion files in one folder.
But what’s about the regular case when it could be dozens/hundreds of files in one
folder? Will Shardmap be better than HTree? If the ordinary user hasn’t visible
performance improvement then it makes sense to consider Shardmap like the
optional feature. What do you think?
Does it mean that Shardmap is ext4 oriented only? Could it be used for another
file systems?
> See the recommendation that is sometimes offered to work around
> HTree's issues with processing files in hash order. Basically, read
> the entire directory into memory, sort by inode number, then process
> in that order. As an application writer do you really want to do this,
> or would you prefer that the filesystem just take care of for you so
> the normal, simple and readable code is also the most efficient code?
>
I slightly missed the point here. To read the whole directory sounds like to read
the dentries tree from the volume. As far as I can see, the dentries are ordered
by names or by hashes. But if we are talking about inode number then we mean
the inodes tree. So, I have misunderstanding here. What do you mean?
>> If you are talking about improving the performance then do you mean
>> some special open-source implementation?
>
> I mean the same kind of kernel filesystem implementation that HTree
> currently has. Open source of course, GPLv2 to be specific.
>
I meant the Shardmap implementation. As far as I can see, the user-space implementation
is available only now. So, my question is still here. It’s hard to say how efficient the Shardmap
could be on kernel side as ext4 subsystem, for example.
>>> For delete, Shardmap avoids write multiplication by appending tombstone
>>> entries to index shards, thereby addressing a well known HTree delete
>>> performance issue.
>>
>> Do you mean Copy-On-Write policy here or some special technique?
>
> The technique Shardmap uses to reduce write amplication under heavy
> load is somewhat similar to the technique used by Google's Bigtable to
> achieve a similar result for data files. (However, the resemblance to
> Bigtable ends there.)
>
> Each update to a Shardmap index is done twice: once in a highly
> optimized hash table shard in cache, then again by appending an
> entry to the tail of the shard's media "fifo". Media writes are
> therefore mostly linear. I say mostly, because if there is a large
> number of shards then a single commit may need to update the tail
> block of each one, which still adds up to vastly fewer blocks than
> the BTree case, where it is easy to construct cases where every
> index block must be updated on every commit, a nasty example of
> n**2 performance overhead.
>
It sounds like adding updates in log-structured manner. But what’s about
the obsolete/invalid blocks? Does it mean that it need to use some GC technique
here? I am not completely sure that it could be beneficial for the ext4.
By the way, could the old index blocks be used like the snapshots in the case
of corruptions or other nasty issues?
>> How could be good Shardmap for the SSD use-case? Do you mean that we
>> could reduce write amplification issue for NAND flash case?
>
> Correct. Reducing write amplification is particularly important for
> flash based storage. It also has a noticeable beneficial effect on
> efficiency under many common and not so common loads.
>
>> Let's imagine that it needs to implement the Shardmap approach. Could
>> you estimate the implementation and stabilization time? How expensive
>> and long-term efforts could it be?
>
> Shardmap is already implemented and stable, though it does need wider
> usage and testing. Code is available here:
>
> https://github.com/danielbot/Shardmap
>
> Shardmap needs to be ported to kernel, already planned and in progress
> for Tux3. Now I am proposing that the Ext4 team should consider porting
> Shardmap to Ext4, or at least enter into a serious discussion of the
> logistics.
>
> Added Darrick to cc, as he is already fairly familiar with this subject,
> once was an Ext4 developer, and perhaps still is should the need arise.
> By the way, is there a reason that Ted's MIT address bounced on my
> original post?
>
It’s hard to talk about stability because we haven’t kernel-side implementation
of Shardmap for ext4. I suppose that it needs to spend about a year for the porting
and twice more time for the stabilization. To port a user-space code to the kernel
could be the tricky task. Could you estimate how many lines of code the core
part of Shardmap contains? Does it need to change the ext4 on-disk layout for
this feature? How easy ext4 functionality can be modified for Shardmap support?
Thanks,
Viacheslav Dubeyko.
On 2019-11-27 11:35 a.m., Viacheslav Dubeyko wrote:
> So, it looks like that Shardmap could be better for the case of billion files in one folder.
> But what’s about the regular case when it could be dozens/hundreds of files in one
> folder? Will Shardmap be better than HTree?
Yes, Shardmap is also faster than HTree in that case. Both Shardmap and
HTree are unindexed in that range, however Shardmap is faster because of
two things:
1) Shardmap uses a more efficient record block format that incorporates
a single byte hash code that avoids 99% of failed string compares.
2) Shardmap "pins" the current target block in which new entries are
created, avoiding repeated radix tree lookups for insert under load.
As soon as you get into thousands of files, the difference between
Shardmap and HTree widens dramatically so that Shardmap ends up faster
by a factor of 2, 3, 4, etc as directory size increases. Not just
faster than HTree, but faster than any tree based scheme, because of
the O(1) / O(log N) equation.
Up in the millions and billions of files, HTree just drops out of the
running, but if it were actually improved to operate in that range then
lookups would be at least 4 times slower due to index block probes, and
heavy insert loads would be orders of magnitude slower due to write
multiplication on commit. Of course I hear you when you say that you
don't have any million file directories to worry about, but some folks
do. (Any comment, Andreas?)
> If the ordinary user hasn’t visible
> performance improvement then it makes sense to consider Shardmap like the
> optional feature. What do you think?
I am confident that a significant number of users will perceive the
performance improvement, and that few if any will perceive a slowdown for
any reason short of an outright bug.
> Does it mean that Shardmap is ext4 oriented only? Could it be used for another
> file systems?
Shardmap is not Ext4-only. In fact, Shardmap is firstly the directory
index for Tux3, and I am now proposing that Shardmap should also be
the new directory index for Ext4.
I also heard a suggestion that Shardmap could/should become a generic
kernel library facility that could be used by any file system or
kernel subsystem that requires a high performance persistent
associative string mapping.
Shardmap is also well on its way to being released as a generic high
performance KVS, including supporting persistent memory, a role it
performs in a highly satisfactory way. There will be a post about
this later, but for today, a spoiler: atomic, durable KVS database
transactions at a rate in excess of three per microsecond(!)
>> See the recommendation that is sometimes offered to work around
>> HTree's issues with processing files in hash order. Basically, read
>> the entire directory into memory, sort by inode number, then process
>> in that order. As an application writer do you really want to do this,
>> or would you prefer that the filesystem just take care of for you so
>> the normal, simple and readable code is also the most efficient code?
>
> I slightly missed the point here. To read the whole directory sounds like to read
> the dentries tree from the volume. As far as I can see, the dentries are ordered
> by names or by hashes. But if we are talking about inode number then we mean
> the inodes tree. So, I have misunderstanding here. What do you mean?
It's a bit of a horror show. Ted is the expert on it, I was largely a
bystander at the time this was implemented. Basically, the issue is that
HTree is a hash-ordered BTree (the H in HTree) and the only way to
traverse it for readdir that can possibly satisfy POSIX requirements is
by hash order. If you try to traverse in linear block order then a
simultaneous insert could split a block and move some entries to other
blocks, which then may be returned by readdir twice or not at all. So
hash order traversal is necessary, but this is not easy because hashes
can collide. So what Ted did is, create a temporary structure that
persists for some period of time, to utilize a higher resolution hash
code which is used to resolve collisions and provide telldir cookies
for readdir. Basically. This is even more tricky than it sounds for
various horrifying reasons.
If you want the whole story you will have to ask Ted. Suffice to say that
it takes a special kind of mind to even conceive of such a mechanism, let
alone get it working so well that we have not seen a single complaint
about it for years. However, this code is by any standard, scary and only
marginally maintainable. It further seems likely that a sufficiently
determined person could construct a case where it fails, though I cannot
swear to that one way or the other.
Why did we go to all this effort as opposed to just introducing some
additional ordering metadata as XFS does? Because HTree is faster
without that additional metadata to maintain, and more compact. The
user perceives this; the user does not perceive the pain suffered by
implementors to get this working, nor does the user need to confront
the full horror of the source code. The user cares only about how it
works, and it does work pretty darn well. That said, deprecating this
code will still be immensely satisfying from where I sit. It is more
than likely that Ted shares the same view, though I certainly cannot
speak for him.
In summary, we should all just be happy that this readdir hack worked
well enough over the last 15 years or so to run the world's internet
and phones so reliably. Now let's retire it please, and move on to
something with a sounder design basis, and which is far easier to
understand and maintain, and runs faster to boot. Now, with Shardmap,
readdir runs at media transfer speed, with near zero cpu, unlike
HTree.
>>> If you are talking about improving the performance then do you mean
>>> some special open-source implementation?
>>
>> I mean the same kind of kernel filesystem implementation that HTree
>> currently has. Open source of course, GPLv2 to be specific.
>
> I meant the Shardmap implementation. As far as I can see, the
> user-space implementation is available only now. So, my question is
> still here. It’s hard to say how efficient the Shardmap could be on> kernel side as ext4 subsystem, for example.
That is actually quite easy to predict. All of our benchmarking so far
has been with user space Shardmap running on top of Ext4, so we already
have a pretty accurate picture of the overheads involved. That said,
there will be two main differences between the user space code and the
kernel code:
1) We don't have virtual memory in kernel in any practical form, so
we need to simulate it with explicit lookups in a vector of page
pointers, costing a tiny but likely measurable amount of cpu compared
to the hardware implementation that user space enjoys.
2) We don't know the overhead of atomic commit for Ext4 yet. We do
have a pretty good picture for Tux3: near zero. And we have a very
clear picture of the atomic commit overhead for persistent memory,
which is nonzero but much less than annoying. So atomic commit
overhead for Ext4 should be similar. This is really where the skill
of Ext4 developers kicks in, and of course I expect great things
in that regard, as has been the case consistently to date.
>>>> For delete, Shardmap avoids write multiplication by appending tombstone
>>>> entries to index shards, thereby addressing a well known HTree delete
>>>> performance issue.
>>>
>>> Do you mean Copy-On-Write policy here or some special technique?
>>
>> The technique Shardmap uses to reduce write amplication under heavy
>> load is somewhat similar to the technique used by Google's Bigtable to
>> achieve a similar result for data files. (However, the resemblance to
>> Bigtable ends there.)
>>
>> Each update to a Shardmap index is done twice: once in a highly
>> optimized hash table shard in cache, then again by appending an
>> entry to the tail of the shard's media "fifo". Media writes are
>> therefore mostly linear. I say mostly, because if there is a large
>> number of shards then a single commit may need to update the tail
>> block of each one, which still adds up to vastly fewer blocks than
>> the BTree case, where it is easy to construct cases where every
>> index block must be updated on every commit, a nasty example of
>> n**2 performance overhead.
>
> It sounds like adding updates in log-structured manner. But what’s about
> the obsolete/invalid blocks? Does it mean that it need to use some GC technique
> here? I am not completely sure that it could be beneficial for the ext4.
This is vaguely similar to log structured updating, but then again, it
is more different than similar. A better term might be "streaming
updates". This is a popular theme of modern file system and database
design, and the basis of many recent performance breakthroughs.
Shardmap never garbage collects. Instead, when a shard fifo gets too
many tombstones, it is just compacted by writing out the entire cache
shard on top of the old, excessively fluffy shard fifo. This is both
efficient and rare, for various reasons that require detailed analysis
of the data structures involved. I will get to that eventually, but now
is probably not the best time. The source code makes it clear.
> By the way, could the old index blocks be used like the snapshots in the case
> of corruptions or other nasty issues?
My feeling is, that is not a natural fit. However, rescuing Shardmap from
index corruption is easy: just throw away the entire index and construct
a new one by walking the entry record blocks. Maybe there are cute ways
to make that incremental, but the simplest easiest way should actually be
enough for the long term.
>>> Let's imagine that it needs to implement the Shardmap approach. Could
>>> you estimate the implementation and stabilization time? How expensive
>>> and long-term efforts could it be?
>>
>> Shardmap is already implemented and stable, though it does need wider
>> usage and testing. Code is available here:
>>
>> https://github.com/danielbot/Shardmap
>>
>> Shardmap needs to be ported to kernel, already planned and in progress
>> for Tux3. Now I am proposing that the Ext4 team should consider porting
>> Shardmap to Ext4, or at least enter into a serious discussion of the
>> logistics.
>>
>> Added Darrick to cc, as he is already fairly familiar with this subject,
>> once was an Ext4 developer, and perhaps still is should the need arise.
>> By the way, is there a reason that Ted's MIT address bounced on my
>> original post?
>
> It’s hard to talk about stability because we haven’t kernel-side implementation
> of Shardmap for ext4. I suppose that it needs to spend about a year for the porting
> and twice more time for the stabilization.
Agreed, my best guess is roughly similar.
> To port a user-space code to the kernel could be the tricky task.
Sometimes true, however not in this case. Shardmap is broadly similar to
other code we have ported from user space to kernel in the past, with the
two exceptions I noted above. All just part of a kernel hacker's normal day
in my humble opinion.
> Could you estimate how many lines of code the core
> part of Shardmap contains?
The finished Ext4 code should be somewhere between 2k and 3k lines, about
the same as HTree.
> Does it need to change the ext4 on-disk layout for this feature?
No, this new form of index would be a mount option, and only used for
new directories. The HTree code will necessarily remain part of Ext4
forever, for compatibility with existing volumes. By "deprecate HTree"
I meant, eventually make the Shardmap index the default after it has
gone through its stabilization period of who knows how long? Three
years? Five? It's hard to know the future in that regard.
This would be roughly similar to the transition we did from unindexed
to HTree index some 18 years ago. (Deja vu all over again!) Last time it
went smoothly and this time we additionally have the benefit of having
done it before.
How easy ext4 functionality can be modified for Shardmap support?
From the user's point of view, completely trivial. Initially just a mount
option, and later, no action at all. From the sysadmin's point of view,
something new to learn about, some new procedures in case things go wrong,
but essentially all in a day's work. From the developer's point of view,
one of the easier major hacks that one could contemplate, I expect.
Technical risk is nearly nil because Shardmap is already already quite
mature, being seven years old as it is, and having had the benefit of
considerable earlier design experience.
Regards,
Daniel
On Nov 27, 2019, at 7:54 PM, Daniel Phillips <[email protected]> wrote:
>
> On 2019-11-27 11:35 a.m., Viacheslav Dubeyko wrote:
>> So, it looks like that Shardmap could be better for the case of billion files in one folder. But what’s about the regular case when it could be
>> dozens/hundreds of files in one folder? Will Shardmap be better than HTree?
>
> Yes, Shardmap is also faster than HTree in that case. Both Shardmap and
> HTree are unindexed in that range, however Shardmap is faster because of
> two things:
>
> 1) Shardmap uses a more efficient record block format that incorporates
> a single byte hash code that avoids 99% of failed string compares.
>
> 2) Shardmap "pins" the current target block in which new entries are
> created, avoiding repeated radix tree lookups for insert under load.
>
> As soon as you get into thousands of files, the difference between
> Shardmap and HTree widens dramatically so that Shardmap ends up faster
> by a factor of 2, 3, 4, etc as directory size increases. Not just
> faster than HTree, but faster than any tree based scheme, because of
> the O(1) / O(log N) equation.
>
> Up in the millions and billions of files, HTree just drops out of the
> running, but if it were actually improved to operate in that range then
Actually, 3-level htree was implemented for ext4 several years ago, but
was held back because there was no e2fsck support for it. That was
finished and the 3-level htree support was landed to ext4 in 2017 in
commit v4.12-rc2-15-ge08ac99. In theory it could allow ~5B entries in
a single directory (the 2GB size limit was also removed at the same time).
The code change for this was relatively straight forward, but as you
wrote elsewhere the big problem is each htree insert/lookup/remove
operation degrades to random 4KB IOPS for every change (for the
directory leaf blocks on insert/remove, and for the itable blocks on
readdir), so has about 4096 / 64 = 64x write inflation or more.
A log-structured directory insert/remove feature is appealing to me
if it can avoid this overhead in practise.
> lookups would be at least 4 times slower due to index block probes, and
> heavy insert loads would be orders of magnitude slower due to write
> multiplication on commit. Of course I hear you when you say that you
> don't have any million file directories to worry about, but some folks
> do. (Any comment, Andreas?)
We regularly have directories in the 1M+ size, because of users can easily
run many thousands of processes concurrently creating files in the same
directory. The 2-level htree topped out around 10-12M entries, which was
hit occasionally. At the same time, we also put in directory size limits
so that admins could *prevent* users from doing this, because it also can
cause problems for the user/admin when they need to process such large
directories ("ls -l" will of course never finish).
>> If the ordinary user hasn’t visible performance improvement then it makes
>> sense to consider Shardmap like the optional feature. What do you think?
>
> I am confident that a significant number of users will perceive the
> performance improvement, and that few if any will perceive a slowdown for
> any reason short of an outright bug.
>
>> Does it mean that Shardmap is ext4 oriented only? Could it be used for
>> another file systems?
>
> Shardmap is not Ext4-only. In fact, Shardmap is firstly the directory
> index for Tux3, and I am now proposing that Shardmap should also be
> the new directory index for Ext4.
>
> I also heard a suggestion that Shardmap could/should become a generic
> kernel library facility that could be used by any file system or
> kernel subsystem that requires a high performance persistent
> associative string mapping.
>
> Shardmap is also well on its way to being released as a generic high
> performance KVS, including supporting persistent memory, a role it
> performs in a highly satisfactory way. There will be a post about
> this later, but for today, a spoiler: atomic, durable KVS database
> transactions at a rate in excess of three per microsecond(!)
>
>>> See the recommendation that is sometimes offered to work around
>>> HTree's issues with processing files in hash order. Basically, read
>>> the entire directory into memory, sort by inode number, then process
>>> in that order. As an application writer do you really want to do this,
>>> or would you prefer that the filesystem just take care of for you so
>>> the normal, simple and readable code is also the most efficient code?
>>
>> I slightly missed the point here. To read the whole directory sounds like
>> to read the dentries tree from the volume. As far as I can see, the
>> dentries are ordered by names or by hashes. But if we are talking about
>> inode number then we mean the inodes tree. So, I have misunderstanding
>> here. What do you mean?
>
> It's a bit of a horror show. Ted is the expert on it, I was largely a
> bystander at the time this was implemented. Basically, the issue is that
> HTree is a hash-ordered BTree (the H in HTree) and the only way to
> traverse it for readdir that can possibly satisfy POSIX requirements is
> by hash order. If you try to traverse in linear block order then a
> simultaneous insert could split a block and move some entries to other
> blocks, which then may be returned by readdir twice or not at all. So
> hash order traversal is necessary, but this is not easy because hashes
> can collide. So what Ted did is, create a temporary structure that
> persists for some period of time, to utilize a higher resolution hash
> code which is used to resolve collisions and provide telldir cookies
> for readdir. Basically. This is even more tricky than it sounds for
> various horrifying reasons.
>
> If you want the whole story you will have to ask Ted. Suffice to say that
> it takes a special kind of mind to even conceive of such a mechanism, let
> alone get it working so well that we have not seen a single complaint
> about it for years. However, this code is by any standard, scary and only
> marginally maintainable. It further seems likely that a sufficiently
> determined person could construct a case where it fails, though I cannot
> swear to that one way or the other.
>
> Why did we go to all this effort as opposed to just introducing some
> additional ordering metadata as XFS does? Because HTree is faster
> without that additional metadata to maintain, and more compact. The
> user perceives this; the user does not perceive the pain suffered by
> implementors to get this working, nor does the user need to confront
> the full horror of the source code. The user cares only about how it
> works, and it does work pretty darn well. That said, deprecating this
> code will still be immensely satisfying from where I sit. It is more
> than likely that Ted shares the same view, though I certainly cannot
> speak for him.
>
> In summary, we should all just be happy that this readdir hack worked
> well enough over the last 15 years or so to run the world's internet
> and phones so reliably. Now let's retire it please, and move on to
> something with a sounder design basis, and which is far easier to
> understand and maintain, and runs faster to boot. Now, with Shardmap,
> readdir runs at media transfer speed, with near zero cpu, unlike
> HTree.
>
>>>> If you are talking about improving the performance then do you mean
>>>> some special open-source implementation?
>>>
>>> I mean the same kind of kernel filesystem implementation that HTree
>>> currently has. Open source of course, GPLv2 to be specific.
>>
>> I meant the Shardmap implementation. As far as I can see, the
>> user-space implementation is available only now. So, my question is
>> still here. It’s hard to say how efficient the Shardmap could be on
>> kernel side as ext4 subsystem, for example.
>
> That is actually quite easy to predict. All of our benchmarking so far
> has been with user space Shardmap running on top of Ext4, so we already
> have a pretty accurate picture of the overheads involved. That said,
> there will be two main differences between the user space code and the
> kernel code:
>
> 1) We don't have virtual memory in kernel in any practical form, so
> we need to simulate it with explicit lookups in a vector of page
> pointers, costing a tiny but likely measurable amount of cpu compared
> to the hardware implementation that user space enjoys.
>
> 2) We don't know the overhead of atomic commit for Ext4 yet. We do
> have a pretty good picture for Tux3: near zero. And we have a very
> clear picture of the atomic commit overhead for persistent memory,
> which is nonzero but much less than annoying. So atomic commit
> overhead for Ext4 should be similar. This is really where the skill
> of Ext4 developers kicks in, and of course I expect great things
> in that regard, as has been the case consistently to date.
>
>>>>> For delete, Shardmap avoids write multiplication by appending tombstone
>>>>> entries to index shards, thereby addressing a well known HTree delete
>>>>> performance issue.
>>>>
>>>> Do you mean Copy-On-Write policy here or some special technique?
>>>
>>> The technique Shardmap uses to reduce write amplication under heavy
>>> load is somewhat similar to the technique used by Google's Bigtable to
>>> achieve a similar result for data files. (However, the resemblance to
>>> Bigtable ends there.)
>>>
>>> Each update to a Shardmap index is done twice: once in a highly
>>> optimized hash table shard in cache, then again by appending an
>>> entry to the tail of the shard's media "fifo". Media writes are
>>> therefore mostly linear. I say mostly, because if there is a large
>>> number of shards then a single commit may need to update the tail
>>> block of each one, which still adds up to vastly fewer blocks than
>>> the BTree case, where it is easy to construct cases where every
>>> index block must be updated on every commit, a nasty example of
>>> n**2 performance overhead.
>>
>> It sounds like adding updates in log-structured manner. But what’s about
>> the obsolete/invalid blocks? Does it mean that it need to use some GC
>> here? I am not completely sure that it could be beneficial for the ext4.
>
> This is vaguely similar to log structured updating, but then again, it
> is more different than similar. A better term might be "streaming
> updates". This is a popular theme of modern file system and database
> design, and the basis of many recent performance breakthroughs.
>
> Shardmap never garbage collects. Instead, when a shard fifo gets too
> many tombstones, it is just compacted by writing out the entire cache
> shard on top of the old, excessively fluffy shard fifo. This is both
> efficient and rare, for various reasons that require detailed analysis
> of the data structures involved. I will get to that eventually, but now
> is probably not the best time. The source code makes it clear.
>
>> By the way, could the old index blocks be used like the snapshots in the
>> case of corruptions or other nasty issues?
>
> My feeling is, that is not a natural fit. However, rescuing Shardmap from
> index corruption is easy: just throw away the entire index and construct
> a new one by walking the entry record blocks. Maybe there are cute ways
> to make that incremental, but the simplest easiest way should actually be
> enough for the long term.
>
>>>> Let's imagine that it needs to implement the Shardmap approach. Could
>>>> you estimate the implementation and stabilization time? How expensive
>>>> and long-term efforts could it be?
>>>
>>> Shardmap is already implemented and stable, though it does need wider
>>> usage and testing. Code is available here:
>>>
>>> https://github.com/danielbot/Shardmap
>>>
>>> Shardmap needs to be ported to kernel, already planned and in progress
>>> for Tux3. Now I am proposing that the Ext4 team should consider porting
>>> Shardmap to Ext4, or at least enter into a serious discussion of the
>>> logistics.
>>>
>>> Added Darrick to cc, as he is already fairly familiar with this subject,
>>> once was an Ext4 developer, and perhaps still is should the need arise.
>>> By the way, is there a reason that Ted's MIT address bounced on my
>>> original post?
>>
>> It’s hard to talk about stability because we haven’t kernel-side implementation
>> of Shardmap for ext4. I suppose that it needs to spend about a year for the
>> porting and twice more time for the stabilization.
>
> Agreed, my best guess is roughly similar.
>
>> To port a user-space code to the kernel could be the tricky task.
>
> Sometimes true, however not in this case. Shardmap is broadly similar to
> other code we have ported from user space to kernel in the past, with the
> two exceptions I noted above. All just part of a kernel hacker's normal day
> in my humble opinion.
>
>> Could you estimate how many lines of code the core
>> part of Shardmap contains?
>
> The finished Ext4 code should be somewhere between 2k and 3k lines, about
> the same as HTree.
>
>> Does it need to change the ext4 on-disk layout for this feature?
>
> No, this new form of index would be a mount option, and only used for
> new directories. The HTree code will necessarily remain part of Ext4
> forever, for compatibility with existing volumes. By "deprecate HTree"
> I meant, eventually make the Shardmap index the default after it has
> gone through its stabilization period of who knows how long? Three
> years? Five? It's hard to know the future in that regard.
>
> This would be roughly similar to the transition we did from unindexed
> to HTree index some 18 years ago. (Deja vu all over again!) Last time it
> went smoothly and this time we additionally have the benefit of having
> done it before.
>
> How easy ext4 functionality can be modified for Shardmap support?
>
> From the user's point of view, completely trivial. Initially just a mount
> option, and later, no action at all. From the sysadmin's point of view,
> something new to learn about, some new procedures in case things go wrong,
> but essentially all in a day's work. From the developer's point of view,
> one of the easier major hacks that one could contemplate, I expect.
> Technical risk is nearly nil because Shardmap is already already quite
> mature, being seven years old as it is, and having had the benefit of
> considerable earlier design experience.
>
> Regards,
>
> Daniel
Cheers, Andreas
On 2019-11-28 1:15 a.m., Andreas Dilger wrote:
> On Nov 27, 2019, at 7:54 PM, Daniel Phillips <[email protected]> wrote:
>>
>> On 2019-11-27 11:35 a.m., Viacheslav Dubeyko wrote:
>>> So, it looks like that Shardmap could be better for the case of billion files in one folder. But what’s about the regular case when it could be
>>> dozens/hundreds of files in one folder? Will Shardmap be better than HTree?
>>
>> Yes, Shardmap is also faster than HTree in that case. Both Shardmap and
>> HTree are unindexed in that range, however Shardmap is faster because of
>> two things:
>>
>> 1) Shardmap uses a more efficient record block format that incorporates
>> a single byte hash code that avoids 99% of failed string compares.
>>
>> 2) Shardmap "pins" the current target block in which new entries are
>> created, avoiding repeated radix tree lookups for insert under load.
>>
>> As soon as you get into thousands of files, the difference between
>> Shardmap and HTree widens dramatically so that Shardmap ends up faster
>> by a factor of 2, 3, 4, etc as directory size increases. Not just
>> faster than HTree, but faster than any tree based scheme, because of
>> the O(1) / O(log N) equation.
>>
>> Up in the millions and billions of files, HTree just drops out of the
>> running, but if it were actually improved to operate in that range then
>
> Actually, 3-level htree was implemented for ext4 several years ago, but
> was held back because there was no e2fsck support for it. That was
> finished and the 3-level htree support was landed to ext4 in 2017 in
> commit v4.12-rc2-15-ge08ac99. In theory it could allow ~5B entries in
> a single directory (the 2GB size limit was also removed at the same time).
>
> The code change for this was relatively straight forward,
Let's clarify: straightforward for you. The fact that HTree sat for 15
years with my lame, coded-in-a-day two level scheme suggests that this
would not be straightforward for everyone. In my defence, I plead that
at the time I regarded 10 millions files as essentially infinite. This
makes me think of "640K ought to be enough for anyone" naturally.
Some day somebody will have a use case for billion file directories.
Until that day comes, I am ok with regarding this capability as a mere
stunt, or proof of scalability. Scaling 2 orders of magnitude higher is
just one of the improvements Shardmap offers, and in some sense, the
least important of them.
> but as you
> wrote elsewhere the big problem is each htree insert/lookup/remove
> operation degrades to random 4KB IOPS for every change (for the
> directory leaf blocks on insert/remove, and for the itable blocks on
> readdir), so has about 4096 / 64 = 64x write inflation or more.
Right. Shardmap linearizes pretty much everything, which seems to be
pretty much a necessity to achieve optimal performance with lowest
cache pressure, particularly with machines that are only marginally
capable of the load they are attempting.
> A log-structured directory insert/remove feature is appealing to me
> if it can avoid this overhead in practise.
The main problem with pure log structured updates is reading back the
log, which is randomly interleaved if written out in strict creation
order. This is the issue that LSM (as used by RocksDB and friends)
attempts to address:
https://en.wikipedia.org/wiki/Log-structured_merge-tree
Unfortunately, these solutions tend to be complex and top heavy on the
read side. In practice Shardmap won solidly in all of insert, lookup
and delete benchmarks when we benchmarked it. I am not sure why that
is, but it seems to be trying to do a lot of fancy stuff.
Shardmap uses a far simpler update technique: each new index entry is
simply appended to the end of the appropriate shard. The fewer the
shards, the more efficient that is. However, even when there are 1K
or up to 8K shards as we use for our 1 billion file benchmark, this
algorithm still works well in practice. The key trick is, include
enough updates in the commit so that each shard receives enough to
reduce the tail block write multiplication to a dull roar.
On my not very spectacular Ryzen 7 workstation backed by SSD, we
create 1 billion directory entries in a bit less than 12 minutes. Is
this fast enough? On spinning rust it takes two or three minutes
longer. Clearly Ext4 plus the block layer are doing a nice job of
minimizing seeks for this load.
>> lookups would be at least 4 times slower due to index block probes, and
>> heavy insert loads would be orders of magnitude slower due to write
>> multiplication on commit. Of course I hear you when you say that you
>> don't have any million file directories to worry about, but some folks
>> do. (Any comment, Andreas?)
>
> We regularly have directories in the 1M+ size, because of users can easily
> run many thousands of processes concurrently creating files in the same
> directory. The 2-level htree topped out around 10-12M entries, which was
> hit occasionally. At the same time, we also put in directory size limits
> so that admins could *prevent* users from doing this, because it also can
> cause problems for the user/admin when they need to process such large
> directories ("ls -l" will of course never finish).
OK, this confirms my impression of the sort of loads you feed to HTree.
In this range, I promise that you will see a solid speedup from Shardmap
for the directory ops alone, plus far better cache coherency with respect
to the inode table, and of course, a readdir that does not verge on
brilliant insanity.
Thanks for the comments Andreas, and I hope it is clear that I am also
offering Shardmap for Lustre. Not that you would even need to ask.
Regards,
Daniel