2014-12-12 00:42:23

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [PATCH v2] staging: writeboost: Add dm-writeboost

Mike,

Below is my comments against Joe's previous comments:

1. Writeboost shouldn't split the bio into 4KB chunks.
No. It is necessary.
I know WALB (https://github.com/starpos/walb) logs data without
splitting but the data structure becomes complicated.
If you read my code carefully, you will notice that splitting
helps the design simplicity and performance.

2. Writeboost has to avoid copying of data to in-core buffer.
No. It is also necessary.
Think about partial write (smaller than 4KB).
Also, think about it's incoming I/O amount is even smaller than
the memory throughput. So, it won't affect the performance and
it's actually not (see the fio performance).
I will also copy data when I implement read-caching that will be
surely criticized by you or Joe eventually.

As for the way of discussing,
it's NOT fair to discuss based on technically not meaningful data
that is measured on VM in this case. It's violence, not discussion.
I am really amazed by you two suddenly "unstably" turning into nacking
within a day.
I showed you a data that Writeboost already surpasses the spindle but you
ignored it is also unfair. Did you think I cheated? With read-caching,
Writeboost can be more performant for sure.

My conclusion at this point is I don't like to change these two fundamental
design decision. If you want to nack without knowing how important these
things, it's fairly OK because device-mapper is your castle.
But in that case, I can't (not won't) persuade you and
sadly, retrieve from not only upstreaming *forever*.

I think you don't even try to understand the importance of Writeboost.
Although this guy really understand it (https://github.com/yongseokoh/dm-src).
The SSD-caching should be log-structured.

- Akira

On 12/12/14 12:26 AM, Mike Snitzer wrote:
> On Wed, Dec 10 2014 at 6:42am -0500,
> Akira Hayakawa <[email protected]> wrote:
>
>> This patch adds dm-writeboost to staging tree.
>>
>> dm-writeboost is a log-structured SSD-caching driver.
>> It caches data in log-structured way on the cache device
>> so that the performance is maximized.
>>
>> The merit of putting this driver in staging tree is
>> to make it possible to get more feedback from users
>> and polish the codes.
>>
>> Signed-off-by: Akira Hayakawa <[email protected]>
>
> Joe Thornber Nacked this from going into staging based on his further
> review of the code. On the basis that you should _not_ be copying all
> pages associated with each bio into an in-core buffer. Have you
> addressed that fundamental problem?
>
> But taking a step back: Why do you think going into staging actually
> helps you? Anyone who is willing to test this code should be able to
> apply a patch to their kernel. Having to feed changes to Greg to
> deliver updates to any early consumers of this _EXPERIMENTAL_ target
> seems misplaced when you consider that Joe has detailed important
> fixes, capabilties and tools that need addressing before anyone should
> trust their production data to dm-writeboost.
>
> I think you're completely missing that you are pushing _hard_ for this
> target to go upstream before it is actually ready. In doing so you're
> so hung up on that "upstream" milestone that you cannot appreciate the
> reluctance that Joe and I have given the quantity of code you're pushing
> -- especially when coupled with the limited utility of dm-writeboost in
> comparison with full featured upstream drivers like dm-cache and bcache.
>
> As for this v2, you didn't detail what you changed (I can obviously
> apply v1 and then v2 to see the difference but a brief summary would be
> helpful in general when you revise a patch)
>
> But one inline comment:
>
>> diff --git a/drivers/staging/writeboost/TODO b/drivers/staging/writeboost/TODO
>> new file mode 100644
>> index 0000000..761a9fe
>> --- /dev/null
>> +++ b/drivers/staging/writeboost/TODO
>> @@ -0,0 +1,52 @@
>> +TODO:
>> +
>> +- Get the GitExtract test so it's performance is similar to raw spindle.
>
> No, the whole point of a write cache is to improve performance. You
> should be trying to surpass the performance of the raw spindle.
>


2014-12-12 09:12:26

by Bart Van Assche

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v2] staging: writeboost: Add dm-writeboost

On 12/12/14 01:42, Akira Hayakawa wrote:
> 1. Writeboost shouldn't split the bio into 4KB chunks.
> No. It is necessary.
> I know WALB (https://github.com/starpos/walb) logs data without
> splitting but the data structure becomes complicated.
> If you read my code carefully, you will notice that splitting
> helps the design simplicity and performance.

This is the first time I see someone claiming that reducing the request
size improves performance. I don't know any SSD model for which
splitting requests improves performance.

Additionally, since bio's are split by dm-writeboost, this makes me
wonder how atomic writes will ever be supported ? Atomic writes are
being standardized by the T10 SCSI committee. I don't think the Linux
block layer already supports atomic writes today but I expect support
for atomic writes to be added to the block layer sooner or later. See
e.g. http://www.t10.org/doc13.htm / SBC-4 SPC-5 Atomic writes and reads
for the latest draft specification.

Bart.

2014-12-12 09:35:40

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v2] staging: writeboost: Add dm-writeboost

On 12/12/14 6:12 PM, Bart Van Assche wrote:
> This is the first time I see someone claiming that reducing the request size improves performance. I don't know any SSD model for which splitting requests improves performance.

Writeboost batches number of writes into a log (that is 512KB large) and submits to SSD
which maximizes the throughput and the lifetime of the SSD.
I think you fairly misunderstand how Writeboost works.

- Akira

2014-12-12 11:41:26

by Bart Van Assche

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v2] staging: writeboost: Add dm-writeboost

On 12/12/14 10:35, Akira Hayakawa wrote:
> Writeboost batches number of writes into a log (that is 512KB large) and submits to SSD
> which maximizes the throughput and the lifetime of the SSD.

Does this mean that dm-writeboost is similar to btier ? If so, this
makes me wonder which of these two subsystems would be the best
candidate for upstream inclusion. As far as I know btier doesn't split
bio's. The source code of btier can be downloaded from
http://sourceforge.net/projects/tier/.

Bart.

2014-12-12 14:24:54

by Joe Thornber

[permalink] [raw]
Subject: Re: [PATCH v2] staging: writeboost: Add dm-writeboost

On Fri, Dec 12, 2014 at 09:42:15AM +0900, Akira Hayakawa wrote:
> The SSD-caching should be log-structured.

No argument there, and this is why I've supported you with
dm-writeboost over the last couple of years.

However, after looking at the current code, and using it I think it's
a long, long way from being ready for production. As we've already
discussed there are some very naive design decisions in there, such as
copying every bio payload to another memory buffer, splitting all io
down to 4k. Think about the cpu overhead and memory consumption!
Think about how it will perform when memory is constrained and it
can't allocate many of those rambufs! I'm sure more issues will be
found if I read further.

I'm sorry to have disappointed you so, but if I let this go upstream
it would mean a massive amount of support work for me, not to mention
a damaged reputation for dm.

Mike raised the question of why you want this in the kernel so much?
You'd find none of the distros would support it; so it doesn't widen
your audience much. It's far better for you to maintain it outside of
the kernel at this point. Any users will be bold, adventurous people,
who will be quite capable of building a kernel module.

- Joe

2014-12-12 15:09:19

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [PATCH v2] staging: writeboost: Add dm-writeboost

> However, after looking at the current code, and using it I think it's
> a long, long way from being ready for production. As we've already
> discussed there are some very naive design decisions in there, such as
> copying every bio payload to another memory buffer, splitting all io
> down to 4k. Think about the cpu overhead and memory consumption!
> Think about how it will perform when memory is constrained and it
> can't allocate many of those rambufs! I'm sure more issues will be
> found if I read further.
These decisions are made based on measurement. They are not naive.
I am a man who dislikes performance optimization without measurement.
As a result, I regard things brought by the simplicity much important
than what's from other design decisions possible.

About the CPU consumption,
the average CPU consumption while performing random write fio
with consumer level SSD is only 3% or so,
which is 5 times efficient than bcache per iops.

With RAM-backed cache device, it reaches about 1.5GB/sec throughput.
Even in this case the CPU consumption is only 12%.
Please see this post,
http://www.redhat.com/archives/dm-devel/2014-February/msg00000.html

I don't think the CPU consumption is small enough to ignore.

About the memory consumption,
you seem to misunderstand the fact.
The rambufs are not dynamically allocated but statically.
The default amount is 8MB and this is usually not to argue.

> Mike raised the question of why you want this in the kernel so much?
> You'd find none of the distros would support it; so it doesn't widen
> your audience much. It's far better for you to maintain it outside of
> the kernel at this point. Any users will be bold, adventurous people,
> who will be quite capable of building a kernel module.
Some people deploy Writeboost in their daily use.
The sound of "log-structured" seems to easily attract storage guys' attention.
If this driver is merged into upstream, I think it gains many audience and
thus feedback.
When my driver was introduced by Phoronix before, it actually drew attentions.
They must wait for Writeboost become available in upstream.
http://www.phoronix.com/scan.php?page=news_item&px=MTQ1Mjg

> I'm sorry to have disappointed you so, but if I let this go upstream
> it would mean a massive amount of support work for me, not to mention
> a damaged reputation for dm.
If you read the code further, you will find how simple the mechanism is.
Not to mention the code itself is.

- Akira

On 12/12/14 11:24 PM, Joe Thornber wrote:
> On Fri, Dec 12, 2014 at 09:42:15AM +0900, Akira Hayakawa wrote:
>> The SSD-caching should be log-structured.
>
> No argument there, and this is why I've supported you with
> dm-writeboost over the last couple of years.
>
> However, after looking at the current code, and using it I think it's
> a long, long way from being ready for production. As we've already
> discussed there are some very naive design decisions in there, such as
> copying every bio payload to another memory buffer, splitting all io
> down to 4k. Think about the cpu overhead and memory consumption!
> Think about how it will perform when memory is constrained and it
> can't allocate many of those rambufs! I'm sure more issues will be
> found if I read further.
>
> I'm sorry to have disappointed you so, but if I let this go upstream
> it would mean a massive amount of support work for me, not to mention
> a damaged reputation for dm.
>
> Mike raised the question of why you want this in the kernel so much?
> You'd find none of the distros would support it; so it doesn't widen
> your audience much. It's far better for you to maintain it outside of
> the kernel at this point. Any users will be bold, adventurous people,
> who will be quite capable of building a kernel module.
>
> - Joe
>

2014-12-13 06:45:34

by Jianjian Huo

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v2] staging: writeboost: Add dm-writeboost

If I understand it correctly, the whole idea indeed is very simple,
the consumer/provider and circular buffer model. use SSD as a circular
write buffer, write flush thread stores incoming writes to this buffer
sequentially as provider, and writeback thread write those data logs
sequentially into backing device as consumer.

If writeboost can do that without any random writes, then probably it
can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
sequential read/write performance from SSD. That'll be awesome.
However, I saw every data log segment in its design has meta data
header, like dirty_bits, so I guess writeboost has to randomly write
those data headers of stored data logs in SSD; also, splitting all bio
into 4KB will hurt its ability to get max raw SSD throughput, modern
NAND Flash has pages much bigger than 4KB; so overall I think the
actual benefits writeboost gets from this design will be discounted.

The good thing is that it seems writeboost doesn't use garbage
collection to clean old invalid logs, this will avoid the double
garage collection effect other caching module has, which essentially
both caching module and internal SSD will perform garbage collections
twice.

And one question, how long will be data logs replay time during init,
if SSD is almost full of dirty data logs?

Jianjian

On Fri, Dec 12, 2014 at 7:09 AM, Akira Hayakawa <[email protected]> wrote:
>> However, after looking at the current code, and using it I think it's
>> a long, long way from being ready for production. As we've already
>> discussed there are some very naive design decisions in there, such as
>> copying every bio payload to another memory buffer, splitting all io
>> down to 4k. Think about the cpu overhead and memory consumption!
>> Think about how it will perform when memory is constrained and it
>> can't allocate many of those rambufs! I'm sure more issues will be
>> found if I read further.
> These decisions are made based on measurement. They are not naive.
> I am a man who dislikes performance optimization without measurement.
> As a result, I regard things brought by the simplicity much important
> than what's from other design decisions possible.
>
> About the CPU consumption,
> the average CPU consumption while performing random write fio
> with consumer level SSD is only 3% or so,
> which is 5 times efficient than bcache per iops.
>
> With RAM-backed cache device, it reaches about 1.5GB/sec throughput.
> Even in this case the CPU consumption is only 12%.
> Please see this post,
> http://www.redhat.com/archives/dm-devel/2014-February/msg00000.html
>
> I don't think the CPU consumption is small enough to ignore.
>
> About the memory consumption,
> you seem to misunderstand the fact.
> The rambufs are not dynamically allocated but statically.
> The default amount is 8MB and this is usually not to argue.
>
>> Mike raised the question of why you want this in the kernel so much?
>> You'd find none of the distros would support it; so it doesn't widen
>> your audience much. It's far better for you to maintain it outside of
>> the kernel at this point. Any users will be bold, adventurous people,
>> who will be quite capable of building a kernel module.
> Some people deploy Writeboost in their daily use.
> The sound of "log-structured" seems to easily attract storage guys' attention.
> If this driver is merged into upstream, I think it gains many audience and
> thus feedback.
> When my driver was introduced by Phoronix before, it actually drew attentions.
> They must wait for Writeboost become available in upstream.
> http://www.phoronix.com/scan.php?page=news_item&px=MTQ1Mjg
>
>> I'm sorry to have disappointed you so, but if I let this go upstream
>> it would mean a massive amount of support work for me, not to mention
>> a damaged reputation for dm.
> If you read the code further, you will find how simple the mechanism is.
> Not to mention the code itself is.
>
> - Akira
>
> On 12/12/14 11:24 PM, Joe Thornber wrote:
>> On Fri, Dec 12, 2014 at 09:42:15AM +0900, Akira Hayakawa wrote:
>>> The SSD-caching should be log-structured.
>>
>> No argument there, and this is why I've supported you with
>> dm-writeboost over the last couple of years.
>>
>> However, after looking at the current code, and using it I think it's
>> a long, long way from being ready for production. As we've already
>> discussed there are some very naive design decisions in there, such as
>> copying every bio payload to another memory buffer, splitting all io
>> down to 4k. Think about the cpu overhead and memory consumption!
>> Think about how it will perform when memory is constrained and it
>> can't allocate many of those rambufs! I'm sure more issues will be
>> found if I read further.
>>
>> I'm sorry to have disappointed you so, but if I let this go upstream
>> it would mean a massive amount of support work for me, not to mention
>> a damaged reputation for dm.
>>
>> Mike raised the question of why you want this in the kernel so much?
>> You'd find none of the distros would support it; so it doesn't widen
>> your audience much. It's far better for you to maintain it outside of
>> the kernel at this point. Any users will be bold, adventurous people,
>> who will be quite capable of building a kernel module.
>>
>> - Joe
>>
>
> --
> dm-devel mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/dm-devel

2014-12-13 14:08:07

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v2] staging: writeboost: Add dm-writeboost

Hi,

Jianjian, You really get a point at the fundamental design.

> If I understand it correctly, the whole idea indeed is very simple,
> the consumer/provider and circular buffer model. use SSD as a circular
> write buffer, write flush thread stores incoming writes to this buffer
> sequentially as provider, and writeback thread write those data logs
> sequentially into backing device as consumer.
>
> If writeboost can do that without any random writes, then probably it
> can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
> sequential read/write performance from SSD. That'll be awesome.
> However, I saw every data log segment in its design has meta data
> header, like dirty_bits, so I guess writeboost has to randomly write
> those data headers of stored data logs in SSD; also, splitting all bio
> into 4KB will hurt its ability to get max raw SSD throughput, modern
> NAND Flash has pages much bigger than 4KB; so overall I think the
> actual benefits writeboost gets from this design will be discounted.
You understand *almost* correctly.

Writeboost has two circular buffers, not one; RAM buffers and SSD.
The incoming bio is split into 4KB chunks at the virtual make_request
and are NOT directly remapped to the SSD.
As you mentioned, if I designed so, many update on the metadata happens.
That's really bad since SSD is very bad at small update.

Actually, the 4KB bio is first stored in RAM buffer, which is 512KB large.
There are (512-4)/4=127 4KB bio data stored in the RAM buffer and 4KB metadata
section at the head is made after that.

The RAM buffer is now called "log" and as you mentioned, flushed to the SSD
as 512KB sequential write. This definitely maximizes throughput and lifetime.

Unfortunately, this is not always the case because of barrier request handlings.
But, when the writes is really heavy (e.g. massive dirty page writeback),
Writeboost works as above.

> The good thing is that it seems writeboost doesn't use garbage
> collection to clean old invalid logs, this will avoid the double
> garage collection effect other caching module has, which essentially
> both caching module and internal SSD will perform garbage collections
> twice.
Yes. And I believe SSDs can remove wear-leveling if I used it as fairly sequential.
Am I right? Indeed, Writeboost is really SSD frinedly.

> And one question, how long will be data logs replay time during init,
> if SSD is almost full of dirty data logs?
Sorry, I don't have a data now but it's slow as you may imagine.
I will measure the time on later.

The major reason is, it needs to read full 512KB segment to calculate checksum to
know if the log isn't half written.
(Read 500GB SSD that performs 500MB/sec seqread spends 1000secs)
I think making the procedure done in parallel to exploit the full internal parallelism
inside SSD may improve performance but it's just the matter of coefficient down from 1 to 1/n.
Definitely, Writeboost isn't fit for a machine that needs reboot frequently (e.g. desktop).

There is a way to reduce the init time. We can dump "what is the latest log written back"
on the superblock. This can skip readings that aren't essential.

The corresponding code is replay_log_on_cache() function. Please read if you are
interested.

Thanks,

- Akira

On 12/13/14 3:45 PM, Jianjian Huo wrote:
> If I understand it correctly, the whole idea indeed is very simple,
> the consumer/provider and circular buffer model. use SSD as a circular
> write buffer, write flush thread stores incoming writes to this buffer
> sequentially as provider, and writeback thread write those data logs
> sequentially into backing device as consumer.
>
> If writeboost can do that without any random writes, then probably it
> can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
> sequential read/write performance from SSD. That'll be awesome.
> However, I saw every data log segment in its design has meta data
> header, like dirty_bits, so I guess writeboost has to randomly write
> those data headers of stored data logs in SSD; also, splitting all bio
> into 4KB will hurt its ability to get max raw SSD throughput, modern
> NAND Flash has pages much bigger than 4KB; so overall I think the
> actual benefits writeboost gets from this design will be discounted.
>
> The good thing is that it seems writeboost doesn't use garbage
> collection to clean old invalid logs, this will avoid the double
> garage collection effect other caching module has, which essentially
> both caching module and internal SSD will perform garbage collections
> twice.
>
> And one question, how long will be data logs replay time during init,
> if SSD is almost full of dirty data logs?
>
> Jianjian
>
> On Fri, Dec 12, 2014 at 7:09 AM, Akira Hayakawa <[email protected]> wrote:
>>> However, after looking at the current code, and using it I think it's
>>> a long, long way from being ready for production. As we've already
>>> discussed there are some very naive design decisions in there, such as
>>> copying every bio payload to another memory buffer, splitting all io
>>> down to 4k. Think about the cpu overhead and memory consumption!
>>> Think about how it will perform when memory is constrained and it
>>> can't allocate many of those rambufs! I'm sure more issues will be
>>> found if I read further.
>> These decisions are made based on measurement. They are not naive.
>> I am a man who dislikes performance optimization without measurement.
>> As a result, I regard things brought by the simplicity much important
>> than what's from other design decisions possible.
>>
>> About the CPU consumption,
>> the average CPU consumption while performing random write fio
>> with consumer level SSD is only 3% or so,
>> which is 5 times efficient than bcache per iops.
>>
>> With RAM-backed cache device, it reaches about 1.5GB/sec throughput.
>> Even in this case the CPU consumption is only 12%.
>> Please see this post,
>> http://www.redhat.com/archives/dm-devel/2014-February/msg00000.html
>>
>> I don't think the CPU consumption is small enough to ignore.
>>
>> About the memory consumption,
>> you seem to misunderstand the fact.
>> The rambufs are not dynamically allocated but statically.
>> The default amount is 8MB and this is usually not to argue.
>>
>>> Mike raised the question of why you want this in the kernel so much?
>>> You'd find none of the distros would support it; so it doesn't widen
>>> your audience much. It's far better for you to maintain it outside of
>>> the kernel at this point. Any users will be bold, adventurous people,
>>> who will be quite capable of building a kernel module.
>> Some people deploy Writeboost in their daily use.
>> The sound of "log-structured" seems to easily attract storage guys' attention.
>> If this driver is merged into upstream, I think it gains many audience and
>> thus feedback.
>> When my driver was introduced by Phoronix before, it actually drew attentions.
>> They must wait for Writeboost become available in upstream.
>> http://www.phoronix.com/scan.php?page=news_item&px=MTQ1Mjg
>>
>>> I'm sorry to have disappointed you so, but if I let this go upstream
>>> it would mean a massive amount of support work for me, not to mention
>>> a damaged reputation for dm.
>> If you read the code further, you will find how simple the mechanism is.
>> Not to mention the code itself is.
>>
>> - Akira
>>
>> On 12/12/14 11:24 PM, Joe Thornber wrote:
>>> On Fri, Dec 12, 2014 at 09:42:15AM +0900, Akira Hayakawa wrote:
>>>> The SSD-caching should be log-structured.
>>>
>>> No argument there, and this is why I've supported you with
>>> dm-writeboost over the last couple of years.
>>>
>>> However, after looking at the current code, and using it I think it's
>>> a long, long way from being ready for production. As we've already
>>> discussed there are some very naive design decisions in there, such as
>>> copying every bio payload to another memory buffer, splitting all io
>>> down to 4k. Think about the cpu overhead and memory consumption!
>>> Think about how it will perform when memory is constrained and it
>>> can't allocate many of those rambufs! I'm sure more issues will be
>>> found if I read further.
>>>
>>> I'm sorry to have disappointed you so, but if I let this go upstream
>>> it would mean a massive amount of support work for me, not to mention
>>> a damaged reputation for dm.
>>>
>>> Mike raised the question of why you want this in the kernel so much?
>>> You'd find none of the distros would support it; so it doesn't widen
>>> your audience much. It's far better for you to maintain it outside of
>>> the kernel at this point. Any users will be bold, adventurous people,
>>> who will be quite capable of building a kernel module.
>>>
>>> - Joe
>>>
>>
>> --
>> dm-devel mailing list
>> [email protected]
>> https://www.redhat.com/mailman/listinfo/dm-devel
>
> --
> dm-devel mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/dm-devel
>

2014-12-14 02:12:41

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v2] staging: writeboost: Add dm-writeboost

Hi,

> The major reason is, it needs to read full 512KB segment to calculate checksum to
> know if the log isn't half written.
> (Read 500GB SSD that performs 500MB/sec seqread spends 1000secs)
I've just measured how long the cache resuming is.

I use 2GB SSD for the cache device.

512KB seqread over the cache device:
8.252sec (242MB/sec)

Resume when all caches are dirty:
10.339sec

Typically, if you use 128GB SSD it will be 5-10 minutes.

As I predicted, resuming time is close to seqread.
In other words, it's fully IO-bound.
(If you read the code you will notice that it first searchs for the
oldest log as the starting point. It's 4KB metadata reads but spends to some extent.
The other 2 sec is thought to be spent by this)

- Akira

On 12/13/14 11:07 PM, Akira Hayakawa wrote:
> Hi,
>
> Jianjian, You really get a point at the fundamental design.
>
>> If I understand it correctly, the whole idea indeed is very simple,
>> the consumer/provider and circular buffer model. use SSD as a circular
>> write buffer, write flush thread stores incoming writes to this buffer
>> sequentially as provider, and writeback thread write those data logs
>> sequentially into backing device as consumer.
>>
>> If writeboost can do that without any random writes, then probably it
>> can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
>> sequential read/write performance from SSD. That'll be awesome.
>> However, I saw every data log segment in its design has meta data
>> header, like dirty_bits, so I guess writeboost has to randomly write
>> those data headers of stored data logs in SSD; also, splitting all bio
>> into 4KB will hurt its ability to get max raw SSD throughput, modern
>> NAND Flash has pages much bigger than 4KB; so overall I think the
>> actual benefits writeboost gets from this design will be discounted.
> You understand *almost* correctly.
>
> Writeboost has two circular buffers, not one; RAM buffers and SSD.
> The incoming bio is split into 4KB chunks at the virtual make_request
> and are NOT directly remapped to the SSD.
> As you mentioned, if I designed so, many update on the metadata happens.
> That's really bad since SSD is very bad at small update.
>
> Actually, the 4KB bio is first stored in RAM buffer, which is 512KB large.
> There are (512-4)/4=127 4KB bio data stored in the RAM buffer and 4KB metadata
> section at the head is made after that.
>
> The RAM buffer is now called "log" and as you mentioned, flushed to the SSD
> as 512KB sequential write. This definitely maximizes throughput and lifetime.
>
> Unfortunately, this is not always the case because of barrier request handlings.
> But, when the writes is really heavy (e.g. massive dirty page writeback),
> Writeboost works as above.
>
>> The good thing is that it seems writeboost doesn't use garbage
>> collection to clean old invalid logs, this will avoid the double
>> garage collection effect other caching module has, which essentially
>> both caching module and internal SSD will perform garbage collections
>> twice.
> Yes. And I believe SSDs can remove wear-leveling if I used it as fairly sequential.
> Am I right? Indeed, Writeboost is really SSD frinedly.
>
>> And one question, how long will be data logs replay time during init,
>> if SSD is almost full of dirty data logs?
> Sorry, I don't have a data now but it's slow as you may imagine.
> I will measure the time on later.
>
> The major reason is, it needs to read full 512KB segment to calculate checksum to
> know if the log isn't half written.
> (Read 500GB SSD that performs 500MB/sec seqread spends 1000secs)
> I think making the procedure done in parallel to exploit the full internal parallelism
> inside SSD may improve performance but it's just the matter of coefficient down from 1 to 1/n.
> Definitely, Writeboost isn't fit for a machine that needs reboot frequently (e.g. desktop).
>
> There is a way to reduce the init time. We can dump "what is the latest log written back"
> on the superblock. This can skip readings that aren't essential.
>
> The corresponding code is replay_log_on_cache() function. Please read if you are
> interested.
>
> Thanks,
>
> - Akira
>
> On 12/13/14 3:45 PM, Jianjian Huo wrote:
>> If I understand it correctly, the whole idea indeed is very simple,
>> the consumer/provider and circular buffer model. use SSD as a circular
>> write buffer, write flush thread stores incoming writes to this buffer
>> sequentially as provider, and writeback thread write those data logs
>> sequentially into backing device as consumer.
>>
>> If writeboost can do that without any random writes, then probably it
>> can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
>> sequential read/write performance from SSD. That'll be awesome.
>> However, I saw every data log segment in its design has meta data
>> header, like dirty_bits, so I guess writeboost has to randomly write
>> those data headers of stored data logs in SSD; also, splitting all bio
>> into 4KB will hurt its ability to get max raw SSD throughput, modern
>> NAND Flash has pages much bigger than 4KB; so overall I think the
>> actual benefits writeboost gets from this design will be discounted.
>>
>> The good thing is that it seems writeboost doesn't use garbage
>> collection to clean old invalid logs, this will avoid the double
>> garage collection effect other caching module has, which essentially
>> both caching module and internal SSD will perform garbage collections
>> twice.
>>
>> And one question, how long will be data logs replay time during init,
>> if SSD is almost full of dirty data logs?
>>
>> Jianjian
>>
>> On Fri, Dec 12, 2014 at 7:09 AM, Akira Hayakawa <[email protected]> wrote:
>>>> However, after looking at the current code, and using it I think it's
>>>> a long, long way from being ready for production. As we've already
>>>> discussed there are some very naive design decisions in there, such as
>>>> copying every bio payload to another memory buffer, splitting all io
>>>> down to 4k. Think about the cpu overhead and memory consumption!
>>>> Think about how it will perform when memory is constrained and it
>>>> can't allocate many of those rambufs! I'm sure more issues will be
>>>> found if I read further.
>>> These decisions are made based on measurement. They are not naive.
>>> I am a man who dislikes performance optimization without measurement.
>>> As a result, I regard things brought by the simplicity much important
>>> than what's from other design decisions possible.
>>>
>>> About the CPU consumption,
>>> the average CPU consumption while performing random write fio
>>> with consumer level SSD is only 3% or so,
>>> which is 5 times efficient than bcache per iops.
>>>
>>> With RAM-backed cache device, it reaches about 1.5GB/sec throughput.
>>> Even in this case the CPU consumption is only 12%.
>>> Please see this post,
>>> http://www.redhat.com/archives/dm-devel/2014-February/msg00000.html
>>>
>>> I don't think the CPU consumption is small enough to ignore.
>>>
>>> About the memory consumption,
>>> you seem to misunderstand the fact.
>>> The rambufs are not dynamically allocated but statically.
>>> The default amount is 8MB and this is usually not to argue.
>>>
>>>> Mike raised the question of why you want this in the kernel so much?
>>>> You'd find none of the distros would support it; so it doesn't widen
>>>> your audience much. It's far better for you to maintain it outside of
>>>> the kernel at this point. Any users will be bold, adventurous people,
>>>> who will be quite capable of building a kernel module.
>>> Some people deploy Writeboost in their daily use.
>>> The sound of "log-structured" seems to easily attract storage guys' attention.
>>> If this driver is merged into upstream, I think it gains many audience and
>>> thus feedback.
>>> When my driver was introduced by Phoronix before, it actually drew attentions.
>>> They must wait for Writeboost become available in upstream.
>>> http://www.phoronix.com/scan.php?page=news_item&px=MTQ1Mjg
>>>
>>>> I'm sorry to have disappointed you so, but if I let this go upstream
>>>> it would mean a massive amount of support work for me, not to mention
>>>> a damaged reputation for dm.
>>> If you read the code further, you will find how simple the mechanism is.
>>> Not to mention the code itself is.
>>>
>>> - Akira
>>>
>>> On 12/12/14 11:24 PM, Joe Thornber wrote:
>>>> On Fri, Dec 12, 2014 at 09:42:15AM +0900, Akira Hayakawa wrote:
>>>>> The SSD-caching should be log-structured.
>>>>
>>>> No argument there, and this is why I've supported you with
>>>> dm-writeboost over the last couple of years.
>>>>
>>>> However, after looking at the current code, and using it I think it's
>>>> a long, long way from being ready for production. As we've already
>>>> discussed there are some very naive design decisions in there, such as
>>>> copying every bio payload to another memory buffer, splitting all io
>>>> down to 4k. Think about the cpu overhead and memory consumption!
>>>> Think about how it will perform when memory is constrained and it
>>>> can't allocate many of those rambufs! I'm sure more issues will be
>>>> found if I read further.
>>>>
>>>> I'm sorry to have disappointed you so, but if I let this go upstream
>>>> it would mean a massive amount of support work for me, not to mention
>>>> a damaged reputation for dm.
>>>>
>>>> Mike raised the question of why you want this in the kernel so much?
>>>> You'd find none of the distros would support it; so it doesn't widen
>>>> your audience much. It's far better for you to maintain it outside of
>>>> the kernel at this point. Any users will be bold, adventurous people,
>>>> who will be quite capable of building a kernel module.
>>>>
>>>> - Joe
>>>>
>>>
>>> --
>>> dm-devel mailing list
>>> [email protected]
>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>> --
>> dm-devel mailing list
>> [email protected]
>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>

2014-12-14 02:46:32

by Jianjian Huo

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v2] staging: writeboost: Add dm-writeboost

On Sat, Dec 13, 2014 at 6:07 AM, Akira Hayakawa <[email protected]> wrote:
> Hi,
>
> Jianjian, You really get a point at the fundamental design.
>
>> If I understand it correctly, the whole idea indeed is very simple,
>> the consumer/provider and circular buffer model. use SSD as a circular
>> write buffer, write flush thread stores incoming writes to this buffer
>> sequentially as provider, and writeback thread write those data logs
>> sequentially into backing device as consumer.
>>
>> If writeboost can do that without any random writes, then probably it
>> can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
>> sequential read/write performance from SSD. That'll be awesome.
>> However, I saw every data log segment in its design has meta data
>> header, like dirty_bits, so I guess writeboost has to randomly write
>> those data headers of stored data logs in SSD; also, splitting all bio
>> into 4KB will hurt its ability to get max raw SSD throughput, modern
>> NAND Flash has pages much bigger than 4KB; so overall I think the
>> actual benefits writeboost gets from this design will be discounted.
> You understand *almost* correctly.
>
> Writeboost has two circular buffers, not one; RAM buffers and SSD.
> The incoming bio is split into 4KB chunks at the virtual make_request
> and are NOT directly remapped to the SSD.
> As you mentioned, if I designed so, many update on the metadata happens.
> That's really bad since SSD is very bad at small update.
>
> Actually, the 4KB bio is first stored in RAM buffer, which is 512KB large.
> There are (512-4)/4=127 4KB bio data stored in the RAM buffer and 4KB metadata
> section at the head is made after that.
>
> The RAM buffer is now called "log" and as you mentioned, flushed to the SSD
> as 512KB sequential write. This definitely maximizes throughput and lifetime.
>
> Unfortunately, this is not always the case because of barrier request handlings.
> But, when the writes is really heavy (e.g. massive dirty page writeback),
> Writeboost works as above.
>
How about invalidating previous writes on same sector address? if
first write is stored in one 512KB log in SSD, later when user write
the same address, will writeboost invalid old write by updating meta
data header in place in that 512KB log? and other meta data like
superblock, will writeboost will update them in place? if writeboost
has those frequent random in-place updates, probably those benefits
except utilizing the faster sequential writes will be much discounted.

>> The good thing is that it seems writeboost doesn't use garbage
>> collection to clean old invalid logs, this will avoid the double
>> garage collection effect other caching module has, which essentially
>> both caching module and internal SSD will perform garbage collections
>> twice.
> Yes. And I believe SSDs can remove wear-leveling if I used it as fairly sequential.
> Am I right? Indeed, Writeboost is really SSD frinedly.
>
Even if writeboost write everything sequentially, probably SSD won't
be able to recognize that and do its own wear-leveling whatsoever. It
will be easier, since there is no need to move cold data, etc.

Generally, SSDs vary a lot, IMHO, one certain optimization technique
may work for certain model, but may not work for others, since they
internal NAND Flash management algorithms can be very different.

>> And one question, how long will be data logs replay time during init,
>> if SSD is almost full of dirty data logs?
> Sorry, I don't have a data now but it's slow as you may imagine.
> I will measure the time on later.
>
> The major reason is, it needs to read full 512KB segment to calculate checksum to
> know if the log isn't half written.
> (Read 500GB SSD that performs 500MB/sec seqread spends 1000secs)
> I think making the procedure done in parallel to exploit the full internal parallelism
> inside SSD may improve performance but it's just the matter of coefficient down from 1 to 1/n.
> Definitely, Writeboost isn't fit for a machine that needs reboot frequently (e.g. desktop).
>
> There is a way to reduce the init time. We can dump "what is the latest log written back"
> on the superblock. This can skip readings that aren't essential.
>
> The corresponding code is replay_log_on_cache() function. Please read if you are
> interested.
>
> Thanks,
>
> - Akira
>
> On 12/13/14 3:45 PM, Jianjian Huo wrote:
>> If I understand it correctly, the whole idea indeed is very simple,
>> the consumer/provider and circular buffer model. use SSD as a circular
>> write buffer, write flush thread stores incoming writes to this buffer
>> sequentially as provider, and writeback thread write those data logs
>> sequentially into backing device as consumer.
>>
>> If writeboost can do that without any random writes, then probably it
>> can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
>> sequential read/write performance from SSD. That'll be awesome.
>> However, I saw every data log segment in its design has meta data
>> header, like dirty_bits, so I guess writeboost has to randomly write
>> those data headers of stored data logs in SSD; also, splitting all bio
>> into 4KB will hurt its ability to get max raw SSD throughput, modern
>> NAND Flash has pages much bigger than 4KB; so overall I think the
>> actual benefits writeboost gets from this design will be discounted.
>>
>> The good thing is that it seems writeboost doesn't use garbage
>> collection to clean old invalid logs, this will avoid the double
>> garage collection effect other caching module has, which essentially
>> both caching module and internal SSD will perform garbage collections
>> twice.
>>
>> And one question, how long will be data logs replay time during init,
>> if SSD is almost full of dirty data logs?
>>
>> Jianjian
>>
>> On Fri, Dec 12, 2014 at 7:09 AM, Akira Hayakawa <[email protected]> wrote:
>>>> However, after looking at the current code, and using it I think it's
>>>> a long, long way from being ready for production. As we've already
>>>> discussed there are some very naive design decisions in there, such as
>>>> copying every bio payload to another memory buffer, splitting all io
>>>> down to 4k. Think about the cpu overhead and memory consumption!
>>>> Think about how it will perform when memory is constrained and it
>>>> can't allocate many of those rambufs! I'm sure more issues will be
>>>> found if I read further.
>>> These decisions are made based on measurement. They are not naive.
>>> I am a man who dislikes performance optimization without measurement.
>>> As a result, I regard things brought by the simplicity much important
>>> than what's from other design decisions possible.
>>>
>>> About the CPU consumption,
>>> the average CPU consumption while performing random write fio
>>> with consumer level SSD is only 3% or so,
>>> which is 5 times efficient than bcache per iops.
>>>
>>> With RAM-backed cache device, it reaches about 1.5GB/sec throughput.
>>> Even in this case the CPU consumption is only 12%.
>>> Please see this post,
>>> http://www.redhat.com/archives/dm-devel/2014-February/msg00000.html
>>>
>>> I don't think the CPU consumption is small enough to ignore.
>>>
>>> About the memory consumption,
>>> you seem to misunderstand the fact.
>>> The rambufs are not dynamically allocated but statically.
>>> The default amount is 8MB and this is usually not to argue.
>>>
>>>> Mike raised the question of why you want this in the kernel so much?
>>>> You'd find none of the distros would support it; so it doesn't widen
>>>> your audience much. It's far better for you to maintain it outside of
>>>> the kernel at this point. Any users will be bold, adventurous people,
>>>> who will be quite capable of building a kernel module.
>>> Some people deploy Writeboost in their daily use.
>>> The sound of "log-structured" seems to easily attract storage guys' attention.
>>> If this driver is merged into upstream, I think it gains many audience and
>>> thus feedback.
>>> When my driver was introduced by Phoronix before, it actually drew attentions.
>>> They must wait for Writeboost become available in upstream.
>>> http://www.phoronix.com/scan.php?page=news_item&px=MTQ1Mjg
>>>
>>>> I'm sorry to have disappointed you so, but if I let this go upstream
>>>> it would mean a massive amount of support work for me, not to mention
>>>> a damaged reputation for dm.
>>> If you read the code further, you will find how simple the mechanism is.
>>> Not to mention the code itself is.
>>>
>>> - Akira
>>>
>>> On 12/12/14 11:24 PM, Joe Thornber wrote:
>>>> On Fri, Dec 12, 2014 at 09:42:15AM +0900, Akira Hayakawa wrote:
>>>>> The SSD-caching should be log-structured.
>>>>
>>>> No argument there, and this is why I've supported you with
>>>> dm-writeboost over the last couple of years.
>>>>
>>>> However, after looking at the current code, and using it I think it's
>>>> a long, long way from being ready for production. As we've already
>>>> discussed there are some very naive design decisions in there, such as
>>>> copying every bio payload to another memory buffer, splitting all io
>>>> down to 4k. Think about the cpu overhead and memory consumption!
>>>> Think about how it will perform when memory is constrained and it
>>>> can't allocate many of those rambufs! I'm sure more issues will be
>>>> found if I read further.
>>>>
>>>> I'm sorry to have disappointed you so, but if I let this go upstream
>>>> it would mean a massive amount of support work for me, not to mention
>>>> a damaged reputation for dm.
>>>>
>>>> Mike raised the question of why you want this in the kernel so much?
>>>> You'd find none of the distros would support it; so it doesn't widen
>>>> your audience much. It's far better for you to maintain it outside of
>>>> the kernel at this point. Any users will be bold, adventurous people,
>>>> who will be quite capable of building a kernel module.
>>>>
>>>> - Joe
>>>>
>>>
>>> --
>>> dm-devel mailing list
>>> [email protected]
>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>> --
>> dm-devel mailing list
>> [email protected]
>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>

2014-12-14 03:01:00

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [PATCH v2] staging: writeboost: Add dm-writeboost

Hi,

I've just measured how split affects.

I think seqread can make the discussion solid
so these are the cases of reading 6.4GB (64MB * 100) sequentially.

HDD:
64MB read
real 2m1.191s
user 0m0.000s
sys 0m0.470s

Writeboost (HDD+SSD):
64MB read
real 2m13.532s
user 0m0.000s
sys 0m28.740s

The splitting actually affects to some extent (2m1 -> 2m13 is 10% loss).
But not too big if we consider the typical workload is NOT seqreads
(if so, the user shouldn't use SSD caching).

Splitting bio into 4KB chunks makes the cache lookup and locking simple
and this contributes to the performance of both write and read is the fact,
don't miss it. Without this, especially, writes isn't so fast in Writeboost
but rather loses its charms.

Since simple and fast is the ideal for any softwares. I am really unwilling
to change this fundamental design; splitting.

But, an idea of selective splitting can be proposed for future enhancement.
Add a layer so that a target can choose if it needs splitting or not may be
interesting. I think Writeboost can bypass big writes/reads at the cost of
duplicated cache lookup. Can DM-cache also benefit from this extension?

Conceptually, it's like this
before: bio -> ~map:bio->bio
after: bio -> ~should_split:bio->bool -> ~map:bio->bio

- Akira


On 12/13/14 12:09 AM, Akira Hayakawa wrote:
>> However, after looking at the current code, and using it I think it's
>> a long, long way from being ready for production. As we've already
>> discussed there are some very naive design decisions in there, such as
>> copying every bio payload to another memory buffer, splitting all io
>> down to 4k. Think about the cpu overhead and memory consumption!
>> Think about how it will perform when memory is constrained and it
>> can't allocate many of those rambufs! I'm sure more issues will be
>> found if I read further.
> These decisions are made based on measurement. They are not naive.
> I am a man who dislikes performance optimization without measurement.
> As a result, I regard things brought by the simplicity much important
> than what's from other design decisions possible.
>
> About the CPU consumption,
> the average CPU consumption while performing random write fio
> with consumer level SSD is only 3% or so,
> which is 5 times efficient than bcache per iops.
>
> With RAM-backed cache device, it reaches about 1.5GB/sec throughput.
> Even in this case the CPU consumption is only 12%.
> Please see this post,
> http://www.redhat.com/archives/dm-devel/2014-February/msg00000.html
>
> I don't think the CPU consumption is small enough to ignore.
>
> About the memory consumption,
> you seem to misunderstand the fact.
> The rambufs are not dynamically allocated but statically.
> The default amount is 8MB and this is usually not to argue.
>
>> Mike raised the question of why you want this in the kernel so much?
>> You'd find none of the distros would support it; so it doesn't widen
>> your audience much. It's far better for you to maintain it outside of
>> the kernel at this point. Any users will be bold, adventurous people,
>> who will be quite capable of building a kernel module.
> Some people deploy Writeboost in their daily use.
> The sound of "log-structured" seems to easily attract storage guys' attention.
> If this driver is merged into upstream, I think it gains many audience and
> thus feedback.
> When my driver was introduced by Phoronix before, it actually drew attentions.
> They must wait for Writeboost become available in upstream.
> http://www.phoronix.com/scan.php?page=news_item&px=MTQ1Mjg
>
>> I'm sorry to have disappointed you so, but if I let this go upstream
>> it would mean a massive amount of support work for me, not to mention
>> a damaged reputation for dm.
> If you read the code further, you will find how simple the mechanism is.
> Not to mention the code itself is.
>
> - Akira
>
> On 12/12/14 11:24 PM, Joe Thornber wrote:
>> On Fri, Dec 12, 2014 at 09:42:15AM +0900, Akira Hayakawa wrote:
>>> The SSD-caching should be log-structured.
>>
>> No argument there, and this is why I've supported you with
>> dm-writeboost over the last couple of years.
>>
>> However, after looking at the current code, and using it I think it's
>> a long, long way from being ready for production. As we've already
>> discussed there are some very naive design decisions in there, such as
>> copying every bio payload to another memory buffer, splitting all io
>> down to 4k. Think about the cpu overhead and memory consumption!
>> Think about how it will perform when memory is constrained and it
>> can't allocate many of those rambufs! I'm sure more issues will be
>> found if I read further.
>>
>> I'm sorry to have disappointed you so, but if I let this go upstream
>> it would mean a massive amount of support work for me, not to mention
>> a damaged reputation for dm.
>>
>> Mike raised the question of why you want this in the kernel so much?
>> You'd find none of the distros would support it; so it doesn't widen
>> your audience much. It's far better for you to maintain it outside of
>> the kernel at this point. Any users will be bold, adventurous people,
>> who will be quite capable of building a kernel module.
>>
>> - Joe
>>
>

2014-12-14 03:22:34

by Akira Hayakawa

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v2] staging: writeboost: Add dm-writeboost

Jianjian,

> How about invalidating previous writes on same sector address? if
> first write is stored in one 512KB log in SSD, later when user write
> the same address, will writeboost invalid old write by updating meta
> data header in place in that 512KB log? and other meta data like
> superblock, will writeboost will update them in place? if writeboost
> has those frequent random in-place updates, probably those benefits
> except utilizing the faster sequential writes will be much discounted.
In runtime, Writeboost has equivalent metadata on the memory as hash table.
This makes not only cache lookup fast but also makes invalidation easy.
Without this, Writeboost's speed is really discounted.

Any logs on the cache device are never updated partially (it's too dangerous
if we consider partial write failure). Durability is another benefit of
log-structured caching.

As for writing on superblock, it's really random. But, we don't need to
update superblock every write but every seconds or so. To begin with, it's
not logically essential so you can turn it off (that's the default).
The option is for users much care for the resume time.

> Even if writeboost write everything sequentially, probably SSD won't
> be able to recognize that and do its own wear-leveling whatsoever. It
> will be easier, since there is no need to move cold data, etc.
>
> Generally, SSDs vary a lot, IMHO, one certain optimization technique
> may work for certain model, but may not work for others, since they
> internal NAND Flash management algorithms can be very different.
Yes, Thanks.
I believe making the SSD-internal works easy can not only stabilize the
performance but also reduce the possibility of being trapped by some bugs
that the SSD provides such as petit freeze.

- Akira

On 12/14/14 11:46 AM, Jianjian Huo wrote:
> On Sat, Dec 13, 2014 at 6:07 AM, Akira Hayakawa <[email protected]> wrote:
>> Hi,
>>
>> Jianjian, You really get a point at the fundamental design.
>>
>>> If I understand it correctly, the whole idea indeed is very simple,
>>> the consumer/provider and circular buffer model. use SSD as a circular
>>> write buffer, write flush thread stores incoming writes to this buffer
>>> sequentially as provider, and writeback thread write those data logs
>>> sequentially into backing device as consumer.
>>>
>>> If writeboost can do that without any random writes, then probably it
>>> can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
>>> sequential read/write performance from SSD. That'll be awesome.
>>> However, I saw every data log segment in its design has meta data
>>> header, like dirty_bits, so I guess writeboost has to randomly write
>>> those data headers of stored data logs in SSD; also, splitting all bio
>>> into 4KB will hurt its ability to get max raw SSD throughput, modern
>>> NAND Flash has pages much bigger than 4KB; so overall I think the
>>> actual benefits writeboost gets from this design will be discounted.
>> You understand *almost* correctly.
>>
>> Writeboost has two circular buffers, not one; RAM buffers and SSD.
>> The incoming bio is split into 4KB chunks at the virtual make_request
>> and are NOT directly remapped to the SSD.
>> As you mentioned, if I designed so, many update on the metadata happens.
>> That's really bad since SSD is very bad at small update.
>>
>> Actually, the 4KB bio is first stored in RAM buffer, which is 512KB large.
>> There are (512-4)/4=127 4KB bio data stored in the RAM buffer and 4KB metadata
>> section at the head is made after that.
>>
>> The RAM buffer is now called "log" and as you mentioned, flushed to the SSD
>> as 512KB sequential write. This definitely maximizes throughput and lifetime.
>>
>> Unfortunately, this is not always the case because of barrier request handlings.
>> But, when the writes is really heavy (e.g. massive dirty page writeback),
>> Writeboost works as above.
>>
> How about invalidating previous writes on same sector address? if
> first write is stored in one 512KB log in SSD, later when user write
> the same address, will writeboost invalid old write by updating meta
> data header in place in that 512KB log? and other meta data like
> superblock, will writeboost will update them in place? if writeboost
> has those frequent random in-place updates, probably those benefits
> except utilizing the faster sequential writes will be much discounted.
>
>>> The good thing is that it seems writeboost doesn't use garbage
>>> collection to clean old invalid logs, this will avoid the double
>>> garage collection effect other caching module has, which essentially
>>> both caching module and internal SSD will perform garbage collections
>>> twice.
>> Yes. And I believe SSDs can remove wear-leveling if I used it as fairly sequential.
>> Am I right? Indeed, Writeboost is really SSD frinedly.
>>
> Even if writeboost write everything sequentially, probably SSD won't
> be able to recognize that and do its own wear-leveling whatsoever. It
> will be easier, since there is no need to move cold data, etc.
>
> Generally, SSDs vary a lot, IMHO, one certain optimization technique
> may work for certain model, but may not work for others, since they
> internal NAND Flash management algorithms can be very different.
>
>>> And one question, how long will be data logs replay time during init,
>>> if SSD is almost full of dirty data logs?
>> Sorry, I don't have a data now but it's slow as you may imagine.
>> I will measure the time on later.
>>
>> The major reason is, it needs to read full 512KB segment to calculate checksum to
>> know if the log isn't half written.
>> (Read 500GB SSD that performs 500MB/sec seqread spends 1000secs)
>> I think making the procedure done in parallel to exploit the full internal parallelism
>> inside SSD may improve performance but it's just the matter of coefficient down from 1 to 1/n.
>> Definitely, Writeboost isn't fit for a machine that needs reboot frequently (e.g. desktop).
>>
>> There is a way to reduce the init time. We can dump "what is the latest log written back"
>> on the superblock. This can skip readings that aren't essential.
>>
>> The corresponding code is replay_log_on_cache() function. Please read if you are
>> interested.
>>
>> Thanks,
>>
>> - Akira
>>
>> On 12/13/14 3:45 PM, Jianjian Huo wrote:
>>> If I understand it correctly, the whole idea indeed is very simple,
>>> the consumer/provider and circular buffer model. use SSD as a circular
>>> write buffer, write flush thread stores incoming writes to this buffer
>>> sequentially as provider, and writeback thread write those data logs
>>> sequentially into backing device as consumer.
>>>
>>> If writeboost can do that without any random writes, then probably it
>>> can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
>>> sequential read/write performance from SSD. That'll be awesome.
>>> However, I saw every data log segment in its design has meta data
>>> header, like dirty_bits, so I guess writeboost has to randomly write
>>> those data headers of stored data logs in SSD; also, splitting all bio
>>> into 4KB will hurt its ability to get max raw SSD throughput, modern
>>> NAND Flash has pages much bigger than 4KB; so overall I think the
>>> actual benefits writeboost gets from this design will be discounted.
>>>
>>> The good thing is that it seems writeboost doesn't use garbage
>>> collection to clean old invalid logs, this will avoid the double
>>> garage collection effect other caching module has, which essentially
>>> both caching module and internal SSD will perform garbage collections
>>> twice.
>>>
>>> And one question, how long will be data logs replay time during init,
>>> if SSD is almost full of dirty data logs?
>>>
>>> Jianjian
>>>
>>> On Fri, Dec 12, 2014 at 7:09 AM, Akira Hayakawa <[email protected]> wrote:
>>>>> However, after looking at the current code, and using it I think it's
>>>>> a long, long way from being ready for production. As we've already
>>>>> discussed there are some very naive design decisions in there, such as
>>>>> copying every bio payload to another memory buffer, splitting all io
>>>>> down to 4k. Think about the cpu overhead and memory consumption!
>>>>> Think about how it will perform when memory is constrained and it
>>>>> can't allocate many of those rambufs! I'm sure more issues will be
>>>>> found if I read further.
>>>> These decisions are made based on measurement. They are not naive.
>>>> I am a man who dislikes performance optimization without measurement.
>>>> As a result, I regard things brought by the simplicity much important
>>>> than what's from other design decisions possible.
>>>>
>>>> About the CPU consumption,
>>>> the average CPU consumption while performing random write fio
>>>> with consumer level SSD is only 3% or so,
>>>> which is 5 times efficient than bcache per iops.
>>>>
>>>> With RAM-backed cache device, it reaches about 1.5GB/sec throughput.
>>>> Even in this case the CPU consumption is only 12%.
>>>> Please see this post,
>>>> http://www.redhat.com/archives/dm-devel/2014-February/msg00000.html
>>>>
>>>> I don't think the CPU consumption is small enough to ignore.
>>>>
>>>> About the memory consumption,
>>>> you seem to misunderstand the fact.
>>>> The rambufs are not dynamically allocated but statically.
>>>> The default amount is 8MB and this is usually not to argue.
>>>>
>>>>> Mike raised the question of why you want this in the kernel so much?
>>>>> You'd find none of the distros would support it; so it doesn't widen
>>>>> your audience much. It's far better for you to maintain it outside of
>>>>> the kernel at this point. Any users will be bold, adventurous people,
>>>>> who will be quite capable of building a kernel module.
>>>> Some people deploy Writeboost in their daily use.
>>>> The sound of "log-structured" seems to easily attract storage guys' attention.
>>>> If this driver is merged into upstream, I think it gains many audience and
>>>> thus feedback.
>>>> When my driver was introduced by Phoronix before, it actually drew attentions.
>>>> They must wait for Writeboost become available in upstream.
>>>> http://www.phoronix.com/scan.php?page=news_item&px=MTQ1Mjg
>>>>
>>>>> I'm sorry to have disappointed you so, but if I let this go upstream
>>>>> it would mean a massive amount of support work for me, not to mention
>>>>> a damaged reputation for dm.
>>>> If you read the code further, you will find how simple the mechanism is.
>>>> Not to mention the code itself is.
>>>>
>>>> - Akira
>>>>
>>>> On 12/12/14 11:24 PM, Joe Thornber wrote:
>>>>> On Fri, Dec 12, 2014 at 09:42:15AM +0900, Akira Hayakawa wrote:
>>>>>> The SSD-caching should be log-structured.
>>>>>
>>>>> No argument there, and this is why I've supported you with
>>>>> dm-writeboost over the last couple of years.
>>>>>
>>>>> However, after looking at the current code, and using it I think it's
>>>>> a long, long way from being ready for production. As we've already
>>>>> discussed there are some very naive design decisions in there, such as
>>>>> copying every bio payload to another memory buffer, splitting all io
>>>>> down to 4k. Think about the cpu overhead and memory consumption!
>>>>> Think about how it will perform when memory is constrained and it
>>>>> can't allocate many of those rambufs! I'm sure more issues will be
>>>>> found if I read further.
>>>>>
>>>>> I'm sorry to have disappointed you so, but if I let this go upstream
>>>>> it would mean a massive amount of support work for me, not to mention
>>>>> a damaged reputation for dm.
>>>>>
>>>>> Mike raised the question of why you want this in the kernel so much?
>>>>> You'd find none of the distros would support it; so it doesn't widen
>>>>> your audience much. It's far better for you to maintain it outside of
>>>>> the kernel at this point. Any users will be bold, adventurous people,
>>>>> who will be quite capable of building a kernel module.
>>>>>
>>>>> - Joe
>>>>>
>>>>
>>>> --
>>>> dm-devel mailing list
>>>> [email protected]
>>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>>
>>> --
>>> dm-devel mailing list
>>> [email protected]
>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>>
>>