Hi Jens and Kernel Gurus,
We are submitting EnhanceIO(TM) caching driver for an inclusion in Linux kernel. It's an enterprise grade caching solution having been validated extensively in-house, in a large number of enterprise installations. It features this distinct property not found in dm-cache or bcache. EnhanceIO(TM) caching solution has been reported to offer a substantial performance over a RAID source device in various types of applications - file servers, relational and object databases, replication engines, web hosting, messaging and more. EnhanceIO(TM) caching solution has been proven in independent testing, such as testing by Demartek.
We believe that EnhanceIO(TM) driver will add a substantial value to Linux kernel by letting customers exploit SSDs to their full potential. We would like you to consider it for an inclusion in the kernel.
Thanks.
--
Amit Kale
Features and capabilities are described below. Patch is being submitted in another email.
1. User interface
There are commands for creation and deletion of caches and editing of cache parameters.
2. Software interface
This kernel driver uses block device interface to receive IOs submitted by upper layers like filesystems or applications and submit IOs to HDD and SSDs. There is full transparency from upper layers' viewpoint.
3. Availability
Caches can be created and deleted while applications are online. So absolutely no downtime. Crash recovery is in milliseconds. Error recovery times depend on the kind of errors. Caches continue working without a downtime for intermittent errors.
4. Security
Cache operations require root privilege. IO operations are done at block layer, which is implicitly protected using device node level access control.
5. Portability
It works with all any HDD or SSD that supports Linux block layer interface. So it's device agnostic. We have tested Linux x86 32 and 64 bit configurations. We have compiled it for big-endian machines, although haven't tested in that configuration.
6. Persistence of cache configuration
Cache configuration can be made persistent through startup scripts.
7. Persistence of cached data
Cached data can be made persistent. There is an option to make it volatile only for abrupt shutdowns or reboots. This is to prevent the HDD and the SSD for a cache going out of sync, say in enterprise environments. There a large number of HDDs and SSDs may be accessed by a number of servers through a switch.
8. SSD life
An administrator has an option to choose a cache mode appropriate for desired SSD life. SSD life depends on the number of writes. These are defined by a cache mode: read-only (R1, W0), write-through (R1, W1), write-back (R1, W1 + MD writes).
9. Performance
EnhanceIO(TM) driver through put is equal to SSD throughput for 100% cache hits. No difference is measurable within the error margins of throughput measurement. Throughput generally depends on cache hit rate and is between HDD and SSD through put. Interestingly throughput can also be higher than SSD for a few test cases.
10. ACID properties
EnhanceIO(TM) metadata layout does not contain a journal or a shadow-page update capability, which are typical mechanisms to ensure atomicity. Write-back feature instead has been developed on the basis of safe sequence of SSD operations. A safe sequence is explained as follows - A cache operation may involve multiple block writes. A sequence of block writes is safe when executing only first few of them does not result in inconsistencies. If the sequence involves two block writes, if only one of them is written, it does not cause an inconsistency. An example of this is dirty block write - dirty data is first written; after that updated metadata is written. If an abnormal shutdown occurs after the dirty data is written, it will not cause inconsistency. This dirty block will be ignored as it's metadata is yet to be written. Since application wasn't returned IO completion status before the shutdown, it'll assume that the IO never made it to HDD.
EnhanceIO(TM) driver offers atomicity guarantee at a SSD internal flash block size level. In the case of an abnormal shutdown, for all IO requests each block in the IO request persists either in full or none. In no event does an incomplete block be found when cache is enabled later. For example for an SSD having an internal flash block size of 4kB - if an IO of 8kB was requested at offset 0, the end result could be either first 4kB being written, or the last 4kB being written, or none being written. If an IO of 4KB was requested at offset 2KB, the end result could be either the first 2kB (contained in first cache block) being written, or the last 2kB (contained in the second cache block) being written, or none of them being written.
If an SSD internal flash block size is smaller than cache block size, a torn-page problem could occur, where a cache block may contain garbage in the case of an abrupt power failure or an OS crash. This is an extremely rare condition. At present none of the recently manufactured SSDs are known to have less than 4kB of internal flash block size.
Block devices are required to offer an atomicity guarantee at a sector level (512bytes) at a minimum. EnhanceIO(TM) driver write-back conforms to this requirement.
If upper layers such as filesystems or databases are not prepared to handle sector level atomicity, they may fail in the case of an abnormal shutdown with EnhanceIO(TM) driver as well has HDD. So HDD guarantees in this context are not diluted.
If upper layers require a sequential write guarantee in addition to atomicity, they will work for HDD, however may fail with EnhanceIO(TM) driver in the case of an abnormal shutdown. Such layers will not work with RAID also. A sequential write guarantee is that a device will write blocks strictly in a sequential order. With this guarantee, if a block in an IO range is persistent, it implies that all blocks prior to it are also persistent. Enterprise software packages do not make this assumption, so will not have a problem with EnhanceIO(TM) driver. So EnhanceIO(TM) offers the same level of guarantees as RAID in this context.
EnhanceIO(TM) caches should offer the same guarantee of end result as is offered by a HDD in case of Parallel IOs. Parallel IOs do not result in stale or inconsistent cache data. This is not a requirement from filesystems as filesystems use page cache and do not issue simultaneous IO requests with overlapping IO ranges. However some applications may require this guarantee.
EnhanceIO(TM) write-back has been implemented to ensure that data is not be lost once a success status is returned for an application requested IO. This includes OS crashes, abrupt power failures, planned reboot and planned power-offs.
11. Error conditions - Handling power failures, intermittent and permanent device failures. These are being described in another email.
On Fri, May 24 2013, OS Engineering wrote:
> Hi Jens and Kernel Gurus,
[snip]
Thanks for writing all of this up, but I'm afraid it misses the point
somewhat. As stated previously, we have (now) two existing competing
implementations in the kernel. I'm looking for justification on why YOUR
solution is better. A writeup and documentation on error handling
details is nice and all, but it doesn't answer the key important
questions.
Lets say somebody sends in a patch that he/she claims improves memory
management performance. To justify such a patch (or any patch, really),
the maintenance burden vs performance benefit needs to be quantified.
Such a person had better supply a set of before and after numbers, such
that the benefit can be quantified.
It's really the same with your solution. You mention "the solution has
been proven in independent testing, such as testing by Demartek.". I
have no idea what this testing is, what they ran, compared with, etc.
So, to put it bluntly, I need to see some numbers. Run relevant
workloads on EnhanceIO, bcache, dm-cache. Show why EnhanceIO is better.
Then we can decide whether it really is the superior solution. Or,
perhaps, it turns out there are inefficiencies in eg bcache/dm-cache
that could be fixed up.
Usually I'm not such a stickler for including new code. But a new driver
is different than EnhanceIO. If somebody submitted a patch to add a
newly written driver for hw that we already have a driver for, that
would be similar situation.
The executive summary: your writeup was good, but we need some relevant
numbers to look at too.
--
Jens Axboe
Hi Jens,
I by mistake dropped the weblink to demartek study while composing my email. The demartek study is published here: http://www.demartek.com/Demartek_STEC_S1120_PCIe_Evaluation_2013-02.html. It's an independent study. Here are a few numbers taken from this report. In a database comparison using transactions per second
HDD baseline (40 disks) - 2570 tps
240GB Cache - 9844 tps
480GB cache - 19758 tps
RAID5 pure SSD - 32380 tps
RAID0 pure SSD - 40467 tps
There are two types of performance comparisons, application based and IO pattern based. Application based tests measure efficiency of cache replacement algorithms. These are time consuming. Above tests were done by demartek over a period of time. I don't have performance comparisons between EnhanceIO(TM) driver, bcache and dm-cache. I'll try to get them done in-house.
IO pattern based tests can be done quickly. However since IO pattern is fixed prior to the test, output tends to depend on whether the IO pattern suits the caching algorithm. These are relatively easy. I can definitely post this comparison.
Regarding IO error handling - that's really our USP :-). While it won't be possible to do a testing of bcache and dm-cache on our internal error test suites, I'll try to come up with a few points based on code comparison.
Thanks.
-Amit
> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Jens Axboe
> Sent: Saturday, May 25, 2013 12:17 AM
> To: OS Engineering
> Cc: LKML; Padmini Balasubramaniyan; Amit Phansalkar
> Subject: Re: EnhanceIO(TM) caching driver features [1/3]
>
> On Fri, May 24 2013, OS Engineering wrote:
> > Hi Jens and Kernel Gurus,
>
> [snip]
>
> Thanks for writing all of this up, but I'm afraid it misses the point
> somewhat. As stated previously, we have (now) two existing competing
> implementations in the kernel. I'm looking for justification on why
> YOUR solution is better. A writeup and documentation on error handling
> details is nice and all, but it doesn't answer the key important
> questions.
>
> Lets say somebody sends in a patch that he/she claims improves memory
> management performance. To justify such a patch (or any patch, really),
> the maintenance burden vs performance benefit needs to be quantified.
> Such a person had better supply a set of before and after numbers, such
> that the benefit can be quantified.
>
> It's really the same with your solution. You mention "the solution has
> been proven in independent testing, such as testing by Demartek.". I
> have no idea what this testing is, what they ran, compared with, etc.
>
> So, to put it bluntly, I need to see some numbers. Run relevant
> workloads on EnhanceIO, bcache, dm-cache. Show why EnhanceIO is better.
> Then we can decide whether it really is the superior solution. Or,
> perhaps, it turns out there are inefficiencies in eg bcache/dm-cache
> that could be fixed up.
>
> Usually I'm not such a stickler for including new code. But a new
> driver is different than EnhanceIO. If somebody submitted a patch to
> add a newly written driver for hw that we already have a driver for,
> that would be similar situation.
>
> The executive summary: your writeup was good, but we need some relevant
> numbers to look at too.
>
> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to [email protected] More majordomo
> info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED
This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.
Please don't top post!
On Sat, May 25 2013, Amit Kale wrote:
> Hi Jens,
>
> I by mistake dropped the weblink to demartek study while composing my
> email. The demartek study is published here:
> http://www.demartek.com/Demartek_STEC_S1120_PCIe_Evaluation_2013-02.html.
> It's an independent study. Here are a few numbers taken from this
> report. In a database comparison using transactions per second
> HDD baseline (40 disks) - 2570 tps
> 240GB Cache - 9844 tps
> 480GB cache - 19758 tps
> RAID5 pure SSD - 32380 tps
> RAID0 pure SSD - 40467 tps
>
> There are two types of performance comparisons, application based and
> IO pattern based. Application based tests measure efficiency of cache
> replacement algorithms. These are time consuming. Above tests were
> done by demartek over a period of time. I don't have performance
> comparisons between EnhanceIO(TM) driver, bcache and dm-cache. I'll
> try to get them done in-house.
Unless I'm badly mistaken, that study is only on enhanceio, it does not
compare it to any other solutions. Additionally, it's running on
Windows?! I don't think it's too much to ask to see results on the
operating system for which you are submitting the changes.
> IO pattern based tests can be done quickly. However since IO pattern
> is fixed prior to the test, output tends to depend on whether the IO
> pattern suits the caching algorithm. These are relatively easy. I can
> definitely post this comparison.
It's fairly trivial to do some synthetic cache testing with fio, using
eg the zipf distribution. That'll get you data reuse, for both reads and
writes (if you want), in the selected distribution.
--
Jens Axboe
On Saturday 25 May 2013, Jens Axboe wrote:
> Please don't top post!
Got to use a different email client for that. Note that I am writing this from
my personal email address. This email and any future emails I write from this
address are my personal views and sTec can't be held responsible for them.
> On Sat, May 25 2013, Amit Kale wrote:
> > Hi Jens,
> >
> > I by mistake dropped the weblink to demartek study while composing my
> > email. The demartek study is published here:
> > http://www.demartek.com/Demartek_STEC_S1120_PCIe_Evaluation_2013-02.html.
> > It's an independent study. Here are a few numbers taken from this
> > report. In a database comparison using transactions per second
> >
> > HDD baseline (40 disks) - 2570 tps
> > 240GB Cache - 9844 tps
> > 480GB cache - 19758 tps
> > RAID5 pure SSD - 32380 tps
> > RAID0 pure SSD - 40467 tps
> >
> > There are two types of performance comparisons, application based and
> > IO pattern based. Application based tests measure efficiency of cache
> > replacement algorithms. These are time consuming. Above tests were
> > done by demartek over a period of time. I don't have performance
> > comparisons between EnhanceIO(TM) driver, bcache and dm-cache. I'll
> > try to get them done in-house.
>
> Unless I'm badly mistaken, that study is only on enhanceio, it does not
> compare it to any other solutions.
That's correct. I haven't seen any application level benchmark based
comparisons between different caching solutions on any platform.
> Additionally, it's running on
> Windows?!
Yes. However as I have said above, application level testing is primarly a
test of cache replacement algorithm. So the effect of a platform is less,
although not zero.
> I don't think it's too much to ask to see results on the
> operating system for which you are submitting the changes.
Agreed that's a fair thing to ask.
>
> > IO pattern based tests can be done quickly. However since IO pattern
> > is fixed prior to the test, output tends to depend on whether the IO
> > pattern suits the caching algorithm. These are relatively easy. I can
> > definitely post this comparison.
>
> It's fairly trivial to do some synthetic cache testing with fio, using
> eg the zipf distribution. That'll get you data reuse, for both reads and
> writes (if you want), in the selected distribution.
While the running a test is trivial deciding what IO patterns to run is a
difficult problem. The bottom line for sequential, random, zipf and pareto is
the same - they all test a fixed IO pattern, which at best is very unlike an
application pattern. Cache behavior affects IO addresses that an application
issues. The block list of IOs requested by an application is different when
running on HDD, SSD and cache. Memory usage by cache is one significant factor
in this effect. IO latency (more and less) also affects when multiple threads
are processing transactions.
Regardless of which IO pattern is used following characteristics to a large
extent measure efficiency of a cache engine minus cache replacement algorithm.
Hit rate - Can be varied between 0 to 100%. 90% and above are of interest for
caching.
Read versus Write mix - Can be varied from 0/100, 10/90, ...., 10/90, 0/100.
IO block size - Fixed equal to or a multiple of cache block size. Variable
complicates analysis of results.
Does following comparison sound interesting? I welcome others to propose
modifications or other ways.
Cache block size 4kB.
HDD size = 500GB
SSD size = 100GB and 500GB.
HDD equal to SSD is only to study cache behavior so all of the tests below
need not be performed.
For read-only cache mode
This works best for write intensive loads R/W mix - 10/90 and 30/70
Write intensive loads are usually long writes Block size - 64kB
Cache hit ratio 0%, 90%, 95%, 100%.
For write-through cache mode
This works best for read intensive loads R/W mix 100/0, 90/10
Block size - 4kB, 16kB, 128kB.
Cache hit ratio 90%, 95%, 100%
For write-back cache mode
This works best for fluctuating loads R/W mix 90/10
Block size - 4kB
Cache hit ratio 95%.
Thanks.
-Amit