>Beta3 of device-mapper is now available at:
>
>ftp://ftp.sistina.com/pub/LVM2/device-mapper/
>device-mapper-beta3.0.tgz
>
>The accompanying LVM2 toolset:
>
> ftp://ftp.sistina.com/pub/LVM2/tools/LVM2.0-beta3.0.tgz
>
>The main addition for this release is high performance persistent
>snapshots, see >http://people.sistina.com/~thornber/snap_performance.html
>for a comparison with LVM1 and EVMS.
Thanks for the results. I tried the same thing, but with the latest
release (beta 4) and I am not observing the same behavior. Your results
show very little difference in performance when using different chunk
sizes for snapshots, but I observed a range of 10 to 24 seconds for this
same test on beta4 (I have also included EVMS 1.1 pre4):
#### dbench with 2 clients ####
#### mem=32MB ####
EVMS 1.1 pre4 LVM2 Beta4.1
chunk ------------------------ -----------------------
size 1st 2nd 3rd Ave Ave 1st 2nd 3rd
----- --- --- --- --- --- --- --- ---
8k 11 9 9 9.66 24.0 23 24 25
16k 9 9 8 8.66 17.6 18 17 18
32k 8 9 7 8.00 12.6 13 12 13
64k 9 8 9 8.66 11.3 12 11 11
128k 9 8 9 8.66 10.0 10 9 11
256k 8 9 9 8.66 10.0 10 10 10
512k 8 9 9 8.66 10.0 10 10 10
none 7 7 6 6.66 6.66 8 6 6
results in seconds
none = baseline, no snapshot
As you can see, the smaller chunk sizes did make a difference in the
times. The EVMS results also now have the async option, and the results
are generally much more consistent and faster in all cases regardless of
chunk size. The baselines are the same, as I expected, since we are
disk bound and the non-snapshot IO code paths are not that much
different. I believe the major difference between the snapshot
performance is the disk latency differences. I suspect EVMS drives the
IO more efficiently so the head movement is minimized. FYI, this test
used two disks, one for the original volume, and one for the snapshot.
I also wanted to run a test that concentrated on only IO, so I scrapped
dbench in favor of plain old dd. First I ran with all my memory:
#### time dd if=/dev/zero of=/dev/<vol> bs=4k count=25000 ####
#### mem=768MB ####
EVMS 1.1 pre4 LVM2 Beta4.1
chunk ------------------------ -----------------------
size 1st 2nd 3rd Ave Ave 1st 2nd 3rd
----- --- --- --- --- --- --- --- ---
8k 10.773 10.271 10.752 10.599 28.076 27.581 28.065 28.582
16k 7.621 7.785 7.557 7.684 14.926 14.672 14.856 15.251
32k 7.676 7.747 7.537 7.653 11.947 12.082 12.026 11.734
64k 7.534 7.889 7.873 7.765 11.407 11.548 11.436 11.238
128k 7.803 7.660 7.511 7.658 11.248 11.216 11.130 11.399
256k 7.629 7.677 7.631 7.646 11.122 11.256 10.973 11.137
512k 7.677 7.593 7.920 7.730 10.813 11.104 10.736 10.601
none 4.734 4.956 4.751 4.814 4.887 4.755 4.974 4.933
results in seconds
none = baseline, no snapshot
As you can see, again, the small chunk sizes really affected performance
of the LVM2 snapshots. I can tell you this extra time is not in kernel,
we are just waiting longer for the disk to complete its transactions. I
am really curious why you did not experience this behaviour in your tests.
Considering what is going on during a snapshot, one read (from one disk)
and two parallel writes (to different disks) The EVMS results show the
the best you could possibly achieve, compared to the performance of the
plain old write test (assuming that a disk read is a little fater than a
disk write). The baseline results are what I expected; nearly identical
times, since we bottleneck on the disk throughput.
>Please be warned that snapshots will deadlock under load on 2.4.18
>kernels due to a bug in the VM syste, 2.4.19-pre8 works fine.
That leads me to my next test, the same as above, but with only 32 MB
memory. Whatever problem exists, it still may be in 2.4.19-rc1 (only
for LVM2):
#### time dd if=/dev/zero of=/dev/<vol> bs=4k count=25000 ####
#### mem=32MB ####
EVMS 1.1 pre4 LVM2 Beta4.1
chunk ------------------------ -----------------------
size 1st 2nd 3rd Ave Ave 1st 2nd 3rd
----- --- --- --- --- --- --- --- ---
8k 9.290 9.825 9.513 9.543 43.519 42.121 44.918 DNF
16k 8.540 8.684 9.016 8.747 DNF DNF DNF DNF
32k 8.607 8.512 8.339 8.486 20.216 DNF DNF 20.216
64k 8.202 8.436 8.137 8.258 14.355 13.972 14.737 DNF
128k 8.269 7.772 8.505 8.182 11.915 11.828 DNF 12.002
256k 8.667 8.022 8.236 8.308 15.212 10.952 23.319 11.366
512k 8.249 7.961 8.602 8.271 12.480 13.996 DNF 10.964
none 4.046 4.215 4.464 4.242 4.294 4.318 4.094 4.469
results in seconds
none = no snapshot
DNF = Did Not Finish (system was unresponsive after 15 minutes)
The performance did drop off again for small chunksizes on LVM2, but
sometimes it was very bad. EVMS had incrementally slower performance
overall, and IMO acceptable considering the memory available. On the
"DNF", I could not get the system to respond to anything; most likely
the deadlock issue you moentioned above.
If you have any ideas why our tests results differ, please let me know.
I can send you my test scripts if you like. Below are the system specs:
System: 800 Mhz PIII, 768 MB RAM, 3 x 18 GB 15k rpm SCSI
Linux: 2.4.19-rc1 with LVM2 Beta 4.1 and EVMS 1.1 pre4
Also, if anyone has any ideas how to test the "other half" of snapshots,
reading the snapshot while writing to the original, please send me your
ideas. Perhaps a simulated tape backup on the snapshot while something
is thrashing the original, and of course something we can measure.
Regards,
Andrew Theurer
Andrew,
On Fri, Jul 12, 2002 at 06:21:08PM -0500, Andrew Theurer wrote:
> Thanks for the results. I tried the same thing, but with the latest
> release (beta 4) and I am not observing the same behavior. Your results
> show very little difference in performance when using different chunk
> sizes for snapshots, but I observed a range of 10 to 24 seconds for this
> same test on beta4 (I have also included EVMS 1.1 pre4):
I must admit your results are strange to say the least, the only
explanation that I can think of at the moment is that you have been
running LVM1.
Just to reassure me that this is not the case, can you please make
sure that the LVM1 driver is not available, and that your path are not
picking up old LVM1 tools by mistake. There was a time when the tools
were installed in /usr/sbin rather than /sbin.
- Joe
Andrew,
On Fri, Jul 12, 2002 at 06:21:08PM -0500, Andrew Theurer wrote:
> Thanks for the results. I tried the same thing, but with the latest
> release (beta 4) and I am not observing the same behavior.
I've just read through the evms 1.1-pre4 code.
You (or Kevin Corry rather) have obviously looked at the
device-mapper snapshot code and tried to reimplement it. I think it
might have been better if you'd actually used the relevent files from
device-mapper such as the kcopyd daemon.
Also I have serious concerns about the correctness of your
implementation, for example:
i) It is possible for the exception table to be updated *before* the
copy on write of the data actually occurs. This means that briefly
the on disk state of the snapshot is inconsistent. Users of EVMS
that experience a machine failure will therefor have no option but to
delete all snapshots.
ii) The exception table is only written out every
(EVMS_VSECTOR_SIZE/sizeof(u_int64_t)) exceptions. Surely I've misread
the code here ? This would mean I could have a machine that
triggers 1 fewer exceptions than this, then does nothing to this
volume, at no point would the exception table get written to disk?
Or do you periodically flush the exception table as well ?
Again if the machine crashed the snapshot would be useless.
iii) EVMS is writing the exception table data to the cow device
asynchronously, you *cannot* do this without risking deadlocks with
the VM system.
- Joe
On Monday 15 July 2002 03:59, Joe Thornber wrote:
> Andrew,
>
> On Fri, Jul 12, 2002 at 06:21:08PM -0500, Andrew Theurer wrote:
> > Thanks for the results. I tried the same thing, but with the latest
> > release (beta 4) and I am not observing the same behavior. Your results
> > show very little difference in performance when using different chunk
> > sizes for snapshots, but I observed a range of 10 to 24 seconds for this
> > same test on beta4 (I have also included EVMS 1.1 pre4):
>
> I must admit your results are strange to say the least, the only
> explanation that I can think of at the moment is that you have been
> running LVM1.
>
> Just to reassure me that this is not the case, can you please make
> sure that the LVM1 driver is not available, and that your path are not
> picking up old LVM1 tools by mistake. There was a time when the tools
> were installed in /usr/sbin rather than /sbin.
Joe,
I assure you this is not LVM1. I have both tools installed, but they are
under different directories, and my test scripts call them with full paths.
I have also done some tracing to verify I am in device-mapper code. Also,
the kernel I tested with has only device mapper and evms, no lvm1.
-Andrew Theurer
On Monday 15 July 2002 07:40, Joe Thornber wrote:
> On Fri, Jul 12, 2002 at 06:21:08PM -0500, Andrew Theurer wrote:
> > Thanks for the results. I tried the same thing, but with the latest
> > release (beta 4) and I am not observing the same behavior.
>
> I've just read through the evms 1.1-pre4 code.
>
> You (or Kevin Corry rather) have obviously looked at the
> device-mapper snapshot code and tried to reimplement it. I think it
> might have been better if you'd actually used the relevent files from
> device-mapper such as the kcopyd daemon.
Actually, I've never looked at the snapshotting code in LVM2.
We are both writing code with very similar end-goals, and there are only so
many ways to perform I/O from within the block layer, so it wouldn't be
unheard of that our designs might share some similarities.
> Also I have serious concerns about the correctness of your
> implementation, for example:
>
> i) It is possible for the exception table to be updated *before* the
> copy on write of the data actually occurs. This means that briefly
> the on disk state of the snapshot is inconsistent. Users of EVMS
> that experience a machine failure will therefor have no option but to
> delete all snapshots.
>
> ii) The exception table is only written out every
> (EVMS_VSECTOR_SIZE/sizeof(u_int64_t)) exceptions. Surely I've misread
> the code here ? This would mean I could have a machine that
> triggers 1 fewer exceptions than this, then does nothing to this
> volume, at no point would the exception table get written to disk?
> Or do you periodically flush the exception table as well ?
> Again if the machine crashed the snapshot would be useless.
In the current design, there are two cases when the COW table is written to
disk. Either when current COW sector is full, or on a clean system shutdown.
All snapshots will thus be persistent across a clean shutdown or reboot.
Currently, an async snapshot will be disabled if the system crashes. This
wasn't a big secret. In the latest HOWTO, the section on snapshotting
explains this, and in the EVMS gui, there is a note attached to the "async"
option saying this as well.
We designed async snapshots in EVMS for maximum performance. Allowing for the
one above condition provides for a significant performance increase. Writing
the COW table only when it's full prevents a lot of unnecessary disk head
seeking. Also, allowing out-of-order I/Os means that the write request to the
original that triggered the copy can be released much sooner - as soon as the
chunk is read from the original, and in parallel with the write to the
snapshot.
We've discussed adding a mode to async snapshots that will provide
persistency across system crashes (at a slight performance penalty of
course). But, there's been a lot going on here lately, so I haven't had a
chance to get it all coded up. But, the synchronous option is still available
for those who are scared about system crashes. Personally, I'm not that
scared. I'd have a hard time remembering the last time one of my production
machines crashed unexpectedly.
> iii) EVMS is writing the exception table data to the cow device
> asynchronously, you *cannot* do this without risking deadlocks with
> the VM system.
I don't see how this is the case. From the VM's viewpoint, writing the COW
table is basically no different than copying the chunks from the original to
the snapshot. We've never experienced any VM deadlocks in the testing we've
done. If you provide us with some more details about the deadlock you are
describing, we'll give it some more thought.
-Kevin
Kevin,
On Mon, Jul 15, 2002 at 01:56:54PM -0500, Kevin Corry wrote:
> In the current design, there are two cases when the COW table is written to
> disk. Either when current COW sector is full, or on a clean system shutdown.
> All snapshots will thus be persistent across a clean shutdown or reboot.
> Currently, an async snapshot will be disabled if the system crashes. This
> wasn't a big secret. In the latest HOWTO, the section on snapshotting
> explains this, and in the EVMS gui, there is a note attached to the "async"
> option saying this as well.
>
> We designed async snapshots in EVMS for maximum performance. Allowing for the
> one above condition provides for a significant performance increase. Writing
> the COW table only when it's full prevents a lot of unnecessary disk head
> seeking.
...
> But, the synchronous option is still available
> for those who are scared about system crashes. Personally, I'm not that
> scared. I'd have a hard time remembering the last time one of my production
> machines crashed unexpectedly.
So you are saying that your async snapshots should only be used on
production machines, and where the data stored on the snapshot is so
unimportant that you don't mind loosing it. Nice.
In future could you mention this caveat when you post comparison
benchmarks.
device-mapper *does* ensure that the snapshot is always consistent.
I don't believe the benchmarks posted at the top of this thread at
all. Not only are you claiming poor performance for device-mapper,
but that this performance degrades as chunk size reduces.
device-mapper has been tested by a variety of people on many different
machines/architectures and we've only ever seen a flat performance
profile as chunk size increases, if anything, there is a very slight
degradation as chunk size gets too large.
For instance I just ran a test on my dev. box, this should
not be considered a tuned benchmark by any means.
dbench -2 on a 32M RAM system:
no snapshot 8.22
8k 13.59
16k 13.99
32k 13.33
64k 12.90
128k 13.442
256k 13.654
512k 13.84
As far as I'm concerned you should be comparing this with the slower
but consistent synchronous snapshots in EVMS.
- Joe
> I don't believe the benchmarks posted at the top of this thread at
> all. Not only are you claiming poor performance for device-mapper,
> but that this performance degrades as chunk size reduces.
> device-mapper has been tested by a variety of people on many different
> machines/architectures and we've only ever seen a flat performance
> profile as chunk size increases, if anything, there is a very slight
> degradation as chunk size gets too large.
Joe, a couple of things occurred to me. When you sent out the email about
device-mapper beta 3 (22/5/2002), and more specifically the user tools 95.07,
your results http://people.sistina.com/~thornber/snap_performance.html state
that the default chunk size is 8k. This is not the case as the code for
95.07 shows it is 32k:
static int _read_params(struct lvcreate_params *lp, struct cmd_context *cmd,
int argc, char **argv)
{
/*
* Set the defaults.
*/
memset(lp, 0, sizeof(*lp));
lp->chunk_size = 2 * arg_int_value(cmd, chunksize_ARG, 32);
log_verbose("setting chunksize to %d sectors.", lp->chunk_size);
if (!_read_name_params(lp, cmd, &argc, &argv) ||
!_read_size_params(lp, cmd, &argc, &argv) ||
!_read_stripe_params(lp, cmd, &argc, &argv))
return 0;
And more importantly, I also could not successfully set the chunksize option
with "-c" or "--chunksize" on 95.07. I did not chase down the error, rather
just changed the default size myself and then ran. Since you had many people
test this, is it possible the chunksize did not change at all, and they were
at 32k all the time? That would explain why these results are flat, while I
am getting a wider range.
Now, this option does work in beta4, so obviously someone working on LVM2 knew
about it and fixed it. I am wondering have any of your testers run on beta4?
> For instance I just ran a test on my dev. box, this should
> not be considered a tuned benchmark by any means.
>
> dbench -2 on a 32M RAM system:
>
> no snapshot 8.22
> 8k 13.59
> 16k 13.99
> 32k 13.33
> 64k 12.90
> 128k 13.442
> 256k 13.654
> 512k 13.84
Joe, are you absolutely sure these tests had the disk cache disabled? That's
the only hardware thing I can think of that would make a difference.
It seems we can go 'round and 'round to no end, as long as we have HW
differences, so I have asked for use on a OSDL system we can both run on.
This way there is no difference in our HW. I'll let you know when I hear
back from them, so we can both test on the same system (if you want to).
-Andrew Theurer
Andrew,
On Tue, Jul 16, 2002 at 11:05:49AM -0500, Andrew Theurer wrote:
> at 32k all the time? That would explain why these results are flat, while I
> am getting a wider range.
I think this is a red herring, the chunk size code did accidentally
get backed out of CVS for a while. But variable chunk sizes were
certainly going into the kernel when we did the development.
> Joe, are you absolutely sure these tests had the disk cache disabled? That's
> the only hardware thing I can think of that would make a difference.
Absolutely sure. Those figures were for a pair of PVs that were
sharing an IDE cable so I can certainly get things moving faster.
> It seems we can go 'round and 'round to no end, as long as we have HW
> differences, so I have asked for use on a OSDL system we can both run on.
> This way there is no difference in our HW. I'll let you know when I hear
> back from them, so we can both test on the same system (if you want to).
Excellent idea, we should probably think up some better benchmarks too.
- Joe
On Tue, 2002-07-16 at 12:31, Joe Thornber wrote:
> > Joe, are you absolutely sure these tests had the disk cache disabled? That's
> > the only hardware thing I can think of that would make a difference.
>
> Absolutely sure. Those figures were for a pair of PVs that were
> sharing an IDE cable so I can certainly get things moving faster.
Some IDE drives ignore commands to turn off the write back cache, or
turn it back on when load gets high.
Try iozone -s 50M -i 0 -o with writeback on and off. If you get the
same answer the benchmarks are suspect....
-chris
On Tue, Jul 16, 2002 at 01:27:24PM -0400, Chris Mason wrote:
> Some IDE drives ignore commands to turn off the write back cache, or
> turn it back on when load gets high.
I'm amazed hardware manufacturers would dare to do such a thing - I
guess I'm naive.
> Try iozone -s 50M -i 0 -o with writeback on and off. If you get the
> same answer the benchmarks are suspect....
I tried this and found that /proc/ide/hda/settings was lying in that
it always reports write caching as disabled. Once disabled with
hdparm the disks didn't re-enable write caching on their own.
These are new test results run on a pair of 5k rpm disks with write
caching disabled (average of three runs):
Non Persistent snapshots
------------------------
Run 1 .. 3 Average
baseline 15.349 14.835 14.724 14.979
8k 19.43 21.629 19.951 20.337
16k 19.373 21.579 20.373 20.442
32k 22.158 19.505 21.472 21.045
64k 19.745 19.273 20.191 19.736
128k 19.588 20.86 20.166 20.205
256k 19.67 19.819 21.216 20.235
512k 20.439 22.621 20.753 21.271
Persistent snapshots
--------------------
Run 1 .. 3 Average
baseline 15.342 14.81 14.514 14.889
8k 25.24 26.985 27.092 26.439
16k 24.869 23.943 24.687 24.500
32k 27.084 25.861 24.728 25.891
64k 23.646 23.051 24.786 23.828
128k 23.761 24.596 25.53 24.629
256k 25.472 26.962 27.225 26.553
512k 29.076 27.183 27.795 28.018
As you can see the persistent snapshots are taking a significant
performance hit compared to the non-persistent ones due to the
overhead of ensuring all the data on the disk is consistent.
I would expect the EVMS async snapshots to perform similarly to
device-mappers non-persistent snapshots since they are not ensuring
any form of consistency. Effectively they are not writing any
exception metadata to the disk until a controlled shutdown occurs.
- Joe
On Wed, 2002-07-17 at 11:31, Joe Thornber wrote:
> On Tue, Jul 16, 2002 at 01:27:24PM -0400, Chris Mason wrote:
> > Some IDE drives ignore commands to turn off the write back cache, or
> > turn it back on when load gets high.
>
> I'm amazed hardware manufacturers would dare to do such a thing - I
> guess I'm naive.
Its not explicitly against the spec. Some of them also like to make
"cache flush" a no-op too. The next generation of ATA is supposed to add
explicit "direct to media" write command facilities