Hi,
I am trying to implement copy on write operation by reading the
original disk block and writing it to some other location and then
allowing the write to pass though (block the write operation till the
read or original block completes) I tried using submit_bio() /
sb_bread() to read the block and using the completion API to signal
the end of reading the block but the performance of this is very bad.
It takes around 12 times more time for any disk writes. Is there any
better way to improve the performance ?
Not waiting for the completion of the read operation and letting the
disk write go through gives good performance but under 10% of the
cases the read happens after the write and ends up the the new data
and not the original data.
Regards.
On Tue, 9 Apr 2013, Prashant Shah wrote:
> Date: Tue, 9 Apr 2013 14:35:56 +0530
> From: Prashant Shah <[email protected]>
> To: [email protected]
> Subject: Fwd: block level cow operation
>
> Hi,
>
> I am trying to implement copy on write operation
Hi,
In ext4 ? Why are you trying to do that ?
> by reading the
> original disk block and writing it to some other location and then
> allowing the write to pass though (block the write operation till the
> read or original block completes) I tried using submit_bio() /
> sb_bread() to read the block and using the completion API to signal
> the end of reading the block but the performance of this is very bad.
> It takes around 12 times more time for any disk writes. Is there any
> better way to improve the performance ?
I am not sure what you're trying to achieve here, but the simplest
answer is yes, there is a way to improve the performance - use
device mapper to do this. thinp target provides you with block level
cow functionality which enables you to do snapshots efficiently for
example.
-Lukas
>
> Not waiting for the completion of the read operation and letting the
> disk write go through gives good performance but under 10% of the
> cases the read happens after the write and ends up the the new data
> and not the original data.
>
> Regards.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Tue, 9 Apr 2013 14:35:56 +0530, Prashant Shah <[email protected]> wrote:
> Hi,
>
> I am trying to implement copy on write operation by reading the
> original disk block and writing it to some other location and then
> allowing the write to pass though (block the write operation till the
> read or original block completes) I tried using submit_bio() /
> sb_bread() to read the block and using the completion API to signal
> the end of reading the block but the performance of this is very bad.
> It takes around 12 times more time for any disk writes. Is there any
> better way to improve the performance ?
>
Yes obviously instead of synchronous block handling (block by block)
which give about ~1-3Mb/s
you should not block bio/requests handling, but simply deffer original
bio. Some things like that:
OUR_MAIN_ENTERING_POINT {
if (bio->bi_rw == WRITE) {
if (cow_required(bio))
cow_bio = create_cow_copy(bio)
submit_bio(cow_bio);
}
/* Cow is not required */
submit_bio(bio);
}
create_cow_bio(struct *bio)
{
/* Save original content, and once it will be done we will
* issue original bio */
*/
cow_bio = alloc_bio();
cow_bio.bi_sector = bio->bi_sector;
....
cow_bio->bi_private = bio;
cow_bio->bi_end_io = cow_end_io
}
cow_end_io(struct bio *cow_bio, int error) ;
{
/* Once we done with saving original content we may send original
bio, But end_io may be called from various contexts even from
interrupt context , so we are not allowed to call submit_bio()
So we will put original bio to the list and let our worker
thread submit it for us later
*/
add_bio_to_the_list((struct bio*)cow_bio->bi_private);
}
This approach gives us reasonable performance ~3 times slower than disk
throughput.
For a reference implementation you may look at driver/dm/dm-snap or to
Acronis snapapi module (AFAIR it is opensource)
}
> Not waiting for the completion of the read operation and letting the
> disk write go through gives good performance but under 10% of the
> cases the read happens after the write and ends up the the new data
> and not the original data.
Noooo never do that. Block layer will not guarantee you an order.
>
> Regards.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 09, 2013 at 02:35:56PM +0530, Prashant Shah wrote:
> I am trying to implement copy on write operation by reading the
> original disk block and writing it to some other location....
Lukas asked the correct first question, which is why are you trying to
do this? If the goal is to make COW snapshots, then there's a lot of
accounting information that you'll need to keep track of, and it is
very doubtful ext4 will be the right place to do things.
If the goal is to do efficient writes into cheap eMMC flash for random
write workloads (i.e., which is the same problem f2fs is trying to
solve), it's not totally insane to try to adapt ext4 to handle this
problem.
#1 You'd need to add support into mballoc to understand how to align
its block writes on eMMC erase block boundaries, and to have a mode
where it gives you sequentially increasing physical blocks ignoring
the logical block numbers.
#2 You'd need to intercept the write requests at the writepages() and
writepage() calls, and that's where the decision would have to be made
to allocate a new set of block numbers, based on some flag setting
that would either be on a per-filesystem or per open file basis. As
part of the I/O completion callback, where today we have code paths to
convert an uninitialized extent to initialized extents, we could teach
that code path to update the logical block mapping.
#3 You'd have to come up with some approach to deal with direct I/O
(including potentially not supporting COW writes for DIO).
#4 You'd probably only want to do this for indirect block mapped
files, since for a random write workload, the extent tree would
become very inefficient very quickly.
So it's not _insane_ but it's a huge amount of work, and it would be
very trickly, and it's not something that I would recommend, say, if a
student was looking for a term project. It would also not be faster
on SSD or HDD's. The only reason to do something like this would be
to deal with the extremely low-cost FTL of cheap eMMC flash devices
(where the BOM cost of eMMC is approximately two orders of magnitude
cheaper than SSD's). So if you are benchmarking this on a HDD or SSD,
don't be surprised if it's much slower. And if you are benchmarking
on eMMC, you have to make sure that you have the writes appropriately
erase block aligned, or any performance gains would be hopeless.
Regards,
- Ted
Hi,
On Tue, Apr 9, 2013 at 8:16 PM, Dmitry Monakhov <[email protected]> wrote:
>
> you should not block bio/requests handling, but simply deffer original
> bio. Some things like that:
>
> OUR_MAIN_ENTERING_POINT {
> if (bio->bi_rw == WRITE) {
> if (cow_required(bio))
> cow_bio = create_cow_copy(bio)
> submit_bio(cow_bio);
> }
> /* Cow is not required */
> submit_bio(bio);
> }
> This approach gives us reasonable performance ~3 times slower than disk
> throughput.
> For a reference implementation you may look at driver/dm/dm-snap or to
> Acronis snapapi module (AFAIR it is opensource)
> }
Thanks. That is what I was looking for. Got the ref code from snapapi
module which is opensource.
Its not something that is specific to any filesystem.
Regards.
Hi,
On Thu, Apr 25, 2013 at 6:30 PM, Prashant Shah <[email protected]> wrote:
> Hi,
>
> On Tue, Apr 9, 2013 at 8:16 PM, Dmitry Monakhov <[email protected]> wrote:
>>
>> you should not block bio/requests handling, but simply deffer original
>> bio. Some things like that:
>>
>> OUR_MAIN_ENTERING_POINT {
>> if (bio->bi_rw == WRITE) {
>> if (cow_required(bio))
>> cow_bio = create_cow_copy(bio)
>> submit_bio(cow_bio);
>> }
>> /* Cow is not required */
>> submit_bio(bio);
>> }
>
>> This approach gives us reasonable performance ~3 times slower than disk
>> throughput.
>> For a reference implementation you may look at driver/dm/dm-snap or to
>> Acronis snapapi module (AFAIR it is opensource)
>> }
Is this scenario possible ?
If a write bio (bio1) for a particular sector is under cow and waiting
for the read of the original block to complete. At the same time there
is another write bio (bio2) for the same sector. The original order is
bio1 then bio2. Now since bio1 is delayed due to cow and the new order
becomes bio2 followed by bio1 that goes in the queue. This will cause
the final on-disk write to be bio1.
Regards.