Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754254Ab0KWXgL (ORCPT ); Tue, 23 Nov 2010 18:36:11 -0500 Received: from mail-fx0-f46.google.com ([209.85.161.46]:33299 "EHLO mail-fx0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753160Ab0KWXgH convert rfc822-to-8bit (ORCPT ); Tue, 23 Nov 2010 18:36:07 -0500 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; b=VNxSHJ8U9yLW3Vpqd7y+MASgtIXwrJLEl0wiYQlK1j9K8d3JvVGFYMoydJHd/jcZlY eDiPfaXvqGXqeEWWOkhz6mWXwmPmxLNBMcDDzxK2F4zrS+kPBy4lAZXMkQDjknrsAvme rll/PL6094NFwsn6P9KktdFQpd/IgS6+SfbwI= MIME-Version: 1.0 In-Reply-To: <20101121140808.GA6429@moria> References: <20101121140808.GA6429@moria> From: =?ISO-8859-1?Q?C=E9dric_Villemain?= Date: Wed, 24 Nov 2010 00:35:44 +0100 Message-ID: Subject: Re: Bcache version 9 To: Kent Overstreet Cc: linux-bcache@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10676 Lines: 249 First I am really happy to see this project appearing here. 2010/11/21 Kent Overstreet : > Bcache is a patch to use SSDs to transparently cache arbitrary block > devices. Its main claim to fame is that it's designed for the > performance characteristics of SSDs - it avoids random writes and > extraneous IO at all costs, instead allocating buckets sized to your > erase blocks and filling them up seqentially. It uses a hybrid > btree/log, instead of a hash table as some other caches. Is it its main diff with flashcache ? https://github.com/facebook/flashcache/blob/master/doc/flashcache-doc.txt > > It does both writethrough and writeback caching - it can use most of > your SSD for buffering random writes, which are then flushed > sequentially to the backing device. Skips sequential IO, too. > > Current status: > Recovering from unclean shutdown has been the main focus, and is now > working magnificantly - I'm having no luck breaking it. This version > looks to be plenty safe enough for beta testing (still, make backups). > > Proper discard support is in and enabled by default; bcache won't ever > write to the same location twice without issuing a discard to that > bucket. Is it relative to Torn Page possible issue outline by flashcache devel ? >On my test box with a Corsair Nova, I'm seeing around a 30% hit > in mysql performance with it on - there might be a bit of room for > improvement, but I'm also curious of other drives do better. Even with > that hit it's well worth it though, the performance degradation over > time on this drive without TRIM is massive. > > The sysfs stuff has all been moved around and should be a little more > standard now; the few files that aren't specific to a device > (register_cache, register_dev) could use a better location - any > suggestions? > > The btree cache has been rewritten and simplified, should exhibit less > memory pressure than the old code. > > The initial implementation of incremental garbage collection is done - > this version doesn't yet normally gc incrementally, as it was needed to > handle allocation failure without deadlocking while ordering writes > correctly. But finishing it is only a bit more work and will give much > better worst case latency and slightly better cache utilization. > > Bcache is available from > git://evilpiepirate.org/~kent/linux-bcache.git > git://evilpiepirate.org/~kent/bcache-tools.git > > And the (somewhat outdated) wiki is > http://bcache.evilpiepirate.org > > diff --git a/Documentation/bcache.txt b/Documentation/bcache.txt > new file mode 100644 > index 0000000..fc0ebac > --- /dev/null > +++ b/Documentation/bcache.txt > @@ -0,0 +1,170 @@ > +Say you've got a big slow raid 6, and an X-25E or three. Wouldn't it be > +nice if you could use them as cache... Hence bcache. > + > +Userspace tools and a wiki are at: > + ?git://evilpiepirate.org/~kent/bcache-tools.git > + ?http://bcache.evilpiepirate.org > + > +It's designed around the performance characteristics of SSDs - it only allocates > +in erase block sized buckets, and it uses a hybrid btree/log to track cached > +extants (which can be anywhere from a single sector to the bucket size). It's > +designed to avoid random writes at all costs; it fills up an erase block > +sequentially, then issues a discard before reusing it. > + > +Caching can be transparently enabled and disabled on arbitrary block devices > +while they're in use. A caches stores the UUIDs of the devices it is caching, > +allowing caches to safely persist across reboots. There's currently a hard > +limit of 256 backing devices per cache. > + > +Both writethrough and writeback caching are supported. Writeback defaults to > +off, but can be switched on and off arbitrarily at runtime. Bcache goes to > +great lengths to order all writes to the cache so that the cache is always in a > +consistent state on disk, and it never returns writes as completed until all > +necessary data and metadata writes are completed. It's designed to safely > +tolerate unclean shutdown without loss of data. > + > +Writeback caching can use most of the cache for buffering writes - writing > +dirty data to the backing device is always done sequentially, scanning from the > +start to the end of the index. > + > +Since random IO is what SSDs excel at, there generally won't be much benefit > +to caching large sequential IO. Bcache detects sequential IO and skips it; > +it also keeps a rolling average of the IO sizes per task, and as long as the > +average is above the cutoff it will skip all IO from that task - instead of > +caching the first 512k after every seek. Backups and large file copies should > +thus entirely bypass the cache. > + > +In the event of an IO error or an inconsistency is detected, caching is > +automatically disabled; if dirty data was present in the cache it first > +disables writeback caching and waits for all dirty data to be flushed. > + > +All configuration is done via sysfs. To use sde to cache md1, assuming the > +SSD's erase block size is 128k: > + > + ?make-bcache -b128k /dev/sde > + ?echo "/dev/sde" > /sys/kernel/bcache/register_cache > + ?echo " /dev/md1" > /sys/kernel/bcache/register_dev > + > +More suitable for scripting might be > + ?echo "`blkid /dev/md1 -s UUID -o value` /dev/md1" \ > + ? ? ? ? > /sys/kernel/bcache/register_dev > + > +Then, to enable writeback: > + > + ?echo 1 > /sys/block/md1/bcache/writeback > + > +Other sysfs files for the backing device: > + > + ?bypassed > + ? ?Sum of all IO, reads and writes, than have bypassed the cache > + > + ?cache_hits > + ?cache_misses > + ?cache_hit_ratio > + ? ?Hits and misses are counted per individual IO as bcache sees them; a > + ? ?partial hit is counted as a miss. > + > + ?clear_stats > + ? ?Writing to this file resets all the statistics > + > + ?flush_delay_ms > + ?flush_delay_ms_sync > + ? ?Optional delay for btree writes to allow for more coalescing of updates to > + ? ?the index. Default to 10 ms for normal writes and 0 for sync writes. > + > + ?sequential_cutoff > + ? ?A sequential IO will bypass the cache once it passes this threshhold; the > + ? ?most recent 128 IOs are tracked so sequential IO can be detected even when > + ? ?it isn't all done at once. > + > + ?unregister > + ? ?Writing to this file disables caching on that device > + > + ?writeback > + ? ?Boolean, if off only writethrough caching is done > + > + ?writeback_delay > + ? ?When dirty data is written to the cache and it previously did not contain > + ? ?any, waits some number of seconds before initiating writeback. Defaults to > + ? ?30. > + > + ?writeback_percent > + ? ?To allow for more buffering of random writes, writeback only proceeds when > + ? ?more than this percentage of the cache is unavailable. Defaults to 0. > + > + ?writeback_running > + ? ?If off, writeback of dirty data will not take place at all. Dirty data will > + ? ?still be added to the cache until it is mostly full; only meant for > + ? ?benchmarking. Defaults to on. > + > +For the cache: > + ?btree_avg_keys_written > + ? ?Average number of keys per write to the btree when a node wasn't being > + ? ?rewritten - indicates how much coalescing is taking place. > + > + ?btree_cache_size > + ? ?Number of btree buckets currently cached in memory > + > + ?btree_written > + ? ?Sum of all btree writes, in (kilo/mega/giga) bytes > + > + ?clear_stats > + ? ?Clears the statistics associated with this cache > + > + ?discard > + ? ?Boolean; if on a discard/TRIM will be issued to each bucket before it is > + ? ?reused. Defaults to on if supported. > + > + ?heap_size > + ? ?Number of buckets that are available for reuse (aren't used by the btree or > + ? ?dirty data) > + > + ?nbuckets > + ? ?Total buckets in this cache > + > + ?synchronous > + ? ?Boolean; when on all writes to the cache are strictly ordered such that it > + ? ?can recover from unclean shutdown. If off it will not generally wait for > + ? ?writes to complete, but the entire cache contents will be invalidated on > + ? ?unclean shutdown. Not recommended that it be turned off when writeback is > + ? ?on. > + > + ?unregister > + ? ?Closes the cache device and all devices being cached; if dirty data is > + ? ?present it will disable writeback caching and wait for it to be flushed. > + > + ?written > + ? ?Sum of all data that has been written to the cache; comparison with > + ? ?btree_written gives the amount of write inflation in bcache. > + > +To script the UUID lookup, you could do something like: > + ?echo "`blkid /dev/md1 -s UUID -o value` /dev/md1"\ > + ? ? ? ? > /sys/kernel/bcache/register_dev > + > +Caveats: > + > +Bcache appears to be quite stable and reliable at this point, but there are a > +number of potential issues. > + > +The ordering requirement of barriers is silently ignored; for ext4 (and > +possibly other filesystems) you must explicitly mount with -o nobarrier or you > +risk severe filesystem corruption in the event of unclean shutdown. > + > +A change to the generic block layer for ad hoc bio splitting can potentially > +break other things; if a bio is used without calling bio_init() or bio_endio() > +is called more than once, the kernel will BUG(). Ext4, raid1, raid10 and lvm > +work fine for me; raid5/6 and I'm told btrfs are not. > + > + > +Caching partitions doesn't do anything (though using them as caches works just > +fine). Using the whole device instead works. > + > +Nothing is done to prevent the use of a backing device without the cache it has > +been used with, when the cache contains dirty data; if you do, terribly things > +will happen. > + > +Furthermore, if the cache didn't have any dirty data and you mount the backing > +device without the cache, you've now made the cache contents stale and they > +need to be manually invalidated. For now the only way to do that is rerun > +make-bcache. The solution to both issues will be the introduction of a bcache > +specific container format for the backing device, which will come at some point > +in the future along with thin provisioning and volume management. > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at ?http://vger.kernel.org/majordomo-info.html > -- C?dric Villemain? ? ? ? ? ? ?? 2ndQuadrant http://2ndQuadrant.fr/? ?? PostgreSQL : Expertise, Formation et Support -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/