2010-11-21 14:09:45

by Kent Overstreet

[permalink] [raw]
Subject: Bcache version 9

Bcache is a patch to use SSDs to transparently cache arbitrary block
devices. Its main claim to fame is that it's designed for the
performance characteristics of SSDs - it avoids random writes and
extraneous IO at all costs, instead allocating buckets sized to your
erase blocks and filling them up seqentially. It uses a hybrid
btree/log, instead of a hash table as some other caches.

It does both writethrough and writeback caching - it can use most of
your SSD for buffering random writes, which are then flushed
sequentially to the backing device. Skips sequential IO, too.

Current status:
Recovering from unclean shutdown has been the main focus, and is now
working magnificantly - I'm having no luck breaking it. This version
looks to be plenty safe enough for beta testing (still, make backups).

Proper discard support is in and enabled by default; bcache won't ever
write to the same location twice without issuing a discard to that
bucket. On my test box with a Corsair Nova, I'm seeing around a 30% hit
in mysql performance with it on - there might be a bit of room for
improvement, but I'm also curious of other drives do better. Even with
that hit it's well worth it though, the performance degradation over
time on this drive without TRIM is massive.

The sysfs stuff has all been moved around and should be a little more
standard now; the few files that aren't specific to a device
(register_cache, register_dev) could use a better location - any
suggestions?

The btree cache has been rewritten and simplified, should exhibit less
memory pressure than the old code.

The initial implementation of incremental garbage collection is done -
this version doesn't yet normally gc incrementally, as it was needed to
handle allocation failure without deadlocking while ordering writes
correctly. But finishing it is only a bit more work and will give much
better worst case latency and slightly better cache utilization.

Bcache is available from
git://evilpiepirate.org/~kent/linux-bcache.git
git://evilpiepirate.org/~kent/bcache-tools.git

And the (somewhat outdated) wiki is
http://bcache.evilpiepirate.org

diff --git a/Documentation/bcache.txt b/Documentation/bcache.txt
new file mode 100644
index 0000000..fc0ebac
--- /dev/null
+++ b/Documentation/bcache.txt
@@ -0,0 +1,170 @@
+Say you've got a big slow raid 6, and an X-25E or three. Wouldn't it be
+nice if you could use them as cache... Hence bcache.
+
+Userspace tools and a wiki are at:
+ git://evilpiepirate.org/~kent/bcache-tools.git
+ http://bcache.evilpiepirate.org
+
+It's designed around the performance characteristics of SSDs - it only allocates
+in erase block sized buckets, and it uses a hybrid btree/log to track cached
+extants (which can be anywhere from a single sector to the bucket size). It's
+designed to avoid random writes at all costs; it fills up an erase block
+sequentially, then issues a discard before reusing it.
+
+Caching can be transparently enabled and disabled on arbitrary block devices
+while they're in use. A caches stores the UUIDs of the devices it is caching,
+allowing caches to safely persist across reboots. There's currently a hard
+limit of 256 backing devices per cache.
+
+Both writethrough and writeback caching are supported. Writeback defaults to
+off, but can be switched on and off arbitrarily at runtime. Bcache goes to
+great lengths to order all writes to the cache so that the cache is always in a
+consistent state on disk, and it never returns writes as completed until all
+necessary data and metadata writes are completed. It's designed to safely
+tolerate unclean shutdown without loss of data.
+
+Writeback caching can use most of the cache for buffering writes - writing
+dirty data to the backing device is always done sequentially, scanning from the
+start to the end of the index.
+
+Since random IO is what SSDs excel at, there generally won't be much benefit
+to caching large sequential IO. Bcache detects sequential IO and skips it;
+it also keeps a rolling average of the IO sizes per task, and as long as the
+average is above the cutoff it will skip all IO from that task - instead of
+caching the first 512k after every seek. Backups and large file copies should
+thus entirely bypass the cache.
+
+In the event of an IO error or an inconsistency is detected, caching is
+automatically disabled; if dirty data was present in the cache it first
+disables writeback caching and waits for all dirty data to be flushed.
+
+All configuration is done via sysfs. To use sde to cache md1, assuming the
+SSD's erase block size is 128k:
+
+ make-bcache -b128k /dev/sde
+ echo "/dev/sde" > /sys/kernel/bcache/register_cache
+ echo "<UUID> /dev/md1" > /sys/kernel/bcache/register_dev
+
+More suitable for scripting might be
+ echo "`blkid /dev/md1 -s UUID -o value` /dev/md1" \
+ > /sys/kernel/bcache/register_dev
+
+Then, to enable writeback:
+
+ echo 1 > /sys/block/md1/bcache/writeback
+
+Other sysfs files for the backing device:
+
+ bypassed
+ Sum of all IO, reads and writes, than have bypassed the cache
+
+ cache_hits
+ cache_misses
+ cache_hit_ratio
+ Hits and misses are counted per individual IO as bcache sees them; a
+ partial hit is counted as a miss.
+
+ clear_stats
+ Writing to this file resets all the statistics
+
+ flush_delay_ms
+ flush_delay_ms_sync
+ Optional delay for btree writes to allow for more coalescing of updates to
+ the index. Default to 10 ms for normal writes and 0 for sync writes.
+
+ sequential_cutoff
+ A sequential IO will bypass the cache once it passes this threshhold; the
+ most recent 128 IOs are tracked so sequential IO can be detected even when
+ it isn't all done at once.
+
+ unregister
+ Writing to this file disables caching on that device
+
+ writeback
+ Boolean, if off only writethrough caching is done
+
+ writeback_delay
+ When dirty data is written to the cache and it previously did not contain
+ any, waits some number of seconds before initiating writeback. Defaults to
+ 30.
+
+ writeback_percent
+ To allow for more buffering of random writes, writeback only proceeds when
+ more than this percentage of the cache is unavailable. Defaults to 0.
+
+ writeback_running
+ If off, writeback of dirty data will not take place at all. Dirty data will
+ still be added to the cache until it is mostly full; only meant for
+ benchmarking. Defaults to on.
+
+For the cache:
+ btree_avg_keys_written
+ Average number of keys per write to the btree when a node wasn't being
+ rewritten - indicates how much coalescing is taking place.
+
+ btree_cache_size
+ Number of btree buckets currently cached in memory
+
+ btree_written
+ Sum of all btree writes, in (kilo/mega/giga) bytes
+
+ clear_stats
+ Clears the statistics associated with this cache
+
+ discard
+ Boolean; if on a discard/TRIM will be issued to each bucket before it is
+ reused. Defaults to on if supported.
+
+ heap_size
+ Number of buckets that are available for reuse (aren't used by the btree or
+ dirty data)
+
+ nbuckets
+ Total buckets in this cache
+
+ synchronous
+ Boolean; when on all writes to the cache are strictly ordered such that it
+ can recover from unclean shutdown. If off it will not generally wait for
+ writes to complete, but the entire cache contents will be invalidated on
+ unclean shutdown. Not recommended that it be turned off when writeback is
+ on.
+
+ unregister
+ Closes the cache device and all devices being cached; if dirty data is
+ present it will disable writeback caching and wait for it to be flushed.
+
+ written
+ Sum of all data that has been written to the cache; comparison with
+ btree_written gives the amount of write inflation in bcache.
+
+To script the UUID lookup, you could do something like:
+ echo "`blkid /dev/md1 -s UUID -o value` /dev/md1"\
+ > /sys/kernel/bcache/register_dev
+
+Caveats:
+
+Bcache appears to be quite stable and reliable at this point, but there are a
+number of potential issues.
+
+The ordering requirement of barriers is silently ignored; for ext4 (and
+possibly other filesystems) you must explicitly mount with -o nobarrier or you
+risk severe filesystem corruption in the event of unclean shutdown.
+
+A change to the generic block layer for ad hoc bio splitting can potentially
+break other things; if a bio is used without calling bio_init() or bio_endio()
+is called more than once, the kernel will BUG(). Ext4, raid1, raid10 and lvm
+work fine for me; raid5/6 and I'm told btrfs are not.
+
+Caching partitions doesn't do anything (though using them as caches works just
+fine). Using the whole device instead works.
+
+Nothing is done to prevent the use of a backing device without the cache it has
+been used with, when the cache contains dirty data; if you do, terribly things
+will happen.
+
+Furthermore, if the cache didn't have any dirty data and you mount the backing
+device without the cache, you've now made the cache contents stale and they
+need to be manually invalidated. For now the only way to do that is rerun
+make-bcache. The solution to both issues will be the introduction of a bcache
+specific container format for the backing device, which will come at some point
+in the future along with thin provisioning and volume management.


2010-11-22 01:12:33

by Greg KH

[permalink] [raw]
Subject: Re: Bcache version 9

On Sun, Nov 21, 2010 at 06:09:34AM -0800, Kent Overstreet wrote:
> +++ b/Documentation/bcache.txt

For new sysfs files, please create Documentation/ABI files.

> +All configuration is done via sysfs. To use sde to cache md1, assuming the
> +SSD's erase block size is 128k:
> +
> + make-bcache -b128k /dev/sde
> + echo "/dev/sde" > /sys/kernel/bcache/register_cache
> + echo "<UUID> /dev/md1" > /sys/kernel/bcache/register_dev

/sys/kernel/bcache/? Really?

Come on, shouldn't this be somewhere else? You only have 1 file here,
right?

Shouldn't it be a configfs file instead as that is what you are doing?

thanks,

greg k-h

2010-11-23 08:07:45

by Kent Overstreet

[permalink] [raw]
Subject: Re: Bcache version 9

On 11/21/2010 05:09 PM, Greg KH wrote:
> On Sun, Nov 21, 2010 at 06:09:34AM -0800, Kent Overstreet wrote:
>> +++ b/Documentation/bcache.txt
>
> For new sysfs files, please create Documentation/ABI files.
>
>> +All configuration is done via sysfs. To use sde to cache md1, assuming the
>> +SSD's erase block size is 128k:
>> +
>> + make-bcache -b128k /dev/sde
>> + echo "/dev/sde"> /sys/kernel/bcache/register_cache
>> + echo "<UUID> /dev/md1"> /sys/kernel/bcache/register_dev
>
> /sys/kernel/bcache/? Really?

That was a completely arbitrary choice dating from when I first started
hacking on it. No point in moving it when it might be moved again :p

> Come on, shouldn't this be somewhere else? You only have 1 file here,
> right?

Two files (really three, but the third is for gimpy latency tracing and
will die eventually). register_dev is there so on bootup you don't have
to wait for the cache to be discovered - when you add a cache device if
there's a backing device waiting for a cache, and the cache has seen
that UUID before it'll do what you want.

> Shouldn't it be a configfs file instead as that is what you are doing?

That was one of the possibilities I had in mind. My main issue with that
though is I don't see any way to just have a bare config_item - per the
documentation, the user must do a mkdir() first, which just doesn't make
any sense for bcache. There's no point in having a persistent object
besides the one associated with the block device. Maybe there would be
in the future, with multiple cache devices, but I still think it's a
lousy interface for that problem - what bcache wants is something more
like a syscall; you wouldn't use configfs to replace mount(), for example.

There do exist global interfaces in sysfs, not attached to any device -
besides /sys/kernel, there's /sys/fs which doesn't have any rhyme or
reason to it I can discern. ecryptfs has /sys/ext4/ecryptfs/version,
ext4 has per device stuff that you can't find from the device's dir (you
woludn't know /sys/fs/ext4/md0 exists from looking at /sys/block/md0).
There's also /sys/fs/cgroup, which is another unique thing as far as I
can tell...

Then there's /sys/module which has a bunch of ad hoc stuff, but as far
as I can tell that's all still module parameters and register_cache and
register_dev certainly aren't module parameters.

So anyways, I absolutely agree that there are better solutions than
/sys/kernel/bcache but I want to replace it with something correct, not
something that sucks less. Ideas/flames are of course more than welcome :)

2010-11-23 23:36:11

by Cédric Villemain

[permalink] [raw]
Subject: Re: Bcache version 9

First I am really happy to see this project appearing here.

2010/11/21 Kent Overstreet <[email protected]>:
> Bcache is a patch to use SSDs to transparently cache arbitrary block
> devices. Its main claim to fame is that it's designed for the
> performance characteristics of SSDs - it avoids random writes and
> extraneous IO at all costs, instead allocating buckets sized to your
> erase blocks and filling them up seqentially. It uses a hybrid
> btree/log, instead of a hash table as some other caches.

Is it its main diff with flashcache ?
https://github.com/facebook/flashcache/blob/master/doc/flashcache-doc.txt

>
> It does both writethrough and writeback caching - it can use most of
> your SSD for buffering random writes, which are then flushed
> sequentially to the backing device. Skips sequential IO, too.
>
> Current status:
> Recovering from unclean shutdown has been the main focus, and is now
> working magnificantly - I'm having no luck breaking it. This version
> looks to be plenty safe enough for beta testing (still, make backups).
>
> Proper discard support is in and enabled by default; bcache won't ever
> write to the same location twice without issuing a discard to that
> bucket.

Is it relative to Torn Page possible issue outline by flashcache devel ?

>On my test box with a Corsair Nova, I'm seeing around a 30% hit
> in mysql performance with it on - there might be a bit of room for
> improvement, but I'm also curious of other drives do better. Even with
> that hit it's well worth it though, the performance degradation over
> time on this drive without TRIM is massive.
>
> The sysfs stuff has all been moved around and should be a little more
> standard now; the few files that aren't specific to a device
> (register_cache, register_dev) could use a better location - any
> suggestions?
>
> The btree cache has been rewritten and simplified, should exhibit less
> memory pressure than the old code.
>
> The initial implementation of incremental garbage collection is done -
> this version doesn't yet normally gc incrementally, as it was needed to
> handle allocation failure without deadlocking while ordering writes
> correctly. But finishing it is only a bit more work and will give much
> better worst case latency and slightly better cache utilization.
>
> Bcache is available from
> git://evilpiepirate.org/~kent/linux-bcache.git
> git://evilpiepirate.org/~kent/bcache-tools.git
>
> And the (somewhat outdated) wiki is
> http://bcache.evilpiepirate.org
>
> diff --git a/Documentation/bcache.txt b/Documentation/bcache.txt
> new file mode 100644
> index 0000000..fc0ebac
> --- /dev/null
> +++ b/Documentation/bcache.txt
> @@ -0,0 +1,170 @@
> +Say you've got a big slow raid 6, and an X-25E or three. Wouldn't it be
> +nice if you could use them as cache... Hence bcache.
> +
> +Userspace tools and a wiki are at:
> + ?git://evilpiepirate.org/~kent/bcache-tools.git
> + ?http://bcache.evilpiepirate.org
> +
> +It's designed around the performance characteristics of SSDs - it only allocates
> +in erase block sized buckets, and it uses a hybrid btree/log to track cached
> +extants (which can be anywhere from a single sector to the bucket size). It's
> +designed to avoid random writes at all costs; it fills up an erase block
> +sequentially, then issues a discard before reusing it.
> +
> +Caching can be transparently enabled and disabled on arbitrary block devices
> +while they're in use. A caches stores the UUIDs of the devices it is caching,
> +allowing caches to safely persist across reboots. There's currently a hard
> +limit of 256 backing devices per cache.
> +
> +Both writethrough and writeback caching are supported. Writeback defaults to
> +off, but can be switched on and off arbitrarily at runtime. Bcache goes to
> +great lengths to order all writes to the cache so that the cache is always in a
> +consistent state on disk, and it never returns writes as completed until all
> +necessary data and metadata writes are completed. It's designed to safely
> +tolerate unclean shutdown without loss of data.
> +
> +Writeback caching can use most of the cache for buffering writes - writing
> +dirty data to the backing device is always done sequentially, scanning from the
> +start to the end of the index.
> +
> +Since random IO is what SSDs excel at, there generally won't be much benefit
> +to caching large sequential IO. Bcache detects sequential IO and skips it;
> +it also keeps a rolling average of the IO sizes per task, and as long as the
> +average is above the cutoff it will skip all IO from that task - instead of
> +caching the first 512k after every seek. Backups and large file copies should
> +thus entirely bypass the cache.
> +
> +In the event of an IO error or an inconsistency is detected, caching is
> +automatically disabled; if dirty data was present in the cache it first
> +disables writeback caching and waits for all dirty data to be flushed.
> +
> +All configuration is done via sysfs. To use sde to cache md1, assuming the
> +SSD's erase block size is 128k:
> +
> + ?make-bcache -b128k /dev/sde
> + ?echo "/dev/sde" > /sys/kernel/bcache/register_cache
> + ?echo "<UUID> /dev/md1" > /sys/kernel/bcache/register_dev
> +
> +More suitable for scripting might be
> + ?echo "`blkid /dev/md1 -s UUID -o value` /dev/md1" \
> + ? ? ? ? > /sys/kernel/bcache/register_dev
> +
> +Then, to enable writeback:
> +
> + ?echo 1 > /sys/block/md1/bcache/writeback
> +
> +Other sysfs files for the backing device:
> +
> + ?bypassed
> + ? ?Sum of all IO, reads and writes, than have bypassed the cache
> +
> + ?cache_hits
> + ?cache_misses
> + ?cache_hit_ratio
> + ? ?Hits and misses are counted per individual IO as bcache sees them; a
> + ? ?partial hit is counted as a miss.
> +
> + ?clear_stats
> + ? ?Writing to this file resets all the statistics
> +
> + ?flush_delay_ms
> + ?flush_delay_ms_sync
> + ? ?Optional delay for btree writes to allow for more coalescing of updates to
> + ? ?the index. Default to 10 ms for normal writes and 0 for sync writes.
> +
> + ?sequential_cutoff
> + ? ?A sequential IO will bypass the cache once it passes this threshhold; the
> + ? ?most recent 128 IOs are tracked so sequential IO can be detected even when
> + ? ?it isn't all done at once.
> +
> + ?unregister
> + ? ?Writing to this file disables caching on that device
> +
> + ?writeback
> + ? ?Boolean, if off only writethrough caching is done
> +
> + ?writeback_delay
> + ? ?When dirty data is written to the cache and it previously did not contain
> + ? ?any, waits some number of seconds before initiating writeback. Defaults to
> + ? ?30.
> +
> + ?writeback_percent
> + ? ?To allow for more buffering of random writes, writeback only proceeds when
> + ? ?more than this percentage of the cache is unavailable. Defaults to 0.
> +
> + ?writeback_running
> + ? ?If off, writeback of dirty data will not take place at all. Dirty data will
> + ? ?still be added to the cache until it is mostly full; only meant for
> + ? ?benchmarking. Defaults to on.
> +
> +For the cache:
> + ?btree_avg_keys_written
> + ? ?Average number of keys per write to the btree when a node wasn't being
> + ? ?rewritten - indicates how much coalescing is taking place.
> +
> + ?btree_cache_size
> + ? ?Number of btree buckets currently cached in memory
> +
> + ?btree_written
> + ? ?Sum of all btree writes, in (kilo/mega/giga) bytes
> +
> + ?clear_stats
> + ? ?Clears the statistics associated with this cache
> +
> + ?discard
> + ? ?Boolean; if on a discard/TRIM will be issued to each bucket before it is
> + ? ?reused. Defaults to on if supported.
> +
> + ?heap_size
> + ? ?Number of buckets that are available for reuse (aren't used by the btree or
> + ? ?dirty data)
> +
> + ?nbuckets
> + ? ?Total buckets in this cache
> +
> + ?synchronous
> + ? ?Boolean; when on all writes to the cache are strictly ordered such that it
> + ? ?can recover from unclean shutdown. If off it will not generally wait for
> + ? ?writes to complete, but the entire cache contents will be invalidated on
> + ? ?unclean shutdown. Not recommended that it be turned off when writeback is
> + ? ?on.
> +
> + ?unregister
> + ? ?Closes the cache device and all devices being cached; if dirty data is
> + ? ?present it will disable writeback caching and wait for it to be flushed.
> +
> + ?written
> + ? ?Sum of all data that has been written to the cache; comparison with
> + ? ?btree_written gives the amount of write inflation in bcache.
> +
> +To script the UUID lookup, you could do something like:
> + ?echo "`blkid /dev/md1 -s UUID -o value` /dev/md1"\
> + ? ? ? ? > /sys/kernel/bcache/register_dev
> +
> +Caveats:
> +
> +Bcache appears to be quite stable and reliable at this point, but there are a
> +number of potential issues.
> +
> +The ordering requirement of barriers is silently ignored; for ext4 (and
> +possibly other filesystems) you must explicitly mount with -o nobarrier or you
> +risk severe filesystem corruption in the event of unclean shutdown.
> +
> +A change to the generic block layer for ad hoc bio splitting can potentially
> +break other things; if a bio is used without calling bio_init() or bio_endio()
> +is called more than once, the kernel will BUG(). Ext4, raid1, raid10 and lvm
> +work fine for me; raid5/6 and I'm told btrfs are not.
> +
> +
> +Caching partitions doesn't do anything (though using them as caches works just
> +fine). Using the whole device instead works.
> +
> +Nothing is done to prevent the use of a backing device without the cache it has
> +been used with, when the cache contains dirty data; if you do, terribly things
> +will happen.
> +
> +Furthermore, if the cache didn't have any dirty data and you mount the backing
> +device without the cache, you've now made the cache contents stale and they
> +need to be manually invalidated. For now the only way to do that is rerun
> +make-bcache. The solution to both issues will be the introduction of a bcache
> +specific container format for the backing device, which will come at some point
> +in the future along with thin provisioning and volume management.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>



--
C?dric Villemain? ? ? ? ? ? ?? 2ndQuadrant
http://2ndQuadrant.fr/? ?? PostgreSQL : Expertise, Formation et Support

2010-11-24 06:25:27

by Kent Overstreet

[permalink] [raw]
Subject: Re: Bcache version 9

On 11/23/2010 03:35 PM, C?dric Villemain wrote:
> First I am really happy to see this project appearing here.
>
> 2010/11/21 Kent Overstreet<[email protected]>:
>> Bcache is a patch to use SSDs to transparently cache arbitrary block
>> devices. Its main claim to fame is that it's designed for the
>> performance characteristics of SSDs - it avoids random writes and
>> extraneous IO at all costs, instead allocating buckets sized to your
>> erase blocks and filling them up seqentially. It uses a hybrid
>> btree/log, instead of a hash table as some other caches.
>
> Is it its main diff with flashcache ?
> https://github.com/facebook/flashcache/blob/master/doc/flashcache-doc.txt

Yeah. It's a more complex approach, but it's capable of significantly
higher performance. Performance has regressed some lately (I've been
concentrating on other things and don't really have the hardware for
performance work), but a month or so ago it was benchmarking around 50%
higher than flashcache, with mysql on an X25-E.

>
>>
>> It does both writethrough and writeback caching - it can use most of
>> your SSD for buffering random writes, which are then flushed
>> sequentially to the backing device. Skips sequential IO, too.
>>
>> Current status:
>> Recovering from unclean shutdown has been the main focus, and is now
>> working magnificantly - I'm having no luck breaking it. This version
>> looks to be plenty safe enough for beta testing (still, make backups).
>>
>> Proper discard support is in and enabled by default; bcache won't ever
>> write to the same location twice without issuing a discard to that
>> bucket.
>
> Is it relative to Torn Page possible issue outline by flashcache devel ?

Kind of. Bcache isn't subject to that issue, but that's because bcache
is cow and always strictly orders writes.

2010-12-01 04:15:22

by Greg KH

[permalink] [raw]
Subject: Re: Bcache version 9

On Tue, Nov 23, 2010 at 12:07:36AM -0800, Kent Overstreet wrote:
> On 11/21/2010 05:09 PM, Greg KH wrote:
> >On Sun, Nov 21, 2010 at 06:09:34AM -0800, Kent Overstreet wrote:
> >>+++ b/Documentation/bcache.txt
> >
> >For new sysfs files, please create Documentation/ABI files.
> >
> >>+All configuration is done via sysfs. To use sde to cache md1, assuming the
> >>+SSD's erase block size is 128k:
> >>+
> >>+ make-bcache -b128k /dev/sde
> >>+ echo "/dev/sde"> /sys/kernel/bcache/register_cache
> >>+ echo "<UUID> /dev/md1"> /sys/kernel/bcache/register_dev
> >
> >/sys/kernel/bcache/? Really?
>
> That was a completely arbitrary choice dating from when I first
> started hacking on it. No point in moving it when it might be moved
> again :p

Heh.

> >Come on, shouldn't this be somewhere else? You only have 1 file here,
> >right?
>
> Two files (really three, but the third is for gimpy latency tracing
> and will die eventually). register_dev is there so on bootup you
> don't have to wait for the cache to be discovered - when you add a
> cache device if there's a backing device waiting for a cache, and
> the cache has seen that UUID before it'll do what you want.
>
> >Shouldn't it be a configfs file instead as that is what you are doing?
>
> That was one of the possibilities I had in mind. My main issue with
> that though is I don't see any way to just have a bare config_item -
> per the documentation, the user must do a mkdir() first, which just
> doesn't make any sense for bcache. There's no point in having a
> persistent object besides the one associated with the block device.
> Maybe there would be in the future, with multiple cache devices, but
> I still think it's a lousy interface for that problem - what bcache
> wants is something more like a syscall; you wouldn't use configfs to
> replace mount(), for example.

True, but I thought configfs could handle "bare" config items, you might
want to look a bit closer as to how people are using it. But I could be
totally wrong however.

> There do exist global interfaces in sysfs, not attached to any
> device - besides /sys/kernel, there's /sys/fs which doesn't have any
> rhyme or reason to it I can discern.

/sys/fs is for different filesystem specific things.

> ecryptfs has
> /sys/ext4/ecryptfs/version, ext4 has per device stuff that you can't
> find from the device's dir (you woludn't know /sys/fs/ext4/md0
> exists from looking at /sys/block/md0). There's also /sys/fs/cgroup,
> which is another unique thing as far as I can tell...

No, sys/fs/cgroup/ is where the cgroup filesystem is mounted.

> Then there's /sys/module which has a bunch of ad hoc stuff, but as
> far as I can tell that's all still module parameters and
> register_cache and register_dev certainly aren't module parameters.

It's not ad hoc, it's module specific things.

> So anyways, I absolutely agree that there are better solutions than
> /sys/kernel/bcache but I want to replace it with something correct,
> not something that sucks less. Ideas/flames are of course more than
> welcome :)

What is "bcache"? Is it related to filesystems? If so, use
/sys/fs/bcache and I have no issues with it. But don't put it in
/sys/kernel/ without at least asking.

thanks,

greg k-h

2010-12-04 03:44:38

by Kent Overstreet

[permalink] [raw]
Subject: Re: Bcache version 9




On 11/30/2010 08:16 PM, Greg KH wrote:
> True, but I thought configfs could handle "bare" config items, you might
> want to look a bit closer as to how people are using it. But I could be
> totally wrong however.

The documentation is pretty specific and I haven't seen any
counterexamples, but I'll see what I can find.

>> There do exist global interfaces in sysfs, not attached to any
>> device - besides /sys/kernel, there's /sys/fs which doesn't have any
>> rhyme or reason to it I can discern.
>
> /sys/fs is for different filesystem specific things.
>
>> ecryptfs has
>> /sys/ext4/ecryptfs/version, ext4 has per device stuff that you can't
>> find from the device's dir (you woludn't know /sys/fs/ext4/md0
>> exists from looking at /sys/block/md0). There's also /sys/fs/cgroup,
>> which is another unique thing as far as I can tell...
>
> No, sys/fs/cgroup/ is where the cgroup filesystem is mounted.

Yes, but as far as how the namespace is used it's exactly the same. By
that logic, I could stick anything in /sys/fs if I made a filesystem for
it to mount there. cgroupfs is just an interface, users wouldn't care if
the same interface was written against sysfs (except for mounting
multiple instances, but that's still not an argument for putting a
mountpoint in /sys/fs).

>> Then there's /sys/module which has a bunch of ad hoc stuff, but as
>> far as I can tell that's all still module parameters and
>> register_cache and register_dev certainly aren't module parameters.
>
> It's not ad hoc, it's module specific things.

Exactly :p Bcache lives in a module, as does most code. There's no
pattern to it besides that, is all I was saying.

> What is "bcache"? Is it related to filesystems?

It uses SSDs to cache block devices; you'd cache say /dev/md0 with
/dev/sdb, reads and writes get added to the cache and writes get
buffered in the cache if writeback caching is on.

> If so, use
> /sys/fs/bcache and I have no issues with it. But don't put it in
> /sys/kernel/ without at least asking.

You could say it's related to filesystems, but it's an awful stretch
since it lives entirely at the block layer.

It's on the list of things that need fixing before merging, but that's a
solid list. Priority #1 has been making it rock solid, which appears to
be done... I've still got to finish handling all the potential memory
allocation failures correctly and do something about the hooks in the
block layer, which is a much bigger problem. I prefer my hacks to be
obvious, ugly hacks :)

2010-12-04 05:41:24

by John Drescher

[permalink] [raw]
Subject: Re: Bcache version 9

>> Is it its main diff with flashcache ?
>> https://github.com/facebook/flashcache/blob/master/doc/flashcache-doc.txt
>
> Yeah. It's a more complex approach, but it's capable of significantly higher
> performance. Performance has regressed some lately (I've been concentrating
> on other things and don't really have the hardware for performance work),
> but a month or so ago it was benchmarking around 50% higher than flashcache,
> with mysql on an X25-E.

BTW, Thanks for releasing this..
I am just in the middle of evaluating using flashcache to speed up
slow IO in kvm clients when storing the VMs on hard disks. I will when
I get a chance try your patch..

John

2010-12-16 11:21:55

by Kent Overstreet

[permalink] [raw]
Subject: Re: Bcache version 9

On 11/30/2010 08:16 PM, Greg KH wrote:
> True, but I thought configfs could handle "bare" config items, you might
> want to look a bit closer as to how people are using it. But I could be
> totally wrong however.

Just in case I wasn't the only one confused, I stared at the
documentation and examples some more... still wasn't completely sure, so
I wrote the code, and bare attributes without a configfs item do work. I
just switched to configfs for the stuff that was in /sys/kernel/bcache,
it'll be in the next version.

Almost got everything done on my todo list before it's ready to be
submitted...