2017-07-24 18:57:22

by Pavel Machek

[permalink] [raw]
Subject: bcache with existing ext4 filesystem

Hi!

Would it be feasible to run bcache (write-through) with existing ext4
filesystem?

I have 400GB of data I'd rather not move, and SSD I could use for
caching. Ok, SSD is connecte over USB2, but I guess it is still way
faster then seeking harddrive on random access... I have kernels on
that partition, so it would be nice if grub2 could still read it, and
it would be good if I could go back to old kernel.

IIRC ext* filesystems have first 1024 bytes reserved for the
bootloader. Unfortunately, cache_sb is bigger than that, and it is
normally at offset 4K in the disk.

Is cache_sb.d[] being used for backing devices? Could I just make
SB_JOURNAL_BUCKETS smaller?

Remaining problem is how to invalidate the cache when someone mounts
the filesystem without bcache; but I believe that should be possible
to check using "last mount time" field in ext4 superblock.

bache would save "last mount time" during shutdown, and would just
consider the cache stale if someone mounted it in between....

Does the plan look reasonable?

Thanks,

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.16 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-07-24 19:17:27

by Pavel Machek

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Mon 2017-07-24 21:08:16, Reindl Harald wrote:
>
>
> Am 24.07.2017 um 20:57 schrieb Pavel Machek:
> >Would it be feasible to run bcache (write-through) with existing ext4
> >filesystem?
> >
> >I have 400GB of data I'd rather not move, and SSD I could use for
> >caching. Ok, SSD is connecte over USB2, but I guess it is still way
> >faster then seeking harddrive on random access
>
> i doubt that seriously - USB2 has a terrible latency

Well.. if that's too slow, I can get SSD M.2; plus bcache docs says
that combination works.

And... if you ever tried to do git diff while git checkout is running
on spinning rust... spinning rust has awful parameters when idle, and
it only gets worse when loaded :-(.

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (874.00 B)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-07-24 19:27:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Mon, Jul 24, 2017 at 09:15:48PM +0200, Pavel Machek wrote:
> >
> > Am 24.07.2017 um 20:57 schrieb Pavel Machek:
> > >Would it be feasible to run bcache (write-through) with existing ext4
> > >filesystem?
> > >
> > >I have 400GB of data I'd rather not move, and SSD I could use for
> > >caching. Ok, SSD is connecte over USB2, but I guess it is still way
> > >faster then seeking harddrive on random access
> >
> > i doubt that seriously - USB2 has a terrible latency
>
> Well.. if that's too slow, I can get SSD M.2; plus bcache docs says
> that combination works.
>
> And... if you ever tried to do git diff while git checkout is running
> on spinning rust... spinning rust has awful parameters when idle, and
> it only gets worse when loaded :-(.

So some hard numbers. Max throughput of USB 2.0 is 53 MiB/s[1]. In
actual practice the max throughput you will see out of the USB 2.0
interface is 30-40 MiB/s. In contrast, a HDD doing sequential reads
can easily do much more than that.

[1] https://superuser.com/questions/317217/whats-the-maximum-typical-speed-possible-with-a-usb2-0-drive

So a lot is going to depend on how bcache works. If you can get large
sequential reads and writes to *bypass* the cache device, then I think
there's a good cache that bcache on a USB 2.0 device won't hurt. It
might not help as much as you like, but that's a function of the
overhead of populating the cache and whether the cache can keep the
useful bits in the cache device.

Cheers,

- Ted

2017-07-24 19:27:43

by Reindl Harald

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem



Am 24.07.2017 um 20:57 schrieb Pavel Machek:
> Would it be feasible to run bcache (write-through) with existing ext4
> filesystem?
>
> I have 400GB of data I'd rather not move, and SSD I could use for
> caching. Ok, SSD is connecte over USB2, but I guess it is still way
> faster then seeking harddrive on random access

i doubt that seriously - USB2 has a terrible latency

2017-07-24 20:07:17

by Pavel Machek

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

Hi!

On Mon 2017-07-24 15:27:18, Theodore Ts'o wrote:
> On Mon, Jul 24, 2017 at 09:15:48PM +0200, Pavel Machek wrote:
> > >
> > > Am 24.07.2017 um 20:57 schrieb Pavel Machek:
> > > >Would it be feasible to run bcache (write-through) with existing ext4
> > > >filesystem?
> > > >
> > > >I have 400GB of data I'd rather not move, and SSD I could use for
> > > >caching. Ok, SSD is connecte over USB2, but I guess it is still way
> > > >faster then seeking harddrive on random access
> > >
> > > i doubt that seriously - USB2 has a terrible latency
> >
> > Well.. if that's too slow, I can get SSD M.2; plus bcache docs says
> > that combination works.
> >
> > And... if you ever tried to do git diff while git checkout is running
> > on spinning rust... spinning rust has awful parameters when idle, and
> > it only gets worse when loaded :-(.
>
> So some hard numbers. Max throughput of USB 2.0 is 53 MiB/s[1]. In
> actual practice the max throughput you will see out of the USB 2.0
> interface is 30-40 MiB/s. In contrast, a HDD doing sequential reads
> can easily do much more than that.
>
> [1] https://superuser.com/questions/317217/whats-the-maximum-typical-speed-possible-with-a-usb2-0-drive
>
> So a lot is going to depend on how bcache works. If you can get large
> sequential reads and writes to *bypass* the cache device, then I think
> there's a good cache that bcache on a USB 2.0 device won't hurt. It
> might not help as much as you like, but that's a function of the
> overhead of populating the cache and whether the cache can keep the
> useful bits in the cache device.

Another useful number is that spinning rust does less than 3MB/sec on
common operations done by git. (Yes, probably because a lot of
seeking). So... USB device should be able to help.

Question for you was... Is the first 1KiB of each ext4 filesystem still
free and "reserved for a bootloader"?

If I needed more for bcache superblock (8KiB, IIRC), would that be
easy to accomplish on existing filesystem?

Thanks,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (2.12 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-07-25 04:52:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote:
> Question for you was... Is the first 1KiB of each ext4 filesystem still
> free and "reserved for a bootloader"?

Yes.

> If I needed more for bcache superblock (8KiB, IIRC), would that be
> easy to accomplish on existing filesystem?

Huh? Why would the bcache superblock matter when you're talking about
the ext4 layout? The bcache superblock will be on the bcache
device/partition, and the ext4 superblock will be on the ext4
device/partition.

- Ted

2017-07-25 06:43:09

by Pavel Machek

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Tue 2017-07-25 00:51:56, Theodore Ts'o wrote:
> On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote:
> > Question for you was... Is the first 1KiB of each ext4 filesystem still
> > free and "reserved for a bootloader"?
>
> Yes.

Thanks.

> > If I needed more for bcache superblock (8KiB, IIRC), would that be
> > easy to accomplish on existing filesystem?
>
> Huh? Why would the bcache superblock matter when you're talking about
> the ext4 layout? The bcache superblock will be on the bcache
> device/partition, and the ext4 superblock will be on the ext4
> device/partition.

I'd like to enable bcache on already existing ext4 partition. AFAICT
normal situation, even on the backing device, is:

| 8KiB bcache superblock | 1KiB reserved | ext4 superblock | 400GB data |

Unfortunately, that would mean shifting 400GB data 8KB forward, and
compatibility problems. So I'd prefer adding bcache superblock into
the reserved space, so I can have caching _and_ compatibility with
grub2 etc (and avoid 400GB move):

| 1KiB (modified) bcache superblock | ext4 superblock | 400GB data |

Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.23 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-07-25 11:00:54

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Tue, Jul 25, 2017 at 08:43:04AM +0200, Pavel Machek wrote:
> On Tue 2017-07-25 00:51:56, Theodore Ts'o wrote:
> > On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote:
> > > Question for you was... Is the first 1KiB of each ext4 filesystem still
> > > free and "reserved for a bootloader"?
> >
> > Yes.
>
> Thanks.
>
> > > If I needed more for bcache superblock (8KiB, IIRC), would that be
> > > easy to accomplish on existing filesystem?
> >
> > Huh? Why would the bcache superblock matter when you're talking about
> > the ext4 layout? The bcache superblock will be on the bcache
> > device/partition, and the ext4 superblock will be on the ext4
> > device/partition.
>
> I'd like to enable bcache on already existing ext4 partition. AFAICT
> normal situation, even on the backing device, is:
>
> | 8KiB bcache superblock | 1KiB reserved | ext4 superblock | 400GB data |
>
> Unfortunately, that would mean shifting 400GB data 8KB forward, and
> compatibility problems. So I'd prefer adding bcache superblock into
> the reserved space, so I can have caching _and_ compatibility with
> grub2 etc (and avoid 400GB move):

The common way to do that is to move the beginning of the partition,
assuming your ext4 lives in a partition.

I don't see how overlapping the ext4 and the bcache backing device
starts would give you what you want, because bcache assumes the
backing device data starts with an offset.

--
Vojtech Pavlik

2017-07-25 11:12:14

by Pavel Machek

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Tue 2017-07-25 12:32:48, Vojtech Pavlik wrote:
> On Tue, Jul 25, 2017 at 08:43:04AM +0200, Pavel Machek wrote:
> > On Tue 2017-07-25 00:51:56, Theodore Ts'o wrote:
> > > On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote:
> > > > Question for you was... Is the first 1KiB of each ext4 filesystem still
> > > > free and "reserved for a bootloader"?
> > >
> > > Yes.
> >
> > Thanks.
> >
> > > > If I needed more for bcache superblock (8KiB, IIRC), would that be
> > > > easy to accomplish on existing filesystem?
> > >
> > > Huh? Why would the bcache superblock matter when you're talking about
> > > the ext4 layout? The bcache superblock will be on the bcache
> > > device/partition, and the ext4 superblock will be on the ext4
> > > device/partition.
> >
> > I'd like to enable bcache on already existing ext4 partition. AFAICT
> > normal situation, even on the backing device, is:
> >
> > | 8KiB bcache superblock | 1KiB reserved | ext4 superblock | 400GB data |
> >
> > Unfortunately, that would mean shifting 400GB data 8KB forward, and
> > compatibility problems. So I'd prefer adding bcache superblock into
> > the reserved space, so I can have caching _and_ compatibility with
> > grub2 etc (and avoid 400GB move):
>
> The common way to do that is to move the beginning of the partition,
> assuming your ext4 lives in a partition.

Well... if I move the partition, grub2 (etc) will be unable to access
data on it. (Plus I do not have free space before some of the
partitions I'd like to be cached).

> I don't see how overlapping the ext4 and the bcache backing device
> starts would give you what you want, because bcache assumes the
> backing device data starts with an offset.

My plan is to make offset 0. AFAICT bcache superblock can be shrunk.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.89 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-07-25 13:46:10

by Pavel Machek

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

Hi!

> > Question for you was... Is the first 1KiB of each ext4 filesystem still
> > free and "reserved for a bootloader"?
>
> Yes.
>
> > If I needed more for bcache superblock (8KiB, IIRC), would that be
> > easy to accomplish on existing filesystem?
>
> Huh? Why would the bcache superblock matter when you're talking about
> the ext4 layout? The bcache superblock will be on the bcache
> device/partition, and the ext4 superblock will be on the ext4
> device/partition.

So this is what I came up with so far. With SSD in USB2 envelope (and
previous version of the patch), git diff goes from 9seconds to
3seconds; not too bad. git diff on (different) ssd is 1.5seconds.

Warning: this is rather dangerous, as it is easy to make cache go
out-of-sync with data partition. To solve that...

Is there some field in ext2 superblock that changes every time
filesystem is changed? Is mtime changed by fsck/badblocks/...?

Best regards,
Pavel



diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e57353e..f8c0aef 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -6,6 +6,7 @@
* Copyright 2012 Google, Inc.
*/

+#define DEBUG
#include "bcache.h"
#include "btree.h"
#include "debug.h"
@@ -67,7 +68,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
{
const char *err;
struct cache_sb *s;
- struct buffer_head *bh = __bread(bdev, 1, SB_SIZE);
+ struct buffer_head *bh = __bread(bdev, SB_SECTOR/8, SB_SIZE);
unsigned i;

if (!bh)
@@ -95,10 +96,11 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
pr_debug("read sb version %llu, flags %llu, seq %llu, journal size %u",
sb->version, sb->flags, sb->seq, sb->keys);

- err = "Not a bcache superblock";
+ err = "Not a bcache superblock: offset";
if (sb->offset != SB_SECTOR)
goto err;

+ err = "Not a bcache superblock: magic";
if (memcmp(sb->magic, bcache_magic, 16))
goto err;

@@ -124,7 +126,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
case BCACHE_SB_VERSION_BDEV:
sb->data_offset = BDEV_DATA_START_DEFAULT;
break;
- case BCACHE_SB_VERSION_BDEV_WITH_OFFSET:
+// case BCACHE_SB_VERSION_BDEV_WITH_OFFSET:
sb->data_offset = le64_to_cpu(s->data_offset);

err = "Bad data offset";
@@ -132,6 +134,15 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
goto err;

break;
+ case BCACHE_SB_VERSION_BDEV_WITH_OFFSET:
+ case BCACHE_SB_VERSION_BDEV_EXT4_LITE:
+ sb->data_offset = le64_to_cpu(s->data_offset);
+ printk("Size of sb is %d\n", sizeof(*sb));
+ WARN_ON(sizeof(*sb) > 1024);
+ if (sizeof(*sb) > 1024)
+ goto err;
+ break;
+
case BCACHE_SB_VERSION_CDEV:
case BCACHE_SB_VERSION_CDEV_WITH_UUID:
sb->nbuckets = le64_to_cpu(s->nbuckets);
diff --git a/include/uapi/linux/bcache.h b/include/uapi/linux/bcache.h
index e3bb063..b1ef80c 100644
--- a/include/uapi/linux/bcache.h
+++ b/include/uapi/linux/bcache.h
@@ -142,12 +142,13 @@ static inline struct bkey *bkey_idx(const struct bkey *k, unsigned nr_keys)
#define BCACHE_SB_VERSION_BDEV 1
#define BCACHE_SB_VERSION_CDEV_WITH_UUID 3
#define BCACHE_SB_VERSION_BDEV_WITH_OFFSET 4
-#define BCACHE_SB_MAX_VERSION 4
+#define BCACHE_SB_VERSION_BDEV_EXT4_LITE 5
+#define BCACHE_SB_MAX_VERSION 5

-#define SB_SECTOR 8
+#define SB_SECTOR 0
#define SB_SIZE 4096
#define SB_LABEL_SIZE 32
-#define SB_JOURNAL_BUCKETS 256U
+#define SB_JOURNAL_BUCKETS 64U
/* SB_JOURNAL_BUCKETS must be divisible by BITS_PER_LONG */
#define MAX_CACHES_PER_SET 8



--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (3.65 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-07-25 16:11:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Tue, Jul 25, 2017 at 01:12:10PM +0200, Pavel Machek wrote:
>
> Well... if I move the partition, grub2 (etc) will be unable to access
> data on it. (Plus I do not have free space before some of the
> partitions I'd like to be cached).

Both Grub and Linux's implementation of ext4 expect the superblock to
be at offset 1024 bytes from the beginning of the block device.

>From looking at Documentation/bcache.txt, the problem is that bcache
works much like LVM or device mapper. That is, you have to create the
file system on /dev/bcacheN. That simplies that grub needs to
understand bcache, which as far as I understand, it doesn't today.

- Ted

2017-07-25 18:02:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Tue, Jul 25, 2017 at 03:46:04PM +0200, Pavel Machek wrote:
>
> Is there some field in ext2 superblock that changes every time
> filesystem is changed? Is mtime changed by fsck/badblocks/...?

No, there isn't. If we were writing the superblock every time the
file system is changed it would be ***extremely*** flash unfriendly.
It would also be a scalability bottleneck, it would cause us to pay an
extra HDD seek, etc. So it's a really bad Bad BAD idea, and so we
don't do it.

Cheers,

- Ted

2017-07-25 18:13:11

by Eric Wheeler

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Tue, 25 Jul 2017, Pavel Machek wrote:

> On Tue 2017-07-25 12:32:48, Vojtech Pavlik wrote:
> > On Tue, Jul 25, 2017 at 08:43:04AM +0200, Pavel Machek wrote:
> > > On Tue 2017-07-25 00:51:56, Theodore Ts'o wrote:
> > > > On Mon, Jul 24, 2017 at 10:04:51PM +0200, Pavel Machek wrote:
> > > > > Question for you was... Is the first 1KiB of each ext4 filesystem still
> > > > > free and "reserved for a bootloader"?
> > > >
> > > > Yes.
> > >
> > > Thanks.
> > >
> > > > > If I needed more for bcache superblock (8KiB, IIRC), would that be
> > > > > easy to accomplish on existing filesystem?
> > > >
> > > > Huh? Why would the bcache superblock matter when you're talking about
> > > > the ext4 layout? The bcache superblock will be on the bcache
> > > > device/partition, and the ext4 superblock will be on the ext4
> > > > device/partition.
> > >
> > > I'd like to enable bcache on already existing ext4 partition. AFAICT
> > > normal situation, even on the backing device, is:
> > >
> > > | 8KiB bcache superblock | 1KiB reserved | ext4 superblock | 400GB data |
> > >
> > > Unfortunately, that would mean shifting 400GB data 8KB forward, and
> > > compatibility problems. So I'd prefer adding bcache superblock into
> > > the reserved space, so I can have caching _and_ compatibility with
> > > grub2 etc (and avoid 400GB move):
> >
> > The common way to do that is to move the beginning of the partition,
> > assuming your ext4 lives in a partition.
>
> Well... if I move the partition, grub2 (etc) will be unable to access
> data on it. (Plus I do not have free space before some of the
> partitions I'd like to be cached).

Why not use dm-linear and prepend space for the bcache superblock? If
this is your boot device, then you would need to write a custom
initrd hook too.

Note that if bcache comes up without its cache, you will need to:
echo 1 > /sys/block/sdX/bcache/running

This is of course unsafe with writeback but should be fine with
writethrough.


--
Eric Wheeler


>
> > I don't see how overlapping the ext4 and the bcache backing device
> > starts would give you what you want, because bcache assumes the
> > backing device data starts with an offset.
>
> My plan is to make offset 0. AFAICT bcache superblock can be shrunk.
>
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>

2017-07-25 20:55:33

by Pavel Machek

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Tue 2017-07-25 14:02:25, Theodore Ts'o wrote:
> On Tue, Jul 25, 2017 at 03:46:04PM +0200, Pavel Machek wrote:
> >
> > Is there some field in ext2 superblock that changes every time
> > filesystem is changed? Is mtime changed by fsck/badblocks/...?
>
> No, there isn't. If we were writing the superblock every time the
> file system is changed it would be ***extremely*** flash unfriendly.
> It would also be a scalability bottleneck, it would cause us to pay an
> extra HDD seek, etc. So it's a really bad Bad BAD idea, and so we
> don't do it.

Ok, I did not mean "every time" when I said "every time". That would
be too heavy.

I mean... is there something changed by the regular mount (like mtime)
plus by operations like fsck and badblocks?

In particular, does fsck change mtime when it writes to the
filesystem?

Thanks,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (985.00 B)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-07-25 22:02:28

by Pavel Machek

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

Hi!

> > > > Unfortunately, that would mean shifting 400GB data 8KB forward, and
> > > > compatibility problems. So I'd prefer adding bcache superblock into
> > > > the reserved space, so I can have caching _and_ compatibility with
> > > > grub2 etc (and avoid 400GB move):
> > >
> > > The common way to do that is to move the beginning of the partition,
> > > assuming your ext4 lives in a partition.
> >
> > Well... if I move the partition, grub2 (etc) will be unable to access
> > data on it. (Plus I do not have free space before some of the
> > partitions I'd like to be cached).
>
> Why not use dm-linear and prepend space for the bcache superblock? If
> this is your boot device, then you would need to write a custom
> initrd hook too.

Thanks for a pointer. That would actually work, but I'd have to be
very, very careful using it...

...because if I, or systemd or some kind of automounter sees the
underlying device (sda4) and writes to it (it is valid ext4 after
all), I'll have inconsistent base device and cache ... and that will
be asking for major problems (even in writethrough mode).

Actually, this already would be usable, if we killed content of cache
device on every mount. Hmmm. I have reasonably long uptimes these days.

If possible, I'd like something more clever: bcache saves mtime of the
ext4 filesystem on shutdown. If the mtime does not match on the next
startup, it means someone fsck-ed the filesystem or mounted it
directly or something, and cache is invalid.

Bonus would be some kind of interlock with "incompatible feature"
bits. If the bcache has dirty data in write-back cache, it would be
nice to have "incompatible feature" bit set, so that tools that don't
have access to the cache refuse to touch it.

Best regards,

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.87 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-07-26 17:41:34

by Eric Wheeler

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Wed, 26 Jul 2017, Pavel Machek wrote:

> Hi!
>
> > > > > Unfortunately, that would mean shifting 400GB data 8KB forward, and
> > > > > compatibility problems. So I'd prefer adding bcache superblock into
> > > > > the reserved space, so I can have caching _and_ compatibility with
> > > > > grub2 etc (and avoid 400GB move):
> > > >
> > > > The common way to do that is to move the beginning of the partition,
> > > > assuming your ext4 lives in a partition.
> > >
> > > Well... if I move the partition, grub2 (etc) will be unable to access
> > > data on it. (Plus I do not have free space before some of the
> > > partitions I'd like to be cached).
> >
> > Why not use dm-linear and prepend space for the bcache superblock? If
> > this is your boot device, then you would need to write a custom
> > initrd hook too.
>
> Thanks for a pointer. That would actually work, but I'd have to be
> very, very careful using it...
>
> ...because if I, or systemd or some kind of automounter sees the
> underlying device (sda4) and writes to it (it is valid ext4 after
> all), I'll have inconsistent base device and cache ... and that will
> be asking for major problems (even in writethrough mode).

Sigh. Gone are the days when distributions would only mount filesystems
if you ask them to.

If this is a desktop, then I'm not sure what to suggest. But for server
with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in
/etc/sysconfig/grub). If this was LVM, then I would suggest also setting
global_filter in lvm.conf. If you find other places that need poked to
prevent automounting then please let me know!

As for ext4 feature bits, can they be arbitrarily named? (I think they are
bits, so maybe not). Maybe propose a patch to ext4 to provide a
"disable_mount" feature. This would prevent mounting altogether, and you
would set/clear it when you care to. A strange feature indeed. Doing
this as an obscure feature on a single filesystem doesn't quite seem
right.


It might be better to have a block-layer device-mount mask so devices that
are allowed to be mounted can be whitelisted on the kernel cmdline or
something. blk.allow_mount=8:16,253:*,... or blk.disallow_mount=8:32 (or
probably both).

Just ideas, but it would be great to allow mounting only those
major/minors which are authorized, particularly with increasingly complex
block-device stacks.


--
Eric Wheeler




>
> Actually, this already would be usable, if we killed content of cache
> device on every mount. Hmmm. I have reasonably long uptimes these days.
>
> If possible, I'd like something more clever: bcache saves mtime of the
> ext4 filesystem on shutdown. If the mtime does not match on the next
> startup, it means someone fsck-ed the filesystem or mounted it
> directly or something, and cache is invalid.
>
> Bonus would be some kind of interlock with "incompatible feature"
> bits. If the bcache has dirty data in write-back cache, it would be
> nice to have "incompatible feature" bit set, so that tools that don't
> have access to the cache refuse to touch it.
>
> Best regards,
>
> Pavel
>
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>

2017-07-26 19:00:00

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On 2017-07-26 13:41, Eric Wheeler wrote:
> On Wed, 26 Jul 2017, Pavel Machek wrote:
>
>> Hi!
>>
>>>>>> Unfortunately, that would mean shifting 400GB data 8KB forward, and
>>>>>> compatibility problems. So I'd prefer adding bcache superblock into
>>>>>> the reserved space, so I can have caching _and_ compatibility with
>>>>>> grub2 etc (and avoid 400GB move):
>>>>>
>>>>> The common way to do that is to move the beginning of the partition,
>>>>> assuming your ext4 lives in a partition.
>>>>
>>>> Well... if I move the partition, grub2 (etc) will be unable to access
>>>> data on it. (Plus I do not have free space before some of the
>>>> partitions I'd like to be cached).
>>>
>>> Why not use dm-linear and prepend space for the bcache superblock? If
>>> this is your boot device, then you would need to write a custom
>>> initrd hook too.
>>
>> Thanks for a pointer. That would actually work, but I'd have to be
>> very, very careful using it...
>>
>> ...because if I, or systemd or some kind of automounter sees the
>> underlying device (sda4) and writes to it (it is valid ext4 after
>> all), I'll have inconsistent base device and cache ... and that will
>> be asking for major problems (even in writethrough mode).
>
> Sigh. Gone are the days when distributions would only mount filesystems
> if you ask them to.
This is only an issue if:
1. You are crazy and have your desktop set up to auto-mount stuff on
insertion (instead of on-access), or you're using a brain-dead desktop
that doesn't let you turn this functionality off.
or:
2. You are mounting by filesystem label or filesystem UUID
(/dev/mapper/* and /dev/<vgname>/<lvname> links are more than
sufficiently stable unless you're changing the storage stack all the
time). This is of course almost always the case these days because for
some reason the big distros assume that the device mapper links aren't
stable or aren't safe to use.
or:
3. You're working with a multi-device BTRFS volume (and in that case,
it's not an issue of auto-mounting it, but an issue of buggy kernel
behavior combined with the 'scan everything for BTRFS' udev rule that's
now upstream in udev).
>
> If this is a desktop, then I'm not sure what to suggest. But for server
> with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in
> /etc/sysconfig/grub). If this was LVM, then I would suggest also setting
> global_filter in lvm.conf. If you find other places that need poked to
> prevent automounting then please let me know!
Nope, just those and what I mentioned above under 1 and 2.
>
> As for ext4 feature bits, can they be arbitrarily named? (I think they are
> bits, so maybe not). Maybe propose a patch to ext4 to provide a
> "disable_mount" feature. This would prevent mounting altogether, and you
> would set/clear it when you care to. A strange feature indeed. Doing
> this as an obscure feature on a single filesystem doesn't quite seem
> right.
In the case of ext4, you can use the MMP feature (with the associated
minor overhead) to enforce this.
>
>
> It might be better to have a block-layer device-mount mask so devices that
> are allowed to be mounted can be whitelisted on the kernel cmdline or
> something. blk.allow_mount=8:16,253:*,... or blk.disallow_mount=8:32 (or
> probably both).
>
> Just ideas, but it would be great to allow mounting only those
> major/minors which are authorized, particularly with increasingly complex
> block-device stacks.
If it's not explicitly created as a mount unit, fstab entry, or
automount entry, and you're not using a brain-dead desktop, and you're
not using FS labels or UUID's to mount, it _really_ isn't an issue
except on BTRFS, and for that this would provide _no_ benefit, because
the issue there results from confusion over which device to use combined
with the assumption that a UUID stored on the device is sufficient to
uniquely identify a filesystem.

2017-07-26 19:16:30

by Eric Wheeler

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

On Wed, 26 Jul 2017, Austin S. Hemmelgarn wrote:

> On 2017-07-26 13:41, Eric Wheeler wrote:
> > On Wed, 26 Jul 2017, Pavel Machek wrote:
> >
> > > Hi!
> > >
> > > > > > > Unfortunately, that would mean shifting 400GB data 8KB forward,
> > > > > > > and
> > > > > > > compatibility problems. So I'd prefer adding bcache superblock
> > > > > > > into
> > > > > > > the reserved space, so I can have caching _and_ compatibility with
> > > > > > > grub2 etc (and avoid 400GB move):
> > > > > >
> > > > > > The common way to do that is to move the beginning of the partition,
> > > > > > assuming your ext4 lives in a partition.
> > > > >
> > > > > Well... if I move the partition, grub2 (etc) will be unable to access
> > > > > data on it. (Plus I do not have free space before some of the
> > > > > partitions I'd like to be cached).
> > > >
> > > > Why not use dm-linear and prepend space for the bcache superblock? If
> > > > this is your boot device, then you would need to write a custom
> > > > initrd hook too.
> > >
> > > Thanks for a pointer. That would actually work, but I'd have to be
> > > very, very careful using it...
> > >
> > > ...because if I, or systemd or some kind of automounter sees the
> > > underlying device (sda4) and writes to it (it is valid ext4 after
> > > all), I'll have inconsistent base device and cache ... and that will
> > > be asking for major problems (even in writethrough mode).
> >
> > Sigh. Gone are the days when distributions would only mount filesystems
> > if you ask them to.
> This is only an issue if:
> 1. You are crazy and have your desktop set up to auto-mount stuff on insertion
> (instead of on-access), or you're using a brain-dead desktop that doesn't let
> you turn this functionality off.
> or:
> 2. You are mounting by filesystem label or filesystem UUID (/dev/mapper/* and
> /dev/<vgname>/<lvname> links are more than sufficiently stable unless you're
> changing the storage stack all the time). This is of course almost always the
> case these days because for some reason the big distros assume that the device
> mapper links aren't stable or aren't safe to use.
> or:
> 3. You're working with a multi-device BTRFS volume (and in that case, it's not
> an issue of auto-mounting it, but an issue of buggy kernel behavior combined
> with the 'scan everything for BTRFS' udev rule that's now upstream in udev).
> >
> > If this is a desktop, then I'm not sure what to suggest. But for server
> > with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in
> > /etc/sysconfig/grub). If this was LVM, then I would suggest also setting
> > global_filter in lvm.conf. If you find other places that need poked to
> > prevent automounting then please let me know!
> Nope, just those and what I mentioned above under 1 and 2.
> >
> > As for ext4 feature bits, can they be arbitrarily named? (I think they are
> > bits, so maybe not). Maybe propose a patch to ext4 to provide a
> > "disable_mount" feature. This would prevent mounting altogether, and you
> > would set/clear it when you care to. A strange feature indeed. Doing
> > this as an obscure feature on a single filesystem doesn't quite seem
> > right.
> In the case of ext4, you can use the MMP feature (with the associated minor
> overhead) to enforce this.

Neat! However, MMP might not get flagged if bcache is in writeback mode
and caches the superblock update. He would need to set MMP on the backing
device explicitly, which might work but seems like a bad idea for
consistency. Writethrough would be fine.

Given the earlier precautions and sanity over which volume should be
mounted (eg, by device name not UUID), then he should be ok if mounting
neither by label nor uuid.

-Eric

> >
> >
> > It might be better to have a block-layer device-mount mask so devices that
> > are allowed to be mounted can be whitelisted on the kernel cmdline or
> > something. blk.allow_mount=8:16,253:*,... or blk.disallow_mount=8:32 (or
> > probably both).
> >
> > Just ideas, but it would be great to allow mounting only those
> > major/minors which are authorized, particularly with increasingly complex
> > block-device stacks.
> If it's not explicitly created as a mount unit, fstab entry, or automount
> entry, and you're not using a brain-dead desktop, and you're not using FS
> labels or UUID's to mount, it _really_ isn't an issue except on BTRFS, and for
> that this would provide _no_ benefit, because the issue there results from
> confusion over which device to use combined with the assumption that a UUID
> stored on the device is sufficient to uniquely identify a filesystem.
>
>

2017-07-26 20:01:32

by Pavel Machek

[permalink] [raw]
Subject: Re: bcache with existing ext4 filesystem

Hi!

> > > > > > Unfortunately, that would mean shifting 400GB data 8KB forward, and
> > > > > > compatibility problems. So I'd prefer adding bcache superblock into
> > > > > > the reserved space, so I can have caching _and_ compatibility with
> > > > > > grub2 etc (and avoid 400GB move):
> > > > >
> > > > > The common way to do that is to move the beginning of the partition,
> > > > > assuming your ext4 lives in a partition.
> > > >
> > > > Well... if I move the partition, grub2 (etc) will be unable to access
> > > > data on it. (Plus I do not have free space before some of the
> > > > partitions I'd like to be cached).
> > >
> > > Why not use dm-linear and prepend space for the bcache superblock? If
> > > this is your boot device, then you would need to write a custom
> > > initrd hook too.
> >
> > Thanks for a pointer. That would actually work, but I'd have to be
> > very, very careful using it...
> >
> > ...because if I, or systemd or some kind of automounter sees the
> > underlying device (sda4) and writes to it (it is valid ext4 after
> > all), I'll have inconsistent base device and cache ... and that will
> > be asking for major problems (even in writethrough mode).
>
> Sigh. Gone are the days when distributions would only mount filesystems
> if you ask them to.
>
> If this is a desktop, then I'm not sure what to suggest. But for

This is a desktop.

> with no GUI, turn off the grub2 osprober (GRUB_DISABLE_OS_PROBER="true" in
> /etc/sysconfig/grub). If this was LVM, then I would suggest also setting
> global_filter in lvm.conf. If you find other places that need poked to
> prevent automounting then please let me know!
>
> As for ext4 feature bits, can they be arbitrarily named? (I think they are
> bits, so maybe not). Maybe propose a patch to ext4 to provide a

Hmm. I guess I could just set "read-only compatible" bit "this device
has a cache"... but last time I checked, ext3 still replayed log
during read-only mount...
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (2.08 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments