2008-12-24 21:10:38

by Alberto Bertogli

[permalink] [raw]
Subject: jbd2 inside a device mapper module


Hi!

I'm writing a small device mapper module, and I'm interested in placing
a jbd/jdb2 journal on the backing device.

I started by trying to do a __bread() manually (just for early tests)
inside my map function. But it got stucked, as far as I could see,
waiting for a buffer head in wait_on_buffer() IIRC (I could track it
down again if it's needed). And I couldn't find why it was locked, since
it was an unused loopback device, and my code didn't even deal with
buffer heads.

Then, since I was planning on using jbd/jdb2 anyway, I decided to use it
(and went for jbd2).


Now, I'm having issues with journal creation.

I tried using "mkfs.ext3 -O journal_dev", but jbd2_journal_load()
complains that it can't find the journal superblock.

And if I modify jbd2_journal_create(), removing the 'if
(journal->j_inode == NULL)' check (I imagine it's there for a reason,
but from a quick look at the code couldn't find it and thought it was
worth a try) then when creating it I get a warning (pasted below) and it
gets locked up, which I think may be related to what happened when I did
__bread(), but obviously I'm not sure at all.

And I got stucked there, so I thought it'd be better to ask. Does anyone
have any ideas or suggestions on what I'm doing wrong?


I've not published my code yet because it's really rough, but if anyone
wants to take a look at it, please let me know. I was planning on
posting it when it was at least working.

Thanks a lot,
Alberto



[42949814.780000] ------------[ cut here ]------------
[42949814.780000] WARNING: at /pub/src/linux/linux-2.6/fs/buffer.c:1186 mark_buffer_dirty+0x77/0xa0()
[42949814.780000] Modules linked in:
[42949814.780000] Call Trace:
[42949814.780000] 678f17d8: [<6003988b>] warn_on_slowpath+0x5b/0x80
[42949814.780000] 678f1818: [<600bbe9a>] __find_get_block_slow+0x7a/0x110
[42949814.780000] 678f1858: [<600bc219>] __find_get_block+0x79/0x180
[42949814.780000] 678f1888: [<60033b05>] __might_sleep+0x105/0x130
[42949814.780000] 678f18c8: [<600bc353>] __getblk+0x33/0x270
[42949814.780000] 678f18f8: [<600bc7e7>] mark_buffer_dirty+0x77/0xa0
[42949814.780000] 678f1918: [<6012b1a8>] jbd2_journal_create+0x88/0x170
[42949814.780000] 678f1958: [<601aac70>] csum_ctr+0x1b0/0x240
[42949814.780000] 678f1968: [<6019b810>] get_target_type+0x60/0xa0
[42949814.780000] 678f19a8: [<6019b0d4>] dm_table_add_target+0x174/0x3b0
[42949814.780000] 678f1a08: [<6019d057>] table_load+0xb7/0x200
[42949814.780000] 678f1a68: [<6019dd98>] dm_ctl_ioctl+0x288/0x300
[42949814.780000] 678f1a98: [<6019cfa0>] table_load+0x0/0x200
[42949814.780000] 678f1c18: [<600a82fb>] vfs_ioctl+0x1b/0x70
[42949814.780000] 678f1c28: [<600a8770>] do_vfs_ioctl+0x400/0x660
[42949814.780000] 678f1ca8: [<600a8a1a>] sys_ioctl+0x4a/0x80
[42949814.780000] 678f1ce8: [<6001a310>] handle_syscall+0x50/0x80
[42949814.780000] 678f1d08: [<6002bf1f>] userspace+0x3ff/0x530
[42949814.780000] 678f1fc8: [<60017012>] fork_handler+0x62/0x70
[42949814.780000]
[42949814.780000] ---[ end trace ebc125a00ee8f9d2 ]---


2008-12-24 23:20:14

by Alberto Bertogli

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module


[Adding lkml on the CC list, somehow I managed to screw the address and
sent it to the mm-cc list instead]

On Wed, Dec 24, 2008 at 07:10:38PM -0200, Alberto Bertogli wrote:
>
> Hi!
>
> I'm writing a small device mapper module, and I'm interested in placing
> a jbd/jdb2 journal on the backing device.
>
> I started by trying to do a __bread() manually (just for early tests)
> inside my map function. But it got stucked, as far as I could see,
> waiting for a buffer head in wait_on_buffer() IIRC (I could track it
> down again if it's needed). And I couldn't find why it was locked, since
> it was an unused loopback device, and my code didn't even deal with
> buffer heads.
>
> Then, since I was planning on using jbd/jdb2 anyway, I decided to use it
> (and went for jbd2).
>
>
> Now, I'm having issues with journal creation.
>
> I tried using "mkfs.ext3 -O journal_dev", but jbd2_journal_load()
> complains that it can't find the journal superblock.
>
> And if I modify jbd2_journal_create(), removing the 'if
> (journal->j_inode == NULL)' check (I imagine it's there for a reason,
> but from a quick look at the code couldn't find it and thought it was
> worth a try) then when creating it I get a warning (pasted below) and it
> gets locked up, which I think may be related to what happened when I did
> __bread(), but obviously I'm not sure at all.
>
> And I got stucked there, so I thought it'd be better to ask. Does anyone
> have any ideas or suggestions on what I'm doing wrong?
>
>
> I've not published my code yet because it's really rough, but if anyone
> wants to take a look at it, please let me know. I was planning on
> posting it when it was at least working.
>
> Thanks a lot,
> Alberto
>
>
>
> [42949814.780000] ------------[ cut here ]------------
> [42949814.780000] WARNING: at /pub/src/linux/linux-2.6/fs/buffer.c:1186 mark_buffer_dirty+0x77/0xa0()
> [42949814.780000] Modules linked in:
> [42949814.780000] Call Trace:
> [42949814.780000] 678f17d8: [<6003988b>] warn_on_slowpath+0x5b/0x80
> [42949814.780000] 678f1818: [<600bbe9a>] __find_get_block_slow+0x7a/0x110
> [42949814.780000] 678f1858: [<600bc219>] __find_get_block+0x79/0x180
> [42949814.780000] 678f1888: [<60033b05>] __might_sleep+0x105/0x130
> [42949814.780000] 678f18c8: [<600bc353>] __getblk+0x33/0x270
> [42949814.780000] 678f18f8: [<600bc7e7>] mark_buffer_dirty+0x77/0xa0
> [42949814.780000] 678f1918: [<6012b1a8>] jbd2_journal_create+0x88/0x170
> [42949814.780000] 678f1958: [<601aac70>] csum_ctr+0x1b0/0x240
> [42949814.780000] 678f1968: [<6019b810>] get_target_type+0x60/0xa0
> [42949814.780000] 678f19a8: [<6019b0d4>] dm_table_add_target+0x174/0x3b0
> [42949814.780000] 678f1a08: [<6019d057>] table_load+0xb7/0x200
> [42949814.780000] 678f1a68: [<6019dd98>] dm_ctl_ioctl+0x288/0x300
> [42949814.780000] 678f1a98: [<6019cfa0>] table_load+0x0/0x200
> [42949814.780000] 678f1c18: [<600a82fb>] vfs_ioctl+0x1b/0x70
> [42949814.780000] 678f1c28: [<600a8770>] do_vfs_ioctl+0x400/0x660
> [42949814.780000] 678f1ca8: [<600a8a1a>] sys_ioctl+0x4a/0x80
> [42949814.780000] 678f1ce8: [<6001a310>] handle_syscall+0x50/0x80
> [42949814.780000] 678f1d08: [<6002bf1f>] userspace+0x3ff/0x530
> [42949814.780000] 678f1fc8: [<60017012>] fork_handler+0x62/0x70
> [42949814.780000]
> [42949814.780000] ---[ end trace ebc125a00ee8f9d2 ]---



2008-12-24 23:52:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Wed, Dec 24, 2008 at 07:10:38PM -0200, Alberto Bertogli wrote:
>
> I'm writing a small device mapper module, and I'm interested in placing
> a jbd/jdb2 journal on the backing device.
>
> I started by trying to do a __bread() manually (just for early tests)
> inside my map function. But it got stucked, as far as I could see,
> waiting for a buffer head in wait_on_buffer() IIRC (I could track it
> down again if it's needed). And I couldn't find why it was locked, since
> it was an unused loopback device, and my code didn't even deal with
> buffer heads.

I have no idea why you would need to do manual __breads(). No doubt
I'm missing some context here.

> Then, since I was planning on using jbd/jdb2 anyway, I decided to use it
> (and went for jbd2).
>
> Now, I'm having issues with journal creation.
>
> I tried using "mkfs.ext3 -O journal_dev", but jbd2_journal_load()
> complains that it can't find the journal superblock.

So I'll tell you how to do this via simple hard drives, and you can
figure out how to make it work with dm. Note that if the journal
device isn't on a stand-alone spindle, it's probably not going to help
you. The whole point of using an external journal device is to avoid
the seeking on the journal device, or to take advantage of the speed
of a battery-backed NVRAM device. I'm not sure how much sense it
makes to use dm-based external journal device.... what exactly do you
hope to achieve.

To create an external journal device on the device /dev/sda:

mke2fs -O journal_dev /dev/sda

To create a new filesystem on /dev/sdb1 that will use the external
journal found on /dev/sda:

mke2fs -j -J device=/dev/sda /dev/sdb1

- Ted

P.S. All of this is in the mke2fs man page....

2008-12-25 14:37:05

by Alberto Bertogli

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Wed, Dec 24, 2008 at 06:49:15PM -0500, Theodore Tso wrote:
> On Wed, Dec 24, 2008 at 07:10:38PM -0200, Alberto Bertogli wrote:
> >
> > I'm writing a small device mapper module, and I'm interested in placing
> > a jbd/jdb2 journal on the backing device.
> >
> > I started by trying to do a __bread() manually (just for early tests)
> > inside my map function. But it got stucked, as far as I could see,
> > waiting for a buffer head in wait_on_buffer() IIRC (I could track it
> > down again if it's needed). And I couldn't find why it was locked, since
> > it was an unused loopback device, and my code didn't even deal with
> > buffer heads.
>
> I have no idea why you would need to do manual __breads(). No doubt
> I'm missing some context here.

I'm writing (just for fun and learning purposes) a device mapper module
that stores checksums on writes and verifies them on reads. The
integrity metadata (currently just the checksum) is interleaved in the
backing device: one sector holding the integrity metadata for the
following 64 data sectors.

The reason for the __bread() is explained below.


> > Then, since I was planning on using jbd/jdb2 anyway, I decided to use it
> > (and went for jbd2).
> >
> > Now, I'm having issues with journal creation.
> >
> > I tried using "mkfs.ext3 -O journal_dev", but jbd2_journal_load()
> > complains that it can't find the journal superblock.
>
> So I'll tell you how to do this via simple hard drives, and you can
> figure out how to make it work with dm. Note that if the journal
> device isn't on a stand-alone spindle, it's probably not going to help
> you. The whole point of using an external journal device is to avoid
> the seeking on the journal device, or to take advantage of the speed
> of a battery-backed NVRAM device. I'm not sure how much sense it
> makes to use dm-based external journal device.... what exactly do you
> hope to achieve.
>
> To create an external journal device on the device /dev/sda:
>
> mke2fs -O journal_dev /dev/sda
>
> To create a new filesystem on /dev/sdb1 that will use the external
> journal found on /dev/sda:
>
> mke2fs -j -J device=/dev/sda /dev/sdb1
>
> - Ted
>
> P.S. All of this is in the mke2fs man page....

Thanks. I've found and tried that (that's what I meant with the
paragraph you quote), but I couldn't make it work.

I'll try to make my intentions more clear, but please let me know if I'm
not explaining myself.


For each write on the dm device I should not only write the data in the
backing device, but also upgrade the corresponding integrity metadata.

So, to upgrade the metadata, I should first read that sector from the
backing device, then update it, and finally write it back. As an early
experiment I began to do the first part without caring for the atomicity
of the update. I tried __bread() (just as an experiment, because I've
been using dm-io to do the reads so far) without success.


I then thought of giving jbd2 a try, with the final intention of using
it to update the metadata and the data in an atomic way. I'd devote some
space at the beginning of the backing device for the journal, and use it
internally to that purpose (so it has nothing to do with ext3/4).

The first problem I stumbled upon was that jbd2_journal_create() doesn't
like journals initialized using jbd2_journal_init_dev() (because it has
no j_inode). I had two choices: or try to create the journal some other
way, or remove the j_inode test in jbd2_journal_create().

I suspected the test was there for a reason, but I couldn't find it from
a quick look, so I tried it anyway, which resulted in the warning from
the first email.

Then I tried to create the journal using mke2fs as you described, but
jbd2_journal_load() fails when trying to load it.


To summarize, these are my questions:

- Why does __bread() gets stucked when called from inside a dm map
function? It looks like it's waiting on a buffer_head, but why?
- What is the reason behind the j_inode check in jbd2_journal_create()?
- Does mke2fs -O journal_dev creates a journal that jbd2_journal_load()
is supposed to read without any knowledge of ext2/3/4 stuff? If not,
how can I create such a journal? I'll be looking at the e2fsprogs
code for the answer to this question later today (I haven't looked at
it yet).

Obviously, I'm not expected long detailed answers; any tip on where I
can find them would be greatly appreciated.

Thanks a lot,
Alberto


2008-12-25 15:52:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Thu, Dec 25, 2008 at 12:35:35PM -0200, Alberto Bertogli wrote:
>
> Thanks. I've found and tried that (that's what I meant with the
> paragraph you quote), but I couldn't make it work.

See attached transcript. I did it using lvm/dm just to show it's not
an devicemapper problem.

> The first problem I stumbled upon was that jbd2_journal_create() doesn't
> like journals initialized using jbd2_journal_init_dev() (because it has
> no j_inode). I had two choices: or try to create the journal some other
> way, or remove the j_inode test in jbd2_journal_create().

ext4_journal_create is ancient code dating back to ext3/jbd, and even
there it's code which has been obsolete for about 6-7 years. In fact,
I plan to remove ext4_journal_create, the journal_inum mount option,
and jbd2_journal_init_dev, because the supported way of creating a
journal is using mke2fs. I need to double check and make sure ocfs2
isn't using jbd2_journal_init_dev before I remove it from the jbd2
layer, but really, this sort of thing should be done all in userspace.

> Then I tried to create the journal using mke2fs as you described, but
> jbd2_journal_load() fails when trying to load it.

See attached. Works fine for me.

> - Why does __bread() gets stucked when called from inside a dm map
> function? It looks like it's waiting on a buffer_head, but why?

I'm not a dm guy, so I can't answer this, but I suspect the issue may
be a lock ordering issue.

> - What is the reason behind the j_inode check in jbd2_journal_create()?

jbd2_journal_create was only designed for creating inode-based
journals, and it's a deprecated function that will likely be removed
soon.

> - Does mke2fs -O journal_dev creates a journal that jbd2_journal_load()
> is supposed to read without any knowledge of ext2/3/4 stuff? If not,
> how can I create such a journal? I'll be looking at the e2fsprogs
> code for the answer to this question later today (I haven't looked at
> it yet).

mke2fs -O journal_dev creates an external journal, but when you create
a filesystem, you need to specify need to specify location of the
external journal. Hence:

mke2fs -O journal_dev /dev/extern_journal_dev
mke2fs -t ext4 -J device=/dev/extern_journal_dev /dev/filesystem_dev

As I said in my last message. I've tested it, and it works Just Fine.

- Ted

Script started on Thu 25 Dec 2008 10:22:11 AM EST
Top-level shell (parent script)
Using forwarded ssh authentication socket
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy%
ext3root thunk -wi-a- 15.00G
footest thunk -wi-a- 1.00G
foresight thunk -wi-a- 5.00G
old-root thunk -wi-a- 128.00G
rmake thunk -wi-a- 2.00G
root thunk -wi-ao 128.00G
sff-torrent thunk -wi-a- 7.00G
testext4 thunk -wi-a- 1.00G
# mke2fs -O journal_dev /dev/thunk/footest
mke2fs 1.41.3 (12-Oct-2008)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
0 inodes, 262144 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
0 block group
32768 blocks per group, 32768 fragments per group
0 inodes per group
Superblock backups stored on blocks:

Zeroing journal device: done
# mke2fs -t ext4 -J device=/dev/thunk/footest /dev/thunk/testext4
mke2fs 1.41.3 (12-Oct-2008)
Using journal device's blocksize: 4096
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
65536 inodes, 262144 blocks
13107 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=268435456
8 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376

Writing inode tables: done
Adding journal to device /dev/thunk/footest: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
# dumpe2fs -h /dev/thunk/testext4
dumpe2fs 1.41.3 (12-Oct-2008)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 47b3315f-7b0d-40ab-995e-de1ddaaf3528
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 65536
Block count: 262144
Reserved block count: 13107
Free blocks: 257701
Free inodes: 65525
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 63
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Thu Dec 25 10:23:12 2008
Last mount time: n/a
Last write time: Thu Dec 25 10:23:12 2008
Mount count: 0
Maximum mount count: 29
Last checked: Thu Dec 25 10:23:12 2008
Check interval: 15552000 (6 months)
Next check after: Tue Jun 23 11:23:12 2009
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal UUID: 484902c6-34a5-4cd2-9f66-02a3251bfc9e
Journal device: 0xfe06
Default directory hash: half_md4
Directory Hash Seed: 2889d0e3-ca37-443d-b9a3-12e3b0e26d70

# mount /dev/thunk/testext4 /mnt
# df /mnt
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/thunk-testext4
1032088 1284 978376 1% /mnt
# umount /mnt
# exit

Script done on Thu 25 Dec 2008 10:23:37 AM EST

2008-12-26 00:02:07

by Alberto Bertogli

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Thu, Dec 25, 2008 at 10:52:48AM -0500, Theodore Tso wrote:
> On Thu, Dec 25, 2008 at 12:35:35PM -0200, Alberto Bertogli wrote:
> > - What is the reason behind the j_inode check in jbd2_journal_create()?
>
> jbd2_journal_create was only designed for creating inode-based
> journals, and it's a deprecated function that will likely be removed
> soon.

Thanks, I didn't know that!


> > - Does mke2fs -O journal_dev creates a journal that jbd2_journal_load()
> > is supposed to read without any knowledge of ext2/3/4 stuff? If not,
> > how can I create such a journal? I'll be looking at the e2fsprogs
> > code for the answer to this question later today (I haven't looked at
> > it yet).
>
> mke2fs -O journal_dev creates an external journal, but when you create
> a filesystem, you need to specify need to specify location of the
> external journal. Hence:
>
> mke2fs -O journal_dev /dev/extern_journal_dev
> mke2fs -t ext4 -J device=/dev/extern_journal_dev /dev/filesystem_dev
>
> As I said in my last message. I've tested it, and it works Just Fine.

I think I'm not explaining myself correctly. My code has _nothing_ to do
with ext2/3/4 (or any other filesystem) whatsoever. I'm not using the
journal as an external one for a filesystem. I want to use it to be able
to do atomic writes in my own, filesystem independant, device-mapper
code.

After what you told me (both this and the deprecation of
jbd2_journal_create()), I took a look at e2fsprogs' source. From what I
could see, "mke2fs -O journal_dev" creates the external journal inside
some ext2/3/4 structures, which caused my journal-loading code to fail
(because it doesn't know about ext stuff).

So, I wrote a small "mkjournal" utility that creates a journal on the
block device without any ext2/3/4 stuff. It's based on e2fsprogs'
mkjournal.c, except it doesn't have any ext2 stuff. And it worked great!
I'm now able to load the journal just fine.

Thanks a lot for all the help!

Alberto


2008-12-26 03:37:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Thu, Dec 25, 2008 at 10:00:05PM -0200, Alberto Bertogli wrote:
>
> I think I'm not explaining myself correctly. My code has _nothing_ to do
> with ext2/3/4 (or any other filesystem) whatsoever. I'm not using the
> journal as an external one for a filesystem. I want to use it to be able
> to do atomic writes in my own, filesystem independant, device-mapper
> code.

How many block writes are you batching into a single transaction? If
you're not careful you may find that performance overhead will be
quite expensive.

> After what you told me (both this and the deprecation of
> jbd2_journal_create()), I took a look at e2fsprogs' source. From what I
> could see, "mke2fs -O journal_dev" creates the external journal inside
> some ext2/3/4 structures, which caused my journal-loading code to fail
> (because it doesn't know about ext stuff).

Yes, this is necessary because in a production system you need to be
able to identify the external journal by UUID, and the ext2/3/4
superblock makes it easy to add a label, UUID, et. al. It also
significantly lowers the chance that an external journal will get
misidentified as some other filesystem based on the data stored in the
journal.

- Ted

2008-12-26 16:19:11

by Alberto Bertogli

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Thu, Dec 25, 2008 at 10:37:36PM -0500, Theodore Tso wrote:
> On Thu, Dec 25, 2008 at 10:00:05PM -0200, Alberto Bertogli wrote:
> >
> > I think I'm not explaining myself correctly. My code has _nothing_ to do
> > with ext2/3/4 (or any other filesystem) whatsoever. I'm not using the
> > journal as an external one for a filesystem. I want to use it to be able
> > to do atomic writes in my own, filesystem independant, device-mapper
> > code.
>
> How many block writes are you batching into a single transaction? If
> you're not careful you may find that performance overhead will be
> quite expensive.

At this moment I'm trying to keep it simple, so I plan to batch two for
each sector written to the device: one for the metadata and one for the
data.


> > After what you told me (both this and the deprecation of
> > jbd2_journal_create()), I took a look at e2fsprogs' source. From what I
> > could see, "mke2fs -O journal_dev" creates the external journal inside
> > some ext2/3/4 structures, which caused my journal-loading code to fail
> > (because it doesn't know about ext stuff).
>
> Yes, this is necessary because in a production system you need to be
> able to identify the external journal by UUID, and the ext2/3/4
> superblock makes it easy to add a label, UUID, et. al. It also
> significantly lowers the chance that an external journal will get
> misidentified as some other filesystem based on the data stored in the
> journal.

Yes, it makes sense. I've reserved the first sector for that purpose.

Thanks a lot,
Alberto


2008-12-26 18:06:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Fri, Dec 26, 2008 at 02:17:08PM -0200, Alberto Bertogli wrote:
>
> At this moment I'm trying to keep it simple, so I plan to batch two for
> each sector written to the device: one for the metadata and one for the
> data.
>

I think I can pretty much guarantee that your performance will be so
horrible that it won't be worth using.

> > Yes, this is necessary because in a production system you need to be
> > able to identify the external journal by UUID, and the ext2/3/4
> > superblock makes it easy to add a label, UUID, et. al. It also
> > significantly lowers the chance that an external journal will get
> > misidentified as some other filesystem based on the data stored in the
> > journal.
>
> Yes, it makes sense. I've reserved the first sector for that purpose.

Why not just use the ext3/4 external journal format?

- Ted

2008-12-27 03:01:54

by Alberto Bertogli

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Fri, Dec 26, 2008 at 01:06:42PM -0500, Theodore Tso wrote:
> On Fri, Dec 26, 2008 at 02:17:08PM -0200, Alberto Bertogli wrote:
> >
> > At this moment I'm trying to keep it simple, so I plan to batch two for
> > each sector written to the device: one for the metadata and one for the
> > data.
> >
>
> I think I can pretty much guarantee that your performance will be so
> horrible that it won't be worth using.

Thanks for the warning.

I have a couple of alternatives in mind, the most decent one at the
moment is having two metadatas (M1 and M2) for the each block, and
update M1 on the first write to the given block, M2 on the second, M1 on
the third, and so on.

So, if a block has written "A" and M1 holds crc("A"), and the user wants
to write "B" to the block, I would first write crc("B") in M2, and then
write "B" to the block.

The biggest problem I can see with this approach is that I require
either a timestamp on the metadata so I can determine where to write (if
M1 or M2).

And I'm not sure if it'd perform better than the journal, tho.

Do you have any suggestions as to how can I handle this issue?


> > > Yes, this is necessary because in a production system you need to be
> > > able to identify the external journal by UUID, and the ext2/3/4
> > > superblock makes it easy to add a label, UUID, et. al. It also
> > > significantly lowers the chance that an external journal will get
> > > misidentified as some other filesystem based on the data stored in the
> > > journal.
> >
> > Yes, it makes sense. I've reserved the first sector for that purpose.
>
> Why not just use the ext3/4 external journal format?

Wouldn't that lead to confusion, because people can think the device
holds an ext3/4 external journal, while it actually holds a
device-mapper backing device that happens to contain a journal?

What would be the advantages of using the ext3/4 journal format, over a
simple initial sector and the journal following?

Thanks,
Alberto


2008-12-27 19:29:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Sat, Dec 27, 2008 at 01:00:20AM -0200, Alberto Bertogli wrote:
> I have a couple of alternatives in mind, the most decent one at the
> moment is having two metadatas (M1 and M2) for the each block, and
> update M1 on the first write to the given block, M2 on the second, M1 on
> the third, and so on.

I don't see how this would help. You still have to do synchronous
writes for safety, which is what is going to kill your performance.

What you want to do is to batch as many writes as possible. Until the
underlying filesystem requests a flush, you can afford to hold off
writing the block to disk. Otherwise, you'll end up turning each 4k
write into two 8k synchronous writes, which will be a performance
disaster. If you hold off, it's much more likely that the you'll be
able to patch a large number of blocks into a single transaction.
Also, if a block gets modified multiple times (for example, with an
inode table block where tar writes one file, and then another), if you
hold off the write as long as possible, you can only write the inode
table block once, instead of multiple times.

Note that this means that you have to wait until the last minute to
calculate the checksum, since the buffer could be modified after the
write request. OCFS2 does this, by using a commit-time callback to
calculate the checksums used.

The bottom line doing something like this in an efficient way is
tricky.

> > Why not just use the ext3/4 external journal format?
>
> Wouldn't that lead to confusion, because people can think the device
> holds an ext3/4 external journal, while it actually holds a
> device-mapper backing device that happens to contain a journal?

Not really; the external journal has a label and uuid, and the journal
superblock has a place to store the uuid of the "client" of the
journal. So there is plenty of information available to tie an
external journal to some device-mapper backing device.

> What would be the advantages of using the ext3/4 journal format, over a
> simple initial sector and the journal following?

There already existing tools to find the external journal, using the
blkid library. So you only have to store the UUID of the journal in
the superblock of the device-mapper backing device, and then you can
easily find the external journal as follows:

journal_fn = blkid_get_devname(ctx->blkid, "UUID", uuid);

- Ted

2008-12-27 20:01:46

by Andreas Dilger

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Dec 25, 2008 12:35 -0200, Alberto Bertogli wrote:
> On Wed, Dec 24, 2008 at 06:49:15PM -0500, Theodore Tso wrote:
> > I have no idea why you would need to do manual __breads(). No doubt
> > I'm missing some context here.
>
> I'm writing (just for fun and learning purposes) a device mapper module
> that stores checksums on writes and verifies them on reads. The
> integrity metadata (currently just the checksum) is interleaved in the
> backing device: one sector holding the integrity metadata for the
> following 64 data sectors.

Alex and I discussed implementing checksums for ext4 using an external
device like this, and he might have some more design information for
you.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-12-29 06:20:14

by shyam_iyer

[permalink] [raw]
Subject: RE: Re: jbd2 inside a device mapper module

Andreas Dilger wrote:
> On Dec 25, 2008 12:35 -0200, Alberto Bertogli wrote:
> > On Wed, Dec 24, 2008 at 06:49:15PM -0500, Theodore Tso wrote:
> > > I have no idea why you would need to do manual __breads(). No
doubt
> > > I'm missing some context here.
> >
> > I'm writing (just for fun and learning purposes) a device mapper
> > module that stores checksums on writes and verifies them on reads.
The
> > integrity metadata (currently just the checksum) is interleaved in
the
> > backing device: one sector holding the integrity metadata for the
> > following 64 data sectors.

> Alex and I discussed implementing checksums for ext4 using an external
device like this, and he might have some more design information for
you.


That external device could possibly be a TPM chip that can store
checksums.

Shyam Iyer
Dell Linux Engineering


Attachments:
(No filename) (0.00 B)

2008-12-29 21:08:43

by Alberto Bertogli

[permalink] [raw]
Subject: Re: [dm-devel] Re: jbd2 inside a device mapper module

On Mon, Dec 29, 2008 at 11:50:14AM +0530, [email protected] wrote:
> Andreas Dilger wrote:
> > On Dec 25, 2008 12:35 -0200, Alberto Bertogli wrote:
> > > On Wed, Dec 24, 2008 at 06:49:15PM -0500, Theodore Tso wrote:
> > > > I have no idea why you would need to do manual __breads(). No
> doubt
> > > > I'm missing some context here.
> > >
> > > I'm writing (just for fun and learning purposes) a device mapper
> > > module that stores checksums on writes and verifies them on reads.
> The
> > > integrity metadata (currently just the checksum) is interleaved in
> the
> > > backing device: one sector holding the integrity metadata for the
> > > following 64 data sectors.
>
> > Alex and I discussed implementing checksums for ext4 using an external
> device like this, and he might have some more design information for
> you.
>
>
> That external device could possibly be a TPM chip that can store
> checksums.

Thanks for the suggestion. The code I have at the moment (without the
journal stuff) already has the capability of storing checksums in a
different device. It's one of the reasons why I would prefer to avoid
using jbd.

I think I'll go with the "two metadatas" approach and see how it goes.
Worst case scenario is that I have to drop that code, which means to be
back where I am now, only with one less option.

Thanks,
Alberto


2008-12-29 21:31:48

by Alberto Bertogli

[permalink] [raw]
Subject: Re: jbd2 inside a device mapper module

On Sat, Dec 27, 2008 at 02:29:50PM -0500, Theodore Tso wrote:
> On Sat, Dec 27, 2008 at 01:00:20AM -0200, Alberto Bertogli wrote:
> > I have a couple of alternatives in mind, the most decent one at the
> > moment is having two metadatas (M1 and M2) for the each block, and
> > update M1 on the first write to the given block, M2 on the second, M1 on
> > the third, and so on.
>
> I don't see how this would help. You still have to do synchronous
> writes for safety, which is what is going to kill your performance.

I was thinking of queueing the writes to the metadata, and then queue
the writes of the data marked with bio_barrier(); when the data write
completes I end the original bio.

Although if they metadata is on a different device, I do have to wait
for the metadata to be written because the barrier is useless; but OTOH
if I use a journal I can't split my data and metadata in two different
devices, can I? (without using two journals or doing more complex
stuff).


> What you want to do is to batch as many writes as possible. Until the
> underlying filesystem requests a flush, you can afford to hold off
> writing the block to disk. Otherwise, you'll end up turning each 4k

I think I can't do this at the device-mapper layer. There's a .flush
function pointer, but I think it's suspend-related; and in any case I
gave it a try and it's never called during normal operation.


> > > Why not just use the ext3/4 external journal format?
> >
> > Wouldn't that lead to confusion, because people can think the device
> > holds an ext3/4 external journal, while it actually holds a
> > device-mapper backing device that happens to contain a journal?
>
> Not really; the external journal has a label and uuid, and the journal
> superblock has a place to store the uuid of the "client" of the
> journal. So there is plenty of information available to tie an
> external journal to some device-mapper backing device.
>
> > What would be the advantages of using the ext3/4 journal format, over a
> > simple initial sector and the journal following?
>
> There already existing tools to find the external journal, using the
> blkid library. So you only have to store the UUID of the journal in
> the superblock of the device-mapper backing device, and then you can
> easily find the external journal as follows:
>
> journal_fn = blkid_get_devname(ctx->blkid, "UUID", uuid);

Thanks a lot for the suggestions!

As I said in the other email, I'll give the writes a try and see how it
goes. If their performance suck (what, from what you tell me, it's
likely) at least I'll have something that works.

Thanks a lot,
Alberto


2008-12-30 07:09:02

by Alex Tomas

[permalink] [raw]
Subject: Re: [dm-devel] Re: jbd2 inside a device mapper module

one good thing about JBD is that you can't update target block and csum
atomically. so, either you use some form of COW or you use journalling.
given we already have JBD it'd make sense to use it?

thanks, Alex

Alberto Bertogli wrote:
> I think I'll go with the "two metadatas" approach and see how it goes.
> Worst case scenario is that I have to drop that code, which means to be
> back where I am now, only with one less option.
>
> Thanks,
> Alberto
>


2008-12-30 13:54:34

by Alberto Bertogli

[permalink] [raw]
Subject: Re: [dm-devel] Re: jbd2 inside a device mapper module

On Tue, Dec 30, 2008 at 09:55:57AM +0300, Alex Tomas wrote:
> one good thing about JBD is that you can't update target block and csum
> atomically. so, either you use some form of COW or you use journalling.
> given we already have JBD it'd make sense to use it?

I'm sorry, but I'm not following. Is that first sentence right?

The main disadvantage I see of using jbd at the moment is that I loose
the possibility of having checksums and data in a different device.

The only alternative to jbd that I have at the moment is the "two
metadatas" approach I explained in another email (but please let me know
if it wasn't clear).

They both provide what I need (atomicity in data and csum writes), one
is easier, more tested, but prevents a feature. The other is a bit more
difficult, untested and written my me, but allows a feature. I have no
idea, performance-wise, how they will behave (it is expected they suck,
according to the other emails).

At this moment I'm going with the two metadatas approach, because I
think it has less limitations and it'd be fun to write. If then it's
unfit for some reason, I can always go back and use jbd. But I'm
obviously open to suggestions and more alternatives.

Thanks a lot,
Alberto