2011-08-13 10:21:28

by Ivan Shmakov

[permalink] [raw]
Subject: e2dis: a Jigdo-like tool for Ext2+ FS

A couple of weeks ago I've started working on a tool
(tentantively named “Ext2 disassembler”) to walk through an
Ext2+ filesystem (or an image of) and produce the mapping of
files' (inodes') relative block numbers to the image's (or
“physical”) block numbers.

The version-that-works (apparently) is almost done, pending
upload to a publicly-accessible Git repository.

However, there's a considerable amount of work to be done so
that the tool will become really usable. Therefore, I'd
appreciate any help with it.

TIA.

Why I'm interested in that?

Recently, there was a discussion in debian-devel@ on whether the
Debian project should provide images for easy deployment within
“virtual” environments (such as KVM, Xen, etc.)

Such images (which, I assume, will use a filesystem supported by
e2fsprogs) are going to be quite large: hundreds MiB to a few
GiB's (depending on the intended usage) per architecture per
version.

Earlier, to reduce the burden of mirroring of the ISO 9660 (CD,
DVD, etc.) images, the Jigdo (for Jigsaw Download) tool was
introduced. The tool uses SHA-1 to associate pieces of a
filesystem image with the contents of the files of a specified
set. As the result, the tool produces the association map,
which has the parts of the image for which no matching files are
known embedded. (A helper file, which contains the URI's the
files may be downloaded from, is also generated.)

Given such an association map, and the files, the tool is
capable of restoring the image.

The tool is filesystem-agnostic. Unfortunately, it relies on
the fact that the files on the ISO 9660 filesystem are never
fragmented. Which doesn't hold for Ext2+.

However, given the knowledge of the filesystem, it's possible to
solve the task of describing the parts of a given image as being
parts of the files specified.

Done

The tool iterates over the inodes, and records the
logical-to-physical blocks correspondence. All the “chunks”
belonging to the same inode are marked as such.

The mapping is written to a SQLite database.

To do

Message digests are to be computed and recorded just as well.

Non-payload blocks are to be annotated as well.

A tool to reassemble the image.

Command line interface. (Preferably compliant to the GNU Coding
Standards.)

--
FSF associate member #7257


2011-08-14 06:56:49

by Ivan Shmakov

[permalink] [raw]

2011-08-15 09:30:01

by Lukas Czerner

[permalink] [raw]
Subject: Re: e2dis: a Jigdo-like tool for Ext2+ FS

On Sat, 13 Aug 2011, Ivan Shmakov wrote:

> A couple of weeks ago I've started working on a tool
> (tentantively named “Ext2 disassembler”) to walk through an
> Ext2+ filesystem (or an image of) and produce the mapping of
> files' (inodes') relative block numbers to the image's (or
> “physical”) block numbers.

Hi Ivan,

I have not seen your code, but that sounds like something that debugfs
(part of e2fsprogs) is already doing very well (and a lot more). This is
exactly the "extN disassembler" you're talking about and with a little
bit of scripting around it you should be able dig any information you
desire from the file system so I do not think that new application is
needed. But I might be wrong, just take a look at it.

Thanks!
-Lukas

>
> The version-that-works (apparently) is almost done, pending
> upload to a publicly-accessible Git repository.
>
> However, there's a considerable amount of work to be done so
> that the tool will become really usable. Therefore, I'd
> appreciate any help with it.
>
> TIA.
>
> Why I'm interested in that?
>
> Recently, there was a discussion in debian-devel@ on whether the
> Debian project should provide images for easy deployment within
> “virtual” environments (such as KVM, Xen, etc.)
>
> Such images (which, I assume, will use a filesystem supported by
> e2fsprogs) are going to be quite large: hundreds MiB to a few
> GiB's (depending on the intended usage) per architecture per
> version.
>
> Earlier, to reduce the burden of mirroring of the ISO 9660 (CD,
> DVD, etc.) images, the Jigdo (for Jigsaw Download) tool was
> introduced. The tool uses SHA-1 to associate pieces of a
> filesystem image with the contents of the files of a specified
> set. As the result, the tool produces the association map,
> which has the parts of the image for which no matching files are
> known embedded. (A helper file, which contains the URI's the
> files may be downloaded from, is also generated.)
>
> Given such an association map, and the files, the tool is
> capable of restoring the image.
>
> The tool is filesystem-agnostic. Unfortunately, it relies on
> the fact that the files on the ISO 9660 filesystem are never
> fragmented. Which doesn't hold for Ext2+.
>
> However, given the knowledge of the filesystem, it's possible to
> solve the task of describing the parts of a given image as being
> parts of the files specified.
>
> Done
>
> The tool iterates over the inodes, and records the
> logical-to-physical blocks correspondence. All the “chunks”
> belonging to the same inode are marked as such.
>
> The mapping is written to a SQLite database.
>
> To do
>
> Message digests are to be computed and recorded just as well.
>
> Non-payload blocks are to be annotated as well.
>
> A tool to reassemble the image.
>
> Command line interface. (Preferably compliant to the GNU Coding
> Standards.)
>
>

--

2011-08-15 11:10:48

by Ivan Shmakov

[permalink] [raw]
Subject: Re: e2dis: a Jigdo-like tool for Ext2+ FS

>>>>> Lukas Czerner <[email protected]> writes:
>>>>> On Sat, 13 Aug 2011, Ivan Shmakov wrote:

>> A couple of weeks ago I've started working on a tool (tentantively
>> named “Ext2 disassembler”) to walk through an Ext2+ filesystem (or
>> an image of) and produce the mapping of files' (inodes') relative
>> block numbers to the image's (or “physical”) block numbers.

> I have not seen your code, but that sounds like something that
> debugfs (part of e2fsprogs) is already doing very well (and a lot
> more). This is exactly the "extN disassembler" you're talking about

Not quite. The meaning of “disassembler” here is that the image
is torn in parts, which could later be assembled together to
form exactly the same image (by an “image assembler” tool.)

It's not implied that e2dis will ever produce some sort of
human-readable output (as its primary result.) For that,
debugfs(8) should indeed suffice.

> and with a little bit of scripting around it you should be able dig
> any information you desire from the file system so I do not think
> that new application is needed. But I might be wrong, just take a
> look at it.

Indeed, my first try was to use debugfs(8). However, there're
several issues with it:

• I see no way to obtain the list of used inodes in debugfs(8)
(as of 1.41.12 debian 2); therefore, I have had to resort to
trying the ‘stat’ command on every possible inode number;

• also, the (binary) filesystem data is serialized into ASCII by
debugfs(8) and is parsed afterwards by the invoking tool,
which is computationally-inefficient; (especially if applied
to a filesystem with size in the order of several GiB's, and
the number of used inodes in the order of tens of thousands,
or more);

• moreover, I see no claims that the output of the debugfs(8)
‘stat’ command won't ever change (neither I see the formal
description of the aforementioned output — its source is the
only form of specification I could rely); my guess is that the
C API, being documented, is going to be much more stable;

That being said, the most of the code I've written so far is
concerned /not/ with the filesystems per se (i. e., libext2fs
calls), but with data recording: representing the data in a
compact way, interfacing SQLite, etc. (The SHA-1 computation
and GNU-style CLI will require some coding as well, thus making
the Ext2+ FS-specific parts even smaller when compared to the
overall code size.)

[…]

--
FSF associate member #7257

2011-08-15 16:12:57

by Lukas Czerner

[permalink] [raw]
Subject: Re: e2dis: a Jigdo-like tool for Ext2+ FS

On Mon, 15 Aug 2011, Ivan Shmakov wrote:

> >>>>> Lukas Czerner <[email protected]> writes:
> >>>>> On Sat, 13 Aug 2011, Ivan Shmakov wrote:
>
> >> A couple of weeks ago I've started working on a tool (tentantively
> >> named “Ext2 disassembler”) to walk through an Ext2+ filesystem (or
> >> an image of) and produce the mapping of files' (inodes') relative
> >> block numbers to the image's (or “physical”) block numbers.
>
> > I have not seen your code, but that sounds like something that
> > debugfs (part of e2fsprogs) is already doing very well (and a lot
> > more). This is exactly the "extN disassembler" you're talking about
>
> Not quite. The meaning of “disassembler” here is that the image
> is torn in parts, which could later be assembled together to
> form exactly the same image (by an “image assembler” tool.)

Ok then, I have misunderstood your intentions. I thought that you need
to get logical to physical mappings of inodes.

>
> It's not implied that e2dis will ever produce some sort of
> human-readable output (as its primary result.) For that,
> debugfs(8) should indeed suffice.

Ok, I was just implying that you can use debugfs as a tool to figure out
what to read e.g. what physical blocks belongs to what inode. If you
have already tried debugfs and it did not suit you needs I am ok with
that.

>
> > and with a little bit of scripting around it you should be able dig
> > any information you desire from the file system so I do not think
> > that new application is needed. But I might be wrong, just take a
> > look at it.
>
> Indeed, my first try was to use debugfs(8). However, there're
> several issues with it:
>
> • I see no way to obtain the list of used inodes in debugfs(8)
> (as of 1.41.12 debian 2); therefore, I have had to resort to
> trying the ‘stat’ command on every possible inode number;

I am not sure if there is a way to list used inodes in debugfs but it
should be very easy to implement.

>
> • also, the (binary) filesystem data is serialized into ASCII by
> debugfs(8) and is parsed afterwards by the invoking tool,
> which is computationally-inefficient; (especially if applied
> to a filesystem with size in the order of several GiB's, and
> the number of used inodes in the order of tens of thousands,
> or more);

Oh, I was not trying to say that you should use debugfs to dig data out,
but rather to get hint of where data lies in the image.

>
> • moreover, I see no claims that the output of the debugfs(8)
> ‘stat’ command won't ever change (neither I see the formal
> description of the aforementioned output — its source is the
> only form of specification I could rely); my guess is that the
> C API, being documented, is going to be much more stable;

Well, you can probably say that about every tool, but that is not good
enough reason to duplicate the code for everything. Although I am not
saying that you're doing so.

Thanks!
-Lukas

>
> That being said, the most of the code I've written so far is
> concerned /not/ with the filesystems per se (i. e., libext2fs
> calls), but with data recording: representing the data in a
> compact way, interfacing SQLite, etc. (The SHA-1 computation
> and GNU-style CLI will require some coding as well, thus making
> the Ext2+ FS-specific parts even smaller when compared to the
> overall code size.)

2011-08-17 05:49:26

by Ivan Shmakov

[permalink] [raw]
Subject: Re: e2dis: a Jigdo-like tool for Ext2+ FS

>>>>> Lukas Czerner <[email protected]> writes:
>>>>> On Mon, 15 Aug 2011, Ivan Shmakov wrote:

BTW, the primary Git repository for the project is now located
at Gitorious:

git://gitorious.org/e2dis/e2dis-devel.git
http://gitorious.org/e2dis/e2dis-devel.git
https://gitorious.org/e2dis/e2dis-devel

The most notable improvement that I made recently is that the
payload message digests are now recorded. (The support for the
whole-image and metadata message digests is not yet committed.)

[…]

>> • moreover, I see no claims that the output of the debugfs(8) ‘stat’
>> command won't ever change (neither I see the formal description of
>> the aforementioned output — its source is the only form of
>> specification I could rely); my guess is that the C API, being
>> documented, is going to be much more stable;

> Well, you can probably say that about every tool,

Why, there're plenty of tools that support either some sort of a
standardized output, or output that's user-defined (based on
some standardized items.) E. g., the output of ls(1) and
date(1) is defined by POSIX (IEEE Std 1003.1-2008), etc.

> but that is not good enough reason to duplicate the code for
> everything. Although I am not saying that you're doing so.

I believe that there's very little, if any, code duplication
between e2dis and debugfs.

[…]

--
FSF associate member #7257 Coming soon: Software Freedom Day
http://mail.sf-day.org/lists/listinfo/ planning-ru (ru), sfd-discuss (en)