LinuxLists.cc - Using compression before encryption in device-mapper

2004-04-13 15:44:49

Subject: Using compression before encryption in device-mapper

Hi,

I hope this is the right place to post this message; I tried to keep it small.
Basically I really would like to implement compression at the dm level,
despite all of the problems. The reason for this is that reducing redundancy
through compression tremendously reduces the possibilities of success for an
attacker. I had implemented this idea in a java archiver (
http://jsam.sourceforge.net ).

Although I am not a good kernel hacker, I have spent some time reading
compressed-loop.c, loop-aes, dm-crypt.c, and various threads from lkml
including http://www.uwsg.iu.edu/hypermail/linux/kernel/0402.2/0035.html
Thus I would appreciate if you could answer the following questions regarding
the implementation of a "dm-compress" dm personality.

0) Has this problem already been adressed, and if yes, where ?

1) Using dm: I want to be able to use compression both above dm (to compress
before encrypting a volume) and below it (to do a RAID on compressed
volumes). I assume there is no other way than to make compression be a dm
personality. Is this correct (or shall I use something more similar to a
compressed loop device) ?

2) Block I/O boundaries: compression does not preserve size. I plan to use a
mapping from real sectors to their compressed location (e.g. struct {
sector_t sector; size_t compressed_length }* mapping; ). I believe this
mapping can be stored on another dm target and bufferized dynamically. Is
this correct, or shall it remain in (non-swappable ?) memory ?

3) Compressed sectors have varying sizes, while dm targets only deal with full
blocks. Thus every compressed request may need to be fragmented into several
block aligned requests. This might imply reading a full block before
partially filling it with new data. Is it an exceedingly difficult task ?
Will this kill performance ?

4) Block allocation on writes: this is the most difficult problem I believe.
When rewriting a sector, its new compressed length might not be the same as
before. This would require a whole sector allocation mechanism: magaging
lists of free space, optimizing dynamic allocation to reduce fragmentation,
etc. Is there another solution than to adapt algorithms used in for e.g. ext2
?

5) As a workaround to 2,3,4 I plan to systematically allocate 2 sectors per
real sector (space efficiency is _not_ my aim, growing entropy per bit is)
and to use a trivial dynamic huffman compression algorithm. Is this solution
(which means having half less space than physically available) acceptable ?

6) Shall this whole idea of compression be ruled out of dm and only be
implemented at the file-system level (e.g. as a plugin for ReiserFS4) ?

I thank you very much in advance for your critics and advices.

G. Lac?te.

2004-04-13 16:57:11

by Timothy Miller

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Guillaume Lac?te wrote:
> Hi,
>
> I hope this is the right place to post this message; I tried to keep it small.
> Basically I really would like to implement compression at the dm level,
> despite all of the problems. The reason for this is that reducing redundancy
> through compression tremendously reduces the possibilities of success for an
> attacker. I had implemented this idea in a java archiver (
> http://jsam.sourceforge.net ).
>
> Although I am not a good kernel hacker, I have spent some time reading
> compressed-loop.c, loop-aes, dm-crypt.c, and various threads from lkml
> including http://www.uwsg.iu.edu/hypermail/linux/kernel/0402.2/0035.html
> Thus I would appreciate if you could answer the following questions regarding
> the implementation of a "dm-compress" dm personality.
>
[snip]

I have a suggestion. If you're compressing only for the sake of
obfuscation, then don't really try to save any space. Use a fast
compression algorithm which doesn't necessarily do a great job.

When you're going to write, compress the block. If it gets smaller,
fine. Store it in the same space it would have required even if it were
uncompressed. If the block gets bigger, then store it uncompressed.
Whether or not the block could be compressed would be stored in metadata
(in the inode, I guess).

2004-04-13 17:45:31

by Jörn Engel

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Tue, 13 April 2004 17:44:40 +0200, Guillaume Lac?te wrote:
>
> I hope this is the right place to post this message; I tried to keep it small.
> Basically I really would like to implement compression at the dm level,
> despite all of the problems. The reason for this is that reducing redundancy
> through compression tremendously reduces the possibilities of success for an
> attacker. I had implemented this idea in a java archiver (
> http://jsam.sourceforge.net ).
>
> Although I am not a good kernel hacker, I have spent some time reading
> compressed-loop.c, loop-aes, dm-crypt.c, and various threads from lkml
> including http://www.uwsg.iu.edu/hypermail/linux/kernel/0402.2/0035.html
> Thus I would appreciate if you could answer the following questions regarding
> the implementation of a "dm-compress" dm personality.
>
> 0) Has this problem already been adressed, and if yes, where ?

Yes, on the filesystems level. Jffs2 is usable, although not
well-suited for disks and similar, ext2compr appears to be unusable.
On the device level, I haven't heard of anything yet.

> 1) Using dm: I want to be able to use compression both above dm (to compress
> before encrypting a volume) and below it (to do a RAID on compressed
> volumes). I assume there is no other way than to make compression be a dm
> personality. Is this correct (or shall I use something more similar to a
> compressed loop device) ?

I'd go for a dm implementation.

> 2) Block I/O boundaries: compression does not preserve size. I plan to use a
> mapping from real sectors to their compressed location (e.g. struct {
> sector_t sector; size_t compressed_length }* mapping; ). I believe this
> mapping can be stored on another dm target and bufferized dynamically. Is
> this correct, or shall it remain in (non-swappable ?) memory ?
>
> 3) Compressed sectors have varying sizes, while dm targets only deal with full
> blocks. Thus every compressed request may need to be fragmented into several
> block aligned requests. This might imply reading a full block before
> partially filling it with new data. Is it an exceedingly difficult task ?
> Will this kill performance ?
>
> 4) Block allocation on writes: this is the most difficult problem I believe.
> When rewriting a sector, its new compressed length might not be the same as
> before. This would require a whole sector allocation mechanism: magaging
> lists of free space, optimizing dynamic allocation to reduce fragmentation,
> etc. Is there another solution than to adapt algorithms used in for e.g. ext2
> ?

If you really want to deal with this, you end up with a device that
can grow and shrink depending on the data. Unless you have a strange
fetish for pain, you shouldn't even think about it.

> 5) As a workaround to 2,3,4 I plan to systematically allocate 2 sectors per
> real sector (space efficiency is _not_ my aim, growing entropy per bit is)
> and to use a trivial dynamic huffman compression algorithm. Is this solution
> (which means having half less space than physically available) acceptable ?

Makes sense. One of the zlib developers actually calculated the
maximum expansion when zlib-compressing data, so you could even get
away with more than 50% net size, but that makes the code more
complicated. Your call.

Performance should not be a big issue, as encryption is a performance
killer anyway.

Whether it is acceptable depends on the user. Make it optional and
let the user decide.

> 6) Shall this whole idea of compression be ruled out of dm and only be
> implemented at the file-system level (e.g. as a plugin for ReiserFS4) ?

Again, depends on the user. But from experience, there are plenty of
users who want something like this.

J?rn

--
A victorious army first wins and then seeks battle.
-- Sun Tzu

2004-04-13 19:42:58

by Ville Herva

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Tue, Apr 13, 2004 at 07:45:16PM +0200, you [J?rn Engel] wrote:
>
> Yes, on the filesystems level. Jffs2 is usable, although not well-suited
> for disks and similar, ext2compr appears to be unusable.

Quite the opposite: e2compr on 2.2.x (and 2.0 as far as I can remember) is
very stable. The 2.4 port wasn't usable in my testing, though.

-- v --

[email protected]

2004-04-14 06:48:08

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Thank you for your prompt answers.
Le Mardi 13 Avril 2004 18:57, Timothy Miller a ?crit :
[snip]
> I have a suggestion. If you're compressing only for the sake of
> obfuscation, then don't really try to save any space. Use a fast
> compression algorithm which doesn't necessarily do a great job.
I prefer speaking in terms of "entropy-per-bit" and "redundancy" rather than
"obfuscation", although you got the idea.
Actually I plan to use a basic dynamic huffman : the rationale for this is
that there is no meta-data (and any meta-data would help an attacker), since
the weighted tree is updated along (de)compression. And from J. S. Vitter's
article (Design and analysis of Dynamic Huffman Codes) we know that in the
worst-case scenario the Huffman encoding will use one additional bit per
byte. Thus allocating 9 native blocks for every 8 "compressed" blocks will
do, although it is a bit more complicated than the 2-to-1 scheme I had
suggested.

>
> When you're going to write, compress the block. If it gets smaller,
> fine. Store it in the same space it would have required even if it were
> uncompressed. If the block gets bigger, then store it uncompressed.
> Whether or not the block could be compressed would be stored in metadata
> (in the inode, I guess).
Actually I do _not_ want to do that. The reason for that is that I want to add
yet another layer before compression, which would interleave real data with
random bytes. These random bytes are not drawn uniformly but rather drawn as
to make the distribution on huffman trees (and thus on the encodings)
uniform. This ensures that in order to decode my real data, an attacker has
to decode the random data first; but since all _compressed_ random sequences
are made equi-probable, there is (hopefully) no better way for him to do this
than brute force. This is the idea I have (successfully ?) implemented in
http://jsam.sourceforge.net .

Thus I still want to "compress" my data even if its size grows.

The problem I encounter however is that if forcibly allocating more space than
required (e.g. 9 plain blocks every 8 compressed blocks) I will need padding.
However padding is generally unwise cryptographically speaking ...

2004-04-14 06:55:08

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Thank you for your answers.
Le Mardi 13 Avril 2004 19:45, J?rn Engel a ?crit :

>> 0) Has this problem already been adressed, and if yes, where ?
> Yes, on the filesystems level. Jffs2 is usable, although not
> well-suited for disks and similar, ext2compr appears to be unusable.
> On the device level, I haven't heard of anything yet.
Thank you, I didn't know about Jffs2; however I believe it is not an
implemendation at the device level as I would like.

> > 1) Using dm:
> I'd go for a dm implementation.
>
> > 2) Block I/O boundaries:
> > 3) Compressed sectors have varying sizes ...
> > 4) Block allocation on writes:
>
> If you really want to deal with this, you end up with a device that
> can grow and shrink depending on the data. Unless you have a strange
> fetish for pain, you shouldn't even think about it.
Since space efficiency is _not_ my aim I plan to forcibly allocate 3 physical
blocks for every 2 "compressed" blocks (as it should (?) always fit with a
dynamic Huffman encoding).

>
> > 5) As a workaround to 2,3,4 I plan to systematically allocate 2 sectors
> > per real sector (space efficiency is _not_ my aim, growing entropy per
> > bit is) and to use a trivial dynamic huffman compression algorithm. Is
> > this solution (which means having half less space than physically
> > available) acceptable ?
>
> Makes sense. One of the zlib developers actually calculated the
> maximum expansion when zlib-compressing data, so you could even get
> away with more than 50% net size, but that makes the code more
> complicated. Your call.
Oops ! I thought it was possible to guarantee with the Huffman encoding (which
is more basic than Lempev-Zif) that the compressed data use no more than 1
bit for every byte (i.e. 12,5% more space).

>
> Performance should not be a big issue, as encryption is a performance
> killer anyway.
I am not sure that this is good news ;)
>
> Whether it is acceptable depends on the user. Make it optional and
> let the user decide.
>
> > 6) Shall this whole idea of compression be ruled out of dm and only be
> > implemented at the file-system level (e.g. as a plugin for ReiserFS4) ?
>
> Again, depends on the user. But from experience, there are plenty of
> users who want something like this.
Unfortunately I failed to find substantial code/documentation on encryption
plugin for Reiser4, for example. Do you know about some ?

>
> J?rn
Thank you, Guillaume.

2004-04-14 09:43:34

by Jörn Engel

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Wed, 14 April 2004 08:54:56 +0200, Guillaume Lac?te wrote:
> Le Mardi 13 Avril 2004 19:45, J?rn Engel a ?crit :
>
> Thank you, I didn't know about Jffs2; however I believe it is not an
> implemendation at the device level as I would like.

Correct.

> Since space efficiency is _not_ my aim I plan to forcibly allocate 3 physical
> blocks for every 2 "compressed" blocks (as it should (?) always fit with a
> dynamic Huffman encoding).
>
> [...]
>
> Oops ! I thought it was possible to guarantee with the Huffman encoding (which
> is more basic than Lempev-Zif) that the compressed data use no more than 1
> bit for every byte (i.e. 12,5% more space).

Makes sense, although I'd like to see the proof first. Shouldn't be
too hard to do.

> > Performance should not be a big issue, as encryption is a performance
> > killer anyway.
> I am not sure that this is good news ;)

Is it news? ;)

> > Again, depends on the user. But from experience, there are plenty of
> > users who want something like this.
> Unfortunately I failed to find substantial code/documentation on encryption
> plugin for Reiser4, for example. Do you know about some ?

My interest in reiserfs has ceised some time ago. Have you asked
their list? <[email protected]>

J?rn

--
Victory in war is not repetitious.
-- Sun Tzu

2004-04-14 10:02:12

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

> > Oops ! I thought it was possible to guarantee with the Huffman encoding
> > (which is more basic than Lempev-Zif) that the compressed data use no
> > more than 1 bit for every byte (i.e. 12,5% more space).
>
> Makes sense, although I'd like to see the proof first. Shouldn't be
> too hard to do.
>
I was referring to the paper by Jeffrey Scott Vitter "Design and Analysis of
Dynamic Huffman Codes" (accessible through http://acm.org). It defines a
refinement of the well-known dynamic Huffman algorithm by Faller, Gallager
and Knuth such that the encoded length will use at most _one_ more bit per
encoded letter than the optimal two-pass Huffman algorithm (it is also shown
that the FGK algorithm an use twice the optimal length + on more bit per
letter).

My conclusion comes from the fact that for every text, the optimal two-pass
huffman encoding can _not_ be longer than the native encoding (apart from the
dictionnary encoding).

Actually I plan to implement the easier FGK algorithm first - if the whole
matter makes sense.

2004-04-14 11:25:19

by Jörn Engel

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Wed, 14 April 2004 12:02:07 +0200, Guillaume Lac?te wrote:
>
> > > Oops ! I thought it was possible to guarantee with the Huffman encoding
> > > (which is more basic than Lempev-Zif) that the compressed data use no
> > > more than 1 bit for every byte (i.e. 12,5% more space).
> >
> > Makes sense, although I'd like to see the proof first. Shouldn't be
> > too hard to do.
> >
> I was referring to the paper by Jeffrey Scott Vitter "Design and Analysis of
> Dynamic Huffman Codes" (accessible through http://acm.org). It defines a
> refinement of the well-known dynamic Huffman algorithm by Faller, Gallager
> and Knuth such that the encoded length will use at most _one_ more bit per
> encoded letter than the optimal two-pass Huffman algorithm (it is also shown
> that the FGK algorithm an use twice the optimal length + on more bit per
> letter).
>
> My conclusion comes from the fact that for every text, the optimal two-pass
> huffman encoding can _not_ be longer than the native encoding (apart from the
> dictionnary encoding).

Makes sense. You should try to use the crypto/ infrastructure in the
kernel, add the compression algorithm there and use it through the
normal interface. Apart from that, good lock! :)

J?rn

--
Simplicity is prerequisite for reliability.
-- Edsger W. Dijkstra

2004-04-14 13:30:48

by Paulo Marques

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Guillaume Lac?te wrote:

>>>Oops ! I thought it was possible to guarantee with the Huffman encoding
>>>(which is more basic than Lempev-Zif) that the compressed data use no
>>>more than 1 bit for every byte (i.e. 12,5% more space).

WTF??

Zlib gives a maximum increase of 0.1% + 12 bytes (from the zlib manual), which
for a 512 block will be a 2.4% guaranteed increase.

I think that zlib already does the "if this is bigger than original, just mark
the block type as uncompressed" algorithm internally, so the increase is minimal
in the worst case.

A while ago I started working on a proof of concept kind of thing, that was a
network block device server that compressed the data sent to it.

I think that if we want to go ahead with this, we really should make the extra
effort to have actual compression, and use the extra space.

From my experience it is possible to get "near tar.gz" compression ratios on a
scheme like this.

Biggest problems:

1 - We really need to have an extra layer of metadata. This is the worse part.
Not only makes the thing much more complex, but it brings new problems, like
making sure that the order the data is written to disk is transparent to the
upper layers and won't wreck the journal on a journaling file system.

2 - The compression layer should report a large block size upwards, and use a
little block size downwards, so that compression is as efficient as possible.
Good results are obtained with a 32kB / 512 byte ratio. This can cause extra
read-modify-write cycles upwards.

3 - If we use a fixed size partition to store compressed data, the apparent
uncompressed size of the block device would have to change. Filesystems aren't
prepared to have a block device that changes size on-the-fly. If we can solve
this problem, then this compression layer could be really useful. Otherwise it
can only be used over loopback on files that can grow in size (this could still
be useful this way).

4 - The block device has no knowledge of what blocks are actually being used by
the filesystem above, so it has to compress and store blocks that are actually
"unused". This is not an actual problem, is just that it could be a little more
efficient if it could ignore unused blocks.

When I did the tests I mounted an ext2 filesystem over the network block device.
At least with ext2, the requests were gathered so that the server would often
receive requests of 128kB, so the big block size problem is not too serious
(performance will be bad anyway, this is a clear space/speed trade-off). This
was kernel 2.4. I don't know enough about the kernel internals to know which
layer is responsible for this "gathering".

On the up side, having an extra metadata layer already provides the "is not
compressed" bit, so that we never need more than a block of disk to store one
block of uncompressed data.

Anyway, I really think that if there is no solution for problem 3, there is
little point in pushing this forward.

For a "better encryption only" scheme, we could use a much simpler scheme, like
having a number of reserved blocks on the start of the block device to hold a
bitmap of all the blocks. On this bitmap a 1 means that the block is
uncompressed, so that if, after compression, the block is bigger than the
original we can store it uncompressed.

--
Paulo Marques - http://www.grupopie.com
"In a world without walls and fences who needs windows and gates?"

2004-04-14 13:34:48

by Jörn Engel

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Wed, 14 April 2004 13:44:33 +0100, Paulo Marques wrote:
> Guillaume Lac?te wrote:
>
> >>>Oops ! I thought it was possible to guarantee with the Huffman encoding
> >>>(which is more basic than Lempev-Zif) that the compressed data use no
> >>>more than 1 bit for every byte (i.e. 12,5% more space).
>
> WTF??
>
> Zlib gives a maximum increase of 0.1% + 12 bytes (from the zlib manual),
> which for a 512 block will be a 2.4% guaranteed increase.
>
> I think that zlib already does the "if this is bigger than original, just
> mark the block type as uncompressed" algorithm internally, so the increase
> is minimal in the worst case.

Correct, but Guillaume doesn't care about compression efficiency.
"mark the block uncompressed" is precisely what he does *not* want,
unless I got him wrong.

J?rn

--
...one more straw can't possibly matter...
-- Kirby Bakken

2004-04-14 13:58:38

by maccorin

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Wed, Apr 14, 2004 at 01:44:33PM +0100, Paulo Marques wrote:
> Guillaume Lac?te wrote:
>
> >>>Oops ! I thought it was possible to guarantee with the Huffman encoding
> >>>(which is more basic than Lempev-Zif) that the compressed data use no
> >>>more than 1 bit for every byte (i.e. 12,5% more space).
>
> WTF??
>
> Zlib gives a maximum increase of 0.1% + 12 bytes (from the zlib manual),
> which for a 512 block will be a 2.4% guaranteed increase.
>
> I think that zlib already does the "if this is bigger than original, just
> mark the block type as uncompressed" algorithm internally, so the increase
> is minimal in the worst case.

It is my impression that meta-data is exactly what he wants to avoid as it would
provide known data for crackers

2004-04-14 14:02:50

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Le Mercredi 14 Avril 2004 14:44, Paulo Marques a ?crit :
> Guillaume Lac?te wrote:
> >>>Oops ! I thought it was possible to guarantee with the Huffman encoding
> >>>(which is more basic than Lempev-Zif) that the compressed data use no
> >>>more than 1 bit for every byte (i.e. 12,5% more space).
>
> WTF??
>
> Zlib gives a maximum increase of 0.1% + 12 bytes (from the zlib manual),
> which for a 512 block will be a 2.4% guaranteed increase.
>
> I think that zlib already does the "if this is bigger than original, just
> mark the block type as uncompressed" algorithm internally, so the increase
> is minimal in the worst case.
Actually (see my reply to Timothy Miller) I really want to do "compression"
even if it does not reduce space: it is a matter of growing the per-bit
entropy rather than to gain space (see http://jsam.sourceforge.net). Moreover
I do not want to use sophisticated algorithms (in order to be able to compute
plain text random distributions that ensure that the compressed distributions
will be uniform, which is very difficult with for e.g zlib; in particular
having any kind of "meta-data", "signatures" or "dictionnary" is a no-go for
me). See details at the end of this post.

I should have made clear that my aim is not space efficiency, although I do
believe that having a transparent space-efficient compression target a-la
cloop would be nice.

>
> A while ago I started working on a proof of concept kind of thing, that was
> a network block device server that compressed the data sent to it.
Would it be possible for you to point me to the relevant material ?
>
> I think that if we want to go ahead with this, we really should make the
> extra effort to have actual compression, and use the extra space.
I strongly agree but I fear that I do not have sufficient knowledge to go this
way.
>
> From my experience it is possible to get "near tar.gz" compression ratios
> on a scheme like this.
>
> Biggest problems:
>
> 1 - We really need to have an extra layer of metadata. This is the worse
> part. Not only makes the thing much more complex, but it brings new
> problems, like making sure that the order the data is written to disk is
> transparent to the upper layers and won't wreck the journal on a journaling
> file system.
>
> 2 - The compression layer should report a large block size upwards, and use
> a little block size downwards, so that compression is as efficient as
> possible. Good results are obtained with a 32kB / 512 byte ratio. This can
> cause extra read-modify-write cycles upwards.
I failed to understand; could you provide me with more details please ?
>
> 3 - If we use a fixed size partition to store compressed data, the apparent
> uncompressed size of the block device would have to change. Filesystems
> aren't prepared to have a block device that changes size on-the-fly. If we
> can solve this problem, then this compression layer could be really useful.
> Otherwise it can only be used over loopback on files that can grow in size
> (this could still be useful this way).
>
> 4 - The block device has no knowledge of what blocks are actually being
> used by the filesystem above, so it has to compress and store blocks that
> are actually "unused". This is not an actual problem, is just that it could
> be a little more efficient if it could ignore unused blocks.
>
> When I did the tests I mounted an ext2 filesystem over the network block
> device. At least with ext2, the requests were gathered so that the server
> would often receive requests of 128kB, so the big block size problem is not
> too serious (performance will be bad anyway, this is a clear space/speed
> trade-off). This was kernel 2.4. I don't know enough about the kernel
> internals to know which layer is responsible for this "gathering".
>
> On the up side, having an extra metadata layer already provides the "is not
> compressed" bit, so that we never need more than a block of disk to store
> one block of uncompressed data.
>
> Anyway, I really think that if there is no solution for problem 3, there is
> little point in pushing this forward.
You are right I presume.
>
> For a "better encryption only" scheme, we could use a much simpler scheme,
> like having a number of reserved blocks on the start of the block device to
> hold a bitmap of all the blocks. On this bitmap a 1 means that the block is
> uncompressed, so that if, after compression, the block is bigger than the
> original we can store it uncompressed.
As I said earlier I my point is definetely not to gain space, but to grow the
"per-bit entropy". I really want to encode my data even if this grows its
length, as is done in http://jsam.sourceforge.net . My final goal is the
following: for each plain block first draw a chunk of random bytes, and then
compresse both the random bytes followed by the plain data with a dynamic
huffman encoding. The random bytes are _not_ drawn uniformly, but rather so
that the distribution on huffman trees (and thus on encodings) is uniform.
This ensures (?) that an attacker really has not other solution to decipher
the data than brute-force: each and every key is possible, and more
precisely, each and every key is equi-probable.
This is definetely _not_ the case with ext2 over dm-crypt, since for example
the first sectors are well-known and can help reducing the key search space.
Once you have a pre-calculated dictionnary of all possible outputs of the
first 512 bytes of an ext2 partition with all possible keys, finding a secret
key is trivial ...

This has little to do with compression for space efficiency, although you are
definetely right that this could be useful to more end-users ...

2004-04-14 14:40:07

by Grzegorz Kulewski

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

I think that not only compression should be moved to fs layer but
possibly encryption also.

How?
In Reiser4 there are plugins.
In other fses (as far as I remember e2compr and maybe other posts on
this list) there is only one bigger problem with compression and only if
we want to support mmap (I do not remember more details about the problem)
and the problem is somewhere between current mm and vfs implementation. I
think that (probably) Linus said once that this problem can be solved by
changing these implementations. The same probably goes for encryption.
In order to protect guessing the key from decrypting possibly-well-known
values in superblock and other metadata (cuch as fs size and signature) we
could probably place random numbers before them and xor each 4 bytes with
last 4 bytes before encryption (or use any other hash function).

Why?
Because in dm approach you are encrypting entire blocks at once and in fs
approach you are encrypting only needed parts. This can even bring more
security because if fs is merging small files into one block and if it is
patched to move begining of not full block data into random position in
that block attacker must crack all fs and its metadata structures to know
where your data actually is and what key is used to encrypt them (we can
have several different keys for different parts of fs to make things
harder). So we can have situation that in one block there is 2 or 3 or
maybe more files (or parts) encrypted using different keys and hashes and
that these files reside at different offsets in that block. I think that
this is easier to implement and protects better.

What do you think?

Grzegorz Kulewski

2004-04-14 15:02:58

by Pascal Schmidt

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Wed, 14 Apr 2004 16:10:20 +0200, you wrote in linux.kernel:

> Actually (see my reply to Timothy Miller) I really want to do "compression"
> even if it does not reduce space: it is a matter of growing the per-bit
> entropy rather than to gain space (see http://jsam.sourceforge.net).

How is the per-bit entropy higher when the same amount of data (and
thus entropy on that data) is sometimes contained in *more* bits?

I can see the argument if data is really compressed, because then more
bits than would normally fit into, say, a sector, contribute to the
entropy of the final sector.

--
Ciao,
Pascal

2004-04-14 15:07:13

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Thank you for your comments.

Le Mercredi 14 Avril 2004 16:39, Grzegorz Kulewski a ?crit :
> I think that not only compression should be moved to fs layer but
> possibly encryption also.
This is way easier IMHO, and totally preserves from the burden of meta-data
versus plain-data encryption. However there is one big drawback: you benefit
from encryption only when using the file-system for which you have such a
plugin (for example, you can _not_ do a RAID5 array over encrypted disks).
I believe this is too much of a drawback, put perhaps I am wrong.
>
> How?
> In Reiser4 there are plugins. [...]
> In order to protect guessing the key from decrypting possibly-well-known
> values in superblock and other metadata (cuch as fs size and signature) we
> could probably place random numbers before them and xor each 4 bytes with
> last 4 bytes before encryption (or use any other hash function).
This is the idea I have in mind, but with a little more sophisticated scheme
(see my reply to Paulo Marques): basically the sole purpose to "compress" is
to improve the per-bit entropy, and you can mix this with adequately
inserting random bytes before your data according to your compression
algorithm.

>
> Why?
> Because in dm approach you are encrypting entire blocks at once and in fs
> approach you are encrypting only needed parts. [...]
> So we can have situation that in one block there is 2 or 3 or
> maybe more files (or parts) encrypted using different keys and hashes and
> that these files reside at different offsets in that block. I think that
> this is easier to implement and protects better.
Actually I would have used the counter-argument to prove your point ;)
Let me explain : I believe (but I am no cryptographer) that it is a very
unwise idea to encrypt both data and meta-data at the same time. I prefer
extracting all meta-data, possibly encrypting them somewhere, and encrypt
data independently. This is basically because you can easily gain information
on meta-data (but not so easily on data themselves). This is the reason why
encrypting at the file-system level (where you encipher the content of files)
is better than at the device level (where you also encrypt meta-data; this is
however what I plan to do).

To follow your example, I would even prefer the offsets of the 2 or 3 files be
public rather than having them encrypted and let the attacker know that the
first 8 bytes of my sector encodes for offsets for 2 files, each of which
being less than 4096 (thus having many 0 bits inside it) : this might
tremendously help him reduce his search space.

But I do agree that encrypting at the file-system level might be a better
approach. However I believe this raises two issues:
1) you may want to also encipher the meta-data itself. A basic way to do this
would be to additionnaly use dm-crypt.
2) you can only profit from encryption with the file-system it has been
implemented for.

I personally have no political stance on wether developping a Reiser4-only
compression+encryption plugin would be unwise. I just loosily believe that it
is always better to support several FSes. Unless, of course, if this is
doomed to failed, which is the question I would like to answer before working
further on it.

What to you think ?

Guillaume.

2004-04-14 15:25:43

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Le Mercredi 14 Avril 2004 17:02, Pascal Schmidt a ?crit :
> On Wed, 14 Apr 2004 16:10:20 +0200, you wrote in linux.kernel:
> > Actually (see my reply to Timothy Miller) I really want to do
> > "compression" even if it does not reduce space: it is a matter of growing
> > the per-bit entropy rather than to gain space (see
> > http://jsam.sourceforge.net).
>
> How is the per-bit entropy higher when the same amount of data (and
> thus entropy on that data) is sometimes contained in *more* bits?
>
> I can see the argument if data is really compressed, because then more
> bits than would normally fit into, say, a sector, contribute to the
> entropy of the final sector.
You are perfectly right. If I recall correctly from Paul Li and Ming Vitanyi,
An Introduction to Kolmogorv Complexity, then the minimum length of an
encoding (e.g. a static two-pass Huffman encoding) is equal to the total
entropy of the text, or one more than that (this is also asymptotically equal
to the Kolmogorov complexity of the text). In particular the static encoding
can not be longer than the original text.

However I do _not_ want to have any form of meta-data, dictionnary, etc. Thus
I need to use dynamic encoding, were the huffman tree is dynamically updated
(both while compressing or while decompressing) as characters are
read/written. The problem is that the encoding "evolves" and in particular it
can be (and usually is) worse than the static encoding.

J. S. Vitter showed that the dynamic algorithm by Faller, Gallager and Knuth
can use twice more bytes plus one bit per byte than the optimal static
Huffman encoding. On the other hand J. S. Vitter discusses dynamic algorithm
that uses only one more bit per byte in the worst case when compared to the
static encoding.

You are right that in this very case, the per-bit entropy will be
(1 - 1/(1+1/8) ) ~ 12% lower than in the original text. The point is that this
case (which has nothing to do with the case where a text can be well
compressed or not, this is the worst _relative_ performance of dynamic versus
static encoding) does not happen "too often".

Note that I wish to prepend random bytes followed by the block of real text
before compressing and ciphering, so as to make the distribution of huffman
trees uniform. The ultimate goal being to let no other solution to an
attacker than to brute force test all possible keys. An indirect consequence
on this is that the probability of being in the "poor dynamic performance"
case does not depend on the data itself (but on the random drawing).

I hope that I made things clearer and that I didn't make any mistake (please
feel free to correct me).

Guillaume.

2004-04-14 15:26:25

by Paulo Marques

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Guillaume Lac?te wrote:

>...
> Actually (see my reply to Timothy Miller) I really want to do "compression"
> even if it does not reduce space: it is a matter of growing the per-bit
> entropy rather than to gain space (see http://jsam.sourceforge.net). Moreover
> I do not want to use sophisticated algorithms (in order to be able to compute
> plain text random distributions that ensure that the compressed distributions
> will be uniform, which is very difficult with for e.g zlib; in particular
> having any kind of "meta-data", "signatures" or "dictionnary" is a no-go for
> me). See details at the end of this post.

Point taken

> ...
>
>>A while ago I started working on a proof of concept kind of thing, that was
>>a network block device server that compressed the data sent to it.
>>
> Would it be possible for you to point me to the relevant material ?
>

I just need to tidy it up a little :)

Maybe I can publish it tomorrow or something like that.

>....

>>2 - The compression layer should report a large block size upwards, and use
>>a little block size downwards, so that compression is as efficient as
>>possible. Good results are obtained with a 32kB / 512 byte ratio. This can
>>cause extra read-modify-write cycles upwards.
>>
> I failed to understand; could you provide me with more details please ?
>

If we are to compress on a block basis, the bigger the block the higher the
compression ratio we'll be able to achieve (using zlib, for instance). However,
data sent to the actual block device will have to go in blocks itself.

For instance, if we compress a 32kB block and it only needs 8980 bytes to be
stored, we need 18 512byte blocks to store it. On average, we will lose 1/2 of
the actual block size per "upper level" block size bytes of data. In a 32kB/512
byte ratio, we would lose on average 256 bytes per 32kb of data ~ 0.8% (which is
more than acceptable).

>>...
> As I said earlier I my point is definetely not to gain space, but to grow the
> "per-bit entropy". I really want to encode my data even if this grows its
> length, as is done in http://jsam.sourceforge.net . My final goal is the
> following: for each plain block first draw a chunk of random bytes, and then
> compresse both the random bytes followed by the plain data with a dynamic
> huffman encoding. The random bytes are _not_ drawn uniformly, but rather so
> that the distribution on huffman trees (and thus on encodings) is uniform.
> This ensures (?) that an attacker really has not other solution to decipher
> the data than brute-force: each and every key is possible, and more
> precisely, each and every key is equi-probable.

Ok, we are definitely fighting different wars here.

Anyway, I'll try to gather what I did with the network block device server and
place it somewhere where you can look at it. It will probably help you do some
tests, too. Because it is a block device in "user space", it is much simpler to
develop and test different approaches, and gather some results, before trying
things inside the kernel.

If you want to start now, just go to:

http://nbd.sourceforge.net/

and download the source for the network block device server. My server is
probably more complex than the original, because of all the metadata handling.

I hope this helps,

--
Paulo Marques - http://www.grupopie.com
"In a world without walls and fences who needs windows and gates?"

2004-04-14 15:33:06

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Thank you for your feed-back.
Le Mercredi 14 Avril 2004 17:23, Paulo Marques a ?crit :
> Guillaume Lac?te wrote:
> >...
> > Actually (see my reply to Timothy Miller) I really want to do
> > "compression" even if it does not reduce space: it is a matter of growing
> > the per-bit entropy rather than to gain space (see
> > http://jsam.sourceforge.net). Moreover I do not want to use sophisticated
> > algorithms (in order to be able to compute plain text random
> > distributions that ensure that the compressed distributions will be
> > uniform, which is very difficult with for e.g zlib; in particular having
> > any kind of "meta-data", "signatures" or "dictionnary" is a no-go for
> > me). See details at the end of this post.
>
> Point taken
Please see the critic by Pascal Schmidt, as I think he has a point ...

> > Would it be possible for you to point me to the relevant material ?
>
> I just need to tidy it up a little :)
>
> Maybe I can publish it tomorrow or something like that.
Since I will be on holiday until the end of April please feel free to take
your time (although I stay highly interested).

>
> >....
> >
> >>2 - The compression layer should report a large block size upwards, and
> >> use a little block size downwards, so that compression is as efficient
> >> as possible. Good results are obtained with a 32kB / 512 byte ratio.
> >> This can cause extra read-modify-write cycles upwards.
> >
> > I failed to understand; could you provide me with more details please ?
>
> If we are to compress on a block basis, the bigger the block the higher the
> compression ratio we'll be able to achieve (using zlib, for instance).
> However, data sent to the actual block device will have to go in blocks
> itself.
On the other hand, since I will need to interleave random data "quite often" I
prefer my blocks not be too large (although this shall be a choice for the
end-user).
>
> For instance, if we compress a 32kB block and it only needs 8980 bytes to
> be stored, we need 18 512byte blocks to store it. On average, we will lose
> 1/2 of the actual block size per "upper level" block size bytes of data. In
> a 32kB/512 byte ratio, we would lose on average 256 bytes per 32kb of data
> ~ 0.8% (which is more than acceptable).
I got it, thank you.
>
> >>...
> >
> > As I said earlier I my point is definetely not to gain space, but to grow
> > the "per-bit entropy". I really want to encode my data even if this grows
> > its length, as is done in http://jsam.sourceforge.net . My final goal is
> > the following: for each plain block first draw a chunk of random bytes,
> > and then compresse both the random bytes followed by the plain data with
> > a dynamic huffman encoding. The random bytes are _not_ drawn uniformly,
> > but rather so that the distribution on huffman trees (and thus on
> > encodings) is uniform. This ensures (?) that an attacker really has not
> > other solution to decipher the data than brute-force: each and every key
> > is possible, and more precisely, each and every key is equi-probable.
>
> Ok, we are definitely fighting different wars here.
>
> Anyway, I'll try to gather what I did with the network block device server
> and place it somewhere where you can look at it. It will probably help you
> do some tests, too.
Thank you in advance.
> Because it is a block device in "user space", it is
> much simpler to develop and test different approaches, and gather some
> results, before trying things inside the kernel.
You are right.
>
> If you want to start now, just go to:
>
> http://nbd.sourceforge.net/
>
> and download the source for the network block device server. My server is
> probably more complex than the original, because of all the metadata
> handling.
I will to this once I come back from holidays (no net there).
>
> I hope this helps,
It does, thank you.
Guillaume.

2004-04-14 16:14:18

by Grzegorz Kulewski

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Wed, 14 Apr 2004, Guillaume [iso-8859-1] Lac?te wrote:

> However there is one big drawback: you benefit
> from encryption only when using the file-system for which you have such a
> plugin (for example, you can _not_ do a RAID5 array over encrypted disks).

Why do you need this? Why encrypting fs is not enough?

> Let me explain : I believe (but I am no cryptographer) that it is a very
> unwise idea to encrypt both data and meta-data at the same time. I prefer
> extracting all meta-data, possibly encrypting them somewhere, and encrypt
> data independently. This is basically because you can easily gain information
> on meta-data (but not so easily on data themselves). This is the reason why
> encrypting at the file-system level (where you encipher the content of files)
> is better than at the device level (where you also encrypt meta-data; this is
> however what I plan to do).

It depends on the data and on the metadata. Of course we can use
compression and hashing (already discussed on this list in dm-crypt
thread) to make data more protected. But, at fs level, we can encrypt
metadata with different key (for metadata or part of metadata only) and
this way keep it still near the data (this helps performance by reducing
expensive disk seeks) but encrypted "differently".

> To follow your example, I would even prefer the offsets of the 2 or 3 files be
> public rather than having them encrypted and let the attacker know that the
> first 8 bytes of my sector encodes for offsets for 2 files, each of which
> being less than 4096 (thus having many 0 bits inside it) : this might
> tremendously help him reduce his search space.

But why do you want to make attacker life easier? With fs level encryption
we can hide the offsets in secure way, encrypt both data and metadata (but
differently).

> However I believe this raises two issues:
> 1) you may want to also encipher the meta-data itself. A basic way to do this
> would be to additionnaly use dm-crypt.

But it will not help performance.

> 2) you can only profit from encryption with the file-system it has been
> implemented for.
>
> I personally have no political stance on wether developping a Reiser4-only
> compression+encryption plugin would be unwise. I just loosily believe that it
> is always better to support several FSes. Unless, of course, if this is
> doomed to failed, which is the question I would like to answer before working
> further on it.

Well, changing some parts of VFS and maybe MM we can make encryption and
compression easier to implement for various fs developers. Maybe we can
add some generic "messing with the data and metadata" (compressing,
encrypting, etc.) hooks to VFS?

I think that generic solution is best, but compression on block level is
very hard to implement. Remember you must keep block size the same
(because majority of fs will work only with 512, 1024, ... block sizes)
while still keeping your additional compression data near the compressed data
(for disk seek performance reasons). And you will not gain any size
effects from this type of compression (building not constant block size
device is horrible, and you must probably change significant amount of
kernel to do that; think about fragmentation also). And compression is cpu
consumming process (I think you cannot implement Huffman in O(n) time,
probably someting like O(nlogn) with not very small constant is required).
And if you want just make your data harder to decrypt or to analyse its
encrypted form you can use some O(n) simple algorithms designed for this
purpose (some were discussed on this list in dm-crypt thread).

But feel free to try :-)

Grzegorz Kulewski

2004-04-14 17:28:02

by Bill Davidsen

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Wed, 14 Apr 2004, Paulo Marques wrote:

> Guillaume Lac?te wrote:
>
> >>>Oops ! I thought it was possible to guarantee with the Huffman encoding
> >>>(which is more basic than Lempev-Zif) that the compressed data use no
> >>>more than 1 bit for every byte (i.e. 12,5% more space).
>
> WTF??
>
> Zlib gives a maximum increase of 0.1% + 12 bytes (from the zlib manual), which
> for a 512 block will be a 2.4% guaranteed increase.
>
> I think that zlib already does the "if this is bigger than original, just mark
> the block type as uncompressed" algorithm internally, so the increase is minimal
> in the worst case.

I kind of think that the compression, even negative, is wanted here to
assist in obfuscation.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2004-04-14 19:29:56

by Pascal Schmidt

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Wed, 14 Apr 2004, Guillaume [iso-8859-15] Lac?te wrote:

> You are right that in this very case, the per-bit entropy will be
> (1 - 1/(1+1/8) ) ~ 12% lower than in the original text. The point is
> that this case (which has nothing to do with the case where a text can
> be well compressed or not, this is the worst _relative_ performance of
> dynamic versus static encoding) does not happen "too often".

I was only pointing out a corner case where your argument doesn't
hold. I completely agree that it is rather unlikely to happen with
real world data.

There is one exception, though: data that has already been compressed
before on a higher level; by the user, for example. However, in
that case you could say that what you're trying to accomplish has already
happened before during the original compression run.

--
Ciao,
Pascal

2004-04-15 09:28:51

by Jörn Engel

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

2004-04-22 08:01:06

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Thank you for your feed-back and sorry for my late answer (just back from
holidays).
> Btw, looks like the whole idea is broken. Consider a block of
> known-plaintext. The known plaintext happens to be at the beginning
> of the block and has more entropy than the blocksize of your block
> cypher. Bang, you're dead.
I am sorry but I failed to understand this problem; could you elaborate on it
? Are you saying that if I have known plaintext that compreses to at least on
full block, the problem remains ?

- Now (with dm-crypt) = basically the first 512 bytes are known (apart from
iv, see discussion in dm-crypt threads). This termendously helps a
brute-force attack (use a cluster to pre-calculate all possible encryptions
of these 512 bytes and you are done).

- What I suggest = 1) grow entropy (this does not solve the above problem)
2) interleave random bytes _before_ each (plain) block of data. Bytes are
drawn as to make the distribution on Huffman encodings uniform.
Thus even if you know the plain text, even if it would compress to more than
4kB, since the block starts with random bytes you know nothing about, and
since these random bytes have changed the Huffman encoding used to encode the
known plain text, I claim that you know about nothing (?).

Other way to say the same: start by chosing one Huffman encoding among all
possible 8bit encodings; "write" this encoding, and then write your
(eventually known) data with this encoding. To "write" the encoding, just
draw one particular random sequence whose Huffman encoding is the one you
want (this avoids having a "dictionnary" and any other meta-data in the
encrypted data).

>
> Your whole approach depends on the fact, that any known plaintext in
> the device is either not at the beginning of the block or has very
> little entropy. 1k of zeroes already has too much entropy if you use
> any form of huffman, so without RLE you're not frequently dead, but
> practically always.
You are perfectly right that Huffman is a poor redundancy fighter and will not
always drive the size of data down to its Kolmogorov complexity. The "1k of
zeroes" might still be a problem. I agree that a better compression algorithm
would be nice, but the problem is that i _need_ to know how draw random bytes
so as to make the _compressed_ encoding uniformly distributed.
>
> Does the idea still sound sane to you?
I hope so; what is your opinion ?
Guillaume.

2004-04-22 09:18:42

by Jörn Engel

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Thu, 22 April 2004 09:59:14 +0200, Guillaume Lac?te wrote:
> > Btw, looks like the whole idea is broken. Consider a block of
> > known-plaintext. The known plaintext happens to be at the beginning
> > of the block and has more entropy than the blocksize of your block
> > cypher. Bang, you're dead.
> I am sorry but I failed to understand this problem; could you elaborate on it
> ? Are you saying that if I have known plaintext that compreses to at least on
> full block, the problem remains ?

Before:
http://marc.theaimsgroup.com/?l=linux-kernel&m=107419912024246&w=2
The described attack works against user-friendly passwords. People
who can remember long strings from /dev/random are safe.

After:
You compressed the known plaintext. But the compressed known
plaintext remains known plaintext, just shorter. If it remains longer
than one block of the block cypher, it is still sufficient for a
dictionary attack. Nothing gained.

And of course it has to be at the beginning of a compression block, so
the offset is known in advance.

> - Now (with dm-crypt) = basically the first 512 bytes are known (apart from
> iv, see discussion in dm-crypt threads). This termendously helps a
> brute-force attack (use a cluster to pre-calculate all possible encryptions
> of these 512 bytes and you are done).
>
> - What I suggest = 1) grow entropy (this does not solve the above problem)
> 2) interleave random bytes _before_ each (plain) block of data. Bytes are
> drawn as to make the distribution on Huffman encodings uniform.
> Thus even if you know the plain text, even if it would compress to more than
> 4kB, since the block starts with random bytes you know nothing about, and
> since these random bytes have changed the Huffman encoding used to encode the
> known plain text, I claim that you know about nothing (?).

Ok, that makes the attack a little harder (but not much). After any
amount of random data, you end up with a random huffman tree, agreed.
Compress another 1k of zeros and see what the huffman tree looks like
now. There are not too many options, what the compressed data could
look like, right? Somewhere in the ballpark of 2, 4 or maybe 8.

So now the attack takes twice as long. Yeah, it helps a little, but
is it really worth the trouble?

> You are perfectly right that Huffman is a poor redundancy fighter and will not
> always drive the size of data down to its Kolmogorov complexity. The "1k of
> zeroes" might still be a problem. I agree that a better compression algorithm
> would be nice, but the problem is that i _need_ to know how draw random bytes
> so as to make the _compressed_ encoding uniformly distributed.
> >
> > Does the idea still sound sane to you?
> I hope so; what is your opinion ?

I doubt it. Maybe with a statistical encoding or even with block
sorting followed by statistical encoding, you increase the complexity
by much more than 8. But without changes, the idea looks pretty
futile.

J?rn

--
When in doubt, punt. When somebody actually complains, go back and fix it...
The 90% solution is a good thing.
-- Rob Landley

2004-04-22 10:20:22

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Thank you for your prompt answer ;)

> Before:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107419912024246&w=2
Yes, I had seen this.

> After:

> And of course it has to be at the beginning of a compression block, so
> the offset is known in advance.
OK, but this is not the case anymore if you insert random bytes first (?).
Where does the encrypted non-random data start ? This depends on the huffman
encoding, which you know nothing about.

>
> Ok, that makes the attack a little harder (but not much). After any
> amount of random data, you end up with a random huffman tree, agreed.
> Compress another 1k of zeros and see what the huffman tree looks like
> now. There are not too many options, what the compressed data could
> look like, right? Somewhere in the ballpark of 2, 4 or maybe 8.
OK, you are right. So assuming you have:
1) n randomly choosen bytes (n is itselft random)
Here the dynamic Huffman tree is T1 (uniformly distributed among all
trees)
2) p zero bytes
Now the tree is T2.
3) some more valuable data you would not want the attacker to know about.
My whole idea is useless if the attacker can gain sufficient knowledge on T2.
So if p >> n, then T2 will look like the well-known tree with 0 stored on 1
bit. I had missed this problem, thank you.

However (?):
a) the attacker stills knows nothing about the rest of the tree (ie what the
encoding of all other bits are). Basically to construct T2 you take T1, let
"0" be the coding of "0", and prefix the code of all other bytes with a "1".

On the other hand, replace (2) with a known sequence of p _equally
distributed_ bytes: now T2 looks like the well-balanced tree, and its
encoding is trivial and totally know to the attacker. This is a real problem
(apart from b and c below), but how often can this happen ?

b) the attacker does not know (?) where the real data starts in the enciphered
stream, since Huffman is variable-length.

c) it is sufficient to ensure that not (p >> n). This is easily satistifed if
the expectation of n is in the order of one block size, since at most p <
block_size.

> > > Does the idea still sound sane to you?
> > I hope so; what is your opinion ?
> I doubt it. Maybe with a statistical encoding or even with block
> sorting followed by statistical encoding, you increase the complexity
> by much more than 8. But without changes, the idea looks pretty
> futile.
Could you detail what you mean with statistical encoding ? Thank you in
advance, Guillaume.

2004-04-22 12:15:57

by Jörn Engel

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Thu, 22 April 2004 12:20:08 +0200, Guillaume Lac?te wrote:
>
> > After:
> > And of course it has to be at the beginning of a compression block, so
> > the offset is known in advance.
> OK, but this is not the case anymore if you insert random bytes first (?).
> Where does the encrypted non-random data start ? This depends on the huffman
> encoding, which you know nothing about.

True. But since the data is random, chances are it is impossible to
compress, so guessing the compressed data should be feasible. The
guess could be off by a few bits, but again, this makes the attack
harder by just a small factor.

Yeah, in the end, all small factors can multiply so something quite
sufficient.

> a) the attacker stills knows nothing about the rest of the tree (ie what the
> encoding of all other bits are). Basically to construct T2 you take T1, let
> "0" be the coding of "0", and prefix the code of all other bytes with a "1".

Too complicated. Figure out, where to find a block of encrypted
zeros, then look it up in a dictionary. If that is successful, you
have the key, full stop.

The attacker could even look for the most common block (or the 100
most common blocks), do a dictionary attack on those and afterwards
guess if the plaintext is all zeros, all ones or something similar.

> b) the attacker does not know (?) where the real data starts in the enciphered
> stream, since Huffman is variable-length.

What if the known plaintext is all zeros and compressed known
plaintext is longer than two encryption blocks? There will be one
block containing only compressed zeros.

> c) it is sufficient to ensure that not (p >> n). This is easily satistifed if
> the expectation of n is in the order of one block size, since at most p <
> block_size.

Block_size is confusing. Encryption block? Compression block?
Either way, even with p << n, you end up with, at most, 512 different
encodings for a zero. Again, 2, 4 or 8 are much more realistic.

> Could you detail what you mean with statistical encoding ? Thank you in
> advance, Guillaume.

Sorry, I meant arithmetical encoding. Statistical is the superset for
huffman (discrete) and arithmetic (continuous).

http://dogma.net/markn/articles/arith/part1.htm

J?rn

PS: To shorten this endless story, all you're trying to accomplish to
avoid dictionary attack on known plaintext. Those attack are
meaningless, as long as the encryption key is strong enough. So
use a true random key end be done with it.

And if you care about usability as well, store the key along with the
encrypted data, but encrypt the key with the digest of a simple
password. That is prone to dictionary attacks, but since it only
encrypts true random data (the real key), you're safe again.

--
Geld macht nicht gl?cklich.
Gl?ck macht nicht satt.

2004-04-22 13:07:04

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Thanks for continuing this discussion.
>
> True. But since the data is random, chances are it is impossible to
> compress, so guessing the compressed data should be feasible. The
> guess could be off by a few bits, but again, this makes the attack
> harder by just a small factor.
Not exactly; the random bytes are _not_ drawn uniformly; they are drawn so as
to make the distribution on Huffman trees uniform.

The rough algorithm is as follows:

1) set up the weight of each of the 256 possible bytes. If you want byte x to
have a Huffman code of length k, let its weight be 2^k .

2) now draw a sequence of n bytes with these weights; have n be sufficiently
high so that with high probability the empirical distribution of the bytes
you have actually drawn is close to the theoretical one. This ensures that
the resulting Huffman tree will be close to the one you wanted. This is done
through the Central Limit Theorem.

So the actual random sequence _will_ most often be highly compressible. The
point is that all binary sequences that correspond to a possible Huffman tree
are equiprobable. This maximizes the entropy (ie this minimizes the
information known to the attacker).

>
> > a) the attacker stills knows nothing about the rest of the tree (ie what
> > the encoding of all other bits are). Basically to construct T2 you take
> > T1, let "0" be the coding of "0", and prefix the code of all other bytes
> > with a "1".
>
> Too complicated. Figure out, where to find a block of encrypted
> zeros, then look it up in a dictionary. If that is successful, you
> have the key, full stop.
OK; my whole point is that it is difficult to "figure out, where to find a
block of encrypted zeros". I hope to make this as difficult as to first crack
the preceding sequence of random bytes. Which is, by design, only feasible by
brute force.
I may be wrong on this. So assume I have stored 4kB of zeros.
What you have is the following:
L1 = n bytes of random bytes
L2 = the 4kB of zeros, compressed with the dynamic Huffman tree after L1, then
enciphered.
You do not know n, nor the actual sequence of bytes I have drawn, nor the
secret key. How do you do (I do not claim this is not possible although I
hope so, I just would like to understand) ?

>
> The attacker could even look for the most common block (or the 100
> most common blocks), do a dictionary attack on those and afterwards
> guess if the plaintext is all zeros, all ones or something similar.
I do not understand this assertion. What does "look for the most common bloc"
mean if you have no clue on which Huffman tree was used to encode them (all
Huffman trees are made equi-probable) ? In some of the trees, '0x0' =>
'00000000', in some other trees '0x0' => '0011', in some others, it gets
'00011110100100', etc.

>
> > b) the attacker does not know (?) where the real data starts in the
> > enciphered stream, since Huffman is variable-length.
>
> What if the known plaintext is all zeros and compressed known
> plaintext is longer than two encryption blocks? There will be one
> block containing only compressed zeros.
OK, maybe I should have made clear that the drawing of random bytes is to be
done _at the beginning of each and every block_ (as is done in
http://jsam.sourceforge.net).

> > c) it is sufficient to ensure that not (p >> n). This is easily
> > satistifed if the expectation of n is in the order of one block size,
> > since at most p < block_size.
>
> Block_size is confusing. Encryption block? Compression block?
Sorry about that: I meant plain data block (ie your 4kB of real data).
This will be written to 5kB of data or so : about 1kB of random bytes, then
the data (possibly compressed), everything being encrypted at the end.

> Either way, even with p << n, you end up with, at most, 512 different
> encodings for a zero. Again, 2, 4 or 8 are much more realistic.
I fail to understand this: assume I want to encode one huge 4GB file full of
zeros. This will lead to 10^6 blocks of size 4kB. On writing each block will
be preceeded by its own sequence of random bytes, thus have its own uniformly
drawn Huffman encoding. Thus the data feed to the cipher will be different
for each block. What have I missed ?

>
> > Could you detail what you mean with statistical encoding ? Thank you in
> > advance, Guillaume.
>
> Sorry, I meant arithmetical encoding. Statistical is the superset for
> huffman (discrete) and arithmetic (continuous).
>
> http://dogma.net/markn/articles/arith/part1.htm
Thank you for the link.
The idea to use a partition of (0,1) to represent codes is that of Kraft; so
is the idea to use a real number range to represent the code. However the
remark that inside such a range (e.g. [0.43046721 ; 0.4782969[ ) , there are
numbers that have a low Kolmogorov complexity (eg. 0.45) is neat, though.

But whatever happens, the optimal length of the encoding will be exactly equal
to the total entropy of the message, up to a constant of 1. (see Li and
Vitanyi, and Introduction to Kolmogorov complexity, Th. 1.11.2 p 75).

It can happen than arithmetic encoding comes closer to the optimal encoding
than Huffman, but I fail to understand why this should always be the case.

But it holds for sure that Huffman encoding, especially adaptive, will _not_
always achieve it either.
>
> J?rn
>
> PS: To shorten this endless story, all you're trying to accomplish to
> avoid dictionary attack on known plaintext. Those attack are
> meaningless, as long as the encryption key is strong enough. So
> use a true random key end be done with it.
Could you please point me to the relevant litterature on this subject ? I have
read the NIST-AES site on this question, but still: if the entropy of the
plain text is low, the entropy of the encoded text _will_ be low. I believe
entropy is an orthogonal concept. And the lower the entropy, the higher the
information. Wether dividing the search space by 4 leads to any practicable
speed up in the key search is another matter I believe you. But if I can
prevent it alltogether in the first place, why not ?

>
> And if you care about usability as well, store the key along with the
> encrypted data, but encrypt the key with the digest of a simple
> password. That is prone to dictionary attacks, but since it only
> encrypts true random data (the real key), you're safe again.
This is what SuSe has been doing for some time I believe. But this is just
another matter. I already assume that the key is random. However the known
text is not, and this is my whole problem.

Thank you for your time, and feel free not to answer if you feel bored ...
Guillaume.

2004-04-22 16:09:43

by Jörn Engel

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Ignoring most of your reply...

On Thu, 22 April 2004 15:06:43 +0200, Guillaume Lac?te wrote:
> >
> > The attacker could even look for the most common block (or the 100
> > most common blocks), do a dictionary attack on those and afterwards
> > guess if the plaintext is all zeros, all ones or something similar.
> I do not understand this assertion. What does "look for the most common bloc"
> mean if you have no clue on which Huffman tree was used to encode them (all
> Huffman trees are made equi-probable) ? In some of the trees, '0x0' =>
> '00000000', in some other trees '0x0' => '0011', in some others, it gets
> '00011110100100', etc.

1. Cut up the complete device into chunks of 16 bytes (or whatever your
cipher uses).
2. Add all of those chunks to a huge hash, counting how often they
occur.
3. For the most common ones, try a dictionary attack.

Yeah, sure, the attacker has no idea what the plaintext of those
blocks is, but if they appear often enough, it has to be something
quite common. Something like, say, all ones or all zeros. Or like
one of those 48 common huffman encodings thereof.

> OK, maybe I should have made clear that the drawing of random bytes is to be
> done _at the beginning of each and every block_ (as is done in
> http://jsam.sourceforge.net).

So what! You end up with maybe three bits per zero (assuming all
zeros). Depending on the size of random data up front, they start
with bit 1, 2 or 3. Makes 3*2^3 or 24 possibilities. Same for all
ones, give a total of 48. Great, a dictionary attack is 48x slower
now!

> > > c) it is sufficient to ensure that not (p >> n). This is easily
> > > satistifed if the expectation of n is in the order of one block size,
> > > since at most p < block_size.
> >
> > Block_size is confusing. Encryption block? Compression block?
> Sorry about that: I meant plain data block (ie your 4kB of real data).
> This will be written to 5kB of data or so : about 1kB of random bytes, then
> the data (possibly compressed), everything being encrypted at the end.

Still, towards the end of all-ones or all-zeros, each byte will be
encoded with the same 1-3bit value.

> > Either way, even with p << n, you end up with, at most, 512 different
> > encodings for a zero. Again, 2, 4 or 8 are much more realistic.
> I fail to understand this: assume I want to encode one huge 4GB file full of
> zeros. This will lead to 10^6 blocks of size 4kB. On writing each block will
> be preceeded by its own sequence of random bytes, thus have its own uniformly
> drawn Huffman encoding. Thus the data feed to the cipher will be different
> for each block. What have I missed ?

That huffman encodings are discrete, not contiguous. Towards the end
of each 4k block of zeros, you end up with very few likely encodings.
And those are also stable for quite a while.

> The idea to use a partition of (0,1) to represent codes is that of Kraft; so
> is the idea to use a real number range to represent the code. However the
> remark that inside such a range (e.g. [0.43046721 ; 0.4782969[ ) , there are
> numbers that have a low Kolmogorov complexity (eg. 0.45) is neat, though.
>
> But whatever happens, the optimal length of the encoding will be exactly equal
> to the total entropy of the message, up to a constant of 1. (see Li and
> Vitanyi, and Introduction to Kolmogorov complexity, Th. 1.11.2 p 75).
>
> It can happen than arithmetic encoding comes closer to the optimal encoding
> than Huffman, but I fail to understand why this should always be the case.

That's just a minor point for you, although still valid. In almost
all cases, perfect number of bits is not 1, 2, 3, etc. but 2.54, 1.78
and such. Huffman is unable to do so, which explains why all-ones
compresses to horribly with huffman and so well with just about
anything else. In almost every case, huffman is worse, just usually
not by that much.

> > PS: To shorten this endless story, all you're trying to accomplish to
> > avoid dictionary attack on known plaintext. Those attack are
> > meaningless, as long as the encryption key is strong enough. So
> > use a true random key end be done with it.
> Could you please point me to the relevant litterature on this subject ? I have
> read the NIST-AES site on this question, but still: if the entropy of the
> plain text is low, the entropy of the encoded text _will_ be low. I believe
> entropy is an orthogonal concept. And the lower the entropy, the higher the
> information. Wether dividing the search space by 4 leads to any practicable
> speed up in the key search is another matter I believe you. But if I can
> prevent it alltogether in the first place, why not ?

Sorry, I don't have a document for common sense. What you are
protecting against is easy guessing of bad keys. Make the key better
and your whole point is void.

> > And if you care about usability as well, store the key along with the
> > encrypted data, but encrypt the key with the digest of a simple
> > password. That is prone to dictionary attacks, but since it only
> > encrypts true random data (the real key), you're safe again.
> This is what SuSe has been doing for some time I believe. But this is just
> another matter. I already assume that the key is random. However the known
> text is not, and this is my whole problem.

In that case, what's your point. If the key is strong and the
encryption is strong (I sure hope, AES is), nothing short of brute
force can be successful. What are you protecting against?

J?rn

--
A defeated army first battles and then seeks victory.
-- Sun Tzu

2004-04-23 15:17:09

by Guillaume Lacôte

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

Feel free to ignore all of my reply; please note that I am not trying to "be
right" or to "be wrong" but I still do not understand ... Thank you still for
your time and pedagogy.

> Yeah, sure, the attacker has no idea what the plaintext of those
> blocks is, but if they appear often enough, it has to be something
> quite common. Something like, say, all ones or all zeros. Or like
> one of those 48 common huffman encodings thereof.
> [...]
> So what! You end up with maybe three bits per zero (assuming all
> zeros). Depending on the size of random data up front, they start
> with bit 1, 2 or 3. Makes 3*2^3 or 24 possibilities. Same for all
> ones, give a total of 48. Great, a dictionary attack is 48x slower
> now!
> [...]
> Still, towards the end of all-ones or all-zeros, each byte will be
> encoded with the same 1-3bit value.
The point I fail to understand is the following : you know the enciphered
value of these 1-3bits. But how can you know what is
compressed-but-deciphered 1-3bit value ? Ok my text contains only 0s. OK
these 0s appear to be "011" once encrypted. How do you launch your
dictionnary attack ? You do _not_ (?) know what the 3bit deciphered code for
"0" is. Or maybe you do ?

> [...]
> In that case, what's your point. If the key is strong and the
> encryption is strong (I sure hope, AES is), nothing short of brute
> force can be successful. What are you protecting against?
Maybe my "endless" story is absurd, but I am _not_ protecting against weak
keys; I am trying to protected against weak _data_ , which is the basis for
dictionnary attacks even in the case of perfectly random keys.

Thank you for having read till here,
Guillaume.

2004-04-23 16:58:44

by Jörn Engel

[permalink] [raw]

Subject: Re: Using compression before encryption in device-mapper

On Fri, 23 April 2004 17:16:53 +0200, Guillaume Lac?te wrote:
>
> Feel free to ignore all of my reply; please note that I am not trying to "be
> right" or to "be wrong" but I still do not understand ... Thank you still for
> your time and pedagogy.

No need to mention this. I'll stop whenever I feel like it.

> > Yeah, sure, the attacker has no idea what the plaintext of those
> > blocks is, but if they appear often enough, it has to be something
> > quite common. Something like, say, all ones or all zeros. Or like
> > one of those 48 common huffman encodings thereof.
> > [...]
> > So what! You end up with maybe three bits per zero (assuming all
> > zeros). Depending on the size of random data up front, they start
> > with bit 1, 2 or 3. Makes 3*2^3 or 24 possibilities. Same for all
> > ones, give a total of 48. Great, a dictionary attack is 48x slower
> > now!
> > [...]
> > Still, towards the end of all-ones or all-zeros, each byte will be
> > encoded with the same 1-3bit value.
> The point I fail to understand is the following : you know the enciphered
> value of these 1-3bits. But how can you know what is
> compressed-but-deciphered 1-3bit value ? Ok my text contains only 0s. OK
> these 0s appear to be "011" once encrypted. How do you launch your
> dictionnary attack ? You do _not_ (?) know what the 3bit deciphered code for
> "0" is. Or maybe you do ?

I know the encrypted text and I know it is common, so the uncompressed
decrypted text is likely 0x00,... Now, what may the compressed
decyphered text be? It could be:
000,000,...
001,001,...
010,010,...
011,011,...
.
.
.

Actually, the 000 and 111 cases don't even have three variants
depending on when they start, so the 48 above comes down to 40. And
since I honestly don't case whether it was 0x00 or 0xff, that was
encrypted to 001,001,..., it even comes down to 20.

You are right, I don't know which one of those 20 it is, but I am
quite sure it is one of them.

> > In that case, what's your point. If the key is strong and the
> > encryption is strong (I sure hope, AES is), nothing short of brute
> > force can be successful. What are you protecting against?
> Maybe my "endless" story is absurd, but I am _not_ protecting against weak
> keys; I am trying to protected against weak _data_ , which is the basis for
> dictionnary attacks even in the case of perfectly random keys.

Show me the paper. Since when is AES weak against known plaintext
attacks?

J?rn

--
The grand essentials of happiness are: something to do, something to
love, and something to hope for.
-- Allan K. Chalmers