2008-10-29 01:50:16

by Phillip Lougher

[permalink] [raw]
Subject: [PATCH V2 00/16] Squashfs: compressed read-only filesystem

Hi,

This a respin of the Squashfs patches incorporating the review comments
received. Thanks to everyone who have sent comments.

Summary of changes in patch respin:

1. Functions changed to return 0 on success and -ESOMETHING on error
2. Header files moved from include/linux to fs/squashfs
3. Variables changed to use sb and inode
4. Number of squashfs_read_metadata() parameters reduced
5. Xattr placeholder code tweaked
6. TRACE and ERROR macros fixed to use pr_debug and pr_warning
7. Some obsolete macros in squashfs_fs.h removed
8. A number of gotos to return statements replaced with direct returns
9. Sparse with endian checking (make C=2 CHECKFLAGS="-D__CHECK_ENDIAN__")
errors fixed
10. get_dir_index_using_name() misaligned access fixed
11. Fix a couple of printk warnings on PPC64
12. Shorten a number of variable names

There is now a public git repository on kernel.org. Pull/clone from
git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-2.6.git

These 16 patches are against 2.6.28-rc2.

Following is the original re-submission oveview, detailing the major
changes since the original 2005 submission, and a case for its inclusion.

Thanks

Phillip

This is a second attempt at mainlining Squashfs. The first attempt was way
way back in early 2005 :-) Since then the filesystem layout has undergone
two major revisions, and the kernel code has almost been completely
rewritten. Both of these were to address the criticisms made at the original
attempt.

Summary of changes:
1. Filesystem layout is now 64-bit, in theory filesystems and
files can be 2^64 in size.

2. Filesystem is now fixed little-endian.

3. "." and ".." are now returned by readdir.

4. Sparse files are now supported.

5. Filesystem is now exportable (NFS etc.).

6. Datablocks up to 1 Mbyte are now supported.

Codewise all of the packed bit-fields and the swap macros have been removed in
favour of aligned structures and in-line swapping using leXX_to_cpu(). The
code has also been extensively restructured, reformatted to kernel coding
standards and commented.

Previously there was resistance to the inclusion of another compressed
filesystem when Linux already had cramfs. There was pressure for a strong
case to be made for the inclusion of Squashfs. Hopefully the case for
the inclusion of other compressed filesystems has now already been answered
over the last couple of years, however, it is worth listing the
features of Squashfs over cramfs, which is still the only read-only
compressed filesystem in mainline.

Max filesystem size: cramfs 16 Mbytes, Squashfs 64-bit filesystem
Max filesize: cramfs 16 Mbytes, Squashfs 64-bit filesystem
Block size: cramfs 4K, Squashfs default 128K, max 1Mbyte
Tail-end packing: cramfs no, Squashfs yes
Directory indexes: cramfs no, Squashfs yes
Compressed metadata: cramfs no, Squashfs yes
Hard link support: cramfs no, Squashfs yes
Support for "." and ".." in readdir: cramfs no, Squashfs yes
Real inode numbers: cramfs no, Squashfs yes. Cramfs gives device inodes,
fifo and empty directories the same inode of 1!
Exportable filesystem (NFS, etc.): cramfs no, Squashfs yes
Active maintenance: cramfs no (it is listed as orphaned, probably no active
work for years), Squashfs yes

Sorry for the list formatting, but many email readers are very unforgiving
displaying tabbed lists and so I avoided them.

For those that want hard performance statistics
http://tree.celinuxforum.org/CelfPubWiki/SquashFsComparisons gives
a full comparison of the performance of Squashfs against cramfs, zisofs,
cloop and ext3. I made these tests a number of years ago using Squashfs 2.1,
but they are still valid. In fact the performance should now be better.

Cramfs is a limited filesystem, it's good for some embedded users but not now
much else, its layout and features hasn't changed in the eight years+ since
its release. Squashfs, despite never being in mainline, has been actively
developed for over six years, and in that time has gone through four
layout revisions, each revision improving compression and performance where
limitations were found. For an often dismissed filesystem, Squashfs has
advanced features such as metadata compression and tail-end packing for greater
compression, and directory indexes for faster dentry operations.

Despite not being in mainline, it is widely used. It is packaged
by all major distributions (Ubuntu, Fedora, Debian, SUSE, Gentoo), it is used
on most LiveCDs, it is extensively used in embedded systems (STBs, routers,
mobile phones), and notably is used in such things as Splashtop and the
Amazon Kindle.

Anyway that's my case for inclusion. If any readers want Squashfs
mainlined it's probably now a good time to offer support!

There are 16 patches in the patch set, and the patches are against the
latest linux-next tree (linux 2.6.27-next-20081016).

Finally, I would like to acknowledge the financial support of the Consumer
Embedded Linux Forum (CELF). They've made it possible for me to spend the
last four months working full time on this mainlining attempt.

Phillip


2008-10-29 02:30:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH V2 00/16] Squashfs: compressed read-only filesystem

On Wed, 29 Oct 2008 01:49:55 +0000 Phillip Lougher <[email protected]> wrote:

> Hi,
>
> This a respin of the Squashfs patches incorporating the review comments
> received. Thanks to everyone who have sent comments.
>
> Summary of changes in patch respin:
>
> 1. Functions changed to return 0 on success and -ESOMETHING on error
> 2. Header files moved from include/linux to fs/squashfs
> 3. Variables changed to use sb and inode
> 4. Number of squashfs_read_metadata() parameters reduced
> 5. Xattr placeholder code tweaked
> 6. TRACE and ERROR macros fixed to use pr_debug and pr_warning
> 7. Some obsolete macros in squashfs_fs.h removed
> 8. A number of gotos to return statements replaced with direct returns
> 9. Sparse with endian checking (make C=2 CHECKFLAGS="-D__CHECK_ENDIAN__")
> errors fixed
> 10. get_dir_index_using_name() misaligned access fixed
> 11. Fix a couple of printk warnings on PPC64
> 12. Shorten a number of variable names

- what are the limitations of squashfs (please add this to the
changelog of patch #1 or something). Does it support nfsd? (yes, it
does!) xatrs and acls? File size limits, entries-per-directory,
etc, etc?

What is on the todo list?

- Please check that all non-static identifiers really did need global
scope. I saw some which surprised me a bit.

- Please check that all global identifiers have suitable names. For
example "get_fragment_location" is a poor choice for a kernel-wide
identifier. It could clash with other subsystems, mainly. Plus it
is hardly self-identifying. I see quite a few such cases.

- The fs uses vmalloc() rather a lot. I'd suggest that this be
explained and justified in the design/implementation overview,
wherever that is. This should include a means by which we can
estimate (or at least understand) the amount of memory which will be
allocated in this way.

Because vmalloc() is unreliable. It is a fixed-size resource on
each machine. Some machines will run out much much earlier than
others. It will set an upper limit on the number of filesystems
which can be concurrently mounted, and presumably upon the size of
those filesystems. One a per-machine basis, which is worse.

It also exposes users to vmalloc arena fragmentation. eg: mount
ten 1G filesystems, then unmount every second one, then try to mount
a 2G filesystem and you find you have no contiguous vmalloc space
(simplified example).

See, this vmalloc thing is a fairly big deal. What's up with all
of this??

- The fs uses brelse() in quite a few places where the bh is known to
be non-zero. Suggest that these be converted to the more efficient
and modern put_bh().

- this:

+/*
+ * Blocks in Squashfs are compressed. To avoid repeatedly decompressing
+ * recently accessed data Squashfs uses two small metadata and fragment caches.
+ *
+ * This file implements a generic cache implementation used for both caches,
+ * plus functions layered ontop of the generic cache implementation to
+ * access the metadata and fragment caches.
+ */

confuses me. Why not just decompress these blocks into pagecache
and let the VFS handle the caching??

The real bug here is that this rather obvious question wasn't
answered anywhere in the patch submission (afaict). How to fix that?

Methinks we need a squashfs.txt which covers these things.

- I suspect that squashfs_cache_put() has races around the handling
of cache->waiting. Does it assume that another CPU wrote that flag
prior to adding itself to the waitqueue? How can the other task do
that atomically? What about memory ordering issues?

Suggest that cache->lock coverage be extended to clear all this up.

- Quite a few places do kzalloc(a * b, ...). Please convert to
kcalloc() which has checks for multiplicative overflows.

> There is now a public git repository on kernel.org. Pull/clone from
> git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-2.6.git

Generally looks OK to me. Please prepare a tree for linx-next
inclusion and unless serious problems are pointed out I'd suggest
shooting for a 2.6.29 merge.

2008-10-29 21:40:19

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH V2 00/16] Squashfs: compressed read-only filesystem

On Wed, 2008-10-29 at 01:49 +0000, Phillip Lougher wrote:
> Hi,
>
> This a respin of the Squashfs patches incorporating the review comments
> received. Thanks to everyone who have sent comments.

I read over the v3 source a few weeks ago and must say this looks
greatly improved.

--
Mathematics is the supreme nostalgia of our time.

2008-10-31 00:29:35

by Phillip Lougher

[permalink] [raw]
Subject: Re: [PATCH V2 00/16] Squashfs: compressed read-only filesystem

Matt Mackall wrote:
> On Wed, 2008-10-29 at 01:49 +0000, Phillip Lougher wrote:
>> Hi,
>>
>> This a respin of the Squashfs patches incorporating the review comments
>> received. Thanks to everyone who have sent comments.
>
> I read over the v3 source a few weeks ago and must say this looks
> greatly improved.
>

Yes thanks. The v3 source was a bit of a mess, it had grown organically
from the earliest version of Squashfs, and long needed restructuring,
closer attention to coding standards, and commenting. I think the v4
source is a major improvement, it's partially thanks to CELF that I got
the necessary time off work to knock it into shape.

Phillip

2008-11-03 14:15:08

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH V2 00/16] Squashfs: compressed read-only filesystem

Hi.

On Wed, Oct 29, 2008 at 01:49:55AM +0000, Phillip Lougher ([email protected]) wrote:
> Summary of changes in patch respin:
>
> 1. Functions changed to return 0 on success and -ESOMETHING on error
> 2. Header files moved from include/linux to fs/squashfs
> 3. Variables changed to use sb and inode
> 4. Number of squashfs_read_metadata() parameters reduced
> 5. Xattr placeholder code tweaked
> 6. TRACE and ERROR macros fixed to use pr_debug and pr_warning
> 7. Some obsolete macros in squashfs_fs.h removed
> 8. A number of gotos to return statements replaced with direct returns
> 9. Sparse with endian checking (make C=2 CHECKFLAGS="-D__CHECK_ENDIAN__")
> errors fixed
> 10. get_dir_index_using_name() misaligned access fixed
> 11. Fix a couple of printk warnings on PPC64
> 12. Shorten a number of variable names

Looks very good.

As a generic comment of the style: imho u64 is more appropriate than
long long, at least it is less keys to press when typing :)

--
Evgeniy Polyakov

2009-01-04 07:55:50

by Phillip Lougher

[permalink] [raw]
Subject: Re: [PATCH V2 00/16] Squashfs: compressed read-only filesystem

Andrew Morton wrote a couple of months ago (sadly how time flies):

> - what are the limitations of squashfs (please add this to the
> changelog of patch #1 or something). Does it support nfsd? (yes, it
> does!) xatrs and acls? File size limits, entries-per-directory,
> etc, etc?

Xattrs and acls are not supported, this is a todo.

Filesize limits are in theory 2^64. In practice about 2 TiB.

No limits on the entries per directory.

>
> What is on the todo list?
>

Xattrs and ACLs.

The above information has been added to a squashfs.txt file.

> - Please check that all non-static identifiers really did need global
> scope. I saw some which surprised me a bit.
>
> - Please check that all global identifiers have suitable names. For
> example "get_fragment_location" is a poor choice for a kernel-wide
> identifier. It could clash with other subsystems, mainly. Plus it
> is hardly self-identifying. I see quite a few such cases.

Done and fixed.

>
> - The fs uses vmalloc() rather a lot. I'd suggest that this be
> explained and justified in the design/implementation overview,
> wherever that is. This should include a means by which we can
> estimate (or at least understand) the amount of memory which will be
> allocated in this way.
>
> Because vmalloc() is unreliable. It is a fixed-size resource on
> each machine. Some machines will run out much much earlier than
> others. It will set an upper limit on the number of filesystems
> which can be concurrently mounted, and presumably upon the size of
> those filesystems. One a per-machine basis, which is worse.
>
> It also exposes users to vmalloc arena fragmentation. eg: mount
> ten 1G filesystems, then unmount every second one, then try to mount
> a 2G filesystem and you find you have no contiguous vmalloc space
> (simplified example).
>
> See, this vmalloc thing is a fairly big deal. What's up with all
> of this??

Vmalloc was used to allocate block data (128 Kib by default).

I've removed all vmallocs from Squashfs. All buffers are now kmalloced
in PAGE_CACHE_SIZEd chunks.

>
> - The fs uses brelse() in quite a few places where the bh is known to
> be non-zero. Suggest that these be converted to the more efficient
> and modern put_bh().
>

Fixed.

> - this:
>
> +/*
> + * Blocks in Squashfs are compressed. To avoid repeatedly decompressing
> + * recently accessed data Squashfs uses two small metadata and fragment caches.
> + *
> + * This file implements a generic cache implementation used for both caches,
> + * plus functions layered ontop of the generic cache implementation to
> + * access the metadata and fragment caches.
> + */
>
> confuses me. Why not just decompress these blocks into pagecache
> and let the VFS handle the caching??
>

The cache is not used for file datablocks, these are decompressed and
cached in the page-cache in the normal way. The cache is only used to
temporarily cache fragment and metadata blocks which have been read as
as a result of a metadata (i.e. inode or directory) or fragment access.
Because metadata and fragments are packed together into blocks (to
gain greater compression) the read of a particular piece of metadata or
fragment will retrieve other metadata/fragments which have been packed
with it, these because of locality-of-reference may be read in the near
future. Temporarily caching them ensures they are available for near
future access without requiring an additional read and decompress.

The cache is deliberately kept small to only cache the last couple of
blocks read. The total overhead is 3 x block_size (128 KiB) for
fragments and 8 x 8 KiB for metadata blocks, or a total of 448 KiB.

If you think this is too large I can reduce the number of fragments and
metadata blocks cached.

Because these blocks are greater than PAGE_CACHE_SIZE is not easy to use
the page cache. As Joern said "there really isn't any infrastructure to
deal with such cases yet. Bufferheads deal with blocks smaller than a
page, not the other way around." Storing these in the page cache
introduces a lot more complexity in terms of locking and associated race
conditions.

As previously mentioned I have rewritten the cache implementation to
avoid vmalloc and to use PAGE_CACHE_SIZE blocks. This eliminates the
vmalloc fragment issues, and is a first step in moving the
implementation over to using the page cache in the future.

> The real bug here is that this rather obvious question wasn't
> answered anywhere in the patch submission (afaict). How to fix that?
>
> Methinks we need a squashfs.txt which covers these things.

Added to a new squashfs.txt file.

>
> - I suspect that squashfs_cache_put() has races around the handling
> of cache->waiting. Does it assume that another CPU wrote that flag
> prior to adding itself to the waitqueue? How can the other task do
> that atomically? What about memory ordering issues?
>
> Suggest that cache->lock coverage be extended to clear all this up.
>

Fixed.

> - Quite a few places do kzalloc(a * b, ...). Please convert to
> kcalloc() which has checks for multiplicative overflows.

Fixed.

>
>> There is now a public git repository on kernel.org. Pull/clone from
>> git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-2.6.git
>
> Generally looks OK to me. Please prepare a tree for linx-next
> inclusion and unless serious problems are pointed out I'd suggest
> shooting for a 2.6.29 merge.
>

Ok. I'll re-spin the patches against 2.6.28 tomorrow (Sunday), and I'll
prepare a tree for linux-next.

Thanks

Phillip

2009-01-04 19:05:16

by Leon Woestenberg

[permalink] [raw]
Subject: Re: [PATCH V2 00/16] Squashfs: compressed read-only filesystem

Hello,

On Sun, Jan 4, 2009 at 8:55 AM, Phillip Lougher
<[email protected]> wrote:
>> - what are the limitations of squashfs (please add this to the
>> changelog of patch #1 or something). Does it support nfsd? (yes, it
>> does!) xatrs and acls? File size limits, entries-per-directory,
>> etc, etc?
>
> Xattrs and acls are not supported, this is a todo.
> Filesize limits are in theory 2^64. In practice about 2 TiB.
>
...
>
> Ok. I'll re-spin the patches against 2.6.28 tomorrow (Sunday), and I'll
> prepare a tree for linux-next.
>

For use cases such as embedded firmware, the limitations are
non-interesting, and the compression savings are very interesting.
Especially where the resulting filesystem has to creep through slow
wires such as half duplex serial links etc.

Have been using squashfs 2.2 up to 3.4 without problems for years, for
distribution of Linux based firmwares into embedded devices.

Many thanks for your continued efforts on mainlining squashfs,

Regards,
--
Leon