LinuxLists.cc - [PATCH 04/10] AXFS: axfs

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

On Thursday 21 August 2008, Jared Hulbert wrote:
> +/***************** functions in other axfs files ******************************/
> +int axfs_get_sb(struct file_system_type *, int, const char *, void *,
> + struct vfsmount *);
> +void axfs_kill_super(struct super_block *);
> +void axfs_profiling_add(struct axfs_super *, unsigned long, unsigned int);
> +int axfs_copy_mtd(struct super_block *, void *, u64, u64);
> +int axfs_copy_block(struct super_block *, void *, u64, u64);

*Never* put extern declarations into a .c file, that's what headers are for.
If you ever change the definition, the compiler doesn't get a chance to
warn you otherwise.

> +/******************************************************************************/
> +static int axfs_readdir(struct file *, void *, filldir_t);
> +static int axfs_mmap(struct file *, struct vm_area_struct *);
> +static ssize_t axfs_file_read(struct file *, char __user *, size_t, loff_t *);
> +static int axfs_readpage(struct file *, struct page *);
> +static int axfs_fault(struct vm_area_struct *, struct vm_fault *);
> +static struct dentry *axfs_lookup(struct inode *, struct dentry *,
> + struct nameidata *);
> +static int axfs_get_xip_mem(struct address_space *, pgoff_t, int, void **,
> + unsigned long *);

For style reasons, also please don't put static forward declarations anywhere,
but define the functions in the right order so you don't need them.

Arnd <><

2008-08-21 12:18:24

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

On Thursday 21 August 2008, Jared Hulbert wrote:> + array_index = AXFS_GET_INODE_ARRAY_INDEX(sbi, ino_number);> + array_index += page->index;> +> + node_index = AXFS_GET_NODE_INDEX(sbi, array_index);> + node_type = AXFS_GET_NODE_TYPE(sbi, array_index);> +> +???????if (node_type == Compressed) {> +???????????????/* node is in compessed region */> +???????????????cnode_offset = AXFS_GET_CNODE_OFFSET(sbi, node_index);> +???????????????cnode_index = AXFS_GET_CNODE_INDEX(sbi, node_index);> +???????????????down_write(&sbi->lock);> +???????????????if (cnode_index != sbi->current_cnode_index) {> +???????????????????????/* uncompress only necessary if different cblock */> +???????????????????????ofs = AXFS_GET_CBLOCK_OFFSET(sbi, cnode_index);> +???????????????????????len = AXFS_GET_CBLOCK_OFFSET(sbi, cnode_index + 1);> +???????????????????????len -= ofs;> +???????????????????????axfs_copy_data(sb, cblk1, &(sbi->compressed), ofs, len);> +???????????????????????axfs_uncompress_block(cblk0, cblk_size, cblk1, len);> +???????????????????????sbi->current_cnode_index = cnode_index;> +???????????????}> +???????????????downgrade_write(&sbi->lock);> +???????????????max_len = cblk_size - cnode_offset;> +???????????????len = max_len > PAGE_CACHE_SIZE ? PAGE_CACHE_SIZE : max_len;> +???????????????src = (void *)((unsigned long)cblk0 + cnode_offset);> +???????????????memcpy(pgdata, src, len);> +???????????????up_read(&sbi->lock);
This looks very nice, but could use some comments about how the data isactually stored on disk. It took me some time to figure out that it actuallyallows to do tail merging into compressed blocks, which I was about to suggestyou implement ;-). Cramfs doesn't have them, and I found that they are themain reason why squashfs compresses better than cramfs, besides the defaultblock size, which you can change on either one.
Have you seen any benefit of the rwsem over a simple mutex? I would guessthat you can never even get into the situation where you get concurrentreaders since I haven't found a single down_read() in your code, onlydowngrade_write().
Arnd <><????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2008-08-21 15:07:04

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

> Have you seen any benefit of the rwsem over a simple mutex? I would guess
> that you can never even get into the situation where you get concurrent
> readers since I haven't found a single down_read() in your code, only
> downgrade_write()

We implemented a rwsem here because you can get concurrent readers.
My understanding is that downgrade_write() puts the rewem into the
same state as down_read(). Am I mistaken?

2008-08-21 15:14:48

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

On Thursday 21 August 2008, Jared Hulbert wrote:
> > Have you seen any benefit of the rwsem over a simple mutex? I would guess
> > that you can never even get into the situation where you get concurrent
> > readers since I haven't found a single down_read() in your code, only
> > downgrade_write()
>
> We implemented a rwsem here because you can get concurrent readers.
> My understanding is that downgrade_write() puts the rewem into the
> same state as down_read(). Am I mistaken?

Your interpretation of downgrade_write is correct, but if every thread
always does

down_write();
serialized_code();
downgrade_write();
parallel_code();
up_read();

Then you still won't have any concurrency, because each thread trying
to down_write() will be blocked until the previous one has done its up_read(),
causing parallel_code() to be serialized as well.

In addition to that, I'd still consider it better to use a simple mutex
if parallel_code() is a much faster operation than serialized_code(), as it
is in your case, where only the memcpy is parallel and that is much slower
than the deflate.

Arnd <><

2008-08-22 00:32:50

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

Jared Hulbert wrote:

> +
> +static int axfs_iget5_test(struct inode *inode, void *opaque)
> +{
> + u64 *inode_number = (u64 *) opaque;
> +
> + if (inode->i_sb == NULL) {
> + printk(KERN_ERR "axfs_iget5_test:"
> + " the super block is set to null\n");
> + }
> + if (inode->i_ino == *inode_number)
> + return 1; /* matches */
> + else
> + return 0; /* does not match */
> +}
> +

This implies inode_numbers are unique in AXFS? If so you can get rid of
the axfs_iget5_set/test logic. This is only necessary for filesystems
with non-unique inode numbers like cramfs.

> +
> +struct inode *axfs_create_vfs_inode(struct super_block *sb, int ino)
> +{
> + struct axfs_super *sbi = AXFS_SB(sb);
> + struct inode *inode;
> + u64 size;
> +
> + inode = iget5_locked(sb, ino, axfs_iget5_test, axfs_iget5_set, &ino);

If inode_numbers are unique, use iget_locked here.

> +
> + if (!(inode && (inode->i_state & I_NEW)))
> + return inode;
> +
> + inode->i_mode = AXFS_GET_MODE(sbi, ino);
> + inode->i_uid = AXFS_GET_UID(sbi, ino);
> + size = AXFS_GET_INODE_FILE_SIZE(sbi, ino);
> + inode->i_size = size;

What's the reason for splitting this into two lines, rather than

inode->i_size = AXFS_GET_INODE_FILE_SIZE(sbi, ino);

> + inode->i_blocks = AXFS_GET_INODE_NUM_ENTRIES(sbi, ino);
> + inode->i_blkbits = PAGE_CACHE_SIZE * 8;
> + inode->i_gid = AXFS_GET_GID(sbi, ino);
> +
> + inode->i_mtime = inode->i_atime = inode->i_ctime = sbi->timestamp;

No unique per inode time? Will cause problems using AXFS for archives
etc. where preserving timestamps is important.

> + inode->i_ino = ino;
> +

Unnecessary, set by iget_locked/iget_locked5

> +
> +static int axfs_readdir(struct file *filp, void *dirent, filldir_t filldir)
> +{
> + struct inode *inode = filp->f_dentry->d_inode;
> + struct super_block *sb = inode->i_sb;
> + struct axfs_super *sbi = AXFS_SB(sb);
> + u64 ino_number = inode->i_ino;
> + u64 entry;
> + loff_t dir_index;
> + char *name;
> + int namelen, mode;
> + int err = 0;
> +
> + /* Get the current index into the directory and verify it is not beyond
> + the end of the list */
> + dir_index = filp->f_pos;
> + if (dir_index >= AXFS_GET_INODE_NUM_ENTRIES(sbi, ino_number))
> + goto out;
> +
> + /* Verify the inode is for a directory */
> + if (!(S_ISDIR(inode->i_mode))) {
> + err = -EINVAL;
> + goto out;
> + }
> +
> + while (dir_index < AXFS_GET_INODE_NUM_ENTRIES(sbi, ino_number)) {
> + entry = AXFS_GET_INODE_ARRAY_INDEX(sbi, ino_number) + dir_index;
> +
> + name = (char *)AXFS_GET_INODE_NAME(sbi, entry);

One to one mapping between inode number and inode name? No hard link
support...?

> + namelen = strlen(name);
> +
> + mode = (int)AXFS_GET_MODE(sbi, entry);
> + err = filldir(dirent, name, namelen, dir_index, entry, mode);
> +
> + if (err)
> + break;
> +
> + dir_index++;
> + filp->f_pos = dir_index;
> + }
> +
> +out:
> + return 0;
> +}

Are "." and ".." stored in the directory? If not then axfs_readdir
should fabricate them to avoid confusing applications that expect
readdir(3) to return them.

> +static ssize_t axfs_file_read(struct file *filp, char __user *buf, size_t len,
> + loff_t *ppos)

> + actual_size = len > remaining ? remaining : len;
> + readlength = actual_size < PAGE_SIZE ? actual_size : PAGE_SIZE;

Use min() or min_t()

> +
> +static int axfs_readpage(struct file *file, struct page *page)
> +{
> +
> + if (node_type == Compressed) {
> + /* node is in compessed region */
> + cnode_offset = AXFS_GET_CNODE_OFFSET(sbi, node_index);
> + cnode_index = AXFS_GET_CNODE_INDEX(sbi, node_index);
> + down_write(&sbi->lock);
> + if (cnode_index != sbi->current_cnode_index) {
> + /* uncompress only necessary if different cblock */
> + ofs = AXFS_GET_CBLOCK_OFFSET(sbi, cnode_index);
> + len = AXFS_GET_CBLOCK_OFFSET(sbi, cnode_index + 1);
> + len -= ofs;
> + axfs_copy_data(sb, cblk1, &(sbi->compressed), ofs, len);
> + axfs_uncompress_block(cblk0, cblk_size, cblk1, len);
> + sbi->current_cnode_index = cnode_index;

I assume compressed blocks can be larger than PAGE_CACHE_SIZE? This
suffers from the rather obvious inefficiency that you decompress a big
block > PAGE_CACHE_SIZE, but only copy one PAGE_CACHE_SIZE page out of
it. If multiple files are being read simultaneously (a common
occurrence), then each is going to replace your one cached uncompressed
block (sbi->current_cnode_index), leading to decompressing the same
blocks over and over again on sequential file access.

readpage file A, index 1 -> decompress block X
readpage file B, index 1 -> decompress block Y (replaces X)
readpage file A, index 2 -> repeated decompress of block X (replaces Y)
readpage file B, index 2 -> repeated decompress of block Y (replaces X)

and so on.

> + }
> + downgrade_write(&sbi->lock);
> + max_len = cblk_size - cnode_offset;
> + len = max_len > PAGE_CACHE_SIZE ? PAGE_CACHE_SIZE : max_len;

Again, min() or min_t(). Lots of these.

Phillip

2008-08-22 02:22:50

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

Arnd Bergmann wrote:
> On Thursday 21 August 2008, Jared Hulbert wrote:
>> + array_index = AXFS_GET_INODE_ARRAY_INDEX(sbi, ino_number);
>> + array_index += page->index;
>> +
>> + node_index = AXFS_GET_NODE_INDEX(sbi, array_index);
>> + node_type = AXFS_GET_NODE_TYPE(sbi, array_index);
>> +
>> + if (node_type == Compressed) {
>> + /* node is in compessed region */
>> + cnode_offset = AXFS_GET_CNODE_OFFSET(sbi, node_index);
>> + cnode_index = AXFS_GET_CNODE_INDEX(sbi, node_index);
>> + down_write(&sbi->lock);
>> + if (cnode_index != sbi->current_cnode_index) {
>> + /* uncompress only necessary if different cblock */
>> + ofs = AXFS_GET_CBLOCK_OFFSET(sbi, cnode_index);
>> + len = AXFS_GET_CBLOCK_OFFSET(sbi, cnode_index + 1);
>> + len -= ofs;
>> + axfs_copy_data(sb, cblk1, &(sbi->compressed), ofs, len);
>> + axfs_uncompress_block(cblk0, cblk_size, cblk1, len);
>> + sbi->current_cnode_index = cnode_index;
>> + }
>> + downgrade_write(&sbi->lock);
>> + max_len = cblk_size - cnode_offset;
>> + len = max_len > PAGE_CACHE_SIZE ? PAGE_CACHE_SIZE : max_len;
>> + src = (void *)((unsigned long)cblk0 + cnode_offset);
>> + memcpy(pgdata, src, len);
>> + up_read(&sbi->lock);
>
> This looks very nice, but could use some comments about how the data is
> actually stored on disk. It took me some time to figure out that it actually
> allows to do tail merging into compressed blocks, which I was about to suggest
> you implement ;-). Cramfs doesn't have them, and I found that they are the
> main reason why squashfs compresses better than cramfs, besides the default
> block size, which you can change on either one.

Squashfs has much larger block sizes than cramfs (last time I looked it
was limited to 4K blocks), and it compresses the metadata which helps to
get better compression. But tail merging (fragments in Squashfs
terminology) is obviously a major reason why Squashfs gets good compression.

The AXFS code is rather obscure but it doesn't look to me that it does
tail merging. The following code wouldn't work if the block in question
was a tail contained in a larger block. It assumes the block extends to
the end of the compressed block (cblk_size - cnode_offset).

>> + max_len = cblk_size - cnode_offset;
>> + len = max_len > PAGE_CACHE_SIZE ? PAGE_CACHE_SIZE :
max_len;
>> + src = (void *)((unsigned long)cblk0 + cnode_offset);
>> + memcpy(pgdata, src, len);

Perhaps the AXFS authors could clarify this?

Phillip

2008-08-22 03:24:15

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

> Squashfs has much larger block sizes than cramfs (last time I looked it was
> limited to 4K blocks), and it compresses the metadata which helps to get
> better compression. But tail merging (fragments in Squashfs terminology) is
> obviously a major reason why Squashfs gets good compression.
>
> The AXFS code is rather obscure but it doesn't look to me that it does tail
> merging. The following code wouldn't work if the block in question was a
> tail contained in a larger block. It assumes the block extends to the end
> of the compressed block (cblk_size - cnode_offset).

A c_block is the unit that gets compressed. It can contain multiple
c_nodes. The c_block can be PAGE_SIZE to 4GB in size, in theory :)
The c_nodes can be 1B to PAGE_SIZE. in any alignment. I pack many
tails as c_nodes in a c_block.

>>> + max_len = cblk_size - cnode_offset;
>>> + len = max_len > PAGE_CACHE_SIZE ? PAGE_CACHE_SIZE :
>>> max_len;
>>> + src = (void *)((unsigned long)cblk0 + cnode_offset);
>>> + memcpy(pgdata, src, len);
>
> Perhaps the AXFS authors could clarify this?

The memcpy in question copies a c_node to the page. The len is either
the max length of a c_node and size of the buffer I'm copying to
(PAGE_CACHE_SIZE) or it is the difference between the beginning of the
c_node in the c_block and the end of the c_block, whichever is
smaller. The confusion is probably because of the fact that this
copies extra crap to the page for tails.

2008-08-22 03:28:04

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

> I assume compressed blocks can be larger than PAGE_CACHE_SIZE? This suffers
> from the rather obvious inefficiency that you decompress a big block >
> PAGE_CACHE_SIZE, but only copy one PAGE_CACHE_SIZE page out of it. If
> multiple files are being read simultaneously (a common occurrence), then
> each is going to replace your one cached uncompressed block
> (sbi->current_cnode_index), leading to decompressing the same blocks over
> and over again on sequential file access.
>
> readpage file A, index 1 -> decompress block X
> readpage file B, index 1 -> decompress block Y (replaces X)
> readpage file A, index 2 -> repeated decompress of block X (replaces Y)
> readpage file B, index 2 -> repeated decompress of block Y (replaces X)
>
> and so on.

Yep. Been thinking about optimizing it. So far it hasn't been an
issue for my customers. Most fs traffic being on the XIP pages. Once
I get a good automated performance test up we'll probably look into
something to improve this.

2008-08-22 03:29:53

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

Jared Hulbert wrote:

>
> The memcpy in question copies a c_node to the page. The len is either
> the max length of a c_node and size of the buffer I'm copying to
> (PAGE_CACHE_SIZE) or it is the difference between the beginning of the
> c_node in the c_block and the end of the c_block, whichever is
> smaller. The confusion is probably because of the fact that this
> copies extra crap to the page for tails.

Ah yes, that's where I got confused :) Glad to see AXFS uses tail packing.

Phillip

2008-08-22 03:46:50

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

Jared Hulbert wrote:
>> I assume compressed blocks can be larger than PAGE_CACHE_SIZE? This suffers
>> from the rather obvious inefficiency that you decompress a big block >
>> PAGE_CACHE_SIZE, but only copy one PAGE_CACHE_SIZE page out of it. If
>> multiple files are being read simultaneously (a common occurrence), then
>> each is going to replace your one cached uncompressed block
>> (sbi->current_cnode_index), leading to decompressing the same blocks over
>> and over again on sequential file access.
>>
>> readpage file A, index 1 -> decompress block X
>> readpage file B, index 1 -> decompress block Y (replaces X)
>> readpage file A, index 2 -> repeated decompress of block X (replaces Y)
>> readpage file B, index 2 -> repeated decompress of block Y (replaces X)
>>
>> and so on.
>
> Yep. Been thinking about optimizing it. So far it hasn't been an
> issue for my customers. Most fs traffic being on the XIP pages. Once
> I get a good automated performance test up we'll probably look into
> something to improve this.

It's relatively easy to solve. Squashfs explicitly pushes the extra
pages into the pagecache (so subsequent readpages find them there and
don't call readpage on squashfs again).

Phillip

2008-08-22 10:00:43

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

On Friday 22 August 2008, Phillip Lougher wrote:
> >
> > This looks very nice, but could use some comments about how the data is
> > actually stored on disk. It took me some time to figure out that it actually
> > allows to do tail merging into compressed blocks, which I was about to suggest
> > you implement ;-). Cramfs doesn't have them, and I found that they are the
> > main reason why squashfs compresses better than cramfs, besides the default
> > block size, which you can change on either one.
>
> Squashfs has much larger block sizes than cramfs (last time I looked it
> was limited to 4K blocks), and it compresses the metadata which helps to
> get better compression. ?But tail merging (fragments in Squashfs
> terminology) is obviously a major reason why Squashfs gets good compression.

The *default* block size in cramfs is smaller than in squashfs, but they both
have user selectable block sizes. I found the impact of compressed metadata
to be almost zero. I hacked up a mksquashfs to avoid tail merging, and found
that the image size for squashfs and cramfs is practically identical if you
use the same block size and no tail merging.

> The AXFS code is rather obscure but it doesn't look to me that it does
> tail merging. ?The following code wouldn't work if the block in question
> was a tail contained in a larger block. ?It assumes the block extends to
> the end of the compressed block (cblk_size - cnode_offset).

yes, I thought the same thing when I first read that code, and was about
to send a lengthy reply about how it should be changed when I saw that
it already does exactly that ;-).

Arnd <><

2008-08-22 17:08:53

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

Arnd Bergmann wrote:
> On Friday 22 August 2008, Phillip Lougher wrote:
>>> This looks very nice, but could use some comments about how the data is
>>> actually stored on disk. It took me some time to figure out that it actually
>>> allows to do tail merging into compressed blocks, which I was about to suggest
>>> you implement ;-). Cramfs doesn't have them, and I found that they are the
>>> main reason why squashfs compresses better than cramfs, besides the default
>>> block size, which you can change on either one.
>> Squashfs has much larger block sizes than cramfs (last time I looked it
>> was limited to 4K blocks), and it compresses the metadata which helps to
>> get better compression. But tail merging (fragments in Squashfs
>> terminology) is obviously a major reason why Squashfs gets good compression.
>
> The *default* block size in cramfs is smaller than in squashfs, but they both
> have user selectable block sizes. I found the impact of compressed metadata
> to be almost zero.

Squashfs stores significantly more metadata than cramfs. Remember
cramfs has no support for filesystems > ~ 16Mbytes, no inode timestamps,
truncates uid/gids, no hard-links, no nlink counts, no hashed
directories, no unique inode numbers. If Squashfs didn't compress the
metadata it would be significantly larger than cramfs.

Cheers

Phillip

2008-08-22 17:20:32

by Jörn Engel

[permalink] [raw]

Subject: Re: [PATCH 04/10] AXFS: axfs_inode.c

On Fri, 22 August 2008 18:08:51 +0100, Phillip Lougher wrote:
>
> Squashfs stores significantly more metadata than cramfs. Remember
> cramfs has no support for filesystems > ~ 16Mbytes, no inode timestamps,
> truncates uid/gids, no hard-links, no nlink counts, no hashed
> directories, no unique inode numbers. If Squashfs didn't compress the
> metadata it would be significantly larger than cramfs.

Elsewhere in this maze of threads Arnd claimed to have tested the
benefits of metadata compression - and it making little impact.

My guess is that it would make a large impact if metadata would be a
significant part of the filesystem image. Usually metadata is close
enough to 0% to be mistaken for statistical noise. So compressing it
makes a significant impact on an insignificant amount of data.

Jörn

--
One of my most productive days was throwing away 1000 lines of code.
-- Ken Thompson.

2008-08-22 18:04:29