LinuxLists.cc - Storing inodes in a separate block device?

2008-05-22 15:07:49

Subject: Storing inodes in a separate block device?

Has a feature ever been considered (or already exist) for storing inodes
in a block device separate from the data? Is it even a "reasonable"
thing to do or are there major pitfalls that one would run into?

The rationale behind this question comes from use cases where a file
system is storing very large numbers of files. Reading files in these
file systems will essentially incur at least two seeks: one for the
inode, one for the data blocks. If the seek to the inode were more
efficient, dramatic performance gains could be achieved for such use cases.

Fast seeking devices (such as flash based devices) are becoming much
more mainstream these days and would seem like a reasonable device for
the inodes. The $/GB is not as good as disks but it's much better than
DRAM. For many use cases, the number of these "fast access" inodes that
would need to be cached in RAM is near 0. So, RAM savings are also a
potential benefit.

I've ran some basic tests using ext4 on a SATA array plus a USB thumb
drive for the inodes. Even with the slowness of a thumb drive, I was
able to see encouraging results ( >50% read throughput improvement for a
mixture of 4K-8K files).

I'm interested in hearing thoughts/potential pitfalls/etc.

Nathan

2008-05-22 15:22:20

by Eric Sandeen

[permalink] [raw]

Subject: Re: Storing inodes in a separate block device?

Nathan Roberts wrote:
> Has a feature ever been considered (or already exist) for storing inodes
> in a block device separate from the data? Is it even a "reasonable"
> thing to do or are there major pitfalls that one would run into?

XFS has such a thing, although it evolved for slightly different
reasons. The "realtime subvolume" is a data-only volume, with all
metadata on the main block device. It also has some different allocator
characteristics. In practice I don't think it's been used much in the
field on Linux, but ISTR some people have had good luck for some workloads.

> The rationale behind this question comes from use cases where a file
> system is storing very large numbers of files. Reading files in these
> file systems will essentially incur at least two seeks: one for the
> inode, one for the data blocks. If the seek to the inode were more
> efficient, dramatic performance gains could be achieved for such use cases.
>
> Fast seeking devices (such as flash based devices) are becoming much
> more mainstream these days and would seem like a reasonable device for
> the inodes. The $/GB is not as good as disks but it's much better than
> DRAM. For many use cases, the number of these "fast access" inodes that
> would need to be cached in RAM is near 0. So, RAM savings are also a
> potential benefit.

One downside may be flash wear; in a hand-wavy way I could imagine that
data blocks may change less often than metadata in many use casees
(think atimes, directory updates and whatnot). Just a thought.

> I've ran some basic tests using ext4 on a SATA array plus a USB thumb
> drive for the inodes. Even with the slowness of a thumb drive, I was
> able to see encouraging results ( >50% read throughput improvement for a
> mixture of 4K-8K files).

How'd you test this, do you have a patch? Sounds interesting.

Thanks,
-Eric

> I'm interested in hearing thoughts/potential pitfalls/etc.
>
> Nathan
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-05-22 16:03:23

by Andreas Dilger

[permalink] [raw]

Subject: Re: Storing inodes in a separate block device?

On May 22, 2008 09:53 -0500, Nathan Roberts wrote:
> Has a feature ever been considered (or already exist) for storing inodes in
> a block device separate from the data? Is it even a "reasonable" thing to
> do or are there major pitfalls that one would run into?

There was a filesystem called "dualfs" that implemented this - I believe
it was a hacked version of ext3. It showed quite decent results.
Similarly, Lustre splits filesystem metadata (path, permissions,
attributes) onto different filesystems from the data.

For ext4 there is work being done on the FLEX_BG feature, which will allow
clustering of the metadata into a larger groups inside the filesystem, in
order to reduce seeking when doing filesystem scans. This could be taken
to extremes to group all of the metadata into a single area, and use LVM
to place that on a separate disk.

Putting the journal on a separate disk would also help reduce the seeking
during writes. Using flash for the journal is not useful because it does
almost exclusively linear IO and no seeking.

> The rationale behind this question comes from use cases where a file system
> is storing very large numbers of files. Reading files in these file systems
> will essentially incur at least two seeks: one for the inode, one for the
> data blocks. If the seek to the inode were more efficient, dramatic
> performance gains could be achieved for such use cases.
>
> Fast seeking devices (such as flash based devices) are becoming much more
> mainstream these days and would seem like a reasonable device for the
> inodes. The $/GB is not as good as disks but it's much better than DRAM.
> For many use cases, the number of these "fast access" inodes that would
> need to be cached in RAM is near 0. So, RAM savings are also a potential
> benefit.
>
> I've ran some basic tests using ext4 on a SATA array plus a USB thumb drive
> for the inodes. Even with the slowness of a thumb drive, I was able to see
> encouraging results ( >50% read throughput improvement for a mixture of
> 4K-8K files).

Ah, are you using FLEX_BG for this test? It would also be interesting to
see if splitting the metadata onto a normal disk had the same effect,
just by virtue of allowing the data and metadata to seek independently.

There is also work to aggregate the allocation of ext4 file allocation
metadata, and while this speeds up unlink the current algorithm hurts
the read performance. Having the file allocation metadata on a separate
disk may avoid this performance hit. It may also be that we just need
to make more effort to do readahead on this metadata.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2008-05-22 16:58:57

by Nathan Roberts

[permalink] [raw]

Subject: Re: Storing inodes in a separate block device?

>> I've ran some basic tests using ext4 on a SATA array plus a USB thumb
>> drive for the inodes. Even with the slowness of a thumb drive, I was
>> able to see encouraging results ( >50% read throughput improvement for a
>> mixture of 4K-8K files).
>
> How'd you test this, do you have a patch? Sounds interesting.

Right now I have only changed enough code to be able to test the theory.
It's in no way a presentable patch at this point. With some simplifying
assumptions, the code changes were pretty easy:
- parse a new "idev=" mount option
- Store bdev information for the inode block device in sb_info struct
- Change __ext4_get_inode_loc() to recalculate the block offset in the
case of a separate device and issue __getblk() to the alternate device.

- A simple utility which copies inodes from one block device to another
is the only other thing that's needed. (This was simpler than modifying
the tools. It also allowed me to easily perform BEFORE/AFTER comparisons
with the only real variable being where the inodes are located.)

So, to get a file system going:
- mke2fs as usual
- copy inodes from original blkdev to inode_blkdev (yes, there are 2
copies of the inodes, space conservation was not my objective.)
- mount using idev=<inode block device> option

To run the test:
- mkfs
- mount WITHOUT idev= option
- Create 10 million files
- copy inodes to inode_blkdev

SEQ1
-----
- umount, mount readonly, WITHOUT idev
- echo 3 > /proc/sys/vm/drop_caches
- Read 5000 random files using 500 threads, record average read time

SEQ2
-----
- umount, mount readonly, WITH idev,
- drop_caches
- Read 5000 random files using 500 threads, record average read time

- Repeat SEQ1 and then SEQ2 to verify no unexpected caching is going on
(should see same results as original run).

--

The filesystem features reported by dumpe2fs were:
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype needs_recovery extents sparse_super large_file

Thanks,
Nathan