On Mon, 14 Jan 2008 08:39:01 -0500
Abhishek Rai <[email protected]> wrote:
> This is the patch for 2.6.24-rc6 -mm tree, please let me know if anyone would like a patch against another recent kernel. Ingo Molnar has already posted a patch for 2.6.24-rc7.
>
Please always retain and maintain the full changelog with each version of a
patch. The changelog in your earlier "Clustering indirect blocks in Ext3"
patch was good. I retained that (after renaming it to "ext3: indirect
block clustering" rather than this dopey name) and then, err, dropped the
patch ;)
Please cc linux-ext4 on ext2/3/4-related work.
Please update Documentation/filesystems/ext3.txt when adding new mount
options.
Here's a quick low-levelish nitpickingish implementation review. I haven't
thought about the design-level things yet.
This code will need careful checking regarding the journalling side of
things - to verify that if the machine crashes at any stage, recovery will
produce a consistent and secure fs. It is all-nigh impossible to spot bugs
in this area via the usual kernel bug reporting process and it is very hard
to effectively runtime test recovery even in a development situation.
Careful review unusually important here.
> +/
> + * Count number of free blocks in a block group that don't lie in the
> + * metacluster region of the block group.
> + */
> +static void
> +ext3_init_grp_free_nonmc_blocks(struct super_block *sb,
> + struct buffer_head *bitmap_bh,
> + unsigned long block_group)
> +{
> + struct ext3_sb_info *sbi = EXT3_SB(sb);
> + struct ext3_bg_info *bgi = &sbi->s_bginfo[block_group];
> +
> + BUG_ON(!test_opt(sb, METACLUSTER));
> +
> + spin_lock(sb_bgl_lock(sbi, block_group));
> + if (bgi->bgi_free_nonmc_blocks_count >= 0)
> + goto out;
> +
> + bgi->bgi_free_nonmc_blocks_count =
> + ext3_count_free(bitmap_bh, sbi->s_nonmc_blocks_per_group/8);
> +
> +out:
> + spin_unlock(sb_bgl_lock(sbi, block_group));
> + BUG_ON(bgi->bgi_free_nonmc_blocks_count >
> + sbi->s_nonmc_blocks_per_group);
> +}
hm, so ext3_count_free() hits prime time.
ext3_count_free() should be converted to use the standard hweight8().
Really it should be converted to hweight_long() or hweight32() or something
- it is presently probably inefficient.
This function appears to be called frequently and it calls the slow-looking
ext3_count_free() under spinlock. This might have scalability problems on
the right machine running the right workload.
Can we address that before we hit it?
> +/*
> + * ext3_update_nonmc_block_count:
> + * Update bgi_free_nonmc_blocks_count for block group 'group_no' following
> + * an allocation or deallocation.
> + *
> + * @group_no: affected block group
> + * @start: start of the [de]allocated range
> + * @count: number of blocks [de]allocated
> + * @allocation: 1 if blocks were allocated, 0 otherwise.
> + */
> +static inline void
> +ext3_update_nonmc_block_count(struct ext3_sb_info *sbi, unsigned long group_no,
> + ext3_grpblk_t start, unsigned long count,
> + int allocation)
> +{
> + struct ext3_bg_info *bginfo = &sbi->s_bginfo[group_no];
> + ext3_grpblk_t change;
> +
> + BUG_ON(bginfo->bgi_free_nonmc_blocks_count < 0);
> + BUG_ON(start >= sbi->s_nonmc_blocks_per_group);
> +
> + change = min_t(ext3_grpblk_t, start + count,
> + sbi->s_nonmc_blocks_per_group) - start;
> +
> + spin_lock(sb_bgl_lock(sbi, group_no));
> + BUG_ON(bginfo->bgi_free_nonmc_blocks_count >
> + sbi->s_nonmc_blocks_per_group);
> + BUG_ON(allocation && bginfo->bgi_free_nonmc_blocks_count < change);
> +
> + bginfo->bgi_free_nonmc_blocks_count += (allocation ? -change : change);
> +
> + BUG_ON(bginfo->bgi_free_nonmc_blocks_count >
> + sbi->s_nonmc_blocks_per_group);
> + spin_unlock(sb_bgl_lock(sbi, group_no));
> +}
Far too large to inline. Please just remove all inlines from the patch.
The compiler will sort it out.
> +static ext3_grpblk_t
> +bitmap_find_prev_zero_bit(char *map, ext3_grpblk_t start, ext3_grpblk_t lowest)
> +{
> + ext3_grpblk_t k, blk;
> +
> + k = start & ~7;
> + while (lowest <= k) {
> + if (map[k/8] != '\255' &&
> + (blk = ext3_find_next_zero_bit(map, k + 8, k))
> + < (k + 8))
> + return blk;
> +
> + k -= 8;
> + }
> + return -1;
> +}
Please rename `k' to something meaningful.
The logic in here looks odd. If (map[k/8] != 0xff) then the
ext3_find_next_zero_bit() cannot fail. So why do we test for it in this
fashion?
Perhaps some commnetary is needed here.
> +static ext3_grpblk_t
> +bitmap_search_prev_usable_block(ext3_grpblk_t start, struct buffer_head *bh,
> + ext3_grpblk_t lowest)
> +{
> + ext3_grpblk_t next;
> + struct journal_head *jh = bh2jh(bh);
> +
> + /*
> + * The bitmap search --- search backward alternately through the actual
> + * bitmap and the last-committed copy until we find a bit free in
> + * both
> + */
> + while (start >= lowest) {
Little style nit: in bitmap_find_prev_zero_bit() you used (lowest < k)
which is a little surprising but I assumeed that you were following the (a
< b < c) layout. But here you're not doing that.
> + next = bitmap_find_prev_zero_bit(bh->b_data, start, lowest);
> + if (next < lowest)
> + return -1;
> + if (ext3_test_allocatable(next, bh))
> + return next;
> + jbd_lock_bh_state(bh);
> + if (jh->b_committed_data)
> + start = bitmap_find_prev_zero_bit(jh->b_committed_data,
> + next, lowest);
> + jbd_unlock_bh_state(bh);
> + }
> + return -1;
> +}
>
> ...
>
> +int ext3_alloc_indirect_blocks(struct super_block *sb,
> + struct buffer_head *bitmap_bh,
> + struct ext3_group_desc *gdp,
> + int group_no, unsigned long indirect_blks,
> + ext3_fsblk_t new_blocks[])
> +{
> + struct ext3_bg_info *bgi = &EXT3_SB(sb)->s_bginfo[group_no];
> + ext3_grpblk_t blk = EXT3_BLOCKS_PER_GROUP(sb) - 1;
> + ext3_grpblk_t mc_start = EXT3_SB(sb)->s_nonmc_blocks_per_group;
> + ext3_fsblk_t group_first_block;
> + int allocated = 0;
> +
> + BUG_ON(!test_opt(sb, METACLUSTER));
> +
> + /* This check is racy but that wouldn't harm us. */
> + if (bgi->bgi_free_nonmc_blocks_count >=
> + le16_to_cpu(gdp->bg_free_blocks_count))
> + return 0;
> +
> + group_first_block = ext3_group_first_block_no(sb, group_no);
> + while (allocated < indirect_blks && blk >= mc_start) {
> + if (!ext3_test_allocatable(blk, bitmap_bh)) {
> + blk = bitmap_search_prev_usable_block(blk, bitmap_bh,
> + mc_start);
> + continue;
> + }
> + if (claim_block(sb_bgl_lock(EXT3_SB(sb), group_no), blk,
> + bitmap_bh)) {
> + new_blocks[allocated++] = group_first_block + blk;
> + } else {
> + /*
> + * The block was allocated by another thread, or it
> + * was allocated and then freed by another thread
> + */
> + cpu_relax();
Whoa. The first ever use of cpu_relax in a filesystem (apart from sysfs,
but that's hardy endorsement).
What are you trying to do here? Why was this needed?
> + }
> + if (allocated < indirect_blks)
> + blk = bitmap_search_prev_usable_block(blk, bitmap_bh,
> + mc_start);
> + }
> + return allocated;
> +}
> +
> +/*
> + * check_allocated_blocks:
> + * Helper function for ext3_new_blocks. Checks newly allocated block
> + * numbers.
> + */
> +int check_allocated_blocks(ext3_fsblk_t blk, unsigned long num,
> + struct super_block *sb, int group_no,
> + struct ext3_group_desc *gdp,
> + struct buffer_head *bitmap_bh)
> +{
> + struct ext3_super_block *es = EXT3_SB(sb)->s_es;
> + struct ext3_sb_info *sbi = EXT3_SB(sb);
> + ext3_fsblk_t grp_blk = blk - ext3_group_first_block_no(sb, group_no);
> +
> + if (in_range(le32_to_cpu(gdp->bg_block_bitmap), blk, num) ||
> + in_range(le32_to_cpu(gdp->bg_inode_bitmap), blk, num) ||
> + in_range(blk, le32_to_cpu(gdp->bg_inode_table),
> + EXT3_SB(sb)->s_itb_per_group) ||
> + in_range(blk + num - 1, le32_to_cpu(gdp->bg_inode_table),
> + EXT3_SB(sb)->s_itb_per_group)) {
> + ext3_error(sb, "ext3_new_blocks",
> + "Allocating block in system zone - "
> + "blocks from "E3FSBLK", length %lu",
> + blk, num);
> + return 1;
> + }
> +
> +#ifdef CONFIG_JBD_DEBUG
> + {
> + struct buffer_head *debug_bh;
> +
> + /* Record bitmap buffer state in the newly allocated block */
> + debug_bh = sb_find_get_block(sb, blk);
> + if (debug_bh) {
> + BUFFER_TRACE(debug_bh, "state when allocated");
> + BUFFER_TRACE2(debug_bh, bitmap_bh, "bitmap state");
> + brelse(debug_bh);
> + }
> + }
> + jbd_lock_bh_state(bitmap_bh);
> + spin_lock(sb_bgl_lock(sbi, group_no));
> + if (buffer_jbd(bitmap_bh) && bh2jh(bitmap_bh)->b_committed_data) {
> + int i;
> +
> + for (i = 0; i < num; i++) {
> + if (ext3_test_bit(grp_blk+i,
> + bh2jh(bitmap_bh)->b_committed_data))
> + printk(KERN_ERR "%s: block was unexpectedly set"
> + " in b_committed_data\n", __FUNCTION__);
> + }
> + }
> + ext3_debug("found bit %d\n", grp_blk);
> + spin_unlock(sb_bgl_lock(sbi, group_no));
> + jbd_unlock_bh_state(bitmap_bh);
> +#endif
Has the CONFIG_JBD_DEBUG=y code been tested?
> +
> + if (blk + num - 1 >= le32_to_cpu(es->s_blocks_count)) {
> + ext3_error(sb, "ext3_new_blocks",
> + "block("E3FSBLK") >= blocks count(%d) - "
> + "block_group = %d, es == %p ", blk,
> + le32_to_cpu(es->s_blocks_count), group_no, es);
> + return 1;
> + }
> +
> + return 0;
> +}
>
> ...
>
> + - (indirect_blks_done + grp_mc_alloc);
> grp_alloc_blk = ext3_try_to_allocate_with_rsv(sb, handle,
> - group_no, bitmap_bh, -1, my_rsv,
> - &num, &fatal);
> + group_no, bitmap_bh, grp_target_blk,
> + my_rsv, &grp_alloc);
> + if (grp_alloc_blk < 0)
> + grp_alloc = 0;
> +
> + /*
> + * If we couldn't allocate anything, there is nothing more to
> + * do with this block group, so move over to the next. But
> + * before that We must release write access to the old one via
> + * ext3_journal_release_buffer(), else we'll run out of credits.
> + */
> + if (grp_mc_alloc == 0 && grp_alloc == 0) {
> + BUFFER_TRACE(bitmap_bh, "journal_release_buffer");
> + ext3_journal_release_buffer(handle, bitmap_bh);
> + goto next;
> + }
> +
> + BUFFER_TRACE(bitmap_bh, "journal_dirty_metadata for "
> + "bitmap block");
> + fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
> if (fatal)
> goto out;
> - if (grp_alloc_blk >= 0)
> +
> + ext3_debug("using block group %d(%d)\n",
> + group_no, gdp->bg_free_blocks_count);
> +
> + BUFFER_TRACE(gdp_bh, "get_write_access");
> + fatal = ext3_journal_get_write_access(handle, gdp_bh);
> + if (fatal)
> + goto out;
> +
> + /* Should this be called before ext3_journal_dirty_metadata? */
If you're concerned here I'd suggest that you formulate a question for the
ext3/4 maintainers to think about.
> + for (i = 0; i < grp_mc_alloc; i++) {
> + if (check_allocated_blocks(
> + new_blocks[indirect_blks_done + i], 1, sb,
> + group_no, gdp, bitmap_bh))
> + goto out;
> + }
> + if (grp_alloc > 0) {
> + ret_block = ext3_group_first_block_no(sb, group_no) +
> + grp_alloc_blk;
> + if (check_allocated_blocks(ret_block, grp_alloc, sb,
> + group_no, gdp, bitmap_bh))
> + goto out;
> + }
> +
> + indirect_blks_done += grp_mc_alloc;
> + performed_allocation = 1;
> +
> + /* The caller will add the new buffer to the journal. */
> + if (grp_alloc > 0)
> + ext3_debug("allocating block %lu. "
> + "Goal hits %d of %d.\n",
> + ret_block, goal_hits, goal_attempts);
> +
> + spin_lock(sb_bgl_lock(sbi, group_no));
> + gdp->bg_free_blocks_count =
> + cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) -
> + (grp_mc_alloc + grp_alloc));
> + spin_unlock(sb_bgl_lock(sbi, group_no));
> + percpu_counter_sub(&sbi->s_freeblocks_counter,
> + (grp_mc_alloc + grp_alloc));
> +
> + BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for "
> + "group descriptor");
> + err = ext3_journal_dirty_metadata(handle, gdp_bh);
> + if (!fatal)
> + fatal = err;
> +
> + sb->s_dirt = 1;
> + if (fatal)
> + goto out;
> +
> + brelse(bitmap_bh);
> + bitmap_bh = NULL;
> +
>
> ...
>
> +
> +/*
> + * ext3_ind_read_end_bio --
> + *
> + * bio callback for read IO issued from ext3_read_indblocks.
> + * May be called multiple times until the whole I/O completes at
> + * which point bio->bi_size = 0 and it frees read_info and bio.
> + * The first time it is called, first_bh is unlocked so that any sync
> + * waier can unblock.
typo.
> + */
> +static void ext3_ind_read_end_bio(struct bio *bio, int err)
> +{
> + struct ext3_ind_read_info *read_info = bio->bi_private;
> + struct buffer_head *bh;
> + int uptodate = !err && test_bit(BIO_UPTODATE, &bio->bi_flags);
It's pretty regrettable that ext3 now starts using bios directly. All of
jbd and ext3 has thus far avoided this additional coupling.
Why was this necessary?
> + int i;
> +
> + if (err == -EOPNOTSUPP)
> + set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
> +
> + /* Wait for all buffers to finish - is this needed? */
> + if (bio->bi_size)
> + return;
> +
> + for (i = 0; i < read_info->count; i++) {
> + bh = read_info->bh[i];
> + if (err == -EOPNOTSUPP)
> + set_bit(BH_Eopnotsupp, &bh->b_state);
> +
> + if (uptodate) {
> + BUG_ON(buffer_uptodate(bh));
> + BUG_ON(ext3_buffer_prefetch(bh));
> + set_buffer_uptodate(bh);
> + if (read_info->seq_prefetch)
> + ext3_set_buffer_prefetch(bh);
> + }
> +
> + unlock_buffer(bh);
> + brelse(bh);
> + }
> +
> + kfree(read_info);
> + bio_put(bio);
> +}
> +
> +/*
> + * ext3_get_max_read --
> + * @inode: inode of file.
> + * @block: block number in file (starting from zero).
> + * @offset_in_dind_block: offset of the indirect block inside it's
> + * parent doubly-indirect block.
> + *
> + * Compute the maximum no. of indirect blocks that can be read
> + * satisfying following constraints:
> + * - Don't read indirect blocks beyond the end of current
> + * doubly-indirect block.
> + * - Don't read beyond eof.
> + */
> +static inline unsigned long ext3_get_max_read(const struct inode *inode,
> + int block,
> + int offset_in_dind_block)
This is far too large to be inlined.
> +{
> + const struct super_block *sb = inode->i_sb;
> + unsigned long max_read;
> + unsigned long ptrs = EXT3_ADDR_PER_BLOCK(inode->i_sb);
> + unsigned long ptrs_bits = EXT3_ADDR_PER_BLOCK_BITS(inode->i_sb);
> + unsigned long blocks_in_file =
> + (inode->i_size + sb->s_blocksize - 1) >> sb->s_blocksize_bits;
> + unsigned long remaining_ind_blks_in_dind =
> + (ptrs >= offset_in_dind_block) ? (ptrs - offset_in_dind_block)
> + : 0;
> + unsigned long remaining_ind_blks_before_eof =
> + ((blocks_in_file - EXT3_NDIR_BLOCKS + ptrs - 1) >> ptrs_bits) -
> + ((block - EXT3_NDIR_BLOCKS) >> ptrs_bits);
> +
> + BUG_ON(block >= blocks_in_file);
> +
> + max_read = min_t(unsigned long, remaining_ind_blks_in_dind,
> + remaining_ind_blks_before_eof);
Can use plain old min() here.
> + BUG_ON(max_read < 1);
Please prefer to use the (much) friendlier ext3_error() over the
user-hostile BUG. Where it makes sense.
This is especially the case where the assertion could be triggered by
corrupted disk contents - doing a BUG() in response to that is a coding
bug.
> + return max_read;
> +}
> +
> +static void ext3_read_indblocks_submit(struct bio **pbio,
> + struct ext3_ind_read_info **pread_info,
> + int *read_cnt, int seq_prefetch)
> +{
> + struct bio *bio = *pbio;
> + struct ext3_ind_read_info *read_info = *pread_info;
> +
> + BUG_ON(*read_cnt < 1);
> +
> + read_info->seq_prefetch = seq_prefetch;
> + read_info->count = *read_cnt;
> + read_info->size = bio->bi_size;
> + bio->bi_private = read_info;
> + bio->bi_end_io = ext3_ind_read_end_bio;
> + submit_bio(READ, bio);
> +
> + *pbio = NULL;
> + *pread_info = NULL;
> + *read_cnt = 0;
> +}
> +
> +struct ind_block_info {
> + ext3_fsblk_t blockno;
> + struct buffer_head *bh;
> +};
> +
> +static int ind_info_cmp(const void *a, const void *b)
> +{
> + struct ind_block_info *info_a = (struct ind_block_info *)a;
> + struct ind_block_info *info_b = (struct ind_block_info *)b;
> +
> + return info_a->blockno - info_b->blockno;
> +}
> +
> +static void ind_info_swap(void *a, void *b, int size)
> +{
> + struct ind_block_info *info_a = (struct ind_block_info *)a;
> + struct ind_block_info *info_b = (struct ind_block_info *)b;
> + struct ind_block_info tmp;
> +
> + tmp = *info_a;
> + *info_a = *info_b;
> + *info_b = tmp;
> +}
Unneeded (and undesirable) casts of void*.
> +/*
> + * ext3_read_indblocks_async --
> + * @sb: super block
> + * @ind_blocks[]: array of indirect block numbers on disk
> + * @count: maximum number of indirect blocks to read
> + * @first_bh: buffer_head for indirect block ind_blocks[0], may be
> + * NULL
> + * @seq_prefetch: if this is part of a sequential prefetch and buffers'
> + * prefetch bit must be set.
> + * @blocks_done: number of blocks considered for prefetching.
> + *
> + * Issue a single bio request to read upto count buffers identified in
> + * ind_blocks[]. Fewer than count buffers may be read in some cases:
> + * - If a buffer is found to be uptodate and it's prefetch bit is set, we
> + * don't look at any more buffers as they will most likely be in the cache.
> + * - We skip buffers we cannot lock without blocking (except for first_bh
> + * if specified).
> + * - We skip buffers beyond a certain range on disk.
> + *
> + * This function must issue read on first_bh if specified unless of course
> + * it's already uptodate.
> + */
This comment is almost-but-not-quite in kerneldoc format. Please review
the /** comments in ext3 and convert newly-added commetns to match.
> +static int ext3_read_indblocks_async(struct super_block *sb,
> + const __le32 ind_blocks[], int count,
> + struct buffer_head *first_bh,
> + int seq_prefetch,
> + unsigned long *blocks_done)
> +{
> + struct buffer_head *bh;
> + struct bio *bio = NULL;
> + struct ext3_ind_read_info *read_info = NULL;
> + int read_cnt = 0, blk;
> + ext3_fsblk_t prev_blk = 0, io_start_blk = 0, curr;
> + struct ind_block_info *ind_info = NULL;
> + int err = 0, ind_info_count = 0;
> +
> + BUG_ON(count < 1);
> + /* Don't move this to ext3_get_max_read() since callers often need to
> + * trim the count returned by that function. So this bound must only
> + * be imposed at the last moment. */
> + count = min_t(unsigned long, count, EXT3_IND_READ_MAX);
Both count and EXT3_IND_READ_MAX are plain old int. Forcing them to ulong
for this comparison is a little odd.
Please check whether those two things have appropriate types - can they
ever legitimately go negative? I doubt it, in which case they should have
unsigned types.
Please review all newly-added min_t's for these things.
> + *blocks_done = 0UL;
> +
> + if (count == 1 && first_bh) {
This comparison could do with a comment explaining what's happening?
> + lock_buffer(first_bh);
> + get_bh(first_bh);
> + first_bh->b_end_io = end_buffer_read_sync;
> + submit_bh(READ, first_bh);
> + *blocks_done = 1UL;
> + return 0;
> + }
> +
> + ind_info = kmalloc(count * sizeof(*ind_info), GFP_KERNEL);
> + if (unlikely(!ind_info))
> + return -ENOMEM;
> +
> + /*
> + * First pass: sort block numbers for all indirect blocks that we'll
> + * read. This allows us to scan blocks in sequenial order during the
> + * second pass which helps coalasce requests to contiguous blocks.
> + * Since we sort block numbers here instead of assuming any specific
> + * layout on the disk, we have some protection against different
> + * indirect block layout strategies as long as they keep all indirect
> + * blocks close by.
> + */
> + for (blk = 0; blk < count; blk++) {
> + curr = le32_to_cpu(ind_blocks[blk]);
> + if (!curr)
> + continue;
> +
> + /*
> + * Skip this block if it lies too far from blocks we have
> + * already decided to read. "Too far" should typically indicate
> + * lying on a different track on the disk. EXT3_IND_READ_MAX
> + * seems reasonable for most disks.
> + */
> + if (io_start_blk > 0 &&
> + (max(io_start_blk, curr) - min(io_start_blk, curr) >=
> + EXT3_IND_READ_MAX))
> + continue;
> +
> + if (blk == 0 && first_bh) {
> + bh = first_bh;
> + get_bh(first_bh);
> + } else {
> + bh = sb_getblk(sb, curr);
> + if (unlikely(!bh)) {
> + err = -ENOMEM;
> + goto failure;
> + }
> + }
> +
> + if (buffer_uptodate(bh)) {
> + if (ext3_buffer_prefetch(bh)) {
> + brelse(bh);
> + break;
> + }
> + brelse(bh);
brelse() is an old-fashioned thing which checks for a NULL arg. Unless you
really have a bh* which might legitimately be NULL, please prefer put_bh().
> + continue;
> + }
> +
> + if (io_start_blk == 0)
> + io_start_blk = curr;
> +
> + ind_info[ind_info_count].blockno = curr;
> + ind_info[ind_info_count].bh = bh;
> + ind_info_count++;
> + }
> + *blocks_done = blk;
> +
> + sort(ind_info, ind_info_count, sizeof(*ind_info),
> + ind_info_cmp, ind_info_swap);
> +
> + /* Second pass: compose bio requests and issue them. */
> + for (blk = 0; blk < ind_info_count; blk++) {
> + bh = ind_info[blk].bh;
> + curr = ind_info[blk].blockno;
> +
> + if (prev_blk > 0 && curr != prev_blk + 1) {
> + ext3_read_indblocks_submit(&bio, &read_info,
> + &read_cnt, seq_prefetch);
> + prev_blk = 0;
> + }
> +
> + /* Lock the buffer without blocking, skipping any buffers
> + * which would require us to block. first_bh when specified is
> + * an exception as caller typically wants it to be read for
> + * sure (e.g., ext3_read_indblocks_sync).
> + */
> + if (bh == first_bh) {
> + lock_buffer(bh);
> + } else if (test_set_buffer_locked(bh)) {
> + brelse(bh);
> + continue;
> + }
> +
>
> ...
>
> +
> + kfree(ind_info);
> + return 0;
> +
> +failure:
> + while (--read_cnt >= 0) {
> + unlock_buffer(read_info->bh[read_cnt]);
> + brelse(read_info->bh[read_cnt]);
> + }
> + *blocks_done = 0UL;
> +
> +done:
> + kfree(read_info);
> +
> + if (bio)
> + bio_put(bio);
> +
> + kfree(ind_info);
> + return err;
> +}
Again, I really don't see why we needed to dive into the BIO layer for
this. Why not use submit_bh() and friends for all of this?
Also, why is it necessary to do block sorting at the filesystem level? The
block layer does that.
> + NULL, 1, &blocks_done);
> +
> + goto done;
> + }
> + brelse(prev_bh);
> + }
> +
> + /* Either random read, or sequential detection failed above.
> + * We always prefetch the next indirect block in this case
> + * whenever possible.
> + * This is because for random reads of size ~512KB, there is
> + * >12% chance that a read will span two indirect blocks.
> + */
> + *err = ext3_read_indblocks_sync(sb, ind_blocks,
> + (max_read >= 2) ? 2 : 1,
> + first_bh, 0, &blocks_done);
> + if (*err)
> + goto out;
> + }
> +
> +done:
> + /* Reader: pointers */
> + if (!verify_chain(chain, &chain[depth - 2])) {
> + brelse(first_bh);
> + goto changed;
> + }
> + add_chain(&chain[depth - 1], first_bh,
> + (__le32 *)first_bh->b_data + offsets[depth - 1]);
> + /* Reader: end */
> + if (!chain[depth - 1].key)
> + goto out;
> +
> + BUG_ON(!buffer_uptodate(first_bh));
> + return NULL;
> +
> +changed:
> + *err = -EAGAIN;
> + goto out;
> +failure:
> + *err = -EIO;
> +out:
> + if (*err) {
> + ext3_debug("Error %d reading indirect blocks\n", *err);
> + return &chain[depth - 2];
> + } else
> + return &chain[depth - 1];
> +}
I'm wondering about the interaction between this code and the
buffer_boundary() logic. I guess we should disable the buffer_boundary()
handling when this code is in effect. Have you reviewed and tested that
aspect?
> @@ -1692,6 +1699,13 @@ static int ext3_fill_super (struct super
> }
> sbi->s_frags_per_block = 1;
> sbi->s_blocks_per_group = le32_to_cpu(es->s_blocks_per_group);
> + if (test_opt(sb, METACLUSTER)) {
> + sbi->s_nonmc_blocks_per_group = sbi->s_blocks_per_group -
> + sbi->s_blocks_per_group / 12;
> + sbi->s_nonmc_blocks_per_group &= ~7;
> + } else
> + sbi->s_nonmc_blocks_per_group = sbi->s_blocks_per_group;
> +
hm, magic numbers.
> + sbi->s_bginfo = kmalloc(sbi->s_groups_count *
> + sizeof(*sbi->s_bginfo), GFP_KERNEL);
> + if (!sbi->s_bginfo) {
> + printk(KERN_ERR "EXT3-fs: not enough memory\n");
> + goto failed_mount3;
> + }
> + for (i = 0; i < sbi->s_groups_count; i++)
> + sbi->s_bginfo[i].bgi_free_nonmc_blocks_count = -1;
> + } else
> + sbi->s_bginfo = NULL;
> +
> /*
> * set up enough so that it can read an inode
> */
I'm wondering about the real value of this change, really.
In any decent environment, people will fsck their ext3 filesystems during
planned downtime, and the benefit of reducing that downtime from 6
hours/machine to 2 hours/machine is probably fairly small, given that there
is no service interruption. (The same applies to desktops and laptops).
Sure, the benefit is not *zero*, but it's small. Much less than it would
be with ext2. I mean, the "avoid unplanned fscks" feature is the whole
reason why ext3 has journalling (and boy is that feature expensive during
normal operation).
So... it's unobvious that the benefit of this feature is worth its risks
and costs?
On Tue, Jan 15, 2008 at 03:04:41AM -0800, Andrew Morton wrote:
> I'm wondering about the real value of this change, really.
>
> In any decent environment, people will fsck their ext3 filesystems during
> planned downtime, and the benefit of reducing that downtime from 6
> hours/machine to 2 hours/machine is probably fairly small, given that there
> is no service interruption. (The same applies to desktops and laptops).
>
> Sure, the benefit is not *zero*, but it's small. Much less than it would
> be with ext2. I mean, the "avoid unplanned fscks" feature is the whole
> reason why ext3 has journalling (and boy is that feature expensive during
> normal operation).
>
> So... it's unobvious that the benefit of this feature is worth its risks
> and costs?
They won't fsck in planned downtimes. They will have to use fsck when
the shit hits the fan and they need to. Not sure about ext3, but big
XFS user with a close tie to the US goverment were concerned about this
case for really big filesystems and have sponsored speedup including
multithreading xfs_repair. I'm pretty sure the same arguments apply
to ext3, even if the filesystems are a few magnitudes smaller.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
---end quoted text---
On Tue, Jan 15, 2008 at 01:15:33PM +0000, Christoph Hellwig wrote:
> They won't fsck in planned downtimes. They will have to use fsck when
> the shit hits the fan and they need to. Not sure about ext3, but big
> XFS user with a close tie to the US goverment were concerned about this
> case for really big filesystems and have sponsored speedup including
> multithreading xfs_repair. I'm pretty sure the same arguments apply
> to ext3, even if the filesystems are a few magnitudes smaller.
And to add to that thanks to the not quite optimal default of
peridocially checking that I alwasy forget to turn off on test machines
an ext3 fsck speedup would be in my personal interested, and probably
that of tons of developers :)
Andrew Morton wrote:
> I'm wondering about the real value of this change, really.
>
> In any decent environment, people will fsck their ext3 filesystems during
> planned downtime, and the benefit of reducing that downtime from 6
> hours/machine to 2 hours/machine is probably fairly small, given that there
> is no service interruption. (The same applies to desktops and laptops).
>
> Sure, the benefit is not *zero*, but it's small. Much less than it would
> be with ext2. I mean, the "avoid unplanned fscks" feature is the whole
> reason why ext3 has journalling (and boy is that feature expensive during
> normal operation).
>
> So... it's unobvious that the benefit of this feature is worth its risks
> and costs?
I actually think that the value of this kind of reduction is huge. We
have seen fsck run for days (not just hours) which makes the "restore
from backup" versus "fsck" decision favor the tapes...
ric
On Tue, Jan 15, 2008 at 01:15:33PM +0000, Christoph Hellwig wrote:
> They won't fsck in planned downtimes. They will have to use fsck when
> the shit hits the fan and they need to. Not sure about ext3, but big
> XFS user with a close tie to the US goverment were concerned about this
> case for really big filesystems and have sponsored speedup including
> multithreading xfs_repair. I'm pretty sure the same arguments apply
> to ext3, even if the filesystems are a few magnitudes smaller.
Agreed, 100%. Even if you fsck snapshots during slow periods, it
still doesn't help you if the filesystem gets corrupted due to a
hardware or software error. That's where this will matter the most.
Val Hensen has done a proof of concept patch that multi-threads e2fsck
(and she's working on one that would be long-term supportable) that
might reduce the value of this patch, but metaclustering should still
help.
> > In any decent environment, people will fsck their ext3 filesystems during
> > planned downtime, and the benefit of reducing that downtime from 6
> > hours/machine to 2 hours/machine is probably fairly small, given that there
> > is no service interruption. (The same applies to desktops and laptops).
> >
> > Sure, the benefit is not *zero*, but it's small. Much less than it would
> > be with ext2. I mean, the "avoid unplanned fscks" feature is the whole
> > reason why ext3 has journalling (and boy is that feature expensive during
> > normal operation).
Also, it's not just reducing fsck times, although that's the main one.
The last time this was suggested, the rationale was to speed up the
"rm dvd.iso" case. Also, something which *could* be done, if Abhishek
wants to pursue it, would be to pull in all of the indirect blocks
when the file is opened, and create an in-memory extent tree that
would speed up access to the file. It's rarely worth doing this
without metaclustering, since it doesn't help for sequential I/O, only
random I/O, but with metaclustering it would also be a win for
sequential I/O. (This would also remove the minor performance
degradation for sequential I/O imposed by metaclustering, and in fact
improve it slightly for really big files.)
- Ted
On Tue, 15 Jan 2008 03:04:41 PST, Andrew Morton said:
> In any decent environment, people will fsck their ext3 filesystems during
> planned downtime, and the benefit of reducing that downtime from 6
> hours/machine to 2 hours/machine is probably fairly small, given that there
> is no service interruption. (The same applies to desktops and laptops).
I've got multiple boxes across the hall that have 50T of disk on them, in one
case as one large filesystem, and the users want *more* *bigger* still (damned
researchers - you put a 15 teraflop supercomputer in the room, and then they
want someplace to *put* all the numbers that come spewing out of there.. ;)
There comes a point where that downtime gets too long to be politically
expedient. 6->2 may not be a biggie, because you can likely get a 6 hour
window. 24->8 suddenly looks a lot different.
(Having said that, I'll admit the one 52T filesystem is an SGI Itanium box
running Suse and using XFS rather than ext3).
Has anybody done a back-of-envelope of what this would do for fsck times for
a "max realistically achievable ext3 filesystem" (i.e. 100T-200T or ext3
design limit, whichever is smaller)?
(And one of the research crew had a not-totally-on-crack proposal to get a
petabyte of spinning oxide. Figuring out how to back that up would probably
have landed firmly in my lap. Ouch. ;)
On Tue, 15 Jan 2008 10:09:16 EST, Ric Wheeler said:
> I actually think that the value of this kind of reduction is huge. We
> have seen fsck run for days (not just hours) which makes the "restore
> from backup" versus "fsck" decision favor the tapes...
Funny thing is that for many of these sorts of cases, "restore from backup"
is *also* a "days" issue unless you do a *lot* of very clever planning
ahead to be able to get multiple tape drives moving at the same time while
not causing issues at the receiving end either....
On Jan 15, 2008 23:25 -0500, [email protected] wrote:
> I've got multiple boxes across the hall that have 50T of disk on them, in one
> case as one large filesystem, and the users want *more* *bigger* still (damned
> researchers - you put a 15 teraflop supercomputer in the room, and then they
> want someplace to *put* all the numbers that come spewing out of there.. ;)
>
> There comes a point where that downtime gets too long to be politically
> expedient. 6->2 may not be a biggie, because you can likely get a 6 hour
> window. 24->8 suddenly looks a lot different.
>
> (Having said that, I'll admit the one 52T filesystem is an SGI Itanium box
> running Suse and using XFS rather than ext3).
>
> Has anybody done a back-of-envelope of what this would do for fsck times for
> a "max realistically achievable ext3 filesystem" (i.e. 100T-200T or ext3
> design limit, whichever is smaller)?
This is exactly the kind of environment that Lustre is designed for.
Not only do you get parallel IO performance, but you can also do parallel
e2fsck on the individual filesystems when you need it. Not that we aren't
also working on improving e2fsck performance, but 100 * 4TB e2fsck in
parallel is much better than 1 * 400TB e2fsck (probably not possible on
a system today due to RAM constraints though I haven't really done any
calculations either way). I know customers were having RAM problems with
3 * 2TB e2fsck in parallel on a 2GB node.
Most customers these days use 2-4 4-8TB filesystems per server
(inexpensive SMP node with 2-4GB RAM instead of a single monstrous SGI box).
We have many Lustre filesystems in the 100-200TB range today, some over 1PB
already and much larger ones being planned.
> (And one of the research crew had a not-totally-on-crack proposal to get a
> petabyte of spinning oxide. Figuring out how to back that up would probably
> have landed firmly in my lap. Ouch. ;)
Charge them for 2PB of storage, and use rsync :-).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Jan 15, 2008 10:28 AM, Theodore Tso <[email protected]> wrote:
> Also, it's not just reducing fsck times, although that's the main one.
> The last time this was suggested, the rationale was to speed up the
> "rm dvd.iso" case. Also, something which *could* be done, if Abhishek
> wants to pursue it, would be to pull in all of the indirect blocks
> when the file is opened, and create an in-memory extent tree that
> would speed up access to the file. It's rarely worth doing this
> without metaclustering, since it doesn't help for sequential I/O, only
> random I/O, but with metaclustering it would also be a win for
> sequential I/O. (This would also remove the minor performance
> degradation for sequential I/O imposed by metaclustering, and in fact
> improve it slightly for really big files.)
>
> - Ted
>
Also, since the in memory extent tree will now occupy much less space,
we can keep them cached for a much longer time which will improve
performance of random reads. The new metaclustering patch is more
amenable to this trick since it reduces fragmentation thereby reducing
the number of extents.
Thanks,
Abhishek
On Tuesday 15 January 2008 03:04, Andrew Morton wrote:
> I'm wondering about the real value of this change, really.
>
> In any decent environment, people will fsck their ext3 filesystems
> during planned downtime, and the benefit of reducing that downtime
> from 6 hours/machine to 2 hours/machine is probably fairly small,
> given that there is no service interruption. (The same applies to
> desktops and laptops).
>
> Sure, the benefit is not *zero*, but it's small. Much less than it
> would be with ext2. I mean, the "avoid unplanned fscks" feature is
> the whole reason why ext3 has journalling (and boy is that feature
> expensive during normal operation).
>
> So... it's unobvious that the benefit of this feature is worth its
> risks and costs?
Since I am waiting for an Ext3 fsck to complete right now, I thought I
would while away the time by tagging on my "me too" to the concensus
that faster fsck is indeed worth the cost, which is (ahem) free.
Regards,
Daniel
On Thursday 17 January 2008 04:47, Abhishek Rai wrote:
> > if Abhishek wants to pursue it, would be to pull in all of the
> > indirect blocks when the file is opened, and create an in-memory
> > extent tree that would speed up access to the file. It's rarely
> > worth doing this without metaclustering, since it doesn't help for
> > sequential I/O, only random I/O, but with metaclustering it would
> > also be a win for sequential I/O. (This would also remove the
> > minor performance degradation for sequential I/O imposed by
> > metaclustering, and in fact improve it slightly for really big
> > files.)
>
> Also, since the in memory extent tree will now occupy much less
> space, we can keep them cached for a much longer time which will
> improve performance of random reads. The new metaclustering patch is
> more amenable to this trick since it reduces fragmentation thereby
> reducing the number of extents.
I can see value in preemptively loading indirect blocks into the buffer
cache, but is building a second-order extent tree really worth the
effort? Probing the buffer cache is very fast.
Regards,
Daniel
On Sat, Jan 19, 2008 at 08:10:20PM -0800, Daniel Phillips wrote:
>
> I can see value in preemptively loading indirect blocks into the buffer
> cache, but is building a second-order extent tree really worth the
> effort? Probing the buffer cache is very fast.
It's not that much effort, and for a big database (say, like a 50GB
database file), the indirect blocks would take up 50 megabytes of
memory. Collapsing it into an extent tree would save that memory into
a few kilobytes. I suppose a database server would probably have
5-10GB's of memory, so the grand scheme of things it's not a vast
amount of memory, but the trick is keeping the indirect blocks pinned
so they don't get pushed out by some vast, gigunndo Java application
running in the same server as the database. If you have the indirect
blocks encoded into the extent tree, then you don't have to worry
about that.
- Ted
On Sunday 20 January 2008 18:51, Theodore Tso wrote:
> On Sat, Jan 19, 2008 at 08:10:20PM -0800, Daniel Phillips wrote:
> > I can see value in preemptively loading indirect blocks into the
> > buffer cache, but is building a second-order extent tree really
> > worth the effort? Probing the buffer cache is very fast.
>
> It's not that much effort, and for a big database (say, like a 50GB
> database file), the indirect blocks would take up 50 megabytes of
> memory. Collapsing it into an extent tree would save that memory
> into a few kilobytes. I suppose a database server would probably
> have 5-10GB's of memory, so the grand scheme of things it's not a
> vast amount of memory, but the trick is keeping the indirect blocks
> pinned so they don't get pushed out by some vast, gigunndo Java
> application running in the same server as the database. If you have
> the indirect blocks encoded into the extent tree, then you don't have
> to worry about that.
Hi Ted,
OK I think you are right, because this is a nice step towards developing
an on-disk extent format for Ext4 that avoids committing design
mistakes to permanent storage. The benefit can be proven using a pure
cache, in order to justify the considerable work necessary to make it
persistent.
Chris and Jens have an effort going to implement a physical disk extent
cache for loop.c. It is actually the same problem, and I smell a
library here.
Issue: how do you propose to make this cache evictable?
Regards,
Daniel