Hi!
Here is the small patch which implements concurrent block allocation
for ext3. It removes lock_super() in ext3_new_block() and ext3_free_blocks().
Modifications of counters in superblock and group descriptors are protected
by spinlock. Tested on SMP for several hours.
--- linux/fs/ext3/balloc.c Thu Feb 20 16:19:06 2003
+++ balloc.c Mon Mar 10 16:00:49 2003
@@ -118,7 +118,6 @@
printk ("ext3_free_blocks: nonexistent device");
return;
}
- lock_super (sb);
es = EXT3_SB(sb)->s_es;
if (block < le32_to_cpu(es->s_first_data_block) ||
block + count < block ||
@@ -214,11 +213,13 @@
block + i);
BUFFER_TRACE(bitmap_bh, "bit already cleared");
} else {
+ spin_lock(&EXT3_SB(sb)->s_alloc_lock);
dquot_freed_blocks++;
gdp->bg_free_blocks_count =
cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)+1);
es->s_free_blocks_count =
cpu_to_le32(le32_to_cpu(es->s_free_blocks_count)+1);
+ spin_unlock(&EXT3_SB(sb)->s_alloc_lock);
}
/* @@@ This prevents newly-allocated data from being
* freed and then reallocated within the same
@@ -267,7 +268,6 @@
error_return:
brelse(bitmap_bh);
ext3_std_error(sb, err);
- unlock_super(sb);
if (dquot_freed_blocks)
DQUOT_FREE_BLOCK(inode, dquot_freed_blocks);
return;
@@ -408,7 +408,6 @@
return 0;
}
- lock_super(sb);
es = EXT3_SB(sb)->s_es;
if (le32_to_cpu(es->s_free_blocks_count) <=
le32_to_cpu(es->s_r_blocks_count) &&
@@ -461,6 +460,7 @@
ext3_debug("Bit not found in block group %d.\n", group_no);
+repeat:
/*
* Now search the rest of the groups. We assume that
* i and gdp correctly point to the last group visited.
@@ -538,9 +538,9 @@
/* The superblock lock should guard against anybody else beating
* us to this point! */
- J_ASSERT_BH(bitmap_bh, !ext3_test_bit(ret_block, bitmap_bh->b_data));
BUFFER_TRACE(bitmap_bh, "setting bitmap bit");
- ext3_set_bit(ret_block, bitmap_bh->b_data);
+ if (ext3_set_bit(ret_block, bitmap_bh->b_data))
+ goto repeat;
performed_allocation = 1;
#ifdef CONFIG_JBD_DEBUG
@@ -586,11 +586,13 @@
ext3_debug("allocating block %d. Goal hits %d of %d.\n",
ret_block, goal_hits, goal_attempts);
+ spin_lock(&EXT3_SB(sb)->s_alloc_lock);
gdp->bg_free_blocks_count =
cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count) - 1);
es->s_free_blocks_count =
cpu_to_le32(le32_to_cpu(es->s_free_blocks_count) - 1);
-
+ spin_unlock(&EXT3_SB(sb)->s_alloc_lock);
+
BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
err = ext3_journal_dirty_metadata(handle, gdp_bh);
if (!fatal)
@@ -606,7 +608,6 @@
if (fatal)
goto out;
- unlock_super(sb);
*errp = 0;
brelse(bitmap_bh);
return ret_block;
@@ -618,7 +619,6 @@
*errp = fatal;
ext3_std_error(sb, fatal);
}
- unlock_super(sb);
/*
* Undo the block allocation
*/
On Mar 10, 2003 18:41 +0300, Alex Tomas wrote:
> Here is the small patch which implements concurrent block allocation
> for ext3. It removes lock_super() in ext3_new_block() and ext3_free_blocks().
> Modifications of counters in superblock and group descriptors are protected
> by spinlock. Tested on SMP for several hours.
Any ideas on how much this improves the performance? What sort of tests
were you running? We could improve things a bit further by having separate
per-group locks for the update of the group descriptor info, and only
lazily update the superblock at statfs and unmount time (with a suitable
feature flag so e2fsck can fix this up at recovery time), but you seem
to have gotten the majority of the parallelism from this fix.
> @@ -214,11 +213,13 @@
> block + i);
> BUFFER_TRACE(bitmap_bh, "bit already cleared");
> } else {
> + spin_lock(&EXT3_SB(sb)->s_alloc_lock);
> dquot_freed_blocks++;
> gdp->bg_free_blocks_count =
> cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)+1);
> es->s_free_blocks_count =
> cpu_to_le32(le32_to_cpu(es->s_free_blocks_count)+1);
> + spin_unlock(&EXT3_SB(sb)->s_alloc_lock);
One minor nit is that you left an ext3_error() for the "bit already cleared"
case just above this patch hunk.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
On Mon 10 Mar 03 17:25, Andreas Dilger wrote:
> One minor nit is that you left an ext3_error() for the "bit already
> cleared" case just above this patch hunk.
But that one belongs there, because no two threads should be trying to free
the same block at the same time.
Regards,
Daniel
>>>>> Andreas Dilger (AD) writes:
AD> Any ideas on how much this improves the performance? What sort
AD> of tests were you running? We could improve things a bit further
AD> by having separate per-group locks for the update of the group
AD> descriptor info, and only lazily update the superblock at statfs
AD> and unmount time (with a suitable feature flag so e2fsck can fix
AD> this up at recovery time), but you seem to have gotten the
AD> majority of the parallelism from this fix.
I'm trying to measure improvement.
The tests were:
1) on big fs (1GB)
lots of processes (up to 50) creating, removing directories and files +
untaring kernel and make -j4 bzImage +
dd if=/dev/zero of=/mnt/dump.file bs=1M count=8000; rm -f /mnt/dump.file
2) on small fs (64MB)
20 processes create and remove lots of files and directories
in fact, I catched dozens of debug messages about set_bit collision. Then
I fscked fs to be sure all is ok.
>> @@ -214,11 +213,13 @@ block + i); BUFFER_TRACE(bitmap_bh, "bit
>> already cleared"); } else { +
>> spin_lock(&EXT3_SB(sb)->s_alloc_lock); dquot_freed_blocks++;
gdp-> bg_free_blocks_count =
>> cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)+1);
es-> s_free_blocks_count =
>> cpu_to_le32(le32_to_cpu(es->s_free_blocks_count)+1); +
>> spin_unlock(&EXT3_SB(sb)->s_alloc_lock);
AD> One minor nit is that you left an ext3_error() for the "bit
AD> already cleared" case just above this patch hunk.
hmm. whats wrong with it?
with best regards, Alex
SDET on my machine (16x NUMA-Q) has fallen in love with your patch,
and has decided to elope with it to a small desert island. This is
despite it's one disk hung off node 0, and the IO througput of a
slightly damp piece of cotton thread. Apologies for the loss of your
patch as it gets whisked away ;-)
M.
PS. Oh, I had this bit, per akpm-instructions: For best results, add ____cacheline_aligned_in_smp to struct ext2_bg_info
PPS. I'll try to run some more focused tests with aim7 over the weekend.
As if we needed it ...
-------------------------
DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.
Results are shown as percentages of the first set displayed
SDET 1 (see disclaimer)
Throughput Std. Dev
2.5.64-bk3-mjb3 100.0% 1.8%
2.5.64-mjb3-ext2 102.0% 1.1%
SDET 2 (see disclaimer)
Throughput Std. Dev
2.5.64-bk3-mjb3 100.0% 3.7%
2.5.64-mjb3-ext2 106.1% 3.1%
SDET 4 (see disclaimer)
Throughput Std. Dev
2.5.64-bk3-mjb3 100.0% 1.5%
2.5.64-mjb3-ext2 101.1% 2.1%
SDET 8 (see disclaimer)
Throughput Std. Dev
2.5.64-bk3-mjb3 100.0% 0.2%
2.5.64-mjb3-ext2 113.3% 0.7%
SDET 16 (see disclaimer)
Throughput Std. Dev
2.5.64-bk3-mjb3 100.0% 1.1%
2.5.64-mjb3-ext2 167.1% 0.8%
SDET 32 (see disclaimer)
Throughput Std. Dev
2.5.64-bk3-mjb3 100.0% 0.9%
2.5.64-mjb3-ext2 170.7% 0.1%
SDET 64 (see disclaimer)
Throughput Std. Dev
2.5.64-bk3-mjb3 100.0% 0.7%
2.5.64-mjb3-ext2 157.2% 0.5%
SDET 128 (see disclaimer)
Throughput Std. Dev
2.5.64-bk3-mjb3 100.0% 0.3%
2.5.64-mjb3-ext2 151.3% 0.8%
> SDET on my machine (16x NUMA-Q) has fallen in love with your patch,
> and has decided to elope with it to a small desert island. This is
> despite it's one disk hung off node 0, and the IO througput of a
> slightly damp piece of cotton thread. Apologies for the loss of your
> patch as it gets whisked away ;-)
Dbench (1 disk, x440 8 real cpus, 16 HT ones)
before:
Throughput 265.032 MB/sec (NB=331.29 MB/sec 2650.32 MBit/sec) 256 procs
after:
Throughput 381.964 MB/sec (NB=477.454 MB/sec 3819.64 MBit/sec) 256 procs
(I took the second run, first ones are slower, seems to be stable after)
NUMA-Q 16-way (1 disk. 16 cpus)
before:
Throughput 48.5304 MB/sec (NB=60.663 MB/sec 485.304 MBit/sec) 256 procs
after:
Throughput 58.8483 MB/sec (NB=73.5603 MB/sec 588.483 MBit/sec) 256 procs
NUMA-Q has slower disks, old adaptors, and a slow cross-node interconnect.
> before:
> Throughput 48.5304 MB/sec (NB=60.663 MB/sec 485.304 MBit/sec) 256 procs
> after:
> Throughput 58.8483 MB/sec (NB=73.5603 MB/sec 588.483 MBit/sec) 256 procs
OK, akpm wanted dbench 32 instead:
before:
Throughput 187.637 MB/sec (NB=234.546 MB/sec 1876.37 MBit/sec) 32 procs
after:
Throughput 378.664 MB/sec (NB=473.33 MB/sec 3786.64 MBit/sec) 32 procs
/me likes.
M.