by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH] ext4: dir inode reservation V3

On Nov 20, 2007 12:22 -0800, Mingming Cao wrote:
> On Tue, 2007-11-20 at 12:14 +0800, Coly Li wrote:
> > Mingming Cao wrote:
> > > On Tue, 2007-11-13 at 22:12 +0800, Coly Li wrote:
> > The hole is (s_dir_ireserve_nr - 1), not N * s_dir_ireserve_nr. Because
> > directory inode will also use a inode slot from reserved area, reset
> > slots number for files is (s_dir_ireserve_nr - 1). Except for the
> > reserved inodes number, your understanding exactly matches my idea.
>
> The performance gain on large number of directories looks interesting,
> but I am afraid this makes the uninitialized block group feature less
> useful (to speed up fsck) than before:( The reserved inode will cause
> the unused inode watermark higher than before, and spread the
> directories over to other block groups earlier than before. Maybe it
> worth it...not sure.

My original thoughts on the design for this were slightly different:
- that the per-directory reserved window would scale with the size of
the directory, so that even (or especially) with htree directories the
inodes would be kept in hash-ordered sets to speed up stat/unlink
- it would be possible/desirable to re-use the existing block bitmap
reservation code to handle inode bitmap reservation for directories
while those directories are in-core. We already have the mechanisms
for this, "all" that would need to change is have the reservation code
point at the inode bitmaps but I don't know how easy that is
- after an unmount/remount it would be desirable to re-use the same blocks
for getting good hash->inode mappings, wherein lies the problem of
compatibility

One possible solutions for the restart problem is searching the directory
leaf block in which an inode is being added for the inode numbers and try
to use those as a goal for the inode allocation... Has a minor problem
with ordering, because usually the inode is allocated before the dirent
is created, but isn't impossible to fix (e.g. find dirent slot first,
keep a pointer to it, check for inode goals, and then fill in dirent
inum after allocating inode)

> > >> 5, Performance number
> > >> On a Core-Duo, 2MB DDM memory, 7200 RPM SATA PC, I built a 50GB ext4
> > >> partition, and tried to create 50000 directories, and create 15 (1KB)
> > >> files in each directory alternatively. After a remount, I tried to
> > >> remove all the directories and files recursively by a 'rm -rf'.
> > >> Below is the benchmark result,
> > >> normal ext4 ext4 with dir inode reservation
> > >> mount options: -o data=writeback -o data=writeback,dir_ireserve=low
> > >> Create dirs: real 0m49.101s real 2m59.703s
> > >> Create files: real 24m17.962s real 21m8.161s
> > >> Unlink all: real 24m43.788s real 17m29.862s

One likely reason that the create dirs step is slower is that this is
doing a lot more IO than in the past. Only a single inode in each
inode table block is being used, so that means that a lot of empty
bytes are being read and written (maybe 16x as much data in this case).

Also, in what order are you creating files in the directories? If you
are creating them in directory order like:

for (f = 0; f < 15; f++)
for (i = 0; i < 50000; i++)
touch dir$i/f$f

then it is completely unsurprising that directory reservation is faster
at file create/unlink because those inodes are now contiguous at the
expense of having gaps in the inode sequence. Creating 15 files per
directory is of course the optimum test case also.

How does this patch behave with benchmarks like dbench, mongo, postmark?

> It would be nice to remember the last allocated bit for each block
> group, so we don't have to start from bit 0 (in the worse case scan the
> entire block group) for every single directory. Probably this can be
> saved in the in-core block group descriptor. Right now the in core
> block group descriptor structure is the same on-disk block-group
> structure though, it might worth to separate them and provide load/store
> helper functions.

Note that mballoc already creates an in-memory struct for each group.
I think the initialization of this should be moved outside of mballoc
so that it can be used for other purposes as you propose.

Eric had a benchmark where creating many files/subdirs would cause
a huge slowdown because of bitmap searching, and having a per-group
pointer with the first free inode (or last allocated inode might be
less work to track) would speed this up a lot.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.