2007-12-03 18:12:42

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Understanding mballoc

Alex,

This is my attempt at understanding multi block allocator. I have
few questions marked as FIXME below. Can you help answering them.
Most of this data is already in the patch queue as commit message.
I have updated some details regarding preallocation. Once we
understand the details i will update the patch queue commit message.



The allocation request involve request for multiple number of
blocks near to the goal(block) value specified.

During initialization phase of the allocator we decide to use the group
preallocation or inode preallocation depending on the size of the request. If
the request is smaller than sbi->s_mb_small_req we select the group
preallocation. This is needed because we would like to have the small files
closer. The value of s_mb_small_req is 256 blocks.

/* FIXME!!
Does the value of s_mb_small_req depend on the s_mb_prealloc_table ?
If yes, then how do we update the s_mb_small_req. We have a hook to update
the prealloc table via /proc. But that doesn't update the s_mb_small_req.
*/

/* FIXME!! The code within ext4_mb_group_or_file does below.
if (ac->ac_o_ex.fe_len >= sbi->s_mb_large_req)
return;

if (ac->ac_o_ex.fe_len >= >sbi->s_mb_small_req)
return;

That doesn't seem to make sense because the if the len is greater than
s_mb_sall_req it will be always greater than s_mb_large_req. What are we
expecting to do here ?
*/


First stage the allocator looks at the inode prealloc list
ext4_inode_info->i_prealloc_list contain list of prealloc
spaces for this particular inode. The inode prealloc space
is represented as
pa_lstart -> the logical start block for this prealloc space
pa_pstart -> the physical start block for this prealloc space
pa_len -> lenght for this prealloc space
pa_free -> free space available in this prealloc space

The inode preallocation space is used looking at the _logical_
start block. If only the logical file block falls within the
range of prealloc space we will consume the particular prealloc
space. This make sure that that the we have contiguous physical
blocks representing the file blocks

The important thing to be noted in case of inode prealloc space
is that we don't modify the values associated to inode prealloc
space except pa_free.

If we are not able to find blocks in the inode prealloc space and if we have
the group allocation flag set then we look at the locality group prealloc
space. These are per CPU prealloc list repreasented as

ext4_sb_info.s_locality_groups[smp_processor_id()]

/* FIXME!!
After getting the locality group related to the current CPU we could be
scheduled out and scheduled in on different CPU. So why are we putting the
locality group per cpu ?
*/

The locality group prealloc space is used looking at whether we have
enough free space (pa_free) withing the prealloc space.


If we can't allocate blocks via inode prealloc or/and locality group prealloc
then we look at the buddy cache. The buddy cache is represented by
ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets mapped to the
buddy and bitmap information regarding different groups. The buddy information
is attached to buddy cache inode so that we can access them through the page
cache. The information regarding each group is loaded via ext4_mb_load_buddy.
The information involve block bitmap and buddy information. The information are
stored in the inode as

{ page }
[ group 0 buddy][ group 0 bitmap] [group 1][ group 1]...


one block each for bitmap and buddy information.
So for each group we take up 2 blocks. A page can
contain blocks_per_page (PAGE_CACHE_SIZE / blocksize) blocks.
So it can have information regarding groups_per_page which
is blocks_per_page/2

Buddy cachche inode is not stored on disk. The inode get
thrown away at the end when unmounting the disk.

We look for count number of blocks in the buddy cache. If
we were able to locate that many free blocks we return
with additional information regarding rest of the
contiguous physical block available

/* FIXME:
We need to explain the normalization of the request length.
What are the conditions we are checking the request length
against. Why are group request always requested at 512 blocks ?


Buddy scanning follows different criteria. We need to explain what
a "criteria" is how they infulence the allocation
*/

If we allocate more space than we requested for then the remaining
space get added to the locality group prealloc space or
inode prealloc space.


Both the prealloc space are getting populated
as above. So for the first request we will hit the buddy cache
which will result in this prealloc space getting filled. The prealloc
space is then later used for the subsequent request.


2007-12-03 19:29:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: Understanding mballoc

On Dec 03, 2007 23:42 +0530, Aneesh Kumar K.V wrote:
> This is my attempt at understanding multi block allocator. I have
> few questions marked as FIXME below. Can you help answering them.
> Most of this data is already in the patch queue as commit message.
> I have updated some details regarding preallocation. Once we
> understand the details i will update the patch queue commit message.

Some comments below, Alex can answer more authoritatively.

> If we are not able to find blocks in the inode prealloc space and if we have
> the group allocation flag set then we look at the locality group prealloc
> space. These are per CPU prealloc list repreasented as
>
> ext4_sb_info.s_locality_groups[smp_processor_id()]
>
> /* FIXME!!
> After getting the locality group related to the current CPU we could be
> scheduled out and scheduled in on different CPU. So why are we putting the
> locality group per cpu ?
> */

I think just to avoid contention between CPUs. It is possible to get
scheduled at this point it is definitely unlikely. There does appear
to still be proper locking for the locality group, so at worst we get
contention between 2 CPUs for the preallocation instead of all of them.

> /* FIXME:
> We need to explain the normalization of the request length.
> What are the conditions we are checking the request length
> against. Why are group request always requested at 512 blocks ?

Probably no particular reason for 512 blocks = 2MB, other than some
decent number of smaller requests can fit in there before looking
for another one.

One note for normalization - regarding recent benchmarks that show
e2fsck performance improvement for clustering of indirect blocks it
would also seem that allocating index blocks in the same preallocation
group could provide a similar improvement for mballoc+extents.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.