Message-ID: <49D38D4B.7020701@panasas.com>
Date: Wed, 01 Apr 2009 18:50:35 +0300
From: Boaz Harrosh <bharrosh@panasas.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1b3pre) Gecko/20090315 Remi/3.0-0.b2.fc10.remi Thunderbird/3.0b2
MIME-Version: 1.0
To: Tejun Heo <tj@kernel.org>
CC: axboe@kernel.dk, linux-kernel@vger.kernel.org,
       fujita.tomonori@lab.ntt.co.jp
Subject: Re: [PATCH 08/17] bio: reimplement bio_copy_user_iov()
References: <1238593472-30360-1-git-send-email-tj@kernel.org> <1238593472-30360-9-git-send-email-tj@kernel.org>
In-Reply-To: <1238593472-30360-9-git-send-email-tj@kernel.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4133
Lines: 105

On 04/01/2009 04:44 PM, Tejun Heo wrote:
> Impact: more modular implementation
> 
> Break down bio_copy_user_iov() into the following steps.
> 
> 1. bci and page allocation
> 2. copying data if WRITE
> 3. create bio accordingly
> 
> bci is now responsible for managing any copy related resources.  Given
> source iov, bci_create() allocates bci and fills it with enough pages
> to cover the source iov.  The allocated pages are described with a
> sgl.
> 
> Note that new allocator always rounds up rq_map_data->offset to page
> boundary to simplify implementation and guarantee enough DMA padding
> area at the end.  As the only user, scsi sg, always passes in zero
> offset, this doesn't cause any actual behavior difference.  Also,
> nth_page() is used to walk to the next page rather than directly
> adding to struct page *.
> 
> Copying back and forth is done using bio_memcpy_sgl_uiov() which is
> implemented using sg mapping iterator and iov iterator.
> 
> The last step is done using bio_create_from_sgl().
> 
> This patch by itself adds one more level of indirection via sgl and
> more code but components factored out here will be used for future
> code refactoring.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Hi dear Tejun

I've looked hard and deep into your patchset, and I would like to
suggest an improvement.

[Option 1]
What your code is actually using from sgl-code base is:
 for_each_sg
 sg_mapping_iter and it's
	sg_miter_start, sg_miter_next
 ... (what else)

I would like if you can define above for bvec(s) just the way you like
them. Then code works directly on the destination bvect inside the final
bio. One less copy no intermediate allocation, and no kmalloc of
bigger-then-page buffers.

These are all small inlines, duplicating those will not affect
Kernel size at all. You are not using the chaining ability of sgl(s)
so it can be simplified. You will see that not having the intermediate
copy simplifies the code even more.

Since no out-side user currently needs sgl(s) no functionality is lost.

[Option 2]
Keep pointer to sgl and not bvec at bio, again code works on final destination.
Later users of block layer that call blk_rq_fill_sgl (blk_rq_map_sg) will just
get a copy of the pointer and another allocation and copy is gained.
This option will spill outside of the current patches scope. Into bvec hacking
code.


I do like your long term vision of separating the DMA part from the virtual part
of scatterlists. Note how they are actually two disjoint lists altogether. After
the dma_map does its thing the dma physical list might be shorter then virtual
and sizes might not correspond at all. The dma mapping code regards the dma part
as an empty list that gets appended while processing, any segments match is
accidental. (That is: inside the scatterlist the virtual address most probably
does not match the dma address)

So [option 1] matches more closely to that vision.

Historically code was doing
  Many-sources => scatterlist => biovec => scatterlist => dma-scatterlist

Only at 2.6.30 we can say that we shorten a step to do:
  Many-sources => biovec => scatterlist => dma-scatterlist

Now you want to return the extra step, I hate it.
[Option 2] can make that even shorter.
  Many-sources => scatterlist => dma-scatterlist

Please consider [option 1] it will only add some source code
but it will not increase code size, maybe it will decrease,
and it will be fast.

Please consider that this code-path is used by me, in exofs and
pNFS-objcets in a very very hot path, where memory pressure is a
common scenario.

And I have one more question.
Are you sure kmalloc of bigger-then-page buffers are safe? As I
understood it, that tries to allocate physically contiguous pages
which degrades as time passes, and last time I tried this with a kmem_cache
(do to a bug) it crashed the kernel randomly after 2 minutes of use.

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/