DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:sender:to:subject:cc:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth;
        b=QO/7RLZlCCWhaPy8BjF9L4Yrb6issmDo8tsNIcrBwL6E6eWcpxtL4qke9WZ8LrkgYTzw/sdVO53o0d50gi0RdCz5C/WjffdvEdQqh/fIclMlQTzJnt0oGpP9kkGjuGfu/f2PHmPVvR4N4nxfaWWhOzuLtOkfgrmkYCgcRtSHfwQ=
Message-ID: <e9c3a7c20701051736i113772aby31ccb34f231633a0@mail.gmail.com>
Date: Fri, 5 Jan 2007 18:36:39 -0700
From: "Dan Williams" <dan.j.williams@intel.com>
To: "Evgeniy Polyakov" <johnpol@2ka.mipt.ru>
Subject: asynchronous memory transfer/transform api(s) (was: Re: [RFC] Heads up on a series of AIO patchsets)
Cc: linux-kernel@vger.kernel.org
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6540
Lines: 138

[ trimmed the cc to just linux-kernel ]
On 1/3/07, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> On Tue, Jan 02, 2007 at 02:38:13PM -0700, Dan Williams (dan.j.williams@intel.com) wrote:
> > Would you have time to comment on the approach I have taken to
> > implement a standard asynchronous memcpy interface?  It seems it would
> > be a good complement to what you are proposing.  The entity that
> > describes the aio operation could take advantage of asynchronous
> > engines to carry out copies or other transforms (maybe an acrypto tie
> > in as well).
> >
> > Here is the posting for 2.6.19.  There has since been updates for
> > 2.6.20, but the overall approach remains the same.
> > intro: http://marc.theaimsgroup.com/?l=linux-raid&m=116491661527161&w=2
> > async_tx: http://marc.theaimsgroup.com/?l=linux-raid&m=116491753318175&w=2
>
> My first impression is that it has too many lists :)
>
Ok, I think I can cut this down if I use static allocations for the
software descriptors.

> Looks good, but IMHO there are steps to implement further.
> I have not found there any kind of scheduler - what if system has two
> async engines?
It supports multiple engines (the iop341 has three), but it only does
a simple round robin for load balancing.  I'll look to acrypto for
other load balancing schemes, at the same time I am concerned about
cpu overhead of a scheduling algorithm.

> What if sync engine faster than async in some cases (and
> it is indeed the case for small buffers), and should be selected that time?
Yes I need to profile for this and make it tunable.

> What if you will want to add additional transformations for some
> devices like crypto processing or checksumming?
>
It would be nice to use an async_tx-style engine to handle the xor and
GF multiply portions of a crypto algorithm, but how do we handle the
case when an acrypto-style engine is available to handle the entire
algorithm from end-to-end?  It seems like there would need to be a
third layer to hide this distinction from application code, but as
quoted below maybe a layer is not the right way to go...

>
> I would just create a driver for low-level engine, and exported its
> functionality - iop3_async_copy(), iop3_async_checksum(), iop3_async_crypto_1(),
> iop3_async_crypto_2() and so on.
>
> There will be a lot of potential users of exactly that functionality,
> but not stricly hardcoded higher layer operations like raidX.
>
This is the goal of the async_tx layer, it provides (currently)
async_memcpy, async_memset, async_xor and async_xor_zero_sum.  The
iop-adma driver exposes iop_adma_prep_dma_memcpy,
iop_adma_prep_dma_xor etc... for async_tx to consume.

> More generic solution must be used to select appropriate device.
> We had a very brief discussion about asynchronous crypto layer (acrypto)
> and how its ideas could be used for async dma engines - user should not
> even know how his data has been transferred - it calls async_copy(),
> which selects appropriate device (and sync copy is just an additional
> usual device in that case) from the list of devices, exported its
> functionality, selection can be done in millions of different ways from
> getting the fisrt one from the list (this is essentially how your
> approach is implemented right now), or using special (including run-time
> updated) heueristics (like it is done in acrypto).
>
async_memcpy makes an attempt at transparent fallback but currently
assumes that an engine is always faster.

> Thinking further, async_copy() is just a usual case for async class of
> operations. So the same above logic must be applied on this layer too.
>
> But 'layers are the way to design protocols, not implement them'.
>         David Miller on netchannels
>
> So, user should not even know about layers - it should just say 'copy
> data from pointer A to pointer B', or 'copy data from pointer A to
> socket B' or even 'copy it from file "/tmp/file" to "192.168.0.1:80:tcp"',
> without ever knowing that there are sockets and/or memcpy() calls,
> and if user requests to perform it asynchronously, it must be later
> notified (one might expect, that I will prefer to use kevent :)
> The same approach thus can be used by NFS/SAMBA/CIFS and other users.
>
> That is how I start to implement AIO (it looks like it becomes popular):
> 1. system exports set of operations it supports (send, receive, copy,
> crypto, ....)
> 2. each operation has subsequent set of suboptions (different crypto
> types, for example)
> 3. each operation has set of low-level drivers, which support it (with
> optional performance or any other parameters)
> 4. each driver when loaded publishes its capabilities (async copy with
> speed A, XOR and so on)
>
The only problem with engines is that they are problematic to use on
userspace buffers (on non-coherent architectures).

> From user's point of view its aio_sendfile() or async_copy() will look
> following:
> 1. call aio_schedule_pointer(source='0xaabbccdd', dest='0x123456578')
> 1. call aio_schedule_file_socket(source='/tmp/file', dest='socket')
> 1. call aio_schedule_file_addr(source='/tmp/file',
> dest='192.168.0.1:80:tcp')
>
> or any other similar call
>
> then wait for received descriptor in kevent_get_events() or provide own
> cookie in each call.
>
> Each request is then converted into FIFO of smaller request like 'open file',
> 'open socket', 'get in user pages' and so on, each of which should be
> handled on appropriate device (hardware or software), completeness of
> each request starts procesing of the next one.
>
> Reading microthreading design notes I recall comparison of the NPTL and
> Erlang threading models on Debian site - they are _completely_ different
> models, NPTL creates real threads, which is supposed (I hope NOT)
> to be implemented in microthreading design too. It is slow.
> (Or is it not, Zach, we are intrigued :)
> It's damn bloody slow to create a thread compared to the correct non-blocking
> state machine. TUX state machine is similar to what I had in my first kevent
> based FS and network AIO patchset, and what I will use for current async
> processing work.
>
>
> A bit of empty words actually, but it can provide some food for
> thoughts.
>
Not at all, thank you for taking a look at the code.
> --
>         Evgeniy Polyakov
>
> --
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/