2002-06-23 23:58:34

by Adam J. Richter

[permalink] [raw]
Subject: RFC: turn scatterlist into a linked list, eliminate bio_vec

I have an idea of how to simplify all of the device drivers
that do DMA, the block device layer, networking, scsi, USB, and
others. I came up with this partly in an effort to explore if I could
go furhter with the bio_append changes, and partly in response to
Martin Dalecki suggesting that think about "a generalized packet
driver command layer, which is independant from SCSI" (although this
idea does not go that far).

The idea is to turn struct scatterlist from an array into a
linked list. Here are some advantages:

1. We can more easily consolidate all of the calls to
pci_{,un}map_sg, blk_rq_map_sg, etc. in the higher level dispatch
routines for each device type (ll_rw_blk.c, scsi.c, etc.), because
the higher level code would already have the struct scatterlist, so
it would not have to do memory allocation per request. It also
means we have one less place to worry about memory allocation
failures (or add a lot of complexity to prevent them). Hardware
device drivers would just receive a struct scatterlist and stuff
DMA values.

2. The gather/scatter merge optimizations now done for block
devices can be restated as operating on any scatterlist, so the same
code could optimize merging and breaking up streams to USB endpoints,
for example.

3. The gather/scatter merge optimizations will be faster
because it will not be necessary to copy or reallocate arrays.
Instead, we can do O(1) manipulations of head and tail pointers.
Given prior knowledge about exactly what kind of gather/scatter lists
the underlying driver can handle, the merging code could do the
merges one time as it does the math about what can be merged.

4. The potential memory costs for the raid code to split
linked lists is smaller than that for copying arrays. At most,
it will cost num_disks * sizeof(struct scatterlist).

5. We can more easily prepend or append additional data to a
request (assuming a tail pointer, which is needed by the block driver
code for merging anyhow). This may be useful, say, when one needs
basically to encapsulate one protocol in another.

Right now, just about every type of device that can
potentially do a large amount of IO has its own way of chaining
these blocks of arbitrary data together (request/bio/bio_vec for
block devices, struct urb, struct sk_buff_head for network packets,
etc.). At the lowest level of division, I believe that these data
structures already require their blocks of data not to cross
page boundaries. Most of these data structures contain other important
data and could not be entirely eliminated, but they could be changed to
each contain one struct scatterlist.

The result would be that the request handler for most block
device drivers, network drivers, usb drivers, etc. would be
simplified. Also the common denominator of struct scatterlist between
different types of drivers could also simplify the various
encapsulation system (for example, right now the sddr09 scsi over usb
interface actually a scatterlist and copies all of the data into a
single linear buffer, even though USB controllers are good at
gather/scatter).

When implementing this in practice, there are things
that could be done to preserve backward compatibility for a
while, like leaving the old array-based routines in place for
a time.

Anyhow, to illustrate better what I have in mind, I have
attached a description of the approximate changes in data structures
and a pair of helper routines that could centralize most of the
scatterlist mapping and unmapping.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."


/* Parameters from ll_rw_blk.c. Used for expressing limits and keeping
running totals on sglist sizes. */
struct scatter_sizes {
unsigned int hw_segments;
unsigned int phys_segments;
unsigned int segment_size;
unsigned int total_size;
};

struct scatterlist {
...include all existing fields. Add this one:

struct scatterlist *next;
void *driver_priv;
dma_addr_t driver_priv_dma;

...also alloc_sg() will allocation additional space at the end of
struct scatterlist as requested by the hardware driver, which the
hardware driver will use for things like the hardware device's
gather/scatter structures.
};


typedef void (bio_end_io_t) (struct bio *,
unsigned long bi_flags /* combined with bi_rw */);


struct bio {
/* merge bi_rw and bi_flags into a parameter to submit_bio */
struct bio *bi_next; /* request queue link */
unsigned int bi_size;
bio_end_io_t *bi_end_io;
void *bi_private;
bio_destructor_t *bi_destructor; /* destructor */
};

typedef void (unprep_rq_fn)(request_queue_t *q, struct request *req);

struct request {
Delete current_nr_sectors, nr_phys_segments, nr_hw_segments.
Replace them with:

int rw; /* formerly bio.bi_rw */
unprep_rq_fn *unprep_rq_fn; /* Called by
blkdev_release_request*/
struct scatterlist *head, *tail;
struct scatter_sizes sizes; /* sizes.segment_size refers to the
size of the last segment */
};

struct request_queue {
...new fields:
int sgpriv_size;
mempool_t *sgpriv_pool;
struct pci_dev *dma_map_dev;
};

/* Allocate one scatterlist element, with any extra space requested in
q->sglist_extra_bytes. */
struct scatterlist *sglist_alloc(requeust_queue_t *q);

#define QUEUE_FLAG_DMA_MAP_MUST_SUCCEED 4

/* Generic routines usable as queue->{,un}prep_fn: */

int rq_dma_prep(reqeust_queue_t *q, struct request *req) {
int failed = 0;
struct scatterlist *sg;
int fail_value = q->flags & QUEUE_FLAGS_DMA_MAP_MUST_SUCCEED;

sg = req->head;
if (pci_map_sg(req->dma_map_dev, sg, bio_dir_to_dma_dir(req->rw)) < 0)
failed = fail_value;
if (failed || !q->sgpriv_pool)
return failed;

/* Let's assume the new pci_map_sg will free unused scatterlists. */
while (sg != NULL && count--) {
sg->driver_priv = mempool_alloc(q->sgpriv_pool, GFP_KERNEL);

sg->driver_priv_dma =
pci_map_single(req->dma_map_dev, sg->driver_priv, len,
PCI_DMA_TODEVICE);
if (sg_dma->dma_add_priv == 0);
failed = fail_value;
}
}
if (failed)
rq_dma_unprep(q, req);

return failed;
}


void rq_dma_unprep(reqeust_queue_t *q, struct request *req) {
struct scatterlist *sg;

sg = req->head;
pci_unmap_sg(req->dma_map_dev, sg, bio_dir_to_dma_dir(req->rw));
if ((q->flags & QUEUE_FLAG_DMA_MAP_SGLIST) != 0) {
while (sg != NULL && count--) {
struct scatterlist_with_dma *sg_dma =
(struct scatterlist_with_dma*) sg;
void *vaddr = &sg_dma[1];
int len = req->sglist_extra_bytes - sizeof(dma_addr_t);

pci_unmap_single(req->dma_map_dev,
sg_dma->dma_addr_priv, vaddr, len,
PCI_DMA_TODEVICE);
}
}
}


2002-06-24 06:30:25

by David Miller

[permalink] [raw]
Subject: Re: RFC: turn scatterlist into a linked list, eliminate bio_vec

From: "Adam J. Richter" <[email protected]>
Date: Sun, 23 Jun 2002 16:58:20 -0700

/* Let's assume the new pci_map_sg will free unused scatterlists. */
while (sg != NULL && count--) {
sg->driver_priv = mempool_alloc(q->sgpriv_pool, GFP_KERNEL);

sg->driver_priv_dma =
pci_map_single(req->dma_map_dev, sg->driver_priv, len,
PCI_DMA_TODEVICE);
if (sg_dma->dma_add_priv == 0);
failed = fail_value;
}
}

Driver descriptors are not supposed to be done using pci_map_*()
and friends. You are supposed to use consistent DMA memory for
this purpose. Please read DMA-mapping.txt a few more times Adam :-)

Also, this while loop never terminates :-)

2002-06-24 12:41:25

by Adam J. Richter

[permalink] [raw]
Subject: Re: RFC: turn scatterlist into a linked list, eliminate bio_vec

>From: "David S. Miller" <[email protected]>
>Date: Sun, 23 Jun 2002 23:24:16 -0700 (PDT)

>> /* Let's assume the new pci_map_sg will free unused scatterlists. */
>> while (sg != NULL && count--) {
>> sg->driver_priv = mempool_alloc(q->sgpriv_pool, GFP_KERNEL);
>>
>> sg->driver_priv_dma =
>> pci_map_single(req->dma_map_dev, sg->driver_priv, len,
>> PCI_DMA_TODEVICE);
>> if (sg_dma->dma_add_priv == 0);
>> failed = fail_value;
>> }
>> }

>Driver descriptors are not supposed to be done using pci_map_*()
>and friends. You are supposed to use consistent DMA memory for
>this purpose. Please read DMA-mapping.txt a few more times Adam :-)

Sorry if I was not clear enough about the purpose of of
the new scatterlist->driver_priv field. It is a "streaming" data
structure to use the terminology of DMA-maping.txt (i.e., one that
would typically only be allocated for a few microseconds as an IO is
built up and sent). Its purpose is to hold the hardware-specific
gather-scatter segment descriptor, which typically look something like this:

struct foo_corporation_scsi_controller_sg_element {
u64 data_addr;
u64 next_addr;
u16 data_len;
u16 reserved;
u32 various_flags;
}

I probably should not have mentioned adding sg->driver_priv{,_dma}
in my proposal, because it is largely independent. The reason that
it is related is that it should allow the "mid-layer" device driver code
(scsi.c, usb.c, etc.) to do all of the streaming DMA mapping for
typical DMA-based device drivers.

Come to think of it, my use of pci_map_single is incorrect
after all, because the driver has not yet filled in that data structure
at that point. Since the data structures are being allocated from a
single contiguous block that spans a couple of pages that is being used
only for this purpose, perhaps I would be fastest to pci_alloc_consistent
the whole memory pool for those little descriptors at initialization time
and then change that loop to do the following.

sg->driver_priv = mempool_alloc(q->sgpriv_pol, GFP_KERNEL);
sg->driver_priv_dma = (sg->driver_priv - q->driver_priv_start)+
sg->driver_priv_start_dma_addr;


>Also, this while loop never terminates :-)

Oops! Sorry. The bottom of that loop needs

sg = sg->next;



Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

2002-06-24 12:45:26

by David Miller

[permalink] [raw]
Subject: Re: RFC: turn scatterlist into a linked list, eliminate bio_vec

From: "Adam J. Richter" <[email protected]>
Date: Mon, 24 Jun 2002 04:42:45 -0700

Sorry if I was not clear enough about the purpose of of
the new scatterlist->driver_priv field. It is a "streaming" data
structure to use the terminology of DMA-maping.txt (i.e., one that
would typically only be allocated for a few microseconds as an IO is
built up and sent). Its purpose is to hold the hardware-specific
gather-scatter segment descriptor, which typically look something like this:

struct foo_corporation_scsi_controller_sg_element {
u64 data_addr;
u64 next_addr;
u16 data_len;
u16 reserved;
u32 various_flags;
}

This is small, about one cacheline, and thus is not to be used
with non-consistent memory. Also, if you use streaming memory, where
is the structure written to and where is the pci_dma_sync_single
(which is a costly cache flush on many systems btw, another reason to
use consistent memory) between the CPU writes and giving the
descriptor to the device?

Again, reread DMA-mapping.txt a few more times before composing
your response to me. I really meant that you don't understand
the purpose of consistent vs. streaming memory.

Come to think of it, my use of pci_map_single is incorrect
after all, because the driver has not yet filled in that data structure
at that point. Since the data structures are being allocated from a
single contiguous block that spans a couple of pages that is being used
only for this purpose, perhaps I would be fastest to pci_alloc_consistent
the whole memory pool for those little descriptors at initialization time
and then change that loop to do the following.

Please use PCI pools, this is what they were designed for. Pools of
like-sized objects allocated out of consistent DMA memory.

So as it stands I still think your proposal is buggy.

2002-06-24 19:30:15

by William Lee Irwin III

[permalink] [raw]
Subject: Re: RFC: turn scatterlist into a linked list, eliminate bio_vec

On Sun, Jun 23, 2002 at 04:58:20PM -0700, Adam J. Richter wrote:
> The idea is to turn struct scatterlist from an array into a
> linked list. Here are some advantages:

I'm all for it, though I wonder sometimes if a tree might be better.


Cheers,
Bill

2002-06-24 20:44:46

by Adam J. Richter

[permalink] [raw]
Subject: Re: RFC: turn scatterlist into a linked list, eliminate bio_vec

> Sorry if I was not clear enough about the purpose of of
> the new scatterlist->driver_priv field. It is a "streaming" data
> structure to use the terminology of DMA-maping.txt (i.e., one that
> would typically only be allocated for a few microseconds as an IO is
> built up and sent). Its purpose is to hold the hardware-specific
> gather-scatter segment descriptor, which typically look something like this:
>
> struct foo_corporation_scsi_controller_sg_element {
> u64 data_addr;
> u64 next_addr;
> u16 data_len;
> u16 reserved;
> u32 various_flags;
> }
>
>This is small, about one cacheline, and thus is not to be used
>with non-consistent memory. Also, if you use streaming memory, where
>is the structure written to and where is the pci_dma_sync_single
>(which is a costly cache flush on many systems btw, another reason to
>use consistent memory) between the CPU writes and giving the
>descriptor to the device?

>Again, reread DMA-mapping.txt a few more times before composing
>your response to me. I really meant that you don't understand
>the purpose of consistent vs. streaming memory.

In my previous email, I suggested using consistent memory in
the paragraph that followed. I have underlined the sentence below.

Anyhow, I have reread DMA-mapping.txt end to end, updated my
sample code to suggest to making available some scatterlist merging
code that is currently only available to block devices. Just to
demonstrate that to you, I have a few comments:

1. It would help to document if pci_pool_alloc(pool,SLAB_KERNEL,&dma)
can ever return failure (as opposed to blocking forever). That would
effect whether certain error legs have to be written in a device
driver.

2. It sure would be nice if there were a couple of ifdef symbols
that controlled the expansion DECLARE_PCI_UNMAP_{ADDR,LEN}, so
that default usual implementations could be put in <linux/pci.h>,
and so that it would be possible to portbly write sg_dma_address()
that want to return sg.dma_addr if that field exists, and
sg.vaddr otherwise. This is of practical use because I would like
to do the same thing for my proposed sglist.driver_priv{,_dma} fields.

3. Isn't the stuff about scatterlist->address obselete. I thought that
field was deleted in 2.5.


> Come to think of it, my use of pci_map_single is incorrect
> after all, because the driver has not yet filled in that data structure
> at that point. Since the data structures are being allocated from a
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> single contiguous block that spans a couple of pages that is being used
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> only for this purpose, perhaps I would be fastest to pci_alloc_consistent
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> the whole memory pool for those little descriptors at initialization time
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> and then change that loop to do the following.
>
>Please use PCI pools, this is what they were designed for. Pools of
>like-sized objects allocated out of consistent DMA memory.

Thanks for reminding me of them. Although I had explored using
them in another situation, I forgot about them for this situation, where
they make sense, especially if pci_pool_alloc(...,SLAB_KERNEL,...) can
never fail for lack of memory (although they could block indefinitely).
I have updated my sample code accordingly.

>So as it stands I still think your proposal is buggy.

Do you understand that scatterlist.driver_priv{,_dma} is just
a follow-on optimization of my proposal to turn struct scatterlist
into a linked list, so that pci_map_sg &co. can be consolidated into
the "mid-level" drivers (scsi.o, usb.o)?

I hope the updated code below and the clarification of my
previous email address the scatterlist.driver_priv{,_dma} issues that
you raised.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

/* Parameters from ll_rw_blk.c. Used for expressing limits and keeping
running totals on sglist sizes. */
struct scatter_sizes {
unsigned int hw_segments;
unsigned int phys_segments;
unsigned int segment_size;
unsigned int total_size;
};

struct scatter_limit {
struct scatter_sizes sizes;
u32 boundary_mask; /* u64? */
}

struct dma_stream {
struct scatter_limit *limit;
struct pci_pool *sgpriv_pool;
struct pci_dev *pci_dev;
unsigned long flags;
};

/* dma_stream.flags: */
#define DMA_MAP_MUST_SUCCEED 1

struct scatterlist {
...include all existing fields. Add these:

struct scatterlist *next;
void *driver_priv;
dma_addr_t driver_priv_dma;
};

/* Formerly blk_map_sg. Now all devices can get the DMA merge optimizations
currently available to block devices. */
extern int dma_coalesce_sg(struct scatter_limit *limit,
struct scatterlist *sg);

typedef void (bio_end_io_t) (struct bio *,
unsigned long bi_flags /* combined with bi_rw */);


struct bio {
/* merge bi_rw and bi_flags into a parameter to submit_bio */
struct bio *bi_next; /* request queue link */
unsigned int bi_size;
bio_end_io_t *bi_end_io;
void *bi_private;
bio_destructor_t *bi_destructor;
};

typedef void (unprep_rq_fn)(request_queue_t *q, struct request *req);

struct request {
Include all existing fields except current_nr_sectors,
nr_phys_segments, nr_hw_segments. Add these:

int map_sg_count;
unprep_rq_fn *unprep_rq_fn; /* Called by
blkdev_release_request*/
struct scatterlist *head, *tail;
struct scatter_sizes sizes; /* sizes.segment_size refers to the
size of the last segment */
atomic_t num_bios; /* Cannot free scatterlist until
last request completes. */
};

struct request_queue {
...delete max_{sectors,phys_segments,hw_segments,segment_size} and
seg_boundary_mask. Replace them with this field:

struct dma_stream *stream; /* Shareable between different queues,
for example, disks on the same
controller, or between /dev/loop/nnn,
and its underlying device. */
};


static inline int req_dma_dir(struct request *req) {
return (req->flags & REQ_RW) ? PCI_DMA_TODEVICE : PCI_DMA_FROMDEVICE;
}

/* Generic routines usable as queue->{,un}prep_fn: */
int rq_dma_prep(dma_stream *stream, struct request *req) {
int count;
struct scatterlist *sg = req->head;
int alloc_flags;
int err_on_mem_fail =
(stream->flags & DMA_MAP_MUST_SUCCEED) ? -ENOMEM : 0;
int mem_fail;

count = dma_coalesce_sg(&stream->limit, sg);
count = pci_map_sg(stream->pci_dev, sg, count, req_dma_dir(req));
if (count < 0 && err_on_mem_fail)
return count;

/* Perhaps we could change new pci_map_sg, so that if it shortens
shortens the scatterlist, that it still leaves the unused
scatterlist entries linked at the end of the list, but that
the first unused entry (if any) is makred with
sg->sg_length = 0, just so we don't have to remember the
count returned by pci_map_sg result (since there is no easy
place to store it). Otherwise, we need to remember the count,
like so:
*/
req->map_sg_count = count;

/* "Read ahead" requests should abort rather than block. */
alloc_flags = (req->flags & REQ_RW_AHEAD) ? SLAB_ATOMIC : SLAB_KERNEL;

/*
* SLAB_KERNEL will never fail, so no need to check. It will
* just block until memory is available.
*/
mem_fail = 0;
if (q->sgpriv_pool) {
while(count--) {
sg->driver_priv = pci_pool_alloc(q->sgpriv_pool,
alloc_flags, &sg->driver_priv_dma);
if (sg->driver_prv == NULL)
mem_fail = err_on_mem_fail;
sg = sg->next;
}
}

return mem_fail;
}


void rq_dma_unprep(reqeust_queue_t *q, struct request *req) {
struct scatterlist *sg;

sg = req->head;
pci_unmap_sg(req->dma_map_dev, sg, req_dma_dir(req));
if (q->sgpriv_pool) {
int count = req->map_sg_count;
while (count--) {
pci_pool_free(q->sgpriv_pool, sg->driver_priv,
sg->driver_priv_dma);
sg = sg->next;
}
}
}

/* By the way, we cannot move q->sgpriv_pool into struct pci_device and
have pci_{,un}map_sg take care of sg->driver_priv{,_dma} because

1. Non-PCI drivers use pci_device == NULL. Perhaps, in the future,
this will cease to be an issue of the DMA mapping stuff can
be abstracted to "struct dma_device".

2. Conceivably, a driver might want more than one pool for
different IO functions on the same device. This is similar
to the example of a sound card requiring different DMA masks
in DMA-mapping.txt. Both problems could be solved by
creating a "struct dma_channel", which could contain the
gather/scatter parameters currently in struct requeust_queue,
the dma_mask parameter from pci_device, and allocation pool
for sglist->sg_driver_priv.
*/





2002-06-25 14:04:53

by David Miller

[permalink] [raw]
Subject: Re: RFC: turn scatterlist into a linked list, eliminate bio_vec

From: "Adam J. Richter" <[email protected]>
Date: Mon, 24 Jun 2002 13:44:25 -0700

1. It would help to document if pci_pool_alloc(pool,SLAB_KERNEL,&dma)
can ever return failure (as opposed to blocking forever). That would
effect whether certain error legs have to be written in a device
driver.

Because a SLAB_KERNEL allocation can fail, thus can a pci_pool_alloc
allocation. These rules have everything to do with what SLAB_KERNEL
means and there is nothing special about the fact that this flag is
being passed to pci_pool_alloc instead of kmalloc or similar.

2. It sure would be nice if there were a couple of ifdef symbols
that controlled the expansion DECLARE_PCI_UNMAP_{ADDR,LEN}, so
that default usual implementations could be put in <linux/pci.h>,
and so that it would be possible to portbly write sg_dma_address()
that want to return sg.dma_addr if that field exists, and
sg.vaddr otherwise. This is of practical use because I would like
to do the same thing for my proposed sglist.driver_priv{,_dma} fields.

Huh? sg_dma_address() must return virt_to_bus() on non-IOMMU
platforms. If it returned the virtual address that would be
a BUG(). Why do you want to change radically the semantics of
sg_dma_address() in such a radical way? It makes no sense.

3. Isn't the stuff about scatterlist->address obselete. I thought that
field was deleted in 2.5.

Yes, it should be killed, feel free to submit a patch.

Thanks for reminding me of them. Although I had explored using
them in another situation, I forgot about them for this situation, where
they make sense, especially if pci_pool_alloc(...,SLAB_KERNEL,...) can
never fail for lack of memory (although they could block indefinitely).
I have updated my sample code accordingly.

SLAB_KERNEL can fail for pci_pool_alloc as it can fail for
kmalloc(...SLAB_KERNEL); What makes you think that pci_pool_alloc
changes the well defined semantics of the SLAB_KERNEL flag?

Do you understand that scatterlist.driver_priv{,_dma} is just
a follow-on optimization of my proposal to turn struct scatterlist
into a linked list, so that pci_map_sg &co. can be consolidated into
the "mid-level" drivers (scsi.o, usb.o)?

To be honest I think this is a horrible idea while we still lack
generic devices. What about my SBUS scsi drivers, are they just sort
of "left out" of your infrastructure because it is PCI based?

Now is not the time for this, when we finally have the generic struct
device stuff, then you can start doing DMA stuff at the generic layer.
Until them you serve only to break things for non-PCI folks.

2002-06-25 14:36:11

by Adam J. Richter

[permalink] [raw]
Subject: Re: RFC: turn scatterlist into a linked list, eliminate bio_vec

[Sorry for the duplication, Dave. I accidentally sent a reply to you only.]

> 1. It would help to document if pci_pool_alloc(pool,SLAB_KERNEL,&dma)
> can ever return failure (as opposed to blocking forever). That would
> effect whether certain error legs have to be written in a device
> driver.
>
>Because a SLAB_KERNEL allocation can fail, thus can a pci_pool_alloc
>allocation. These rules have everything to do with what SLAB_KERNEL
>means and there is nothing special about the fact that this flag is
>being passed to pci_pool_alloc instead of kmalloc or similar.

Then we should mempool_alloc instead, which, according
to Jens Axboe never fails with GFP_KERNEL:

|> = Adam Richter
| = Jens Axboe

|> Even so, if __GFP_WAIT never fails, then it can deadlock (for
|> example, some other device driver has a memory leak). Under a
|> scheme like bio_chain (provided that it is changed not to call a
|> memory allocator that can deadlock), the only way you deadlock is
|> if there really is deadlock bug in the lower layers that process
|> the underlying request.
|
|This whole dead lock debate has been done to death before, I suggest you
|find the mempool discussions in the lkml archives from the earlier 2.5
|series. Basically we maintain deadlock free allocation although we never
|fail allocs by saying that a single bio allocation is short lived (or
|at least not held indefinetely). That plus a reserve of XX bios makes
|sure that someone will always return a bio to the pool sooner or later
|and at least the get_from_pool alloc above will succeed sooner or later
|even if vm pressure is ridicilous.



> 2. It sure would be nice if there were a couple of ifdef symbols
> that controlled the expansion DECLARE_PCI_UNMAP_{ADDR,LEN}, so
> that default usual implementations could be put in <linux/pci.h>,
> and so that it would be possible to portbly write sg_dma_address()
> that want to return sg.dma_addr if that field exists, and
> sg.vaddr otherwise. This is of practical use because I would like
> to do the same thing for my proposed sglist.driver_priv{,_dma} fields.
>
>Huh? sg_dma_address() must return virt_to_bus() on non-IOMMU
>platforms. If it returned the virtual address that would be
>a BUG(). Why do you want to change radically the semantics of
>sg_dma_address() in such a radical way? It makes no sense.

You don't understand what I'm asking. On some platforms, such
as mips, there is no sglist->dma_address field. Instead sg_dma_address(sg)
returns:

#define sg_dma_address(sg) (virt_to_bus((sg)->address))

What I want is:
1. To do the same thing with sg->driver_priv, and
2. To not have to repeat the implimentation in n different
architecture files.

I would like something like this:

In include/asm-foo/pci.h.

/* Note that NEED_DMA_UNMAP_ADDR is different from NEED_SEPARATE_DMA_ADDR */
#define NEED_DMA_UNMAP_LEN 1
#define NEED_DMA_UNMAP_ADDR 1
#define NEED_SEPARATE_DMA_ADDR 1


In include/linux/pci.h:


#ifdef NEED_DMA_UNMAP_LEN
# define DECLARE_PCI_UNMAP_LEN(LEN_NAME) size_t LEN_NAME
# define pci_unmap_len(PTR, LEN_NAME) ((PTR)->LEN_NAME)
# define pci_unmap_len_set(PTR, LEN_NAME, VAL) (((PTR)->LEN_NAME) = (VAL))
#else
# define DECLARE_PCI_UNMAP_LEN(LEN_NAME)
# define pci_unmap_len(PTR, LEN_NAME) (0)
# define pci_unmap_len_set(PTR, LEN_NAME, VAL) do { } while (0)
#endif

#ifdef NEED_DMA_UNMAP_ADDR
# define DECLARE_PCI_UNMAP_ADDR(ADDR_NAME) dma_addr_t ADDR_NAME
# define pci_unmap_addr(PTR, ADDR_NAME) ((PTR)->ADDR_NAME)
# define pci_unmap_addr_set(PTR, ADDR_NAME, VAL) (((PTR)->ADDR_NAME) = (VAL))
#else
# define DECLARE_PCI_UNMAP_ADDR(ADDR_NAME)
# define pci_unmap_addr(PTR, ADDR_NAME) (0)
# define pci_unmap_addr_set(PTR, ADDR_NAME, VAL) do { } while (0)
#endif

#ifdef NEED_SEPARATE_DMA_ADDRESS
# define DECLARE_PCI_DMA_ADDR(NAME) dma_addr_t name ## _dma
# define PCI_DMA_ADDR(NAME) name ## _dma
#else
# define DECLARE_PCI_DMA_ADDR(NAME)
# define PCI_DMA_ADDR(NAME) virt_to_bus(name)
#endif




> 3. Isn't the stuff about scatterlist->address obselete. I thought that
> field was deleted in 2.5.
>
>Yes, it should be killed, feel free to submit a patch.

OK. Will do when I get back from OLS.

>SLAB_KERNEL can fail for pci_pool_alloc as it can fail for
>kmalloc(...SLAB_KERNEL); What makes you think that pci_pool_alloc
>changes the well defined semantics of the SLAB_KERNEL flag?

> Do you understand that scatterlist.driver_priv{,_dma} is just
> a follow-on optimization of my proposal to turn struct scatterlist
> into a linked list, so that pci_map_sg &co. can be consolidated into
> the "mid-level" drivers (scsi.o, usb.o)?
>
>To be honest I think this is a horrible idea while we still lack
>generic devices. What about my SBUS scsi drivers, are they just sort
>of "left out" of your infrastructure because it is PCI based?

Oh, I would love to get rid of the struct pci_dev usage. In
fact, I had a version of that illustrative code that basically proposed
that (struct dma_device, dma_alloc_consistent, etc.). I am just trying
to make these change incrementally. Even without that change, moving
to an abstract dma_limits.

>Now is not the time for this, when we finally have the generic struct
>device stuff, then you can start doing DMA stuff at the generic layer.
>Until them you serve only to break things for non-PCI folks.

How does my proposal change anything for the non-PCI folks or
make it harder to abstract away from PCI?

I have to catch a plane to OLS now, so I may not be able to
respond for a while.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

2002-06-25 21:27:31

by Adam J. Richter

[permalink] [raw]
Subject: Re: RFC: turn scatterlist into a linked list, eliminate bio_vec

> I believe that my proposal would make it much easier for
> reduce the dependency on struct pci_device, because it should
> greatly reduce the amount of code that would have to be changed.

>Fine but do it when we have the abstration to actually
>make it work, not before. [...]

How can I help? Who is working on this? Is this just an idea
right now? Are there patches floating around?

If you want this to work through drivers/base, I could rename
struct bus_type to struct device_type (it really quite misnamed), and
create a new struct bus. Here is roughly what I have in mind. I am
cc'ing linux-kenrel in case anyone else wants to point out problems
or indicate that they are working on this.

struct bus_type {
#ifdef CONFIG_DEVICEFS /* I really want to be able to config driverfs out */
const char *name; /* "pci", "isa", "sbus"... */
#endif
(void *)(*alloc_consistent)(struct device *dev, size_t size,
dma_addr_t *dma_addr);
void (*free_conistent)(struct device *dev, size_t size,
void *addr, dma_addr_t dma_handle);
map_single...
unmap_single...
sync_single...
};

struct device {
...
struct bus_type *bus_type;
struct device_type *dev_type; /* Formerly device.bus */
u64 dma_mask; /* less than device.parent.dma_mask
Individual DMA channels on this
device might have tighter masks
than this, but will never have
looser ones. */

}

static inline void*
dma_alloc_consistent(struct device *dev, size_t size, dma_addr_t *dma_addr)
{
return (*dev->bus_type->alloc_consistent)(dev, size, dma_addr);
}

static inline dma_addr_t
dma_map_single(struct device *dev, void *vaddr, size_t size, int direction)
{
#ifdef NEED_SEPARATE_DMA_ADDR
return (*dev->bus_type->map_single)(dev, vaddr, size, directio);
#else
return virt_to_bus(vaddr);
#endif
}

static inline void pci_unmap_single(struct device *dev, dma_addr_t dma_addr,
size_t size, int direction)
{
if (direction == PCI_DMA_NONE)
BUG();
#ifdef NEED_SEPARATE_DMA_ADDR
(*dev->bus_type->unmap_single)(dev, dma_addr, size, direction);
#endif
}


...in include/pci.h:...

/* Implemented in each arch subdirectory */
static void *__pci_alloc_consistent(struct device *dev, size_t size,
dma_addr_t *dma_addr);

/* Legacy interface */
static inline void *pci_alloc_consistent(struct device *pcidev, size_t size,
dma_addr_t *dma_addr)
{
return __pci_alloc_consistent(&pcidev->dev, size, dma_addr);
}


....Somewhere in drivers/pci:....

struct bus_type pci_bus_type = {
.name = "pci",
.alloc_consistent = __pci_alloc_consistent,
...
};

...Somewhere in drivers/?....

/* ISA uses the PCI routines. */
struct bus_type isa_bus_type = {
.name = "isa",
.alloc_consistent = __pci_alloc_consistent,
...
};

...Somewhere in drivers/sbus... */
struct bus_type isa_bus_type = {
.name = "isa",
.alloc_consistent = __sbus_alloc_consistent,
...
};


...In each arch/xxxx/pci.c...

void *__pci_alloc_consistent(struct device *dev, size_t size,
dma_addr_t *dma_addr)
{
struct pci_dev *pci_dev = list_entry(dev, struct pci_dev, dev);
...guts of existing pci_alloc_consistent implementation...
}


Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."