2007-01-25 06:40:24

by Pete Zaitcev

[permalink] [raw]
Subject: Juju

Hi, Kristian:

I only looked briefly at SBP-2, and at submit/callback paths it pulled,
because I do not understand most of the other issues.

Executive summary: please implement proper ORB cancellation. This is
how you verify that your locking model is worth anything: by interaction
of normal transactions with ->remove, ->suspend and SCSI aborts.
Currently, there's no telling if the code is sane or not.

I see that ORBs are always allocated with a call (like SKB) and not
embedded into drivers (like URBs). It's great, keep it up. Also,
never allow drivers to pass DMA-mapped buffers into fw_send_request
and friends. We made both of these mistakes in USB, and it hurts.

Now, about small things:

diff --git a/fw-sbp2.c b/fw-sbp2.c
index 8888f27..3a65095 100644
--- a/fw-sbp2.c
+++ b/fw-sbp2.c
@@ -303,7 +303,7 @@ complete_transaction(struct fw_card *card, int rcode,
unsigned long flags;

orb->rcode = rcode;
- if (rcode != RCODE_COMPLETE) {
+ if (rcode != RCODE_COMPLETE) { /* Huh? */
spin_lock_irqsave(&card->lock, flags);
list_del(&orb->link);
spin_unlock_irqrestore(&card->lock, flags);

This looks like an inverted test. Who does remove ORB from the list
if it's completed normally?

@@ -320,8 +320,7 @@ sbp2_send_orb(struct sbp2_orb *orb, struct fw_unit *unit,
unsigned long flags;

orb->pointer.high = 0;
- orb->pointer.low = orb->request_bus;
- fw_memcpy_to_be32(&orb->pointer, &orb->pointer, sizeof orb->pointer);
+ orb->pointer.low = cpu_to_be32(orb->request_bus);

spin_lock_irqsave(&device->card->lock, flags);
list_add_tail(&orb->link, &sd->orb_list);

I'll whine later about this

@@ -347,6 +346,7 @@ static void sbp2_cancel_orbs(struct fw_unit *unit)
list_splice_init(&sd->orb_list, &list);
spin_unlock_irqrestore(&device->card->lock, flags);

+ /* XXX Notify the hardware -- show-stopper. */
list_for_each_entry_safe(orb, next, &list, link) {
orb->rcode = RCODE_CANCELLED;
orb->callback(orb, NULL);
@@ -364,6 +364,9 @@ complete_management_orb(struct sbp2_orb *base_orb, struct sbp2_status *status)
complete(&orb->done);
}

+/*
+ * This function must be called from a process context.
+ */
static int
sbp2_send_management_orb(struct fw_unit *unit, int node_id, int generation,
int function, int lun, void *response)
@@ -374,9 +377,9 @@ sbp2_send_management_orb(struct fw_unit *unit, int node_id, int generation,
unsigned long timeout;
int retval = -ENOMEM;

- orb = kzalloc(sizeof *orb, GFP_ATOMIC);
+ orb = kzalloc(sizeof *orb, GFP_KERNEL);
if (orb == NULL)
- return -ENOMEM;
+ goto out_orballoc;

/* The sbp2 device is going to send a block read request to
* read out the request from host memory, so map it for

No reason to use the atomic pool, this function sleeps later anyway.

@@ -384,17 +387,22 @@ sbp2_send_management_orb(struct fw_unit *unit, int node_id, int generation,
orb->base.request_bus =
dma_map_single(device->card->device, &orb->request,
sizeof orb->request, DMA_TO_DEVICE);
- if (orb->base.request_bus == 0)
- goto out;
+ if (dma_mapping_error(orb->base.request_bus))
+ goto out_orbreq;

orb->response_bus =
dma_map_single(device->card->device, &orb->response,
sizeof orb->response, DMA_FROM_DEVICE);
- if (orb->response_bus == 0)
- goto out;
+ if (dma_mapping_error(orb->response_bus))
+ goto out_orbresp;

orb->request.response.high = 0;
orb->request.response.low = orb->response_bus;
+ /*
+ * XXX Kristian, what exactly makes you think that DMA address
+ * is 32 bits wide above? This is only guaranteed if device->
+ * card->device has mask 0xffffffff. Where is that set?
+ */

orb->request.misc =
management_orb_notify |

Obvious.

@@ -450,16 +458,18 @@ sbp2_send_management_orb(struct fw_unit *unit, int node_id, int generation,

retval = 0;
out:
- dma_unmap_single(device->card->device, orb->base.request_bus,
- sizeof orb->request, DMA_TO_DEVICE);
dma_unmap_single(device->card->device, orb->response_bus,
sizeof orb->response, DMA_FROM_DEVICE);
+ out_orbresp:
+ dma_unmap_single(device->card->device, orb->base.request_bus,
+ sizeof orb->request, DMA_TO_DEVICE);
+ out_orbreq:

if (response)
fw_memcpy_from_be32(response,
orb->response, sizeof orb->response);
kfree(orb);
-
+ out_orballoc:
return retval;
}


This is how to use "out" labels properly, kernel style.

@@ -746,6 +756,7 @@ complete_command_orb(struct sbp2_orb *base_orb, struct sbp2_status *status)
struct fw_unit *unit = orb->unit;
struct fw_device *device = fw_device(unit->device.parent);
struct scatterlist *sg;
+ struct scsi_cmnd *cmd;
int result;

if (status != NULL) {
@@ -798,10 +809,11 @@ complete_command_orb(struct sbp2_orb *base_orb, struct sbp2_status *status)
sizeof orb->request_buffer_bus,
DMA_FROM_DEVICE);

- orb->cmd->result = result << 16;
- orb->done(orb->cmd);
-
+ cmd = orb->cmd;
kfree(orb);
+
+ cmd->result = result << 16;
+ cmd->done(cmd);
}

static void sbp2_command_orb_map_scatterlist(struct sbp2_command_orb *orb)

Just a stylistic sugar.

@@ -896,6 +908,7 @@ static int sbp2_scsi_queuecommand(struct scsi_cmnd *cmd, scsi_done_fn_t done)
struct fw_device *device = fw_device(unit->device.parent);
struct sbp2_device *sd = unit->device.driver_data;
struct sbp2_command_orb *orb;
+ u32 misc;

/* Bidirectional commands are not yet implemented, and unknown
* transfer direction not handled. */
@@ -917,28 +930,32 @@ static int sbp2_scsi_queuecommand(struct scsi_cmnd *cmd, scsi_done_fn_t done)
orb->base.request_bus =
dma_map_single(device->card->device, &orb->request,
sizeof orb->request, DMA_TO_DEVICE);
+ if (dma_mapping_error(orb->base.request_bus)) {
+ kfree(orb);
+ cmd->result = DID_ERROR << 16; /* XXX Better error? */
+ done(cmd);
+ return 0;
+ }

orb->unit = unit;
- orb->done = done;
orb->cmd = cmd;

- orb->request.next.high = SBP2_ORB_NULL;
+ orb->request.next.high = cpu_to_be32(SBP2_ORB_NULL);
orb->request.next.low = 0x0;
/* At speed 100 we can do 512 bytes per packet, at speed 200,
* 1024 bytes per packet etc. The SBP-2 max_payload field
* specifies the max payload size as 2 ^ (max_payload + 2), so
* if we set this to max_speed + 7, we get the right value. */
- orb->request.misc =
+ misc =
command_orb_max_payload(device->node->max_speed + 7) |
command_orb_speed(device->node->max_speed) |
command_orb_notify;

if (cmd->sc_data_direction == DMA_FROM_DEVICE)
- orb->request.misc |=
- command_orb_direction(SBP2_DIRECTION_FROM_MEDIA);
+ misc |= command_orb_direction(SBP2_DIRECTION_FROM_MEDIA);
else if (cmd->sc_data_direction == DMA_TO_DEVICE)
- orb->request.misc |=
- command_orb_direction(SBP2_DIRECTION_TO_MEDIA);
+ misc |= command_orb_direction(SBP2_DIRECTION_TO_MEDIA);
+ orb->request.misc = cpu_to_be32(misc);

if (cmd->use_sg) {
sbp2_command_orb_map_scatterlist(orb);
@@ -947,6 +964,7 @@ static int sbp2_scsi_queuecommand(struct scsi_cmnd *cmd, scsi_done_fn_t done)
* could we get the scsi or blk layer to do that by
* reporting our max supported block size? */
fw_error("command > 64k\n");
+ kfree(orb);
cmd->result = DID_ERROR << 16;
done(cmd);
return 0;
@@ -954,11 +972,7 @@ static int sbp2_scsi_queuecommand(struct scsi_cmnd *cmd, scsi_done_fn_t done)
sbp2_command_orb_map_buffer(orb);
}

- fw_memcpy_to_be32(&orb->request, &orb->request, sizeof orb->request);
-
- memset(orb->request.command_block,
- 0, sizeof orb->request.command_block);
- memcpy(orb->request.command_block, cmd->cmnd, COMMAND_SIZE(*cmd->cmnd));
+ memcpy(orb->request.command_block, cmd->cmnd, 12);

orb->base.callback = complete_command_orb;


The way you byteswap irks me. You make the job of static analisys
tools harder, for no benefit. The sensible rule of thumb is, use wire
order for untyped arrays of bytes (you do that), use CPU order when
kernel knows structures (you kinda do it, but in a very messy way
by swapping 32-bit values; now all you need is one 16-bit value for
the whole idea to unravel), and never mix them! Alas, this is how
fw_send_request is designed: the payload can be header, or can be data.

But like I said, it's small bananas.

Cheers,
-- Pete

P.S. If I were using git, I would clone a Linus git tree and work
there. All this out-of-tree module stuff is just a waste of time.
You even wrote an RPM spec! What for? Not to mention that the
blacklists that you install interfere with non-jujufied kernels.
If you work with a cloned repo, your pre-integration history will
also be available post-integration. I presume you want to integrate
into mainline eventually, don't you?


2007-01-25 17:14:01

by Stefan Richter

[permalink] [raw]
Subject: Re: Juju

Pete Zaitcev wrote:
...
> Now, about small things:

Thanks Pete, I'm sure I will find some of these issues in mainline's
sbp2 too. :-)

...
> P.S. If I were using git, I would clone a Linus git tree and work
> there. All this out-of-tree module stuff is just a waste of time.

Kristian just announced
http://gitweb.freedesktop.org/?p=users/krh/linux-2.6.git
git://people.freedesktop.org/~krh/linux-2.6
shortly before you posted.

(And time-lagged/ filtered versions: kernel.org's linux1394-2.6.git, and
non-canonical patchkits for 2.6.{18,19,20} on my homepage.)
--
Stefan Richter
-=====-=-=== ---= ==--=
http://arcgraph.de/sr/

2007-01-25 21:24:31

by Kristian Høgsberg

[permalink] [raw]
Subject: Re: Juju

Pete Zaitcev wrote:
> Hi, Kristian:
>
> I only looked briefly at SBP-2, and at submit/callback paths it pulled,
> because I do not understand most of the other issues.

Great, thanks for giving this a look-over, much appreciated.

> Executive summary: please implement proper ORB cancellation. This is
> how you verify that your locking model is worth anything: by interaction
> of normal transactions with ->remove, ->suspend and SCSI aborts.
> Currently, there's no telling if the code is sane or not.

Good catch, my ORB cancellation is a bit rough, and I believe there are a
couple of races there, though I haven't been able to actually trigger these.
When cancelling an ORB I need to cancel the ORB, the pending transaction and
potentially the packet queued for DMA in the fw-ohci driver. This is not
fully in place yet and I need to sit down and audit this path.

> I see that ORBs are always allocated with a call (like SKB) and not
> embedded into drivers (like URBs). It's great, keep it up. Also,
> never allow drivers to pass DMA-mapped buffers into fw_send_request
> and friends. We made both of these mistakes in USB, and it hurts.

Oh, the ORBs are SBP-2 specific data structures, struct fw_transaction is
probably what corresponds to USB URBs. This struct is defined in
fw-transaction.h and is available for embedding into other structs, such as
struct sbp2_orb in fw-sbp2. Is that what you're suggesting against, and what
are the problems with this approach?

As for passing DMA-mapped buffers, do you mean that fw_send_request (or code
below that function) should map the payload as it currently does as opposed to
callers pre-mapping DMA buffers?

> Now, about small things:
>
> diff --git a/fw-sbp2.c b/fw-sbp2.c
> index 8888f27..3a65095 100644
> --- a/fw-sbp2.c
> +++ b/fw-sbp2.c
> @@ -303,7 +303,7 @@ complete_transaction(struct fw_card *card, int rcode,
> unsigned long flags;
>
> orb->rcode = rcode;
> - if (rcode != RCODE_COMPLETE) {
> + if (rcode != RCODE_COMPLETE) { /* Huh? */
> spin_lock_irqsave(&card->lock, flags);
> list_del(&orb->link);
> spin_unlock_irqrestore(&card->lock, flags);
>
> This looks like an inverted test. Who does remove ORB from the list
> if it's completed normally?

An SBP-2 ORB transaction sequence consists of a number of FireWire
transactions. First the initiator prepares an SBP-2 ORB (object request
block) in host memory and maps this for DMA. Then the initiators sends its
bus address and the physical address of the ORB to the SBP-2 device. This is
the request that complete_transaction deals with. If we get rcode ==
RCODE_COMPLETE here, we successfully wrote the ORB address to the SBP-2 device
and we leave the ORB in the list so the SBP-2 transaction can continue. Any
other rcode means that the write failed and we need to back out the
transaction and leave it to upper layers to retry.

The rest of the SBP-2 transaction continues with the SBP-2 device issuing a
FireWire read transaction to read out the ORB from host memory and then carry
out the instructions in the ORB. Once the device has completed the ORB it
will do a status write to the status address specified in the ORB, at which
point the SBP-2 transaction is complete.

> @@ -320,8 +320,7 @@ sbp2_send_orb(struct sbp2_orb *orb, struct fw_unit *unit,
> unsigned long flags;
>
> orb->pointer.high = 0;
> - orb->pointer.low = orb->request_bus;
> - fw_memcpy_to_be32(&orb->pointer, &orb->pointer, sizeof orb->pointer);
> + orb->pointer.low = cpu_to_be32(orb->request_bus);
>
> spin_lock_irqsave(&device->card->lock, flags);
> list_add_tail(&orb->link, &sd->orb_list);
>
> I'll whine later about this
>
> @@ -347,6 +346,7 @@ static void sbp2_cancel_orbs(struct fw_unit *unit)
> list_splice_init(&sd->orb_list, &list);
> spin_unlock_irqrestore(&device->card->lock, flags);
>
> + /* XXX Notify the hardware -- show-stopper. */

Yes, agreed.

> list_for_each_entry_safe(orb, next, &list, link) {
> orb->rcode = RCODE_CANCELLED;
> orb->callback(orb, NULL);
> @@ -364,6 +364,9 @@ complete_management_orb(struct sbp2_orb *base_orb, struct sbp2_status *status)
> complete(&orb->done);
> }
>
> +/*
> + * This function must be called from a process context.
> + */
> static int
> sbp2_send_management_orb(struct fw_unit *unit, int node_id, int generation,
> int function, int lun, void *response)
> @@ -374,9 +377,9 @@ sbp2_send_management_orb(struct fw_unit *unit, int node_id, int generation,
> unsigned long timeout;
> int retval = -ENOMEM;
>
> - orb = kzalloc(sizeof *orb, GFP_ATOMIC);
> + orb = kzalloc(sizeof *orb, GFP_KERNEL);
> if (orb == NULL)
> - return -ENOMEM;
> + goto out_orballoc;
>
> /* The sbp2 device is going to send a block read request to
> * read out the request from host memory, so map it for
>
> No reason to use the atomic pool, this function sleeps later anyway.

Oh, true.

> @@ -384,17 +387,22 @@ sbp2_send_management_orb(struct fw_unit *unit, int node_id, int generation,
> orb->base.request_bus =
> dma_map_single(device->card->device, &orb->request,
> sizeof orb->request, DMA_TO_DEVICE);
> - if (orb->base.request_bus == 0)
> - goto out;
> + if (dma_mapping_error(orb->base.request_bus))
> + goto out_orbreq;
>
> orb->response_bus =
> dma_map_single(device->card->device, &orb->response,
> sizeof orb->response, DMA_FROM_DEVICE);
> - if (orb->response_bus == 0)
> - goto out;
> + if (dma_mapping_error(orb->response_bus))
> + goto out_orbresp;
>
> orb->request.response.high = 0;
> orb->request.response.low = orb->response_bus;
> + /*
> + * XXX Kristian, what exactly makes you think that DMA address
> + * is 32 bits wide above? This is only guaranteed if device->
> + * card->device has mask 0xffffffff. Where is that set?
> + */

I guess that should be in pci_probe in fw-ohci.c, but it isn't. If the
allocation is in physical memory above the 4G limit, will dma_map_single set
up a trampoline buffer below 4G if the device DMA mask is 0xffffffff? I have
a similar problem when mapping the buffer passed down from the SCSI stack, in
that I don't know for sure that these buffers are accessible within the first 4G.

Didn't know about dma_mapping_error, cool, will use that.

> orb->request.misc =
> management_orb_notify |
>
> Obvious.

Huh? What do you mean?

> @@ -450,16 +458,18 @@ sbp2_send_management_orb(struct fw_unit *unit, int node_id, int generation,
>
> retval = 0;
> out:
> - dma_unmap_single(device->card->device, orb->base.request_bus,
> - sizeof orb->request, DMA_TO_DEVICE);
> dma_unmap_single(device->card->device, orb->response_bus,
> sizeof orb->response, DMA_FROM_DEVICE);
> + out_orbresp:
> + dma_unmap_single(device->card->device, orb->base.request_bus,
> + sizeof orb->request, DMA_TO_DEVICE);
> + out_orbreq:
>
> if (response)
> fw_memcpy_from_be32(response,
> orb->response, sizeof orb->response);
> kfree(orb);
> -
> + out_orballoc:
> return retval;
> }
>
>
> This is how to use "out" labels properly, kernel style.

Yeah, I guess... I did actually track down every function call in the destroy
path to make sure that they're safe to call on the values that kzalloc sets
them to.

> @@ -746,6 +756,7 @@ complete_command_orb(struct sbp2_orb *base_orb, struct sbp2_status *status)
> struct fw_unit *unit = orb->unit;
> struct fw_device *device = fw_device(unit->device.parent);
> struct scatterlist *sg;
> + struct scsi_cmnd *cmd;
> int result;
>
> if (status != NULL) {
> @@ -798,10 +809,11 @@ complete_command_orb(struct sbp2_orb *base_orb, struct sbp2_status *status)
> sizeof orb->request_buffer_bus,
> DMA_FROM_DEVICE);
>
> - orb->cmd->result = result << 16;
> - orb->done(orb->cmd);
> -
> + cmd = orb->cmd;
> kfree(orb);
> +
> + cmd->result = result << 16;
> + cmd->done(cmd);
> }
>
> static void sbp2_command_orb_map_scatterlist(struct sbp2_command_orb *orb)
>
> Just a stylistic sugar.
>
> @@ -896,6 +908,7 @@ static int sbp2_scsi_queuecommand(struct scsi_cmnd *cmd, scsi_done_fn_t done)
> struct fw_device *device = fw_device(unit->device.parent);
> struct sbp2_device *sd = unit->device.driver_data;
> struct sbp2_command_orb *orb;
> + u32 misc;
>
> /* Bidirectional commands are not yet implemented, and unknown
> * transfer direction not handled. */
> @@ -917,28 +930,32 @@ static int sbp2_scsi_queuecommand(struct scsi_cmnd *cmd, scsi_done_fn_t done)
> orb->base.request_bus =
> dma_map_single(device->card->device, &orb->request,
> sizeof orb->request, DMA_TO_DEVICE);
> + if (dma_mapping_error(orb->base.request_bus)) {
> + kfree(orb);
> + cmd->result = DID_ERROR << 16; /* XXX Better error? */
> + done(cmd);
> + return 0;
> + }

Yikes, I thought I had most of the dma_map_single error cases handled.

> orb->unit = unit;
> - orb->done = done;
> orb->cmd = cmd;
>
> - orb->request.next.high = SBP2_ORB_NULL;
> + orb->request.next.high = cpu_to_be32(SBP2_ORB_NULL);
> orb->request.next.low = 0x0;
> /* At speed 100 we can do 512 bytes per packet, at speed 200,
> * 1024 bytes per packet etc. The SBP-2 max_payload field
> * specifies the max payload size as 2 ^ (max_payload + 2), so
> * if we set this to max_speed + 7, we get the right value. */
> - orb->request.misc =
> + misc =
> command_orb_max_payload(device->node->max_speed + 7) |
> command_orb_speed(device->node->max_speed) |
> command_orb_notify;
>
> if (cmd->sc_data_direction == DMA_FROM_DEVICE)
> - orb->request.misc |=
> - command_orb_direction(SBP2_DIRECTION_FROM_MEDIA);
> + misc |= command_orb_direction(SBP2_DIRECTION_FROM_MEDIA);
> else if (cmd->sc_data_direction == DMA_TO_DEVICE)
> - orb->request.misc |=
> - command_orb_direction(SBP2_DIRECTION_TO_MEDIA);
> + misc |= command_orb_direction(SBP2_DIRECTION_TO_MEDIA);
> + orb->request.misc = cpu_to_be32(misc);
>
> if (cmd->use_sg) {
> sbp2_command_orb_map_scatterlist(orb);
> @@ -947,6 +964,7 @@ static int sbp2_scsi_queuecommand(struct scsi_cmnd *cmd, scsi_done_fn_t done)
> * could we get the scsi or blk layer to do that by
> * reporting our max supported block size? */
> fw_error("command > 64k\n");
> + kfree(orb);
> cmd->result = DID_ERROR << 16;
> done(cmd);
> return 0;
> @@ -954,11 +972,7 @@ static int sbp2_scsi_queuecommand(struct scsi_cmnd *cmd, scsi_done_fn_t done)
> sbp2_command_orb_map_buffer(orb);
> }
>
> - fw_memcpy_to_be32(&orb->request, &orb->request, sizeof orb->request);
> -
> - memset(orb->request.command_block,
> - 0, sizeof orb->request.command_block);
> - memcpy(orb->request.command_block, cmd->cmnd, COMMAND_SIZE(*cmd->cmnd));
> + memcpy(orb->request.command_block, cmd->cmnd, 12);
>
> orb->base.callback = complete_command_orb;
>
>
> The way you byteswap irks me. You make the job of static analisys
> tools harder, for no benefit. The sensible rule of thumb is, use wire
> order for untyped arrays of bytes (you do that), use CPU order when
> kernel knows structures (you kinda do it, but in a very messy way
> by swapping 32-bit values; now all you need is one 16-bit value for
> the whole idea to unravel), and never mix them! Alas, this is how
> fw_send_request is designed: the payload can be header, or can be data.

Yeah, your changes for fw-sbp2.c makes sense, the byteswapping in-place always
felt kinda iffy. None of the wire structs have any 16-bit fields, though,
I've intentionally made them all 32 bit values. The SBP-2 ORBs have 16 bit
values in them but they never cross 32 bit boundaries, and they're shifted and
or'ed together so that when they're byteswapped they end up in the right
places. But your changes make it a little more clear and I can then use
__be32 in the wirte structs.

For fw-transaction.c and code touching the FireWire headers
(fw_packet::header[4]) the rule is that it's CPU endian data. This is the
only way to avoid unecessary swaps. The OHCI controller will expect the
headers to be little endian but byteswap them as 32 bit words as it sends out
the packets. So, ironically, on big endian architechtures we need to byteswap
the headers to little endian when setting up DMA. So since the header byte
swapping depends on the host controller, it only makes sense that the stack
doesn't swap, but leaves that to the controller.

> But like I said, it's small bananas.
>
> Cheers,
> -- Pete
>
> P.S. If I were using git, I would clone a Linus git tree and work
> there. All this out-of-tree module stuff is just a waste of time.
> You even wrote an RPM spec! What for? Not to mention that the
> blacklists that you install interfere with non-jujufied kernels.
> If you work with a cloned repo, your pre-integration history will
> also be available post-integration. I presume you want to integrate
> into mainline eventually, don't you?

Indeed, I've just moved to an in-tree development model now. I still think
the out-off-tree model is a good way to prototype, get started and reach
"critical mass" with your driver. But as I'm starting to integrate and sync
with Stefans tree now, an in-tree model is a lot easier to work with. The
tree is here

git://people.freedesktop.org/~krh/linux-2.6

with gitweb available here:

http://gitweb.freedesktop.org/?p=users/krh/linux-2.6.git;a=summary

cheers,
Kristian

2007-01-25 23:41:04

by Pete Zaitcev

[permalink] [raw]
Subject: Re: Juju

On Thu, 25 Jan 2007 16:18:35 -0500, Kristian Høgsberg <[email protected]> wrote:

> > I see that ORBs are always allocated with a call (like SKB) and not
> > embedded into drivers (like URBs). It's great, keep it up. Also,
> > never allow drivers to pass DMA-mapped buffers into fw_send_request
> > and friends. We made both of these mistakes in USB, and it hurts.
>
> Oh, the ORBs are SBP-2 specific data structures, struct fw_transaction is
> probably what corresponds to USB URBs. This struct is defined in
> fw-transaction.h and is available for embedding into other structs, such as
> struct sbp2_orb in fw-sbp2. Is that what you're suggesting against, and what
> are the problems with this approach?

Fortunately we do not care about out-of-tree drivers, which are most
affected, you may even call it a feature ^_^. My main problem is,
we can't refcount URBs, so usbmon can't tap them and must copy.

I understand that fw_transaction points to an external buffer while
SKB does not, so any jujumon code won't be able to refcount data.
So maybe it's not important.

> As for passing DMA-mapped buffers, do you mean that fw_send_request (or code
> below that function) should map the payload as it currently does as opposed to
> callers pre-mapping DMA buffers?

Yes, this is what I mean. It serves well for our networking stack,
which is pretty fast. For some reason someone decided that USB users
should be let mapping their buffers. This led to big trouble with
usbmon, extra fields in URB (although already fat...), tying up precious
IOMMU resources (not so precious on x86, fortunately).

> > orb->rcode = rcode;
> > - if (rcode != RCODE_COMPLETE) {
> > + if (rcode != RCODE_COMPLETE) { /* Huh? */
> > spin_lock_irqsave(&card->lock, flags);
> > list_del(&orb->link);
> > spin_unlock_irqrestore(&card->lock, flags);
> >
> > This looks like an inverted test. Who does remove ORB from the list
> > if it's completed normally?

>[...]
> The rest of the SBP-2 transaction continues with the SBP-2 device issuing a
> FireWire read transaction to read out the ORB from host memory and then carry
> out the instructions in the ORB. Once the device has completed the ORB it
> will do a status write to the status address specified in the ORB, at which
> point the SBP-2 transaction is complete.

You know, I wanted to use this picture for a long time:
http://www.flickr.com/photos/zaitcev/369269557/
I hope it's all right.

I guess my problem is that I don't see just why the ORB transaction callback
is being delivered before the whole I/O was done.

Now that you drew my attention to sbp2_status_write(), this looks wrong:

/* Lookup the orb corresponding to this status write. */
spin_lock_irqsave(&card->lock, flags);
list_for_each_entry(orb, &sd->orb_list, link) {
if (status_get_orb_high(status) == 0 &&
status_get_orb_low(status) == orb->request_bus) {
list_del(&orb->link);
break;
}
}
spin_unlock_irqrestore(&card->lock, flags);

Why is it that fw_request can't carry a pointer?

> > orb->request.response.high = 0;
> > orb->request.response.low = orb->response_bus;
> > + /*
> > + * XXX Kristian, what exactly makes you think that DMA address
> > + * is 32 bits wide above? This is only guaranteed if device->
> > + * card->device has mask 0xffffffff. Where is that set?
> > + */
>
> I guess that should be in pci_probe in fw-ohci.c, but it isn't. If the
> allocation is in physical memory above the 4G limit, will dma_map_single set
> up a trampoline buffer below 4G if the device DMA mask is 0xffffffff?

If an IOMMU (or SWIOTLB) is not available, dma_map_single will fail.
If IOMMU is available, it will allocate a mapping so that DMA address
is constrained, failing if tables are full. At least this was my
understanding. However, I implemented that many years ago.

> I have
> a similar problem when mapping the buffer passed down from the SCSI stack, in
> that I don't know for sure that these buffers are accessible within the first 4G.

Please ask James, I do not remember that either.

For block layer, Jens guarantees it for you (it's controlled by
blk_queue_bounce_limit), but I am not sure about SCSI. I don't expect
it to have bounce buffers though.

> > Obvious.
>
> Huh? What do you mean?

I meant the replacement of the tests for zero with dma_mapping_error().

> > @@ -450,16 +458,18 @@ sbp2_send_management_orb(struct fw_unit *unit, int node_id, int generation,
> >
> > retval = 0;
> > out:
> > - dma_unmap_single(device->card->device, orb->base.request_bus,
> > - sizeof orb->request, DMA_TO_DEVICE);
> > dma_unmap_single(device->card->device, orb->response_bus,
> > sizeof orb->response, DMA_FROM_DEVICE);
> > + out_orbresp:
> > + dma_unmap_single(device->card->device, orb->base.request_bus,
> > + sizeof orb->request, DMA_TO_DEVICE);
> > + out_orbreq:
> >
> > if (response)
> > fw_memcpy_from_be32(response,
> > orb->response, sizeof orb->response);
> > kfree(orb);
> > -
> > + out_orballoc:
> > return retval;
> > }
> >
> >
> > This is how to use "out" labels properly, kernel style.
>
> Yeah, I guess... I did actually track down every function call in the destroy
> path to make sure that they're safe to call on the values that kzalloc sets
> them to.

I challenge this claim. I am pretty sure that dma_unmap_single on a buffer
that wasn't mapped in the first place can cause havoc, primarily because
zero bus address can be valid.

> > @@ -917,28 +930,32 @@ static int sbp2_scsi_queuecommand(struct scsi_cmnd *cmd, scsi_done_fn_t done)
> > orb->base.request_bus =
> > dma_map_single(device->card->device, &orb->request,
> > sizeof orb->request, DMA_TO_DEVICE);
> > + if (dma_mapping_error(orb->base.request_bus)) {
> > + kfree(orb);
> > + cmd->result = DID_ERROR << 16; /* XXX Better error? */
> > + done(cmd);
> > + return 0;
> > + }
>
> Yikes, I thought I had most of the dma_map_single error cases handled.

Well, on SPARC we lived for years with a panic when IOMMU ran out and
nothing happened, so no biggie. It never happens. And when it does
you're dead anyway. Testing this is important in network drivers,
where the host can recover from running out of IOMMU.

> None of the wire structs have any 16-bit fields, though,
> I've intentionally made them all 32 bit values.

I figured as much, but I don't know 1394 enough to be sure.

> Indeed, I've just moved to an in-tree development model now. I still think
> the out-off-tree model is a good way to prototype, get started and reach
> "critical mass" with your driver.

Yeah, repo is far smaller. Maybe I'm just so used to building
throwaway kernels for every test driver.

-- Pete

2007-01-26 02:35:35

by Stefan Richter

[permalink] [raw]
Subject: Re: Juju

Pete Zaitcev wrote:
> On Thu, 25 Jan 2007 16:18:35 -0500, Kristian H?gsberg <[email protected]> wrote:
...
>> will do a status write to the status address specified in the ORB, at which
>> point the SBP-2 transaction is complete.
>
> You know, I wanted to use this picture for a long time:
> http://www.flickr.com/photos/zaitcev/369269557/

The fundamental thing about SBP-2 is that ORBs ( = SCSI command blocks
plus SBP-2 header) and data buffers all reside in the memory of the
initiator (or of a 3rd party on the FireWire bus). The target peeks and
pokes them when and how it sees fit. The initiator pushes only tiny
notifications about availability of new ORBs to the target. The target
eventually completes SCSI commands in-order or out-of-order and signals
so by pushing a status block per one or more completed commands.

(Juju's fw-sbp2 gives only one command at a time to the target.
Mainline's sbp2 can optionally give more commands in a row, but the
implementation is subtly broken in several ways and therefore disabled
by default until I fix it right after hell froze over.)

Another important thing to know in order to understand fw-sbp2 and sbp2
is that they currently rely on OHCI-1394's physical DMA feature, which
I'll not explain here. It means two things: 1. FireWire bus addresses of
ORBs and buffers are directly derived from the DMA mapped address.
(FireWire bus addresses are the addresses used in communication between
SBP-2 initiator and target.) 2. Almost all of the transfers done by the
target do not generate interrupts. (Just the status write generates an
interrupt.)

...
> Now that you drew my attention to sbp2_status_write(), this looks wrong:
>
> /* Lookup the orb corresponding to this status write. */
> spin_lock_irqsave(&card->lock, flags);
> list_for_each_entry(orb, &sd->orb_list, link) {
> if (status_get_orb_high(status) == 0 &&
> status_get_orb_low(status) == orb->request_bus) {
> list_del(&orb->link);
> break;
> }
> }
> spin_unlock_irqrestore(&card->lock, flags);
>
> Why is it that fw_request can't carry a pointer?

The target wrote an SBP-2 status block into our memory. The status block
contains the FireWire bus address of the ORB to which it belongs. Juju's
fw-sbp2 does the same as mainline's sbp2: Looking through the pile of
unfinished ORBs for one with the same FireWire bus address, which was
previously derived from the DMA mapped address. Since there aren't many
mapped ORBs per target, a linked list is a reasonable data structure to
search over. That said --- Kristian, doesn't fw-sbp2 have at most 1 ORB
in sd->orb_list?
--
Stefan Richter
-=====-=-=== ---= ==-=-
http://arcgraph.de/sr/

2007-01-26 04:07:45

by Kristian Høgsberg

[permalink] [raw]
Subject: Re: Juju

Stefan Richter wrote:
> Pete Zaitcev wrote:
>> On Thu, 25 Jan 2007 16:18:35 -0500, Kristian H?gsberg <[email protected]> wrote:
> ...
>>> will do a status write to the status address specified in the ORB, at which
>>> point the SBP-2 transaction is complete.
>> You know, I wanted to use this picture for a long time:
>> http://www.flickr.com/photos/zaitcev/369269557/

Haha, sure :)

> The fundamental thing about SBP-2 is that ORBs ( = SCSI command blocks
> plus SBP-2 header) and data buffers all reside in the memory of the
> initiator (or of a 3rd party on the FireWire bus). The target peeks and
> pokes them when and how it sees fit. The initiator pushes only tiny
> notifications about availability of new ORBs to the target. The target
> eventually completes SCSI commands in-order or out-of-order and signals
> so by pushing a status block per one or more completed commands.
>
> (Juju's fw-sbp2 gives only one command at a time to the target.
> Mainline's sbp2 can optionally give more commands in a row, but the
> implementation is subtly broken in several ways and therefore disabled
> by default until I fix it right after hell froze over.)
>
> Another important thing to know in order to understand fw-sbp2 and sbp2
> is that they currently rely on OHCI-1394's physical DMA feature, which
> I'll not explain here. It means two things: 1. FireWire bus addresses of
> ORBs and buffers are directly derived from the DMA mapped address.
> (FireWire bus addresses are the addresses used in communication between
> SBP-2 initiator and target.) 2. Almost all of the transfers done by the
> target do not generate interrupts. (Just the status write generates an
> interrupt.)

Another thing that probably makes my explanation a little confusing is that
there are two types of transactions: FireWire transactions which consists of a
request followed by a response and are pretty much the smallest interaction
you can have with a remote device. Then there are SBP-2 transactions, which
are a higer level sequence layered on top of FireWire transactions. An SBP-2
transaction consists of a sequence of FireWire transactions, the first of
which is initiated by the initiator. This is the FireWire transaction that
complete_transaction handles. When this first FireWire transaction finishes
succesfully, we know that the SBP-2 transaction has been started and we sit
back and wait for the target to do it's part. If that initial FireWire
transaction fails, we need to fail the SBP-2 transaction we we're trying to start.

> ...
>> Now that you drew my attention to sbp2_status_write(), this looks wrong:
>>
>> /* Lookup the orb corresponding to this status write. */
>> spin_lock_irqsave(&card->lock, flags);
>> list_for_each_entry(orb, &sd->orb_list, link) {
>> if (status_get_orb_high(status) == 0 &&
>> status_get_orb_low(status) == orb->request_bus) {
>> list_del(&orb->link);
>> break;
>> }
>> }
>> spin_unlock_irqrestore(&card->lock, flags);
>>
>> Why is it that fw_request can't carry a pointer?
>
> The target wrote an SBP-2 status block into our memory. The status block
> contains the FireWire bus address of the ORB to which it belongs. Juju's
> fw-sbp2 does the same as mainline's sbp2: Looking through the pile of
> unfinished ORBs for one with the same FireWire bus address, which was
> previously derived from the DMA mapped address.

But the status write actually does carry the address of the ORB it signals the
completion of. So in theory, we could just read out the ORB address from the
status write packet and map that back to kernel virtual memory and do an
appropriate container_of() call and we should have the struct sbp2_orb
pointer. The reason I still search through the list is of course that this is
way to much trust to put into hardware as buggy as external storage devices.
Blindly dereferencing a pointer returned by storage driver firmware is
probably a very bad idea.

One thing I want to do (though very low priority) is to allocate the ORBs out
of a preallocated circular buffer. We can then check that the ORB pointer
returned in the status write points into this buffer and that it's a multiple
of the ORB size, at which point it should be safe to dereference it.

> Since there aren't many
> mapped ORBs per target, a linked list is a reasonable data structure to
> search over. That said --- Kristian, doesn't fw-sbp2 have at most 1 ORB
> in sd->orb_list?

Yes, there is only ever one pending ORB in the list, so looking through the
list is not exactly a time sink :)

Kristian

2007-01-26 04:50:19

by Pete Zaitcev

[permalink] [raw]
Subject: Re: Juju

On Fri, 26 Jan 2007 03:35:19 +0100, Stefan Richter <[email protected]> wrote:

> The fundamental thing about SBP-2 is that ORBs ( = SCSI command blocks
> plus SBP-2 header) and data buffers all reside in the memory of the
> initiator (or of a 3rd party on the FireWire bus).

I recognize the concept, I worked with SRP in Infiniband a bit.

> The target wrote an SBP-2 status block into our memory. The status block
> contains the FireWire bus address of the ORB to which it belongs. [...]

I see. SRP has a more flexible tag which can be used to look up
the just completed command more effectively. But if we only submit
one, it's a moot point of course.

> [...] Since there aren't many
> mapped ORBs per target, a linked list is a reasonable data structure to
> search over.

Righto. I'm used to having thousands of oustanding commands in arrays.

-- Pete

2007-01-26 06:47:12

by Greg KH

[permalink] [raw]
Subject: Re: Juju

On Thu, Jan 25, 2007 at 03:38:24PM -0800, Pete Zaitcev wrote:
> On Thu, 25 Jan 2007 16:18:35 -0500, Kristian H??gsberg <[email protected]> wrote:
>
> > > I see that ORBs are always allocated with a call (like SKB) and not
> > > embedded into drivers (like URBs). It's great, keep it up. Also,
> > > never allow drivers to pass DMA-mapped buffers into fw_send_request
> > > and friends. We made both of these mistakes in USB, and it hurts.
> >
> > Oh, the ORBs are SBP-2 specific data structures, struct fw_transaction is
> > probably what corresponds to USB URBs. This struct is defined in
> > fw-transaction.h and is available for embedding into other structs, such as
> > struct sbp2_orb in fw-sbp2. Is that what you're suggesting against, and what
> > are the problems with this approach?
>
> Fortunately we do not care about out-of-tree drivers, which are most
> affected, you may even call it a feature ^_^. My main problem is,
> we can't refcount URBs, so usbmon can't tap them and must copy.

urbs are reference counted, it's just that not all drivers who create
them use them that way :(

Perhaps you can inforce this in the new codebase...

thanks,

greg k-h

2007-01-26 10:53:47

by Stefan Richter

[permalink] [raw]
Subject: Re: Juju

Pete Zaitcev wrote:
> On Fri, 26 Jan 2007 03:35:19 +0100, Stefan Richter <[email protected]> wrote:
>> The target wrote an SBP-2 status block into our memory. The status block
>> contains the FireWire bus address of the ORB to which it belongs. [...]
>
> I see. SRP has a more flexible tag which can be used to look up
> the just completed command more effectively. But if we only submit
> one, it's a moot point of course.

And indeed, in the SBP-2 realm the ORB's address is actually the tag of
the SCSI task.

>> [...] Since there aren't many
>> mapped ORBs per target, a linked list is a reasonable data structure to
>> search over.
>
> Righto. I'm used to having thousands of oustanding commands in arrays.

If I ever get to fix sbp2's handling of dynamically appended ORB lists,
the next step would be to measure which queue depth is optimal for
different hardware and different workloads. (Certainly much less than
thousands for typical SBP-2 target(s)+initiator(s) setups.)
--
Stefan Richter
-=====-=-=== ---= ==-=-
http://arcgraph.de/sr/

2007-01-26 11:30:11

by Stefan Richter

[permalink] [raw]
Subject: Re: Juju

Kristian H?gsberg wrote:
> Another thing that probably makes my explanation a little confusing is that
> there are two types of transactions: FireWire transactions which consists of a
> request followed by a response and are pretty much the smallest interaction
> you can have with a remote device. Then there are SBP-2 transactions, which
> are a higer level sequence layered on top of FireWire transactions.
[...]

Exactly. One SBP-2 transaction is one SCSI task's request and completion
and contains
1. one 1394 write transaction, requested by initiator node,
2. one 1394 read transaction, requested by the target node,
3. zero to many 1394 read or write transactions, requested by the
target node,
4. one 1394 write transaction, requested by the target node
(if everything goes well and if we don't have dynamically appended ORB
lists). So it might be less confusing if we only speak of "1394
transactions" and "SCSI tasks".

BTW, SBP-3 allows to wrap step 2 into step 1. I read that some SBP-2
targets support this SBP-3 feature.

[...]
>> The target wrote an SBP-2 status block into our memory. The status block
>> contains the FireWire bus address of the ORB to which it belongs. Juju's
>> fw-sbp2 does the same as mainline's sbp2: Looking through the pile of
>> unfinished ORBs for one with the same FireWire bus address, which was
>> previously derived from the DMA mapped address.
>
> But the status write actually does carry the address of the ORB it signals the
> completion of. So in theory, we could just read out the ORB address from the
> status write packet and map that back to kernel virtual memory

"Map back to virtual memory" is exactly what we do already, although in
a rather dumb fashion. It would be easier if there was a portable
bus_to_virt() which would do the job for us. This is much more of an
issue if we want to work without OHCI-1394 physical DMA: Then we have to
do this inverse DMA mapping also for arbitrary pieces of all scatter
gather elements of the SCSI data buffer.

[...]
> One thing I want to do (though very low priority) is to allocate the ORBs out
> of a preallocated circular buffer.

Incidentally, one thing I want to do (though at low priority) is to use
a circular ORB buffer... Well, linux1394 has a number of more serious
issues pending at bugzilla.kernel.org, and meanwhile people are eager to
run things like FreeBob or Kino on Juju --- thus it remains to be seen
who of us gets to it first. :-)
--
Stefan Richter
-=====-=-=== ---= ==-=-
http://arcgraph.de/sr/

2007-01-26 19:57:49

by Kristian Høgsberg

[permalink] [raw]
Subject: Re: Juju

Greg KH wrote:
> On Thu, Jan 25, 2007 at 03:38:24PM -0800, Pete Zaitcev wrote:
>> On Thu, 25 Jan 2007 16:18:35 -0500, Kristian H??gsberg <[email protected]> wrote:
>>
>>>> I see that ORBs are always allocated with a call (like SKB) and not
>>>> embedded into drivers (like URBs). It's great, keep it up. Also,
>>>> never allow drivers to pass DMA-mapped buffers into fw_send_request
>>>> and friends. We made both of these mistakes in USB, and it hurts.
>>> Oh, the ORBs are SBP-2 specific data structures, struct fw_transaction is
>>> probably what corresponds to USB URBs. This struct is defined in
>>> fw-transaction.h and is available for embedding into other structs, such as
>>> struct sbp2_orb in fw-sbp2. Is that what you're suggesting against, and what
>>> are the problems with this approach?
>> Fortunately we do not care about out-of-tree drivers, which are most
>> affected, you may even call it a feature ^_^. My main problem is,
>> we can't refcount URBs, so usbmon can't tap them and must copy.
>
> urbs are reference counted, it's just that not all drivers who create
> them use them that way :(
>
> Perhaps you can inforce this in the new codebase...

It's a small change to make the fw_transaction struct opaque and ref-counted,
and it's definitely still doable. But the nice thing about embedding the
struct is that you have one memory allocation failure path less to worry
about. And I haven't yet, and don't expect to see a use case that will need
ref-counted struct fw_transaction, the ownership is always clearly defined.
But I can go either way on this and if there is a good reason to ref count
them it's a pretty small change.

cheers,
Kristian

2007-01-29 00:16:33

by Pete Zaitcev

[permalink] [raw]
Subject: Re: Juju

On Thu, 25 Jan 2007 16:18:35 -0500, Kristian Høgsberg <[email protected]> wrote:

> Indeed, I've just moved to an in-tree development model now. I still think
> the out-off-tree model is a good way to prototype, get started and reach
> "critical mass" with your driver. But as I'm starting to integrate and sync
> with Stefans tree now, an in-tree model is a lot easier to work with. The
> tree is here
>
> git://people.freedesktop.org/~krh/linux-2.6

This seems to have disappeared. Was it moved or dropped?

-- Pete

2007-01-29 20:00:26

by Kristian Høgsberg

[permalink] [raw]
Subject: Re: Juju

Pete Zaitcev wrote:
> On Thu, 25 Jan 2007 16:18:35 -0500, Kristian Høgsberg <[email protected]> wrote:
>
>> Indeed, I've just moved to an in-tree development model now. I still think
>> the out-off-tree model is a good way to prototype, get started and reach
>> "critical mass" with your driver. But as I'm starting to integrate and sync
>> with Stefans tree now, an in-tree model is a lot easier to work with. The
>> tree is here
>>
>> git://people.freedesktop.org/~krh/linux-2.6
>
> This seems to have disappeared. Was it moved or dropped?

No, it's still there, and I just did a git clone on it. How does it fail for
you? In any case, you're probably better of pulling from Stefans kernel.org tree:

git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6.git

since that's where new work is going to show up. I'm still trying to figure
out a good workflow for this, but if all patches are going to through
linux1394-devel, there's not much point in me publishing a branch in the
freedesktop.org repo, if the patches are going to be edited/tweaked and the
committed to Stefans repo.

Kristian

2007-01-29 20:25:48

by Pete Zaitcev

[permalink] [raw]
Subject: Re: Juju

On Mon, 29 Jan 2007 14:53:28 -0500, Kristian Høgsberg <[email protected]> wrote:

> >> git://people.freedesktop.org/~krh/linux-2.6
> >
> > This seems to have disappeared. Was it moved or dropped?
>
> No, it's still there, and I just did a git clone on it. How does it fail for
> you?

Oh never mind, sorry. It said "Already up-to-date", but since HTTP
does not show it, I assumed it's not there.

> In any case, you're probably better of pulling from Stefans kernel.org tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6.git
>
> since that's where new work is going to show up. I'm still trying to figure
> out a good workflow for this, but if all patches are going to through
> linux1394-devel, there's not much point in me publishing a branch in the
> freedesktop.org repo, if the patches are going to be edited/tweaked and the
> committed to Stefans repo.

I see, thanks.

-- Pete