This patch implements GSI transactions. A GSI transaction is a
structure that represents a single request (consisting of one or
more TREs) sent to the GSI hardware. The last TRE in a transaction
includes a flag requesting that the GSI interrupt the AP to notify
that it has completed.
TREs are executed and completed strictly in order. For this reason,
the completion of a single TRE implies that all previous TREs (in
particular all of those "earlier" in a transaction) have completed.
Whenever there is a need to send a request (a set of TREs) to the
IPA, a GSI transaction is allocated, specifying the number of TREs
that will be required. Details of the request (e.g. transfer offsets
and length) are represented by in a Linux scatterlist array that is
incorporated in the transaction structure.
Once "filled," the transaction is committed. The GSI transaction
layer performs all needed mapping (and unmapping) for DMA, and
issues the request to the hardware. When the hardware signals
that the request has completed, a function in the IPA layer is
called, allowing for cleanup or followup activity to be performed
before the transaction is freed.
Signed-off-by: Alex Elder <[email protected]>
---
drivers/net/ipa/gsi_trans.c | 604 ++++++++++++++++++++++++++++++++++++
drivers/net/ipa/gsi_trans.h | 106 +++++++
2 files changed, 710 insertions(+)
create mode 100644 drivers/net/ipa/gsi_trans.c
create mode 100644 drivers/net/ipa/gsi_trans.h
diff --git a/drivers/net/ipa/gsi_trans.c b/drivers/net/ipa/gsi_trans.c
new file mode 100644
index 000000000000..e35e8369d5d0
--- /dev/null
+++ b/drivers/net/ipa/gsi_trans.c
@@ -0,0 +1,604 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Copyright (c) 2012-2018, The Linux Foundation. All rights reserved.
+ * Copyright (C) 2019 Linaro Ltd.
+ */
+
+#include <linux/types.h>
+#include <linux/bits.h>
+#include <linux/bitfield.h>
+#include <linux/refcount.h>
+#include <linux/scatterlist.h>
+
+#include "gsi.h"
+#include "gsi_private.h"
+#include "gsi_trans.h"
+#include "ipa_gsi.h"
+#include "ipa_data.h"
+#include "ipa_cmd.h"
+
+/**
+ * DOC: GSI Transactions
+ *
+ * A GSI transaction abstracts the behavior of a GSI channel by representing
+ * everything about a related group of data transfers in a single structure.
+ * Most details of interaction with the GSI hardware are managed by the GSI
+ * transaction core, allowing users to simply describe transfers to be
+ * performed and optionally supply a callback function to run once the set
+ * of transfers has been completed.
+ *
+ * To perform a data transfer (or a related set of them), a user of the GSI
+ * transaction interface allocates a transaction, indicating the number of
+ * TREs required (one per data transfer). If sufficient TREs are available,
+ * they are reserved for use in the transaction and the allocation succeeds.
+ * This way exhaustion of the available TREs in a channel ring is detected
+ * as early as possible. All resources required to complete a transaction
+ * are allocated at transaction allocation time.
+ *
+ * Transfers performed as part of a transaction are represented in an array
+ * of Linux scatterlist structures. This array is allocated with the
+ * transaction, and its entries must be initialized using standard
+ * scatterlist functions (such as sg_init_one() or skb_to_sgvec()). The
+ * user must supply the total number of bytes represented by all transfers
+ * in the transaction.
+ *
+ * Once a transaction has been prepared, it is committed. The GSI transaction
+ * layer is responsible for DMA mapping (and unmapping) memory described in
+ * the transaction's scatterlist array. The only way committing a transaction
+ * fails is if this DMA mapping step returns an error. Otherwise, ownership
+ * of the entire transaction is transferred to the GSI transaction core. The
+ * GSI transaction code formats the content of the scatterlist array into the
+ * channel ring buffer and informs the hardware that new TREs are available
+ * to process.
+ *
+ * The last TRE in each transaction is marked to interrupt the AP when the
+ * GSI hardware has completed it. Because transfers described by TREs are
+ * performed strictly in order, signaling the completion of just the last
+ * TRE in the transaction is sufficient to indicate the full transaction
+ * is complete.
+ *
+ * When a transaction is complete, ipa_gsi_trans_complete() is called by the
+ * GSI code into the IPA layer, allowing it to perform any final cleanup
+ * required before the transaction is freed.
+ *
+ * It is possible to await the completion of a transaction; only immediate
+ * commands currently use this functionality.
+ */
+
+/* gsi_tre->flags mask values (in CPU byte order) */
+#define GSI_TRE_FLAGS_CHAIN_FMASK GENMASK(0, 0)
+#define GSI_TRE_FLAGS_IEOB_FMASK GENMASK(8, 8)
+#define GSI_TRE_FLAGS_IEOT_FMASK GENMASK(9, 9)
+#define GSI_TRE_FLAGS_BEI_FMASK GENMASK(10, 10)
+#define GSI_TRE_FLAGS_TYPE_FMASK GENMASK(23, 16)
+
+/* Hardware values representing a transfer element type */
+enum gsi_tre_type {
+ GSI_RE_XFER = 0x2,
+ GSI_RE_IMMD_CMD = 0x3,
+ GSI_RE_NOP = 0x4,
+};
+
+/* Map a given ring entry index to the transaction associated with it */
+static void gsi_channel_trans_map(struct gsi_channel *channel, u32 index,
+ struct gsi_trans *trans)
+{
+ channel->trans_info.map[index] = trans;
+}
+
+/* Return the transaction mapped to a given ring entry */
+struct gsi_trans *
+gsi_channel_trans_mapped(struct gsi_channel *channel, u32 index)
+{
+ return channel->trans_info.map[index];
+}
+
+/* Return the oldest completed transaction for a channel (or null) */
+struct gsi_trans *gsi_channel_trans_complete(struct gsi_channel *channel)
+{
+ return list_first_entry_or_null(&channel->trans_info.complete,
+ struct gsi_trans, links);
+}
+
+/* Move a transaction from the allocated list to the pending list */
+static void gsi_trans_move_pending(struct gsi_trans *trans)
+{
+ struct gsi_channel *channel = &trans->gsi->channel[trans->channel_id];
+ struct gsi_trans_info *trans_info = &channel->trans_info;
+ unsigned long flags;
+
+ spin_lock_irqsave(&trans_info->spinlock, flags);
+
+ list_move_tail(&trans->links, &trans_info->pending);
+
+ spin_unlock_irqrestore(&trans_info->spinlock, flags);
+}
+
+/* Move a transaction and all of its predecessors from the pending list
+ * to the completed list.
+ */
+void gsi_trans_move_complete(struct gsi_trans *trans)
+{
+ struct gsi_channel *channel = &trans->gsi->channel[trans->channel_id];
+ struct gsi_trans_info *trans_info;
+ struct list_head list;
+ unsigned long flags;
+
+ trans_info = &channel->trans_info;
+
+ spin_lock_irqsave(&trans_info->spinlock, flags);
+
+ /* Move this transaction and all predecessors to completed list */
+ list_cut_position(&list, &trans_info->pending, &trans->links);
+ list_splice_tail(&list, &trans_info->complete);
+
+ spin_unlock_irqrestore(&trans_info->spinlock, flags);
+}
+
+/* Move a transaction from the completed list to the polled list */
+void gsi_trans_move_polled(struct gsi_trans *trans)
+{
+ struct gsi_channel *channel = &trans->gsi->channel[trans->channel_id];
+ struct gsi_trans_info *trans_info = &channel->trans_info;
+ unsigned long flags;
+
+ spin_lock_irqsave(&trans_info->spinlock, flags);
+
+ list_move_tail(&trans->links, &trans_info->polled);
+
+ spin_unlock_irqrestore(&trans_info->spinlock, flags);
+}
+
+/* Allocate a GSI transaction on a channel */
+struct gsi_trans *
+gsi_channel_trans_alloc(struct gsi *gsi, u32 channel_id, u32 tre_count)
+{
+ struct gsi_channel *channel = &gsi->channel[channel_id];
+ struct gsi_trans_info *trans_info = &channel->trans_info;
+ struct gsi_trans *trans = NULL;
+ unsigned long flags;
+
+ /* Caller should know the limit is gsi_channel_trans_max() */
+ if (WARN_ON(tre_count > channel->data->tlv_count))
+ return NULL;
+
+ spin_lock_irqsave(&trans_info->spinlock, flags);
+
+ if (trans_info->tre_avail >= tre_count) {
+ u32 avail;
+
+ /* Allocate the transaction */
+ if (trans_info->pool_free == trans_info->pool_count)
+ trans_info->pool_free = 0;
+ trans = &trans_info->pool[trans_info->pool_free++];
+
+ /* Allocate the scatter/gather entries it will use. If
+ * what's needed would cross the end-of-pool boundary,
+ * allocate them from the beginning.
+ */
+ avail = trans_info->sg_pool_count - trans_info->sg_pool_free;
+ if (tre_count > avail)
+ trans_info->sg_pool_free = 0;
+ trans->sgl = &trans_info->sg_pool[trans_info->sg_pool_free];
+ trans->sgc = tre_count;
+ trans_info->sg_pool_free += tre_count;
+
+ /* We reserve the TREs now, but consume them at commit time */
+ trans_info->tre_avail -= tre_count;
+
+ list_add_tail(&trans->links, &trans_info->alloc);
+ }
+
+ spin_unlock_irqrestore(&trans_info->spinlock, flags);
+
+ if (!trans)
+ return NULL;
+
+ trans->gsi = gsi;
+ trans->channel_id = channel_id;
+ refcount_set(&trans->refcount, 1);
+ trans->tre_count = tre_count;
+ init_completion(&trans->completion);
+
+ /* We're reusing, so make sure all fields are reinitialized */
+ trans->dev = gsi->dev;
+ trans->result = 0; /* Success assumed unless overwritten */
+ trans->data = NULL;
+
+ return trans;
+}
+
+/* Free a previously-allocated transaction (used only in case of error) */
+void gsi_trans_free(struct gsi_trans *trans)
+{
+ struct gsi_trans_info *trans_info;
+ struct gsi_channel *channel;
+ unsigned long flags;
+
+ if (!refcount_dec_and_test(&trans->refcount))
+ return;
+
+ channel = &trans->gsi->channel[trans->channel_id];
+ trans_info = &channel->trans_info;
+
+ spin_lock_irqsave(&trans_info->spinlock, flags);
+
+ list_del(&trans->links);
+ trans_info->tre_avail += trans->tre_count;
+
+ spin_unlock_irqrestore(&trans_info->spinlock, flags);
+}
+
+/* Compute the length/opcode value to use for a TRE */
+static __le16 gsi_tre_len_opcode(enum ipa_cmd_opcode opcode, u32 len)
+{
+ return opcode == IPA_CMD_NONE ? cpu_to_le16((u16)len)
+ : cpu_to_le16((u16)opcode);
+}
+
+/* Compute the flags value to use for a given TRE */
+static __le32 gsi_tre_flags(bool last_tre, bool bei, enum ipa_cmd_opcode opcode)
+{
+ enum gsi_tre_type tre_type;
+ u32 tre_flags;
+
+ tre_type = opcode == IPA_CMD_NONE ? GSI_RE_XFER : GSI_RE_IMMD_CMD;
+ tre_flags = u32_encode_bits(tre_type, GSI_TRE_FLAGS_TYPE_FMASK);
+
+ /* Last TRE contains interrupt flags */
+ if (last_tre) {
+ /* All transactions end in a transfer completion interrupt */
+ tre_flags |= GSI_TRE_FLAGS_IEOT_FMASK;
+ /* Don't interrupt when outbound commands are acknowledged */
+ if (bei)
+ tre_flags |= GSI_TRE_FLAGS_BEI_FMASK;
+ } else { /* All others indicate there's more to come */
+ tre_flags |= GSI_TRE_FLAGS_CHAIN_FMASK;
+ }
+
+ return cpu_to_le32(tre_flags);
+}
+
+static void gsi_trans_tre_fill(struct gsi_tre *dest_tre, dma_addr_t addr,
+ u32 len, bool last_tre, bool bei,
+ enum ipa_cmd_opcode opcode)
+{
+ struct gsi_tre tre;
+
+ tre.addr = cpu_to_le64(addr);
+ tre.len_opcode = gsi_tre_len_opcode(opcode, len);
+ tre.reserved = 0;
+ tre.flags = gsi_tre_flags(last_tre, bei, opcode);
+
+ *dest_tre = tre; /* Write TRE as a single (16-byte) unit */
+}
+
+int gsi_trans_read_byte(struct gsi *gsi, u32 channel_id, dma_addr_t addr)
+{
+ struct gsi_channel *channel = &gsi->channel[channel_id];
+ struct gsi_trans_info *trans_info;
+ struct gsi_evt_ring *evt_ring;
+ struct gsi_tre *dest_tre;
+ bool exhausted = false;
+ unsigned long flags;
+ u32 wp_local;
+
+ /* assert(!channel->toward_ipa); */
+
+ trans_info = &channel->trans_info;
+ spin_lock_irqsave(&trans_info->spinlock, flags);
+
+ if (trans_info->tre_avail)
+ trans_info->tre_avail--;
+ else
+ exhausted = true;
+
+ spin_unlock_irqrestore(&trans_info->spinlock, flags);
+
+ if (exhausted)
+ return -EBUSY;
+
+ evt_ring = &gsi->evt_ring[channel->evt_ring_id];
+ spin_lock_irqsave(&evt_ring->ring.spinlock, flags);
+
+ wp_local = channel->tre_ring.wp_local;
+ if (wp_local == channel->tre_ring.end)
+ wp_local = channel->tre_ring.base;
+ dest_tre = gsi_ring_virt(&channel->tre_ring, wp_local);
+
+ gsi_trans_tre_fill(dest_tre, addr, 1, true, false, IPA_CMD_NONE);
+
+ gsi_ring_wp_local_add(&channel->tre_ring, 1);
+
+ gsi_channel_doorbell(channel);
+
+ spin_unlock_irqrestore(&evt_ring->ring.spinlock, flags);
+
+ return 0;
+}
+
+/**
+ * __gsi_trans_commit() - Common GSI transaction commit code
+ * @trans: Transaction to commit
+ * @opcode: Immediate command opcode, or IPA_CMD_NONE
+ * @ring_db: Whether to tell the hardware about these queued transfers
+ *
+ * @Return: 0 if successful, or a negative error code
+ *
+ * Maps the transactions's scatterlist array for DMA, and returns -ENOMEM
+ * if that fails. Formats channel ring TRE entries based on the content of
+ * the scatterlist. Records the transaction pointer in the map entry for
+ * the last ring entry used for the transaction so it can be recovered when
+ * it completes, and moves the transaction to the pending list. Updates the
+ * channel ring pointer and optionally rings the doorbell.
+ */
+static int __gsi_trans_commit(struct gsi_trans *trans,
+ enum ipa_cmd_opcode opcode, bool ring_db)
+{
+ struct gsi_channel *channel = &trans->gsi->channel[trans->channel_id];
+ enum dma_data_direction direction;
+ bool bei = channel->toward_ipa;
+ struct gsi_evt_ring *evt_ring;
+ struct scatterlist *sg;
+ struct gsi_tre *dest_tre;
+ u32 queued_trans_count;
+ u32 queued_byte_count;
+ unsigned long flags;
+ u32 byte_count = 0;
+ u32 wp_local;
+ u32 index;
+ u32 avail;
+ int ret;
+ u32 i;
+
+ direction = channel->toward_ipa ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+ ret = dma_map_sg(trans->dev, trans->sgl, trans->sgc, direction);
+ if (!ret)
+ return -ENOMEM;
+
+ evt_ring = &channel->gsi->evt_ring[channel->evt_ring_id];
+
+ spin_lock_irqsave(&evt_ring->ring.spinlock, flags);
+
+ /* We'll consume the entries available at the end of the ring,
+ * switching to the beginning to finish if necessary.
+ */
+ wp_local = channel->tre_ring.wp_local;
+ dest_tre = gsi_ring_virt(&channel->tre_ring, wp_local);
+
+ avail = (channel->tre_ring.end - wp_local) / sizeof(*dest_tre);
+
+ for_each_sg(trans->sgl, sg, trans->sgc, i) {
+ bool last_tre = i == trans->tre_count - 1;
+ dma_addr_t addr = sg_dma_address(sg);
+ u32 len = sg_dma_len(sg);
+
+ byte_count += len;
+ if (!avail--)
+ dest_tre = gsi_ring_virt(&channel->tre_ring,
+ channel->tre_ring.base);
+
+ gsi_trans_tre_fill(dest_tre, addr, len, last_tre, bei, opcode);
+ dest_tre++;
+ }
+
+ if (channel->toward_ipa) {
+ /* We record TX bytes when they are sent */
+ trans->len = byte_count;
+ trans->trans_count = channel->trans_count;
+ trans->byte_count = channel->byte_count;
+ channel->trans_count++;
+ channel->byte_count += byte_count;
+ }
+
+ /* Advance the write pointer; record info for last used element */
+ gsi_ring_wp_local_add(&channel->tre_ring, trans->tre_count - 1);
+ index = ring_index(&channel->tre_ring, channel->tre_ring.wp_local);
+ gsi_channel_trans_map(channel, index, trans);
+ gsi_ring_wp_local_add(&channel->tre_ring, 1);
+
+ gsi_trans_move_pending(trans);
+
+ /* Ring doorbell if requested, or if all TREs are allocated */
+ if (ring_db || !channel->trans_info.tre_avail) {
+ /* Report what we're handing off to hardware for TX channels */
+ if (channel->toward_ipa) {
+ queued_trans_count = channel->trans_count -
+ channel->doorbell_trans_count;
+ queued_byte_count = channel->byte_count -
+ channel->doorbell_byte_count;
+ channel->doorbell_trans_count = channel->trans_count;
+ channel->doorbell_byte_count = channel->byte_count;
+
+ ipa_gsi_channel_tx_queued(trans->gsi,
+ gsi_channel_id(channel),
+ queued_trans_count,
+ queued_byte_count);
+ }
+
+ gsi_channel_doorbell(channel);
+ }
+
+ spin_unlock_irqrestore(&evt_ring->ring.spinlock, flags);
+
+ return 0;
+}
+
+/* Commit a GSI transaction */
+int gsi_trans_commit(struct gsi_trans *trans, bool ring_db)
+{
+ return __gsi_trans_commit(trans, IPA_CMD_NONE, ring_db);
+}
+
+/* Commit a GSI command transaction and wait for it to complete */
+int gsi_trans_commit_command(struct gsi_trans *trans,
+ enum ipa_cmd_opcode opcode)
+{
+ int ret;
+
+ refcount_inc(&trans->refcount);
+
+ ret = __gsi_trans_commit(trans, opcode, true);
+ if (ret)
+ goto out_free_trans;
+
+ wait_for_completion(&trans->completion);
+
+out_free_trans:
+ gsi_trans_free(trans);
+
+ return ret;
+}
+
+/* Commit a GSI command transaction, wait for it to complete, with timeout */
+int gsi_trans_commit_command_timeout(struct gsi_trans *trans,
+ enum ipa_cmd_opcode opcode,
+ unsigned long timeout)
+{
+ unsigned long timeout_jiffies = msecs_to_jiffies(timeout);
+ unsigned long remaining;
+ int ret;
+
+ refcount_inc(&trans->refcount);
+
+ ret = __gsi_trans_commit(trans, opcode, true);
+ if (ret)
+ goto out_free_trans;
+
+ remaining = wait_for_completion_timeout(&trans->completion,
+ timeout_jiffies);
+out_free_trans:
+ gsi_trans_free(trans);
+
+ return ret ? ret : remaining ? 0 : -ETIMEDOUT;
+}
+
+/* Return a channel's next completed transaction (or NULL) */
+void gsi_trans_complete(struct gsi_trans *trans)
+{
+ struct gsi_channel *channel = &trans->gsi->channel[trans->channel_id];
+ enum dma_data_direction direction;
+
+ direction = channel->toward_ipa ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+
+ dma_unmap_sg(trans->dev, trans->sgl, trans->sgc, direction);
+
+ ipa_gsi_trans_complete(trans);
+
+ complete(&trans->completion);
+
+ gsi_trans_free(trans);
+}
+
+/* Cancel a channel's pending transactions */
+void gsi_channel_trans_cancel_pending(struct gsi_channel *channel)
+{
+ struct gsi_trans_info *trans_info = &channel->trans_info;
+ u32 evt_ring_id = channel->evt_ring_id;
+ struct gsi *gsi = channel->gsi;
+ struct gsi_evt_ring *evt_ring;
+ struct gsi_trans *trans;
+ unsigned long flags;
+
+ evt_ring = &gsi->evt_ring[evt_ring_id];
+
+ spin_lock_irqsave(&evt_ring->ring.spinlock, flags);
+
+ list_for_each_entry(trans, &trans_info->pending, links)
+ trans->result = -ECANCELED;
+
+ list_splice_tail_init(&trans_info->pending, &trans_info->complete);
+
+ spin_unlock_irqrestore(&evt_ring->ring.spinlock, flags);
+
+ spin_lock_irqsave(&gsi->spinlock, flags);
+
+ if (gsi->event_enable_bitmap & BIT(evt_ring_id))
+ gsi_event_handle(gsi, evt_ring_id);
+
+ spin_unlock_irqrestore(&gsi->spinlock, flags);
+}
+
+/* Initialize a channel's GSI transaction info */
+int gsi_channel_trans_init(struct gsi_channel *channel)
+{
+ struct gsi_trans_info *trans_info = &channel->trans_info;
+ u32 tre_count = channel->data->tre_count;
+
+ trans_info->map = kcalloc(tre_count, sizeof(*trans_info->map),
+ GFP_KERNEL);
+ if (!trans_info->map)
+ return -ENOMEM;
+
+ /* We will never need more transactions than there are TRE
+ * entries in the transfer ring. For that reason, we can
+ * preallocate an array of (at least) that many transactions,
+ * and use a single free index to determine the next one
+ * available for allocation.
+ */
+ trans_info->pool_count = tre_count;
+ trans_info->pool = kcalloc(trans_info->pool_count,
+ sizeof(*trans_info->pool), GFP_KERNEL);
+ if (!trans_info->pool)
+ goto err_free_map;
+ /* If we get extra memory from the allocator, use it */
+ trans_info->pool_count =
+ ksize(trans_info->pool) / sizeof(*trans_info->pool);
+ trans_info->pool_free = 0;
+
+ /* While transactions are allocated one at a time, a transaction
+ * can have multiple TREs. The number of TRE entries in a single
+ * transaction is limited by the number of TLV FIFO entries the
+ * channel has. We reserve TREs when a transaction is allocated,
+ * but we don't actually use them until the transaction is
+ * committed.
+ *
+ * A transaction uses a scatterlist array to represent the data
+ * transfers implemented by the transaction. Each scatterliest
+ * element is used to fill a single TRE when the transaction is
+ * committed. As a result, we need the same number of scatterlist
+ * elements as there are TREs in the transfer ring, and we can
+ * preallocate them in a pool.
+ *
+ * If we allocate a few (tlv_count - 1) extra entries in our pool
+ * we can always satisfy requests without ever worrying about
+ * straddling the end of the array. If there aren't enough
+ * entries starting at the free index, we just allocate free
+ * entries from the beginning of the pool.
+ */
+ trans_info->sg_pool_count = tre_count + channel->data->tlv_count - 1;
+ trans_info->sg_pool = kcalloc(trans_info->sg_pool_count,
+ sizeof(*trans_info->sg_pool), GFP_KERNEL);
+ if (!trans_info->sg_pool)
+ goto err_free_pool;
+ /* Use any extra memory we get from the allocator */
+ trans_info->sg_pool_count =
+ ksize(trans_info->sg_pool) / sizeof(*trans_info->sg_pool);
+ trans_info->sg_pool_free = 0;
+
+ spin_lock_init(&trans_info->spinlock);
+ trans_info->tre_avail = tre_count; /* maximum active */
+ INIT_LIST_HEAD(&trans_info->alloc);
+ INIT_LIST_HEAD(&trans_info->pending);
+ INIT_LIST_HEAD(&trans_info->complete);
+ INIT_LIST_HEAD(&trans_info->polled);
+
+ return 0;
+
+err_free_pool:
+ kfree(trans_info->pool);
+err_free_map:
+ kfree(trans_info->map);
+
+ return -ENOMEM;
+}
+
+/* Inverse of gsi_channel_trans_init() */
+void gsi_channel_trans_exit(struct gsi_channel *channel)
+{
+ struct gsi_trans_info *trans_info = &channel->trans_info;
+
+ kfree(trans_info->sg_pool);
+ kfree(trans_info->pool);
+ kfree(trans_info->map);
+}
diff --git a/drivers/net/ipa/gsi_trans.h b/drivers/net/ipa/gsi_trans.h
new file mode 100644
index 000000000000..160902b077a7
--- /dev/null
+++ b/drivers/net/ipa/gsi_trans.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* Copyright (c) 2012-2018, The Linux Foundation. All rights reserved.
+ * Copyright (C) 2019 Linaro Ltd.
+ */
+#ifndef _GSI_TRANS_H_
+#define _GSI_TRANS_H_
+
+#include <linux/types.h>
+#include <linux/refcount.h>
+#include <linux/completion.h>
+
+struct scatterlist;
+struct device;
+
+struct gsi;
+struct gsi_trans;
+enum ipa_cmd_opcode;
+
+struct gsi_trans {
+ struct list_head links; /* gsi_channel lists */
+
+ struct gsi *gsi;
+ u32 channel_id;
+
+ u32 tre_count; /* # TREs requested */
+ u32 len; /* total # bytes in sgl */
+ struct scatterlist *sgl;
+ u32 sgc; /* # entries in sgl[] */
+
+ struct completion completion;
+ refcount_t refcount;
+
+ /* fields above are internal only */
+
+ struct device *dev; /* Use this for DMA mapping */
+ long result; /* RX count, 0, or error code */
+
+ u64 byte_count; /* channel byte_count when committed */
+ u64 trans_count; /* channel trans_count when committed */
+
+ void *data;
+};
+
+/**
+ * gsi_channel_trans_alloc() - Allocate a GSI transaction on a channel
+ * @gsi: GSI pointer
+ * @channel_id: Channel the transaction is associated with
+ * @tre_count: Number of elements in the transaction
+ *
+ * @Return: A GSI transaction structure, or a null pointer if all
+ * available transactions are in use
+ */
+struct gsi_trans *gsi_channel_trans_alloc(struct gsi *gsi, u32 channel_id,
+ u32 tre_count);
+
+/**
+ * gsi_trans_free() - Free a previously-allocated GSI transaction
+ * @trans: Transaction to be freed
+ *
+ * Note: this should only be used in error paths, before the transaction is
+ * committed or in the event committing the transaction produces an error.
+ * Successfully committing a transaction passes ownership of the structure
+ * to the core transaction code.
+ */
+void gsi_trans_free(struct gsi_trans *trans);
+
+/**
+ * gsi_trans_commit() - Commit a GSI transaction
+ * @trans: Transaction to commit
+ * @ring_db: Whether to tell the hardware about these queued transfers
+ * @callback: Function called when transaction has completed.
+ */
+int gsi_trans_commit(struct gsi_trans *trans, bool ring_db);
+
+/**
+ * gsi_trans_commit_command() - Commit a GSI command transaction and wait
+ * wait for it to complete
+ * @trans: Transaction to commit
+ */
+int gsi_trans_commit_command(struct gsi_trans *trans,
+ enum ipa_cmd_opcode opcode);
+
+/**
+ * gsi_trans_commit_command_timeout() - Commit a GSI command transaction,
+ * wait for it to complete, with timeout
+ * @trans: Transaction to commit
+ * @ring_db: Whether to tell the hardware about these queued transfers
+ * @timeout: Timeout period (in milliseconds)
+ */
+int gsi_trans_commit_command_timeout(struct gsi_trans *trans,
+ enum ipa_cmd_opcode opcode,
+ unsigned long timeout);
+
+/**
+ * gsi_trans_read_byte() - Issue a single byte read TRE on a channel
+ * @gsi: GSI pointer
+ * @channel_id: Channel on which to read a byte
+ * @addr: DMA address into which to transfer the one byte
+ *
+ * This is not a transaction operation at all. It's defined here because
+ * it needs to be done in coordination with other transaction activity.
+ */
+int gsi_trans_read_byte(struct gsi *gsi, u32 channel_id, dma_addr_t addr);
+
+#endif /* _GSI_TRANS_H_ */
--
2.20.1
> +static void gsi_trans_tre_fill(struct gsi_tre *dest_tre, dma_addr_t addr,
> + u32 len, bool last_tre, bool bei,
> + enum ipa_cmd_opcode opcode)
> +{
> + struct gsi_tre tre;
> +
> + tre.addr = cpu_to_le64(addr);
> + tre.len_opcode = gsi_tre_len_opcode(opcode, len);
> + tre.reserved = 0;
> + tre.flags = gsi_tre_flags(last_tre, bei, opcode);
> +
> + *dest_tre = tre; /* Write TRE as a single (16-byte) unit */
> +}
Have you checked that the atomic write is actually what happens here,
but looking at the compiler output? You might need to add a 'volatile'
qualifier to the dest_tre argument so the temporary structure doesn't
get optimized away here.
> +/* Cancel a channel's pending transactions */
> +void gsi_channel_trans_cancel_pending(struct gsi_channel *channel)
> +{
> + struct gsi_trans_info *trans_info = &channel->trans_info;
> + u32 evt_ring_id = channel->evt_ring_id;
> + struct gsi *gsi = channel->gsi;
> + struct gsi_evt_ring *evt_ring;
> + struct gsi_trans *trans;
> + unsigned long flags;
> +
> + evt_ring = &gsi->evt_ring[evt_ring_id];
> +
> + spin_lock_irqsave(&evt_ring->ring.spinlock, flags);
> +
> + list_for_each_entry(trans, &trans_info->pending, links)
> + trans->result = -ECANCELED;
> +
> + list_splice_tail_init(&trans_info->pending, &trans_info->complete);
> +
> + spin_unlock_irqrestore(&evt_ring->ring.spinlock, flags);
> +
> + spin_lock_irqsave(&gsi->spinlock, flags);
> +
> + if (gsi->event_enable_bitmap & BIT(evt_ring_id))
> + gsi_event_handle(gsi, evt_ring_id);
> +
> + spin_unlock_irqrestore(&gsi->spinlock, flags);
> +}
That is a lot of irqsave()/irqrestore() operations. Do you actually call
all of these functions from hardirq context?
Arnd
On 5/15/19 2:34 AM, Arnd Bergmann wrote:
>> +static void gsi_trans_tre_fill(struct gsi_tre *dest_tre, dma_addr_t addr,
>> + u32 len, bool last_tre, bool bei,
>> + enum ipa_cmd_opcode opcode)
>> +{
>> + struct gsi_tre tre;
>> +
>> + tre.addr = cpu_to_le64(addr);
>> + tre.len_opcode = gsi_tre_len_opcode(opcode, len);
>> + tre.reserved = 0;
>> + tre.flags = gsi_tre_flags(last_tre, bei, opcode);
>> +
>> + *dest_tre = tre; /* Write TRE as a single (16-byte) unit */
>> +}
>
> Have you checked that the atomic write is actually what happens here,
> but looking at the compiler output? You might need to add a 'volatile'
> qualifier to the dest_tre argument so the temporary structure doesn't
> get optimized away here.
No, and I really should have checked, since I'm assuming that's
what will happen. I will check, and may well add the volatile
regardless.
>> +/* Cancel a channel's pending transactions */
>> +void gsi_channel_trans_cancel_pending(struct gsi_channel *channel)
>> +{
>> + struct gsi_trans_info *trans_info = &channel->trans_info;
>> + u32 evt_ring_id = channel->evt_ring_id;
>> + struct gsi *gsi = channel->gsi;
>> + struct gsi_evt_ring *evt_ring;
>> + struct gsi_trans *trans;
>> + unsigned long flags;
>> +
>> + evt_ring = &gsi->evt_ring[evt_ring_id];
>> +
>> + spin_lock_irqsave(&evt_ring->ring.spinlock, flags);
>> +
>> + list_for_each_entry(trans, &trans_info->pending, links)
>> + trans->result = -ECANCELED;
>> +
>> + list_splice_tail_init(&trans_info->pending, &trans_info->complete);
>> +
>> + spin_unlock_irqrestore(&evt_ring->ring.spinlock, flags);
>> +
>> + spin_lock_irqsave(&gsi->spinlock, flags);
>> +
>> + if (gsi->event_enable_bitmap & BIT(evt_ring_id))
>> + gsi_event_handle(gsi, evt_ring_id);
>> +
>> + spin_unlock_irqrestore(&gsi->spinlock, flags);
>> +}
>
> That is a lot of irqsave()/irqrestore() operations. Do you actually call
> all of these functions from hardirq context?
The transaction list is definitely updated in IRQ context,
but I think it is no longer updated in hardirq context (the
softirq was a recent change). This particular function is
definitely not called in a hardirq context, so I can remove
the irqsave/irqrestore.
I'll survey my spinlock use throughout the driver and will
remove any irqsave/irqrestore used in non-hardirq contexts.
Thanks.
-Alex
> Arnd
>
On Wed, May 15, 2019 at 2:26 PM Alex Elder <[email protected]> wrote:
> On 5/15/19 2:34 AM, Arnd Bergmann wrote:
> >> +/* Cancel a channel's pending transactions */
> >> +void gsi_channel_trans_cancel_pending(struct gsi_channel *channel)
> >> +{
> >> + struct gsi_trans_info *trans_info = &channel->trans_info;
> >> + u32 evt_ring_id = channel->evt_ring_id;
> >> + struct gsi *gsi = channel->gsi;
> >> + struct gsi_evt_ring *evt_ring;
> >> + struct gsi_trans *trans;
> >> + unsigned long flags;
> >> +
> >> + evt_ring = &gsi->evt_ring[evt_ring_id];
> >> +
> >> + spin_lock_irqsave(&evt_ring->ring.spinlock, flags);
> >> +
> >> + list_for_each_entry(trans, &trans_info->pending, links)
> >> + trans->result = -ECANCELED;
> >> +
> >> + list_splice_tail_init(&trans_info->pending, &trans_info->complete);
> >> +
> >> + spin_unlock_irqrestore(&evt_ring->ring.spinlock, flags);
> >> +
> >> + spin_lock_irqsave(&gsi->spinlock, flags);
> >> +
> >> + if (gsi->event_enable_bitmap & BIT(evt_ring_id))
> >> + gsi_event_handle(gsi, evt_ring_id);
> >> +
> >> + spin_unlock_irqrestore(&gsi->spinlock, flags);
> >> +}
> >
> > That is a lot of irqsave()/irqrestore() operations. Do you actually call
> > all of these functions from hardirq context?
>
> The transaction list is definitely updated in IRQ context,
> but I think it is no longer updated in hardirq context (the
> softirq was a recent change). This particular function is
> definitely not called in a hardirq context, so I can remove
> the irqsave/irqrestore.
If you want to protect against concurrent softirqs, you still
need spin_lock_bh(), which is cheaper than spin_lock_irqsave()
but still requires writing to the shared cache line for the
atomic update of the lock.
> I'll survey my spinlock use throughout the driver and will
> remove any irqsave/irqrestore used in non-hardirq contexts.
Ok. I actually hope that most of the spinlocks can be
removed from the data path entirely. I just replied on the
ring.spinlock, which I think can go away and be replaced
either with two atomic_t values (rp_local and wp_local
only; 'wp' appears to be unused), or even just an smp_rmb()/
smp_wmb() pair for each access. The gsi register spinlock
can probably be avoided as well if we stop disabling and
renabling the interrupts as I suggested.
gsi_trans_info->spinlock is harder to get rid of unfortunately,
as that would require changing the way you do the doubly linked
lists.
Arnd
On 5/17/19 1:33 PM, Arnd Bergmann wrote:
> On Fri, May 17, 2019 at 8:08 PM Alex Elder <[email protected]> wrote:
>>
>> On 5/15/19 2:34 AM, Arnd Bergmann wrote:
>>>> +static void gsi_trans_tre_fill(struct gsi_tre *dest_tre, dma_addr_t addr,
>>>> + u32 len, bool last_tre, bool bei,
>>>> + enum ipa_cmd_opcode opcode)
>>>> +{
>>>> + struct gsi_tre tre;
>>>> +
>>>> + tre.addr = cpu_to_le64(addr);
>>>> + tre.len_opcode = gsi_tre_len_opcode(opcode, len);
>>>> + tre.reserved = 0;
>>>> + tre.flags = gsi_tre_flags(last_tre, bei, opcode);
>>>> +
>>>> + *dest_tre = tre; /* Write TRE as a single (16-byte) unit */
>>>> +}
>>> Have you checked that the atomic write is actually what happens here,
>>> but looking at the compiler output? You might need to add a 'volatile'
>>> qualifier to the dest_tre argument so the temporary structure doesn't
>>> get optimized away here.
>>
>> Currently, the assignment *does* become a "stp" instruction.
>> But I don't know that we can *force* the compiler to write it
>> as a pair of registers, so I'll soften the comment with
>> "Attempt to write" or something similar.
>>
>> To my knowledge, adding a volatile qualifier only prevents the
>> compiler from performing funny optimizations, but that has no
>> effect on whether the 128-bit assignment is made as a single
>> unit. Do you know otherwise?
>
> I don't think it you can force the 128-bit assignment to be
> atomic, but marking 'dest_tre' should serve to prevent a
> specific optimization that replaces the function with
>
> dest_tre->addr = ...
> dest_tre->len_opcode = ...
> dest_tre->reserved = ...
> dest_tre->flags = ...
>
> which it might find more efficient than the stp and is equivalent
> when the pointer is not marked volatile. We also have the WRITE_ONCE()
> macro that can help prevent this, but it does not work reliably beyond
> 64 bit assignments.
OK, I'll mark it volatile to avoid that potential result.
Thanks.
-Alex
>
> Arnd
>
On 5/15/19 2:34 AM, Arnd Bergmann wrote:
>> +static void gsi_trans_tre_fill(struct gsi_tre *dest_tre, dma_addr_t addr,
>> + u32 len, bool last_tre, bool bei,
>> + enum ipa_cmd_opcode opcode)
>> +{
>> + struct gsi_tre tre;
>> +
>> + tre.addr = cpu_to_le64(addr);
>> + tre.len_opcode = gsi_tre_len_opcode(opcode, len);
>> + tre.reserved = 0;
>> + tre.flags = gsi_tre_flags(last_tre, bei, opcode);
>> +
>> + *dest_tre = tre; /* Write TRE as a single (16-byte) unit */
>> +}
> Have you checked that the atomic write is actually what happens here,
> but looking at the compiler output? You might need to add a 'volatile'
> qualifier to the dest_tre argument so the temporary structure doesn't
> get optimized away here.
Currently, the assignment *does* become a "stp" instruction.
But I don't know that we can *force* the compiler to write it
as a pair of registers, so I'll soften the comment with
"Attempt to write" or something similar.
To my knowledge, adding a volatile qualifier only prevents the
compiler from performing funny optimizations, but that has no
effect on whether the 128-bit assignment is made as a single
unit. Do you know otherwise?
-Alex
On Fri, May 17, 2019 at 8:08 PM Alex Elder <[email protected]> wrote:
>
> On 5/15/19 2:34 AM, Arnd Bergmann wrote:
> >> +static void gsi_trans_tre_fill(struct gsi_tre *dest_tre, dma_addr_t addr,
> >> + u32 len, bool last_tre, bool bei,
> >> + enum ipa_cmd_opcode opcode)
> >> +{
> >> + struct gsi_tre tre;
> >> +
> >> + tre.addr = cpu_to_le64(addr);
> >> + tre.len_opcode = gsi_tre_len_opcode(opcode, len);
> >> + tre.reserved = 0;
> >> + tre.flags = gsi_tre_flags(last_tre, bei, opcode);
> >> +
> >> + *dest_tre = tre; /* Write TRE as a single (16-byte) unit */
> >> +}
> > Have you checked that the atomic write is actually what happens here,
> > but looking at the compiler output? You might need to add a 'volatile'
> > qualifier to the dest_tre argument so the temporary structure doesn't
> > get optimized away here.
>
> Currently, the assignment *does* become a "stp" instruction.
> But I don't know that we can *force* the compiler to write it
> as a pair of registers, so I'll soften the comment with
> "Attempt to write" or something similar.
>
> To my knowledge, adding a volatile qualifier only prevents the
> compiler from performing funny optimizations, but that has no
> effect on whether the 128-bit assignment is made as a single
> unit. Do you know otherwise?
I don't think it you can force the 128-bit assignment to be
atomic, but marking 'dest_tre' should serve to prevent a
specific optimization that replaces the function with
dest_tre->addr = ...
dest_tre->len_opcode = ...
dest_tre->reserved = ...
dest_tre->flags = ...
which it might find more efficient than the stp and is equivalent
when the pointer is not marked volatile. We also have the WRITE_ONCE()
macro that can help prevent this, but it does not work reliably beyond
64 bit assignments.
Arnd
On 5/17/19 1:44 PM, Alex Elder wrote:
> On 5/17/19 1:33 PM, Arnd Bergmann wrote:
>> On Fri, May 17, 2019 at 8:08 PM Alex Elder <[email protected]>
>> wrote:
>>>
>>> On 5/15/19 2:34 AM, Arnd Bergmann wrote:
>>>>> +static void gsi_trans_tre_fill(struct gsi_tre *dest_tre,
>>>>> dma_addr_t addr, + u32 len, bool
>>>>> last_tre, bool bei, + enum
>>>>> ipa_cmd_opcode opcode) +{ + struct gsi_tre tre; + +
>>>>> tre.addr = cpu_to_le64(addr); + tre.len_opcode =
>>>>> gsi_tre_len_opcode(opcode, len); + tre.reserved = 0; +
>>>>> tre.flags = gsi_tre_flags(last_tre, bei, opcode); + +
>>>>> *dest_tre = tre; /* Write TRE as a single (16-byte)
>>>>> unit */ +}
>>>> Have you checked that the atomic write is actually what happens
>>>> here, but looking at the compiler output? You might need to add
>>>> a 'volatile' qualifier to the dest_tre argument so the
>>>> temporary structure doesn't get optimized away here.
>>>
>>> Currently, the assignment *does* become a "stp" instruction. But
>>> I don't know that we can *force* the compiler to write it as a
>>> pair of registers, so I'll soften the comment with "Attempt to
>>> write" or something similar.
>>>
>>> To my knowledge, adding a volatile qualifier only prevents the
>>> compiler from performing funny optimizations, but that has no
>>> effect on whether the 128-bit assignment is made as a single
>>> unit. Do you know otherwise?
>>
>> I don't think it you can force the 128-bit assignment to be atomic,
>> but marking 'dest_tre' should serve to prevent a specific
>> optimization that replaces the function with
>>
>> dest_tre->addr = ... dest_tre->len_opcode = ... dest_tre->reserved
>> = ... dest_tre->flags = ...
>>
>> which it might find more efficient than the stp and is equivalent
>> when the pointer is not marked volatile. We also have the
>> WRITE_ONCE() macro that can help prevent this, but it does not work
>> reliably beyond 64 bit assignments.
>
> OK, I'll mark it volatile to avoid that potential result.
OK I got interesting results so I wanted to report back.
The way it is currently written (no volatile qualifier) is
the *only* way I get a 16-byte store instruction.
Specifically, with dest_tre having type (struct gsi_tre *):
*dest_tre = tre;
Produces this:
4ec: a9002013 stp x19, x8, [x0]
I attempted to make the assigned-to pointer volatile:
*(volatile struct gsi_tre *)dest_tre = tre;
But that apparently is interpreted as "assign things
in the destination in exactly the way they were assigned
above into the "tre" structure." Because it produced this:
748: f9000348 str x8, [x26]
74c: 7940c3e8 ldrh w8, [sp, #96]
750: 79001348 strh w8, [x26, #8]
754: 7940bbe8 ldrh w8, [sp, #92]
758: 79001748 strh w8, [x26, #10]
75c: b9405be8 ldr w8, [sp, #88]
760: b9000f48 str w8, [x26, #12]
From there I went back and changed the type of the dest_tre pointer
parameter to (volatile struct gsi_tre *), and changed the the type
of the values passed through that argument in the two callers to
also have the volatile qualifier. This way there was no need to
use a cast in the left-hand side. That too produced a string of
separate assignments, not a single 128-bit one.
I also attempted this:
*dest_tre = *(volatile struct gsi_tre *)&tre;
And even this:
*(volatile struct gsi_tre *)dest_tre = *(volatile struct gsi_tre *)&tre;
But neither produced a single "stp" instruction; all produced
the same sequence of instructions above.
So it seems that I must *not* apply a volatile qualifier,
because doing so restricts the compiler from making the
single instruction optimization.
If I've missed something and you have another suggestion for
me to try let me know and I'll try it.
-Alex
> Thanks.
>
> -Alex
>
>>
>> Arnd
>>
>
On Sun, May 19, 2019 at 7:11 PM Alex Elder <[email protected]> wrote:
> On 5/17/19 1:44 PM, Alex Elder wrote:
> > On 5/17/19 1:33 PM, Arnd Bergmann wrote:
> >> On Fri, May 17, 2019 at 8:08 PM Alex Elder <[email protected]>
>
> So it seems that I must *not* apply a volatile qualifier,
> because doing so restricts the compiler from making the
> single instruction optimization.
Right, I guess that makes sense.
> If I've missed something and you have another suggestion for
> me to try let me know and I'll try it.
A memcpy() might do the right thing as well. Another idea would
be a cast to __int128 like
#ifdef CONFIG_ARCH_SUPPORTS_INT128
typedef __int128 tre128_t;
#else
typedef struct { __u64 a; __u64 b; } tre128_t;
#else
static inline void set_tre(struct gsi_tre *dest_tre, struct gs_tre *src_tre)
{
*(volatile tre128_t *)dest_tre = *(tre128_t *)src_tre;
}
Arnd
On 5/20/19 4:25 AM, Arnd Bergmann wrote:
> On Sun, May 19, 2019 at 7:11 PM Alex Elder <[email protected]> wrote:
>> On 5/17/19 1:44 PM, Alex Elder wrote:
>>> On 5/17/19 1:33 PM, Arnd Bergmann wrote:
>>>> On Fri, May 17, 2019 at 8:08 PM Alex Elder <[email protected]>
>>
>> So it seems that I must *not* apply a volatile qualifier,
>> because doing so restricts the compiler from making the
>> single instruction optimization.
>
> Right, I guess that makes sense.
>
>> If I've missed something and you have another suggestion for
>> me to try let me know and I'll try it.
>
> A memcpy() might do the right thing as well. Another idea would
I find memcpy() does the right thing.
> be a cast to __int128 like
I find that my environment supports 128 bit integers. But...
> #ifdef CONFIG_ARCH_SUPPORTS_INT128
> typedef __int128 tre128_t;
> #else
> typedef struct { __u64 a; __u64 b; } tre128_t;
> #else
>
> static inline void set_tre(struct gsi_tre *dest_tre, struct gs_tre *src_tre)
> {
> *(volatile tre128_t *)dest_tre = *(tre128_t *)src_tre;
> }
...this produces two 8-bit assignments. Could it be because
it's implemented as two 64-bit values? I think so. Dropping
the volatile qualifier produces a single "stp" instruction.
The only other thing I thought I could do to encourage
the compiler to do the right thing is define the type (or
variables) to have 128-bit alignment. And doing that for
the original simple assignment didn't change the (desirable)
outcome, but I don't think it's really necessary in this
case, considering the single instruction uses two 64-bit
registers.
I'm going to leave it as it was originally; it's the simplest:
*dest_tre = tre;
I added a comment about structuring the code this way with
the intention of getting the single instruction. If a different
compiler produces different result
-Alex
>
> Arnd
>
On 5/20/19 9:43 AM, Arnd Bergmann wrote:
> I have no idea how two 8-bit assignments could do that,
> it sounds like a serious gcc bug, unless you mean two
> 8-byte assignments, which would be within the range
> of expected behavior. If it's actually 8-bit stores, please
> open a bug against gcc with a minimized test case.
Sorry, it's 8 *byte* assignments, not 8 bit. -Alex
On Mon, May 20, 2019 at 2:50 PM Alex Elder <[email protected]> wrote:
>
> On 5/20/19 4:25 AM, Arnd Bergmann wrote:
> > On Sun, May 19, 2019 at 7:11 PM Alex Elder <[email protected]> wrote:
> >> On 5/17/19 1:44 PM, Alex Elder wrote:
> >>> On 5/17/19 1:33 PM, Arnd Bergmann wrote:
> >>>> On Fri, May 17, 2019 at 8:08 PM Alex Elder <[email protected]>
> >>
> >> So it seems that I must *not* apply a volatile qualifier,
> >> because doing so restricts the compiler from making the
> >> single instruction optimization.
> >
> > Right, I guess that makes sense.
> >
> >> If I've missed something and you have another suggestion for
> >> me to try let me know and I'll try it.
> >
> > A memcpy() might do the right thing as well. Another idea would
>
> I find memcpy() does the right thing.
>
> > be a cast to __int128 like
>
> I find that my environment supports 128 bit integers. But...
>
> > #ifdef CONFIG_ARCH_SUPPORTS_INT128
> > typedef __int128 tre128_t;
> > #else
> > typedef struct { __u64 a; __u64 b; } tre128_t;
> > #else
> >
> > static inline void set_tre(struct gsi_tre *dest_tre, struct gs_tre *src_tre)
> > {
> > *(volatile tre128_t *)dest_tre = *(tre128_t *)src_tre;
> > }
> ...this produces two 8-bit assignments. Could it be because
> it's implemented as two 64-bit values? I think so. Dropping
> the volatile qualifier produces a single "stp" instruction.
I have no idea how two 8-bit assignments could do that,
it sounds like a serious gcc bug, unless you mean two
8-byte assignments, which would be within the range
of expected behavior. If it's actually 8-bit stores, please
open a bug against gcc with a minimized test case.
> The only other thing I thought I could do to encourage
> the compiler to do the right thing is define the type (or
> variables) to have 128-bit alignment. And doing that for
> the original simple assignment didn't change the (desirable)
> outcome, but I don't think it's really necessary in this
> case, considering the single instruction uses two 64-bit
> registers.
>
> I'm going to leave it as it was originally; it's the simplest:
> *dest_tre = tre;
>
> I added a comment about structuring the code this way with
> the intention of getting the single instruction. If a different
> compiler produces different result.
Ok, that's probably the best we can do then.
Arnd
On Mon, May 20, 2019 at 7:44 AM Alex Elder <[email protected]> wrote:
>
> On 5/20/19 9:43 AM, Arnd Bergmann wrote:
> > I have no idea how two 8-bit assignments could do that,
> > it sounds like a serious gcc bug, unless you mean two
> > 8-byte assignments, which would be within the range
> > of expected behavior. If it's actually 8-bit stores, please
> > open a bug against gcc with a minimized test case.
>
> Sorry, it's 8 *byte* assignments, not 8 bit. -Alex
Is it important to the hardware that you're writing all 128 bits of
this in an atomic manner? My understanding is that while you may get
different behaviors using various combinations of
volatile/aligned/packed, this is way deep in the compiler's "free to
do whatever I want" territory. If the hardware's going to misbehave if
you don't get an atomic 128-bit write, then I don't think this has
been nailed down enough, since I think Clang or even a different
version of GCC is within its right to split the writes up differently.
Is filling out the TRE touching memory that the hardware is also
watching at the same time? Usually when the hardware cares about the
contents of a struct, there's a particular (smaller) field that can be
written out atomically. I remember USB having these structs that
needed to be filled out, but the hardware wouldn't actually slurp them
up until the smaller "token" part was non-zero. If the hardware is
scanning this struct, it might be safer to do it in two steps: 1)
flush out the filled out struct except for the field with the "go" bit
(which you'd have zeroed), then 2) set the field containing the "go"
bit.
On 5/20/19 11:34 AM, Evan Green wrote:
> On Mon, May 20, 2019 at 7:44 AM Alex Elder <[email protected]> wrote:
>>
>> On 5/20/19 9:43 AM, Arnd Bergmann wrote:
>>> I have no idea how two 8-bit assignments could do that,
>>> it sounds like a serious gcc bug, unless you mean two
>>> 8-byte assignments, which would be within the range
>>> of expected behavior. If it's actually 8-bit stores, please
>>> open a bug against gcc with a minimized test case.
>>
>> Sorry, it's 8 *byte* assignments, not 8 bit. -Alex
>
> Is it important to the hardware that you're writing all 128 bits of
No, it is not important in the ways you are describing.
We're just geeking out over how to get optimal performance.
A single 128-bit write is nicer than two 64-bit writes,
or more smaller writes.
The hardware won't touch the TRE until the doorbell gets
rung telling it that it is permitted to do so. The doorbell
is an I/O write, which implies a memory barrier, so the entire
TRE will be up-to-date in memory regardless of whether we
write it 128 bits or 8 bits at a time.
-Alex
> this in an atomic manner? My understanding is that while you may get
> different behaviors using various combinations of
> volatile/aligned/packed, this is way deep in the compiler's "free to
> do whatever I want" territory. If the hardware's going to misbehave if
> you don't get an atomic 128-bit write, then I don't think this has
> been nailed down enough, since I think Clang or even a different
> version of GCC is within its right to split the writes up differently.
>
> Is filling out the TRE touching memory that the hardware is also
> watching at the same time? Usually when the hardware cares about the
> contents of a struct, there's a particular (smaller) field that can be
> written out atomically. I remember USB having these structs that
> needed to be filled out, but the hardware wouldn't actually slurp them
> up until the smaller "token" part was non-zero. If the hardware is
> scanning this struct, it might be safer to do it in two steps: 1)
> flush out the filled out struct except for the field with the "go" bit
> (which you'd have zeroed), then 2) set the field containing the "go"
> bit.
>
On Mon, May 20, 2019 at 9:50 AM Alex Elder <[email protected]> wrote:
>
> On 5/20/19 11:34 AM, Evan Green wrote:
> > On Mon, May 20, 2019 at 7:44 AM Alex Elder <[email protected]> wrote:
> >>
> >> On 5/20/19 9:43 AM, Arnd Bergmann wrote:
> >>> I have no idea how two 8-bit assignments could do that,
> >>> it sounds like a serious gcc bug, unless you mean two
> >>> 8-byte assignments, which would be within the range
> >>> of expected behavior. If it's actually 8-bit stores, please
> >>> open a bug against gcc with a minimized test case.
> >>
> >> Sorry, it's 8 *byte* assignments, not 8 bit. -Alex
> >
> > Is it important to the hardware that you're writing all 128 bits of
>
> No, it is not important in the ways you are describing.
>
> We're just geeking out over how to get optimal performance.
> A single 128-bit write is nicer than two 64-bit writes,
> or more smaller writes.
>
> The hardware won't touch the TRE until the doorbell gets
> rung telling it that it is permitted to do so. The doorbell
> is an I/O write, which implies a memory barrier, so the entire
> TRE will be up-to-date in memory regardless of whether we
> write it 128 bits or 8 bits at a time.
>
Ah, understood. Carry on!