2005-12-29 00:43:38

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

Following Roland's submission of our InfiniPath InfiniBand HCA driver
earlier this month, we have responded to people's comments by making a
large number of changes to the driver.

Here is another set of driver patches for review. Roland is on
vacation until January 4, so I'm posting these in his place. Once
again, your comments are appreciated. We'd like to submit this driver
for inclusion in 2.6.16, so we'll be responding quickly to all
feedback.

A short summary of the changes we have made is as follows:

- sparse annotations (yes, it passes "make C=1")

- Removed x86_64 specificity from driver

- Introduced generic memcpy_toio32 for safe MMIO access

- Got rid of release and RCS IDs

- Use set_page_dirty_lock instead of SetPageDirty

- Fixed misuse of copy_from_user

- Removed all sysctls

- Removed stuff inside #ifndef __KERNEL__

- Use ALIGN() instead of round_up()

- Use static inlines instead of #defines, generally tidied inline
functions

- Renamed _BITS_PER_BYTE to BITS_PER_BYTE, and moved it into
linux/types.h

- Got rid of ipath_shortcopy

- Use fixed-size types for user/kernel communication

- Renamed ipath_mlock to ipath_get_user_pages, fixed some bugs

There are a few requested changes we have chosen to omit for now:

- The driver still uses EXPORT_SYMBOL, for consistency with other
code in drivers/infiniband

- Someone asked for the kernel's i2c infrastructure to be used, but
our i2c usage is very specialised, and it would be more of a mess
to use the kernel's

- We're still using ioctls instead of sysfs or configfs in some
cases, to maintain userspace compatibility

Please note that these patches require a set of OpenIB kernel patches
that are awaiting the 2.6.16 submission window in order to compile; in
other words, they really are for review only. I'll be happy to
provide a suitable jumbo OpenIB patch to anyone who feels a need to
compile-test these patches.

<b


2005-12-29 00:39:10

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 1 of 20] Introduce __memcpy_toio32

This routine is an arch-independent building block for memcpy_toio32.
It copies data to a memory-mapped I/O region, using 32-bit accesses.
This style of access is required by some devices.

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r a56fd6a8895d -r ef833f6712e7 include/asm-generic/iomap.h
--- a/include/asm-generic/iomap.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-generic/iomap.h Wed Dec 28 14:19:42 2005 -0800
@@ -56,6 +56,15 @@
extern void fastcall iowrite16_rep(void __iomem *port, const void *buf, unsigned long count);
extern void fastcall iowrite32_rep(void __iomem *port, const void *buf, unsigned long count);

+/*
+ * __memcpy_toio32 - copy data to MMIO space, in 32-bit units
+ *
+ * @to: destination, in MMIO space (must be 32-bit aligned)
+ * @from: source (must be 32-bit aligned)
+ * @count: number of 32-bit quantities to copy
+ */
+void fastcall __memcpy_toio32(void __iomem *to, const void *from, size_t count);
+
/* Create a virtual mapping cookie for an IO port range */
extern void __iomem *ioport_map(unsigned long port, unsigned int nr);
extern void ioport_unmap(void __iomem *);
diff -r a56fd6a8895d -r ef833f6712e7 lib/iomap.c
--- a/lib/iomap.c Wed Dec 28 14:19:42 2005 -0800
+++ b/lib/iomap.c Wed Dec 28 14:19:42 2005 -0800
@@ -187,6 +187,17 @@
EXPORT_SYMBOL(iowrite16_rep);
EXPORT_SYMBOL(iowrite32_rep);

+void fastcall __memcpy_toio32(void __iomem *d, const void *s, size_t count)
+{
+ u32 __iomem *dst = d;
+ const u32 *src = s;
+ size_t i;
+
+ for (i = 0; i < count; i++)
+ __raw_writel(*src++, dst++);
+ wmb();
+}
+
/* Create a virtual mapping cookie for an IO port range */
void __iomem *ioport_map(unsigned long port, unsigned int nr)
{

2005-12-29 00:39:41

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 20 of 20] ipath - integrate driver into infiniband kbuild infrastructure

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r 07bf9f34e221 -r 914136b2b8ee drivers/infiniband/Kconfig
--- a/drivers/infiniband/Kconfig Wed Dec 28 14:19:43 2005 -0800
+++ b/drivers/infiniband/Kconfig Wed Dec 28 14:19:43 2005 -0800
@@ -30,6 +30,7 @@
<http://www.openib.org>.

source "drivers/infiniband/hw/mthca/Kconfig"
+source "drivers/infiniband/hw/ipath/Kconfig"

source "drivers/infiniband/ulp/ipoib/Kconfig"

diff -r 07bf9f34e221 -r 914136b2b8ee drivers/infiniband/Makefile
--- a/drivers/infiniband/Makefile Wed Dec 28 14:19:43 2005 -0800
+++ b/drivers/infiniband/Makefile Wed Dec 28 14:19:43 2005 -0800
@@ -1,4 +1,5 @@
obj-$(CONFIG_INFINIBAND) += core/
obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/
+obj-$(CONFIG_IPATH_CORE) += hw/ipath/
obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/
obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/

2005-12-29 00:39:11

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 4 of 20] Define BITS_PER_BYTE

This can make some arithmetic expressions clearer.

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r b792638cc4bc -r a3a00f637da6 include/linux/types.h
--- a/include/linux/types.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/linux/types.h Wed Dec 28 14:19:42 2005 -0800
@@ -8,6 +8,8 @@
(((bits)+BITS_PER_LONG-1)/BITS_PER_LONG)
#define DECLARE_BITMAP(name,bits) \
unsigned long name[BITS_TO_LONGS(bits)]
+
+#define BITS_PER_BYTE 8
#endif

#include <linux/posix_types.h>

2005-12-29 00:39:41

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 14 of 20] ipath - infiniband verbs header

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r f9bcd9de3548 -r 26993cb5faee drivers/infiniband/hw/ipath/ipath_verbs.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,532 @@
+/*
+ * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#ifndef IPATH_VERBS_H
+#define IPATH_VERBS_H
+
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/kernel.h>
+#include <linux/interrupt.h>
+#include <rdma/ib_pack.h>
+
+#include "ipath_kernel.h"
+#include "verbs_debug.h"
+
+#define CTL_IPATH_VERBS 0x70736e68 /* "spin" as a hex value, top level */
+#define CTL_IPATH_VERBS_FAULT 1
+#define CTL_IPATH_VERBS_DEBUG 2
+
+#define QPN_MAX (1 << 24)
+#define QPNMAP_ENTRIES (QPN_MAX / PAGE_SIZE / BITS_PER_BYTE)
+
+/*
+ * Increment this value if any changes that break userspace ABI
+ * compatibility are made.
+ */
+#define IPATH_UVERBS_ABI_VERSION 1
+
+/*
+ * Define an ib_cq_notify value that is not valid so we know when CQ
+ * notifications are armed.
+ */
+#define IB_CQ_NONE (IB_CQ_NEXT_COMP + 1)
+
+enum {
+ IB_RNR_NAK = 0x20,
+
+ IB_NAK_PSN_ERROR = 0x60,
+ IB_NAK_INVALID_REQUEST = 0x61,
+ IB_NAK_REMOTE_ACCESS_ERROR = 0x62,
+ IB_NAK_REMOTE_OPERATIONAL_ERROR = 0x63,
+ IB_NAK_INVALID_RD_REQUEST = 0x64
+};
+
+/* IB Performance Manager status values */
+enum {
+ IB_PMA_SAMPLE_STATUS_DONE = 0x00,
+ IB_PMA_SAMPLE_STATUS_STARTED = 0x01,
+ IB_PMA_SAMPLE_STATUS_RUNNING = 0x02
+};
+
+/* Mandatory IB performance counter select values. */
+#define IB_PMA_PORT_XMIT_DATA __constant_htons(0x0001)
+#define IB_PMA_PORT_RCV_DATA __constant_htons(0x0002)
+#define IB_PMA_PORT_XMIT_PKTS __constant_htons(0x0003)
+#define IB_PMA_PORT_RCV_PKTS __constant_htons(0x0004)
+#define IB_PMA_PORT_XMIT_WAIT __constant_htons(0x0005)
+
+struct ib_reth {
+ u64 vaddr;
+ u32 rkey;
+ u32 length;
+} __attribute__ ((packed));
+
+struct ib_atomic_eth {
+ u64 vaddr;
+ u32 rkey;
+ u64 swap_data;
+ u64 compare_data;
+} __attribute__ ((packed));
+
+struct ipath_other_headers {
+ u32 bth[3];
+ union {
+ struct {
+ u32 deth[2];
+ u32 imm_data;
+ } ud;
+ struct {
+ struct ib_reth reth;
+ u32 imm_data;
+ } rc;
+ struct {
+ u32 aeth;
+ u64 atomic_ack_eth;
+ } at;
+ u32 imm_data;
+ u32 aeth;
+ struct ib_atomic_eth atomic_eth;
+ } u;
+} __attribute__ ((packed));
+
+/*
+ * Note that UD packets with a GRH header are 8+40+12+8 = 68 bytes long
+ * (72 w/ imm_data).
+ * Only the first 56 bytes of the IB header will be in the
+ * eager header buffer. The remaining 12 or 16 bytes are in the data buffer.
+ */
+struct ipath_ib_header {
+ u16 lrh[4];
+ union {
+ struct {
+ struct ib_grh grh;
+ struct ipath_other_headers oth;
+ } l;
+ struct ipath_other_headers oth;
+ } u;
+} __attribute__ ((packed));
+
+/*
+ * There is one struct ipath_mcast for each multicast GID.
+ * All attached QPs are then stored as a list of
+ * struct ipath_mcast_qp.
+ */
+struct ipath_mcast_qp {
+ struct list_head list;
+ struct ipath_qp *qp;
+};
+
+struct ipath_mcast {
+ struct rb_node rb_node;
+ union ib_gid mgid;
+ struct list_head qp_list;
+ wait_queue_head_t wait;
+ atomic_t refcount;
+};
+
+/* Memory region */
+struct ipath_mr {
+ struct ib_mr ibmr;
+ struct ipath_mregion mr; /* must be last */
+};
+
+/* Fast memory region */
+struct ipath_fmr {
+ struct ib_fmr ibfmr;
+ u8 page_size;
+ struct ipath_mregion mr; /* must be last */
+};
+
+/* Protection domain */
+struct ipath_pd {
+ struct ib_pd ibpd;
+ int user; /* non-zero if created from user space */
+};
+
+/* Address Handle */
+struct ipath_ah {
+ struct ib_ah ibah;
+ struct ib_ah_attr attr;
+};
+
+/*
+ * Quick description of our CQ/QP locking scheme:
+ *
+ * We have one global lock that protects dev->cq/qp_table. Each
+ * struct ipath_cq/qp also has its own lock. An individual qp lock
+ * may be taken inside of an individual cq lock. Both cqs attached to
+ * a qp may be locked, with the send cq locked first. No other
+ * nesting should be done.
+ *
+ * Each struct ipath_cq/qp also has an atomic_t ref count. The
+ * pointer from the cq/qp_table to the struct counts as one reference.
+ * This reference also is good for access through the consumer API, so
+ * modifying the CQ/QP etc doesn't need to take another reference.
+ * Access because of a completion being polled does need a reference.
+ *
+ * Finally, each struct ipath_cq/qp has a wait_queue_head_t for the
+ * destroy function to sleep on.
+ *
+ * This means that access from the consumer API requires nothing but
+ * taking the struct's lock.
+ *
+ * Access because of a completion event should go as follows:
+ * - lock cq/qp_table and look up struct
+ * - increment ref count in struct
+ * - drop cq/qp_table lock
+ * - lock struct, do your thing, and unlock struct
+ * - decrement ref count; if zero, wake up waiters
+ *
+ * To destroy a CQ/QP, we can do the following:
+ * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock
+ * - decrement ref count
+ * - wait_event until ref count is zero
+ *
+ * It is the consumer's responsibilty to make sure that no QP
+ * operations (WQE posting or state modification) are pending when the
+ * QP is destroyed. Also, the consumer must make sure that calls to
+ * qp_modify are serialized.
+ *
+ * Possible optimizations (wait for profile data to see if/where we
+ * have locks bouncing between CPUs):
+ * - split cq/qp table lock into n separate (cache-aligned) locks,
+ * indexed (say) by the page in the table
+ */
+
+struct ipath_cq {
+ struct ib_cq ibcq;
+ struct tasklet_struct comptask;
+ spinlock_t lock;
+ u8 notify;
+ u8 triggered;
+ u32 head; /* new records added to the head */
+ u32 tail; /* poll_cq() reads from here. */
+ struct ib_wc queue[1]; /* this is actually ibcq.cqe + 1 */
+};
+
+/*
+ * Send work request queue entry.
+ * The size of the sg_list is determined when the QP is created and stored
+ * in qp->s_max_sge.
+ */
+struct ipath_swqe {
+ struct ib_send_wr wr; /* don't use wr.sg_list */
+ u32 psn; /* first packet sequence number */
+ u32 lpsn; /* last packet sequence number */
+ u32 ssn; /* send sequence number */
+ u32 length; /* total length of data in sg_list */
+ struct ipath_sge sg_list[0];
+};
+
+/*
+ * Receive work request queue entry.
+ * The size of the sg_list is determined when the QP is created and stored
+ * in qp->r_max_sge.
+ */
+struct ipath_rwqe {
+ u64 wr_id;
+ u32 length; /* total length of data in sg_list */
+ u8 num_sge;
+ struct ipath_sge sg_list[0];
+};
+
+struct ipath_rq {
+ spinlock_t lock;
+ u32 head; /* new work requests posted to the head */
+ u32 tail; /* receives pull requests from here. */
+ u32 size; /* size of RWQE array */
+ u8 max_sge;
+ struct ipath_rwqe *wq; /* RWQE array */
+};
+
+struct ipath_srq {
+ struct ib_srq ibsrq;
+ struct ipath_rq rq;
+ u32 limit; /* send signal when number of RWQEs < limit */
+};
+
+/*
+ * Variables prefixed with s_ are for the requester (sender).
+ * Variables prefixed with r_ are for the responder (receiver).
+ * Variables prefixed with ack_ are for responder replies.
+ *
+ * Common variables are protected by both r_rq.lock and s_lock in that order
+ * which only happens in modify_qp() or changing the QP 'state'.
+ */
+struct ipath_qp {
+ struct ib_qp ibqp;
+ struct ipath_qp *next; /* link list for QPN hash table */
+ struct list_head piowait; /* link for wait PIO buf */
+ struct list_head timerwait; /* link for waiting for timeouts */
+ struct ib_ah_attr remote_ah_attr;
+ struct ipath_ib_header s_hdr; /* next packet header to send */
+ atomic_t refcount;
+ wait_queue_head_t wait;
+ struct tasklet_struct s_task;
+ struct ipath_sge_state *s_cur_sge;
+ struct ipath_sge_state s_sge; /* current send request data */
+ struct ipath_sge_state s_rdma_sge; /* current RDMA read send data */
+ struct ipath_sge_state r_sge; /* current receive data */
+ spinlock_t s_lock;
+ int s_flags;
+ u32 s_hdrwords; /* size of s_hdr in 32 bit words */
+ u32 s_cur_size; /* size of send packet in bytes */
+ u32 s_len; /* total length of s_sge */
+ u32 s_rdma_len; /* total length of s_rdma_sge */
+ u32 s_next_psn; /* PSN for next request */
+ u32 s_last_psn; /* last response PSN processed */
+ u32 s_psn; /* current packet sequence number */
+ u32 s_rnr_timeout; /* number of milliseconds for RNR timeout */
+ u32 s_ack_psn; /* PSN for next ACK or RDMA_READ */
+ u64 s_ack_atomic; /* data for atomic ACK */
+ u64 r_wr_id; /* ID for current receive WQE */
+ u64 r_atomic_data; /* data for last atomic op */
+ u32 r_atomic_psn; /* PSN of last atomic op */
+ u32 r_len; /* total length of r_sge */
+ u32 r_rcv_len; /* receive data len processed */
+ u32 r_psn; /* expected rcv packet sequence number */
+ u8 state; /* QP state */
+ u8 s_state; /* opcode of last packet sent */
+ u8 s_ack_state; /* opcode of packet to ACK */
+ u8 s_nak_state; /* non-zero if NAK is pending */
+ u8 r_state; /* opcode of last packet received */
+ u8 r_reuse_sge; /* for UC receive errors */
+ u8 r_sge_inx; /* current index into sg_list */
+ u8 s_max_sge; /* size of s_wq->sg_list */
+ u8 qp_access_flags;
+ u8 s_retry_cnt; /* number of times to retry */
+ u8 s_rnr_retry_cnt;
+ u8 s_min_rnr_timer;
+ u8 s_retry; /* requester retry counter */
+ u8 s_rnr_retry; /* requester RNR retry counter */
+ u8 s_pkey_index; /* PKEY index to use */
+ enum ib_mtu path_mtu;
+ atomic_t msn; /* message sequence number */
+ u32 remote_qpn;
+ u32 qkey; /* QKEY for this QP (for UD or RD) */
+ u32 s_size; /* send work queue size */
+ u32 s_head; /* new entries added here */
+ u32 s_tail; /* next entry to process */
+ u32 s_cur; /* current work queue entry */
+ u32 s_last; /* last un-ACK'ed entry */
+ u32 s_ssn; /* SSN of tail entry */
+ u32 s_lsn; /* limit sequence number (credit) */
+ struct ipath_swqe *s_wq; /* send work queue */
+ struct ipath_rq r_rq; /* receive work queue */
+};
+
+/*
+ * Bit definitions for s_flags.
+ */
+#define IPATH_S_BUSY 0
+#define IPATH_S_SIGNAL_REQ_WR 1
+
+/*
+ * Since struct ipath_swqe is not a fixed size, we can't simply index into
+ * struct ipath_qp.s_wq. This function does the array index computation.
+ */
+static inline struct ipath_swqe *get_swqe_ptr(struct ipath_qp *qp, unsigned n)
+{
+ return (struct ipath_swqe *)((char *) qp->s_wq +
+ (sizeof(struct ipath_swqe) +
+ qp->s_max_sge * sizeof(struct ipath_sge)) * n);
+}
+
+/*
+ * Since struct ipath_rwqe is not a fixed size, we can't simply index into
+ * struct ipath_rq.wq. This function does the array index computation.
+ */
+static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq, unsigned n)
+{
+ return (struct ipath_rwqe *)((char *) rq->wq +
+ (sizeof(struct ipath_rwqe) +
+ rq->max_sge * sizeof(struct ipath_sge)) * n);
+}
+
+/*
+ * QPN-map pages start out as NULL, they get allocated upon
+ * first use and are never deallocated. This way,
+ * large bitmaps are not allocated unless large numbers of QPs are used.
+ */
+struct qpn_map {
+ atomic_t n_free;
+ void *page;
+};
+
+struct ipath_qp_table {
+ spinlock_t lock;
+ u32 last; /* last QP number allocated */
+ u32 max; /* size of the hash table */
+ u32 nmaps; /* size of the map table */
+ struct ipath_qp **table;
+ struct qpn_map map[QPNMAP_ENTRIES]; /* bit map of free numbers */
+};
+
+struct ipath_lkey_table {
+ spinlock_t lock;
+ u32 next; /* next unused index (speeds search) */
+ u32 gen; /* generation count */
+ u32 max; /* size of the table */
+ struct ipath_mregion **table;
+};
+
+struct ipath_opcode_stats {
+ u64 n_packets; /* number of packets */
+ u64 n_bytes; /* total number of bytes */
+};
+
+struct ipath_ibdev {
+ struct ib_device ibdev;
+ ipath_type ib_unit; /* This is the device number */
+ u16 sm_lid; /* in host order */
+ u8 sm_sl;
+ u8 mkeyprot_resv_lmc;
+ unsigned long mkey_lease_timeout; /* non-zero when timer is set */
+
+ /* The following fields are really per port. */
+ struct ipath_qp_table qp_table;
+ struct ipath_lkey_table lk_table;
+ struct list_head pending[3]; /* FIFO of QPs waiting for ACKs */
+ struct list_head piowait; /* list for wait PIO buf */
+ struct list_head rnrwait; /* list of QPs waiting for RNR timer */
+ spinlock_t pending_lock;
+ __be64 sys_image_guid; /* in network order */
+ __be64 gid_prefix; /* in network order */
+ __be64 mkey;
+ u64 ipath_sword; /* total dwords sent (sample result) */
+ u64 ipath_rword; /* total dwords received (sample result) */
+ u64 ipath_spkts; /* total packets sent (sample result) */
+ u64 ipath_rpkts; /* total packets received (sample result) */
+ u64 n_unicast_xmit; /* total unicast packets sent */
+ u64 n_unicast_rcv; /* total unicast packets received */
+ u64 n_multicast_xmit; /* total multicast packets sent */
+ u64 n_multicast_rcv; /* total multicast packets received */
+ u64 n_symbol_error_counter; /* starting count for PMA */
+ u64 n_link_error_recovery_counter; /* starting count for PMA */
+ u64 n_link_downed_counter; /* starting count for PMA */
+ u64 n_port_rcv_errors; /* starting count for PMA */
+ u64 n_port_rcv_remphys_errors; /* starting count for PMA */
+ u64 n_port_xmit_discards; /* starting count for PMA */
+ u64 n_port_xmit_data; /* starting count for PMA */
+ u64 n_port_rcv_data; /* starting count for PMA */
+ u64 n_port_xmit_packets; /* starting count for PMA */
+ u64 n_port_rcv_packets; /* starting count for PMA */
+ u32 n_rc_resends;
+ u32 n_rc_acks;
+ u32 n_rc_qacks;
+ u32 n_seq_naks;
+ u32 n_rdma_seq;
+ u32 n_rnr_naks;
+ u32 n_other_naks;
+ u32 n_timeouts;
+ u32 n_pkt_drops;
+ u32 n_wqe_errs;
+ u32 n_rdma_dup_busy;
+ u32 n_piowait;
+ u32 n_no_piobuf;
+ u32 port_cap_flags;
+ u32 pma_sample_start;
+ u32 pma_sample_interval;
+ __be16 pma_counter_select[5];
+ u16 pma_tag;
+ u16 qkey_violations;
+ u16 mkey_violations;
+ u16 mkey_lease_period;
+ u16 pending_index; /* which pending queue is active */
+ u8 pma_sample_status;
+ u8 subnet_timeout;
+ struct ipath_opcode_stats opstats[128];
+};
+
+struct ipath_ucontext {
+ struct ib_ucontext ibucontext;
+};
+
+static inline struct ipath_mr *to_imr(struct ib_mr *ibmr)
+{
+ return container_of(ibmr, struct ipath_mr, ibmr);
+}
+
+static inline struct ipath_fmr *to_ifmr(struct ib_fmr *ibfmr)
+{
+ return container_of(ibfmr, struct ipath_fmr, ibfmr);
+}
+
+static inline struct ipath_pd *to_ipd(struct ib_pd *ibpd)
+{
+ return container_of(ibpd, struct ipath_pd, ibpd);
+}
+
+static inline struct ipath_ah *to_iah(struct ib_ah *ibah)
+{
+ return container_of(ibah, struct ipath_ah, ibah);
+}
+
+static inline struct ipath_cq *to_icq(struct ib_cq *ibcq)
+{
+ return container_of(ibcq, struct ipath_cq, ibcq);
+}
+
+static inline struct ipath_srq *to_isrq(struct ib_srq *ibsrq)
+{
+ return container_of(ibsrq, struct ipath_srq, ibsrq);
+}
+
+static inline struct ipath_qp *to_iqp(struct ib_qp *ibqp)
+{
+ return container_of(ibqp, struct ipath_qp, ibqp);
+}
+
+static inline struct ipath_ibdev *to_idev(struct ib_device *ibdev)
+{
+ return container_of(ibdev, struct ipath_ibdev, ibdev);
+}
+
+int ipath_process_mad(struct ib_device *ibdev,
+ int mad_flags,
+ u8 port_num,
+ struct ib_wc *in_wc,
+ struct ib_grh *in_grh,
+ struct ib_mad *in_mad, struct ib_mad *out_mad);
+
+static inline struct ipath_ucontext *to_iucontext(struct ib_ucontext
+ *ibucontext)
+{
+ return container_of(ibucontext, struct ipath_ucontext, ibucontext);
+}
+
+#endif /* IPATH_VERBS_H */
diff -r f9bcd9de3548 -r 26993cb5faee drivers/infiniband/hw/ipath/verbs_debug.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/verbs_debug.h Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,106 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#ifndef _VERBS_DEBUG_H
+#define _VERBS_DEBUG_H
+
+/*
+ * This file contains tracing code for the ib_ipath kernel module.
+ */
+#ifndef _VERBS_DEBUGGING /* tracing enabled or not */
+#define _VERBS_DEBUGGING 1
+#endif
+
+extern unsigned ib_ipath_debug;
+
+#define _VERBS_ERROR(fmt,...) \
+ do { \
+ printk(KERN_ERR "%s: " fmt, "ib_ipath", ##__VA_ARGS__); \
+ } while(0)
+
+#define _VERBS_UNIT_ERROR(unit,fmt,...) \
+ do { \
+ printk(KERN_ERR "%s: " fmt, "ib_ipath", ##__VA_ARGS__); \
+ } while(0)
+
+#if _VERBS_DEBUGGING
+
+/*
+ * Mask values for debugging. The scheme allows us to compile out any of
+ * the debug tracing stuff, and if compiled in, to enable or disable dynamically
+ * This can be set at modprobe time also:
+ * modprobe ib_path ib_ipath_debug=3
+ */
+
+#define __VERBS_INFO 0x1 /* generic low verbosity stuff */
+#define __VERBS_DBG 0x2 /* generic debug */
+#define __VERBS_VDBG 0x4 /* verbose debug */
+#define __VERBS_SMADBG 0x8000 /* sma packet debug */
+
+#define _VERBS_INFO(fmt,...) \
+ do { \
+ if(unlikely(ib_ipath_debug&__VERBS_INFO)) \
+ printk(KERN_INFO "%s: " fmt,"ib_ipath",##__VA_ARGS__); \
+ } while(0)
+
+#define _VERBS_DBG(fmt,...) \
+ do { \
+ if(unlikely(ib_ipath_debug&__VERBS_DBG)) \
+ printk(KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \
+ } while(0)
+
+#define _VERBS_VDBG(fmt,...) \
+ do { \
+ if(unlikely(ib_ipath_debug&__VERBS_VDBG)) \
+ printk(KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \
+ } while(0)
+
+#define _VERBS_SMADBG(fmt,...) \
+ do { \
+ if(unlikely(ib_ipath_debug&__VERBS_SMADBG)) \
+ printk(KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \
+ } while(0)
+
+#else /* ! _VERBS_DEBUGGING */
+
+#define _VERBS_INFO(fmt,...)
+#define _VERBS_DBG(fmt,...)
+#define _VERBS_VDBG(fmt,...)
+#define _VERBS_SMADBG(fmt,...)
+
+#endif /* _VERBS_DEBUGGING */
+
+#endif /* _VERBS_DEBUG_H */

2005-12-29 00:39:47

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 17 of 20] ipath - infiniband verbs support, part 3 of 3

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r fc067af322a1 -r 584777b6f4dc drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 28 14:19:43 2005 -0800
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 28 14:19:43 2005 -0800
@@ -4815,3 +4815,1393 @@
/* Call do_rc_send() in another thread. */
tasklet_schedule(&qp->s_task);
}
+
+/*
+ * This is called from ipath_ib_rcv() to process an incomming packet
+ * for the given QP.
+ * Called at interrupt level.
+ */
+static inline void ipath_qp_rcv(struct ipath_ibdev *dev,
+ struct ipath_ib_header *hdr, int has_grh,
+ void *data, u32 tlen, struct ipath_qp *qp)
+{
+ /* Check for valid receive state. */
+ if (!(state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) {
+ dev->n_pkt_drops++;
+ return;
+ }
+
+ switch (qp->ibqp.qp_type) {
+ case IB_QPT_SMI:
+ case IB_QPT_GSI:
+ case IB_QPT_UD:
+ ipath_ud_rcv(dev, hdr, has_grh, data, tlen, qp);
+ break;
+
+ case IB_QPT_RC:
+ ipath_rc_rcv(dev, hdr, has_grh, data, tlen, qp);
+ break;
+
+ case IB_QPT_UC:
+ ipath_uc_rcv(dev, hdr, has_grh, data, tlen, qp);
+ break;
+
+ default:
+ break;
+ }
+}
+
+/*
+ * This is called from ipath_kreceive() to process an incomming packet at
+ * interrupt level. Tlen is the length of the header + data + CRC in bytes.
+ */
+static void ipath_ib_rcv(const ipath_type t, void *rhdr, void *data, u32 tlen)
+{
+ struct ipath_ibdev *dev = ipath_devices[t];
+ struct ipath_ib_header *hdr = rhdr;
+ struct ipath_other_headers *ohdr;
+ struct ipath_qp *qp;
+ u32 qp_num;
+ int lnh;
+ u8 opcode;
+
+ if (dev == NULL)
+ return;
+
+ if (tlen < 24) { /* LRH+BTH+CRC */
+ dev->n_pkt_drops++;
+ return;
+ }
+
+ /* Check for GRH */
+ lnh = be16_to_cpu(hdr->lrh[0]) & 3;
+ if (lnh == IPS_LRH_BTH)
+ ohdr = &hdr->u.oth;
+ else if (lnh == IPS_LRH_GRH)
+ ohdr = &hdr->u.l.oth;
+ else {
+ dev->n_pkt_drops++;
+ return;
+ }
+
+ opcode = *(u8 *) (&ohdr->bth[0]);
+ dev->opstats[opcode].n_bytes += tlen;
+ dev->opstats[opcode].n_packets++;
+
+ /* Get the destination QP number. */
+ qp_num = be32_to_cpu(ohdr->bth[1]) & 0xFFFFFF;
+ if (qp_num == 0xFFFFFF) {
+ struct ipath_mcast *mcast;
+ struct ipath_mcast_qp *p;
+
+ mcast = ipath_mcast_find(&hdr->u.l.grh.dgid);
+ if (mcast == NULL) {
+ dev->n_pkt_drops++;
+ return;
+ }
+ dev->n_multicast_rcv++;
+ list_for_each_entry_rcu(p, &mcast->qp_list, list)
+ ipath_qp_rcv(dev, hdr, lnh == IPS_LRH_GRH, data, tlen,
+ p->qp);
+ /*
+ * Notify ipath_multicast_detach() if it is waiting for us
+ * to finish.
+ */
+ if (atomic_dec_return(&mcast->refcount) <= 1)
+ wake_up(&mcast->wait);
+ } else if ((qp = ipath_lookup_qpn(&dev->qp_table, qp_num)) != NULL) {
+ dev->n_unicast_rcv++;
+ ipath_qp_rcv(dev, hdr, lnh == IPS_LRH_GRH, data, tlen, qp);
+ /*
+ * Notify ipath_destroy_qp() if it is waiting for us to finish.
+ */
+ if (atomic_dec_and_test(&qp->refcount))
+ wake_up(&qp->wait);
+ } else
+ dev->n_pkt_drops++;
+}
+
+/*
+ * This is called from ipath_do_rcv_timer() at interrupt level
+ * to check for QPs which need retransmits and to collect performance numbers.
+ */
+static void ipath_ib_timer(const ipath_type t)
+{
+ struct ipath_ibdev *dev = ipath_devices[t];
+ struct ipath_qp *resend = NULL;
+ struct ipath_qp *rnr = NULL;
+ struct list_head *last;
+ struct ipath_qp *qp;
+ unsigned long flags;
+
+ if (dev == NULL)
+ return;
+
+ spin_lock_irqsave(&dev->pending_lock, flags);
+ /* Start filling the next pending queue. */
+ if (++dev->pending_index >= ARRAY_SIZE(dev->pending))
+ dev->pending_index = 0;
+ /* Save any requests still in the new queue, they have timed out. */
+ last = &dev->pending[dev->pending_index];
+ while (!list_empty(last)) {
+ qp = list_entry(last->next, struct ipath_qp, timerwait);
+ if (last->next == LIST_POISON1 ||
+ last->next != &qp->timerwait ||
+ qp->timerwait.prev != last) {
+ INIT_LIST_HEAD(last);
+ } else {
+ list_del(&qp->timerwait);
+ qp->timerwait.prev = (struct list_head *) resend;
+ resend = qp;
+ atomic_inc(&qp->refcount);
+ }
+ }
+ last = &dev->rnrwait;
+ if (!list_empty(last)) {
+ qp = list_entry(last->next, struct ipath_qp, timerwait);
+ if (--qp->s_rnr_timeout == 0) {
+ do {
+ if (last->next == LIST_POISON1 ||
+ last->next != &qp->timerwait ||
+ qp->timerwait.prev != last) {
+ INIT_LIST_HEAD(last);
+ break;
+ }
+ list_del(&qp->timerwait);
+ qp->timerwait.prev = (struct list_head *) rnr;
+ rnr = qp;
+ if (list_empty(last))
+ break;
+ qp = list_entry(last->next, struct ipath_qp,
+ timerwait);
+ } while (qp->s_rnr_timeout == 0);
+ }
+ }
+ /* We should only be in the started state if pma_sample_start != 0 */
+ if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_STARTED &&
+ --dev->pma_sample_start == 0) {
+ dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_RUNNING;
+ ipath_layer_snapshot_counters(dev->ib_unit, &dev->ipath_sword,
+ &dev->ipath_rword,
+ &dev->ipath_spkts,
+ &dev->ipath_rpkts);
+ }
+ if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_RUNNING) {
+ if (dev->pma_sample_interval == 0) {
+ u64 ta, tb, tc, td;
+
+ dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_DONE;
+ ipath_layer_snapshot_counters(dev->ib_unit,
+ &ta, &tb, &tc, &td);
+
+ dev->ipath_sword = ta - dev->ipath_sword;
+ dev->ipath_rword = tb - dev->ipath_rword;
+ dev->ipath_spkts = tc - dev->ipath_spkts;
+ dev->ipath_rpkts = td - dev->ipath_rpkts;
+ } else {
+ dev->pma_sample_interval--;
+ }
+ }
+ spin_unlock_irqrestore(&dev->pending_lock, flags);
+
+ /* XXX What if timer fires again while this is running? */
+ for (qp = resend; qp != NULL;
+ qp = (struct ipath_qp *) qp->timerwait.prev) {
+ struct ib_wc wc;
+
+ spin_lock_irqsave(&qp->s_lock, flags);
+ if (qp->s_last != qp->s_tail && qp->state == IB_QPS_RTS) {
+ dev->n_timeouts++;
+ ipath_restart_rc(qp, qp->s_last_psn + 1, &wc);
+ }
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+
+ /* Notify ipath_destroy_qp() if it is waiting. */
+ if (atomic_dec_and_test(&qp->refcount))
+ wake_up(&qp->wait);
+ }
+ for (qp = rnr; qp != NULL;
+ qp = (struct ipath_qp *) qp->timerwait.prev) {
+ tasklet_schedule(&qp->s_task);
+ }
+}
+
+/*
+ * This is called from ipath_intr() at interrupt level when a PIO buffer
+ * is available after ipath_verbs_send() returned an error that no
+ * buffers were available.
+ * Return 0 if we consumed all the PIO buffers and we still have QPs
+ * waiting for buffers (for now, just do a tasklet_schedule and return one).
+ */
+static int ipath_ib_piobufavail(const ipath_type t)
+{
+ struct ipath_ibdev *dev = ipath_devices[t];
+ struct ipath_qp *qp;
+ unsigned long flags;
+
+ if (dev == NULL)
+ return 1;
+
+ spin_lock_irqsave(&dev->pending_lock, flags);
+ while (!list_empty(&dev->piowait)) {
+ qp = list_entry(dev->piowait.next, struct ipath_qp, piowait);
+ list_del(&qp->piowait);
+ tasklet_schedule(&qp->s_task);
+ }
+ spin_unlock_irqrestore(&dev->pending_lock, flags);
+
+ return 1;
+}
+
+static struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
+ struct ib_qp_init_attr *init_attr,
+ struct ib_udata *udata)
+{
+ struct ipath_qp *qp;
+ int err;
+ struct ipath_swqe *swq = NULL;
+ struct ipath_ibdev *dev;
+ size_t sz;
+
+ if (init_attr->cap.max_send_sge > 255 ||
+ init_attr->cap.max_recv_sge > 255)
+ return ERR_PTR(-ENOMEM);
+
+ switch (init_attr->qp_type) {
+ case IB_QPT_UC:
+ case IB_QPT_RC:
+ sz = sizeof(struct ipath_sge) * init_attr->cap.max_send_sge +
+ sizeof(struct ipath_swqe);
+ swq = vmalloc((init_attr->cap.max_send_wr + 1) * sz);
+ if (swq == NULL)
+ return ERR_PTR(-ENOMEM);
+ /* FALLTHROUGH */
+ case IB_QPT_UD:
+ case IB_QPT_SMI:
+ case IB_QPT_GSI:
+ qp = kmalloc(sizeof(*qp), GFP_KERNEL);
+ if (!qp)
+ return ERR_PTR(-ENOMEM);
+ qp->r_rq.size = init_attr->cap.max_recv_wr + 1;
+ sz = sizeof(struct ipath_sge) * init_attr->cap.max_recv_sge +
+ sizeof(struct ipath_rwqe);
+ qp->r_rq.wq = vmalloc(qp->r_rq.size * sz);
+ if (!qp->r_rq.wq) {
+ kfree(qp);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /*
+ * ib_create_qp() will initialize qp->ibqp
+ * except for qp->ibqp.qp_num.
+ */
+ spin_lock_init(&qp->s_lock);
+ spin_lock_init(&qp->r_rq.lock);
+ atomic_set(&qp->refcount, 0);
+ init_waitqueue_head(&qp->wait);
+ tasklet_init(&qp->s_task,
+ init_attr->qp_type == IB_QPT_RC ? do_rc_send :
+ do_uc_send, (unsigned long)qp);
+ qp->piowait.next = LIST_POISON1;
+ qp->piowait.prev = LIST_POISON2;
+ qp->timerwait.next = LIST_POISON1;
+ qp->timerwait.prev = LIST_POISON2;
+ qp->state = IB_QPS_RESET;
+ qp->s_wq = swq;
+ qp->s_size = init_attr->cap.max_send_wr + 1;
+ qp->s_max_sge = init_attr->cap.max_send_sge;
+ qp->r_rq.max_sge = init_attr->cap.max_recv_sge;
+ qp->s_flags = init_attr->sq_sig_type == IB_SIGNAL_REQ_WR ?
+ 1 << IPATH_S_SIGNAL_REQ_WR : 0;
+ dev = to_idev(ibpd->device);
+ err = ipath_alloc_qpn(&dev->qp_table, qp, init_attr->qp_type);
+ if (err) {
+ vfree(swq);
+ vfree(qp->r_rq.wq);
+ kfree(qp);
+ return ERR_PTR(err);
+ }
+ ipath_reset_qp(qp);
+
+ /* Tell the core driver that the kernel SMA is present. */
+ if (qp->ibqp.qp_type == IB_QPT_SMI)
+ ipath_verbs_set_flags(dev->ib_unit,
+ IPATH_VERBS_KERNEL_SMA);
+ break;
+
+ default:
+ /* Don't support raw QPs */
+ return ERR_PTR(-ENOSYS);
+ }
+
+ init_attr->cap.max_inline_data = 0;
+
+ return &qp->ibqp;
+}
+
+/*
+ * Note that this can be called while the QP is actively sending or receiving!
+ */
+static int ipath_destroy_qp(struct ib_qp *ibqp)
+{
+ struct ipath_qp *qp = to_iqp(ibqp);
+ struct ipath_ibdev *dev = to_idev(ibqp->device);
+ unsigned long flags;
+
+ /* Tell the core driver that the kernel SMA is gone. */
+ if (qp->ibqp.qp_type == IB_QPT_SMI)
+ ipath_verbs_set_flags(dev->ib_unit, 0);
+
+ spin_lock_irqsave(&qp->r_rq.lock, flags);
+ spin_lock(&qp->s_lock);
+ qp->state = IB_QPS_ERR;
+ spin_unlock(&qp->s_lock);
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+
+ /* Stop the sending tasklet. */
+ tasklet_kill(&qp->s_task);
+
+ /* Make sure the QP isn't on the timeout list. */
+ spin_lock_irqsave(&dev->pending_lock, flags);
+ if (qp->timerwait.next != LIST_POISON1)
+ list_del(&qp->timerwait);
+ if (qp->piowait.next != LIST_POISON1)
+ list_del(&qp->piowait);
+ spin_unlock_irqrestore(&dev->pending_lock, flags);
+
+ /*
+ * Make sure that the QP is not in the QPN table so receive interrupts
+ * will discard packets for this QP.
+ * XXX Also remove QP from multicast table.
+ */
+ if (atomic_read(&qp->refcount) != 0)
+ ipath_free_qp(&dev->qp_table, qp);
+
+ vfree(qp->s_wq);
+ vfree(qp->r_rq.wq);
+ kfree(qp);
+ return 0;
+}
+
+static struct ib_srq *ipath_create_srq(struct ib_pd *ibpd,
+ struct ib_srq_init_attr *srq_init_attr,
+ struct ib_udata *udata)
+{
+ struct ipath_srq *srq;
+ u32 sz;
+
+ if (srq_init_attr->attr.max_sge < 1)
+ return ERR_PTR(-EINVAL);
+
+ srq = kmalloc(sizeof(*srq), GFP_KERNEL);
+ if (!srq)
+ return ERR_PTR(-ENOMEM);
+
+ /* Need to use vmalloc() if we want to support large #s of entries. */
+ srq->rq.size = srq_init_attr->attr.max_wr + 1;
+ sz = sizeof(struct ipath_sge) * srq_init_attr->attr.max_sge +
+ sizeof(struct ipath_rwqe);
+ srq->rq.wq = vmalloc(srq->rq.size * sz);
+ if (!srq->rq.wq) {
+ kfree(srq);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /*
+ * ib_create_srq() will initialize srq->ibsrq.
+ */
+ spin_lock_init(&srq->rq.lock);
+ srq->rq.head = 0;
+ srq->rq.tail = 0;
+ srq->rq.max_sge = srq_init_attr->attr.max_sge;
+ srq->limit = srq_init_attr->attr.srq_limit;
+
+ return &srq->ibsrq;
+}
+
+int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
+ enum ib_srq_attr_mask attr_mask)
+{
+ struct ipath_srq *srq = to_isrq(ibsrq);
+ unsigned long flags;
+
+ if (attr_mask & IB_SRQ_LIMIT) {
+ spin_lock_irqsave(&srq->rq.lock, flags);
+ srq->limit = attr->srq_limit;
+ spin_unlock_irqrestore(&srq->rq.lock, flags);
+ }
+ if (attr_mask & IB_SRQ_MAX_WR) {
+ u32 size = attr->max_wr + 1;
+ struct ipath_rwqe *wq, *p;
+ u32 n;
+ u32 sz;
+
+ if (attr->max_sge < srq->rq.max_sge)
+ return -EINVAL;
+
+ sz = sizeof(struct ipath_rwqe) +
+ attr->max_sge * sizeof(struct ipath_sge);
+ wq = vmalloc(size * sz);
+ if (!wq)
+ return -ENOMEM;
+
+ spin_lock_irqsave(&srq->rq.lock, flags);
+ if (srq->rq.head < srq->rq.tail)
+ n = srq->rq.size + srq->rq.head - srq->rq.tail;
+ else
+ n = srq->rq.head - srq->rq.tail;
+ if (size <= n || size <= srq->limit) {
+ spin_unlock_irqrestore(&srq->rq.lock, flags);
+ vfree(wq);
+ return -EINVAL;
+ }
+ n = 0;
+ p = wq;
+ while (srq->rq.tail != srq->rq.head) {
+ struct ipath_rwqe *wqe;
+ int i;
+
+ wqe = get_rwqe_ptr(&srq->rq, srq->rq.tail);
+ p->wr_id = wqe->wr_id;
+ p->length = wqe->length;
+ p->num_sge = wqe->num_sge;
+ for (i = 0; i < wqe->num_sge; i++)
+ p->sg_list[i] = wqe->sg_list[i];
+ n++;
+ p = (struct ipath_rwqe *)((char *) p + sz);
+ if (++srq->rq.tail >= srq->rq.size)
+ srq->rq.tail = 0;
+ }
+ vfree(srq->rq.wq);
+ srq->rq.wq = wq;
+ srq->rq.size = size;
+ srq->rq.head = n;
+ srq->rq.tail = 0;
+ srq->rq.max_sge = attr->max_sge;
+ spin_unlock_irqrestore(&srq->rq.lock, flags);
+ }
+ return 0;
+}
+
+static int ipath_destroy_srq(struct ib_srq *ibsrq)
+{
+ struct ipath_srq *srq = to_isrq(ibsrq);
+
+ vfree(srq->rq.wq);
+ kfree(srq);
+
+ return 0;
+}
+
+/*
+ * This may be called from interrupt context.
+ */
+static int ipath_poll_cq(struct ib_cq *ibcq, int num_entries,
+ struct ib_wc *entry)
+{
+ struct ipath_cq *cq = to_icq(ibcq);
+ unsigned long flags;
+ int npolled;
+
+ spin_lock_irqsave(&cq->lock, flags);
+
+ for (npolled = 0; npolled < num_entries; ++npolled, ++entry) {
+ if (cq->tail == cq->head)
+ break;
+ *entry = cq->queue[cq->tail];
+ if (++cq->tail == cq->ibcq.cqe)
+ cq->tail = 0;
+ }
+
+ spin_unlock_irqrestore(&cq->lock, flags);
+
+ return npolled;
+}
+
+static struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries,
+ struct ib_ucontext *context,
+ struct ib_udata *udata)
+{
+ struct ipath_cq *cq;
+
+ /* Need to use vmalloc() if we want to support large #s of entries. */
+ cq = vmalloc(sizeof(*cq) + entries * sizeof(*cq->queue));
+ if (!cq)
+ return ERR_PTR(-ENOMEM);
+ /*
+ * ib_create_cq() will initialize cq->ibcq except for cq->ibcq.cqe.
+ * The number of entries should be >= the number requested or
+ * return an error.
+ */
+ cq->ibcq.cqe = entries + 1;
+ cq->notify = IB_CQ_NONE;
+ cq->triggered = 0;
+ spin_lock_init(&cq->lock);
+ tasklet_init(&cq->comptask, send_complete, (unsigned long)cq);
+ cq->head = 0;
+ cq->tail = 0;
+
+ return &cq->ibcq;
+}
+
+static int ipath_destroy_cq(struct ib_cq *ibcq)
+{
+ struct ipath_cq *cq = to_icq(ibcq);
+
+ tasklet_kill(&cq->comptask);
+ vfree(cq);
+
+ return 0;
+}
+
+/*
+ * This may be called from interrupt context.
+ */
+static int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+{
+ struct ipath_cq *cq = to_icq(ibcq);
+ unsigned long flags;
+
+ spin_lock_irqsave(&cq->lock, flags);
+ /*
+ * Don't change IB_CQ_NEXT_COMP to IB_CQ_SOLICITED but allow
+ * any other transitions.
+ */
+ if (cq->notify != IB_CQ_NEXT_COMP)
+ cq->notify = notify;
+ spin_unlock_irqrestore(&cq->lock, flags);
+ return 0;
+}
+
+static int ipath_query_device(struct ib_device *ibdev,
+ struct ib_device_attr *props)
+{
+ struct ipath_ibdev *dev = to_idev(ibdev);
+ uint32_t vendor, boardrev, majrev, minrev;
+
+ memset(props, 0, sizeof(*props));
+
+ props->device_cap_flags = IB_DEVICE_BAD_PKEY_CNTR |
+ IB_DEVICE_BAD_QKEY_CNTR | IB_DEVICE_SHUTDOWN_PORT |
+ IB_DEVICE_SYS_IMAGE_GUID;
+ ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev,
+ &majrev, &minrev);
+ props->vendor_id = vendor;
+ props->vendor_part_id = boardrev;
+ props->hw_ver = boardrev << 16 | majrev << 8 | minrev;
+
+ props->sys_image_guid = dev->sys_image_guid;
+ props->node_guid = ipath_layer_get_guid(dev->ib_unit);
+
+ props->max_mr_size = ~0ull;
+ props->max_qp = 0xffff;
+ props->max_qp_wr = 0xffff;
+ props->max_sge = 255;
+ props->max_cq = 0xffff;
+ props->max_cqe = 0xffff;
+ props->max_mr = 0xffff;
+ props->max_pd = 0xffff;
+ props->max_qp_rd_atom = 1;
+ props->max_qp_init_rd_atom = 1;
+ /* props->max_res_rd_atom */
+ props->max_srq = 0xffff;
+ props->max_srq_wr = 0xffff;
+ props->max_srq_sge = 255;
+ /* props->local_ca_ack_delay */
+ props->atomic_cap = IB_ATOMIC_HCA;
+ props->max_pkeys = ipath_layer_get_npkeys(dev->ib_unit);
+ props->max_mcast_grp = 0xffff;
+ props->max_mcast_qp_attach = 0xffff;
+ props->max_total_mcast_qp_attach = props->max_mcast_qp_attach *
+ props->max_mcast_grp;
+
+ return 0;
+}
+
+static int ipath_query_port(struct ib_device *ibdev,
+ u8 port, struct ib_port_attr *props)
+{
+ struct ipath_ibdev *dev = to_idev(ibdev);
+ uint32_t flags = ipath_layer_get_flags(dev->ib_unit);
+ enum ib_mtu mtu;
+ uint32_t l;
+ uint16_t lid = ipath_layer_get_lid(dev->ib_unit);
+
+ memset(props, 0, sizeof(*props));
+ props->lid = lid ? lid : IB_LID_PERMISSIVE;
+ props->lmc = dev->mkeyprot_resv_lmc & 7;
+ props->sm_lid = dev->sm_lid;
+ props->sm_sl = dev->sm_sl;
+ if (flags & IPATH_LINKDOWN)
+ props->state = IB_PORT_DOWN;
+ else if (flags & IPATH_LINKARMED)
+ props->state = IB_PORT_ARMED;
+ else if (flags & IPATH_LINKACTIVE)
+ props->state = IB_PORT_ACTIVE;
+ else if (flags & IPATH_LINK_SLEEPING)
+ props->state = IB_PORT_ACTIVE_DEFER;
+ else
+ props->state = IB_PORT_NOP;
+ /* See phys_state_show() */
+ props->phys_state = 5; /* LinkUp */
+ props->port_cap_flags = dev->port_cap_flags;
+ props->gid_tbl_len = 1;
+ props->max_msg_sz = 4096;
+ props->pkey_tbl_len = ipath_layer_get_npkeys(dev->ib_unit);
+ props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->ib_unit);
+ props->qkey_viol_cntr = dev->qkey_violations;
+ props->active_width = IB_WIDTH_4X;
+ /* See rate_show() */
+ props->active_speed = 1; /* Regular 10Mbs speed. */
+ props->max_vl_num = 1; /* VLCap = VL0 */
+ props->init_type_reply = 0;
+
+ props->max_mtu = IB_MTU_4096;
+ l = ipath_layer_get_ibmtu(dev->ib_unit);
+ switch (l) {
+ case 4096:
+ mtu = IB_MTU_4096;
+ break;
+ case 2048:
+ mtu = IB_MTU_2048;
+ break;
+ case 1024:
+ mtu = IB_MTU_1024;
+ break;
+ case 512:
+ mtu = IB_MTU_512;
+ break;
+ case 256:
+ mtu = IB_MTU_256;
+ break;
+ default:
+ mtu = IB_MTU_2048;
+ }
+ props->active_mtu = mtu;
+ props->subnet_timeout = dev->subnet_timeout;
+
+ return 0;
+}
+
+static int ipath_modify_device(struct ib_device *device,
+ int device_modify_mask,
+ struct ib_device_modify *device_modify)
+{
+ if (device_modify_mask & IB_DEVICE_MODIFY_SYS_IMAGE_GUID)
+ to_idev(device)->sys_image_guid = device_modify->sys_image_guid;
+
+ return 0;
+}
+
+static int ipath_modify_port(struct ib_device *ibdev,
+ u8 port, int port_modify_mask,
+ struct ib_port_modify *props)
+{
+ struct ipath_ibdev *dev = to_idev(ibdev);
+
+ atomic_set_mask(props->set_port_cap_mask, &dev->port_cap_flags);
+ atomic_clear_mask(props->clr_port_cap_mask, &dev->port_cap_flags);
+ if (port_modify_mask & IB_PORT_SHUTDOWN)
+ ipath_kset_linkstate(dev->ib_unit << 16 | IPATH_IB_LINKDOWN);
+ if (port_modify_mask & IB_PORT_RESET_QKEY_CNTR)
+ dev->qkey_violations = 0;
+ return 0;
+}
+
+static int ipath_query_pkey(struct ib_device *ibdev,
+ u8 port, u16 index, u16 *pkey)
+{
+ struct ipath_ibdev *dev = to_idev(ibdev);
+
+ if (index >= ipath_layer_get_npkeys(dev->ib_unit))
+ return -EINVAL;
+ *pkey = ipath_layer_get_pkey(dev->ib_unit, index);
+ return 0;
+}
+
+static int ipath_query_gid(struct ib_device *ibdev, u8 port,
+ int index, union ib_gid *gid)
+{
+ struct ipath_ibdev *dev = to_idev(ibdev);
+
+ if (index >= 1)
+ return -EINVAL;
+ gid->global.subnet_prefix = dev->gid_prefix;
+ gid->global.interface_id = ipath_layer_get_guid(dev->ib_unit);
+
+ return 0;
+}
+
+static struct ib_pd *ipath_alloc_pd(struct ib_device *ibdev,
+ struct ib_ucontext *context,
+ struct ib_udata *udata)
+{
+ struct ipath_pd *pd;
+
+ pd = kmalloc(sizeof *pd, GFP_KERNEL);
+ if (!pd)
+ return ERR_PTR(-ENOMEM);
+
+ /* ib_alloc_pd() will initialize pd->ibpd. */
+ pd->user = udata != NULL;
+
+ return &pd->ibpd;
+}
+
+static int ipath_dealloc_pd(struct ib_pd *ibpd)
+{
+ struct ipath_pd *pd = to_ipd(ibpd);
+
+ kfree(pd);
+
+ return 0;
+}
+
+/*
+ * This may be called from interrupt context.
+ */
+static struct ib_ah *ipath_create_ah(struct ib_pd *pd,
+ struct ib_ah_attr *ah_attr)
+{
+ struct ipath_ah *ah;
+
+ ah = kmalloc(sizeof *ah, GFP_ATOMIC);
+ if (!ah)
+ return ERR_PTR(-ENOMEM);
+
+ /* ib_create_ah() will initialize ah->ibah. */
+ ah->attr = *ah_attr;
+
+ return &ah->ibah;
+}
+
+/*
+ * This may be called from interrupt context.
+ */
+static int ipath_destroy_ah(struct ib_ah *ibah)
+{
+ struct ipath_ah *ah = to_iah(ibah);
+
+ kfree(ah);
+
+ return 0;
+}
+
+static struct ib_mr *ipath_get_dma_mr(struct ib_pd *pd, int acc)
+{
+ struct ipath_mr *mr;
+
+ mr = kmalloc(sizeof *mr, GFP_KERNEL);
+ if (!mr)
+ return ERR_PTR(-ENOMEM);
+
+ /* ib_get_dma_mr() will initialize mr->ibmr except for lkey and rkey. */
+ memset(mr, 0, sizeof *mr);
+ mr->mr.access_flags = acc;
+ return &mr->ibmr;
+}
+
+static struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd,
+ struct ib_phys_buf *buffer_list,
+ int num_phys_buf,
+ int acc, u64 *iova_start)
+{
+ struct ipath_mr *mr;
+ int n, m, i;
+
+ /* Allocate struct plus pointers to first level page tables. */
+ m = (num_phys_buf + IPATH_SEGSZ - 1) / IPATH_SEGSZ;
+ mr = kmalloc(sizeof *mr + m * sizeof mr->mr.map[0], GFP_KERNEL);
+ if (!mr)
+ return ERR_PTR(-ENOMEM);
+
+ /* Allocate first level page tables. */
+ for (i = 0; i < m; i++) {
+ mr->mr.map[i] = kmalloc(sizeof *mr->mr.map[0], GFP_KERNEL);
+ if (!mr->mr.map[i]) {
+ while (i)
+ kfree(mr->mr.map[--i]);
+ kfree(mr);
+ return ERR_PTR(-ENOMEM);
+ }
+ }
+ mr->mr.mapsz = m;
+
+ /*
+ * ib_reg_phys_mr() will initialize mr->ibmr except for
+ * lkey and rkey.
+ */
+ if (!ipath_alloc_lkey(&to_idev(pd->device)->lk_table, &mr->mr)) {
+ while (i)
+ kfree(mr->mr.map[--i]);
+ kfree(mr);
+ return ERR_PTR(-ENOMEM);
+ }
+ mr->ibmr.rkey = mr->ibmr.lkey = mr->mr.lkey;
+ mr->mr.user_base = *iova_start;
+ mr->mr.iova = *iova_start;
+ mr->mr.length = 0;
+ mr->mr.offset = 0;
+ mr->mr.access_flags = acc;
+ mr->mr.max_segs = num_phys_buf;
+ m = 0;
+ n = 0;
+ for (i = 0; i < num_phys_buf; i++) {
+ mr->mr.map[m]->segs[n].vaddr =
+ phys_to_virt(buffer_list[i].addr);
+ mr->mr.map[m]->segs[n].length = buffer_list[i].size;
+ mr->mr.length += buffer_list[i].size;
+ if (++n == IPATH_SEGSZ) {
+ m++;
+ n = 0;
+ }
+ }
+ return &mr->ibmr;
+}
+
+static struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd,
+ struct ib_umem *region,
+ int mr_access_flags,
+ struct ib_udata *udata)
+{
+ struct ipath_mr *mr;
+ struct ib_umem_chunk *chunk;
+ int n, m, i;
+
+ n = 0;
+ list_for_each_entry(chunk, &region->chunk_list, list)
+ n += chunk->nents;
+
+ /* Allocate struct plus pointers to first level page tables. */
+ m = (n + IPATH_SEGSZ - 1) / IPATH_SEGSZ;
+ mr = kmalloc(sizeof *mr + m * sizeof mr->mr.map[0], GFP_KERNEL);
+ if (!mr)
+ return ERR_PTR(-ENOMEM);
+
+ /* Allocate first level page tables. */
+ for (i = 0; i < m; i++) {
+ mr->mr.map[i] = kmalloc(sizeof *mr->mr.map[0], GFP_KERNEL);
+ if (!mr->mr.map[i]) {
+ while (i)
+ kfree(mr->mr.map[--i]);
+ kfree(mr);
+ return ERR_PTR(-ENOMEM);
+ }
+ }
+ mr->mr.mapsz = m;
+
+ /*
+ * ib_uverbs_reg_mr() will initialize mr->ibmr except for
+ * lkey and rkey.
+ */
+ if (!ipath_alloc_lkey(&to_idev(pd->device)->lk_table, &mr->mr)) {
+ while (i)
+ kfree(mr->mr.map[--i]);
+ kfree(mr);
+ return ERR_PTR(-ENOMEM);
+ }
+ mr->ibmr.rkey = mr->ibmr.lkey = mr->mr.lkey;
+ mr->mr.user_base = region->user_base;
+ mr->mr.iova = region->virt_base;
+ mr->mr.length = region->length;
+ mr->mr.offset = region->offset;
+ mr->mr.access_flags = mr_access_flags;
+ mr->mr.max_segs = n;
+ m = 0;
+ n = 0;
+ list_for_each_entry(chunk, &region->chunk_list, list) {
+ for (i = 0; i < chunk->nmap; i++) {
+ mr->mr.map[m]->segs[n].vaddr =
+ page_address(chunk->page_list[i].page);
+ mr->mr.map[m]->segs[n].length = region->page_size;
+ if (++n == IPATH_SEGSZ) {
+ m++;
+ n = 0;
+ }
+ }
+ }
+ return &mr->ibmr;
+}
+
+/*
+ * Note that this is called to free MRs created by
+ * ipath_get_dma_mr() or ipath_reg_user_mr().
+ */
+static int ipath_dereg_mr(struct ib_mr *ibmr)
+{
+ struct ipath_mr *mr = to_imr(ibmr);
+ int i;
+
+ ipath_free_lkey(&to_idev(ibmr->device)->lk_table, ibmr->lkey);
+ i = mr->mr.mapsz;
+ while (i)
+ kfree(mr->mr.map[--i]);
+ kfree(mr);
+ return 0;
+}
+
+static struct ib_fmr *ipath_alloc_fmr(struct ib_pd *pd,
+ int mr_access_flags,
+ struct ib_fmr_attr *fmr_attr)
+{
+ struct ipath_fmr *fmr;
+ int m, i;
+
+ /* Allocate struct plus pointers to first level page tables. */
+ m = (fmr_attr->max_pages + IPATH_SEGSZ - 1) / IPATH_SEGSZ;
+ fmr = kmalloc(sizeof *fmr + m * sizeof fmr->mr.map[0], GFP_KERNEL);
+ if (!fmr)
+ return ERR_PTR(-ENOMEM);
+
+ /* Allocate first level page tables. */
+ for (i = 0; i < m; i++) {
+ fmr->mr.map[i] = kmalloc(sizeof *fmr->mr.map[0], GFP_KERNEL);
+ if (!fmr->mr.map[i]) {
+ while (i)
+ kfree(fmr->mr.map[--i]);
+ kfree(fmr);
+ return ERR_PTR(-ENOMEM);
+ }
+ }
+ fmr->mr.mapsz = m;
+
+ /* ib_alloc_fmr() will initialize fmr->ibfmr except for lkey & rkey. */
+ if (!ipath_alloc_lkey(&to_idev(pd->device)->lk_table, &fmr->mr)) {
+ while (i)
+ kfree(fmr->mr.map[--i]);
+ kfree(fmr);
+ return ERR_PTR(-ENOMEM);
+ }
+ fmr->ibfmr.rkey = fmr->ibfmr.lkey = fmr->mr.lkey;
+ /* Resources are allocated but no valid mapping (RKEY can't be used). */
+ fmr->mr.user_base = 0;
+ fmr->mr.iova = 0;
+ fmr->mr.length = 0;
+ fmr->mr.offset = 0;
+ fmr->mr.access_flags = mr_access_flags;
+ fmr->mr.max_segs = fmr_attr->max_pages;
+ fmr->page_size = fmr_attr->page_size;
+ return &fmr->ibfmr;
+}
+
+/*
+ * This may be called from interrupt context.
+ * XXX Can we ever be called to map a portion of the RKEY space?
+ */
+static int ipath_map_phys_fmr(struct ib_fmr *ibfmr,
+ u64 * page_list, int list_len, u64 iova)
+{
+ struct ipath_fmr *fmr = to_ifmr(ibfmr);
+ struct ipath_lkey_table *rkt;
+ unsigned long flags;
+ int m, n, i;
+ u32 ps;
+
+ if (list_len > fmr->mr.max_segs)
+ return -EINVAL;
+ rkt = &to_idev(ibfmr->device)->lk_table;
+ spin_lock_irqsave(&rkt->lock, flags);
+ fmr->mr.user_base = iova;
+ fmr->mr.iova = iova;
+ ps = 1 << fmr->page_size;
+ fmr->mr.length = list_len * ps;
+ m = 0;
+ n = 0;
+ ps = 1 << fmr->page_size;
+ for (i = 0; i < list_len; i++) {
+ fmr->mr.map[m]->segs[n].vaddr = phys_to_virt(page_list[i]);
+ fmr->mr.map[m]->segs[n].length = ps;
+ if (++n == IPATH_SEGSZ) {
+ m++;
+ n = 0;
+ }
+ }
+ spin_unlock_irqrestore(&rkt->lock, flags);
+ return 0;
+}
+
+static int ipath_unmap_fmr(struct list_head *fmr_list)
+{
+ struct ipath_fmr *fmr;
+
+ list_for_each_entry(fmr, fmr_list, ibfmr.list) {
+ fmr->mr.user_base = 0;
+ fmr->mr.iova = 0;
+ fmr->mr.length = 0;
+ }
+ return 0;
+}
+
+static int ipath_dealloc_fmr(struct ib_fmr *ibfmr)
+{
+ struct ipath_fmr *fmr = to_ifmr(ibfmr);
+ int i;
+
+ ipath_free_lkey(&to_idev(ibfmr->device)->lk_table, ibfmr->lkey);
+ i = fmr->mr.mapsz;
+ while (i)
+ kfree(fmr->mr.map[--i]);
+ kfree(fmr);
+ return 0;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+ struct ipath_ibdev *dev =
+ container_of(cdev, struct ipath_ibdev, ibdev.class_dev);
+ int vendor, boardrev, majrev, minrev;
+
+ ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev,
+ &majrev, &minrev);
+ return sprintf(buf, "%d.%d\n", majrev, minrev);
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+ struct ipath_ibdev *dev =
+ container_of(cdev, struct ipath_ibdev, ibdev.class_dev);
+ int vendor, boardrev, majrev, minrev;
+
+ ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev,
+ &majrev, &minrev);
+ ipath_get_boardname(dev->ib_unit, buf, 128);
+ strcat(buf, "\n");
+ return strlen(buf);
+}
+
+static ssize_t show_board(struct class_device *cdev, char *buf)
+{
+ struct ipath_ibdev *dev =
+ container_of(cdev, struct ipath_ibdev, ibdev.class_dev);
+ int vendor, boardrev, majrev, minrev;
+
+ ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev,
+ &majrev, &minrev);
+ ipath_get_boardname(dev->ib_unit, buf, 128);
+ strcat(buf, "\n");
+ return strlen(buf);
+}
+
+static ssize_t show_stats(struct class_device *cdev, char *buf)
+{
+ struct ipath_ibdev *dev =
+ container_of(cdev, struct ipath_ibdev, ibdev.class_dev);
+ char *p;
+ int i;
+
+ sprintf(buf,
+ "RC resends %d\n"
+ "RC QACKs %d\n"
+ "RC ACKs %d\n"
+ "RC SEQ NAKs %d\n"
+ "RC RDMA seq %d\n"
+ "RC RNR NAKs %d\n"
+ "RC OTH NAKs %d\n"
+ "RC timeouts %d\n"
+ "RC RDMA dup %d\n"
+ "piobuf wait %d\n"
+ "no piobuf %d\n"
+ "PKT drops %d\n"
+ "WQE errs %d\n",
+ dev->n_rc_resends, dev->n_rc_qacks, dev->n_rc_acks,
+ dev->n_seq_naks, dev->n_rdma_seq, dev->n_rnr_naks,
+ dev->n_other_naks, dev->n_timeouts, dev->n_rdma_dup_busy,
+ dev->n_piowait, dev->n_no_piobuf, dev->n_pkt_drops,
+ dev->n_wqe_errs);
+ p = buf;
+ for (i = 0; i < ARRAY_SIZE(dev->opstats); i++) {
+ if (!dev->opstats[i].n_packets && !dev->opstats[i].n_bytes)
+ continue;
+ p += strlen(p);
+ sprintf(p, "%02x %llu/%llu\n",
+ i, dev->opstats[i].n_packets, dev->opstats[i].n_bytes);
+ }
+ return strlen(buf);
+}
+
+static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL);
+static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL);
+static CLASS_DEVICE_ATTR(stats, S_IRUGO, show_stats, NULL);
+
+static struct class_device_attribute *ipath_class_attributes[] = {
+ &class_device_attr_hw_rev,
+ &class_device_attr_hca_type,
+ &class_device_attr_board_id,
+ &class_device_attr_stats
+};
+
+/*
+ * Allocate a ucontext.
+ */
+
+static struct ib_ucontext *ipath_alloc_ucontext(struct ib_device *ibdev,
+ struct ib_udata *udata)
+{
+ struct ipath_ucontext *context;
+
+ context = kmalloc(sizeof *context, GFP_KERNEL);
+ if (!context)
+ return ERR_PTR(-ENOMEM);
+
+ return &context->ibucontext;
+}
+
+static int ipath_dealloc_ucontext(struct ib_ucontext *context)
+{
+ kfree(to_iucontext(context));
+ return 0;
+}
+
+/*
+ * Register our device with the infiniband core.
+ */
+static int ipath_register_ib_device(const ipath_type t)
+{
+ struct ipath_ibdev *idev;
+ struct ib_device *dev;
+ int i;
+ int ret;
+
+ idev = (struct ipath_ibdev *)ib_alloc_device(sizeof *idev);
+ if (idev == NULL)
+ return -ENOMEM;
+
+ dev = &idev->ibdev;
+
+ /* Only need to initialize non-zero fields. */
+ spin_lock_init(&idev->qp_table.lock);
+ spin_lock_init(&idev->lk_table.lock);
+ idev->sm_lid = IB_LID_PERMISSIVE;
+ idev->gid_prefix = __constant_cpu_to_be64(0xfe80000000000000UL);
+ idev->qp_table.last = 1; /* QPN 0 and 1 are special. */
+ idev->qp_table.max = ib_ipath_qp_table_size;
+ idev->qp_table.nmaps = 1;
+ idev->qp_table.table = kmalloc(idev->qp_table.max *
+ sizeof(*idev->qp_table.table),
+ GFP_KERNEL);
+ if (idev->qp_table.table == NULL) {
+ ret = -ENOMEM;
+ goto err_qp;
+ }
+ memset(idev->qp_table.table, 0,
+ idev->qp_table.max * sizeof(*idev->qp_table.table));
+ for (i = 0; i < ARRAY_SIZE(idev->qp_table.map); i++) {
+ atomic_set(&idev->qp_table.map[i].n_free, BITS_PER_PAGE);
+ idev->qp_table.map[i].page = NULL;
+ }
+ /*
+ * The top ib_ipath_lkey_table_size bits are used to index the table.
+ * The lower 8 bits can be owned by the user (copied from the LKEY).
+ * The remaining bits act as a generation number or tag.
+ */
+ idev->lk_table.max = 1 << ib_ipath_lkey_table_size;
+ idev->lk_table.table = kmalloc(idev->lk_table.max *
+ sizeof(*idev->lk_table.table),
+ GFP_KERNEL);
+ if (idev->lk_table.table == NULL) {
+ ret = -ENOMEM;
+ goto err_lk;
+ }
+ memset(idev->lk_table.table, 0,
+ idev->lk_table.max * sizeof(*idev->lk_table.table));
+ spin_lock_init(&idev->pending_lock);
+ INIT_LIST_HEAD(&idev->pending[0]);
+ INIT_LIST_HEAD(&idev->pending[1]);
+ INIT_LIST_HEAD(&idev->pending[2]);
+ INIT_LIST_HEAD(&idev->piowait);
+ INIT_LIST_HEAD(&idev->rnrwait);
+ idev->pending_index = 0;
+ idev->port_cap_flags =
+ IB_PORT_SYS_IMAGE_GUID_SUP | IB_PORT_CLIENT_REG_SUP;
+ idev->pma_counter_select[0] = IB_PMA_PORT_XMIT_DATA;
+ idev->pma_counter_select[1] = IB_PMA_PORT_RCV_DATA;
+ idev->pma_counter_select[2] = IB_PMA_PORT_XMIT_PKTS;
+ idev->pma_counter_select[3] = IB_PMA_PORT_RCV_PKTS;
+ idev->pma_counter_select[5] = IB_PMA_PORT_XMIT_WAIT;
+
+ /*
+ * The system image GUI is supposed to be the same for all
+ * IB HCAs in a single system.
+ * Note that this code assumes device zero is found first.
+ */
+ idev->sys_image_guid =
+ t ? ipath_devices[t]->sys_image_guid : ipath_layer_get_guid(t);
+ idev->ib_unit = t;
+
+ strlcpy(dev->name, "ipath%d", IB_DEVICE_NAME_MAX);
+ dev->node_guid = ipath_layer_get_guid(t);
+ dev->uverbs_abi_ver = IPATH_UVERBS_ABI_VERSION;
+ dev->uverbs_cmd_mask =
+ (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
+ (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
+ (1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
+ (1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
+ (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
+ (1ull << IB_USER_VERBS_CMD_CREATE_AH) |
+ (1ull << IB_USER_VERBS_CMD_DESTROY_AH) |
+ (1ull << IB_USER_VERBS_CMD_REG_MR) |
+ (1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+ (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+ (1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
+ (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
+ (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
+ (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+ (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
+ (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+ (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
+ (1ull << IB_USER_VERBS_CMD_POST_SEND) |
+ (1ull << IB_USER_VERBS_CMD_POST_RECV) |
+ (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) |
+ (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) |
+ (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) |
+ (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) |
+ (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) |
+ (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV);
+ dev->node_type = IB_NODE_CA;
+ dev->phys_port_cnt = 1;
+ dev->dma_device = ipath_layer_get_pcidev(t);
+ dev->class_dev.dev = dev->dma_device;
+ dev->query_device = ipath_query_device;
+ dev->modify_device = ipath_modify_device;
+ dev->query_port = ipath_query_port;
+ dev->modify_port = ipath_modify_port;
+ dev->query_pkey = ipath_query_pkey;
+ dev->query_gid = ipath_query_gid;
+ dev->alloc_ucontext = ipath_alloc_ucontext;
+ dev->dealloc_ucontext = ipath_dealloc_ucontext;
+ dev->alloc_pd = ipath_alloc_pd;
+ dev->dealloc_pd = ipath_dealloc_pd;
+ dev->create_ah = ipath_create_ah;
+ dev->destroy_ah = ipath_destroy_ah;
+ dev->create_srq = ipath_create_srq;
+ dev->modify_srq = ipath_modify_srq;
+ dev->destroy_srq = ipath_destroy_srq;
+ dev->create_qp = ipath_create_qp;
+ dev->modify_qp = ipath_modify_qp;
+ dev->destroy_qp = ipath_destroy_qp;
+ dev->post_send = ipath_post_send;
+ dev->post_recv = ipath_post_receive;
+ dev->post_srq_recv = ipath_post_srq_receive;
+ dev->create_cq = ipath_create_cq;
+ dev->destroy_cq = ipath_destroy_cq;
+ dev->poll_cq = ipath_poll_cq;
+ dev->req_notify_cq = ipath_req_notify_cq;
+ dev->get_dma_mr = ipath_get_dma_mr;
+ dev->reg_phys_mr = ipath_reg_phys_mr;
+ dev->reg_user_mr = ipath_reg_user_mr;
+ dev->dereg_mr = ipath_dereg_mr;
+ dev->alloc_fmr = ipath_alloc_fmr;
+ dev->map_phys_fmr = ipath_map_phys_fmr;
+ dev->unmap_fmr = ipath_unmap_fmr;
+ dev->dealloc_fmr = ipath_dealloc_fmr;
+ dev->attach_mcast = ipath_multicast_attach;
+ dev->detach_mcast = ipath_multicast_detach;
+ dev->process_mad = ipath_process_mad;
+
+ ret = ib_register_device(dev);
+ if (ret)
+ goto err_reg;
+
+ /*
+ * We don't need to register a MAD agent, we just need to create
+ * a linker dependency on ib_mad so the module is loaded before
+ * this module is initialized. The call to ib_register_device()
+ * above will then cause ib_mad to create QP 0 & 1.
+ */
+ (void) ib_register_mad_agent(dev, 1, (enum ib_qp_type) 2,
+ NULL, 0, NULL, NULL, NULL);
+
+ for (i = 0; i < ARRAY_SIZE(ipath_class_attributes); ++i) {
+ ret = class_device_create_file(&dev->class_dev,
+ ipath_class_attributes[i]);
+ if (ret)
+ goto err_class;
+ }
+
+ ipath_layer_enable_timer(t);
+
+ ipath_devices[t] = idev;
+ return 0;
+
+err_class:
+ ib_unregister_device(dev);
+err_reg:
+ kfree(idev->lk_table.table);
+err_lk:
+ kfree(idev->qp_table.table);
+err_qp:
+ ib_dealloc_device(dev);
+ return ret;
+}
+
+static void ipath_unregister_ib_device(struct ipath_ibdev *dev)
+{
+ struct ib_device *ibdev = &dev->ibdev;
+
+ ipath_layer_disable_timer(dev->ib_unit);
+
+ ib_unregister_device(ibdev);
+
+ if (!list_empty(&dev->pending[0]) || !list_empty(&dev->pending[1]) ||
+ !list_empty(&dev->pending[2]))
+ _VERBS_ERROR("ipath%d pending list not empty!\n", dev->ib_unit);
+ if (!list_empty(&dev->piowait))
+ _VERBS_ERROR("ipath%d piowait list not empty!\n", dev->ib_unit);
+ if (!list_empty(&dev->rnrwait))
+ _VERBS_ERROR("ipath%d rnrwait list not empty!\n", dev->ib_unit);
+ if (mcast_tree.rb_node != NULL)
+ _VERBS_ERROR("ipath%d multicast table memory leak!\n",
+ dev->ib_unit);
+ /*
+ * Note that ipath_unregister_ib_device() can be called before all
+ * the QPs are destroyed!
+ */
+ ipath_free_all_qps(&dev->qp_table);
+ kfree(dev->qp_table.table);
+ kfree(dev->lk_table.table);
+ ib_dealloc_device(ibdev);
+}
+
+int __init ipath_verbs_init(void)
+{
+ int i;
+
+ number_of_devices = ipath_layer_get_num_of_dev();
+ i = number_of_devices * sizeof(struct ipath_ibdev *);
+ ipath_devices = kmalloc(i, GFP_ATOMIC);
+ if (ipath_devices == NULL)
+ return -ENOMEM;
+
+ for (i = 0; i < number_of_devices; i++) {
+ int ret = ipath_verbs_register(i, ipath_ib_piobufavail,
+ ipath_ib_rcv, ipath_ib_timer);
+
+ if (ret == 0)
+ ipath_devices[i] = NULL;
+ else if ((ret = ipath_register_ib_device(i)) != 0) {
+ _VERBS_ERROR("ib_ipath%d cannot register ib device "
+ "(%d)!\n", i, ret);
+ ipath_verbs_unregister(i);
+ ipath_devices[i] = NULL;
+ }
+ }
+
+ return 0;
+}
+
+void __exit ipath_verbs_cleanup(void)
+{
+ int i;
+
+ for (i = 0; i < number_of_devices; i++)
+ if (ipath_devices[i]) {
+ ipath_unregister_ib_device(ipath_devices[i]);
+ ipath_verbs_unregister(i);
+ }
+
+ kfree(ipath_devices);
+}
+
+module_init(ipath_verbs_init);
+module_exit(ipath_verbs_cleanup);

2005-12-29 00:41:47

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 12 of 20] ipath - misc driver support code

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_ht400.c
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_ht400.c Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,1137 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+/*
+ * The first part of this file is shared with the diags, the second
+ * part is used only in the kernel.
+ */
+
+#include <stddef.h> /* for offsetof */
+
+#include <linux/types.h>
+#include <linux/time.h>
+#include <linux/timer.h>
+#include <linux/wait.h>
+#include "ipath_kernel.h"
+
+#include "ipath_registers.h"
+#include "ipath_common.h"
+
+/*
+ * This lists the InfiniPath registers, in the actual chip layout. This
+ * structure should never be directly accessed. It is included by the
+ * user mode diags, and so must be able to be compiled in both user
+ * and kernel mode.
+ */
+struct _infinipath_do_not_use_kernel_regs {
+ unsigned long long Revision;
+ unsigned long long Control;
+ unsigned long long PageAlign;
+ unsigned long long PortCnt;
+ unsigned long long DebugPortSelect;
+ unsigned long long DebugPort;
+ unsigned long long SendRegBase;
+ unsigned long long UserRegBase;
+ unsigned long long CounterRegBase;
+ unsigned long long Scratch;
+ unsigned long long ReservedMisc1;
+ unsigned long long InterruptConfig;
+ unsigned long long IntBlocked;
+ unsigned long long IntMask;
+ unsigned long long IntStatus;
+ unsigned long long IntClear;
+ unsigned long long ErrorMask;
+ unsigned long long ErrorStatus;
+ unsigned long long ErrorClear;
+ unsigned long long HwErrMask;
+ unsigned long long HwErrStatus;
+ unsigned long long HwErrClear;
+ unsigned long long HwDiagCtrl;
+ unsigned long long MDIO;
+ unsigned long long IBCStatus;
+ unsigned long long IBCCtrl;
+ unsigned long long ExtStatus;
+ unsigned long long ExtCtrl;
+ unsigned long long GPIOOut;
+ unsigned long long GPIOMask;
+ unsigned long long GPIOStatus;
+ unsigned long long GPIOClear;
+ unsigned long long RcvCtrl;
+ unsigned long long RcvBTHQP;
+ unsigned long long RcvHdrSize;
+ unsigned long long RcvHdrCnt;
+ unsigned long long RcvHdrEntSize;
+ unsigned long long RcvTIDBase;
+ unsigned long long RcvTIDCnt;
+ unsigned long long RcvEgrBase;
+ unsigned long long RcvEgrCnt;
+ unsigned long long RcvBufBase;
+ unsigned long long RcvBufSize;
+ unsigned long long RxIntMemBase;
+ unsigned long long RxIntMemSize;
+ unsigned long long RcvPartitionKey;
+ unsigned long long ReservedRcv[10];
+ unsigned long long SendCtrl;
+ unsigned long long SendPIOBufBase;
+ unsigned long long SendPIOSize;
+ unsigned long long SendPIOBufCnt;
+ unsigned long long SendPIOAvailAddr;
+ unsigned long long TxIntMemBase;
+ unsigned long long TxIntMemSize;
+ unsigned long long ReservedSend[9];
+ unsigned long long SendBufferError;
+ unsigned long long SendBufferErrorCONT1;
+ unsigned long long SendBufferErrorCONT2;
+ unsigned long long SendBufferErrorCONT3;
+ unsigned long long ReservedSBE[4];
+ unsigned long long RcvHdrAddr0;
+ unsigned long long RcvHdrAddr1;
+ unsigned long long RcvHdrAddr2;
+ unsigned long long RcvHdrAddr3;
+ unsigned long long RcvHdrAddr4;
+ unsigned long long RcvHdrAddr5;
+ unsigned long long RcvHdrAddr6;
+ unsigned long long RcvHdrAddr7;
+ unsigned long long RcvHdrAddr8;
+ unsigned long long ReservedRHA[7];
+ unsigned long long RcvHdrTailAddr0;
+ unsigned long long RcvHdrTailAddr1;
+ unsigned long long RcvHdrTailAddr2;
+ unsigned long long RcvHdrTailAddr3;
+ unsigned long long RcvHdrTailAddr4;
+ unsigned long long RcvHdrTailAddr5;
+ unsigned long long RcvHdrTailAddr6;
+ unsigned long long RcvHdrTailAddr7;
+ unsigned long long RcvHdrTailAddr8;
+ unsigned long long ReservedRHTA[7];
+ unsigned long long Sync; /* Software only */
+ unsigned long long Dump; /* Software only */
+ unsigned long long SimVer; /* Software only */
+ unsigned long long ReservedSW[5];
+ unsigned long long SerdesConfig0;
+ unsigned long long SerdesConfig1;
+ unsigned long long SerdesStatus;
+ unsigned long long XGXSConfig;
+ unsigned long long ReservedSW2[4];
+};
+
+#define IPATH_KREG_OFFSET(field) (offsetof(struct \
+ _infinipath_do_not_use_kernel_regs, field) / sizeof(uint64_t))
+#define IPATH_CREG_OFFSET(field) (offsetof( \
+ struct infinipath_counters, field) / sizeof(uint64_t))
+
+ipath_kreg
+ kr_control = IPATH_KREG_OFFSET(Control),
+ kr_counterregbase = IPATH_KREG_OFFSET(CounterRegBase),
+ kr_debugport = IPATH_KREG_OFFSET(DebugPort),
+ kr_debugportselect = IPATH_KREG_OFFSET(DebugPortSelect),
+ kr_errorclear = IPATH_KREG_OFFSET(ErrorClear),
+ kr_errormask = IPATH_KREG_OFFSET(ErrorMask),
+ kr_errorstatus = IPATH_KREG_OFFSET(ErrorStatus),
+ kr_extctrl = IPATH_KREG_OFFSET(ExtCtrl),
+ kr_extstatus = IPATH_KREG_OFFSET(ExtStatus),
+ kr_gpio_clear = IPATH_KREG_OFFSET(GPIOClear),
+ kr_gpio_mask = IPATH_KREG_OFFSET(GPIOMask),
+ kr_gpio_out = IPATH_KREG_OFFSET(GPIOOut),
+ kr_gpio_status = IPATH_KREG_OFFSET(GPIOStatus),
+ kr_hwdiagctrl = IPATH_KREG_OFFSET(HwDiagCtrl),
+ kr_hwerrclear = IPATH_KREG_OFFSET(HwErrClear),
+ kr_hwerrmask = IPATH_KREG_OFFSET(HwErrMask),
+ kr_hwerrstatus = IPATH_KREG_OFFSET(HwErrStatus),
+ kr_ibcctrl = IPATH_KREG_OFFSET(IBCCtrl),
+ kr_ibcstatus = IPATH_KREG_OFFSET(IBCStatus),
+ kr_intblocked = IPATH_KREG_OFFSET(IntBlocked),
+ kr_intclear = IPATH_KREG_OFFSET(IntClear),
+ kr_interruptconfig = IPATH_KREG_OFFSET(InterruptConfig),
+ kr_intmask = IPATH_KREG_OFFSET(IntMask),
+ kr_intstatus = IPATH_KREG_OFFSET(IntStatus),
+ kr_mdio = IPATH_KREG_OFFSET(MDIO),
+ kr_pagealign = IPATH_KREG_OFFSET(PageAlign),
+ kr_partitionkey = IPATH_KREG_OFFSET(RcvPartitionKey),
+ kr_portcnt = IPATH_KREG_OFFSET(PortCnt),
+ kr_rcvbthqp = IPATH_KREG_OFFSET(RcvBTHQP),
+ kr_rcvbufbase = IPATH_KREG_OFFSET(RcvBufBase),
+ kr_rcvbufsize = IPATH_KREG_OFFSET(RcvBufSize),
+ kr_rcvctrl = IPATH_KREG_OFFSET(RcvCtrl),
+ kr_rcvegrbase = IPATH_KREG_OFFSET(RcvEgrBase),
+ kr_rcvegrcnt = IPATH_KREG_OFFSET(RcvEgrCnt),
+ kr_rcvhdrcnt = IPATH_KREG_OFFSET(RcvHdrCnt),
+ kr_rcvhdrentsize = IPATH_KREG_OFFSET(RcvHdrEntSize),
+ kr_rcvhdrsize = IPATH_KREG_OFFSET(RcvHdrSize),
+ kr_rcvintmembase = IPATH_KREG_OFFSET(RxIntMemBase),
+ kr_rcvintmemsize = IPATH_KREG_OFFSET(RxIntMemSize),
+ kr_rcvtidbase = IPATH_KREG_OFFSET(RcvTIDBase),
+ kr_rcvtidcnt = IPATH_KREG_OFFSET(RcvTIDCnt),
+ kr_revision = IPATH_KREG_OFFSET(Revision),
+ kr_scratch = IPATH_KREG_OFFSET(Scratch),
+ kr_sendbuffererror = IPATH_KREG_OFFSET(SendBufferError),
+ kr_sendbuffererror1 = IPATH_KREG_OFFSET(SendBufferErrorCONT1),
+ kr_sendbuffererror2 = IPATH_KREG_OFFSET(SendBufferErrorCONT2),
+ kr_sendbuffererror3 = IPATH_KREG_OFFSET(SendBufferErrorCONT3),
+ kr_sendctrl = IPATH_KREG_OFFSET(SendCtrl),
+ kr_sendpioavailaddr = IPATH_KREG_OFFSET(SendPIOAvailAddr),
+ kr_sendpiobufbase = IPATH_KREG_OFFSET(SendPIOBufBase),
+ kr_sendpiobufcnt = IPATH_KREG_OFFSET(SendPIOBufCnt),
+ kr_sendpiosize = IPATH_KREG_OFFSET(SendPIOSize),
+ kr_sendregbase = IPATH_KREG_OFFSET(SendRegBase),
+ kr_txintmembase = IPATH_KREG_OFFSET(TxIntMemBase),
+ kr_txintmemsize = IPATH_KREG_OFFSET(TxIntMemSize),
+ kr_userregbase = IPATH_KREG_OFFSET(UserRegBase),
+ kr_serdesconfig0 = IPATH_KREG_OFFSET(SerdesConfig0),
+ kr_serdesconfig1 = IPATH_KREG_OFFSET(SerdesConfig1),
+ kr_serdesstatus = IPATH_KREG_OFFSET(SerdesStatus),
+ kr_xgxsconfig = IPATH_KREG_OFFSET(XGXSConfig),
+ /*
+ * last valid direct use register other than diag-only registers
+ */
+ __kr_lastvaliddirect = IPATH_KREG_OFFSET(ReservedSW2[0]),
+ /* always invalid for initializing */
+ __kr_invalid = IPATH_KREG_OFFSET(ReservedSW2[0]) + 1,
+ /*
+ * These should not be used directly via ipath_kget_kreg64(),
+ * use them with ipath_kget_kreg64_port()
+ */
+ kr_rcvhdraddr = IPATH_KREG_OFFSET(RcvHdrAddr0), /* not for direct use */
+ /* not for direct use */
+ kr_rcvhdrtailaddr = IPATH_KREG_OFFSET(RcvHdrTailAddr0),
+ /* we define the full set for the diags, the kernel doesn't use them */
+ kr_rcvhdraddr1 = IPATH_KREG_OFFSET(RcvHdrAddr1),
+ kr_rcvhdraddr2 = IPATH_KREG_OFFSET(RcvHdrAddr2),
+ kr_rcvhdraddr3 = IPATH_KREG_OFFSET(RcvHdrAddr3),
+ kr_rcvhdraddr4 = IPATH_KREG_OFFSET(RcvHdrAddr4),
+ kr_rcvhdrtailaddr1 = IPATH_KREG_OFFSET(RcvHdrTailAddr1),
+ kr_rcvhdrtailaddr2 = IPATH_KREG_OFFSET(RcvHdrTailAddr2),
+ kr_rcvhdrtailaddr3 = IPATH_KREG_OFFSET(RcvHdrTailAddr3),
+ kr_rcvhdrtailaddr4 = IPATH_KREG_OFFSET(RcvHdrTailAddr4),
+ kr_rcvhdraddr5 = IPATH_KREG_OFFSET(RcvHdrAddr5),
+ kr_rcvhdraddr6 = IPATH_KREG_OFFSET(RcvHdrAddr6),
+ kr_rcvhdraddr7 = IPATH_KREG_OFFSET(RcvHdrAddr7),
+ kr_rcvhdraddr8 = IPATH_KREG_OFFSET(RcvHdrAddr8),
+ kr_rcvhdrtailaddr5 = IPATH_KREG_OFFSET(RcvHdrTailAddr5),
+ kr_rcvhdrtailaddr6 = IPATH_KREG_OFFSET(RcvHdrTailAddr6),
+ kr_rcvhdrtailaddr7 = IPATH_KREG_OFFSET(RcvHdrTailAddr7),
+ kr_rcvhdrtailaddr8 = IPATH_KREG_OFFSET(RcvHdrTailAddr8);
+
+/*
+ * first of the pioavail registers, the total number is
+ * (kr_sendpiobufcnt / 32); each buffer uses 2 bits
+ * More properly, it's:
+ * (kr_sendpiobufcnt / ((sizeof(uint64_t)*BITS_PER_BYTE)/2))
+ */
+ipath_sreg sr_sendpioavail = 0;
+
+ipath_creg
+ cr_badformatcnt = IPATH_CREG_OFFSET(RxBadFormatCnt),
+ cr_erricrccnt = IPATH_CREG_OFFSET(RxICRCErrCnt),
+ cr_errlinkcnt = IPATH_CREG_OFFSET(RxLinkProblemCnt),
+ cr_errlpcrccnt = IPATH_CREG_OFFSET(RxLPCRCErrCnt),
+ cr_errpkey = IPATH_CREG_OFFSET(RxPKeyMismatchCnt),
+ cr_errrcvflowctrlcnt = IPATH_CREG_OFFSET(RxFlowCtrlErrCnt),
+ cr_err_rlencnt = IPATH_CREG_OFFSET(RxLenErrCnt),
+ cr_errslencnt = IPATH_CREG_OFFSET(TxLenErrCnt),
+ cr_errtidfull = IPATH_CREG_OFFSET(RxTIDFullErrCnt),
+ cr_errtidvalid = IPATH_CREG_OFFSET(RxTIDValidErrCnt),
+ cr_errvcrccnt = IPATH_CREG_OFFSET(RxVCRCErrCnt),
+ cr_ibstatuschange = IPATH_CREG_OFFSET(IBStatusChangeCnt),
+ /* calc from Reg_CounterRegBase + offset */
+ cr_intcnt = IPATH_CREG_OFFSET(LBIntCnt),
+ cr_invalidrlencnt = IPATH_CREG_OFFSET(RxMaxMinLenErrCnt),
+ cr_invalidslencnt = IPATH_CREG_OFFSET(TxMaxMinLenErrCnt),
+ cr_lbflowstallcnt = IPATH_CREG_OFFSET(LBFlowStallCnt),
+ cr_pktrcvcnt = IPATH_CREG_OFFSET(RxDataPktCnt),
+ cr_pktrcvflowctrlcnt = IPATH_CREG_OFFSET(RxFlowPktCnt),
+ cr_pktsendcnt = IPATH_CREG_OFFSET(TxDataPktCnt),
+ cr_pktsendflowcnt = IPATH_CREG_OFFSET(TxFlowPktCnt),
+ cr_portovflcnt = IPATH_CREG_OFFSET(RxP0HdrEgrOvflCnt),
+ cr_portovflcnt1 = IPATH_CREG_OFFSET(RxP1HdrEgrOvflCnt),
+ cr_portovflcnt2 = IPATH_CREG_OFFSET(RxP2HdrEgrOvflCnt),
+ cr_portovflcnt3 = IPATH_CREG_OFFSET(RxP3HdrEgrOvflCnt),
+ cr_portovflcnt4 = IPATH_CREG_OFFSET(RxP4HdrEgrOvflCnt),
+ cr_portovflcnt5 = IPATH_CREG_OFFSET(RxP5HdrEgrOvflCnt),
+ cr_portovflcnt6 = IPATH_CREG_OFFSET(RxP6HdrEgrOvflCnt),
+ cr_portovflcnt7 = IPATH_CREG_OFFSET(RxP7HdrEgrOvflCnt),
+ cr_portovflcnt8 = IPATH_CREG_OFFSET(RxP8HdrEgrOvflCnt),
+ cr_rcvebpcnt = IPATH_CREG_OFFSET(RxEBPCnt),
+ cr_rcvovflcnt = IPATH_CREG_OFFSET(RxBufOvflCnt),
+ cr_senddropped = IPATH_CREG_OFFSET(TxDroppedPktCnt),
+ cr_sendstallcnt = IPATH_CREG_OFFSET(TxFlowStallCnt),
+ cr_sendunderruncnt = IPATH_CREG_OFFSET(TxUnderrunCnt),
+ cr_wordrcvcnt = IPATH_CREG_OFFSET(RxDwordCnt),
+ cr_wordsendcnt = IPATH_CREG_OFFSET(TxDwordCnt),
+ cr_unsupvlcnt = IPATH_CREG_OFFSET(TxUnsupVLErrCnt),
+ cr_rxdroppktcnt = IPATH_CREG_OFFSET(RxDroppedPktCnt),
+ cr_iblinkerrrecovcnt = IPATH_CREG_OFFSET(IBLinkErrRecoveryCnt),
+ cr_iblinkdowncnt = IPATH_CREG_OFFSET(IBLinkDownedCnt),
+ cr_ibsymbolerrcnt = IPATH_CREG_OFFSET(IBSymbolErrCnt);
+
+/* kr_sendctrl bits */
+#define INFINIPATH_S_DISARMPIOBUF_MASK 0xFF
+
+/* kr_rcvctrl bits */
+#define INFINIPATH_R_PORTENABLE_MASK 0x1FF
+#define INFINIPATH_R_INTRAVAIL_MASK 0x1FF
+
+/* kr_intstatus, kr_intclear, kr_intmask bits */
+#define INFINIPATH_I_RCVURG_MASK 0x1FF
+#define INFINIPATH_I_RCVAVAIL_MASK 0x1FF
+
+/* kr_hwerrclear, kr_hwerrmask, kr_hwerrstatus, bits */
+#define INFINIPATH_HWE_HTCMEMPARITYERR_MASK 0x3FFFFFULL
+#define INFINIPATH_HWE_HTCLNKABYTE0CRCERR 0x0000000000800000ULL
+#define INFINIPATH_HWE_HTCLNKABYTE1CRCERR 0x0000000001000000ULL
+#define INFINIPATH_HWE_HTCLNKBBYTE0CRCERR 0x0000000002000000ULL
+#define INFINIPATH_HWE_HTCLNKBBYTE1CRCERR 0x0000000004000000ULL
+#define INFINIPATH_HWE_HTCMISCERR4 0x0000000008000000ULL
+#define INFINIPATH_HWE_HTCMISCERR5 0x0000000010000000ULL
+#define INFINIPATH_HWE_HTCMISCERR6 0x0000000020000000ULL
+#define INFINIPATH_HWE_HTCMISCERR7 0x0000000040000000ULL
+#define INFINIPATH_HWE_MEMBISTFAILED 0x0040000000000000ULL
+#define INFINIPATH_HWE_COREPLL_FBSLIP 0x0080000000000000ULL
+#define INFINIPATH_HWE_COREPLL_RFSLIP 0x0100000000000000ULL
+#define INFINIPATH_HWE_HTBPLL_FBSLIP 0x0200000000000000ULL
+#define INFINIPATH_HWE_HTBPLL_RFSLIP 0x0400000000000000ULL
+#define INFINIPATH_HWE_HTAPLL_FBSLIP 0x0800000000000000ULL
+#define INFINIPATH_HWE_HTAPLL_RFSLIP 0x1000000000000000ULL
+#define INFINIPATH_HWE_EXTSERDESPLLFAILED 0x2000000000000000ULL
+
+/* kr_hwdiagctrl bits */
+#define INFINIPATH_DC_NUMHTMEMS 22
+
+/* kr_extstatus bits */
+#define INFINIPATH_EXTS_FREQSEL 0x2
+#define INFINIPATH_EXTS_SERDESSEL 0x4
+#define INFINIPATH_EXTS_MEMBIST_ENDTEST 0x0000000000004000
+#define INFINIPATH_EXTS_MEMBIST_CORRECT 0x0000000000008000
+
+/* kr_extctrl bits */
+
+/*
+ * masks and bits that are different in different chips, or present only
+ * in one
+ */
+const uint32_t infinipath_i_rcvavail_mask = INFINIPATH_I_RCVAVAIL_MASK;
+const uint32_t infinipath_i_rcvurg_mask = INFINIPATH_I_RCVURG_MASK;
+const uint64_t infinipath_hwe_htcmemparityerr_mask =
+ INFINIPATH_HWE_HTCMEMPARITYERR_MASK;
+
+const uint64_t infinipath_hwe_spibdcmlockfailed_mask = 0ULL;
+const uint64_t infinipath_hwe_sphtdcmlockfailed_mask = 0ULL;
+const uint64_t infinipath_hwe_htcdcmlockfailed_mask = 0ULL;
+const uint64_t infinipath_hwe_htcdcmlockfailed_shift = 0ULL;
+const uint64_t infinipath_hwe_sphtdcmlockfailed_shift = 0ULL;
+const uint64_t infinipath_hwe_spibdcmlockfailed_shift = 0ULL;
+
+const uint64_t infinipath_hwe_htclnkabyte0crcerr =
+ INFINIPATH_HWE_HTCLNKABYTE0CRCERR;
+const uint64_t infinipath_hwe_htclnkabyte1crcerr =
+ INFINIPATH_HWE_HTCLNKABYTE1CRCERR;
+const uint64_t infinipath_hwe_htclnkbbyte0crcerr =
+ INFINIPATH_HWE_HTCLNKBBYTE0CRCERR;
+const uint64_t infinipath_hwe_htclnkbbyte1crcerr =
+ INFINIPATH_HWE_HTCLNKBBYTE1CRCERR;
+
+const uint64_t infinipath_c_bitsextant =
+ (INFINIPATH_C_FREEZEMODE | INFINIPATH_C_LINKENABLE);
+
+const uint64_t infinipath_s_bitsextant =
+ (INFINIPATH_S_ABORT | INFINIPATH_S_PIOINTBUFAVAIL |
+ INFINIPATH_S_PIOBUFAVAILUPD | INFINIPATH_S_PIOENABLE |
+ INFINIPATH_S_DISARM |
+ (INFINIPATH_S_DISARMPIOBUF_MASK << INFINIPATH_S_DISARMPIOBUF_SHIFT));
+
+const uint64_t infinipath_r_bitsextant =
+ ((INFINIPATH_R_PORTENABLE_MASK << INFINIPATH_R_PORTENABLE_SHIFT) |
+ (INFINIPATH_R_INTRAVAIL_MASK << INFINIPATH_R_INTRAVAIL_SHIFT) |
+ INFINIPATH_R_TAILUPD);
+
+const uint64_t infinipath_i_bitsextant =
+ ((INFINIPATH_I_RCVURG_MASK << INFINIPATH_I_RCVURG_SHIFT) |
+ (INFINIPATH_I_RCVAVAIL_MASK << INFINIPATH_I_RCVAVAIL_SHIFT) |
+ INFINIPATH_I_ERROR | INFINIPATH_I_SPIOSENT |
+ INFINIPATH_I_SPIOBUFAVAIL | INFINIPATH_I_GPIO);
+
+const uint64_t infinipath_e_bitsextant =
+ (INFINIPATH_E_RFORMATERR | INFINIPATH_E_RVCRC | INFINIPATH_E_RICRC |
+ INFINIPATH_E_RMINPKTLEN | INFINIPATH_E_RMAXPKTLEN |
+ INFINIPATH_E_RLONGPKTLEN | INFINIPATH_E_RSHORTPKTLEN |
+ INFINIPATH_E_RUNEXPCHAR | INFINIPATH_E_RUNSUPVL | INFINIPATH_E_REBP |
+ INFINIPATH_E_RIBFLOW | INFINIPATH_E_RBADVERSION |
+ INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL |
+ INFINIPATH_E_RBADTID | INFINIPATH_E_RHDRLEN |
+ INFINIPATH_E_RHDR | INFINIPATH_E_RIBLOSTLINK |
+ INFINIPATH_E_SMINPKTLEN | INFINIPATH_E_SMAXPKTLEN |
+ INFINIPATH_E_SUNDERRUN | INFINIPATH_E_SPKTLEN |
+ INFINIPATH_E_SDROPPEDSMPPKT | INFINIPATH_E_SDROPPEDDATAPKT |
+ INFINIPATH_E_SPIOARMLAUNCH | INFINIPATH_E_SUNEXPERRPKTNUM |
+ INFINIPATH_E_SUNSUPVL | INFINIPATH_E_IBSTATUSCHANGED |
+ INFINIPATH_E_INVALIDADDR | INFINIPATH_E_RESET | INFINIPATH_E_HARDWARE);
+
+const uint64_t infinipath_hwe_bitsextant =
+ (INFINIPATH_HWE_HTCMEMPARITYERR_MASK <<
+ INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT) |
+ (INFINIPATH_HWE_TXEMEMPARITYERR_MASK <<
+ INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT) |
+ (INFINIPATH_HWE_RXEMEMPARITYERR_MASK <<
+ INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT) |
+ INFINIPATH_HWE_HTCLNKABYTE0CRCERR |
+ INFINIPATH_HWE_HTCLNKABYTE1CRCERR | INFINIPATH_HWE_HTCLNKBBYTE0CRCERR |
+ INFINIPATH_HWE_HTCLNKBBYTE1CRCERR | INFINIPATH_HWE_HTCMISCERR4 |
+ INFINIPATH_HWE_HTCMISCERR5 | INFINIPATH_HWE_HTCMISCERR6 |
+ INFINIPATH_HWE_HTCMISCERR7 | INFINIPATH_HWE_HTCBUSTREQPARITYERR |
+ INFINIPATH_HWE_HTCBUSTRESPPARITYERR |
+ INFINIPATH_HWE_HTCBUSIREQPARITYERR |
+ INFINIPATH_HWE_RXDSYNCMEMPARITYERR | INFINIPATH_HWE_MEMBISTFAILED |
+ INFINIPATH_HWE_COREPLL_FBSLIP | INFINIPATH_HWE_COREPLL_RFSLIP |
+ INFINIPATH_HWE_HTBPLL_FBSLIP | INFINIPATH_HWE_HTBPLL_RFSLIP |
+ INFINIPATH_HWE_HTAPLL_FBSLIP | INFINIPATH_HWE_HTAPLL_RFSLIP |
+ INFINIPATH_HWE_EXTSERDESPLLFAILED |
+ INFINIPATH_HWE_IBCBUSTOSPCPARITYERR |
+ INFINIPATH_HWE_IBCBUSFRSPCPARITYERR;
+
+const uint64_t infinipath_dc_bitsextant =
+ (INFINIPATH_DC_FORCEHTCMEMPARITYERR_MASK <<
+ INFINIPATH_DC_FORCEHTCMEMPARITYERR_SHIFT) |
+ (INFINIPATH_DC_FORCETXEMEMPARITYERR_MASK <<
+ INFINIPATH_DC_FORCETXEMEMPARITYERR_SHIFT) |
+ (INFINIPATH_DC_FORCERXEMEMPARITYERR_MASK <<
+ INFINIPATH_DC_FORCERXEMEMPARITYERR_SHIFT) |
+ INFINIPATH_DC_FORCEHTCBUSTREQPARITYERR |
+ INFINIPATH_DC_FORCEHTCBUSTRESPPARITYERR |
+ INFINIPATH_DC_FORCEHTCBUSIREQPARITYERR |
+ INFINIPATH_DC_FORCERXDSYNCMEMPARITYERR |
+ INFINIPATH_DC_COUNTERDISABLE | INFINIPATH_DC_COUNTERWREN |
+ INFINIPATH_DC_FORCEIBCBUSTOSPCPARITYERR |
+ INFINIPATH_DC_FORCEIBCBUSFRSPCPARITYERR;
+
+const uint64_t infinipath_ibcc_bitsextant =
+ (INFINIPATH_IBCC_FLOWCTRLPERIOD_MASK <<
+ INFINIPATH_IBCC_FLOWCTRLPERIOD_SHIFT) |
+ (INFINIPATH_IBCC_FLOWCTRLWATERMARK_MASK <<
+ INFINIPATH_IBCC_FLOWCTRLWATERMARK_SHIFT) |
+ (INFINIPATH_IBCC_LINKINITCMD_MASK <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT) |
+ (INFINIPATH_IBCC_LINKCMD_MASK << INFINIPATH_IBCC_LINKCMD_SHIFT) |
+ (INFINIPATH_IBCC_MAXPKTLEN_MASK << INFINIPATH_IBCC_MAXPKTLEN_SHIFT) |
+ (INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK <<
+ INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) |
+ (INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK <<
+ INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) |
+ (INFINIPATH_IBCC_CREDITSCALE_MASK <<
+ INFINIPATH_IBCC_CREDITSCALE_SHIFT) |
+ INFINIPATH_IBCC_LOOPBACK | INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE;
+
+const uint64_t infinipath_mdio_bitsextant =
+ (INFINIPATH_MDIO_CLKDIV_MASK << INFINIPATH_MDIO_CLKDIV_SHIFT) |
+ (INFINIPATH_MDIO_COMMAND_MASK << INFINIPATH_MDIO_COMMAND_SHIFT) |
+ (INFINIPATH_MDIO_DEVADDR_MASK << INFINIPATH_MDIO_DEVADDR_SHIFT) |
+ (INFINIPATH_MDIO_REGADDR_MASK << INFINIPATH_MDIO_REGADDR_SHIFT) |
+ (INFINIPATH_MDIO_DATA_MASK << INFINIPATH_MDIO_DATA_SHIFT) |
+ INFINIPATH_MDIO_CMDVALID | INFINIPATH_MDIO_RDDATAVALID;
+
+const uint64_t infinipath_ibcs_bitsextant =
+ (INFINIPATH_IBCS_LINKTRAININGSTATE_MASK <<
+ INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) |
+ (INFINIPATH_IBCS_LINKSTATE_MASK << INFINIPATH_IBCS_LINKSTATE_SHIFT) |
+ INFINIPATH_IBCS_TXREADY | INFINIPATH_IBCS_TXCREDITOK;
+
+const uint64_t infinipath_extc_bitsextant =
+ (INFINIPATH_EXTC_GPIOINVERT_MASK << INFINIPATH_EXTC_GPIOINVERT_SHIFT) |
+ (INFINIPATH_EXTC_GPIOOE_MASK << INFINIPATH_EXTC_GPIOOE_SHIFT) |
+ INFINIPATH_EXTC_SERDESENABLE | INFINIPATH_EXTC_SERDESCONNECT |
+ INFINIPATH_EXTC_SERDESENTRUNKING | INFINIPATH_EXTC_SERDESDISRXFIFO |
+ INFINIPATH_EXTC_SERDESENPLPBK1 | INFINIPATH_EXTC_SERDESENPLPBK2 |
+ INFINIPATH_EXTC_SERDESENENCDEC | INFINIPATH_EXTC_LEDSECPORTGREENON |
+ INFINIPATH_EXTC_LEDSECPORTYELLOWON | INFINIPATH_EXTC_LEDPRIPORTGREENON |
+ INFINIPATH_EXTC_LEDPRIPORTYELLOWON | INFINIPATH_EXTC_LEDGBLOKGREENON |
+ INFINIPATH_EXTC_LEDGBLERRREDOFF;
+
+/* Start of Documentation block for SerDes registers
+ * serdes and xgxs register bits; not all have defines,
+ * since I haven't yet needed them all, and I'm lazy. Those that I needed
+ * are in ipath_registers.h
+
+serdesConfig0Out (R/W)
+ Default Value
+bit[3:0] - ResetA/B/C/D (4'b1111)
+bit[7:4] -L1PwrdnA/B/C/D (4'b0000)
+bit[11:8] - RxIdleEnX (4'b0000)
+bit[15:12] - TxIdleEnX (4'b0000)
+bit[19:16] - RxDetectEnX (4'b0000)
+bit[23:20] - BeaconTxEnX (4'b0000)
+bit[27:24] - RxTermEnX (4'b0000)
+bit[28] - ResetPLL (1'b0)
+bit[29] -L2Pwrdn (1'b0)
+bit[37:30] - Offset[7:0] (8'b00000000)
+bit[38] -OffsetEn (1'b0)
+bit[39] -ParLBPK (1'b0)
+bit[40] -ParReset (1'b0)
+bit[42:41] - RefSel (2'b10)
+bit[43] - PW (1'b0)
+bit[47:44] - LPBKA/B/C/D (4'b0000)
+bit[49:48] - ClkBufTermAdj (2'b0)
+bit[51:50] - RxTermAdj (2'b0)
+bit[53:52] - TxTermAdj (2'b0)
+bit[55:54] - RxEqCtl (2'b0)
+bit[63:56] - Reserved
+
+cce_wip_serdesConfig1Out[63:0] (R/W)
+bit[3:0] - HiDrvX (4'b0000)
+bit[7:4] - LoDrvX (4'b0000)
+bit[12:11] - DtxA[3:0] (4'b0000)
+bit[15:12] - DtxB[3:0] (4'b0000)
+bit[19:16] - DtxC[3:0] (4'b0000)
+bit[23:20] - DtxD[3:0] (4'b0000)
+bit[27:24] - DeqA[3:0] (4'b0000)
+bit[31:28] - DeqB[3:0] (4'b0000)
+bit[35:32] - DeqC[3:0] (4'b0000)
+bit[39:36] - DeqD[3:0] (4'b0000)
+Framer interface, bits 40-59, not used
+bit[44:40] - FmOffsetA[4:0] (5'b00000)
+bit[49:45] - FmOffsetB[4:0] (5'b00000)
+bit[54:50] - FmOffsetC[4:0] (5'b00000)
+bit[59:55] - FmOffsetD[4:0] (5'b00000)
+bit[63:60] - FmOffsetEnA/B/C/D (4'b0000)
+
+SerdesStatus[63:0] (RO)
+bit[3:0] - TxIdleDetectA/B/C/D
+bit[7:4] - RxDetectA/B/C/D
+bit[11:8] - BeaconDetectA/B/C/D
+bit[63:12] - Reserved
+
+XGXSConfigOut[63:0]
+bit[2:0] - Resets, init to 1; bit 0 unused?
+bit[3] - MDIO, select register bank for vendor specific register
+ (0x1e if set, else 0x1f); vendor-specific status in register 8
+ bits 0-3 lanes0-3 signal detect, 1 if detected
+ bits 4-7 lanes0-3 CTC fifo errors, 1 if detected (latched until read)
+bit[8:4] - MDIO port address
+bit[18:9] - lnk_sync_mask
+bit[22:19] - polarity inv
+
+Documentation end */
+
+/*
+ *
+ * General specs:
+ * ExtCtrl[63:48] = EXTC_GPIOOE[15:0]
+ * ExtCtrl[47:32] = EXTC_GPIOInvert[15:0]
+ * ExtStatus[63:48] = GpioIn[15:0]
+ *
+ * GPIO[1] = EEPROM_SDA
+ * GPIO[0] = EEPROM_SCL
+ */
+
+#define _IPATH_GPIO_SDA_NUM 1
+#define _IPATH_GPIO_SCL_NUM 0
+
+#define IPATH_GPIO_SDA \
+ (1UL << (_IPATH_GPIO_SDA_NUM+INFINIPATH_EXTC_GPIOOE_SHIFT))
+#define IPATH_GPIO_SCL \
+ (1UL << (_IPATH_GPIO_SCL_NUM+INFINIPATH_EXTC_GPIOOE_SHIFT))
+
+/*
+ * register bits for selecting i2c direction and values, used for I2C serial
+ * flash
+ */
+const uint16_t ipath_gpio_sda_num = _IPATH_GPIO_SDA_NUM;
+const uint16_t ipath_gpio_scl_num = _IPATH_GPIO_SCL_NUM;
+const uint64_t ipath_gpio_sda = IPATH_GPIO_SDA;
+const uint64_t ipath_gpio_scl = IPATH_GPIO_SCL;
+
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/delay.h>
+
+/*
+ * This file contains all of the code that is specific to the InfiniPath
+ * HT-400 chip.
+ */
+
+/* we support up to 4 chips per system */
+const uint32_t infinipath_max = 4;
+struct ipath_devdata devdata[4];
+static const char *ipath_unit_names[4] = {
+ "infinipath0", "infinipath1", "infinipath2", "infinipath3"
+};
+
+const char *ipath_get_unit_name(int unit)
+{
+ return ipath_unit_names[unit];
+}
+
+static void ipath_check_htlink(ipath_type t);
+
+/*
+ * display hardware errors. Use same msg buffer as regular errors to avoid
+ * excessive stack use. Most hardware errors are catastrophic, but for
+ * right now, we'll print them and continue.
+ * We reuse the same message buffer as ipath_handle_errors() to avoid
+ * excessive stack usage.
+ */
+void ipath_handle_hwerrors(const ipath_type t, char *msg, int msgl)
+{
+ uint64_t hwerrs = ipath_kget_kreg64(t, kr_hwerrstatus);
+ uint32_t bits, ctrl;
+ int isfatal = 0;
+ char bitsmsg[64];
+
+ if (!hwerrs) {
+ _IPATH_VDBG("Called but no hardware errors set\n");
+ /*
+ * better than printing cofusing messages
+ * This seems to be related to clearing the crc error, or
+ * the pll error during init.
+ */
+ return;
+ } else if (hwerrs == -1LL) {
+ _IPATH_UNIT_ERROR(t,
+ "Read of hardware error status failed (all bits set); ignoring\n");
+ return;
+ }
+ ipath_stats.sps_hwerrs++;
+
+ /*
+ * clear the error, regardless of whether we continue or stop using
+ * the chip.
+ */
+ ipath_kput_kreg(t, kr_hwerrclear, hwerrs);
+
+ hwerrs &= devdata[t].ipath_hwerrmask;
+
+ /*
+ * make sure we get this much out, unless told to be quiet,
+ * or it's occurred within the last 5 seconds
+ */
+ if ((hwerrs & ~devdata[t].ipath_lasthwerror) ||
+ (infinipath_debug & __IPATH_VERBDBG))
+ _IPATH_INFO("Hardware error: hwerr=0x%llx (cleared)\n", hwerrs);
+ devdata[t].ipath_lasthwerror |= hwerrs;
+
+ if (hwerrs & ~infinipath_hwe_bitsextant)
+ _IPATH_UNIT_ERROR(t,
+ "hwerror interrupt with unknown errors %llx set\n",
+ hwerrs & ~infinipath_hwe_bitsextant);
+
+ ctrl = ipath_kget_kreg32(t, kr_control);
+ if (ctrl & INFINIPATH_C_FREEZEMODE) {
+ if (hwerrs) {
+ /*
+ * if any set that we aren't ignoring
+ * only make the complaint once, in case it's stuck or
+ * recurring, and we get here multiple times
+ */
+ if (devdata[t].ipath_flags & IPATH_INITTED) {
+ _IPATH_UNIT_ERROR(t,
+ "Fatal Error (freezemode), no longer usable\n");
+ isfatal = 1;
+ }
+ *devdata[t].ipath_statusp &= ~IPATH_STATUS_IB_READY;
+ /* mark as having had error */
+ *devdata[t].ipath_statusp |= IPATH_STATUS_HWERROR;
+ /*
+ * mark as not usable, at a minimum until driver
+ * is reloaded, probably until reboot, since no
+ * other reset is possible.
+ */
+ devdata[t].ipath_flags &= ~IPATH_INITTED;
+ } else {
+ _IPATH_DBG
+ ("Clearing freezemode on ignored hardware error\n");
+ ctrl &= ~INFINIPATH_C_FREEZEMODE;
+ ipath_kput_kreg(t, kr_control, ctrl);
+ }
+ }
+
+ *msg = '\0';
+
+ /*
+ * may someday want to decode into which bits are which
+ * functional area for parity errors, etc.
+ */
+ if (hwerrs & (infinipath_hwe_htcmemparityerr_mask
+ << INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT)) {
+ bits = (uint32_t) ((hwerrs >>
+ INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT) &
+ INFINIPATH_HWE_HTCMEMPARITYERR_MASK);
+ snprintf(bitsmsg, sizeof bitsmsg, "[HTC Parity Errs %x] ",
+ bits);
+ strlcat(msg, bitsmsg, msgl);
+ }
+ if (hwerrs & (INFINIPATH_HWE_RXEMEMPARITYERR_MASK
+ << INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT)) {
+ bits = (uint32_t) ((hwerrs >>
+ INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT) &
+ INFINIPATH_HWE_RXEMEMPARITYERR_MASK);
+ snprintf(bitsmsg, sizeof bitsmsg, "[RXE Parity Errs %x] ",
+ bits);
+ strlcat(msg, bitsmsg, msgl);
+ }
+ if (hwerrs & (INFINIPATH_HWE_TXEMEMPARITYERR_MASK
+ << INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT)) {
+ bits = (uint32_t) ((hwerrs >>
+ INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT) &
+ INFINIPATH_HWE_TXEMEMPARITYERR_MASK);
+ snprintf(bitsmsg, sizeof bitsmsg, "[TXE Parity Errs %x] ",
+ bits);
+ strlcat(msg, bitsmsg, msgl);
+ }
+ if (hwerrs & INFINIPATH_HWE_IBCBUSTOSPCPARITYERR)
+ strlcat(msg, "[IB2IPATH Parity]", msgl);
+ if (hwerrs & INFINIPATH_HWE_IBCBUSFRSPCPARITYERR)
+ strlcat(msg, "[IPATH2IB Parity]", msgl);
+ if (hwerrs & INFINIPATH_HWE_HTCBUSIREQPARITYERR)
+ strlcat(msg, "[HTC Ireq Parity]", msgl);
+ if (hwerrs & INFINIPATH_HWE_HTCBUSTREQPARITYERR)
+ strlcat(msg, "[HTC Treq Parity]", msgl);
+ if (hwerrs & INFINIPATH_HWE_HTCBUSTRESPPARITYERR)
+ strlcat(msg, "[HTC Tresp Parity]", msgl);
+
+/* keep the code below somewhat more readonable; not used elsewhere */
+#define _IPATH_HTLINK0_CRCBITS (infinipath_hwe_htclnkabyte0crcerr | \
+ infinipath_hwe_htclnkabyte1crcerr)
+#define _IPATH_HTLINK1_CRCBITS (infinipath_hwe_htclnkbbyte0crcerr | \
+ infinipath_hwe_htclnkbbyte1crcerr)
+#define _IPATH_HTLANE0_CRCBITS (infinipath_hwe_htclnkabyte0crcerr | \
+ infinipath_hwe_htclnkbbyte0crcerr)
+#define _IPATH_HTLANE1_CRCBITS (infinipath_hwe_htclnkabyte1crcerr | \
+ infinipath_hwe_htclnkbbyte1crcerr)
+ if (hwerrs & (_IPATH_HTLINK0_CRCBITS | _IPATH_HTLINK1_CRCBITS)) {
+ char bitsmsg[64];
+ uint64_t crcbits = hwerrs &
+ (_IPATH_HTLINK0_CRCBITS | _IPATH_HTLINK1_CRCBITS);
+ /* don't check if 8bit HT */
+ if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT0)
+ crcbits &= ~infinipath_hwe_htclnkabyte1crcerr;
+ /* don't check if 8bit HT */
+ if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT1)
+ crcbits &= ~infinipath_hwe_htclnkbbyte1crcerr;
+ /*
+ * we'll want to ignore link errors on link that is
+ * not in use, if any. For now, complain about both
+ */
+ if (crcbits) {
+ uint16_t ctrl0, ctrl1;
+ snprintf(bitsmsg, sizeof bitsmsg,
+ "[HT%s lane %s CRC (%llx); ignore till reload]",
+ !(crcbits & _IPATH_HTLINK1_CRCBITS) ?
+ "0 (A)" : (!(crcbits & _IPATH_HTLINK0_CRCBITS)
+ ? "1 (B)" : "0+1 (A+B)"),
+ !(crcbits & _IPATH_HTLANE1_CRCBITS) ? "0"
+ : (!(crcbits & _IPATH_HTLANE0_CRCBITS) ? "1" :
+ "0+1"), crcbits);
+ strlcat(msg, bitsmsg, msgl);
+
+ /*
+ * print extra info for debugging.
+ * slave/primary config word 4, 8 (link control 0, 1)
+ */
+
+ if (pci_read_config_word(devdata[t].pcidev,
+ devdata[t].ipath_ht_slave_off +
+ 0x4, &ctrl0))
+ _IPATH_INFO
+ ("Couldn't read linkctrl0 of slave/primary config block\n");
+ else if (!(ctrl0 & 1 << 6)) /* not if EOC bit set */
+ _IPATH_DBG("HT linkctrl0 0x%x%s%s\n", ctrl0,
+ ((ctrl0 >> 8) & 7) ? " CRC" : "",
+ ((ctrl0 >> 4) & 1) ? "linkfail" :
+ "");
+ if (pci_read_config_word
+ (devdata[t].pcidev,
+ devdata[t].ipath_ht_slave_off + 0x8, &ctrl1))
+ _IPATH_INFO
+ ("Couldn't read linkctrl1 of slave/primary config block\n");
+ else if (!(ctrl1 & 1 << 6)) /* not if EOC bit set */
+ _IPATH_DBG("HT linkctrl1 0x%x%s%s\n", ctrl1,
+ ((ctrl1 >> 8) & 7) ? " CRC" : "",
+ ((ctrl1 >> 4) & 1) ? "linkfail" :
+ "");
+
+ /* disable until driver reloaded */
+ devdata[t].ipath_hwerrmask &= ~crcbits;
+ ipath_kput_kreg(t, kr_hwerrmask,
+ devdata[t].ipath_hwerrmask);
+ _IPATH_DBG("HT crc errs: %s\n", msg);
+ } else
+ _IPATH_DBG
+ ("ignoring HT crc errors 0x%llx, not in use\n",
+ hwerrs & (_IPATH_HTLINK0_CRCBITS |
+ _IPATH_HTLINK1_CRCBITS));
+ }
+
+ if (hwerrs & INFINIPATH_HWE_HTCMISCERR5)
+ strlcat(msg, "[HT core Misc5]", msgl);
+ if (hwerrs & INFINIPATH_HWE_HTCMISCERR6)
+ strlcat(msg, "[HT core Misc6]", msgl);
+ if (hwerrs & INFINIPATH_HWE_HTCMISCERR7)
+ strlcat(msg, "[HT core Misc7]", msgl);
+ if (hwerrs & INFINIPATH_HWE_MEMBISTFAILED) {
+ strlcat(msg, "[Memory BIST test failed, HT-400 unusable]",
+ msgl);
+ /* ignore from now on, so disable until driver reloaded */
+ devdata[t].ipath_hwerrmask &= ~INFINIPATH_HWE_MEMBISTFAILED;
+ ipath_kput_kreg(t, kr_hwerrmask, devdata[t].ipath_hwerrmask);
+ }
+#define _IPATH_PLL_FAIL (INFINIPATH_HWE_COREPLL_FBSLIP | \
+ INFINIPATH_HWE_COREPLL_RFSLIP | \
+ INFINIPATH_HWE_HTBPLL_FBSLIP | \
+ INFINIPATH_HWE_HTBPLL_RFSLIP | \
+ INFINIPATH_HWE_HTAPLL_FBSLIP | \
+ INFINIPATH_HWE_HTAPLL_RFSLIP)
+
+ if (hwerrs & _IPATH_PLL_FAIL) {
+ snprintf(bitsmsg, sizeof bitsmsg,
+ "[PLL failed (%llx), HT-400 unusable]",
+ hwerrs & _IPATH_PLL_FAIL);
+ strlcat(msg, bitsmsg, msgl);
+ /* ignore from now on, so disable until driver reloaded */
+ devdata[t].ipath_hwerrmask &= ~(hwerrs & _IPATH_PLL_FAIL);
+ ipath_kput_kreg(t, kr_hwerrmask, devdata[t].ipath_hwerrmask);
+ }
+
+ if (hwerrs & INFINIPATH_HWE_EXTSERDESPLLFAILED) {
+ /*
+ * If it occurs, it is left masked since the eternal interface
+ * is unused
+ */
+ devdata[t].ipath_hwerrmask &=
+ ~INFINIPATH_HWE_EXTSERDESPLLFAILED;
+ ipath_kput_kreg(t, kr_hwerrmask, devdata[t].ipath_hwerrmask);
+ }
+
+ if (hwerrs & INFINIPATH_HWE_RXDSYNCMEMPARITYERR)
+ strlcat(msg, "[Rx Dsync]", msgl);
+ if (hwerrs & INFINIPATH_HWE_SERDESPLLFAILED)
+ strlcat(msg, "[SerDes PLL]", msgl);
+
+ _IPATH_UNIT_ERROR(t, "%s hardware error\n", msg);
+ if (isfatal && (!ipath_diags_enabled)) {
+ if (devdata[t].ipath_freezemsg) {
+ /*
+ * for proc status file ; if no trailing } is copied, we'll know
+ * it was truncated.
+ */
+ snprintf(devdata[t].ipath_freezemsg,
+ devdata[t].ipath_freezelen, "{%s}", msg);
+ }
+ }
+}
+
+/* fill in the board name, based on the board revision register */
+void ipath_ht_get_boardname(const ipath_type t, char *name, size_t namelen)
+{
+ char *n = NULL;
+ uint8_t boardrev = devdata[t].ipath_boardrev;
+
+ switch (boardrev) {
+ case 4: /* Ponderosa is one of the bringup boards */
+ n = "Ponderosa";
+ break;
+ case 5: /* HT-460 original production board */
+ n = "InfiniPath_HT-460";
+ break;
+ case 7: /* HT-460 small form factor production board */
+ n = "InfiniPath_HT-465";
+ break;
+ case 6:
+ n = "OEM_Board_3";
+ break;
+ case 8:
+ n = "LS/X-1";
+ break;
+ case 9: /* Comstock bringup test board */
+ n = "Comstock";
+ break;
+ case 10:
+ n = "OEM_Board_2";
+ break;
+ case 11:
+ n = "HT-470";
+ break;
+ default: /* don't know, just print the number */
+ _IPATH_ERROR("Don't yet know about board with ID %u\n",
+ boardrev);
+ snprintf(name, namelen, "UnknownBoardRev%u", boardrev);
+ break;
+ }
+ if (n)
+ snprintf(name, namelen, "%s", n);
+}
+
+int ipath_validate_rev(struct ipath_devdata * dd)
+{
+ if (dd->ipath_majrev != 3 || dd->ipath_minrev != 2) {
+ /*
+ * This version of the driver only supports the HT-400
+ * Rev 3.2
+ */
+ _IPATH_UNIT_ERROR(IPATH_UNIT(dd),
+ "Unsupported HT-400 revision %u.%u!\n",
+ dd->ipath_majrev, dd->ipath_minrev);
+ return 1;
+ }
+ if (dd->ipath_htspeed != 800)
+ _IPATH_UNIT_ERROR(IPATH_UNIT(dd),
+ "Incorrectly configured for HT @ %uMHz\n",
+ dd->ipath_htspeed);
+ if (dd->ipath_boardrev == 7 || dd->ipath_boardrev == 11 ||
+ dd->ipath_boardrev == 6)
+ dd->ipath_flags |= IPATH_GPIO_INTR;
+ else if (dd->ipath_boardrev == 8) { /* LS/X-1 */
+ uint64_t val;
+ val = ipath_kget_kreg64(dd->ipath_pd[0]->port_unit, kr_extstatus);
+ if (val & INFINIPATH_EXTS_SERDESSEL) { /* hardware disabled */
+ /* This means that the chip is hardware disabled, and will
+ * not be able to bring up the link, in any case. We special
+ * case this and abort early, to avoid later messages. We
+ * also set the DISABLED status bit
+ */
+ _IPATH_DBG("Unit %u is hardware-disabled\n",
+ dd->ipath_pd[0]->port_unit);
+ *dd->ipath_statusp |= IPATH_STATUS_DISABLED;
+ return 2; /* this value is handled differently */
+ }
+ }
+ return 0;
+}
+
+static void ipath_check_htlink(ipath_type t)
+{
+ uint8_t linkerr, link_off, i;
+
+ for (i = 0; i < 2; i++) {
+ link_off = devdata[t].ipath_ht_slave_off + i * 4 + 0xd;
+ if (pci_read_config_byte(devdata[t].pcidev, link_off, &linkerr))
+ _IPATH_INFO
+ ("Couldn't read linkerror%d of HT slave/primary block\n",
+ i);
+ else if (linkerr & 0xf0) {
+ _IPATH_VDBG("HT linkerr%d bits 0x%x set, clearing\n",
+ linkerr >> 4, i);
+ /*
+ * writing the linkerr bits that are set should
+ * clear them
+ */
+ if (pci_write_config_byte
+ (devdata[t].pcidev, link_off, linkerr))
+ _IPATH_DBG
+ ("Failed write to clear HT linkerror%d\n",
+ i);
+ if (pci_read_config_byte
+ (devdata[t].pcidev, link_off, &linkerr))
+ _IPATH_INFO
+ ("Couldn't reread linkerror%d of HT slave/primary block\n",
+ i);
+ else if (linkerr & 0xf0)
+ _IPATH_INFO
+ ("HT linkerror%d bits 0x%x couldn't be cleared\n",
+ i, linkerr >> 4);
+ }
+ }
+}
+
+/*
+ * now that we have finished initializing everything that might reasonably
+ * cause a hardware error, and cleared those errors bits as they occur,
+ * we can enable hardware errors in the mask (potentially enabling
+ * freeze mode), and enable hardware errors as errors (along with
+ * everything else) in errormask
+ */
+void ipath_clear_init_hwerrs(ipath_type t)
+{
+ uint64_t val, extsval;
+
+ extsval = ipath_kget_kreg64(t, kr_extstatus);
+
+ if (!(extsval & INFINIPATH_EXTS_MEMBIST_ENDTEST))
+ _IPATH_UNIT_ERROR(t, "MemBIST did not complete!\n");
+
+ ipath_check_htlink(t);
+
+ /* barring bugs, all hwerrors become interrupts, which can */
+ val = -1LL;
+ /* don't look at crc lane1 if 8 bit */
+ if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT0)
+ val &= ~infinipath_hwe_htclnkabyte1crcerr;
+ /* don't look at crc lane1 if 8 bit */
+ if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT1)
+ val &= ~infinipath_hwe_htclnkbbyte1crcerr;
+
+ /*
+ * disable RXDSYNCMEMPARITY because external serdes is unused,
+ * and therefore the logic will never be used or initialized,
+ * and uninitialized state will normally result in this error
+ * being asserted. Similarly for the external serdess pll
+ * lock signal.
+ */
+ val &=
+ ~(INFINIPATH_HWE_EXTSERDESPLLFAILED |
+ INFINIPATH_HWE_RXDSYNCMEMPARITYERR);
+
+ /*
+ * Disable MISCERR4 because of an inversion in the HT core
+ * logic checking for errors that cause this bit to be set.
+ * The errata can also cause the protocol error bit to be set
+ * in the HT config space linkerror register(s).
+ */
+ val &= ~INFINIPATH_HWE_HTCMISCERR4;
+
+ /*
+ * PLL ignored because MDIO interface has a logic problem for reads,
+ * on Comstock and Ponderosa. BRINGUP
+ */
+ if (devdata[t].ipath_boardrev == 4 || devdata[t].ipath_boardrev == 9)
+ val &= ~INFINIPATH_HWE_EXTSERDESPLLFAILED; /* BRINGUP */
+ devdata[t].ipath_hwerrmask = val;
+}
+
+/* bring up the serdes */
+int ipath_bringup_serdes(ipath_type t)
+{
+ uint64_t val, config1;
+ int ret = 0, change = 0;
+
+ _IPATH_DBG("Trying to bringup serdes\n");
+
+ if (ipath_kget_kreg64(t, kr_hwerrstatus) &
+ INFINIPATH_HWE_SERDESPLLFAILED) {
+ _IPATH_DBG
+ ("At start, serdes PLL failed bit set in hwerrstatus, clearing and continuing\n");
+ ipath_kput_kreg(t, kr_hwerrclear,
+ INFINIPATH_HWE_SERDESPLLFAILED);
+ }
+
+ val = ipath_kget_kreg64(t, kr_serdesconfig0);
+ config1 = ipath_kget_kreg64(t, kr_serdesconfig1);
+
+ _IPATH_VDBG
+ ("Initial serdes status is config0=%llx config1=%llx, sstatus=%llx xgxs %llx\n",
+ val, config1, ipath_kget_kreg64(t, kr_serdesstatus),
+ ipath_kget_kreg64(t, kr_xgxsconfig));
+
+ /* force reset on */
+ val |=
+ INFINIPATH_SERDC0_RESET_PLL /* | INFINIPATH_SERDC0_RESET_MASK */ ;
+ ipath_kput_kreg(t, kr_serdesconfig0, val);
+ udelay(15); /* need pll reset set at least for a bit */
+
+ if (val & INFINIPATH_SERDC0_RESET_PLL) {
+ uint64_t val2 = val &= ~INFINIPATH_SERDC0_RESET_PLL;
+ /* set lane resets, and tx idle, during pll reset */
+ val2 |= INFINIPATH_SERDC0_RESET_MASK | INFINIPATH_SERDC0_TXIDLE;
+ _IPATH_VDBG("Clearing serdes PLL reset (writing %llx)\n", val2);
+ ipath_kput_kreg(t, kr_serdesconfig0, val2);
+ /* be sure chip saw it */
+ val = ipath_kget_kreg64(t, kr_scratch);
+ /*
+ * need pll reset clear at least 11 usec before lane resets
+ * cleared; give it a few more
+ */
+ udelay(15);
+ val = val2; /* for check below */
+ }
+
+ if (val & (INFINIPATH_SERDC0_RESET_PLL | INFINIPATH_SERDC0_RESET_MASK
+ | INFINIPATH_SERDC0_TXIDLE)) {
+ val &=
+ ~(INFINIPATH_SERDC0_RESET_PLL | INFINIPATH_SERDC0_RESET_MASK
+ | INFINIPATH_SERDC0_TXIDLE);
+ ipath_kput_kreg(t, kr_serdesconfig0, val); /* clear them */
+ }
+
+ val = ipath_kget_kreg64(t, kr_xgxsconfig);
+ if (((val >> INFINIPATH_XGXS_MDIOADDR_SHIFT) &
+ INFINIPATH_XGXS_MDIOADDR_MASK) != 3) {
+ val &=
+ ~(INFINIPATH_XGXS_MDIOADDR_MASK <<
+ INFINIPATH_XGXS_MDIOADDR_SHIFT);
+ /* we use address 3 */
+ val |= 3ULL << INFINIPATH_XGXS_MDIOADDR_SHIFT;
+ change = 1;
+ }
+ if (val & INFINIPATH_XGXS_RESET) { /* normally true after boot */
+ val &= ~INFINIPATH_XGXS_RESET;
+ change = 1;
+ }
+ if (change)
+ ipath_kput_kreg(t, kr_xgxsconfig, val);
+
+ val = ipath_kget_kreg64(t, kr_serdesconfig0);
+
+ config1 &= ~0x0ffffffff00ULL; /* clear current and de-emphasis bits */
+ config1 |= 0x00000000000ULL; /* set current to 20ma */
+ config1 |= 0x0cccc000000ULL; /* set de-emphasis to -5.68dB */
+ ipath_kput_kreg(t, kr_serdesconfig1, config1);
+
+ _IPATH_VDBG
+ ("After setup: serdes status is config0=%llx config1=%llx, sstatus=%llx xgxs %llx\n",
+ val, config1, ipath_kget_kreg64(t, kr_serdesstatus),
+ ipath_kget_kreg64(t, kr_xgxsconfig));
+
+ if ((!ipath_waitfor_mdio_cmdready(t))) {
+ ipath_kput_kreg(t, kr_mdio, IPATH_MDIO_REQ(IPATH_MDIO_CMD_READ,
+ 31,
+ IPATH_MDIO_CTRL_XGXS_REG_8,
+ 0));
+ if (ipath_waitfor_complete
+ (t, kr_mdio, IPATH_MDIO_DATAVALID, &val))
+ _IPATH_DBG
+ ("Never got MDIO data for XGXS status read\n");
+ else
+ _IPATH_VDBG("MDIO Read reg8, 'bank' 31 %x\n",
+ (uint32_t) val);
+ } else
+ _IPATH_DBG("Never got MDIO cmdready for XGXS status read\n");
+
+ return ret; /* for now, say we always succeeded */
+}
+
+/* set serdes to txidle; driver is being unloaded */
+void ipath_quiet_serdes(const ipath_type t)
+{
+ uint64_t val = ipath_kget_kreg64(t, kr_serdesconfig0);
+
+ val |= INFINIPATH_SERDC0_TXIDLE;
+ _IPATH_DBG("Setting TxIdleEn on serdes (config0 = %llx)\n", val);
+ ipath_kput_kreg(t, kr_serdesconfig0, val);
+}
+
+EXPORT_SYMBOL(ipath_get_unit_name);
+
diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_i2c.c
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_i2c.c Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,473 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#include <linux/config.h>
+#include <linux/version.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/timer.h>
+#include <linux/delay.h>
+
+#include "ipath_kernel.h"
+#include "ips_common.h"
+#include "ipath_layer.h"
+
+/*
+ * InfiniPath I2C Driver for a serial flash. Not a generic i2c
+ * interface. Requires software bitbanging, with readbacks from chip
+ * to ensure timing (simple udelay is not enough). Specialized enough
+ * that using the standard kernel i2c bitbanging interface appears as
+ * though it would make the code longer and harder to maintain, rather
+ * than simpler.
+ * Intended to work with parts similar to Atmel AT24C01 (a 1Kbit part,
+ * that uses no programmable address bits, with address 1010000b).
+ */
+
+typedef enum i2c_line_type_e {
+ i2c_line_scl = 0,
+ i2c_line_sda
+} ipath_i2c_type;
+
+typedef enum i2c_line_state_e {
+ i2c_line_low = 0,
+ i2c_line_high
+} ipath_i2c_state;
+
+#define READ_CMD 1
+#define WRITE_CMD 0
+
+static int ipath_eeprom_init;
+
+/*
+ * The gpioval manipulation really should be protected by spinlocks
+ * or be converted to use atomic operations (unfortunately, atomic.h
+ * doesn't cover 64 bit ops for some of them).
+ */
+
+int i2c_gpio_set(ipath_type dev, ipath_i2c_type line,
+ ipath_i2c_state new_line_state);
+int i2c_gpio_get(ipath_type dev, ipath_i2c_type line,
+ ipath_i2c_state * curr_statep);
+
+/*
+ * returns 0 if the line was set to the new state successfully, non-zero
+ * on error.
+ */
+int i2c_gpio_set(ipath_type dev, ipath_i2c_type line,
+ ipath_i2c_state new_line_state)
+{
+ uint64_t read_val, write_val, mask, *gpioval;
+
+ gpioval = &devdata[dev].ipath_gpio_out;
+ read_val = ipath_kget_kreg64(dev, kr_extctrl);
+ if (line == i2c_line_scl)
+ mask = ipath_gpio_scl;
+ else
+ mask = ipath_gpio_sda;
+
+ if (new_line_state == i2c_line_high)
+ /* tri-state the output rather than force high */
+ write_val = read_val & ~mask;
+ else
+ /* config line to be an output */
+ write_val = read_val | mask;
+ ipath_kput_kreg(dev, kr_extctrl, write_val);
+
+ /* set high and verify */
+ if (new_line_state == i2c_line_high)
+ write_val = 0x1UL;
+ else
+ write_val = 0x0UL;
+
+ if (line == i2c_line_scl) {
+ write_val <<= ipath_gpio_scl_num;
+ *gpioval = *gpioval & ~(1UL << ipath_gpio_scl_num);
+ *gpioval |= write_val;
+ } else {
+ write_val <<= ipath_gpio_sda_num;
+ *gpioval = *gpioval & ~(1UL << ipath_gpio_sda_num);
+ *gpioval |= write_val;
+ }
+ ipath_kput_kreg(dev, kr_gpio_out, *gpioval);
+
+ return 0;
+}
+
+/*
+ * returns 0 if the line was set to the new state successfully, non-zero
+ * on error. curr_state is not set on error.
+ */
+int i2c_gpio_get(ipath_type dev, ipath_i2c_type line,
+ ipath_i2c_state * curr_statep)
+{
+ uint64_t read_val, write_val, mask;
+
+ /* check args */
+ if (curr_statep == NULL)
+ return 1;
+
+ read_val = ipath_kget_kreg64(dev, kr_extctrl);
+ /* config line to be an input */
+ if (line == i2c_line_scl)
+ mask = ipath_gpio_scl;
+ else
+ mask = ipath_gpio_sda;
+ write_val = read_val & ~mask;
+ ipath_kput_kreg(dev, kr_extctrl, write_val);
+ read_val = ipath_kget_kreg64(dev, kr_extstatus);
+
+ if (read_val & mask)
+ *curr_statep = i2c_line_high;
+ else
+ *curr_statep = i2c_line_low;
+
+ return 0;
+}
+
+/*
+ * would prefer to not inline this, to avoid code bloat, and simplify debugging
+ * But when compiling against 2.6.10 kernel tree, it gets an error, so
+ * not for now.
+ */
+static void ipath_i2c_delay(ipath_type, int);
+
+/*
+ * we use this instead of udelay directly, so we can make sure
+ * that previous register writes have been flushed all the way
+ * to the chip. Since we are delaying anyway, the cost doesn't
+ * hurt, and makes the bit twiddling more regular
+ * If delay is negative, we'll do the chip read, to be sure write made it
+ * to our chip, but won't do udelay()
+ */
+static void ipath_i2c_delay(ipath_type dev, int dtime)
+{
+ /*
+ * This needs to be volatile, so that the compiler doesn't
+ * optimize away the read to the device's mapped memory.
+ */
+ volatile uint32_t read_val;
+ if (!dtime)
+ return;
+ read_val = ipath_kget_kreg32(dev, kr_scratch);
+ if (--dtime > 0) /* register read takes about .5 usec, itself */
+ udelay(dtime);
+}
+
+static void ipath_scl_out(ipath_type dev, uint8_t bit, int delay)
+{
+ i2c_gpio_set(dev, i2c_line_scl, bit ? i2c_line_high : i2c_line_low);
+
+ ipath_i2c_delay(dev, delay);
+}
+
+static void ipath_sda_out(ipath_type dev, uint8_t bit, int delay)
+{
+ i2c_gpio_set(dev, i2c_line_sda, bit ? i2c_line_high : i2c_line_low);
+
+ ipath_i2c_delay(dev, delay);
+}
+
+static uint8_t ipath_sda_in(ipath_type dev, int delay)
+{
+ ipath_i2c_state bit;
+
+ if (i2c_gpio_get(dev, i2c_line_sda, &bit))
+ _IPATH_DBG("get bit failed!\n");
+
+ ipath_i2c_delay(dev, delay);
+
+ return bit == i2c_line_high ? 1U : 0;
+}
+
+/* see if ack following write is true */
+static int ipath_i2c_ackrcv(ipath_type dev)
+{
+ uint8_t ack_received;
+
+ /* AT ENTRY SCL = LOW */
+ /* change direction, ignore data */
+ ack_received = ipath_sda_in(dev, 1);
+ ipath_scl_out(dev, i2c_line_high, 1);
+ ack_received = ipath_sda_in(dev, 1) == 0;
+ ipath_scl_out(dev, i2c_line_low, 1);
+ return ack_received;
+}
+
+/*
+ * write a byte, one bit at a time. Returns 0 if we got the following
+ * ack, otherwise 1
+ */
+static int ipath_wr_byte(ipath_type dev, uint8_t data)
+{
+ int bit_cntr;
+ uint8_t bit;
+
+ for (bit_cntr = 7; bit_cntr >= 0; bit_cntr--) {
+ bit = (data >> bit_cntr) & 1;
+ ipath_sda_out(dev, bit, 1);
+ ipath_scl_out(dev, i2c_line_high, 1);
+ ipath_scl_out(dev, i2c_line_low, 1);
+ }
+ if (!ipath_i2c_ackrcv(dev))
+ return 1;
+ return 0;
+}
+
+static void send_ack(ipath_type dev)
+{
+ ipath_sda_out(dev, i2c_line_low, 1);
+ ipath_scl_out(dev, i2c_line_high, 1);
+ ipath_scl_out(dev, i2c_line_low, 1);
+ ipath_sda_out(dev, i2c_line_high, 1);
+}
+
+/*
+ * ipath_i2c_startcmd - Transmit the start condition, followed by
+ * address/cmd
+ * (both clock/data high, clock high, data low while clock is high)
+ */
+static int ipath_i2c_startcmd(ipath_type dev, uint8_t offset_dir)
+{
+ int res;
+
+ /* issue start sequence */
+ ipath_sda_out(dev, i2c_line_high, 1);
+ ipath_scl_out(dev, i2c_line_high, 1);
+ ipath_sda_out(dev, i2c_line_low, 1);
+ ipath_scl_out(dev, i2c_line_low, 1);
+
+ /* issue length and direction byte */
+ res = ipath_wr_byte(dev, offset_dir);
+
+ if (res)
+ _IPATH_VDBG("No ack to complete start\n");
+ return res;
+}
+
+/*
+ * stop_cmd - Transmit the stop condition
+ * (both clock/data low, clock high, data high while clock is high)
+ */
+static void stop_cmd(ipath_type dev)
+{
+ ipath_scl_out(dev, i2c_line_low, 1);
+ ipath_sda_out(dev, i2c_line_low, 1);
+ ipath_scl_out(dev, i2c_line_high, 1);
+ ipath_sda_out(dev, i2c_line_high, 3);
+}
+
+/*
+ * ipath_eeprom_reset - reset I2C communication.
+ *
+ * eeprom: Atmel AT24C01
+ *
+ */
+
+static int ipath_eeprom_reset(ipath_type dev)
+{
+ int clock_cycles_left = 9;
+ uint64_t *gpioval = &devdata[dev].ipath_gpio_out;
+
+ ipath_eeprom_init = 1;
+ *gpioval = ipath_kget_kreg64(dev, kr_gpio_out);
+ _IPATH_VDBG("Resetting i2c flash; initial gpioout reg is %llx\n",
+ *gpioval);
+
+ /*
+ * This is to get the i2c into a known state, by first going low,
+ * then tristate sda (and then tristate scl as first thing in loop)
+ */
+ ipath_scl_out(dev, i2c_line_low, 1);
+ ipath_sda_out(dev, i2c_line_high, 1);
+
+ while (clock_cycles_left--) {
+ ipath_scl_out(dev, i2c_line_high, 1);
+
+ if (ipath_sda_in(dev, 0)) {
+ ipath_sda_out(dev, i2c_line_low, 1);
+ ipath_scl_out(dev, i2c_line_low, 1);
+ return 0;
+ }
+
+ ipath_scl_out(dev, i2c_line_low, 1);
+ }
+
+ return 1;
+}
+
+/*
+ * ipath_eeprom_read - Receives x # byte from the eeprom via I2C.
+ *
+ * eeprom: Atmel AT24C01
+ *
+ */
+
+int ipath_eeprom_read(ipath_type dev, uint8_t eeprom_offset, void *buffer,
+ int len)
+{
+ /* compiler complains unless initialized */
+ uint8_t single_byte = 0;
+ int bit_cntr;
+
+ if (!ipath_eeprom_init)
+ ipath_eeprom_reset(dev);
+
+ eeprom_offset = (eeprom_offset << 1) | READ_CMD;
+
+ if (ipath_i2c_startcmd(dev, eeprom_offset)) {
+ _IPATH_DBG("Failed startcmd\n");
+ stop_cmd(dev);
+ return 1;
+ }
+
+ /*
+ * flash keeps clocking data out as long as we ack, automatically
+ * incrementing the address.
+ */
+ while (len-- > 0) {
+ /* get data */
+ single_byte = 0;
+ for (bit_cntr = 8; bit_cntr; bit_cntr--) {
+ uint8_t bit;
+ ipath_scl_out(dev, i2c_line_high, 1);
+ bit = ipath_sda_in(dev, 0);
+ single_byte |= bit << (bit_cntr - 1);
+ ipath_scl_out(dev, i2c_line_low, 1);
+ }
+
+ /* send ack if not the last byte */
+ if (len)
+ send_ack(dev);
+
+ *((uint8_t *) buffer) = single_byte;
+ (uint8_t *) buffer++;
+ }
+
+ stop_cmd(dev);
+
+ return 0;
+}
+
+/*
+ * ipath_eeprom_write - writes data to the eeprom via I2C.
+ *
+*/
+int ipath_eeprom_write(ipath_type dev, uint8_t eeprom_offset, void *buffer,
+ int len)
+{
+ uint8_t single_byte;
+ int sub_len;
+ uint8_t *bp = buffer;
+ int max_wait_time, i;
+
+ if (!ipath_eeprom_init)
+ ipath_eeprom_reset(dev);
+
+ while (len > 0) {
+ if (ipath_i2c_startcmd(dev, (eeprom_offset << 1) | WRITE_CMD)) {
+ _IPATH_DBG("Failed to start cmd offset %u\n",
+ eeprom_offset);
+ goto failed_write;
+ }
+
+ sub_len = min(len, 4);
+ eeprom_offset += sub_len;
+ len -= sub_len;
+
+ for (i = 0; i < sub_len; i++) {
+ if (ipath_wr_byte(dev, *bp++)) {
+ _IPATH_DBG
+ ("no ack after byte %u/%u (%u total remain)\n",
+ i, sub_len, len + sub_len - i);
+ goto failed_write;
+ }
+ }
+
+ stop_cmd(dev);
+
+ /*
+ * wait for write complete by waiting for a successful
+ * read (the chip replies with a zero after the write
+ * cmd completes, and before it writes to the flash.
+ * The startcmd for the read will fail the ack until
+ * the writes have completed. We do this inline to avoid
+ * the debug prints that are in the real read routine
+ * if the startcmd fails.
+ */
+ max_wait_time = 100;
+ while (ipath_i2c_startcmd(dev, READ_CMD)) {
+ stop_cmd(dev);
+ if (!--max_wait_time) {
+ _IPATH_DBG
+ ("Did not get successful read to complete write\n");
+ goto failed_write;
+ }
+ }
+ /* now read the zero byte */
+ for (i = single_byte = 0; i < 8; i++) {
+ uint8_t bit;
+ ipath_scl_out(dev, i2c_line_high, 1);
+ bit = ipath_sda_in(dev, 0);
+ ipath_scl_out(dev, i2c_line_low, 1);
+ single_byte <<= 1;
+ single_byte |= bit;
+ }
+ stop_cmd(dev);
+ }
+
+ return 0;
+
+failed_write:
+ stop_cmd(dev);
+ return 1;
+}
+
+uint8_t ipath_flash_csum(struct ipath_flash * ifp, int adjust)
+{
+ uint8_t *ip = (uint8_t *) ifp;
+ uint8_t csum = 0, len;
+
+ for (len = 0; len < ifp->if_length; len++)
+ csum += *ip++;
+ csum -= ifp->if_csum;
+ csum = ~csum;
+ if (adjust)
+ ifp->if_csum = csum;
+ return csum;
+}
diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_lib.c
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_lib.c Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,90 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+/*
+ * This is library code for the driver, similar to what's in libinfinipath for
+ * usermode code.
+ */
+
+#include <linux/config.h>
+#include <linux/version.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/delay.h>
+#include <linux/timer.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <asm/io.h>
+#include <asm/byteorder.h>
+#include <asm/uaccess.h>
+
+#include "ipath_kernel.h"
+
+unsigned infinipath_debug = __IPATH_INFO;
+
+uint32_t _ipath_pico_per_cycle; /* always present, for now */
+
+/*
+ * This isn't perfect, but it's close enough for timing work. We want this
+ * to work on systems where the cycle counter isn't the same as the clock
+ * frequency. The one msec spin is OK, since we execute this only once
+ * when first loaded. We don't use CURRENT_TIME because on some systems
+ * it only has jiffy resolution; we just assume udelay is well calibrated
+ * and that we aren't likely to be rescheduled. Do it multiple times,
+ * with a yield in between, to try to make sure we get the "true minimum"
+ * value.
+ * _ipath_pico_per_cycle isn't going to lead to completely accurate
+ * conversions from timestamps to nanoseconds, but it's close enough
+ * for our purposes, which is mainly to allow people to show events with
+ * nsecs or usecs if desired, rather than cycles.
+ */
+void ipath_init_picotime(void)
+{
+ int i;
+ u_int64_t ts, te, delta = -1ULL;
+
+ for (i = 0; i < 5; i++) {
+ ts = get_cycles();
+ udelay(250);
+ te = get_cycles();
+ if ((te - ts) < delta)
+ delta = te - ts;
+ yield();
+ }
+ _ipath_pico_per_cycle = 250000000 / delta;
+}
diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_upages.c
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_upages.c Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,144 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#include <stddef.h>
+
+#include <linux/config.h>
+#include <linux/version.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/delay.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/spinlock.h>
+
+#include <asm/page.h>
+#include <asm/io.h>
+
+#include "ipath_kernel.h"
+
+/*
+ * Our version of the kernel mlock function. This function is no longer
+ * exposed, so we need to do it ourselves. It takes a given start page
+ * (page aligned user virtual address) and pins it and the following specified
+ * number of pages.
+ * For now, num_pages is always 1, but that will probably change at some
+ * point (because caller is doing expected sends on a single virtually
+ * contiguous buffer, so we can do all pages at once).
+ */
+int ipath_get_upages(unsigned long start_page, size_t num_pages, struct page **p)
+{
+ int n;
+
+ _IPATH_VDBG("pin %lx pages from vaddr %lx\n", num_pages, start_page);
+ down_read(&current->mm->mmap_sem);
+ n = get_user_pages(current, current->mm, start_page, num_pages, 1, 1,
+ p, NULL);
+ up_read(&current->mm->mmap_sem);
+ if (n != num_pages) {
+ _IPATH_INFO
+ ("get_user_pages (0x%lx pages starting at 0x%lx failed with %d\n",
+ num_pages, start_page, n);
+ if (n < 0) /* it's an errno */
+ return n;
+ /*
+ * We may have gotten some pages, so unlock those.
+ * ipath_putpages() correctly handles n==0
+ */
+ ipath_putpages(n, p);
+ return -ENOMEM; /* no way to know actual error */
+ }
+
+ return 0;
+}
+
+/*
+ * this is similar to ipath_get_upages, but it's always one page, and we mark
+ * the page as locked for i/o, and shared. This is used for the user process
+ * page that contains the destination address for the rcvhdrq tail update,
+ * so we need to have the vma. If we don't do this, the page can be taken
+ * away from us on fork, even if the child never touches it, and then
+ * the user process never sees the tail register updates.
+ */
+int ipath_get_upages_nocopy(unsigned long start_page, struct page **p)
+{
+ int n;
+ struct vm_area_struct *vm = NULL;
+
+ down_read(&current->mm->mmap_sem);
+ n = get_user_pages(current, current->mm, start_page, 1, 1, 1, p, &vm);
+ up_read(&current->mm->mmap_sem);
+ if (n != 1) {
+ _IPATH_INFO("get_user_pages for 0x%lx failed with %d\n",
+ start_page, n);
+ if (n < 0) /* it's an errno */
+ return n;
+ /*
+ * If we ever ask for more than a single page, we will have to
+ * free the pages (if any) that we did get, via ipath_get_upages()
+ * or put_page() directly.
+ */
+ return -ENOMEM; /* no way to know actual error */
+ }
+ vm->vm_flags |= VM_SHM | VM_LOCKED;
+
+ return 0;
+}
+
+/*
+ * Unpins the start page (a page aligned full user virtual address, not a
+ * page number) and pins it and the following specified number of pages.
+ */
+void ipath_putpages(size_t num_pages, struct page **p)
+{
+ int i;
+
+ for (i = 0; i < num_pages; i++) {
+ _IPATH_MMDBG("%u/%lu put_page %p\n", i, num_pages, p[i]);
+ set_page_dirty_lock(p[i]);
+ put_page(p[i]);
+ }
+}
+
+/*
+ * This routine frees up all the allocations made in this file; it's a nop
+ * now, but I'm leaving it in case we go back to a more sophisticated
+ * implementation later.
+ */
+void ipath_upages_cleanup(struct ipath_portdata * pd)
+{
+}

2005-12-29 00:39:47

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 18 of 20] ipath - infiniband management datagram support

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r 584777b6f4dc -r e7cabc7a2e78 drivers/infiniband/hw/ipath/ipath_mad.c
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_mad.c Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,1144 @@
+/*
+ * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#include <linux/version.h>
+#include <rdma/ib_smi.h>
+
+#include "ips_common.h"
+#include "ipath_verbs.h"
+#include "ipath_layer.h"
+
+
+#define IB_SMP_INVALID_FIELD __constant_htons(0x001C)
+
+static int reply(struct ib_smp *smp, int line)
+{
+
+ /*
+ * The verbs framework will handle the directed/LID route
+ * packet changes.
+ */
+ smp->method = IB_MGMT_METHOD_GET_RESP;
+ if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+ smp->status |= IB_SMP_DIRECTION;
+ return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
+}
+
+static inline int recv_subn_get_nodedescription(struct ib_smp *smp)
+{
+
+ strncpy(smp->data, "Infinipath", sizeof(smp->data));
+
+ return reply(smp, __LINE__);
+}
+
+struct nodeinfo {
+ u8 base_version;
+ u8 class_version;
+ u8 node_type;
+ u8 num_ports;
+ __be64 sys_guid;
+ __be64 node_guid;
+ __be64 port_guid;
+ __be16 partition_cap;
+ __be16 device_id;
+ __be32 revision;
+ u8 local_port_num;
+ u8 vendor_id[3];
+} __attribute__ ((packed));
+
+/*
+ * XXX The num_ports value will need a layer function to get the value
+ * if we ever have more than one IB port on a chip.
+ * We will also need to get the GUID for the port.
+ */
+static inline int recv_subn_get_nodeinfo(struct ib_smp *smp,
+ struct ib_device *ibdev, u8 port)
+{
+ struct nodeinfo *nip = (struct nodeinfo *)&smp->data;
+ ipath_type t = to_idev(ibdev)->ib_unit;
+ uint32_t vendor, boardid, majrev, minrev;
+
+ nip->base_version = 1;
+ nip->class_version = 1;
+ nip->node_type = 1; /* channel adapter */
+ nip->num_ports = 1;
+ /* This is already in network order */
+ nip->sys_guid = to_idev(ibdev)->sys_image_guid;
+ nip->node_guid = ipath_layer_get_guid(t);
+ nip->port_guid = nip->sys_guid;
+ nip->partition_cap = cpu_to_be16(ipath_layer_get_npkeys(t));
+ nip->device_id = cpu_to_be16(ipath_layer_get_deviceid(t));
+ ipath_layer_query_device(t, &vendor, &boardid, &majrev, &minrev);
+ nip->revision = cpu_to_be32((majrev << 16) | minrev);
+ nip->local_port_num = port;
+ nip->vendor_id[0] = 0;
+ nip->vendor_id[1] = vendor >> 8;
+ nip->vendor_id[2] = vendor;
+
+ return reply(smp, __LINE__);
+}
+
+static int recv_subn_get_guidinfo(struct ib_smp *smp, struct ib_device *ibdev)
+{
+ uint32_t t = to_idev(ibdev)->ib_unit;
+ u32 startgx = 8 * be32_to_cpu(smp->attr_mod);
+ u64 *p = (u64 *) smp->data;
+
+ /* 32 blocks of 8 64-bit GUIDs per block */
+
+ memset(smp->data, 0, sizeof(smp->data));
+
+ /*
+ * We only support one GUID for now.
+ * If this changes, the portinfo.guid_cap field needs to be updated too.
+ */
+ if (startgx == 0) {
+ /* The first is a copy of the read-only HW GUID. */
+ *p = ipath_layer_get_guid(t);
+ }
+
+ return reply(smp, __LINE__);
+}
+
+struct port_info {
+ __be64 mkey;
+ __be64 gid_prefix;
+ __be16 lid;
+ __be16 sm_lid;
+ __be32 cap_mask;
+ __be16 diag_code;
+ __be16 mkey_lease_period;
+ u8 local_port_num;
+ u8 link_width_enabled;
+ u8 link_width_supported;
+ u8 link_width_active;
+ u8 linkspeed_portstate; /* 4 bits, 4 bits */
+ u8 portphysstate_linkdown; /* 4 bits, 4 bits */
+ u8 mkeyprot_resv_lmc; /* 2 bits, 3 bits, 3 bits */
+ u8 linkspeedactive_enabled; /* 4 bits, 4 bits */
+ u8 neighbormtu_mastersmsl; /* 4 bits, 4 bits */
+ u8 vlcap_inittype; /* 4 bits, 4 bits */
+ u8 vl_high_limit;
+ u8 vl_arb_high_cap;
+ u8 vl_arb_low_cap;
+ u8 inittypereply_mtucap; /* 4 bits, 4 bits */
+ u8 vlstallcnt_hoqlife; /* 3 bits, 5 bits */
+ u8 operationalvl_pei_peo_fpi_fpo; /* 4 bits, 1, 1, 1, 1 */
+ __be16 mkey_violations;
+ __be16 pkey_violations;
+ __be16 qkey_violations;
+ u8 guid_cap;
+ u8 clientrereg_resv_subnetto; /* 1 bit, 2 bits, 5 bits */
+ u8 resv_resptimevalue; /* 3 bits, 5 bits */
+ u8 localphyerrors_overrunerrors; /* 4 bits, 4 bits */
+ __be16 max_credit_hint;
+ u8 resv;
+ u8 link_roundtrip_latency[3];
+} __attribute__ ((packed));
+
+static int recv_subn_get_portinfo(struct ib_smp *smp, struct ib_device *ibdev,
+ u8 port)
+{
+ u32 lportnum = be32_to_cpu(smp->attr_mod);
+ struct ipath_ibdev *dev;
+ struct port_info *pip = (struct port_info *)smp->data;
+ u32 tmp, tmp2;
+
+ if (lportnum == 0) {
+ lportnum = port;
+ smp->attr_mod = cpu_to_be32(lportnum);
+ }
+
+ if (lportnum < 1 || lportnum > ibdev->phys_port_cnt)
+ return IB_MAD_RESULT_FAILURE;
+
+ dev = to_idev(ibdev);
+
+ /* Clear all fields. Only set the non-zero fields. */
+ memset(smp->data, 0, sizeof(smp->data));
+
+ /* Only return the mkey if the protection field allows it. */
+ if ((dev->mkeyprot_resv_lmc >> 6) == 0)
+ pip->mkey = dev->mkey;
+ else
+ pip->mkey = 0;
+ pip->gid_prefix = dev->gid_prefix;
+ tmp = ipath_layer_get_lid(dev->ib_unit);
+ pip->lid = tmp ? cpu_to_be16(tmp) : IB_LID_PERMISSIVE;
+ pip->sm_lid = cpu_to_be16(dev->sm_lid);
+ pip->cap_mask = cpu_to_be32(dev->port_cap_flags);
+ /* pip->diag_code; */
+ pip->mkey_lease_period = cpu_to_be16(dev->mkey_lease_period);
+ pip->local_port_num = port;
+ pip->link_width_enabled = 2; /* 4x */
+ pip->link_width_supported = 3; /* 1x or 4x */
+ pip->link_width_active = 2; /* 4x */
+ pip->linkspeed_portstate = 0x10; /* 2.5Gbps */
+ tmp = ipath_layer_get_lastibcstat(dev->ib_unit) & 0xff;
+ tmp2 = 5; /* link up */
+ if (tmp == 0x11)
+ pip->linkspeed_portstate |= 2; /* initialize */
+ else if (tmp == 0x21)
+ pip->linkspeed_portstate |= 3; /* armed */
+ else if (tmp == 0x31)
+ pip->linkspeed_portstate |= 4; /* active */
+ else {
+ pip->linkspeed_portstate |= 1; /* down */
+ tmp2 = tmp & 0xf;
+ }
+ pip->portphysstate_linkdown = (tmp2 << 4) |
+ (ipath_layer_get_linkdowndefaultstate(dev->ib_unit) ? 1 : 2);
+ pip->mkeyprot_resv_lmc = dev->mkeyprot_resv_lmc;
+ pip->linkspeedactive_enabled = 0x11; /* 2.5Gbps, 2.5Gbps */
+ switch (ipath_layer_get_ibmtu(dev->ib_unit)) {
+ case 4096:
+ tmp = IB_MTU_4096;
+ break;
+ case 2048:
+ tmp = IB_MTU_2048;
+ break;
+ case 1024:
+ tmp = IB_MTU_1024;
+ break;
+ case 512:
+ tmp = IB_MTU_512;
+ break;
+ case 256:
+ tmp = IB_MTU_256;
+ break;
+ default: /* oops, something is wrong */
+ tmp = IB_MTU_2048;
+ break;
+ }
+ pip->neighbormtu_mastersmsl = (tmp << 4) | dev->sm_sl;
+ pip->vlcap_inittype = 0x10; /* VLCap = VL0, InitType = 0 */
+ /* pip->vl_high_limit; // only one VL */
+ /* pip->vl_arb_high_cap; // only one VL */
+ /* pip->vl_arb_low_cap; // only one VL */
+ pip->inittypereply_mtucap = IB_MTU_4096; /* InitTypeReply = 0 */
+ /* pip->vlstallcnt_hoqlife; // HCAs ignore VLStallCount and HOQLife */
+ pip->operationalvl_pei_peo_fpi_fpo = 0x10; /* OVLs = 1 */
+ pip->mkey_violations = cpu_to_be16(dev->mkey_violations);
+ /* P_KeyViolations are counted by hardware. */
+ tmp = ipath_layer_get_cr_errpkey(dev->ib_unit) & 0xFFFF;
+ pip->pkey_violations = cpu_to_be16(tmp);
+ pip->qkey_violations = cpu_to_be16(dev->qkey_violations);
+ /* Only the hardware GUID is supported for now */
+ pip->guid_cap = 1;
+ pip->clientrereg_resv_subnetto = dev->subnet_timeout;
+ /* 32.768 usec. response time (guessing) */
+ pip->resv_resptimevalue = 3;
+ pip->localphyerrors_overrunerrors =
+ (ipath_layer_get_phyerrthreshold(dev->ib_unit) << 4) |
+ ipath_layer_get_overrunthreshold(dev->ib_unit);
+ /* pip->max_credit_hint; */
+ /* pip->link_roundtrip_latency[3]; */
+
+ return reply(smp, __LINE__);
+}
+
+static int recv_subn_get_pkeytable(struct ib_smp *smp, struct ib_device *ibdev)
+{
+ u32 startpx = 32 * (be32_to_cpu(smp->attr_mod) & 0xffff);
+ u16 *p = (u16 *) smp->data;
+
+ /* 64 blocks of 32 16-bit P_Key entries */
+
+ memset(smp->data, 0, sizeof(smp->data));
+ if (startpx == 0)
+ ipath_layer_get_pkeys(to_idev(ibdev)->ib_unit, p);
+ else
+ smp->status |= IB_SMP_INVALID_FIELD;
+
+ return reply(smp, __LINE__);
+}
+
+static inline int recv_subn_set_guidinfo(struct ib_smp *smp,
+ struct ib_device *ibdev)
+{
+ /* The only GUID we support is the first read-only entry. */
+ return recv_subn_get_guidinfo(smp, ibdev);
+}
+
+static inline int recv_subn_set_portinfo(struct ib_smp *smp,
+ struct ib_device *ibdev, u8 port)
+{
+ struct port_info *pip = (struct port_info *)smp->data;
+ uint32_t lportnum = be32_to_cpu(smp->attr_mod);
+ struct ib_event event;
+ struct ipath_ibdev *dev;
+ uint32_t flags;
+ char clientrereg = 0;
+ u32 tmp;
+ u32 tmp2;
+ int ret;
+
+ if (lportnum == 0) {
+ lportnum = port;
+ smp->attr_mod = cpu_to_be32(lportnum);
+ }
+
+ if (lportnum < 1 || lportnum > ibdev->phys_port_cnt)
+ return IB_MAD_RESULT_FAILURE;
+
+ dev = to_idev(ibdev);
+ event.device = ibdev;
+ event.element.port_num = port;
+
+ if (dev->mkey != pip->mkey)
+ dev->mkey = pip->mkey;
+
+ if (pip->gid_prefix != dev->gid_prefix)
+ dev->gid_prefix = pip->gid_prefix;
+
+ tmp = be16_to_cpu(pip->lid);
+ if (tmp != ipath_layer_get_lid(dev->ib_unit)) {
+ ipath_set_sps_lid(dev->ib_unit, tmp);
+ event.event = IB_EVENT_LID_CHANGE;
+ ib_dispatch_event(&event);
+ }
+
+ tmp = be16_to_cpu(pip->sm_lid);
+ if (tmp != dev->sm_lid) {
+ dev->sm_lid = tmp;
+ event.event = IB_EVENT_SM_CHANGE;
+ ib_dispatch_event(&event);
+ }
+
+ dev->mkey_lease_period = be16_to_cpu(pip->mkey_lease_period);
+
+#if 0
+ tmp = pip->link_width_enabled;
+ if (tmp && (tmp != lpp->linkwidthenabled)) {
+ lpp->linkwidthenabled = tmp;
+ /* JAG - notify driver here */
+ }
+#endif
+
+ tmp = pip->portphysstate_linkdown & 0xF;
+ if (tmp == 1) {
+ /* SLEEP */
+ if (ipath_layer_set_linkdowndefaultstate(dev->ib_unit, 1))
+ return IB_MAD_RESULT_FAILURE;
+ } else if (tmp == 2) {
+ /* POLL */
+ if (ipath_layer_set_linkdowndefaultstate(dev->ib_unit, 0))
+ return IB_MAD_RESULT_FAILURE;
+ } else if (tmp)
+ return IB_MAD_RESULT_FAILURE;
+
+ dev->mkeyprot_resv_lmc = pip->mkeyprot_resv_lmc;
+
+#if 0
+ tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_LinkSpeedEnabled);
+ if (tmp && (tmp != lpp->linkspeedenabled)) {
+ lpp->linkspeedenabled = tmp;
+ /* JAG - notify driver here */
+ }
+#endif
+
+ switch ((pip->neighbormtu_mastersmsl >> 4) & 0xF) {
+ case IB_MTU_256:
+ tmp = 256;
+ break;
+ case IB_MTU_512:
+ tmp = 512;
+ break;
+ case IB_MTU_1024:
+ tmp = 1024;
+ break;
+ case IB_MTU_2048:
+ tmp = 2048;
+ break;
+ case IB_MTU_4096:
+ tmp = 4096;
+ break;
+ default:
+ /* XXX We have already partially updated our state! */
+ return IB_MAD_RESULT_FAILURE;
+ }
+ ipath_kset_mtu(dev->ib_unit << 16 | tmp);
+
+ dev->sm_sl = pip->neighbormtu_mastersmsl & 0xF;
+
+#if 0
+ tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_VLHighLimit);
+ if (tmp != lpp->vlhighlimit) {
+ lpp->vlhighlimit = tmp;
+ /* JAG - notify driver here */
+ }
+
+ lpp->inittypereply =
+ BF_GET(g.madp, iba_Subn_PortInfo, FIELD_InitTypeReply);
+
+ tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_OperationalVLs);
+ if (tmp && (tmp != lpp->operationalvls)) {
+ lpp->operationalvls = tmp;
+ /* JAG - notify driver here */
+ }
+#endif
+
+ if (pip->mkey_violations != 0)
+ dev->mkey_violations = 0;
+#if 0
+ /* XXX Hardware counter can't be reset. */
+ if (pip->pkey_violations != 0)
+ dev->pkey_violations = 0;
+#endif
+
+ if (pip->qkey_violations != 0)
+ dev->qkey_violations = 0;
+
+ tmp = (pip->localphyerrors_overrunerrors >> 4) & 0xF;
+ if (ipath_layer_set_phyerrthreshold(dev->ib_unit, tmp))
+ return IB_MAD_RESULT_FAILURE;
+
+ tmp = pip->localphyerrors_overrunerrors & 0xF;
+ if (ipath_layer_set_overrunthreshold(dev->ib_unit, tmp))
+ return IB_MAD_RESULT_FAILURE;
+
+ dev->subnet_timeout = pip->clientrereg_resv_subnetto & 0x1F;
+
+ if (pip->clientrereg_resv_subnetto & 0x80) {
+ clientrereg = 1;
+ event.event = IB_EVENT_LID_CHANGE;
+ ib_dispatch_event(&event);
+ }
+
+ /*
+ * Do the port state change now that the other link parameters
+ * have been set.
+ * Changing the port physical state only makes sense if the link
+ * is down or is being set to down.
+ */
+ tmp = pip->linkspeed_portstate & 0xF;
+ flags = ipath_layer_get_flags(dev->ib_unit);
+ tmp2 = (pip->portphysstate_linkdown >> 4) & 0xF;
+ if (tmp2) {
+ if (tmp != IB_PORT_DOWN && !(flags & IPATH_LINKDOWN))
+ return IB_MAD_RESULT_FAILURE;
+ tmp = IB_PORT_DOWN;
+ tmp2 = IB_PORT_NOP;
+ } else if (flags & IPATH_LINKDOWN)
+ tmp2 = IB_PORT_DOWN;
+ else if (flags & IPATH_LINKINIT)
+ tmp2 = IB_PORT_INIT;
+ else if (flags & IPATH_LINKARMED)
+ tmp2 = IB_PORT_ARMED;
+ else if (flags & IPATH_LINKACTIVE)
+ tmp2 = IB_PORT_ACTIVE;
+ else
+ tmp2 = IB_PORT_NOP;
+
+ if (tmp && tmp != tmp2) {
+ switch (tmp) {
+ case IB_PORT_DOWN:
+ tmp = (pip->portphysstate_linkdown >> 4) & 0xF;
+ if (tmp <= 1)
+ tmp = IPATH_IB_LINKDOWN;
+ else if (tmp == 2)
+ tmp = IPATH_IB_LINKDOWN_POLL;
+ else if (tmp == 3)
+ tmp = IPATH_IB_LINKDOWN_DISABLE;
+ else
+ return IB_MAD_RESULT_FAILURE;
+ ipath_kset_linkstate(dev->ib_unit << 16 | tmp);
+ if (tmp2 == IB_PORT_ACTIVE) {
+ event.event = IB_EVENT_PORT_ERR;
+ ib_dispatch_event(&event);
+ }
+ break;
+
+ case IB_PORT_INIT:
+ ipath_kset_linkstate(dev->ib_unit << 16 |
+ IPATH_IB_LINKINIT);
+ if (tmp2 == IB_PORT_ACTIVE) {
+ event.event = IB_EVENT_PORT_ERR;
+ ib_dispatch_event(&event);
+ }
+ break;
+
+ case IB_PORT_ARMED:
+ ipath_kset_linkstate(dev->ib_unit << 16 |
+ IPATH_IB_LINKARM);
+ if (tmp2 == IB_PORT_ACTIVE) {
+ event.event = IB_EVENT_PORT_ERR;
+ ib_dispatch_event(&event);
+ }
+ break;
+
+ case IB_PORT_ACTIVE:
+ ipath_kset_linkstate(dev->ib_unit << 16 |
+ IPATH_IB_LINKACTIVE);
+ event.event = IB_EVENT_PORT_ACTIVE;
+ ib_dispatch_event(&event);
+ break;
+
+ default:
+ /* XXX We have already partially updated our state! */
+ return IB_MAD_RESULT_FAILURE;
+ }
+ }
+
+ ret = recv_subn_get_portinfo(smp, ibdev, port);
+
+ if (clientrereg)
+ pip->clientrereg_resv_subnetto |= 0x80;
+
+ return ret;
+}
+
+static inline int recv_subn_set_pkeytable(struct ib_smp *smp,
+ struct ib_device *ibdev)
+{
+ u32 startpx = 32 * (be32_to_cpu(smp->attr_mod) & 0xffff);
+ u16 *p = (u16 *) smp->data;
+
+ if (startpx != 0 ||
+ ipath_layer_set_pkeys(to_idev(ibdev)->ib_unit, p) != 0)
+ smp->status |= IB_SMP_INVALID_FIELD;
+
+ return recv_subn_get_pkeytable(smp, ibdev);
+}
+
+#define IB_PMA_CLASS_PORT_INFO __constant_htons(0x0001)
+#define IB_PMA_PORT_SAMPLES_CONTROL __constant_htons(0x0010)
+#define IB_PMA_PORT_SAMPLES_RESULT __constant_htons(0x0011)
+#define IB_PMA_PORT_COUNTERS __constant_htons(0x0012)
+#define IB_PMA_PORT_COUNTERS_EXT __constant_htons(0x001D)
+#define IB_PMA_PORT_SAMPLES_RESULT_EXT __constant_htons(0x001E)
+
+struct ib_perf {
+ u8 base_version;
+ u8 mgmt_class;
+ u8 class_version;
+ u8 method;
+ __be16 status;
+ __be16 unused;
+ __be64 tid;
+ __be16 attr_id;
+ __be16 resv;
+ __be32 attr_mod;
+ u8 reserved[40];
+ u8 data[192];
+} __attribute__ ((packed));
+
+struct ib_pma_classportinfo {
+ u8 base_version;
+ u8 class_version;
+ __be16 cap_mask;
+ u8 reserved[3];
+ u8 resp_time_value; /* only lower 5 bits */
+ union ib_gid redirect_gid;
+ __be32 redirect_tc_sl_fl; /* 8, 4, 20 bits respectively */
+ __be16 redirect_lid;
+ __be16 redirect_pkey;
+ __be32 redirect_qp; /* only lower 24 bits */
+ __be32 redirect_qkey;
+ union ib_gid trap_gid;
+ __be32 trap_tc_sl_fl; /* 8, 4, 20 bits respectively */
+ __be16 trap_lid;
+ __be16 trap_pkey;
+ __be32 trap_hl_qp; /* 8, 24 bits respectively */
+ __be32 trap_qkey;
+} __attribute__ ((packed));
+
+struct ib_pma_portsamplescontrol {
+ u8 opcode;
+ u8 port_select;
+ u8 tick;
+ u8 counter_width; /* only lower 3 bits */
+ __be32 counter_mask0_9; /* 2, 10 * 3, bits */
+ __be16 counter_mask10_14; /* 1, 5 * 3, bits */
+ u8 sample_mechanisms;
+ u8 sample_status; /* only lower 2 bits */
+ __be64 option_mask;
+ __be64 vendor_mask;
+ __be32 sample_start;
+ __be32 sample_interval;
+ __be16 tag;
+ __be16 counter_select[15];
+} __attribute__ ((packed));
+
+struct ib_pma_portsamplesresult {
+ __be16 tag;
+ __be16 sample_status; /* only lower 2 bits */
+ __be32 counter[15];
+} __attribute__ ((packed));
+
+struct ib_pma_portsamplesresult_ext {
+ __be16 tag;
+ __be16 sample_status; /* only lower 2 bits */
+ __be32 extended_width; /* only upper 2 bits */
+ __be64 counter[15];
+} __attribute__ ((packed));
+
+struct ib_pma_portcounters {
+ u8 reserved;
+ u8 port_select;
+ __be16 counter_select;
+ __be16 symbol_error_counter;
+ u8 link_error_recovery_counter;
+ u8 link_downed_counter;
+ __be16 port_rcv_errors;
+ __be16 port_rcv_remphys_errors;
+ __be16 port_rcv_switch_relay_errors;
+ __be16 port_xmit_discards;
+ u8 port_xmit_constraint_errors;
+ u8 port_rcv_constraint_errors;
+ u8 reserved1;
+ u8 lli_ebor_errors; /* 4, 4, bits */
+ __be16 reserved2;
+ __be16 vl15_dropped;
+ __be32 port_xmit_data;
+ __be32 port_rcv_data;
+ __be32 port_xmit_packets;
+ __be32 port_rcv_packets;
+} __attribute__ ((packed));
+
+#define IB_PMA_SEL_SYMBOL_ERROR __constant_htons(0x0001)
+#define IB_PMA_SEL_LINK_ERROR_RECOVERY __constant_htons(0x0002)
+#define IB_PMA_SEL_LINK_DOWNED __constant_htons(0x0004)
+#define IB_PMA_SEL_PORT_RCV_ERRORS __constant_htons(0x0008)
+#define IB_PMA_SEL_PORT_RCV_REMPHYS_ERRORS __constant_htons(0x0010)
+#define IB_PMA_SEL_PORT_XMIT_DISCARDS __constant_htons(0x0040)
+#define IB_PMA_SEL_PORT_XMIT_DATA __constant_htons(0x1000)
+#define IB_PMA_SEL_PORT_RCV_DATA __constant_htons(0x2000)
+#define IB_PMA_SEL_PORT_XMIT_PACKETS __constant_htons(0x4000)
+#define IB_PMA_SEL_PORT_RCV_PACKETS __constant_htons(0x8000)
+
+struct ib_pma_portcounters_ext {
+ u8 reserved;
+ u8 port_select;
+ __be16 counter_select;
+ __be32 reserved1;
+ __be64 port_xmit_data;
+ __be64 port_rcv_data;
+ __be64 port_xmit_packets;
+ __be64 port_rcv_packets;
+ __be64 port_unicast_xmit_packets;
+ __be64 port_unicast_rcv_packets;
+ __be64 port_multicast_xmit_packets;
+ __be64 port_multicast_rcv_packets;
+} __attribute__ ((packed));
+
+#define IB_PMA_SELX_PORT_XMIT_DATA __constant_htons(0x0001)
+#define IB_PMA_SELX_PORT_RCV_DATA __constant_htons(0x0002)
+#define IB_PMA_SELX_PORT_XMIT_PACKETS __constant_htons(0x0004)
+#define IB_PMA_SELX_PORT_RCV_PACKETS __constant_htons(0x0008)
+#define IB_PMA_SELX_PORT_UNI_XMIT_PACKETS __constant_htons(0x0010)
+#define IB_PMA_SELX_PORT_UNI_RCV_PACKETS __constant_htons(0x0020)
+#define IB_PMA_SELX_PORT_MULTI_XMIT_PACKETS __constant_htons(0x0040)
+#define IB_PMA_SELX_PORT_MULTI_RCV_PACKETS __constant_htons(0x0080)
+
+static int recv_pma_get_classportinfo(struct ib_perf *pmp)
+{
+ /*
+ struct ib_pma_classportinfo *p =
+ (struct ib_pma_classportinfo *)pmp->data;
+ */
+
+ memset(pmp->data, 0, sizeof(pmp->data));
+
+ return reply((struct ib_smp *)pmp, __LINE__);
+}
+
+static int recv_pma_get_portsamplescontrol(struct ib_perf *pmp,
+ struct ib_device *ibdev, u8 port)
+{
+ struct ib_pma_portsamplescontrol *p =
+ (struct ib_pma_portsamplescontrol *)pmp->data;
+ struct ipath_ibdev *dev = to_idev(ibdev);
+ unsigned long flags;
+
+ memset(pmp->data, 0, sizeof(pmp->data));
+
+ p->port_select = port;
+ p->tick = 0xFA; /* 1 ms. */
+ p->counter_width = 4; /* 32 bit counters */
+ p->counter_mask0_9 = __constant_htonl(0x09248000); /* counters 0-4 */
+ spin_lock_irqsave(&dev->pending_lock, flags);
+ p->sample_status = dev->pma_sample_status;
+ p->sample_start = cpu_to_be32(dev->pma_sample_start);
+ p->sample_interval = cpu_to_be32(dev->pma_sample_interval);
+ p->tag = cpu_to_be16(dev->pma_tag);
+ p->counter_select[0] = dev->pma_counter_select[0];
+ p->counter_select[1] = dev->pma_counter_select[1];
+ p->counter_select[2] = dev->pma_counter_select[2];
+ p->counter_select[3] = dev->pma_counter_select[3];
+ p->counter_select[4] = dev->pma_counter_select[4];
+ spin_unlock_irqrestore(&dev->pending_lock, flags);
+
+ return reply((struct ib_smp *)pmp, __LINE__);
+}
+
+static int recv_pma_set_portsamplescontrol(struct ib_perf *pmp,
+ struct ib_device *ibdev, u8 port)
+{
+ struct ib_pma_portsamplescontrol *p =
+ (struct ib_pma_portsamplescontrol *)pmp->data;
+ struct ipath_ibdev *dev = to_idev(ibdev);
+ unsigned long flags;
+ u32 start = be32_to_cpu(p->sample_start);
+
+ if (pmp->attr_mod == 0 && p->port_select == port && start != 0) {
+ spin_lock_irqsave(&dev->pending_lock, flags);
+ if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_DONE) {
+ dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_STARTED;
+ dev->pma_sample_start = start;
+ dev->pma_sample_interval =
+ be32_to_cpu(p->sample_interval);
+ dev->pma_tag = be16_to_cpu(p->tag);
+ if (p->counter_select[0])
+ dev->pma_counter_select[0] =
+ p->counter_select[0];
+ if (p->counter_select[1])
+ dev->pma_counter_select[1] =
+ p->counter_select[1];
+ if (p->counter_select[2])
+ dev->pma_counter_select[2] =
+ p->counter_select[2];
+ if (p->counter_select[3])
+ dev->pma_counter_select[3] =
+ p->counter_select[3];
+ if (p->counter_select[4])
+ dev->pma_counter_select[4] =
+ p->counter_select[4];
+ }
+ spin_unlock_irqrestore(&dev->pending_lock, flags);
+ }
+ return recv_pma_get_portsamplescontrol(pmp, ibdev, port);
+}
+
+static u64 get_counter(struct ipath_ibdev *dev, __be16 sel)
+{
+ switch (sel) {
+ case IB_PMA_PORT_XMIT_DATA:
+ return dev->ipath_sword;
+ case IB_PMA_PORT_RCV_DATA:
+ return dev->ipath_rword;
+ case IB_PMA_PORT_XMIT_PKTS:
+ return dev->ipath_spkts;
+ case IB_PMA_PORT_RCV_PKTS:
+ return dev->ipath_rpkts;
+ case IB_PMA_PORT_XMIT_WAIT:
+ default:
+ return 0;
+ }
+}
+
+static int recv_pma_get_portsamplesresult(struct ib_perf *pmp,
+ struct ib_device *ibdev)
+{
+ struct ib_pma_portsamplesresult *p =
+ (struct ib_pma_portsamplesresult *)pmp->data;
+ struct ipath_ibdev *dev = to_idev(ibdev);
+ int i;
+
+ memset(pmp->data, 0, sizeof(pmp->data));
+ p->tag = cpu_to_be16(dev->pma_tag);
+ p->sample_status = cpu_to_be16(dev->pma_sample_status);
+ for (i = 0; i < ARRAY_SIZE(dev->pma_counter_select); i++)
+ p->counter[i] =
+ cpu_to_be32(get_counter(dev, dev->pma_counter_select[i]));
+
+ return reply((struct ib_smp *)pmp, __LINE__);
+}
+
+static int recv_pma_get_portsamplesresult_ext(struct ib_perf *pmp,
+ struct ib_device *ibdev)
+{
+ struct ib_pma_portsamplesresult_ext *p =
+ (struct ib_pma_portsamplesresult_ext *)pmp->data;
+ struct ipath_ibdev *dev = to_idev(ibdev);
+ int i;
+
+ memset(pmp->data, 0, sizeof(pmp->data));
+ p->tag = cpu_to_be16(dev->pma_tag);
+ p->sample_status = cpu_to_be16(dev->pma_sample_status);
+ p->extended_width = __constant_cpu_to_be16(0x80000000); /* 64 bits */
+ for (i = 0; i < ARRAY_SIZE(dev->pma_counter_select); i++)
+ p->counter[i] =
+ cpu_to_be64(get_counter(dev, dev->pma_counter_select[i]));
+
+ return reply((struct ib_smp *)pmp, __LINE__);
+}
+
+static int recv_pma_get_portcounters(struct ib_perf *pmp,
+ struct ib_device *ibdev, u8 port)
+{
+ struct ib_pma_portcounters *p = (struct ib_pma_portcounters *)pmp->data;
+ struct ipath_ibdev *dev = to_idev(ibdev);
+ struct ipath_layer_counters cntrs;
+
+ ipath_layer_get_counters(dev->ib_unit, &cntrs);
+
+ /* Adjust counters for any resets done. */
+ cntrs.symbol_error_counter -= dev->n_symbol_error_counter;
+ cntrs.link_error_recovery_counter -= dev->n_link_error_recovery_counter;
+ cntrs.link_downed_counter -= dev->n_link_downed_counter;
+ cntrs.port_rcv_errors -= dev->n_port_rcv_errors;
+ cntrs.port_rcv_remphys_errors -= dev->n_port_rcv_remphys_errors;
+ cntrs.port_xmit_discards -= dev->n_port_xmit_discards;
+ cntrs.port_xmit_data -= dev->n_port_xmit_data;
+ cntrs.port_rcv_data -= dev->n_port_rcv_data;
+ cntrs.port_xmit_packets -= dev->n_port_xmit_packets;
+ cntrs.port_rcv_packets -= dev->n_port_rcv_packets;
+
+ memset(pmp->data, 0, sizeof(pmp->data));
+ p->port_select = port;
+ if (cntrs.symbol_error_counter > 0xFFFFUL)
+ p->symbol_error_counter = 0xFFFF;
+ else
+ p->symbol_error_counter =
+ cpu_to_be16((u16)cntrs.symbol_error_counter);
+ if (cntrs.link_error_recovery_counter > 0xFFUL)
+ p->link_error_recovery_counter = 0xFF;
+ else
+ p->link_error_recovery_counter =
+ (u8)cntrs.link_error_recovery_counter;
+ if (cntrs.link_downed_counter > 0xFFUL)
+ p->link_downed_counter = 0xFF;
+ else
+ p->link_downed_counter = (u8)cntrs.link_downed_counter;
+ if (cntrs.port_rcv_errors > 0xFFFFUL)
+ p->port_rcv_errors = 0xFFFF;
+ else
+ p->port_rcv_errors = cpu_to_be16((u16)cntrs.port_rcv_errors);
+ if (cntrs.port_rcv_remphys_errors > 0xFFFFUL)
+ p->port_rcv_remphys_errors = 0xFFFF;
+ else
+ p->port_rcv_remphys_errors =
+ cpu_to_be16((u16)cntrs.port_rcv_remphys_errors);
+ if (cntrs.port_xmit_discards > 0xFFFFUL)
+ p->port_xmit_discards = 0xFFFF;
+ else
+ p->port_xmit_discards =
+ cpu_to_be16((u16)cntrs.port_xmit_discards);
+ if (cntrs.port_xmit_data > 0xFFFFFFFFUL)
+ p->port_xmit_data = 0xFFFFFFFF;
+ else
+ p->port_xmit_data = cpu_to_be32((u32)cntrs.port_xmit_data);
+ if (cntrs.port_rcv_data > 0xFFFFFFFFUL)
+ p->port_rcv_data = 0xFFFFFFFF;
+ else
+ p->port_rcv_data = cpu_to_be32((u32)cntrs.port_rcv_data);
+ if (cntrs.port_xmit_packets > 0xFFFFFFFFUL)
+ p->port_xmit_packets = 0xFFFFFFFF;
+ else
+ p->port_xmit_packets =
+ cpu_to_be32((u32)cntrs.port_xmit_packets);
+ if (cntrs.port_rcv_packets > 0xFFFFFFFFUL)
+ p->port_rcv_packets = 0xFFFFFFFF;
+ else
+ p->port_rcv_packets = cpu_to_be32((u32)cntrs.port_rcv_packets);
+
+ return reply((struct ib_smp *)pmp, __LINE__);
+}
+
+static int recv_pma_get_portcounters_ext(struct ib_perf *pmp,
+ struct ib_device *ibdev, u8 port)
+{
+ struct ib_pma_portcounters_ext *p =
+ (struct ib_pma_portcounters_ext *)pmp->data;
+ struct ipath_ibdev *dev = to_idev(ibdev);
+ u64 swords, rwords, spkts, rpkts;
+
+ ipath_layer_snapshot_counters(dev->ib_unit,
+ &swords, &rwords, &spkts, &rpkts);
+
+ /* Adjust counters for any resets done. */
+ swords -= dev->n_port_xmit_data;
+ rwords -= dev->n_port_rcv_data;
+ spkts -= dev->n_port_xmit_packets;
+ rpkts -= dev->n_port_rcv_packets;
+
+ memset(pmp->data, 0, sizeof(pmp->data));
+ p->port_select = port;
+ p->port_xmit_data = cpu_to_be64(swords);
+ p->port_rcv_data = cpu_to_be64(rwords);
+ p->port_xmit_packets = cpu_to_be64(spkts);
+ p->port_rcv_packets = cpu_to_be64(rpkts);
+ p->port_unicast_xmit_packets = cpu_to_be64(dev->n_unicast_xmit);
+ p->port_unicast_rcv_packets = cpu_to_be64(dev->n_unicast_rcv);
+ p->port_multicast_xmit_packets = cpu_to_be64(dev->n_multicast_xmit);
+ p->port_multicast_rcv_packets = cpu_to_be64(dev->n_multicast_rcv);
+
+ return reply((struct ib_smp *)pmp, __LINE__);
+}
+
+static int recv_pma_set_portcounters(struct ib_perf *pmp,
+ struct ib_device *ibdev, u8 port)
+{
+ struct ib_pma_portcounters *p = (struct ib_pma_portcounters *)pmp->data;
+ struct ipath_ibdev *dev = to_idev(ibdev);
+ struct ipath_layer_counters cntrs;
+
+ /*
+ * Since the HW doesn't support clearing counters, we save the
+ * current count and subtract it from future responses.
+ */
+ ipath_layer_get_counters(dev->ib_unit, &cntrs);
+
+ if (p->counter_select & IB_PMA_SEL_SYMBOL_ERROR)
+ dev->n_symbol_error_counter = cntrs.symbol_error_counter;
+
+ if (p->counter_select & IB_PMA_SEL_LINK_ERROR_RECOVERY)
+ dev->n_link_error_recovery_counter =
+ cntrs.link_error_recovery_counter;
+
+ if (p->counter_select & IB_PMA_SEL_LINK_DOWNED)
+ dev->n_link_downed_counter = cntrs.link_downed_counter;
+
+ if (p->counter_select & IB_PMA_SEL_PORT_RCV_ERRORS)
+ dev->n_port_rcv_errors = cntrs.port_rcv_errors;
+
+ if (p->counter_select & IB_PMA_SEL_PORT_RCV_REMPHYS_ERRORS)
+ dev->n_port_rcv_remphys_errors = cntrs.port_rcv_remphys_errors;
+
+ if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DISCARDS)
+ dev->n_port_xmit_discards = cntrs.port_xmit_discards;
+
+ if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DATA)
+ dev->n_port_xmit_data = cntrs.port_xmit_data;
+
+ if (p->counter_select & IB_PMA_SEL_PORT_RCV_DATA)
+ dev->n_port_rcv_data = cntrs.port_rcv_data;
+
+ if (p->counter_select & IB_PMA_SEL_PORT_XMIT_PACKETS)
+ dev->n_port_xmit_packets = cntrs.port_xmit_packets;
+
+ if (p->counter_select & IB_PMA_SEL_PORT_RCV_PACKETS)
+ dev->n_port_rcv_packets = cntrs.port_rcv_packets;
+
+ return recv_pma_get_portcounters(pmp, ibdev, port);
+}
+
+static int recv_pma_set_portcounters_ext(struct ib_perf *pmp,
+ struct ib_device *ibdev, u8 port)
+{
+ struct ib_pma_portcounters *p = (struct ib_pma_portcounters *)pmp->data;
+ struct ipath_ibdev *dev = to_idev(ibdev);
+ u64 swords, rwords, spkts, rpkts;
+
+ ipath_layer_snapshot_counters(dev->ib_unit,
+ &swords, &rwords, &spkts, &rpkts);
+
+ if (p->counter_select & IB_PMA_SELX_PORT_XMIT_DATA)
+ dev->n_port_xmit_data = swords;
+
+ if (p->counter_select & IB_PMA_SELX_PORT_RCV_DATA)
+ dev->n_port_rcv_data = rwords;
+
+ if (p->counter_select & IB_PMA_SELX_PORT_XMIT_PACKETS)
+ dev->n_port_xmit_packets = spkts;
+
+ if (p->counter_select & IB_PMA_SELX_PORT_RCV_PACKETS)
+ dev->n_port_rcv_packets = rpkts;
+
+ if (p->counter_select & IB_PMA_SELX_PORT_UNI_XMIT_PACKETS)
+ dev->n_unicast_xmit = 0;
+
+ if (p->counter_select & IB_PMA_SELX_PORT_UNI_RCV_PACKETS)
+ dev->n_unicast_rcv = 0;
+
+ if (p->counter_select & IB_PMA_SELX_PORT_MULTI_XMIT_PACKETS)
+ dev->n_multicast_xmit = 0;
+
+ if (p->counter_select & IB_PMA_SELX_PORT_MULTI_RCV_PACKETS)
+ dev->n_multicast_rcv = 0;
+
+ return recv_pma_get_portcounters_ext(pmp, ibdev, port);
+}
+
+static inline int process_subn(struct ib_device *ibdev, int mad_flags,
+ u8 port_num, struct ib_mad *in_mad,
+ struct ib_mad *out_mad)
+{
+ struct ib_smp *smp = (struct ib_smp *)out_mad;
+ struct ipath_ibdev *dev = to_idev(ibdev);
+
+ /* Is the mkey in the process of expiring? */
+ if (dev->mkey_lease_timeout && jiffies >= dev->mkey_lease_timeout) {
+ dev->mkey_lease_timeout = 0;
+ dev->mkeyprot_resv_lmc &= 0x3F;
+ }
+
+ /*
+ * M_Key checking depends on
+ * Portinfo:M_Key_protect_bits
+ */
+ if ((mad_flags & IB_MAD_IGNORE_MKEY) == 0 && dev->mkey != 0 &&
+ dev->mkey != smp->mkey && (smp->method != IB_MGMT_METHOD_GET ||
+ (dev->mkeyprot_resv_lmc >> 7) != 0)) {
+ if (dev->mkey_violations != 0xFFFF)
+ ++dev->mkey_violations;
+ if (dev->mkey_lease_timeout || dev->mkey_lease_period == 0)
+ return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+ dev->mkey_lease_timeout = jiffies + dev->mkey_lease_period * HZ;
+ /* Future: Generate a trap notice. */
+ return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED;
+ }
+
+ *out_mad = *in_mad;
+ switch (smp->method) {
+ case IB_MGMT_METHOD_GET:
+ switch (smp->attr_id) {
+ case IB_SMP_ATTR_NODE_DESC:
+ return recv_subn_get_nodedescription(smp);
+
+ case IB_SMP_ATTR_NODE_INFO:
+ return recv_subn_get_nodeinfo(smp, ibdev, port_num);
+
+ case IB_SMP_ATTR_GUID_INFO:
+ return recv_subn_get_guidinfo(smp, ibdev);
+
+ case IB_SMP_ATTR_PORT_INFO:
+ return recv_subn_get_portinfo(smp, ibdev, port_num);
+
+ case IB_SMP_ATTR_PKEY_TABLE:
+ return recv_subn_get_pkeytable(smp, ibdev);
+
+ default:
+ break;
+ }
+ break;
+
+ case IB_MGMT_METHOD_SET:
+ switch (smp->attr_id) {
+ case IB_SMP_ATTR_GUID_INFO:
+ return recv_subn_set_guidinfo(smp, ibdev);
+
+ case IB_SMP_ATTR_PORT_INFO:
+ return recv_subn_set_portinfo(smp, ibdev, port_num);
+
+ case IB_SMP_ATTR_PKEY_TABLE:
+ return recv_subn_set_pkeytable(smp, ibdev);
+
+ default:
+ break;
+ }
+ break;
+
+ default:
+ break;
+ }
+ return IB_MAD_RESULT_FAILURE;
+}
+
+static inline int process_perf(struct ib_device *ibdev, u8 port_num,
+ struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+ struct ib_perf *pmp = (struct ib_perf *)out_mad;
+
+ *out_mad = *in_mad;
+ switch (pmp->method) {
+ case IB_MGMT_METHOD_GET:
+ switch (pmp->attr_id) {
+ case IB_PMA_CLASS_PORT_INFO:
+ return recv_pma_get_classportinfo(pmp);
+
+ case IB_PMA_PORT_SAMPLES_CONTROL:
+ return recv_pma_get_portsamplescontrol(pmp, ibdev,
+ port_num);
+
+ case IB_PMA_PORT_SAMPLES_RESULT:
+ return recv_pma_get_portsamplesresult(pmp, ibdev);
+
+ case IB_PMA_PORT_SAMPLES_RESULT_EXT:
+ return recv_pma_get_portsamplesresult_ext(pmp, ibdev);
+
+ case IB_PMA_PORT_COUNTERS:
+ return recv_pma_get_portcounters(pmp, ibdev, port_num);
+
+ case IB_PMA_PORT_COUNTERS_EXT:
+ return recv_pma_get_portcounters_ext(pmp, ibdev,
+ port_num);
+
+ default:
+ break;
+ }
+ break;
+
+ case IB_MGMT_METHOD_SET:
+ switch (pmp->attr_id) {
+ case IB_PMA_PORT_SAMPLES_CONTROL:
+ return recv_pma_set_portsamplescontrol(pmp, ibdev,
+ port_num);
+
+ case IB_PMA_PORT_COUNTERS:
+ return recv_pma_set_portcounters(pmp, ibdev, port_num);
+
+ case IB_PMA_PORT_COUNTERS_EXT:
+ return recv_pma_set_portcounters_ext(pmp, ibdev,
+ port_num);
+
+ default:
+ break;
+ }
+ break;
+
+ default:
+ break;
+ }
+ return IB_MAD_RESULT_FAILURE;
+}
+
+/*
+ * Note that the verbs framework has already done the MAD sanity checks,
+ * and hop count/pointer updating for IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE MADs.
+ *
+ * Return IB_MAD_RESULT_SUCCESS if this is a MAD that we are not interested
+ * in processing.
+ */
+int ipath_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num,
+ struct ib_wc *in_wc, struct ib_grh *in_grh,
+ struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+ switch (in_mad->mad_hdr.mgmt_class) {
+ case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
+ case IB_MGMT_CLASS_SUBN_LID_ROUTED:
+ return process_subn(ibdev, mad_flags, port_num,
+ in_mad, out_mad);
+
+ case IB_MGMT_CLASS_PERF_MGMT:
+ return process_perf(ibdev, port_num, in_mad, out_mad);
+
+ default:
+ return IB_MAD_RESULT_SUCCESS;
+ }
+}

2005-12-29 00:40:19

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 16 of 20] path - infiniband verbs support, part 2 of 3

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r 471b7a7a005c -r fc067af322a1 drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 28 14:19:43 2005 -0800
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 28 14:19:43 2005 -0800
@@ -2305,3 +2305,2513 @@
spin_unlock_irqrestore(&qp->s_lock, flags);
clear_bit(IPATH_S_BUSY, &qp->s_flags);
}
+
+static void send_rc_ack(struct ipath_qp *qp)
+{
+ struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+ u16 lrh0;
+ u32 bth0;
+ u32 hwords;
+ struct ipath_other_headers *ohdr;
+
+ /* Construct the header. */
+ ohdr = &qp->s_hdr.u.oth;
+ lrh0 = IPS_LRH_BTH;
+ /* header size in 32-bit words LRH+BTH+AETH = (8+12+4)/4. */
+ hwords = 6;
+ if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) {
+ ohdr = &qp->s_hdr.u.l.oth;
+ /* Header size in 32-bit words. */
+ hwords += 10;
+ lrh0 = IPS_LRH_GRH;
+ qp->s_hdr.u.l.grh.version_tclass_flow =
+ cpu_to_be32((6 << 28) |
+ (qp->remote_ah_attr.grh.traffic_class << 20) |
+ qp->remote_ah_attr.grh.flow_label);
+ qp->s_hdr.u.l.grh.paylen =
+ cpu_to_be16(((hwords - 12) + SIZE_OF_CRC) << 2);
+ qp->s_hdr.u.l.grh.next_hdr = 0x1B;
+ qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit;
+ /* The SGID is 32-bit aligned. */
+ qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix;
+ qp->s_hdr.u.l.grh.sgid.global.interface_id =
+ ipath_layer_get_guid(dev->ib_unit);
+ qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid;
+ }
+ bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index);
+ ohdr->u.aeth = ipath_compute_aeth(qp);
+ if (qp->s_ack_state >= IB_OPCODE_RC_COMPARE_SWAP) {
+ bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24;
+ ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic);
+ hwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4;
+ } else {
+ bth0 |= IB_OPCODE_RC_ACKNOWLEDGE << 24;
+ }
+ lrh0 |= qp->remote_ah_attr.sl << 4;
+ qp->s_hdr.lrh[0] = cpu_to_be16(lrh0);
+ /* DEST LID */
+ qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid);
+ qp->s_hdr.lrh[2] = cpu_to_be16(hwords + SIZE_OF_CRC);
+ qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit));
+ ohdr->bth[0] = cpu_to_be32(bth0);
+ ohdr->bth[1] = cpu_to_be32(qp->remote_qpn);
+ ohdr->bth[2] = cpu_to_be32(qp->s_ack_psn & 0xFFFFFF);
+
+ /*
+ * If we can send the ACK, clear the ACK state.
+ */
+ if (ipath_verbs_send(dev->ib_unit, hwords, (uint32_t *) &qp->s_hdr,
+ 0, NULL) == 0) {
+ qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
+ dev->n_rc_qacks++;
+ dev->n_unicast_xmit++;
+ }
+}
+
+/*
+ * Back up the requester to resend the last un-ACKed request.
+ * The QP s_lock should be held.
+ */
+static void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc)
+{
+ struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last);
+ struct ipath_ibdev *dev;
+ u32 n;
+
+ /*
+ * If there are no requests pending, we are done.
+ */
+ if (cmp24(psn, qp->s_next_psn) >= 0 || qp->s_last == qp->s_tail)
+ goto done;
+
+ if (qp->s_retry == 0) {
+ wc->wr_id = wqe->wr.wr_id;
+ wc->status = IB_WC_RETRY_EXC_ERR;
+ wc->opcode = wc_opcode[wqe->wr.opcode];
+ wc->vendor_err = 0;
+ wc->byte_len = 0;
+ wc->qp_num = qp->ibqp.qp_num;
+ wc->src_qp = qp->remote_qpn;
+ wc->pkey_index = 0;
+ wc->slid = qp->remote_ah_attr.dlid;
+ wc->sl = qp->remote_ah_attr.sl;
+ wc->dlid_path_bits = 0;
+ wc->port_num = 0;
+ ipath_sqerror_qp(qp, wc);
+ return;
+ }
+ qp->s_retry--;
+
+ /*
+ * Remove the QP from the timeout queue.
+ * Note: it may already have been removed by ipath_ib_timer().
+ */
+ dev = to_idev(qp->ibqp.device);
+ spin_lock(&dev->pending_lock);
+ if (qp->timerwait.next != LIST_POISON1)
+ list_del(&qp->timerwait);
+ spin_unlock(&dev->pending_lock);
+
+ if (wqe->wr.opcode == IB_WR_RDMA_READ)
+ dev->n_rc_resends++;
+ else
+ dev->n_rc_resends += (int)qp->s_psn - (int)psn;
+
+ /*
+ * If we are starting the request from the beginning, let the
+ * normal send code handle initialization.
+ */
+ qp->s_cur = qp->s_last;
+ if (cmp24(psn, wqe->psn) <= 0) {
+ qp->s_state = IB_OPCODE_RC_SEND_LAST;
+ qp->s_psn = wqe->psn;
+ } else {
+ n = qp->s_cur;
+ for (;;) {
+ if (++n == qp->s_size)
+ n = 0;
+ if (n == qp->s_tail) {
+ if (cmp24(psn, qp->s_next_psn) >= 0) {
+ qp->s_cur = n;
+ wqe = get_swqe_ptr(qp, n);
+ }
+ break;
+ }
+ wqe = get_swqe_ptr(qp, n);
+ if (cmp24(psn, wqe->psn) < 0)
+ break;
+ qp->s_cur = n;
+ }
+ qp->s_psn = psn;
+
+ /*
+ * Reset the state to restart in the middle of a request.
+ * Don't change the s_sge, s_cur_sge, or s_cur_size.
+ * See do_rc_send().
+ */
+ switch (wqe->wr.opcode) {
+ case IB_WR_SEND:
+ case IB_WR_SEND_WITH_IMM:
+ qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST;
+ break;
+
+ case IB_WR_RDMA_WRITE:
+ case IB_WR_RDMA_WRITE_WITH_IMM:
+ qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST;
+ break;
+
+ case IB_WR_RDMA_READ:
+ qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE;
+ break;
+
+ default:
+ /*
+ * This case shouldn't happen since its only
+ * one PSN per req.
+ */
+ qp->s_state = IB_OPCODE_RC_SEND_LAST;
+ }
+ }
+
+done:
+ tasklet_schedule(&qp->s_task);
+}
+
+/*
+ * Handle RC and UC post sends.
+ */
+static int ipath_post_rc_send(struct ipath_qp *qp, struct ib_send_wr *wr)
+{
+ struct ipath_swqe *wqe;
+ unsigned long flags;
+ u32 next;
+ int i, j;
+ int acc;
+
+ /*
+ * Don't allow RDMA reads or atomic operations on UC or
+ * undefined operations.
+ * Make sure buffer is large enough to hold the result for atomics.
+ */
+ if (qp->ibqp.qp_type == IB_QPT_UC) {
+ if ((unsigned) wr->opcode >= IB_WR_RDMA_READ)
+ return -EINVAL;
+ } else if ((unsigned) wr->opcode > IB_WR_ATOMIC_FETCH_AND_ADD)
+ return -EINVAL;
+ else if (wr->opcode >= IB_WR_ATOMIC_CMP_AND_SWP &&
+ (wr->num_sge == 0 || wr->sg_list[0].length < sizeof(u64) ||
+ wr->sg_list[0].addr & 0x7))
+ return -EINVAL;
+
+ /* IB spec says that num_sge == 0 is OK. */
+ if (wr->num_sge > qp->s_max_sge)
+ return -ENOMEM;
+
+ spin_lock_irqsave(&qp->s_lock, flags);
+ next = qp->s_head + 1;
+ if (next >= qp->s_size)
+ next = 0;
+ if (next == qp->s_last) {
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+ return -EINVAL;
+ }
+
+ wqe = get_swqe_ptr(qp, qp->s_head);
+ wqe->wr = *wr;
+ wqe->ssn = qp->s_ssn++;
+ wqe->sg_list[0].mr = NULL;
+ wqe->sg_list[0].vaddr = NULL;
+ wqe->sg_list[0].length = 0;
+ wqe->sg_list[0].sge_length = 0;
+ wqe->length = 0;
+ acc = wr->opcode >= IB_WR_RDMA_READ ? IB_ACCESS_LOCAL_WRITE : 0;
+ for (i = 0, j = 0; i < wr->num_sge; i++) {
+ if (to_ipd(qp->ibqp.pd)->user && wr->sg_list[i].lkey == 0) {
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+ return -EINVAL;
+ }
+ if (wr->sg_list[i].length == 0)
+ continue;
+ if (!ipath_lkey_ok(&to_idev(qp->ibqp.device)->lk_table,
+ &wqe->sg_list[j], &wr->sg_list[i], acc)) {
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+ return -EINVAL;
+ }
+ wqe->length += wr->sg_list[i].length;
+ j++;
+ }
+ wqe->wr.num_sge = j;
+ qp->s_head = next;
+ /*
+ * Wake up the send tasklet if the QP is not waiting
+ * for an RNR timeout.
+ */
+ next = qp->s_rnr_timeout;
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+
+ if (next == 0) {
+ if (qp->ibqp.qp_type == IB_QPT_UC)
+ do_uc_send((unsigned long) qp);
+ else
+ do_rc_send((unsigned long) qp);
+ }
+ return 0;
+}
+
+/*
+ * Note that we actually send the data as it is posted instead of putting
+ * the request into a ring buffer. If we wanted to use a ring buffer,
+ * we would need to save a reference to the destination address in the SWQE.
+ */
+static int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr)
+{
+ struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+ struct ipath_other_headers *ohdr;
+ struct ib_ah_attr *ah_attr;
+ struct ipath_sge_state ss;
+ struct ipath_sge *sg_list;
+ struct ib_wc wc;
+ u32 hwords;
+ u32 nwords;
+ u32 len;
+ u32 extra_bytes;
+ u32 bth0;
+ u16 lrh0;
+ u16 lid;
+ int i;
+
+ if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK))
+ return 0;
+
+ /* IB spec says that num_sge == 0 is OK. */
+ if (wr->num_sge > qp->s_max_sge)
+ return -EINVAL;
+
+ if (wr->num_sge > 1) {
+ sg_list = kmalloc((qp->s_max_sge - 1) * sizeof(*sg_list),
+ GFP_ATOMIC);
+ if (!sg_list)
+ return -ENOMEM;
+ } else
+ sg_list = NULL;
+
+ /* Check the buffer to send. */
+ ss.sg_list = sg_list;
+ ss.sge.mr = NULL;
+ ss.sge.vaddr = NULL;
+ ss.sge.length = 0;
+ ss.sge.sge_length = 0;
+ ss.num_sge = 0;
+ len = 0;
+ for (i = 0; i < wr->num_sge; i++) {
+ /* Check LKEY */
+ if (to_ipd(qp->ibqp.pd)->user && wr->sg_list[i].lkey == 0)
+ return -EINVAL;
+
+ if (wr->sg_list[i].length == 0)
+ continue;
+ if (!ipath_lkey_ok(&dev->lk_table, ss.num_sge ?
+ sg_list + ss.num_sge : &ss.sge,
+ &wr->sg_list[i], 0)) {
+ return -EINVAL;
+ }
+ len += wr->sg_list[i].length;
+ ss.num_sge++;
+ }
+ extra_bytes = (4 - len) & 3;
+ nwords = (len + extra_bytes) >> 2;
+
+ /* Construct the header. */
+ ah_attr = &to_iah(wr->wr.ud.ah)->attr;
+ if (ah_attr->dlid >= 0xC000 && ah_attr->dlid < 0xFFFF)
+ dev->n_multicast_xmit++;
+ else
+ dev->n_unicast_xmit++;
+ if (unlikely(ah_attr->dlid == ipath_layer_get_lid(dev->ib_unit))) {
+ /* Pass in an uninitialized ib_wc to save stack space. */
+ ipath_ud_loopback(qp, &ss, len, wr, &wc);
+ goto done;
+ }
+ if (ah_attr->ah_flags & IB_AH_GRH) {
+ /* Header size in 32-bit words. */
+ hwords = 17;
+ lrh0 = IPS_LRH_GRH;
+ ohdr = &qp->s_hdr.u.l.oth;
+ qp->s_hdr.u.l.grh.version_tclass_flow =
+ cpu_to_be32((6 << 28) |
+ (ah_attr->grh.traffic_class << 20) |
+ ah_attr->grh.flow_label);
+ qp->s_hdr.u.l.grh.paylen =
+ cpu_to_be16(((wr->opcode ==
+ IB_WR_SEND_WITH_IMM ? 6 : 5) + nwords +
+ SIZE_OF_CRC) << 2);
+ qp->s_hdr.u.l.grh.next_hdr = 0x1B;
+ qp->s_hdr.u.l.grh.hop_limit = ah_attr->grh.hop_limit;
+ /* The SGID is 32-bit aligned. */
+ qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix;
+ qp->s_hdr.u.l.grh.sgid.global.interface_id =
+ ipath_layer_get_guid(dev->ib_unit);
+ qp->s_hdr.u.l.grh.dgid = ah_attr->grh.dgid;
+ /*
+ * Don't worry about sending to locally attached
+ * multicast QPs. It is unspecified by the spec. what happens.
+ */
+ } else {
+ /* Header size in 32-bit words. */
+ hwords = 7;
+ lrh0 = IPS_LRH_BTH;
+ ohdr = &qp->s_hdr.u.oth;
+ }
+ if (wr->opcode == IB_WR_SEND_WITH_IMM) {
+ ohdr->u.ud.imm_data = wr->imm_data;
+ wc.imm_data = wr->imm_data;
+ hwords += 1;
+ bth0 = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE << 24;
+ } else if (wr->opcode == IB_WR_SEND) {
+ wc.imm_data = 0;
+ bth0 = IB_OPCODE_UD_SEND_ONLY << 24;
+ } else
+ return -EINVAL;
+ lrh0 |= ah_attr->sl << 4;
+ if (qp->ibqp.qp_type == IB_QPT_SMI)
+ lrh0 |= 0xF000; /* Set VL */
+ qp->s_hdr.lrh[0] = cpu_to_be16(lrh0);
+ qp->s_hdr.lrh[1] = cpu_to_be16(ah_attr->dlid); /* DEST LID */
+ qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC);
+ lid = ipath_layer_get_lid(dev->ib_unit);
+ qp->s_hdr.lrh[3] = lid ? cpu_to_be16(lid) : IB_LID_PERMISSIVE;
+ if (wr->send_flags & IB_SEND_SOLICITED)
+ bth0 |= 1 << 23;
+ bth0 |= extra_bytes << 20;
+ bth0 |= qp->ibqp.qp_type == IB_QPT_SMI ? IPS_DEFAULT_P_KEY :
+ ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index);
+ ohdr->bth[0] = cpu_to_be32(bth0);
+ ohdr->bth[1] = cpu_to_be32(wr->wr.ud.remote_qpn);
+ /* XXX Could lose a PSN count but not worth locking */
+ ohdr->bth[2] = cpu_to_be32(qp->s_psn++ & 0xFFFFFF);
+ /*
+ * Qkeys with the high order bit set mean use the
+ * qkey from the QP context instead of the WR (see 10.2.5).
+ */
+ ohdr->u.ud.deth[0] = cpu_to_be32((int)wr->wr.ud.remote_qkey < 0 ?
+ qp->qkey : wr->wr.ud.remote_qkey);
+ ohdr->u.ud.deth[1] = cpu_to_be32(qp->ibqp.qp_num);
+ if (ipath_verbs_send(dev->ib_unit, hwords, (uint32_t *) &qp->s_hdr,
+ len, &ss))
+ dev->n_no_piobuf++;
+
+done:
+ /* Queue the completion status entry. */
+ if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) ||
+ (wr->send_flags & IB_SEND_SIGNALED)) {
+ wc.wr_id = wr->wr_id;
+ wc.status = IB_WC_SUCCESS;
+ wc.vendor_err = 0;
+ wc.opcode = IB_WC_SEND;
+ wc.byte_len = len;
+ wc.qp_num = qp->ibqp.qp_num;
+ wc.src_qp = 0;
+ wc.wc_flags = 0;
+ /* XXX initialize other fields? */
+ ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0);
+ }
+ kfree(sg_list);
+
+ return 0;
+}
+
+/*
+ * This may be called from interrupt context.
+ */
+static int ipath_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+ struct ib_send_wr **bad_wr)
+{
+ struct ipath_qp *qp = to_iqp(ibqp);
+ int err = 0;
+
+ /* Check that state is OK to post send. */
+ if (!(state_ops[qp->state] & IPATH_POST_SEND_OK)) {
+ *bad_wr = wr;
+ return -EINVAL;
+ }
+
+ for (; wr; wr = wr->next) {
+ switch (qp->ibqp.qp_type) {
+ case IB_QPT_UC:
+ case IB_QPT_RC:
+ err = ipath_post_rc_send(qp, wr);
+ break;
+
+ case IB_QPT_SMI:
+ case IB_QPT_GSI:
+ case IB_QPT_UD:
+ err = ipath_post_ud_send(qp, wr);
+ break;
+
+ default:
+ err = -EINVAL;
+ }
+ if (err) {
+ *bad_wr = wr;
+ break;
+ }
+ }
+ return err;
+}
+
+/*
+ * This may be called from interrupt context.
+ */
+static int ipath_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+ struct ib_recv_wr **bad_wr)
+{
+ struct ipath_qp *qp = to_iqp(ibqp);
+ unsigned long flags;
+
+ /* Check that state is OK to post receive. */
+ if (!(state_ops[qp->state] & IPATH_POST_RECV_OK)) {
+ *bad_wr = wr;
+ return -EINVAL;
+ }
+
+ for (; wr; wr = wr->next) {
+ struct ipath_rwqe *wqe;
+ u32 next;
+ int i, j;
+
+ if (wr->num_sge > qp->r_rq.max_sge) {
+ *bad_wr = wr;
+ return -ENOMEM;
+ }
+
+ spin_lock_irqsave(&qp->r_rq.lock, flags);
+ next = qp->r_rq.head + 1;
+ if (next >= qp->r_rq.size)
+ next = 0;
+ if (next == qp->r_rq.tail) {
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ *bad_wr = wr;
+ return -ENOMEM;
+ }
+
+ wqe = get_rwqe_ptr(&qp->r_rq, qp->r_rq.head);
+ wqe->wr_id = wr->wr_id;
+ wqe->sg_list[0].mr = NULL;
+ wqe->sg_list[0].vaddr = NULL;
+ wqe->sg_list[0].length = 0;
+ wqe->sg_list[0].sge_length = 0;
+ wqe->length = 0;
+ for (i = 0, j = 0; i < wr->num_sge; i++) {
+ /* Check LKEY */
+ if (to_ipd(qp->ibqp.pd)->user &&
+ wr->sg_list[i].lkey == 0) {
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ *bad_wr = wr;
+ return -EINVAL;
+ }
+ if (wr->sg_list[i].length == 0)
+ continue;
+ if (!ipath_lkey_ok(&to_idev(qp->ibqp.device)->lk_table,
+ &wqe->sg_list[j], &wr->sg_list[i],
+ IB_ACCESS_LOCAL_WRITE)) {
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ *bad_wr = wr;
+ return -EINVAL;
+ }
+ wqe->length += wr->sg_list[i].length;
+ j++;
+ }
+ wqe->num_sge = j;
+ qp->r_rq.head = next;
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ }
+ return 0;
+}
+
+/*
+ * This may be called from interrupt context.
+ */
+static int ipath_post_srq_receive(struct ib_srq *ibsrq, struct ib_recv_wr *wr,
+ struct ib_recv_wr **bad_wr)
+{
+ struct ipath_srq *srq = to_isrq(ibsrq);
+ struct ipath_ibdev *dev = to_idev(ibsrq->device);
+ unsigned long flags;
+
+ for (; wr; wr = wr->next) {
+ struct ipath_rwqe *wqe;
+ u32 next;
+ int i, j;
+
+ if (wr->num_sge > srq->rq.max_sge) {
+ *bad_wr = wr;
+ return -ENOMEM;
+ }
+
+ spin_lock_irqsave(&srq->rq.lock, flags);
+ next = srq->rq.head + 1;
+ if (next >= srq->rq.size)
+ next = 0;
+ if (next == srq->rq.tail) {
+ spin_unlock_irqrestore(&srq->rq.lock, flags);
+ *bad_wr = wr;
+ return -ENOMEM;
+ }
+
+ wqe = get_rwqe_ptr(&srq->rq, srq->rq.head);
+ wqe->wr_id = wr->wr_id;
+ wqe->sg_list[0].mr = NULL;
+ wqe->sg_list[0].vaddr = NULL;
+ wqe->sg_list[0].length = 0;
+ wqe->sg_list[0].sge_length = 0;
+ wqe->length = 0;
+ for (i = 0, j = 0; i < wr->num_sge; i++) {
+ /* Check LKEY */
+ if (to_ipd(srq->ibsrq.pd)->user &&
+ wr->sg_list[i].lkey == 0) {
+ spin_unlock_irqrestore(&srq->rq.lock, flags);
+ *bad_wr = wr;
+ return -EINVAL;
+ }
+ if (wr->sg_list[i].length == 0)
+ continue;
+ if (!ipath_lkey_ok(&dev->lk_table,
+ &wqe->sg_list[j], &wr->sg_list[i],
+ IB_ACCESS_LOCAL_WRITE)) {
+ spin_unlock_irqrestore(&srq->rq.lock, flags);
+ *bad_wr = wr;
+ return -EINVAL;
+ }
+ wqe->length += wr->sg_list[i].length;
+ j++;
+ }
+ wqe->num_sge = j;
+ srq->rq.head = next;
+ spin_unlock_irqrestore(&srq->rq.lock, flags);
+ }
+ return 0;
+}
+
+/*
+ * This is called from ipath_qp_rcv() to process an incomming UD packet
+ * for the given QP.
+ * Called at interrupt level.
+ */
+static void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
+ int has_grh, void *data, u32 tlen, struct ipath_qp *qp)
+{
+ struct ipath_other_headers *ohdr;
+ int opcode;
+ u32 hdrsize;
+ u32 pad;
+ unsigned long flags;
+ struct ib_wc wc;
+ u32 qkey;
+ u32 src_qp;
+ struct ipath_rq *rq;
+ struct ipath_srq *srq;
+ struct ipath_rwqe *wqe;
+
+ /* Check for GRH */
+ if (!has_grh) {
+ ohdr = &hdr->u.oth;
+ hdrsize = 8 + 12 + 8; /* LRH + BTH + DETH */
+ qkey = be32_to_cpu(ohdr->u.ud.deth[0]);
+ src_qp = be32_to_cpu(ohdr->u.ud.deth[1]);
+ } else {
+ ohdr = &hdr->u.l.oth;
+ hdrsize = 8 + 40 + 12 + 8; /* LRH + GRH + BTH + DETH */
+ /*
+ * The header with GRH is 68 bytes and the
+ * core driver sets the eager header buffer
+ * size to 56 bytes so the last 12 bytes of
+ * the IB header is in the data buffer.
+ */
+ qkey = be32_to_cpu(((u32 *) data)[1]);
+ src_qp = be32_to_cpu(((u32 *) data)[2]);
+ data += 12;
+ }
+ src_qp &= 0xFFFFFF;
+
+ /* Check that the qkey matches (except for QP0, see 9.6.1.4.1). */
+ if (unlikely(qp->ibqp.qp_num && qkey != qp->qkey)) {
+ /* XXX OK to lose a count once in a while. */
+ dev->qkey_violations++;
+ dev->n_pkt_drops++;
+ return;
+ }
+
+ /* Get the number of bytes the message was padded by. */
+ pad = (ohdr->bth[0] >> 12) & 3;
+ if (unlikely(tlen < (hdrsize + pad + 4))) {
+ /* Drop incomplete packets. */
+ dev->n_pkt_drops++;
+ return;
+ }
+
+ /*
+ * A GRH is expected to preceed the data even if not
+ * present on the wire.
+ */
+ wc.byte_len = tlen - (hdrsize + pad + 4) + sizeof(struct ib_grh);
+
+ /*
+ * The opcode is in the low byte when its in network order
+ * (top byte when in host order).
+ */
+ opcode = *(u8 *) (&ohdr->bth[0]);
+ if (opcode == IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE) {
+ if (has_grh) {
+ wc.imm_data = *(u32 *) data;
+ data += sizeof(u32);
+ } else
+ wc.imm_data = ohdr->u.ud.imm_data;
+ wc.wc_flags = IB_WC_WITH_IMM;
+ hdrsize += sizeof(u32);
+ } else if (opcode == IB_OPCODE_UD_SEND_ONLY) {
+ wc.imm_data = 0;
+ wc.wc_flags = 0;
+ } else {
+ dev->n_pkt_drops++;
+ return;
+ }
+
+ /*
+ * Get the next work request entry to find where to put the data.
+ * Note that it is safe to drop the lock after changing rq->tail
+ * since ipath_post_receive() won't fill the empty slot.
+ */
+ if (qp->ibqp.srq) {
+ srq = to_isrq(qp->ibqp.srq);
+ rq = &srq->rq;
+ } else {
+ srq = NULL;
+ rq = &qp->r_rq;
+ }
+ spin_lock_irqsave(&rq->lock, flags);
+ if (rq->tail == rq->head) {
+ spin_unlock_irqrestore(&rq->lock, flags);
+ dev->n_pkt_drops++;
+ return;
+ }
+ /* Silently drop packets which are too big. */
+ wqe = get_rwqe_ptr(rq, rq->tail);
+ if (wc.byte_len > wqe->length) {
+ spin_unlock_irqrestore(&rq->lock, flags);
+ dev->n_pkt_drops++;
+ return;
+ }
+ wc.wr_id = wqe->wr_id;
+ qp->r_sge.sge = wqe->sg_list[0];
+ qp->r_sge.sg_list = wqe->sg_list + 1;
+ qp->r_sge.num_sge = wqe->num_sge;
+ if (++rq->tail >= rq->size)
+ rq->tail = 0;
+ if (srq && srq->ibsrq.event_handler) {
+ u32 n;
+
+ if (rq->head < rq->tail)
+ n = rq->size + rq->head - rq->tail;
+ else
+ n = rq->head - rq->tail;
+ if (n < srq->limit) {
+ struct ib_event ev;
+
+ srq->limit = 0;
+ spin_unlock_irqrestore(&rq->lock, flags);
+ ev.device = qp->ibqp.device;
+ ev.element.srq = qp->ibqp.srq;
+ ev.event = IB_EVENT_SRQ_LIMIT_REACHED;
+ srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context);
+ } else
+ spin_unlock_irqrestore(&rq->lock, flags);
+ } else
+ spin_unlock_irqrestore(&rq->lock, flags);
+ if (has_grh) {
+ copy_sge(&qp->r_sge, &hdr->u.l.grh, sizeof(struct ib_grh));
+ wc.wc_flags |= IB_WC_GRH;
+ } else
+ skip_sge(&qp->r_sge, sizeof(struct ib_grh));
+ copy_sge(&qp->r_sge, data, wc.byte_len - sizeof(struct ib_grh));
+ wc.status = IB_WC_SUCCESS;
+ wc.opcode = IB_WC_RECV;
+ wc.vendor_err = 0;
+ wc.qp_num = qp->ibqp.qp_num;
+ wc.src_qp = src_qp;
+ /* XXX do we know which pkey matched? Only needed for GSI. */
+ wc.pkey_index = 0;
+ wc.slid = be16_to_cpu(hdr->lrh[3]);
+ wc.sl = (be16_to_cpu(hdr->lrh[0]) >> 4) & 0xF;
+ wc.dlid_path_bits = 0;
+ /* Signal completion event if the solicited bit is set. */
+ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc,
+ ohdr->bth[0] & __constant_cpu_to_be32(1 << 23));
+}
+
+/*
+ * This is called from ipath_post_ud_send() to forward a WQE addressed
+ * to the same HCA.
+ */
+static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss,
+ u32 length, struct ib_send_wr *wr,
+ struct ib_wc *wc)
+{
+ struct ipath_ibdev *dev = to_idev(sqp->ibqp.device);
+ struct ipath_qp *qp;
+ struct ib_ah_attr *ah_attr;
+ unsigned long flags;
+ struct ipath_rq *rq;
+ struct ipath_srq *srq;
+ struct ipath_sge_state rsge;
+ struct ipath_sge *sge;
+ struct ipath_rwqe *wqe;
+
+ qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn);
+ if (!qp)
+ return;
+
+ /*
+ * Check that the qkey matches (except for QP0, see 9.6.1.4.1).
+ * Qkeys with the high order bit set mean use the
+ * qkey from the QP context instead of the WR (see 10.2.5).
+ */
+ if (unlikely(qp->ibqp.qp_num && ((int)wr->wr.ud.remote_qkey < 0 ?
+ qp->qkey : wr->wr.ud.remote_qkey) != qp->qkey)) {
+ /* XXX OK to lose a count once in a while. */
+ dev->qkey_violations++;
+ dev->n_pkt_drops++;
+ goto done;
+ }
+
+ /*
+ * A GRH is expected to preceed the data even if not
+ * present on the wire.
+ */
+ wc->byte_len = length + sizeof(struct ib_grh);
+
+ if (wr->opcode == IB_WR_SEND_WITH_IMM) {
+ wc->wc_flags = IB_WC_WITH_IMM;
+ wc->imm_data = wr->imm_data;
+ } else {
+ wc->wc_flags = 0;
+ wc->imm_data = 0;
+ }
+
+ /*
+ * Get the next work request entry to find where to put the data.
+ * Note that it is safe to drop the lock after changing rq->tail
+ * since ipath_post_receive() won't fill the empty slot.
+ */
+ if (qp->ibqp.srq) {
+ srq = to_isrq(qp->ibqp.srq);
+ rq = &srq->rq;
+ } else {
+ srq = NULL;
+ rq = &qp->r_rq;
+ }
+ spin_lock_irqsave(&rq->lock, flags);
+ if (rq->tail == rq->head) {
+ spin_unlock_irqrestore(&rq->lock, flags);
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ /* Silently drop packets which are too big. */
+ wqe = get_rwqe_ptr(rq, rq->tail);
+ if (wc->byte_len > wqe->length) {
+ spin_unlock_irqrestore(&rq->lock, flags);
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ wc->wr_id = wqe->wr_id;
+ rsge.sge = wqe->sg_list[0];
+ rsge.sg_list = wqe->sg_list + 1;
+ rsge.num_sge = wqe->num_sge;
+ if (++rq->tail >= rq->size)
+ rq->tail = 0;
+ if (srq && srq->ibsrq.event_handler) {
+ u32 n;
+
+ if (rq->head < rq->tail)
+ n = rq->size + rq->head - rq->tail;
+ else
+ n = rq->head - rq->tail;
+ if (n < srq->limit) {
+ struct ib_event ev;
+
+ srq->limit = 0;
+ spin_unlock_irqrestore(&rq->lock, flags);
+ ev.device = qp->ibqp.device;
+ ev.element.srq = qp->ibqp.srq;
+ ev.event = IB_EVENT_SRQ_LIMIT_REACHED;
+ srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context);
+ } else
+ spin_unlock_irqrestore(&rq->lock, flags);
+ } else
+ spin_unlock_irqrestore(&rq->lock, flags);
+ ah_attr = &to_iah(wr->wr.ud.ah)->attr;
+ if (ah_attr->ah_flags & IB_AH_GRH) {
+ copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh));
+ wc->wc_flags |= IB_WC_GRH;
+ } else
+ skip_sge(&rsge, sizeof(struct ib_grh));
+ sge = &ss->sge;
+ while (length) {
+ u32 len = sge->length;
+
+ if (len > length)
+ len = length;
+ BUG_ON(len == 0);
+ copy_sge(&rsge, sge->vaddr, len);
+ sge->vaddr += len;
+ sge->length -= len;
+ sge->sge_length -= len;
+ if (sge->sge_length == 0) {
+ if (--ss->num_sge)
+ *sge = *ss->sg_list++;
+ } else if (sge->length == 0 && sge->mr != NULL) {
+ if (++sge->n >= IPATH_SEGSZ) {
+ if (++sge->m >= sge->mr->mapsz)
+ break;
+ sge->n = 0;
+ }
+ sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr;
+ sge->length = sge->mr->map[sge->m]->segs[sge->n].length;
+ }
+ length -= len;
+ }
+ wc->status = IB_WC_SUCCESS;
+ wc->opcode = IB_WC_RECV;
+ wc->vendor_err = 0;
+ wc->qp_num = qp->ibqp.qp_num;
+ wc->src_qp = sqp->ibqp.qp_num;
+ /* XXX do we know which pkey matched? Only needed for GSI. */
+ wc->pkey_index = 0;
+ wc->slid = ipath_layer_get_lid(dev->ib_unit);
+ wc->sl = ah_attr->sl;
+ wc->dlid_path_bits = 0;
+ /* Signal completion event if the solicited bit is set. */
+ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc,
+ wr->send_flags & IB_SEND_SOLICITED);
+
+done:
+ if (atomic_dec_and_test(&qp->refcount))
+ wake_up(&qp->wait);
+}
+
+/*
+ * Copy the next RWQE into the QP's RWQE.
+ * Return zero if no RWQE is available.
+ * Called at interrupt level with the QP r_rq.lock held.
+ */
+static int get_rwqe(struct ipath_qp *qp, int wr_id_only)
+{
+ struct ipath_rq *rq;
+ struct ipath_srq *srq;
+ struct ipath_rwqe *wqe;
+
+ if (!qp->ibqp.srq) {
+ rq = &qp->r_rq;
+ if (unlikely(rq->tail == rq->head))
+ return 0;
+ wqe = get_rwqe_ptr(rq, rq->tail);
+ qp->r_wr_id = wqe->wr_id;
+ if (!wr_id_only) {
+ qp->r_sge.sge = wqe->sg_list[0];
+ qp->r_sge.sg_list = wqe->sg_list + 1;
+ qp->r_sge.num_sge = wqe->num_sge;
+ qp->r_len = wqe->length;
+ }
+ if (++rq->tail >= rq->size)
+ rq->tail = 0;
+ return 1;
+ }
+
+ srq = to_isrq(qp->ibqp.srq);
+ rq = &srq->rq;
+ spin_lock(&rq->lock);
+ if (unlikely(rq->tail == rq->head)) {
+ spin_unlock(&rq->lock);
+ return 0;
+ }
+ wqe = get_rwqe_ptr(rq, rq->tail);
+ qp->r_wr_id = wqe->wr_id;
+ if (!wr_id_only) {
+ qp->r_sge.sge = wqe->sg_list[0];
+ qp->r_sge.sg_list = wqe->sg_list + 1;
+ qp->r_sge.num_sge = wqe->num_sge;
+ qp->r_len = wqe->length;
+ }
+ if (++rq->tail >= rq->size)
+ rq->tail = 0;
+ if (srq->ibsrq.event_handler) {
+ struct ib_event ev;
+ u32 n;
+
+ if (rq->head < rq->tail)
+ n = rq->size + rq->head - rq->tail;
+ else
+ n = rq->head - rq->tail;
+ if (n < srq->limit) {
+ srq->limit = 0;
+ spin_unlock(&rq->lock);
+ ev.device = qp->ibqp.device;
+ ev.element.srq = qp->ibqp.srq;
+ ev.event = IB_EVENT_SRQ_LIMIT_REACHED;
+ srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context);
+ } else
+ spin_unlock(&rq->lock);
+ } else
+ spin_unlock(&rq->lock);
+ return 1;
+}
+
+/*
+ * This is called from ipath_qp_rcv() to process an incomming UC packet
+ * for the given QP.
+ * Called at interrupt level.
+ */
+static void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
+ int has_grh, void *data, u32 tlen, struct ipath_qp *qp)
+{
+ struct ipath_other_headers *ohdr;
+ int opcode;
+ u32 hdrsize;
+ u32 psn;
+ u32 pad;
+ unsigned long flags;
+ struct ib_wc wc;
+ u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu);
+ struct ib_reth *reth;
+
+ /* Check for GRH */
+ if (!has_grh) {
+ ohdr = &hdr->u.oth;
+ hdrsize = 8 + 12; /* LRH + BTH */
+ psn = be32_to_cpu(ohdr->bth[2]);
+ } else {
+ ohdr = &hdr->u.l.oth;
+ hdrsize = 8 + 40 + 12; /* LRH + GRH + BTH */
+ /*
+ * The header with GRH is 60 bytes and the
+ * core driver sets the eager header buffer
+ * size to 56 bytes so the last 4 bytes of
+ * the BTH header (PSN) is in the data buffer.
+ */
+ psn = be32_to_cpu(((u32 *) data)[0]);
+ data += sizeof(u32);
+ }
+ /*
+ * The opcode is in the low byte when its in network order
+ * (top byte when in host order).
+ */
+ opcode = *(u8 *) (&ohdr->bth[0]);
+
+ wc.imm_data = 0;
+ wc.wc_flags = 0;
+
+ spin_lock_irqsave(&qp->r_rq.lock, flags);
+
+ /* Compare the PSN verses the expected PSN. */
+ if (unlikely(cmp24(psn, qp->r_psn) != 0)) {
+ /*
+ * Handle a sequence error.
+ * Silently drop any current message.
+ */
+ qp->r_psn = psn;
+ inv:
+ qp->r_state = IB_OPCODE_UC_SEND_LAST;
+ switch (opcode) {
+ case IB_OPCODE_UC_SEND_FIRST:
+ case IB_OPCODE_UC_SEND_ONLY:
+ case IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE:
+ goto send_first;
+
+ case IB_OPCODE_UC_RDMA_WRITE_FIRST:
+ case IB_OPCODE_UC_RDMA_WRITE_ONLY:
+ case IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE:
+ goto rdma_first;
+
+ default:
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ }
+
+ /* Check for opcode sequence errors. */
+ switch (qp->r_state) {
+ case IB_OPCODE_UC_SEND_FIRST:
+ case IB_OPCODE_UC_SEND_MIDDLE:
+ if (opcode == IB_OPCODE_UC_SEND_MIDDLE ||
+ opcode == IB_OPCODE_UC_SEND_LAST ||
+ opcode == IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE)
+ break;
+ goto inv;
+
+ case IB_OPCODE_UC_RDMA_WRITE_FIRST:
+ case IB_OPCODE_UC_RDMA_WRITE_MIDDLE:
+ if (opcode == IB_OPCODE_UC_RDMA_WRITE_MIDDLE ||
+ opcode == IB_OPCODE_UC_RDMA_WRITE_LAST ||
+ opcode == IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE)
+ break;
+ goto inv;
+
+ default:
+ if (opcode == IB_OPCODE_UC_SEND_FIRST ||
+ opcode == IB_OPCODE_UC_SEND_ONLY ||
+ opcode == IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE ||
+ opcode == IB_OPCODE_UC_RDMA_WRITE_FIRST ||
+ opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY ||
+ opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE)
+ break;
+ goto inv;
+ }
+
+ /* OK, process the packet. */
+ switch (opcode) {
+ case IB_OPCODE_UC_SEND_FIRST:
+ case IB_OPCODE_UC_SEND_ONLY:
+ case IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE:
+ send_first:
+ if (qp->r_reuse_sge) {
+ qp->r_reuse_sge = 0;
+ qp->r_sge = qp->s_rdma_sge;
+ } else if (!get_rwqe(qp, 0)) {
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ /* Save the WQE so we can reuse it in case of an error. */
+ qp->s_rdma_sge = qp->r_sge;
+ qp->r_rcv_len = 0;
+ if (opcode == IB_OPCODE_UC_SEND_ONLY)
+ goto send_last;
+ else if (opcode == IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE)
+ goto send_last_imm;
+ /* FALLTHROUGH */
+ case IB_OPCODE_UC_SEND_MIDDLE:
+ /* Check for invalid length PMTU or posted rwqe len. */
+ if (unlikely(tlen != (hdrsize + pmtu + 4))) {
+ qp->r_reuse_sge = 1;
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ qp->r_rcv_len += pmtu;
+ if (unlikely(qp->r_rcv_len > qp->r_len)) {
+ qp->r_reuse_sge = 1;
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ copy_sge(&qp->r_sge, data, pmtu);
+ break;
+
+ case IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE:
+ send_last_imm:
+ if (has_grh) {
+ wc.imm_data = *(u32 *) data;
+ data += sizeof(u32);
+ } else {
+ /* Immediate data comes after BTH */
+ wc.imm_data = ohdr->u.imm_data;
+ }
+ hdrsize += 4;
+ wc.wc_flags = IB_WC_WITH_IMM;
+ /* FALLTHROUGH */
+ case IB_OPCODE_UC_SEND_LAST:
+ send_last:
+ /* Get the number of bytes the message was padded by. */
+ pad = (ohdr->bth[0] >> 12) & 3;
+ /* Check for invalid length. */
+ /* XXX LAST len should be >= 1 */
+ if (unlikely(tlen < (hdrsize + pad + 4))) {
+ qp->r_reuse_sge = 1;
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ /* Don't count the CRC. */
+ tlen -= (hdrsize + pad + 4);
+ wc.byte_len = tlen + qp->r_rcv_len;
+ if (unlikely(wc.byte_len > qp->r_len)) {
+ qp->r_reuse_sge = 1;
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ /* XXX Need to free SGEs */
+ last_imm:
+ copy_sge(&qp->r_sge, data, tlen);
+ wc.wr_id = qp->r_wr_id;
+ wc.status = IB_WC_SUCCESS;
+ wc.opcode = IB_WC_RECV;
+ wc.vendor_err = 0;
+ wc.qp_num = qp->ibqp.qp_num;
+ wc.src_qp = qp->remote_qpn;
+ wc.pkey_index = 0;
+ wc.slid = qp->remote_ah_attr.dlid;
+ wc.sl = qp->remote_ah_attr.sl;
+ wc.dlid_path_bits = 0;
+ wc.port_num = 0;
+ /* Signal completion event if the solicited bit is set. */
+ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc,
+ ohdr->bth[0] & __constant_cpu_to_be32(1 << 23));
+ break;
+
+ case IB_OPCODE_UC_RDMA_WRITE_FIRST:
+ case IB_OPCODE_UC_RDMA_WRITE_ONLY:
+ case IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE: /* consume RWQE */
+ rdma_first:
+ /* RETH comes after BTH */
+ if (!has_grh)
+ reth = &ohdr->u.rc.reth;
+ else {
+ reth = (struct ib_reth *)data;
+ data += sizeof(*reth);
+ }
+ hdrsize += sizeof(*reth);
+ qp->r_len = be32_to_cpu(reth->length);
+ qp->r_rcv_len = 0;
+ if (qp->r_len != 0) {
+ u32 rkey = be32_to_cpu(reth->rkey);
+ u64 vaddr = be64_to_cpu(reth->vaddr);
+
+ /* Check rkey */
+ if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, qp->r_len,
+ vaddr, rkey,
+ IB_ACCESS_REMOTE_WRITE))) {
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ } else {
+ qp->r_sge.sg_list = NULL;
+ qp->r_sge.sge.mr = NULL;
+ qp->r_sge.sge.vaddr = NULL;
+ qp->r_sge.sge.length = 0;
+ qp->r_sge.sge.sge_length = 0;
+ }
+ if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_WRITE))) {
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ if (opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY)
+ goto rdma_last;
+ else if (opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE)
+ goto rdma_last_imm;
+ /* FALLTHROUGH */
+ case IB_OPCODE_UC_RDMA_WRITE_MIDDLE:
+ /* Check for invalid length PMTU or posted rwqe len. */
+ if (unlikely(tlen != (hdrsize + pmtu + 4))) {
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ qp->r_rcv_len += pmtu;
+ if (unlikely(qp->r_rcv_len > qp->r_len)) {
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ copy_sge(&qp->r_sge, data, pmtu);
+ break;
+
+ case IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE:
+ rdma_last_imm:
+ /* Get the number of bytes the message was padded by. */
+ pad = (ohdr->bth[0] >> 12) & 3;
+ /* Check for invalid length. */
+ /* XXX LAST len should be >= 1 */
+ if (unlikely(tlen < (hdrsize + pad + 4))) {
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ /* Don't count the CRC. */
+ tlen -= (hdrsize + pad + 4);
+ if (unlikely(tlen + qp->r_rcv_len != qp->r_len)) {
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ if (qp->r_reuse_sge) {
+ qp->r_reuse_sge = 0;
+ } else if (!get_rwqe(qp, 1)) {
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ if (has_grh) {
+ wc.imm_data = *(u32 *) data;
+ data += sizeof(u32);
+ } else {
+ /* Immediate data comes after BTH */
+ wc.imm_data = ohdr->u.imm_data;
+ }
+ hdrsize += 4;
+ wc.wc_flags = IB_WC_WITH_IMM;
+ wc.byte_len = 0;
+ goto last_imm;
+
+ case IB_OPCODE_UC_RDMA_WRITE_LAST:
+ rdma_last:
+ /* Get the number of bytes the message was padded by. */
+ pad = (ohdr->bth[0] >> 12) & 3;
+ /* Check for invalid length. */
+ /* XXX LAST len should be >= 1 */
+ if (unlikely(tlen < (hdrsize + pad + 4))) {
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ /* Don't count the CRC. */
+ tlen -= (hdrsize + pad + 4);
+ if (unlikely(tlen + qp->r_rcv_len != qp->r_len)) {
+ dev->n_pkt_drops++;
+ goto done;
+ }
+ copy_sge(&qp->r_sge, data, tlen);
+ break;
+
+ default:
+ /* Drop packet for unknown opcodes. */
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ dev->n_pkt_drops++;
+ return;
+ }
+ qp->r_psn++;
+ qp->r_state = opcode;
+done:
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+}
+
+/*
+ * Put this QP on the RNR timeout list for the device.
+ * XXX Use a simple list for now. We might need a priority
+ * queue if we have lots of QPs waiting for RNR timeouts
+ * but that should be rare.
+ */
+static void insert_rnr_queue(struct ipath_qp *qp)
+{
+ struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+ unsigned long flags;
+
+ spin_lock_irqsave(&dev->pending_lock, flags);
+ if (list_empty(&dev->rnrwait))
+ list_add(&qp->timerwait, &dev->rnrwait);
+ else {
+ struct list_head *l = &dev->rnrwait;
+ struct ipath_qp *nqp = list_entry(l->next, struct ipath_qp,
+ timerwait);
+
+ while (qp->s_rnr_timeout >= nqp->s_rnr_timeout) {
+ qp->s_rnr_timeout -= nqp->s_rnr_timeout;
+ l = l->next;
+ if (l->next == &dev->rnrwait)
+ break;
+ nqp = list_entry(l->next, struct ipath_qp, timerwait);
+ }
+ list_add(&qp->timerwait, l);
+ }
+ spin_unlock_irqrestore(&dev->pending_lock, flags);
+}
+
+/*
+ * This is called from do_uc_send() or do_rc_send() to forward a WQE addressed
+ * to the same HCA.
+ * Note that although we are single threaded due to the tasklet, we still
+ * have to protect against post_send(). We don't have to worry about
+ * receive interrupts since this is a connected protocol and all packets
+ * will pass through here.
+ */
+static void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc)
+{
+ struct ipath_ibdev *dev = to_idev(sqp->ibqp.device);
+ struct ipath_qp *qp;
+ struct ipath_swqe *wqe;
+ struct ipath_sge *sge;
+ unsigned long flags;
+ u64 sdata;
+
+ qp = ipath_lookup_qpn(&dev->qp_table, sqp->remote_qpn);
+ if (!qp) {
+ dev->n_pkt_drops++;
+ return;
+ }
+
+again:
+ spin_lock_irqsave(&sqp->s_lock, flags);
+
+ if (!(state_ops[sqp->state] & IPATH_PROCESS_SEND_OK)) {
+ spin_unlock_irqrestore(&sqp->s_lock, flags);
+ goto done;
+ }
+
+ /* Get the next send request. */
+ if (sqp->s_last == sqp->s_head) {
+ /* Send work queue is empty. */
+ spin_unlock_irqrestore(&sqp->s_lock, flags);
+ goto done;
+ }
+
+ /*
+ * We can rely on the entry not changing without the s_lock
+ * being held until we update s_last.
+ */
+ wqe = get_swqe_ptr(sqp, sqp->s_last);
+ spin_unlock_irqrestore(&sqp->s_lock, flags);
+
+ wc->wc_flags = 0;
+ wc->imm_data = 0;
+
+ sqp->s_sge.sge = wqe->sg_list[0];
+ sqp->s_sge.sg_list = wqe->sg_list + 1;
+ sqp->s_sge.num_sge = wqe->wr.num_sge;
+ sqp->s_len = wqe->length;
+ switch (wqe->wr.opcode) {
+ case IB_WR_SEND_WITH_IMM:
+ wc->wc_flags = IB_WC_WITH_IMM;
+ wc->imm_data = wqe->wr.imm_data;
+ /* FALLTHROUGH */
+ case IB_WR_SEND:
+ spin_lock_irqsave(&qp->r_rq.lock, flags);
+ if (!get_rwqe(qp, 0)) {
+ rnr_nak:
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ /* Handle RNR NAK */
+ if (qp->ibqp.qp_type == IB_QPT_UC)
+ goto send_comp;
+ if (sqp->s_rnr_retry == 0) {
+ wc->status = IB_WC_RNR_RETRY_EXC_ERR;
+ goto err;
+ }
+ if (sqp->s_rnr_retry_cnt < 7)
+ sqp->s_rnr_retry--;
+ dev->n_rnr_naks++;
+ sqp->s_rnr_timeout = rnr_table[sqp->s_min_rnr_timer];
+ insert_rnr_queue(sqp);
+ goto done;
+ }
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ break;
+
+ case IB_WR_RDMA_WRITE_WITH_IMM:
+ wc->wc_flags = IB_WC_WITH_IMM;
+ wc->imm_data = wqe->wr.imm_data;
+ spin_lock_irqsave(&qp->r_rq.lock, flags);
+ if (!get_rwqe(qp, 1))
+ goto rnr_nak;
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ /* FALLTHROUGH */
+ case IB_WR_RDMA_WRITE:
+ if (wqe->length == 0)
+ break;
+ if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, wqe->length,
+ wqe->wr.wr.rdma.remote_addr,
+ wqe->wr.wr.rdma.rkey,
+ IB_ACCESS_REMOTE_WRITE))) {
+ acc_err:
+ wc->status = IB_WC_REM_ACCESS_ERR;
+ err:
+ wc->wr_id = wqe->wr.wr_id;
+ wc->opcode = wc_opcode[wqe->wr.opcode];
+ wc->vendor_err = 0;
+ wc->byte_len = 0;
+ wc->qp_num = sqp->ibqp.qp_num;
+ wc->src_qp = sqp->remote_qpn;
+ wc->pkey_index = 0;
+ wc->slid = sqp->remote_ah_attr.dlid;
+ wc->sl = sqp->remote_ah_attr.sl;
+ wc->dlid_path_bits = 0;
+ wc->port_num = 0;
+ ipath_sqerror_qp(sqp, wc);
+ goto done;
+ }
+ break;
+
+ case IB_WR_RDMA_READ:
+ if (unlikely(!ipath_rkey_ok(dev, &sqp->s_sge, wqe->length,
+ wqe->wr.wr.rdma.remote_addr,
+ wqe->wr.wr.rdma.rkey,
+ IB_ACCESS_REMOTE_READ))) {
+ goto acc_err;
+ }
+ if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_READ)))
+ goto acc_err;
+ qp->r_sge.sge = wqe->sg_list[0];
+ qp->r_sge.sg_list = wqe->sg_list + 1;
+ qp->r_sge.num_sge = wqe->wr.num_sge;
+ break;
+
+ case IB_WR_ATOMIC_CMP_AND_SWP:
+ case IB_WR_ATOMIC_FETCH_AND_ADD:
+ if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, sizeof(u64),
+ wqe->wr.wr.rdma.remote_addr,
+ wqe->wr.wr.rdma.rkey,
+ IB_ACCESS_REMOTE_ATOMIC))) {
+ goto acc_err;
+ }
+ /* Perform atomic OP and save result. */
+ sdata = wqe->wr.wr.atomic.swap;
+ spin_lock_irqsave(&dev->pending_lock, flags);
+ qp->r_atomic_data = *(u64 *) qp->r_sge.sge.vaddr;
+ if (wqe->wr.opcode == IB_WR_ATOMIC_FETCH_AND_ADD) {
+ *(u64 *) qp->r_sge.sge.vaddr =
+ qp->r_atomic_data + sdata;
+ } else if (qp->r_atomic_data == wqe->wr.wr.atomic.compare_add) {
+ *(u64 *) qp->r_sge.sge.vaddr = sdata;
+ }
+ spin_unlock_irqrestore(&dev->pending_lock, flags);
+ *(u64 *) sqp->s_sge.sge.vaddr = qp->r_atomic_data;
+ goto send_comp;
+
+ default:
+ goto done;
+ }
+
+ sge = &sqp->s_sge.sge;
+ while (sqp->s_len) {
+ u32 len = sqp->s_len;
+
+ if (len > sge->length)
+ len = sge->length;
+ BUG_ON(len == 0);
+ copy_sge(&qp->r_sge, sge->vaddr, len);
+ sge->vaddr += len;
+ sge->length -= len;
+ sge->sge_length -= len;
+ if (sge->sge_length == 0) {
+ if (--sqp->s_sge.num_sge)
+ *sge = *sqp->s_sge.sg_list++;
+ } else if (sge->length == 0 && sge->mr != NULL) {
+ if (++sge->n >= IPATH_SEGSZ) {
+ if (++sge->m >= sge->mr->mapsz)
+ break;
+ sge->n = 0;
+ }
+ sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr;
+ sge->length = sge->mr->map[sge->m]->segs[sge->n].length;
+ }
+ sqp->s_len -= len;
+ }
+
+ if (wqe->wr.opcode == IB_WR_RDMA_WRITE ||
+ wqe->wr.opcode == IB_WR_RDMA_READ)
+ goto send_comp;
+
+ if (wqe->wr.opcode == IB_WR_RDMA_WRITE_WITH_IMM)
+ wc->opcode = IB_WC_RECV_RDMA_WITH_IMM;
+ else
+ wc->opcode = IB_WC_RECV;
+ wc->wr_id = qp->r_wr_id;
+ wc->status = IB_WC_SUCCESS;
+ wc->vendor_err = 0;
+ wc->byte_len = wqe->length;
+ wc->qp_num = qp->ibqp.qp_num;
+ wc->src_qp = qp->remote_qpn;
+ /* XXX do we know which pkey matched? Only needed for GSI. */
+ wc->pkey_index = 0;
+ wc->slid = qp->remote_ah_attr.dlid;
+ wc->sl = qp->remote_ah_attr.sl;
+ wc->dlid_path_bits = 0;
+ /* Signal completion event if the solicited bit is set. */
+ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc,
+ wqe->wr.send_flags & IB_SEND_SOLICITED);
+
+send_comp:
+ sqp->s_rnr_retry = sqp->s_rnr_retry_cnt;
+
+ if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &sqp->s_flags) ||
+ (wqe->wr.send_flags & IB_SEND_SIGNALED)) {
+ wc->wr_id = wqe->wr.wr_id;
+ wc->status = IB_WC_SUCCESS;
+ wc->opcode = wc_opcode[wqe->wr.opcode];
+ wc->vendor_err = 0;
+ wc->byte_len = wqe->length;
+ wc->qp_num = sqp->ibqp.qp_num;
+ wc->src_qp = 0;
+ wc->pkey_index = 0;
+ wc->slid = 0;
+ wc->sl = 0;
+ wc->dlid_path_bits = 0;
+ wc->port_num = 0;
+ ipath_cq_enter(to_icq(sqp->ibqp.send_cq), wc, 0);
+ }
+
+ /* Update s_last now that we are finished with the SWQE */
+ spin_lock_irqsave(&sqp->s_lock, flags);
+ if (++sqp->s_last >= sqp->s_size)
+ sqp->s_last = 0;
+ spin_unlock_irqrestore(&sqp->s_lock, flags);
+ goto again;
+
+done:
+ if (atomic_dec_and_test(&qp->refcount))
+ wake_up(&qp->wait);
+}
+
+/*
+ * Flush send work queue.
+ * The QP s_lock should be held.
+ */
+static void ipath_get_credit(struct ipath_qp *qp, u32 aeth)
+{
+ u32 credit = (aeth >> 24) & 0x1F;
+
+ /*
+ * If credit == 0x1F, credit is invalid and we can send
+ * as many packets as we like. Otherwise, we have to
+ * honor the credit field.
+ */
+ if (credit == 0x1F) {
+ qp->s_lsn = (u32) -1;
+ } else if (qp->s_lsn != (u32) -1) {
+ /* Compute new LSN (i.e., MSN + credit) */
+ credit = (aeth + credit_table[credit]) & 0xFFFFFF;
+ if (cmp24(credit, qp->s_lsn) > 0)
+ qp->s_lsn = credit;
+ }
+
+ /* Restart sending if it was blocked due to lack of credits. */
+ if (qp->s_cur != qp->s_head &&
+ (qp->s_lsn == (u32) -1 ||
+ cmp24(get_swqe_ptr(qp, qp->s_cur)->ssn, qp->s_lsn + 1) <= 0)) {
+ tasklet_schedule(&qp->s_task);
+ }
+}
+
+/*
+ * This is called from ipath_rc_rcv() to process an incomming RC ACK
+ * for the given QP.
+ * Called at interrupt level with the QP s_lock held.
+ * Returns 1 if OK, 0 if current operation should be aborted (NAK).
+ */
+static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode)
+{
+ struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+ struct ib_wc wc;
+ struct ipath_swqe *wqe;
+
+ /*
+ * Remove the QP from the timeout queue (or RNR timeout queue).
+ * If ipath_ib_timer() has already removed it,
+ * it's OK since we hold the QP s_lock and ipath_restart_rc()
+ * just won't find anything to restart if we ACK everything.
+ */
+ spin_lock(&dev->pending_lock);
+ if (qp->timerwait.next != LIST_POISON1)
+ list_del(&qp->timerwait);
+ spin_unlock(&dev->pending_lock);
+
+ /*
+ * Note that NAKs implicitly ACK outstanding SEND and
+ * RDMA write requests and implicitly NAK RDMA read and
+ * atomic requests issued before the NAK'ed request.
+ * The MSN won't include the NAK'ed request but will include
+ * an ACK'ed request(s).
+ */
+ wqe = get_swqe_ptr(qp, qp->s_last);
+
+ /* Nothing is pending to ACK/NAK. */
+ if (qp->s_last == qp->s_tail)
+ return 0;
+
+ /*
+ * The MSN might be for a later WQE than the PSN indicates so
+ * only complete WQEs that the PSN finishes.
+ */
+ while (cmp24(psn, wqe->lpsn) >= 0) {
+ /* If we are ACKing a WQE, the MSN should be >= the SSN. */
+ if (cmp24(aeth, wqe->ssn) < 0)
+ break;
+ /*
+ * If this request is a RDMA read or atomic, and the ACK is
+ * for a later operation, this ACK NAKs the RDMA read or atomic.
+ * In other words, only a RDMA_READ_LAST or ONLY can ACK
+ * a RDMA read and likewise for atomic ops.
+ * Note that the NAK case can only happen if relaxed ordering
+ * is used and requests are sent after an RDMA read
+ * or atomic is sent but before the response is received.
+ */
+ if ((wqe->wr.opcode == IB_WR_RDMA_READ &&
+ opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST) ||
+ ((wqe->wr.opcode == IB_WR_ATOMIC_CMP_AND_SWP ||
+ wqe->wr.opcode == IB_WR_ATOMIC_FETCH_AND_ADD) &&
+ (opcode != IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE ||
+ cmp24(wqe->psn, psn) != 0))) {
+ /* The last valid PSN seen is the previous request's. */
+ qp->s_last_psn = wqe->psn - 1;
+ /* Retry this request. */
+ ipath_restart_rc(qp, wqe->psn, &wc);
+ /*
+ * No need to process the ACK/NAK since we are
+ * restarting an earlier request.
+ */
+ return 0;
+ }
+ /* Post a send completion queue entry if requested. */
+ if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) ||
+ (wqe->wr.send_flags & IB_SEND_SIGNALED)) {
+ wc.wr_id = wqe->wr.wr_id;
+ wc.status = IB_WC_SUCCESS;
+ wc.opcode = wc_opcode[wqe->wr.opcode];
+ wc.vendor_err = 0;
+ wc.byte_len = wqe->length;
+ wc.qp_num = qp->ibqp.qp_num;
+ wc.src_qp = qp->remote_qpn;
+ wc.pkey_index = 0;
+ wc.slid = qp->remote_ah_attr.dlid;
+ wc.sl = qp->remote_ah_attr.sl;
+ wc.dlid_path_bits = 0;
+ wc.port_num = 0;
+ ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0);
+ }
+ qp->s_retry = qp->s_retry_cnt;
+ /*
+ * If we are completing a request which is in the process
+ * of being resent, we can stop resending it since we know
+ * the responder has already seen it.
+ */
+ if (qp->s_last == qp->s_cur) {
+ if (++qp->s_cur >= qp->s_size)
+ qp->s_cur = 0;
+ wqe = get_swqe_ptr(qp, qp->s_cur);
+ qp->s_state = IB_OPCODE_RC_SEND_LAST;
+ qp->s_psn = wqe->psn;
+ }
+ if (++qp->s_last >= qp->s_size)
+ qp->s_last = 0;
+ wqe = get_swqe_ptr(qp, qp->s_last);
+ if (qp->s_last == qp->s_tail)
+ break;
+ }
+
+ switch (aeth >> 29) {
+ case 0: /* ACK */
+ dev->n_rc_acks++;
+ /* If this is a partial ACK, reset the retransmit timer. */
+ if (qp->s_last != qp->s_tail) {
+ spin_lock(&dev->pending_lock);
+ list_add_tail(&qp->timerwait,
+ &dev->pending[dev->pending_index]);
+ spin_unlock(&dev->pending_lock);
+ }
+ ipath_get_credit(qp, aeth);
+ qp->s_rnr_retry = qp->s_rnr_retry_cnt;
+ qp->s_retry = qp->s_retry_cnt;
+ qp->s_last_psn = psn;
+ return 1;
+
+ case 1: /* RNR NAK */
+ dev->n_rnr_naks++;
+ if (qp->s_rnr_retry == 0) {
+ if (qp->s_last == qp->s_tail)
+ return 0;
+
+ wc.status = IB_WC_RNR_RETRY_EXC_ERR;
+ goto class_b;
+ }
+ if (qp->s_rnr_retry_cnt < 7)
+ qp->s_rnr_retry--;
+ if (qp->s_last == qp->s_tail)
+ return 0;
+
+ /* The last valid PSN seen is the previous request's. */
+ qp->s_last_psn = wqe->psn - 1;
+
+ /* Restart this request after the RNR timeout. */
+ wqe = get_swqe_ptr(qp, qp->s_last);
+
+ dev->n_rc_resends += (int)qp->s_psn - (int)psn;
+
+ /*
+ * If we are starting the request from the beginning, let the
+ * normal send code handle initialization.
+ */
+ qp->s_cur = qp->s_last;
+ if (cmp24(psn, wqe->psn) <= 0) {
+ qp->s_state = IB_OPCODE_RC_SEND_LAST;
+ qp->s_psn = wqe->psn;
+ } else {
+ u32 n;
+
+ n = qp->s_cur;
+ for (;;) {
+ if (++n == qp->s_size)
+ n = 0;
+ if (n == qp->s_tail) {
+ if (cmp24(psn, qp->s_next_psn) >= 0) {
+ qp->s_cur = n;
+ wqe = get_swqe_ptr(qp, n);
+ }
+ break;
+ }
+ wqe = get_swqe_ptr(qp, n);
+ if (cmp24(psn, wqe->psn) < 0)
+ break;
+ qp->s_cur = n;
+ }
+ qp->s_psn = psn;
+
+ /*
+ * Set the state to restart in the middle of a request.
+ * Don't change the s_sge, s_cur_sge, or s_cur_size.
+ * See do_rc_send().
+ */
+ switch (wqe->wr.opcode) {
+ case IB_WR_SEND:
+ case IB_WR_SEND_WITH_IMM:
+ qp->s_state =
+ IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST;
+ break;
+
+ case IB_WR_RDMA_WRITE:
+ case IB_WR_RDMA_WRITE_WITH_IMM:
+ qp->s_state =
+ IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST;
+ break;
+
+ case IB_WR_RDMA_READ:
+ qp->s_state =
+ IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE;
+ break;
+
+ default:
+ /*
+ * This case shouldn't happen since its only
+ * one PSN per req.
+ */
+ qp->s_state = IB_OPCODE_RC_SEND_LAST;
+ }
+ }
+
+ qp->s_rnr_timeout = rnr_table[(aeth >> 24) & 0x1F];
+ insert_rnr_queue(qp);
+ return 0;
+
+ case 3: /* NAK */
+ /* The last valid PSN seen is the previous request's. */
+ if (qp->s_last != qp->s_tail)
+ qp->s_last_psn = wqe->psn - 1;
+ switch ((aeth >> 24) & 0x1F) {
+ case 0: /* PSN sequence error */
+ dev->n_seq_naks++;
+ /*
+ * Back up to the responder's expected PSN.
+ * XXX Note that we might get a NAK in the
+ * middle of an RDMA READ response which
+ * terminates the RDMA READ.
+ */
+ if (qp->s_last == qp->s_tail)
+ break;
+
+ if (cmp24(psn, wqe->psn) < 0) {
+ break;
+ }
+ /* Retry the request. */
+ ipath_restart_rc(qp, psn, &wc);
+ break;
+
+ case 1: /* Invalid Request */
+ wc.status = IB_WC_REM_INV_REQ_ERR;
+ dev->n_other_naks++;
+ goto class_b;
+
+ case 2: /* Remote Access Error */
+ wc.status = IB_WC_REM_ACCESS_ERR;
+ dev->n_other_naks++;
+ goto class_b;
+
+ case 3: /* Remote Operation Error */
+ wc.status = IB_WC_REM_OP_ERR;
+ dev->n_other_naks++;
+ class_b:
+ wc.wr_id = wqe->wr.wr_id;
+ wc.opcode = wc_opcode[wqe->wr.opcode];
+ wc.vendor_err = 0;
+ wc.byte_len = 0;
+ wc.qp_num = qp->ibqp.qp_num;
+ wc.src_qp = qp->remote_qpn;
+ wc.pkey_index = 0;
+ wc.slid = qp->remote_ah_attr.dlid;
+ wc.sl = qp->remote_ah_attr.sl;
+ wc.dlid_path_bits = 0;
+ wc.port_num = 0;
+ ipath_sqerror_qp(qp, &wc);
+ break;
+
+ default:
+ /* Ignore other reserved NAK error codes */
+ goto reserved;
+ }
+ qp->s_rnr_retry = qp->s_rnr_retry_cnt;
+ return 0;
+
+ default: /* 2: reserved */
+ reserved:
+ /* Ignore reserved NAK codes. */
+ return 0;
+ }
+}
+
+/*
+ * This is called from ipath_qp_rcv() to process an incomming RC packet
+ * for the given QP.
+ * Called at interrupt level.
+ */
+static void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
+ int has_grh, void *data, u32 tlen, struct ipath_qp *qp)
+{
+ struct ipath_other_headers *ohdr;
+ int opcode;
+ u32 hdrsize;
+ u32 psn;
+ u32 pad;
+ unsigned long flags;
+ struct ib_wc wc;
+ u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu);
+ int diff;
+ struct ib_reth *reth;
+
+ /* Check for GRH */
+ if (!has_grh) {
+ ohdr = &hdr->u.oth;
+ hdrsize = 8 + 12; /* LRH + BTH */
+ psn = be32_to_cpu(ohdr->bth[2]);
+ } else {
+ ohdr = &hdr->u.l.oth;
+ hdrsize = 8 + 40 + 12; /* LRH + GRH + BTH */
+ /*
+ * The header with GRH is 60 bytes and the
+ * core driver sets the eager header buffer
+ * size to 56 bytes so the last 4 bytes of
+ * the BTH header (PSN) is in the data buffer.
+ */
+ psn = be32_to_cpu(((u32 *) data)[0]);
+ data += sizeof(u32);
+ }
+ /*
+ * The opcode is in the low byte when its in network order
+ * (top byte when in host order).
+ */
+ opcode = *(u8 *) (&ohdr->bth[0]);
+
+ /*
+ * Process responses (ACKs) before anything else.
+ * Note that the packet sequence number will be for something
+ * in the send work queue rather than the expected receive
+ * packet sequence number. In other words, this QP is the
+ * requester.
+ */
+ if (opcode >= IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST &&
+ opcode <= IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE) {
+
+ spin_lock_irqsave(&qp->s_lock, flags);
+
+ /* Ignore invalid responses. */
+ if (cmp24(psn, qp->s_next_psn) >= 0) {
+ goto ack_done;
+ }
+
+ /* Ignore duplicate responses. */
+ diff = cmp24(psn, qp->s_last_psn);
+ if (unlikely(diff <= 0)) {
+ /* Update credits for "ghost" ACKs */
+ if (diff == 0 && opcode == IB_OPCODE_RC_ACKNOWLEDGE) {
+ if (!has_grh) {
+ pad = be32_to_cpu(ohdr->u.aeth);
+ } else {
+ pad = be32_to_cpu(((u32 *) data)[0]);
+ data += sizeof(u32);
+ }
+ if ((pad >> 29) == 0) {
+ ipath_get_credit(qp, pad);
+ }
+ }
+ goto ack_done;
+ }
+
+ switch (opcode) {
+ case IB_OPCODE_RC_ACKNOWLEDGE:
+ case IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE:
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST:
+ if (!has_grh) {
+ pad = be32_to_cpu(ohdr->u.aeth);
+ } else {
+ pad = be32_to_cpu(((u32 *) data)[0]);
+ data += sizeof(u32);
+ }
+ if (opcode == IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE) {
+ *(u64 *) qp->s_sge.sge.vaddr = *(u64 *) data;
+ }
+ if (!do_rc_ack(qp, pad, psn, opcode) ||
+ opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST) {
+ goto ack_done;
+ }
+ hdrsize += 4;
+ /*
+ * do_rc_ack() has already checked the PSN so skip
+ * the sequence check.
+ */
+ goto rdma_read;
+
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
+ /* no AETH, no ACK */
+ if (unlikely(cmp24(psn, qp->s_last_psn + 1) != 0)) {
+ dev->n_rdma_seq++;
+ ipath_restart_rc(qp, qp->s_last_psn + 1, &wc);
+ goto ack_done;
+ }
+ rdma_read:
+ if (unlikely(qp->s_state !=
+ IB_OPCODE_RC_RDMA_READ_REQUEST))
+ goto ack_done;
+ if (unlikely(tlen != (hdrsize + pmtu + 4)))
+ goto ack_done;
+ if (unlikely(pmtu >= qp->s_len))
+ goto ack_done;
+ /* We got a response so update the timeout. */
+ if (unlikely(qp->s_last == qp->s_tail ||
+ get_swqe_ptr(qp, qp->s_last)->wr.opcode !=
+ IB_WR_RDMA_READ))
+ goto ack_done;
+ spin_lock(&dev->pending_lock);
+ if (qp->s_rnr_timeout == 0 &&
+ qp->timerwait.next != LIST_POISON1) {
+ list_move_tail(&qp->timerwait,
+ &dev->pending[dev->
+ pending_index]);
+ }
+ spin_unlock(&dev->pending_lock);
+ /*
+ * Update the RDMA receive state but do the copy w/o
+ * holding the locks and blocking interrupts.
+ * XXX Yet another place that affects relaxed
+ * RDMA order since we don't want s_sge modified.
+ */
+ qp->s_len -= pmtu;
+ qp->s_last_psn = psn;
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+ copy_sge(&qp->s_sge, data, pmtu);
+ return;
+
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST:
+ /* ACKs READ req. */
+ if (unlikely(cmp24(psn, qp->s_last_psn + 1) != 0)) {
+ dev->n_rdma_seq++;
+ ipath_restart_rc(qp, qp->s_last_psn + 1, &wc);
+ goto ack_done;
+ }
+ /* FALLTHROUGH */
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY:
+ if (unlikely(qp->s_state !=
+ IB_OPCODE_RC_RDMA_READ_REQUEST)) {
+ goto ack_done;
+ }
+ /* Get the number of bytes the message was padded by. */
+ pad = (ohdr->bth[0] >> 12) & 3;
+ /*
+ * Check that the data size is >= 1 && <= pmtu.
+ * Remember to account for the AETH header (4)
+ * and ICRC (4).
+ */
+ if (unlikely(tlen <= (hdrsize + pad + 8))) {
+ /* XXX Need to generate an error CQ entry. */
+ goto ack_done;
+ }
+ tlen -= hdrsize + pad + 8;
+ if (unlikely(tlen != qp->s_len)) {
+ /* XXX Need to generate an error CQ entry. */
+ goto ack_done;
+ }
+ if (!has_grh) {
+ pad = be32_to_cpu(ohdr->u.aeth);
+ } else {
+ pad = be32_to_cpu(((u32 *) data)[0]);
+ data += sizeof(u32);
+ }
+ copy_sge(&qp->s_sge, data, tlen);
+ if (do_rc_ack(qp, pad, psn,
+ IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST)) {
+ /*
+ * Change the state so we contimue
+ * processing new requests.
+ */
+ qp->s_state = IB_OPCODE_RC_SEND_LAST;
+ }
+ goto ack_done;
+ }
+ ack_done:
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+ return;
+ }
+
+ spin_lock_irqsave(&qp->r_rq.lock, flags);
+
+ /* Compute 24 bits worth of difference. */
+ diff = cmp24(psn, qp->r_psn);
+ if (unlikely(diff)) {
+ if (diff > 0) {
+ /*
+ * Packet sequence error.
+ * A NAK will ACK earlier sends and RDMA writes.
+ * Don't queue the NAK if a RDMA read, atomic, or
+ * NAK is pending though.
+ */
+ spin_lock(&qp->s_lock);
+ if ((qp->s_ack_state >=
+ IB_OPCODE_RC_RDMA_READ_REQUEST &&
+ qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) ||
+ qp->s_nak_state != 0) {
+ spin_unlock(&qp->s_lock);
+ goto done;
+ }
+ qp->s_ack_state = IB_OPCODE_RC_SEND_ONLY;
+ qp->s_nak_state = IB_NAK_PSN_ERROR;
+ /* Use the expected PSN. */
+ qp->s_ack_psn = qp->r_psn;
+ goto resched;
+ }
+
+ /*
+ * Handle a duplicate request.
+ * Don't re-execute SEND, RDMA write or atomic op.
+ * Don't NAK errors, just silently drop the duplicate request.
+ * Note that r_sge, r_len, and r_rcv_len may be
+ * in use so don't modify them.
+ *
+ * We are supposed to ACK the earliest duplicate PSN
+ * but we can coalesce an outstanding duplicate ACK.
+ * We have to send the earliest so that RDMA reads
+ * can be restarted at the requester's expected PSN.
+ */
+ spin_lock(&qp->s_lock);
+ if (qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE &&
+ cmp24(psn, qp->s_ack_psn) >= 0) {
+ if (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST)
+ qp->s_ack_psn = psn;
+ spin_unlock(&qp->s_lock);
+ goto done;
+ }
+ switch (opcode) {
+ case IB_OPCODE_RC_RDMA_READ_REQUEST:
+ /*
+ * We have to be careful to not change s_rdma_sge
+ * while do_rc_send() is using it and not holding
+ * the s_lock.
+ */
+ if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE &&
+ qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) {
+ spin_unlock(&qp->s_lock);
+ dev->n_rdma_dup_busy++;
+ goto done;
+ }
+ /* RETH comes after BTH */
+ if (!has_grh)
+ reth = &ohdr->u.rc.reth;
+ else {
+ reth = (struct ib_reth *)data;
+ data += sizeof(*reth);
+ }
+ qp->s_rdma_len = be32_to_cpu(reth->length);
+ if (qp->s_rdma_len != 0) {
+ u32 rkey = be32_to_cpu(reth->rkey);
+ u64 vaddr = be64_to_cpu(reth->vaddr);
+
+ /*
+ * Address range must be a subset of the
+ * original request and start on pmtu
+ * boundaries.
+ */
+ if (unlikely(!ipath_rkey_ok(dev,
+ &qp->s_rdma_sge,
+ qp->s_rdma_len,
+ vaddr, rkey,
+ IB_ACCESS_REMOTE_READ)))
+ {
+ goto done;
+ }
+ } else {
+ qp->s_rdma_sge.sg_list = NULL;
+ qp->s_rdma_sge.num_sge = 0;
+ qp->s_rdma_sge.sge.mr = NULL;
+ qp->s_rdma_sge.sge.vaddr = NULL;
+ qp->s_rdma_sge.sge.length = 0;
+ qp->s_rdma_sge.sge.sge_length = 0;
+ }
+ break;
+
+ case IB_OPCODE_RC_COMPARE_SWAP:
+ case IB_OPCODE_RC_FETCH_ADD:
+ /*
+ * Check for the PSN of the last atomic operations
+ * performed and resend the result if found.
+ */
+ if ((psn & 0xFFFFFF) != qp->r_atomic_psn) {
+ spin_unlock(&qp->s_lock);
+ goto done;
+ }
+ qp->s_ack_atomic = qp->r_atomic_data;
+ break;
+ }
+ qp->s_ack_state = opcode;
+ qp->s_nak_state = 0;
+ qp->s_ack_psn = psn;
+ goto resched;
+ }
+
+ /* Check for opcode sequence errors. */
+ switch (qp->r_state) {
+ case IB_OPCODE_RC_SEND_FIRST:
+ case IB_OPCODE_RC_SEND_MIDDLE:
+ if (opcode == IB_OPCODE_RC_SEND_MIDDLE ||
+ opcode == IB_OPCODE_RC_SEND_LAST ||
+ opcode == IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE)
+ break;
+ nack_inv:
+ /*
+ * A NAK will ACK earlier sends and RDMA writes.
+ * Don't queue the NAK if a RDMA read, atomic, or
+ * NAK is pending though.
+ */
+ spin_lock(&qp->s_lock);
+ if (qp->s_ack_state >= IB_OPCODE_RC_RDMA_READ_REQUEST &&
+ qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) {
+ spin_unlock(&qp->s_lock);
+ goto done;
+ }
+ /* XXX Flush WQEs */
+ qp->state = IB_QPS_ERR;
+ qp->s_ack_state = IB_OPCODE_RC_SEND_ONLY;
+ qp->s_nak_state = IB_NAK_INVALID_REQUEST;
+ qp->s_ack_psn = qp->r_psn;
+ goto resched;
+
+ case IB_OPCODE_RC_RDMA_WRITE_FIRST:
+ case IB_OPCODE_RC_RDMA_WRITE_MIDDLE:
+ if (opcode == IB_OPCODE_RC_RDMA_WRITE_MIDDLE ||
+ opcode == IB_OPCODE_RC_RDMA_WRITE_LAST ||
+ opcode == IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE)
+ break;
+ goto nack_inv;
+
+ case IB_OPCODE_RC_RDMA_READ_REQUEST:
+ case IB_OPCODE_RC_COMPARE_SWAP:
+ case IB_OPCODE_RC_FETCH_ADD:
+ /*
+ * Drop all new requests until a response has been sent.
+ * A new request then ACKs the RDMA response we sent.
+ * Relaxed ordering would allow new requests to be
+ * processed but we would need to keep a queue
+ * of rwqe's for all that are in progress.
+ * Note that we can't RNR NAK this request since the RDMA
+ * READ or atomic response is already queued to be sent
+ * (unless we implement a response send queue).
+ */
+ goto done;
+
+ default:
+ if (opcode == IB_OPCODE_RC_SEND_MIDDLE ||
+ opcode == IB_OPCODE_RC_SEND_LAST ||
+ opcode == IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE ||
+ opcode == IB_OPCODE_RC_RDMA_WRITE_MIDDLE ||
+ opcode == IB_OPCODE_RC_RDMA_WRITE_LAST ||
+ opcode == IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE)
+ goto nack_inv;
+ break;
+ }
+
+ wc.imm_data = 0;
+ wc.wc_flags = 0;
+
+ /* OK, process the packet. */
+ switch (opcode) {
+ case IB_OPCODE_RC_SEND_FIRST:
+ if (!get_rwqe(qp, 0)) {
+ rnr_nak:
+ /*
+ * A RNR NAK will ACK earlier sends and RDMA writes.
+ * Don't queue the NAK if a RDMA read or atomic
+ * is pending though.
+ */
+ spin_lock(&qp->s_lock);
+ if (qp->s_ack_state >= IB_OPCODE_RC_RDMA_READ_REQUEST &&
+ qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) {
+ spin_unlock(&qp->s_lock);
+ goto done;
+ }
+ qp->s_ack_state = IB_OPCODE_RC_SEND_ONLY;
+ qp->s_nak_state = IB_RNR_NAK | qp->s_min_rnr_timer;
+ qp->s_ack_psn = qp->r_psn;
+ goto resched;
+ }
+ qp->r_rcv_len = 0;
+ /* FALLTHROUGH */
+ case IB_OPCODE_RC_SEND_MIDDLE:
+ case IB_OPCODE_RC_RDMA_WRITE_MIDDLE:
+ send_middle:
+ /* Check for invalid length PMTU or posted rwqe len. */
+ if (unlikely(tlen != (hdrsize + pmtu + 4))) {
+ goto nack_inv;
+ }
+ qp->r_rcv_len += pmtu;
+ if (unlikely(qp->r_rcv_len > qp->r_len)) {
+ goto nack_inv;
+ }
+ copy_sge(&qp->r_sge, data, pmtu);
+ break;
+
+ case IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE:
+ /* consume RWQE */
+ if (!get_rwqe(qp, 1))
+ goto rnr_nak;
+ goto send_last_imm;
+
+ case IB_OPCODE_RC_SEND_ONLY:
+ case IB_OPCODE_RC_SEND_ONLY_WITH_IMMEDIATE:
+ if (!get_rwqe(qp, 0))
+ goto rnr_nak;
+ qp->r_rcv_len = 0;
+ if (opcode == IB_OPCODE_RC_SEND_ONLY)
+ goto send_last;
+ /* FALLTHROUGH */
+ case IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE:
+ send_last_imm:
+ if (has_grh) {
+ wc.imm_data = *(u32 *) data;
+ data += sizeof(u32);
+ } else {
+ /* Immediate data comes after BTH */
+ wc.imm_data = ohdr->u.imm_data;
+ }
+ hdrsize += 4;
+ wc.wc_flags = IB_WC_WITH_IMM;
+ /* FALLTHROUGH */
+ case IB_OPCODE_RC_SEND_LAST:
+ case IB_OPCODE_RC_RDMA_WRITE_LAST:
+ send_last:
+ /* Get the number of bytes the message was padded by. */
+ pad = (ohdr->bth[0] >> 12) & 3;
+ /* Check for invalid length. */
+ /* XXX LAST len should be >= 1 */
+ if (unlikely(tlen < (hdrsize + pad + 4))) {
+ goto nack_inv;
+ }
+ /* Don't count the CRC. */
+ tlen -= (hdrsize + pad + 4);
+ wc.byte_len = tlen + qp->r_rcv_len;
+ if (unlikely(wc.byte_len > qp->r_len)) {
+ goto nack_inv;
+ }
+ /* XXX Need to free SGEs */
+ copy_sge(&qp->r_sge, data, tlen);
+ atomic_inc(&qp->msn);
+ if (opcode == IB_OPCODE_RC_RDMA_WRITE_LAST ||
+ opcode == IB_OPCODE_RC_RDMA_WRITE_ONLY)
+ break;
+ wc.wr_id = qp->r_wr_id;
+ wc.status = IB_WC_SUCCESS;
+ wc.opcode = IB_WC_RECV;
+ wc.vendor_err = 0;
+ wc.qp_num = qp->ibqp.qp_num;
+ wc.src_qp = qp->remote_qpn;
+ wc.pkey_index = 0;
+ wc.slid = qp->remote_ah_attr.dlid;
+ wc.sl = qp->remote_ah_attr.sl;
+ wc.dlid_path_bits = 0;
+ wc.port_num = 0;
+ /* Signal completion event if the solicited bit is set. */
+ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc,
+ ohdr->bth[0] & __constant_cpu_to_be32(1 << 23));
+ break;
+
+ case IB_OPCODE_RC_RDMA_WRITE_FIRST:
+ case IB_OPCODE_RC_RDMA_WRITE_ONLY:
+ case IB_OPCODE_RC_RDMA_WRITE_ONLY_WITH_IMMEDIATE:
+ /* consume RWQE */
+ /* RETH comes after BTH */
+ if (!has_grh)
+ reth = &ohdr->u.rc.reth;
+ else {
+ reth = (struct ib_reth *)data;
+ data += sizeof(*reth);
+ }
+ hdrsize += sizeof(*reth);
+ qp->r_len = be32_to_cpu(reth->length);
+ qp->r_rcv_len = 0;
+ if (qp->r_len != 0) {
+ u32 rkey = be32_to_cpu(reth->rkey);
+ u64 vaddr = be64_to_cpu(reth->vaddr);
+
+ /* Check rkey & NAK */
+ if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, qp->r_len,
+ vaddr, rkey,
+ IB_ACCESS_REMOTE_WRITE))) {
+ nack_acc:
+ /*
+ * A NAK will ACK earlier sends and RDMA
+ * writes.
+ * Don't queue the NAK if a RDMA read,
+ * atomic, or NAK is pending though.
+ */
+ spin_lock(&qp->s_lock);
+ if (qp->s_ack_state >=
+ IB_OPCODE_RC_RDMA_READ_REQUEST &&
+ qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) {
+ spin_unlock(&qp->s_lock);
+ goto done;
+ }
+ /* XXX Flush WQEs */
+ qp->state = IB_QPS_ERR;
+ qp->s_ack_state = IB_OPCODE_RC_RDMA_WRITE_ONLY;
+ qp->s_nak_state = IB_NAK_REMOTE_ACCESS_ERROR;
+ qp->s_ack_psn = qp->r_psn;
+ goto resched;
+ }
+ } else {
+ qp->r_sge.sg_list = NULL;
+ qp->r_sge.sge.mr = NULL;
+ qp->r_sge.sge.vaddr = NULL;
+ qp->r_sge.sge.length = 0;
+ qp->r_sge.sge.sge_length = 0;
+ }
+ if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_WRITE)))
+ goto nack_acc;
+ if (opcode == IB_OPCODE_RC_RDMA_WRITE_FIRST)
+ goto send_middle;
+ else if (opcode == IB_OPCODE_RC_RDMA_WRITE_ONLY)
+ goto send_last;
+ if (!get_rwqe(qp, 1))
+ goto rnr_nak;
+ goto send_last_imm;
+
+ case IB_OPCODE_RC_RDMA_READ_REQUEST:
+ /* RETH comes after BTH */
+ if (!has_grh)
+ reth = &ohdr->u.rc.reth;
+ else {
+ reth = (struct ib_reth *)data;
+ data += sizeof(*reth);
+ }
+ spin_lock(&qp->s_lock);
+ if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE &&
+ qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) {
+ spin_unlock(&qp->s_lock);
+ goto done;
+ }
+ qp->s_rdma_len = be32_to_cpu(reth->length);
+ if (qp->s_rdma_len != 0) {
+ u32 rkey = be32_to_cpu(reth->rkey);
+ u64 vaddr = be64_to_cpu(reth->vaddr);
+
+ /* Check rkey & NAK */
+ if (unlikely(!ipath_rkey_ok(dev, &qp->s_rdma_sge,
+ qp->s_rdma_len,
+ vaddr, rkey,
+ IB_ACCESS_REMOTE_READ))) {
+ spin_unlock(&qp->s_lock);
+ goto nack_acc;
+ }
+ /*
+ * Update the next expected PSN.
+ * We add 1 later below, so only add the remainder here.
+ */
+ if (qp->s_rdma_len > pmtu)
+ qp->r_psn += (qp->s_rdma_len - 1) / pmtu;
+ } else {
+ qp->s_rdma_sge.sg_list = NULL;
+ qp->s_rdma_sge.num_sge = 0;
+ qp->s_rdma_sge.sge.mr = NULL;
+ qp->s_rdma_sge.sge.vaddr = NULL;
+ qp->s_rdma_sge.sge.length = 0;
+ qp->s_rdma_sge.sge.sge_length = 0;
+ }
+ if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_READ)))
+ goto nack_acc;
+ /*
+ * We need to increment the MSN here instead of when we
+ * finish sending the result since a duplicate request would
+ * increment it more than once.
+ */
+ atomic_inc(&qp->msn);
+ qp->s_ack_state = opcode;
+ qp->s_nak_state = 0;
+ qp->s_ack_psn = psn;
+ qp->r_psn++;
+ qp->r_state = opcode;
+ goto rdmadone;
+
+ case IB_OPCODE_RC_COMPARE_SWAP:
+ case IB_OPCODE_RC_FETCH_ADD:{
+ struct ib_atomic_eth *ateth;
+ u64 vaddr;
+ u64 sdata;
+ u32 rkey;
+
+ if (!has_grh)
+ ateth = &ohdr->u.atomic_eth;
+ else {
+ ateth = (struct ib_atomic_eth *)data;
+ data += sizeof(*ateth);
+ }
+ vaddr = be64_to_cpu(ateth->vaddr);
+ if (unlikely(vaddr & 0x7))
+ goto nack_inv;
+ rkey = be32_to_cpu(ateth->rkey);
+ /* Check rkey & NAK */
+ if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge,
+ sizeof(u64), vaddr, rkey,
+ IB_ACCESS_REMOTE_ATOMIC))) {
+ goto nack_acc;
+ }
+ if (unlikely(!(qp->qp_access_flags &
+ IB_ACCESS_REMOTE_ATOMIC)))
+ goto nack_acc;
+ /* Perform atomic OP and save result. */
+ sdata = be64_to_cpu(ateth->swap_data);
+ spin_lock(&dev->pending_lock);
+ qp->r_atomic_data = *(u64 *) qp->r_sge.sge.vaddr;
+ if (opcode == IB_OPCODE_RC_FETCH_ADD) {
+ *(u64 *) qp->r_sge.sge.vaddr =
+ qp->r_atomic_data + sdata;
+ } else if (qp->r_atomic_data ==
+ be64_to_cpu(ateth->compare_data)) {
+ *(u64 *) qp->r_sge.sge.vaddr = sdata;
+ }
+ spin_unlock(&dev->pending_lock);
+ atomic_inc(&qp->msn);
+ qp->r_atomic_psn = psn & 0xFFFFFF;
+ psn |= 1 << 31;
+ break;
+ }
+
+ default:
+ /* Drop packet for unknown opcodes. */
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ return;
+ }
+ qp->r_psn++;
+ qp->r_state = opcode;
+ /* Send an ACK if requested or required. */
+ if (psn & (1 << 31)) {
+ /*
+ * Coalesce ACKs unless there is a RDMA READ or
+ * ATOMIC pending.
+ */
+ spin_lock(&qp->s_lock);
+ if (qp->s_ack_state == IB_OPCODE_RC_ACKNOWLEDGE ||
+ qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST) {
+ qp->s_ack_state = opcode;
+ qp->s_nak_state = 0;
+ qp->s_ack_psn = psn;
+ qp->s_ack_atomic = qp->r_atomic_data;
+ goto resched;
+ }
+ spin_unlock(&qp->s_lock);
+ }
+done:
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ return;
+
+resched:
+ /* Try to send ACK right away but not if do_rc_send() is active. */
+ if (qp->s_hdrwords == 0 &&
+ (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST ||
+ qp->s_ack_state >= IB_OPCODE_COMPARE_SWAP))
+ send_rc_ack(qp);
+
+rdmadone:
+ spin_unlock(&qp->s_lock);
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+
+ /* Call do_rc_send() in another thread. */
+ tasklet_schedule(&qp->s_task);
+}

2005-12-29 00:39:41

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 19 of 20] ipath - kbuild infrastructure

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r e7cabc7a2e78 -r 07bf9f34e221 drivers/infiniband/hw/ipath/Kconfig
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/Kconfig Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,18 @@
+config IPATH_CORE
+ tristate "PathScale InfiniPath Driver"
+ depends on PCI_MSI
+ ---help---
+ This is a low-level driver for PathScale InfiniPath host
+ channel adapters (HCAs) based on the HT-400 chip, including the
+ InfiniPath HT-460, the small form factor InfiniPath HT-460,
+ the InfiniPath HT-470 and the Linux Networx LS/X.
+
+config INFINIBAND_IPATH
+ tristate "PathScale InfiniPath Verbs Driver"
+ depends on IPATH_CORE && INFINIBAND
+ ---help---
+ This is a driver that provides InfiniBand verbs support for
+ PathScale InfiniPath host channel adapters (HCAs). This
+ allows these devices to be used with both kernel upper level
+ protocols such as IP-over-InfiniBand as well as with userspace
+ applications (in conjunction with InfiniBand userspace access).
diff -r e7cabc7a2e78 -r 07bf9f34e221 drivers/infiniband/hw/ipath/Makefile
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/Makefile Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,7 @@
+obj-$(CONFIG_IPATH_CORE) += ipath_core.o
+obj-$(CONFIG_INFINIBAND_IPATH) += ib_ipath.o
+
+ipath_core-y := ipath_copy.o ipath_driver.o ipath_ht400.o ipath_i2c.o \
+ ipath_layer.o ipath_lib.o ipath_upages.o
+
+ib_ipath-y := ipath_mad.o ipath_verbs.o

2005-12-29 00:41:47

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 11 of 20] ipath - core driver, part 4 of 4

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r c37b118ef806 -r e8af3873b0d9 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:43 2005 -0800
@@ -5408,3 +5408,1709 @@

return ret;
}
+
+/*
+ * implemention of the ioctl to get the stats values from the driver
+ * The argument is the user address to which we do the copy_to_user()
+ */
+static int ipath_get_stats(struct infinipath_stats __user *ustats)
+{
+ int ret = 0;
+
+ if ((ret = copy_to_user(ustats, &ipath_stats, sizeof(ipath_stats)))) {
+ _IPATH_DBG("copy_to_user error on driver stats\n");
+ ret = -EFAULT;
+ }
+
+ return ret;
+}
+
+/* set a partition key. We can have up to 4 active at a time (other than
+ * the default, which is always allowed). This is somewhat tricky, since
+ * multiple ports may set the same key, so we reference count them, and
+ * clean up at exit. All 4 partition keys are packed into a single
+ * infinipath register. It's an error for a process to set the same
+ * pkey multiple times. We provide no mechanism to de-allocate a pkey
+ * at this time, we may eventually need to do that.
+ * I've used the atomic operations, and no locking, and only make a single
+ * pass through what's available. This should be more than adequate for
+ * some time. I'll think about spinlocks or the like if and as it's necessary
+ */
+static int ipath_set_partkey(struct ipath_portdata *pd, uint16_t key)
+{
+ struct ipath_devdata *dd;
+ int i, any = 0, pidx = -1;
+ uint16_t lkey = key & 0x7FFF;
+
+ dd = &devdata[pd->port_unit];
+
+ if (lkey == (IPS_DEFAULT_P_KEY & 0x7FFF)) {
+ /* nothing to do; this key always valid */
+ return 0;
+ }
+
+ _IPATH_VDBG
+ ("p%u try to set pkey %hx, current keys %hx:%x %hx:%x %hx:%x %hx:%x\n",
+ pd->port_port, key, dd->ipath_pkeys[0],
+ atomic_read(&dd->ipath_pkeyrefs[0]), dd->ipath_pkeys[1],
+ atomic_read(&dd->ipath_pkeyrefs[1]), dd->ipath_pkeys[2],
+ atomic_read(&dd->ipath_pkeyrefs[2]), dd->ipath_pkeys[3],
+ atomic_read(&dd->ipath_pkeyrefs[3]));
+
+ if (!lkey) {
+ _IPATH_PRDBG("p%u tries to set key 0, not allowed\n",
+ pd->port_port);
+ return -EINVAL;
+ }
+
+ /*
+ * Set the full membership bit, because it has to be
+ * set in the register or the packet, and it seems
+ * cleaner to set in the register than to force all
+ * callers to set it. (see bug 4331)
+ */
+ key |= 0x8000;
+
+ for (i = 0; i < ARRAY_SIZE(pd->port_pkeys); i++) {
+ if (!pd->port_pkeys[i] && pidx == -1)
+ pidx = i;
+ if (pd->port_pkeys[i] == key) {
+ _IPATH_VDBG
+ ("p%u tries to set same pkey (%x) more than once\n",
+ pd->port_port, key);
+ return -EEXIST;
+ }
+ }
+ if (pidx == -1) {
+ _IPATH_DBG
+ ("All pkeys for port %u already in use, can't set %x\n",
+ pd->port_port, key);
+ return -EBUSY;
+ }
+ for (any = i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) {
+ if (!dd->ipath_pkeys[i]) {
+ any++;
+ continue;
+ }
+ if (dd->ipath_pkeys[i] == key) {
+ if (atomic_inc_return(&dd->ipath_pkeyrefs[i]) > 1) {
+ pd->port_pkeys[pidx] = key;
+ _IPATH_VDBG
+ ("p%u set key %x matches #%d, count now %d\n",
+ pd->port_port, key, i,
+ atomic_read(&dd->ipath_pkeyrefs[i]));
+ return 0;
+ } else {
+ /* lost race, decrement count, catch below */
+ atomic_dec(&dd->ipath_pkeyrefs[i]);
+ _IPATH_VDBG
+ ("Lost race, count was 0, after dec, it's %d\n",
+ atomic_read(&dd->ipath_pkeyrefs[i]));
+ any++;
+ }
+ }
+ if ((dd->ipath_pkeys[i] & 0x7FFF) == lkey) {
+ /*
+ * It makes no sense to have both the limited and full
+ * membership PKEY set at the same time since the
+ * unlimited one will disable the limited one.
+ */
+ return -EEXIST;
+ }
+ }
+ if (!any) {
+ _IPATH_DBG
+ ("port %u, all pkeys already in use, can't set %x\n",
+ pd->port_port, key);
+ return -EBUSY;
+ }
+ for (any = i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) {
+ if (!dd->ipath_pkeys[i] &&
+ atomic_inc_return(&dd->ipath_pkeyrefs[i]) == 1) {
+ uint64_t pkey;
+
+ /* for ipathstats, etc. */
+ ipath_stats.sps_pkeys[i] = lkey;
+ pd->port_pkeys[pidx] = dd->ipath_pkeys[i] = key;
+ pkey =
+ (uint64_t) dd->ipath_pkeys[0] |
+ ((uint64_t) dd->ipath_pkeys[1] << 16) |
+ ((uint64_t) dd->ipath_pkeys[2] << 32) |
+ ((uint64_t) dd->ipath_pkeys[3] << 48);
+ _IPATH_PRDBG
+ ("p%u set key %x in #%d, portidx %d, new pkey reg %llx\n",
+ pd->port_port, key, i, pidx, pkey);
+ ipath_kput_kreg(pd->port_unit, kr_partitionkey, pkey);
+
+ return 0;
+ }
+ }
+ _IPATH_DBG
+ ("port %u, all pkeys already in use 2nd pass, can't set %x\n",
+ pd->port_port, key);
+ return -EBUSY;
+}
+
+/*
+ * stop_start == 0 disables receive on the port, for use in queue overflow
+ * conditions. stop_start==1 re-enables, and returns value of tail register,
+ * to be used to re-init the software copy of the head register
+ */
+
+static int ipath_manage_rcvq(struct ipath_portdata * pd, uint16_t start_stop)
+{
+ struct ipath_devdata *dd;
+ /*
+ * This needs to be volatile, so that the compiler doesn't
+ * optimize away the read to the device's mapped memory.
+ */
+ volatile uint64_t tval;
+
+ dd = &devdata[pd->port_unit];
+ _IPATH_PRDBG("%sabling rcv for unit %u port %u\n",
+ start_stop ? "en" : "dis", pd->port_unit, pd->port_port);
+ /* atomically clear receive enable port. */
+ if (start_stop) {
+ /*
+ * on enable, force in-memory copy of the tail register
+ * to 0, so that protocol code doesn't have to worry
+ * about whether or not the chip has yet updated
+ * the in-memory copy or not on return from the system
+ * call. The chip always resets it's tail register back
+ * to 0 on a transition from disabled to enabled.
+ * This could cause a problem if software was broken,
+ * and did the enable w/o the disable, but eventually
+ * the in-memory copy will be updated and correct
+ * itself, even in the face of software bugs.
+ */
+ *pd->port_rcvhdrtail_kvaddr = 0;
+ atomic_set_mask(1U <<
+ (INFINIPATH_R_PORTENABLE_SHIFT + pd->port_port),
+ &dd->ipath_rcvctrl);
+ } else
+ atomic_clear_mask(1U <<
+ (INFINIPATH_R_PORTENABLE_SHIFT +
+ pd->port_port), &dd->ipath_rcvctrl);
+ ipath_kput_kreg(pd->port_unit, kr_rcvctrl, dd->ipath_rcvctrl);
+ /* now be sure chip saw it before we return */
+ tval = ipath_kget_kreg64(pd->port_unit, kr_scratch);
+ if (start_stop) {
+ /*
+ * and try to be sure that tail reg update has happened
+ * too. This should in theory interlock with the RXE
+ * changes to the tail register. Don't assign it to
+ * the tail register in memory copy, since we could
+ * overwrite an update by the chip if we did.
+ */
+ tval =
+ ipath_kget_ureg32(pd->port_unit, ur_rcvhdrtail,
+ pd->port_port);
+ }
+ /* always; new head should be equal to new tail; see above */
+ return 0;
+}
+
+/*
+ * This routine is now quite different for user and kernel, because
+ * the kernel uses skb's, for the accelerated network performance
+ * This is the user port version
+ *
+ * allocate the eager TID buffers and program them into infinipath
+ * They are no longer completely contiguous, we do multiple
+ * alloc_pages() calls.
+ */
+static int ipath_create_user_egr(struct ipath_portdata * pd)
+{
+ char *buf;
+ struct ipath_devdata *dd = &devdata[pd->port_unit];
+ uint64_t __iomem *egrbase;
+ uint64_t egroff, lenvalid;
+ unsigned e, egrcnt, alloced, order, egrperchunk, chunk;
+ unsigned long pa, pent;
+
+ egrcnt = dd->ipath_rcvegrcnt;
+ egroff =
+ dd->ipath_rcvegrbase + pd->port_port * egrcnt * sizeof(*egrbase);
+ egrbase = (uint64_t __iomem *)
+ ((char __iomem *)(dd->ipath_kregbase) + egroff);
+ _IPATH_VDBG("Allocating %d egr buffers, at chip offset %llx (%p)\n",
+ egrcnt, egroff, egrbase);
+
+ /*
+ * to avoid wasting a lot of memory, we allocate 32KB chunks of
+ * physically contiguous memory, advance through it until used up
+ * and then allocate more. Of course, we need memory to store
+ * those extra pointers, now. Started out with 256KB, but under
+ * heavy memory pressure (creating large files and then copying
+ * them over NFS while doing lots of MPI jobs), we hit some
+ * alloc_pages() failures, even though we can sleep... (2.6.10)
+ * Still get failures at 64K. 32K is the lowest we can go without
+ * waiting more memory again. It seems likely that the coalescing
+ * in free_pages, etc. still has issues (as it has had previously
+ * during 2.6.x development).
+ */
+ order = get_order(0x8000);
+ alloced = ALIGN(dd->ipath_rcvegrbufsize * egrcnt,
+ (1 << order) * PAGE_SIZE);
+ egrperchunk = ((1 << order) * PAGE_SIZE) / dd->ipath_rcvegrbufsize;
+ chunk = (egrcnt + egrperchunk - 1) / egrperchunk;
+ pd->port_rcvegrbuf_chunks = chunk;
+ pd->port_rcvegrbufs_perchunk = egrperchunk;
+ pd->port_rcvegrbuf_order = order;
+ pd->port_rcvegrbuf_pages =
+ vmalloc(chunk * sizeof(pd->port_rcvegrbuf_pages[0]));
+ pd->port_rcvegrbuf_virt =
+ vmalloc(chunk * sizeof(pd->port_rcvegrbuf_virt[0]));
+ if (!pd->port_rcvegrbuf_pages || !pd->port_rcvegrbuf_pages) {
+ _IPATH_UNIT_ERROR(pd->port_unit,
+ "Unable to allocate %u EGR buffer array pointers\n",
+ chunk);
+ if (pd->port_rcvegrbuf_pages) {
+ vfree(pd->port_rcvegrbuf_pages);
+ pd->port_rcvegrbuf_pages = NULL;
+ }
+ return -ENOMEM;
+ }
+ for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) {
+ /*
+ * GFP_USER, but without GFP_FS, so buffer cache can
+ * be coalesced (we hope); otherwise, even at order 4, heavy
+ * filesystem activity makes these fail
+ */
+ if (!
+ (pd->port_rcvegrbuf_pages[e] =
+ alloc_pages(__GFP_WAIT | __GFP_IO, order))) {
+ _IPATH_UNIT_ERROR(pd->port_unit,
+ "Unable to allocate EGR buffer array %u/%u\n",
+ e, pd->port_rcvegrbuf_chunks);
+ vfree(pd->port_rcvegrbuf_pages);
+ pd->port_rcvegrbuf_pages = NULL;
+ vfree(pd->port_rcvegrbuf_virt);
+ pd->port_rcvegrbuf_virt = NULL;
+ return -ENOMEM;
+ }
+ }
+
+ /*
+ * calculate physical, then phys_to_virt()
+ * so that we get an address that fits in 64 bits, so we can use
+ * mmap64 from 32 bit programs on the chip and kernel virtual
+ * addresses (mmap64 for 32 bit programs on i386 and x86_64
+ * only has 44 bits of address, because it uses mmap2())
+ * We do this with the first chunk; We don't need a kernel
+ * virtually contiguous address to give the user virtually
+ * contiguous mappings. It just complicates the nopage routine
+ * a little tiny bit ;)
+ */
+ buf = page_address(pd->port_rcvegrbuf_pages[0]);
+ pa = virt_to_phys(buf);
+ pd->port_rcvegr_phys = pa;
+
+ /* in words */
+ lenvalid = (dd->ipath_rcvegrbufsize - pd->port_egrskip) >> 2;
+ _IPATH_VDBG
+ ("port%u egrbuf vaddr %p, cpu %d, egrskip %u, len %llx words\n",
+ pd->port_port, buf, smp_processor_id(), pd->port_egrskip,
+ lenvalid);
+ lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT;
+ lenvalid |= INFINIPATH_RT_VALID;
+
+ for (e = chunk = 0; chunk < pd->port_rcvegrbuf_chunks; chunk++) {
+ int i, n;
+ struct page *p;
+ p = pd->port_rcvegrbuf_pages[chunk];
+ pa = page_to_phys(p);
+ buf = page_address(p);
+ /*
+ * stash away for later use, since page_address() lookup
+ * is not cheap
+ */
+ pd->port_rcvegrbuf_virt[chunk] = buf;
+ if (pa & ~INFINIPATH_RT_ADDR_MASK)
+ _IPATH_INFO
+ ("physaddr %lx has more than 40 bits, using only 40!\n",
+ pa);
+ n = 1 << pd->port_rcvegrbuf_order;
+ for (i = 0; i < n; i++)
+ SetPageReserved(virt_to_page(buf + (i * PAGE_SIZE)));
+
+ /* clear buffer for security, sanity, and, debugging */
+ memset(buf, 0, PAGE_SIZE * n);
+
+ for (i = 0; e < egrcnt && i < egrperchunk; e++, i++) {
+ pent = ((pa + pd->port_egrskip) &
+ INFINIPATH_RT_ADDR_MASK) | lenvalid;
+
+ ipath_kput_memq(pd->port_unit, &egrbase[e], pent);
+ _IPATH_VDBG("egr %u phys %lx val %lx\n", e, pa, pent);
+ pa += dd->ipath_rcvegrbufsize;
+ }
+ yield(); /* don't hog the cpu */
+ }
+
+ return 0;
+}
+
+/*
+ * This routine is now quite different for user and kernel, because
+ * the kernel uses skb's, for the accelerated network performance
+ * This is the kernel (port0) version
+ *
+ * Allocate the eager TID buffers and program them into infinipath.
+ * We use the network layer alloc_skb() allocator to allocate the memory, and
+ * either use the buffers as is for things like SMA packets, or pass
+ * the buffers up to the ipath layered driver and thence the network layer,
+ * replacing them as we do so (see ipath_kreceive())
+ */
+static int ipath_create_port0_egr(struct ipath_portdata * pd)
+{
+ int ret = 0;
+ uint64_t __iomem *egrbase;
+ uint64_t egroff;
+ unsigned e, egrcnt;
+ struct ipath_devdata *dd;
+ struct sk_buff **skbs;
+
+ dd = &devdata[pd->port_unit];
+ egrcnt = dd->ipath_rcvegrcnt;
+ egroff = dd->ipath_rcvegrbase +
+ pd->port_port * egrcnt * sizeof(*egrbase);
+ egrbase = (uint64_t __iomem *) ((char __iomem *)(dd->ipath_kregbase) +
+ egroff);
+ _IPATH_VDBG
+ ("unit%u Allocating %d egr buffers, at chip offset %llx (%p)\n",
+ pd->port_unit, egrcnt, egroff, egrbase);
+
+ skbs = vmalloc(sizeof(*dd->ipath_port0_skbs) * egrcnt);
+ if (skbs == NULL)
+ ret = -ENOMEM;
+ else {
+ for (e = 0; e < egrcnt; e++) {
+ /*
+ * This is a bit tricky in that we allocate
+ * extra space for 2 bytes of the 14 byte
+ * ethernet header. These two bytes are passed
+ * in the ipath header so the rest of the data
+ * is word aligned. We allocate 4 bytes so that the
+ * data buffer stays word aligned.
+ * See ipath_kreceive() for more details.
+ */
+ skbs[e] =
+ __dev_alloc_skb(dd->ipath_ibmaxlen + 4, GFP_KERNEL);
+ if (skbs[e] == NULL) {
+ _IPATH_UNIT_ERROR(pd->port_unit,
+ "SKB allocation error for eager TID %u\n",
+ e);
+ while (e != 0)
+ dev_kfree_skb(skbs[--e]);
+ ret = -ENOMEM;
+ break;
+ }
+ skb_reserve(skbs[e], 4);
+ }
+ }
+ /*
+ * after loop above, so we can test non-NULL
+ * to see if ready to use at receive, etc. Hope this fixes some
+ * panics.
+ */
+ dd->ipath_port0_skbs = skbs;
+
+ /*
+ * have to tell chip each time we init it
+ * even if we are re-using previous memory.
+ */
+ if (!ret) {
+ uint64_t lenvalid; /* in words */
+
+ lenvalid = (dd->ipath_ibmaxlen - pd->port_egrskip) >> 2;
+ lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT;
+ lenvalid |= INFINIPATH_RT_VALID;
+ for (e = 0; e < egrcnt; e++) {
+ unsigned long pa, pent;
+
+ pa = virt_to_phys(dd->ipath_port0_skbs[e]->data);
+ pa += pd->port_egrskip;
+ if (!e && (pa & ~INFINIPATH_RT_ADDR_MASK))
+ _IPATH_INFO
+ ("phys addr %lx has more than 40 bits, using only 40!!!\n",
+ pa);
+ pent = (pa & INFINIPATH_RT_ADDR_MASK) | lenvalid;
+ /*
+ * don't need this except extreme debugging,
+ * but leaving to save future typing.
+ * _IPATH_VDBG("egr[%d] %p <- %lx\n", e, &egrbase[e], pent);
+ */
+ ipath_kput_memq(pd->port_unit, &egrbase[e], pent);
+ }
+ yield(); /* don't hog the cpu */
+ }
+
+ return ret;
+}
+
+/*
+ * this *must* be physically contiguous memory, and for now,
+ * that limits it to what kmalloc can do.
+ */
+static int ipath_create_rcvhdrq(struct ipath_portdata * pd)
+{
+ int i, ret = 0, amt, order, pgs;
+ char *qt;
+ struct page *p;
+ unsigned long pa, pa0;
+
+ amt = ALIGN(devdata[pd->port_unit].ipath_rcvhdrcnt * devdata[pd->port_unit].ipath_rcvhdrentsize * sizeof(uint32_t), PAGE_SIZE);
+ if (!pd->port_rcvhdrq) {
+ order = get_order(amt);
+ /*
+ * not using REPEAT isn't viable; at 128KB, we can easily fail
+ * this. The problem with REPEAT is we can block here
+ * "forever". There isn't an inbetween, unfortunately.
+ * We could reduce the risk by never freeing the rcvhdrq
+ * except at unload, but even then, the first time a
+ * port is used, we could delay for some time...
+ */
+ p = alloc_pages(GFP_USER, order);
+ if (!p) {
+ _IPATH_UNIT_ERROR(pd->port_unit,
+ "attempt to allocate order %u memory for port %u rcvhdrq failed\n",
+ order, pd->port_port);
+ return -ENOMEM;
+ }
+
+ /*
+ * should use kmap (and later kunmap), even though high mem will
+ * always be mapped on x86_64, to play it safe, but for some
+ * bizarre reason these aren't exported symbols...
+ */
+ pd->port_rcvhdrq = page_address(p);
+ if (!virt_addr_valid(pd->port_rcvhdrq)) {
+ _IPATH_DBG
+ ("weird, virt_addr_valid false right after alloc_pages\n");
+ _IPATH_DBG("__pa(%p) is %lx, num_physpages %lx\n",
+ pd->port_rcvhdrq, __pa(pd->port_rcvhdrq),
+ num_physpages);
+ }
+ pd->port_rcvhdrq_phys = virt_to_phys(pd->port_rcvhdrq);
+ pd->port_rcvhdrq_order = order;
+
+ pa0 = pd->port_rcvhdrq_phys;
+ pgs = amt >> PAGE_SHIFT;
+ _IPATH_VDBG
+ ("%d pages at %p (phys %lx) order=%u for port %u rcvhdr Q\n",
+ pgs, pd->port_rcvhdrq, pa0, pd->port_rcvhdrq_order,
+ pd->port_port);
+
+ /*
+ * verify it's really physically contiguous, to be paranoid
+ * also mark pages as reserved, to avoid problems when
+ * user process with them mapped then exits.
+ */
+ qt = pd->port_rcvhdrq;
+ SetPageReserved(virt_to_page(qt));
+ qt += PAGE_SIZE;
+ for (pa = pa0, i = 1; i < pgs; i++, qt += PAGE_SIZE) {
+ SetPageReserved(virt_to_page(qt));
+ pa = virt_to_phys(qt);
+ if (pa != (pa0 + (i * PAGE_SIZE)))
+ _IPATH_INFO
+ ("pg %d at %p phys %lx not contiguous\n", i,
+ qt, pa);
+ else
+ _IPATH_VDBG("pg %d at %p phys %lx\n", i, qt,
+ pa);
+ }
+ }
+
+ /*
+ * clear for security, sanity, and/or debugging (each time we
+ * use/reuse)
+ */
+ memset(pd->port_rcvhdrq, 0, amt);
+
+ /*
+ * tell chip each time we init it, even if we are re-using previous
+ * memory (we zero it at process close)
+ */
+ _IPATH_VDBG("writing port %d rcvhdraddr as %lx\n", pd->port_port,
+ pd->port_rcvhdrq_phys);
+ ipath_kput_kreg_port(pd->port_unit, kr_rcvhdraddr, pd->port_port,
+ pd->port_rcvhdrq_phys);
+
+ return ret;
+}
+
+#ifdef _IPATH_EXTRA_DEBUG
+/*
+ * occasionally useful to dump the full set of kernel registers for debugging.
+ */
+static void ipath_dump_allregs(char *what, ipath_type t)
+{
+ uint16_t reg;
+ _IPATH_DBG("%s\n", what);
+ for (reg = 0; reg <= 0x100; reg++) {
+ uint64_t v = ipath_kget_kreg64(t, reg);
+ if (!(reg % 4))
+ printk("\n%3x: ", reg);
+ printk("%16llx ", v);
+ }
+ printk("\n");
+}
+#endif /* _IPATH_EXTRA_DEBUG */
+
+/*
+ * Do the actual initialization sequence on the chip. For the real
+ * hardware, this is done from the init routine called from the PCI
+ * infrastructure.
+ */
+int ipath_init_chip(const ipath_type t)
+{
+ int ret = 0, i;
+ uint32_t val32, kpiobufs;
+ uint64_t val, atmp;
+ uint32_t __iomem *piobuf;
+ uint32_t pioincr;
+ struct ipath_devdata *dd = &devdata[t];
+ struct ipath_portdata *pd;
+ struct page *vpage;
+ char boardn[32];
+
+ /* first time only, set after static version info */
+ if (!chip_driver_version) {
+ i = strlen(ipath_core_version);
+ chip_driver_version = ipath_core_version + i;
+ chip_driver_size = sizeof ipath_core_version - i;
+ }
+
+ /*
+ * have to clear shadow copies of registers at init that are not
+ * otherwise set here, or all kinds of bizarre things happen with
+ * driver on chip reset
+ */
+ dd->ipath_rcvhdrsize = 0;
+
+ /*
+ * don't clear ipath_flags as 8bit mode was set before entering
+ * this func. However, we do set the linkstate to unknown
+ */
+
+ /* so we can watch for a transition */
+ dd->ipath_flags |= IPATH_LINKUNK;
+ dd->ipath_flags &= ~(IPATH_LINKACTIVE | IPATH_LINKARMED | IPATH_LINKDOWN
+ | IPATH_LINKINIT);
+
+ _IPATH_VDBG("Try to read spc chip revision\n");
+ dd->ipath_revision = ipath_kget_kreg64(t, kr_revision);
+
+ /*
+ * set up fundamental info we need to use the chip; we assume if
+ * the revision reg and these regs are OK, we don't need to special
+ * case the rest
+ */
+ dd->ipath_sregbase = ipath_kget_kreg32(t, kr_sendregbase);
+ dd->ipath_cregbase = ipath_kget_kreg32(t, kr_counterregbase);
+ dd->ipath_uregbase = ipath_kget_kreg32(t, kr_userregbase);
+ _IPATH_VDBG("ipath_kregbase %p, sendbase %x usrbase %x, cntrbase %x\n",
+ dd->ipath_kregbase, dd->ipath_sregbase, dd->ipath_uregbase,
+ dd->ipath_cregbase);
+ if ((dd->ipath_revision & 0xffffffff) == 0xffffffff ||
+ (dd->ipath_sregbase & 0xffffffff) == 0xffffffff ||
+ (dd->ipath_cregbase & 0xffffffff) == 0xffffffff ||
+ (dd->ipath_uregbase & 0xffffffff) == 0xffffffff) {
+ _IPATH_UNIT_ERROR(t,
+ "Register read failures from chip, giving up initialization\n");
+ ret = -ENODEV;
+ goto done;
+ }
+
+ /* clear the initial reset flag, in case first driver load */
+ ipath_kput_kreg(t, kr_errorclear, INFINIPATH_E_RESET);
+
+ dd->ipath_portcnt = ipath_kget_kreg32(t, kr_portcnt);
+ if (!infinipath_cfgports)
+ dd->ipath_cfgports = dd->ipath_portcnt;
+ else if (infinipath_cfgports <= dd->ipath_portcnt) {
+ dd->ipath_cfgports = infinipath_cfgports;
+ _IPATH_DBG("Configured to use %u ports out of %u in chip\n",
+ dd->ipath_cfgports, dd->ipath_portcnt);
+ } else {
+ dd->ipath_cfgports = dd->ipath_portcnt;
+ _IPATH_DBG
+ ("Tried to configured to use %u ports; chip only supports %u\n",
+ infinipath_cfgports, dd->ipath_portcnt);
+ }
+ dd->ipath_pd = kmalloc(sizeof(*dd->ipath_pd) * dd->ipath_cfgports,
+ GFP_KERNEL);
+ if (!dd->ipath_pd) {
+ _IPATH_UNIT_ERROR(t,
+ "Unable to allocate portdata array, failing\n");
+ ret = -ENOMEM;
+ goto done;
+ }
+ memset(dd->ipath_pd, 0, sizeof(*dd->ipath_pd) * dd->ipath_cfgports);
+
+ dd->ipath_lastegrheads = kmalloc(sizeof(*dd->ipath_lastegrheads)
+ * dd->ipath_cfgports, GFP_KERNEL);
+ dd->ipath_lastrcvhdrqtails = kmalloc(sizeof(*dd->ipath_lastrcvhdrqtails)
+ * dd->ipath_cfgports, GFP_KERNEL);
+ if (!dd->ipath_lastegrheads || !dd->ipath_lastrcvhdrqtails) {
+ _IPATH_UNIT_ERROR(t,
+ "Unable to allocate head arrays, failing\n");
+ ret = -ENOMEM;
+ goto done;
+ }
+ memset(dd->ipath_lastrcvhdrqtails, 0,
+ sizeof(*dd->ipath_lastrcvhdrqtails)
+ * dd->ipath_cfgports);
+ memset(dd->ipath_lastegrheads, 0, sizeof(*dd->ipath_lastegrheads)
+ * dd->ipath_cfgports);
+
+ dd->ipath_pd[0] = kmalloc(sizeof(struct ipath_portdata), GFP_KERNEL);
+ if (!dd->ipath_pd[0]) {
+ _IPATH_UNIT_ERROR(t,
+ "Unable to allocate portdata for port 0, failing\n");
+ ret = -ENOMEM;
+ goto done;
+ }
+ memset(dd->ipath_pd[0], 0, sizeof(struct ipath_portdata));
+
+ pd = dd->ipath_pd[0];
+ pd->port_unit = t;
+ pd->port_port = 0;
+ pd->port_cnt = 1;
+ /* The port 0 pkey table is used by the layer interface. */
+ pd->port_pkeys[0] = IPS_DEFAULT_P_KEY;
+
+ dd->ipath_rcvtidcnt = ipath_kget_kreg32(t, kr_rcvtidcnt);
+ dd->ipath_rcvtidbase = ipath_kget_kreg32(t, kr_rcvtidbase);
+ dd->ipath_rcvegrcnt = ipath_kget_kreg32(t, kr_rcvegrcnt);
+ dd->ipath_rcvegrbase = ipath_kget_kreg32(t, kr_rcvegrbase);
+ dd->ipath_palign = ipath_kget_kreg32(t, kr_pagealign);
+ dd->ipath_piobufbase = ipath_kget_kreg32(t, kr_sendpiobufbase);
+ dd->ipath_piosize = ipath_kget_kreg32(t, kr_sendpiosize);
+ dd->ipath_ibmtu = 4096; /* default to largest legal MTU */
+ dd->ipath_piobcnt = ipath_kget_kreg32(t, kr_sendpiobufcnt);
+ dd->ipath_piobase = (((char __iomem *) dd->ipath_kregbase) +
+ (dd->ipath_piobufbase & 0xffffffff));
+
+ _IPATH_VDBG
+ ("Revision %llx (PCI %x), %u ports, %u tids, %u egrtids, %u piobufs\n",
+ dd->ipath_revision, dd->ipath_pcirev, dd->ipath_portcnt,
+ dd->ipath_rcvtidcnt, dd->ipath_rcvegrcnt, dd->ipath_piobcnt);
+
+ if (((dd->ipath_revision >> INFINIPATH_R_SOFTWARE_SHIFT) & INFINIPATH_R_SOFTWARE_MASK) != IPATH_CHIP_SWVERSION) { /* >= maybe, someday */
+ _IPATH_UNIT_ERROR(t,
+ "Driver only handles version %d, chip swversion is %d (%llx), failng\n",
+ IPATH_CHIP_SWVERSION,
+ (int)(dd->
+ ipath_revision >>
+ INFINIPATH_R_SOFTWARE_SHIFT) &
+ INFINIPATH_R_SOFTWARE_MASK,
+ dd->ipath_revision);
+ ret = -ENOSYS;
+ goto done;
+ }
+ dd->ipath_majrev = (uint8_t) ((dd->ipath_revision >>
+ INFINIPATH_R_CHIPREVMAJOR_SHIFT) &
+ INFINIPATH_R_CHIPREVMAJOR_MASK);
+ dd->ipath_minrev =
+ (uint8_t) ((dd->
+ ipath_revision >> INFINIPATH_R_CHIPREVMINOR_SHIFT) &
+ INFINIPATH_R_CHIPREVMINOR_MASK);
+ dd->ipath_boardrev =
+ (uint8_t) ((dd->
+ ipath_revision >> INFINIPATH_R_BOARDID_SHIFT) &
+ INFINIPATH_R_BOARDID_MASK);
+
+ ipath_get_boardname(t, boardn, sizeof boardn);
+
+ {
+ snprintf(chip_driver_version, chip_driver_size,
+ "Driver %u.%u, %s, InfiniPath%u %u.%u, PCI %u, SW Compat %u\n",
+ IPATH_CHIP_VERS_MAJ, IPATH_CHIP_VERS_MIN, boardn,
+ (unsigned)(dd->
+ ipath_revision >> INFINIPATH_R_ARCH_SHIFT) &
+ INFINIPATH_R_ARCH_MASK, dd->ipath_majrev,
+ dd->ipath_minrev, dd->ipath_pcirev,
+ (unsigned)(dd->
+ ipath_revision >>
+ INFINIPATH_R_SOFTWARE_SHIFT) &
+ INFINIPATH_R_SOFTWARE_MASK);
+
+ }
+
+ _IPATH_DBG("%s", chip_driver_version);
+
+ /*
+ * we ignore most issues after reporting them, but have to specially
+ * handle hardware-disabled chips.
+ */
+ if (ipath_validate_rev(dd) == 2) {
+ ret = -EPERM; /* unique error, known to infinipath_init_one() */
+ goto done;
+ }
+
+ /*
+ * zero all the TID entries at startup. We do this for sanity,
+ * in case of a previous driver crash of some kind, and also
+ * because the chip powers up with these memories in an unknown
+ * state. Use portcnt, not cfgports, since this is for the full chip,
+ * not for current (possibly different) configuration value
+ * Chip Errata bug 6447
+ */
+ for (val32 = 0; val32 < dd->ipath_portcnt; val32++)
+ ipath_clear_tids(t, val32);
+
+ dd->ipath_rcvhdrentsize = IPATH_RCVHDRENTSIZE;
+ /* we could bump this
+ * to allow for full rcvegrcnt + rcvtidcnt, but then it no
+ * longer nicely fits power of two, and since we now use
+ * alloc_pages, the rest would be wasted.
+ */
+ dd->ipath_rcvhdrcnt = dd->ipath_rcvegrcnt;
+ /*
+ * setup offset of last valid entry in rcvhdrq, for various tests, to
+ * avoid calculating each time we need it
+ */
+ dd->ipath_hdrqlast =
+ dd->ipath_rcvhdrentsize * (dd->ipath_rcvhdrcnt - 1);
+ ipath_kput_kreg(t, kr_rcvhdrentsize, dd->ipath_rcvhdrentsize);
+ ipath_kput_kreg(t, kr_rcvhdrcnt, dd->ipath_rcvhdrcnt);
+ /*
+ * not in ipath_rcvhdrsize, so user programs can set differently, but
+ * so any early packets see the default size.
+ */
+ ipath_kput_kreg(t, kr_rcvhdrsize, IPATH_DFLT_RCVHDRSIZE);
+
+ /*
+ * we "know" that this works
+ * out OK. It's actually a bit more than we need, but 2048+64 isn't
+ * quite enough for full size, and we want the +N to be a power of 2
+ * to give us reasonable alignment and fit within page_alloc()'ed
+ * memory
+ */
+ dd->ipath_rcvegrbufsize = dd->ipath_piosize;
+
+ /*
+ * the min() check here is currently a nop, but it may not always be,
+ * depending on just how we do ipath_rcvegrbufsize
+ */
+ dd->ipath_ibmaxlen = min(dd->ipath_piosize, dd->ipath_rcvegrbufsize);
+ dd->ipath_init_ibmaxlen = dd->ipath_ibmaxlen;
+
+ /*
+ * set up the shadow copies of the piobufavail registers, which
+ * we compare against the chip registers for now, and the in
+ * memory DMA'ed copies of the registers. This has to be done
+ * early, before we calculate lastport, etc.
+ */
+ val = dd->ipath_piobcnt;
+ /*
+ * calc number of pioavail registers, and save it; we have 2 bits
+ * per buffer
+ */
+ dd->ipath_pioavregs = ALIGN(val, sizeof(uint64_t) * BITS_PER_BYTE / 2) / (sizeof(uint64_t) * BITS_PER_BYTE / 2);
+ if (dd->ipath_pioavregs >
+ (sizeof(dd->ipath_pioavailshadow) /
+ sizeof(dd->ipath_pioavailshadow[0]))) {
+ dd->ipath_pioavregs =
+ sizeof(dd->ipath_pioavailshadow) /
+ sizeof(dd->ipath_pioavailshadow[0]);
+ dd->ipath_piobcnt = dd->ipath_pioavregs * sizeof(uint64_t) * BITS_PER_BYTE >> 1; /* 2 bits/reg */
+ _IPATH_INFO
+ ("Warning: %lld piobufs is too many to fit in shadow, only using %d\n",
+ val, dd->ipath_piobcnt);
+ }
+
+ if (!infinipath_kpiobufs) {
+ /* have to have at least one, for SMA */
+ kpiobufs = infinipath_kpiobufs = 1;
+ } else if (dd->ipath_piobcnt <
+ (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT)) {
+ _IPATH_INFO
+ ("Too few PIO buffers (%u) for %u ports to have %u each!\n",
+ dd->ipath_piobcnt, dd->ipath_cfgports,
+ IPATH_MIN_USER_PORT_BUFCNT);
+ kpiobufs = 1; /* reserve just the minimum for SMA/ether */
+ } else
+ kpiobufs = infinipath_kpiobufs;
+
+ if (kpiobufs >
+ (dd->ipath_piobcnt -
+ (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT))) {
+ i = dd->ipath_piobcnt -
+ (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT);
+ if (i < 0)
+ i = 0;
+ _IPATH_INFO
+ ("Allocating %d PIO bufs for kernel leaves too few for %d user ports (%d each); using %u\n",
+ kpiobufs, dd->ipath_cfgports - 1,
+ IPATH_MIN_USER_PORT_BUFCNT, i);
+ /*
+ * shouldn't change infinipath_kpiobufs, because could be
+ * different for different devices...
+ */
+ kpiobufs = i;
+ }
+ dd->ipath_lastport_piobuf = dd->ipath_piobcnt - kpiobufs;
+ dd->ipath_pbufsport = dd->ipath_cfgports > 1 ?
+ dd->ipath_lastport_piobuf / (dd->ipath_cfgports - 1) : 0;
+ val32 = dd->ipath_lastport_piobuf -
+ (dd->ipath_pbufsport * (dd->ipath_cfgports - 1));
+ if (val32 > 0) {
+ _IPATH_DBG
+ ("allocating %u pbufs/port leaves %u unused, add to kernel\n",
+ dd->ipath_pbufsport, val32);
+ dd->ipath_lastport_piobuf -= val32;
+ _IPATH_DBG("%u pbufs/port leaves %u unused, add to kernel\n",
+ dd->ipath_pbufsport, val32);
+ }
+ dd->ipath_lastpioindex = dd->ipath_lastport_piobuf;
+ _IPATH_VDBG
+ ("%d PIO bufs %u - %u, %u each for %u user ports\n",
+ kpiobufs, dd->ipath_lastport_piobuf, dd->ipath_piobcnt, dd->ipath_pbufsport,
+ dd->ipath_cfgports - 1);
+
+ /*
+ * this has to be page aligned, and on a page of it's own, so we
+ * can map it into user space. We also use it to give processes
+ * a copy of ipath_statusp, on a separate cacheline, followed by
+ * a copy of the freeze error string, if it's happened. Might also
+ * use that space for other things.
+ */
+ val = ALIGN(2 * L1_CACHE_BYTES + sizeof(*dd->ipath_statusp) +
+ dd->ipath_pioavregs * sizeof(uint64_t), 2 * PAGE_SIZE);
+ if (!(dd->ipath_pioavailregs_dma = kmalloc(val * sizeof(uint64_t),
+ GFP_KERNEL))) {
+ _IPATH_UNIT_ERROR(t,
+ "failed to allocate PIOavail reg area in memory\n");
+ ret = -ENOMEM;
+ goto done;
+ }
+ if ((PAGE_SIZE - 1) & (uint64_t) dd->ipath_pioavailregs_dma) {
+ dd->__ipath_pioavailregs_base = dd->ipath_pioavailregs_dma;
+ dd->ipath_pioavailregs_dma = (uint64_t *)
+ ALIGN((uint64_t) dd->ipath_pioavailregs_dma, PAGE_SIZE);
+ } else
+ dd->__ipath_pioavailregs_base = dd->ipath_pioavailregs_dma;
+ /*
+ * zero initial, since whole thing mapped
+ * into user space, and don't want info leak, or confusing garbage
+ */
+ memset((void *)dd->ipath_pioavailregs_dma, 0, PAGE_SIZE);
+
+ /*
+ * we really want L2 cache aligned, but for current CPUs of interest,
+ * they are the same.
+ */
+ dd->ipath_statusp = (uint64_t *) ((char *)dd->ipath_pioavailregs_dma +
+ ((2 * L1_CACHE_BYTES +
+ dd->ipath_pioavregs *
+ sizeof(uint64_t)) &
+ ~L1_CACHE_BYTES));
+ /* copy the current value now that it's really allocated */
+ *dd->ipath_statusp = dd->_ipath_status;
+ /*
+ * setup buffer to hold freeze msg, accessible to apps, following
+ * statusp
+ */
+ dd->ipath_freezemsg = (char *)&dd->ipath_statusp[1];
+ /* and it's length */
+ dd->ipath_freezelen = L1_CACHE_BYTES - sizeof(dd->ipath_statusp[0]);
+
+ atmp = virt_to_phys(dd->ipath_pioavailregs_dma);
+ /* stash physical address for user progs */
+ dd->ipath_pioavailregs_phys = atmp;
+ (void)ipath_kput_kreg(t, kr_sendpioavailaddr, atmp);
+ /*
+ * this is to detect s/w errors, which the h/w works around by
+ * ignoring the low 6 bits of address, if it wasn't aligned.
+ */
+ val = ipath_kget_kreg64(t, kr_sendpioavailaddr);
+ if (val != atmp) {
+ _IPATH_UNIT_ERROR(t,
+ "Catastrophic software error, SendPIOAvailAddr written as %llx, read back as %llx\n",
+ atmp, val);
+ ret = -EINVAL;
+ goto done;
+ }
+
+ if (t * 64 > (sizeof(ipath_port0_rcvhdrtail) - 64)) {
+ _IPATH_UNIT_ERROR(t,
+ "unit %u too large for port 0 rcvhdrtail buffer size\n",
+ t);
+ ret = -ENODEV;
+ }
+
+ /*
+ * kernel modules loaded into vmalloc'ed memory,
+ * verify that when we assume that, map to phys, and back to virt,
+ * that we get the right contents, so we did the mapping right.
+ */
+ vpage = vmalloc_to_page((void *)ipath_port0_rcvhdrtail);
+ if (vpage == NOPAGE_SIGBUS || vpage == NOPAGE_OOM) {
+ _IPATH_UNIT_ERROR(t, "vmalloc_to_page for rcvhdrtail fails!\n");
+ ret = -ENOMEM;
+ goto done;
+ }
+
+ /*
+ * 64 is driven by cache line size, and also by chip requirement
+ * that low 6 bits be 0
+ */
+ val = page_to_phys(vpage) + t * 64;
+
+ /* verify that the alignment requirement was met */
+ ipath_kput_kreg_port(t, kr_rcvhdrtailaddr, 0, val);
+ atmp = ipath_kget_kreg64_port(t, kr_rcvhdrtailaddr, 0);
+ if (val != atmp) {
+ _IPATH_UNIT_ERROR(t,
+ "Catastrophic software error, RcvHdrTailAddr0 written as %llx, read back as %llx from %x\n",
+ val, atmp, kr_rcvhdrtailaddr);
+ ret = -EINVAL;
+ goto done;
+ }
+ /* so we can get current tail in ipath_kreceive(), per chip */
+ dd->ipath_hdrqtailptr =
+ &ipath_port0_rcvhdrtail[t *
+ (64 / sizeof(ipath_port0_rcvhdrtail[0]))];
+
+ ipath_kput_kreg(t, kr_rcvbthqp, IPATH_KD_QP);
+
+ /*
+ * make sure we are not in freeze, and PIO send enabled, so
+ * writes to pbc happen
+ */
+ ipath_kput_kreg(t, kr_hwerrmask, 0ULL);
+ ipath_kput_kreg(t, kr_hwerrclear, -1LL);
+ ipath_kput_kreg(t, kr_control, 0ULL);
+ ipath_kput_kreg(t, kr_sendctrl, INFINIPATH_S_PIOENABLE);
+
+ /*
+ * write the pbc of each buffer, to be sure it's initialized, then
+ * cancel all the buffers, and also abort any packets that might
+ * have been in flight for some reason (the latter is for driver
+ * unload/reload, but isn't a bad idea at first init).
+ * PIO send isn't enabled at this point, so there is no danger
+ * of sending these out on the wire.
+ * Chip Errata bug 6610
+ */
+ piobuf = (uint32_t __iomem *) (((char __iomem *)(dd->ipath_kregbase)) +
+ dd->ipath_piobufbase);
+ pioincr = devdata[t].ipath_palign / sizeof(*piobuf);
+ for (i = 0; i < dd->ipath_piobcnt; i++) {
+ writel(16, piobuf); /* reasonable word count, just to init pbc */
+ piobuf += pioincr;
+ }
+ /* self-clearing */
+ ipath_kput_kreg(t, kr_sendctrl, INFINIPATH_S_ABORT);
+
+ /*
+ * before error clears, since we expect serdes pll errors during
+ * this, the first time after reset
+ */
+ if (ipath_bringup_link(t)) {
+ _IPATH_INFO("Failed to bringup IB link\n");
+ ret = -ENETDOWN;
+ goto done;
+ }
+
+ /*
+ * clear any "expected" hwerrs from reset and/or initialization
+ * clear any that aren't enabled (at least this once), and then
+ * set the enable mask
+ */
+ ipath_clear_init_hwerrs(t);
+ ipath_kput_kreg(t, kr_hwerrclear, -1LL);
+ ipath_kput_kreg(t, kr_hwerrmask, dd->ipath_hwerrmask);
+
+ dd->ipath_maskederrs = dd->ipath_ignorederrs;
+ ipath_kput_kreg(t, kr_errorclear, -1LL); /* clear all */
+ /* enable errors that are masked, at least this first time. */
+ ipath_kput_kreg(t, kr_errormask, ~dd->ipath_maskederrs);
+ /* clear any interrups up to this point (ints still not enabled) */
+ ipath_kput_kreg(t, kr_intclear, -1LL);
+
+ ipath_stats.sps_lid[t] = dd->ipath_lid;
+
+ /*
+ * allocate the shadow TID array, so we can ipath_putpages
+ * previous entries. It make make more sense to move the pageshadow
+ * to the port data structure, so we only allocate memory for ports
+ * actually in use, since we at 8k per port, now
+ */
+ dd->ipath_pageshadow = (struct page **)
+ vmalloc(dd->ipath_cfgports * dd->ipath_rcvtidcnt *
+ sizeof(struct page *));
+ if (!dd->ipath_pageshadow)
+ _IPATH_UNIT_ERROR(t,
+ "failed to allocate shadow page * array, no expected sends!\n");
+ else
+ memset(dd->ipath_pageshadow, 0,
+ dd->ipath_cfgports * dd->ipath_rcvtidcnt *
+ sizeof(struct page *));
+
+ /* set up the port 0 (kernel) rcvhdr q and egr TIDs */
+ if (!(ret = ipath_create_rcvhdrq(dd->ipath_pd[0])))
+ ret = ipath_create_port0_egr(dd->ipath_pd[0]);
+ if (ret)
+ _IPATH_UNIT_ERROR(t,
+ "failed to allocate port 0 (kernel) rcvhdrq and/or egr bufs\n");
+ else {
+ init_waitqueue_head(&ipath_sma_wait);
+ init_waitqueue_head(&ipath_sma_state_wait);
+
+ ipath_kput_kreg(pd->port_unit, kr_rcvctrl, dd->ipath_rcvctrl);
+
+ ipath_kput_kreg(t, kr_rcvbthqp, IPATH_KD_QP);
+
+ /* Enable PIO send, and update of PIOavail regs to memory. */
+ dd->ipath_sendctrl = INFINIPATH_S_PIOENABLE
+ | INFINIPATH_S_PIOBUFAVAILUPD;
+ ipath_kput_kreg(t, kr_sendctrl, dd->ipath_sendctrl);
+
+ /*
+ * enable port 0 receive, and receive interrupt
+ * other ports done as user opens and inits them
+ */
+ dd->ipath_rcvctrl = INFINIPATH_R_TAILUPD |
+ (1ULL << INFINIPATH_R_PORTENABLE_SHIFT) |
+ (1ULL << INFINIPATH_R_INTRAVAIL_SHIFT);
+ ipath_kput_kreg(t, kr_rcvctrl, dd->ipath_rcvctrl);
+
+ /*
+ * now ready for use
+ * this should be cleared whenever we detect a reset, or
+ * initiate one.
+ */
+ dd->ipath_flags |= IPATH_INITTED;
+
+ /*
+ * init our shadow copies of head from tail values, and write
+ * head values to match
+ */
+ val32 = ipath_kget_ureg32(t, ur_rcvegrindextail, 0);
+ (void)ipath_kput_ureg(t, ur_rcvegrindexhead, val32, 0);
+ dd->ipath_port0head = ipath_kget_ureg32(t, ur_rcvhdrtail, 0);
+ (void)ipath_kput_ureg(t, ur_rcvhdrhead, dd->ipath_port0head, 0);
+
+ /*
+ * by now pioavail updates to memory should have occurred,
+ * so copy them into our working/shadow registers; this is
+ * in case something went wrong with abort, but mostly to
+ * get the initial values of the generation bit correct
+ */
+ for (i = 0; i < dd->ipath_pioavregs; i++) {
+ /*
+ * Chip Errata bug 6641; even and odd qwords>3
+ * are swapped
+ */
+ if (i > 3) {
+ if (i & 1)
+ dd->ipath_pioavailshadow[i] =
+ dd->ipath_pioavailregs_dma[i - 1];
+ else
+ dd->ipath_pioavailshadow[i] =
+ dd->ipath_pioavailregs_dma[i + 1];
+ } else
+ dd->ipath_pioavailshadow[i] =
+ dd->ipath_pioavailregs_dma[i];
+ }
+ /* can get counters, stats, etc. */
+ dd->ipath_flags |= IPATH_PRESENT;
+ }
+
+ /*
+ * cause retrigger of pending interrupts ignored during init, even if
+ * we had errors
+ */
+ ipath_kput_kreg(t, kr_intclear, 0ULL);
+
+ /*
+ * set up stats retrieval timer, even if we had errors in last
+ * portion of setup
+ */
+ init_timer(&dd->ipath_stats_timer);
+ dd->ipath_stats_timer.function = ipath_get_faststats;
+ dd->ipath_stats_timer.data = (unsigned long)t;
+ /* every 5 seconds; */
+ dd->ipath_stats_timer.expires = jiffies + 5 * HZ;
+ /* takes ~16 seconds to overflow at full IB 4x bandwdith */
+ add_timer(&dd->ipath_stats_timer);
+
+ dd->ipath_stats_timer_active = 1;
+
+done:
+ if (!ret) {
+ ipath_get_guid(t);
+ *dd->ipath_statusp |= IPATH_STATUS_CHIP_PRESENT;
+ if (!ipath_sma_data_spare) {
+ /* first init, setup SMA data structs */
+ ipath_sma_data_spare =
+ ipath_sma_data_bufs[IPATH_NUM_SMAPKTS];
+ for (i = 0; i < IPATH_NUM_SMAPKTS; i++)
+ ipath_sma_data[i].buf = ipath_sma_data_bufs[i];
+ }
+ /*
+ * sps_nports is a global, so, we set it to the highest
+ * number of ports of any of the chips we find; we never
+ * decrement it, at least for now.
+ */
+ if (dd->ipath_cfgports > ipath_stats.sps_nports)
+ ipath_stats.sps_nports = dd->ipath_cfgports;
+ }
+ /* if ret is non-zero, we probably should do some cleanup here... */
+ return ret;
+}
+
+int ipath_waitfor_complete(const ipath_type t, ipath_kreg reg_id,
+ uint64_t bits_to_wait_for, uint64_t * valp)
+{
+ uint64_t timeout, lastval, val;
+
+ lastval = ipath_kget_kreg64(t, reg_id);
+ timeout = get_cycles() + 0x10000000ULL; /* <- ridiculously long time */
+ do {
+ val = ipath_kget_kreg64(t, reg_id);
+ *valp = val; /* so they have something, even on failures. */
+ if ((val & bits_to_wait_for) == bits_to_wait_for)
+ return 0;
+ if (val != lastval)
+ _IPATH_VDBG
+ ("Changed from %llx to %llx, waiting for %llx bits\n",
+ lastval, val, bits_to_wait_for);
+ yield();
+ if (get_cycles() > timeout) {
+ _IPATH_DBG
+ ("Didn't get bits %llx in register 0x%x, got %llx\n",
+ bits_to_wait_for, reg_id, *valp);
+ return ENODEV;
+ }
+ } while (1);
+}
+
+/*
+ * like ipath_waitfor_complete(), but we wait for the CMDVALID bit to go away
+ * indicating the last command has completed. It doesn't return data
+ */
+int ipath_waitfor_mdio_cmdready(const ipath_type t)
+{
+ uint64_t timeout;
+ uint64_t val;
+
+ timeout = get_cycles() + 0x10000000ULL; /* <- ridiculously long time */
+ do {
+ val = ipath_kget_kreg64(t, kr_mdio);
+ if (!(val & IPATH_MDIO_CMDVALID))
+ return 0;
+ yield();
+ if (get_cycles() > timeout) {
+ _IPATH_DBG("CMDVALID stuck in mdio reg? (%llx)\n", val);
+ return ENODEV;
+ }
+ } while (1);
+}
+
+void ipath_set_ib_lstate(const ipath_type t, int which)
+{
+ struct ipath_devdata *dd = &devdata[t];
+ char *what;
+
+ /*
+ * For all cases, we'll either be setting a new value of linkcmd, or
+ * we want it to be NOP, so clear it here.
+ * Similarly, we want the linkinitcmd to be NOP for everything
+ * other than explictly than explictly changing linkinitcmd,
+ * and for that case, we want to first clear any existing bits
+ */
+ dd->ipath_ibcctrl &= ~((INFINIPATH_IBCC_LINKCMD_MASK <<
+ INFINIPATH_IBCC_LINKCMD_SHIFT) |
+ (INFINIPATH_IBCC_LINKINITCMD_MASK <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT));
+
+ if (which == INFINIPATH_IBCC_LINKCMD_INIT) {
+ dd->ipath_flags &= ~(IPATH_LINK_TOARMED | IPATH_LINK_TOACTIVE
+ | IPATH_LINK_SLEEPING);
+ /* so we can watch for a transition */
+ dd->ipath_flags |= IPATH_LINKDOWN;
+ what = "INIT";
+ } else if (which == INFINIPATH_IBCC_LINKCMD_ARMED) {
+ dd->ipath_flags |= IPATH_LINK_TOARMED;
+ dd->ipath_flags &= ~(IPATH_LINK_TOACTIVE | IPATH_LINK_SLEEPING);
+ /*
+ * this is mainly for loopback testing. If INITCMD is
+ * NOP or SLEEP, the link won't ever come up in loopback...
+ */
+ if (!
+ (dd->
+ ipath_flags & (IPATH_LINKINIT | IPATH_LINKARMED |
+ IPATH_LINKACTIVE))) {
+ _IPATH_SMADBG
+ ("going to armed, but link not yet up, set POLL\n");
+ dd->ipath_ibcctrl |=
+ INFINIPATH_IBCC_LINKINITCMD_POLL <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT;
+ }
+ what = "ARMED";
+ } else if (which == INFINIPATH_IBCC_LINKCMD_ACTIVE) {
+ dd->ipath_flags |= IPATH_LINK_TOACTIVE;
+ dd->ipath_flags &= ~(IPATH_LINK_TOARMED | IPATH_LINK_SLEEPING);
+ what = "ACTIVE";
+ } else if (which & (INFINIPATH_IBCC_LINKINITCMD_MASK << INFINIPATH_IBCC_LINKINITCMD_SHIFT)) { /* down, disable, etc. */
+ dd->ipath_flags &= ~(IPATH_LINK_TOARMED | IPATH_LINK_TOACTIVE);
+ if (((which & INFINIPATH_IBCC_LINKINITCMD_MASK) >>
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT) ==
+ INFINIPATH_IBCC_LINKINITCMD_SLEEP) {
+ dd->ipath_flags |= IPATH_LINK_SLEEPING | IPATH_LINKDOWN;
+ } else
+ dd->ipath_flags |= IPATH_LINKDOWN;
+ dd->ipath_ibcctrl |=
+ which & (INFINIPATH_IBCC_LINKINITCMD_MASK <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT);
+ what = "DOWN";
+ } else {
+ what = "UNKNOWN";
+ _IPATH_INFO("Unknown link transition requested (which=0x%x)\n",
+ which);
+ }
+
+ dd->ipath_ibcctrl |= ((uint64_t) which & INFINIPATH_IBCC_LINKCMD_MASK)
+ << INFINIPATH_IBCC_LINKCMD_SHIFT;
+
+ _IPATH_SMADBG("Trying to move unit %u to %s, current ltstate is %s\n",
+ t, what, ipath_ibcstatus_str[(ipath_kget_kreg64(t, kr_ibcstatus)
+ >> INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT)
+ & INFINIPATH_IBCS_LINKTRAININGSTATE_MASK]);
+ ipath_kput_kreg(t, kr_ibcctrl, dd->ipath_ibcctrl);
+}
+
+static int ipath_bringup_link(const ipath_type t)
+{
+ struct ipath_devdata *dd = &devdata[t];
+ uint64_t val, ibc;
+ int ret = 0;
+
+ dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; /* hold IBC in reset */
+ ipath_kput_kreg(t, kr_control, dd->ipath_control);
+
+ /*
+ * Note that prior to try 14 or 15 of IB, the credit scaling
+ * wasn't working, because it was swapped for writes with the
+ * 1 bit default linkstate field
+ */
+
+ /* ignore pbc and align word */
+ val = dd->ipath_piosize - 2 * sizeof(uint32_t);
+ /*
+ * for ICRC, which we only send in diag test pkt mode, and we don't
+ * need to worry about that for mtu
+ */
+ val += 1;
+ /*
+ * set the IBC maxpktlength to the size of our pio buffers
+ * the maxpktlength is in words. This is *not* the IB data MTU
+ */
+ ibc = (val / sizeof(uint32_t)) << INFINIPATH_IBCC_MAXPKTLEN_SHIFT;
+ /* in KB */
+ ibc |= 0x5ULL << INFINIPATH_IBCC_FLOWCTRLWATERMARK_SHIFT;
+ /* how often flowctrl sent
+ * more or less in usecs; balance against watermark value, so that
+ * in theory senders always get a flow control update in time to not
+ * let the IB link go idle.
+ */
+ ibc |= 0x3ULL << INFINIPATH_IBCC_FLOWCTRLPERIOD_SHIFT;
+ /* max error tolerance */
+ ibc |= 0xfULL << INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT;
+ /* use "real" buffer space for */
+ ibc |= 4ULL << INFINIPATH_IBCC_CREDITSCALE_SHIFT;
+ /* IB credit flow control. */
+ ibc |= 0xfULL << INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT;
+ /* initially come up waiting for TS1, without sending anything. */
+ dd->ipath_ibcctrl = ibc;
+ /* don't put linkinitcmd in ipath_ibcctrl, want that to stay a NOP */
+ ibc |=
+ INFINIPATH_IBCC_LINKINITCMD_SLEEP <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT;
+ dd->ipath_flags |= IPATH_LINK_SLEEPING;
+ ipath_kput_kreg(t, kr_ibcctrl, ibc);
+
+ ret = ipath_bringup_serdes(t);
+
+ if (ret)
+ _IPATH_INFO("Could not initialize SerDes, not usable\n");
+ else {
+ dd->ipath_control |= INFINIPATH_C_LINKENABLE; /* enable IBC */
+ ipath_kput_kreg(t, kr_control, dd->ipath_control);
+ }
+
+ return ret;
+}
+
+/*
+ * called from ipath_shutdown_link(), and from sma doing a LINKDOWN
+ * Left as a separate function for historical reasons, and may want
+ * it to do more than just call ipath_set_ib_lstate() again sometime
+ * in the future.
+ */
+void ipath_down_link(const ipath_type t)
+{
+ ipath_set_ib_lstate(t, INFINIPATH_IBCC_LINKINITCMD_SLEEP <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT);
+}
+
+/*
+ * do this when driver is being unloaded, or perhaps for diags, and
+ * maybe when we get an interrupt of a fatal link error that requires
+ * bringing the linkd down and back up
+ */
+static int ipath_shutdown_link(const ipath_type t)
+{
+ uint64_t val;
+ struct ipath_devdata *dd = &devdata[t];
+ int ret = 0;
+
+ _IPATH_DBG("Shutting down the link\n");
+ ipath_down_link(t);
+
+ /*
+ * we are shutting down, so tell the layered driver. We don't
+ * do this on just a link state change, much like ethernet,
+ * a cable unplug, etc. doesn't change driver state
+ */
+ if (dd->ipath_layer.l_intr)
+ dd->ipath_layer.l_intr(t, IPATH_LAYER_INT_IF_DOWN);
+
+ dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; /* disable IBC */
+ ipath_kput_kreg(t, kr_control, dd->ipath_control);
+
+ *dd->ipath_statusp &= ~(IPATH_STATUS_IB_CONF | IPATH_STATUS_IB_READY);
+
+ /*
+ * clear SerdesEnable and turn the leds off; do this here because
+ * we are unloading, so don't count on interrupts to move along
+ */
+
+ ipath_quiet_serdes(t);
+ val = dd->ipath_extctrl &
+ ~(INFINIPATH_EXTC_LEDPRIPORTGREENON |
+ INFINIPATH_EXTC_LEDPRIPORTYELLOWON);
+ dd->ipath_extctrl = val;
+ ipath_kput_kreg(t, kr_extctrl, val);
+
+ if (dd->ipath_stats_timer_active) {
+ del_timer_sync(&dd->ipath_stats_timer);
+ dd->ipath_stats_timer_active = 0;
+ }
+ if (*dd->ipath_statusp & IPATH_STATUS_CHIP_PRESENT) {
+ /* can't do anything more with chip */
+ /* needs re-init */
+ *dd->ipath_statusp &= ~IPATH_STATUS_CHIP_PRESENT;
+ if (dd->ipath_kregbase) {
+ /*
+ * if we haven't already cleaned up before these
+ * are to ensure any register reads/writes "fail"
+ * until re-init
+ */
+ dd->ipath_kregbase = NULL;
+ dd->ipath_kregvirt = NULL;
+ dd->ipath_uregbase = 0ULL;
+ dd->ipath_sregbase = 0ULL;
+ dd->ipath_cregbase = 0ULL;
+ dd->ipath_kregsize = 0;
+ }
+#ifdef CONFIG_MTRR
+ if (dd->ipath_mtrr) {
+ _IPATH_VDBG("undoing WCCOMB on pio buffers\n");
+ mtrr_del(dd->ipath_mtrr, 0, 0);
+ dd->ipath_mtrr = 0;
+ }
+#endif
+ }
+
+ return ret;
+}
+
+/*
+ * when closing, free up any allocated data for a port, if the
+ * reference count goes to zero
+ * Note: this also frees the portdata itself!
+ */
+void ipath_free_pddata(struct ipath_devdata * dd, uint32_t port, int freehdrq)
+{
+ struct ipath_portdata *pd = dd->ipath_pd[port];
+
+ if (!pd)
+ return;
+ if (freehdrq)
+ /*
+ * only clear and free portdata if we are going to
+ * also release the hdrq, otherwise we leak the hdrq on each
+ * open/close cycle
+ */
+ dd->ipath_pd[port] = NULL;
+ /* cleanup locked pages private data structures */
+ ipath_upages_cleanup(pd);
+ if (freehdrq && pd->port_rcvhdrq) {
+ int i, n = 1 << pd->port_rcvhdrq_order;
+ _IPATH_VDBG("free closed port %d rcvhdrq @ %p (order=%u)\n",
+ pd->port_port, pd->port_rcvhdrq,
+ pd->port_rcvhdrq_order);
+ for (i = 0; i < n; i++)
+ ClearPageReserved(virt_to_page
+ (pd->port_rcvhdrq + (i * PAGE_SIZE)));
+ free_pages((unsigned long)pd->port_rcvhdrq,
+ pd->port_rcvhdrq_order);
+ pd->port_rcvhdrq = NULL;
+ }
+ if (port && pd->port_rcvegrbuf_pages) { /* always free this, however */
+ void *virt;
+ unsigned e, i, n = 1 << pd->port_rcvegrbuf_order;
+ if (pd->port_rcvegrbuf_virt) {
+ for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) {
+ virt = pd->port_rcvegrbuf_virt[e];
+ for (i = 0; i < n; i++)
+ ClearPageReserved(virt_to_page
+ (virt +
+ (i * PAGE_SIZE)));
+ _IPATH_VDBG
+ ("egrbuf free_pages(%p, %x), chunk %u/%u\n",
+ virt, pd->port_rcvegrbuf_order, e,
+ pd->port_rcvegrbuf_chunks);
+ free_pages((unsigned long)virt,
+ pd->port_rcvegrbuf_order);
+ }
+ vfree(pd->port_rcvegrbuf_virt);
+ pd->port_rcvegrbuf_virt = NULL;
+ }
+ pd->port_rcvegrbuf_chunks = 0;
+ _IPATH_VDBG("free closed port %d rcvegrbufs ptr array\n",
+ pd->port_port);
+ /* now the pointer array. */
+ vfree(pd->port_rcvegrbuf_pages);
+ pd->port_rcvegrbuf_pages = NULL;
+ } else if (port == 0 && dd->ipath_port0_skbs) {
+ unsigned e;
+ struct sk_buff **skbs = dd->ipath_port0_skbs;
+
+ dd->ipath_port0_skbs = NULL;
+ _IPATH_VDBG("free closed port %d ipath_port0_skbs @ %p\n",
+ pd->port_port, skbs);
+ for (e = 0; e < dd->ipath_rcvegrcnt; e++)
+ if (skbs[e])
+ dev_kfree_skb(skbs[e]);
+ vfree(skbs);
+ }
+ if (freehdrq) {
+ kfree(pd->port_tid_pg_list);
+ kfree(pd);
+ }
+}
+
+int __init infinipath_init(void)
+{
+ int r = 0, i;
+
+ _IPATH_DBG(KERN_INFO DRIVER_LOAD_MSG "%s", ipath_core_version);
+
+ ipath_init_picotime(); /* init cycles -> pico conversion */
+
+ /*
+ * initialize the statusp to temporary storage so we can use it
+ * everywhere without first checking. When we "really" assign it,
+ * we copy from _ipath_status
+ */
+ for (i = 0; i < infinipath_max; i++)
+ devdata[i].ipath_statusp = &devdata[i]._ipath_status;
+
+ /*
+ * init these early, in case we take an interrupt as soon as the irq
+ * is setup. Saw a spinlock panic once that appeared to be due to that
+ * problem, when they were initted later on.
+ */
+ spin_lock_init(&ipath_pioavail_lock);
+ spin_lock_init(&ipath_sma_lock);
+
+ pci_register_driver(&infinipath_driver);
+
+ driver_create_file(&(infinipath_driver.driver), &driver_attr_version);
+
+ if ((r = register_chrdev(ipath_major, MODNAME, &ipath_fops)))
+ _IPATH_ERROR("Unable to register %s device\n", MODNAME);
+
+
+ /*
+ * never return an error, since we could have stuff registered,
+ * resources used, etc., even if no hardware found. This way we
+ * can clean up through unload.
+ */
+ return 0;
+}
+
+/*
+ * note: if for some reason the unload fails after this routine, and leaves
+ * the driver enterable by user code, we'll almost certainly crash and burn...
+ */
+static void __exit infinipath_cleanup(void)
+{
+ int r, m, port;
+
+ driver_remove_file(&(infinipath_driver.driver), &driver_attr_version);
+ if ((r = unregister_chrdev(ipath_major, MODNAME)))
+ _IPATH_DBG("unregister of device failed: %d\n", r);
+
+
+ /*
+ * turn off rcv, send, and interrupts for all ports, all drivers
+ * should also hard reset the chip here?
+ * free up port 0 (kernel) rcvhdr, egr bufs, and eventually tid bufs
+ * for all versions of the driver, if they were allocated
+ */
+ for (m = 0; m < infinipath_max; m++) {
+ uint64_t val;
+ struct ipath_devdata *dd = &devdata[m];
+ if (dd->ipath_kregbase) {
+ /* in case unload fails, be consistent */
+ dd->ipath_rcvctrl = 0U;
+ ipath_kput_kreg(m, kr_rcvctrl, dd->ipath_rcvctrl);
+
+ /*
+ * gracefully stop all sends allowing any in
+ * progress to trickle out first.
+ */
+ ipath_kput_kreg(m, kr_sendctrl, 0ULL);
+ val = ipath_kget_kreg64(m, kr_scratch); /* flush it */
+ /*
+ * enough for anything that's going to trickle
+ * out to have actually done so.
+ */
+ udelay(5);
+
+ /*
+ * abort any armed or launched PIO buffers that
+ * didn't go. (self clearing). Will cause any
+ * packet currently being transmitted to go out
+ * with an EBP, and may also cause a short packet
+ * error on the receiver.
+ */
+ ipath_kput_kreg(m, kr_sendctrl, INFINIPATH_S_ABORT);
+
+ /* mask interrupts, but not errors */
+ ipath_kput_kreg(m, kr_intmask, 0ULL);
+ ipath_shutdown_link(m);
+
+ /*
+ * clear all interrupts and errors. Next time
+ * driver is loaded, we know that whatever is
+ * set happened while we were unloaded
+ */
+ ipath_kput_kreg(m, kr_hwerrclear, -1LL);
+ ipath_kput_kreg(m, kr_errorclear, -1LL);
+ ipath_kput_kreg(m, kr_intclear, -1LL);
+ if (dd->__ipath_pioavailregs_base) {
+ kfree((void *)dd->__ipath_pioavailregs_base);
+ dd->__ipath_pioavailregs_base = NULL;
+ dd->ipath_pioavailregs_dma = NULL;
+ }
+
+ if (dd->ipath_pageshadow) {
+ struct page **tmpp = dd->ipath_pageshadow;
+ int i, cnt = 0;
+
+ _IPATH_VDBG
+ ("Unlocking any expTID pages still locked\n");
+ for (port = 0; port < dd->ipath_cfgports;
+ port++) {
+ int port_tidbase =
+ port * dd->ipath_rcvtidcnt;
+ int maxtid =
+ port_tidbase + dd->ipath_rcvtidcnt;
+ for (i = port_tidbase; i < maxtid; i++) {
+ if (tmpp[i]) {
+ ipath_putpages(1,
+ &tmpp[i]);
+ tmpp[i] = NULL;
+ cnt++;
+ }
+ }
+ }
+ if (cnt) {
+ ipath_stats.sps_pageunlocks += cnt;
+ _IPATH_VDBG
+ ("There were still %u expTID entries locked\n",
+ cnt);
+ }
+ if (ipath_stats.sps_pagelocks
+ || ipath_stats.sps_pageunlocks)
+ _IPATH_VDBG
+ ("%llu pages locked, %llu unlocked via ipath_m{un}lock\n",
+ ipath_stats.sps_pagelocks,
+ ipath_stats.sps_pageunlocks);
+
+ _IPATH_VDBG
+ ("Free shadow page tid array at %p\n",
+ dd->ipath_pageshadow);
+ vfree(dd->ipath_pageshadow);
+ dd->ipath_pageshadow = NULL;
+ }
+
+ /*
+ * free any resources still in use (usually just
+ * kernel ports) at unload
+ */
+ for (port = 0; port < dd->ipath_cfgports; port++)
+ ipath_free_pddata(dd, port, 1);
+ kfree(dd->ipath_pd);
+ /*
+ * debuggability, in case some cleanup path
+ * tries to use it after this
+ */
+ dd->ipath_pd = NULL;
+ }
+
+ if (dd->pcidev) {
+ if (dd->pcidev->irq) {
+ _IPATH_VDBG("unit %u free_irq of irq %x\n", m,
+ dd->pcidev->irq);
+ free_irq(dd->pcidev->irq, dd);
+ } else
+ _IPATH_DBG
+ ("irq is 0, not doing free_irq for unit %u\n",
+ m);
+ dd->pcidev = NULL;
+ }
+ if (dd->pci_registered) {
+ _IPATH_VDBG
+ ("Unregistering pci infrastructure unit %u\n", m);
+ pci_unregister_driver(&infinipath_driver);
+ dd->pci_registered = 0;
+ } else
+ _IPATH_VDBG
+ ("unit %u: no pci unreg, wasn't registered\n", m);
+ ipath_chip_cleanup(dd); /* clean up any per-chip chip-specific stuff */
+ }
+ /*
+ * clean up any chip-specific stuff for now, only one type of chip
+ * for any given driver
+ */
+ ipath_chip_done();
+
+ /* cleanup all our locked pages private data structures */
+ ipath_upages_cleanup(NULL);
+}
+
+/* This is a generic function here, so it can return device-specific
+ * info. This allows keeping in sync with the version that supports
+ * multiple chip types.
+*/
+void ipath_get_boardname(const ipath_type t, char *name, size_t namelen)
+{
+ ipath_ht_get_boardname(t, name, namelen);
+}
+
+module_init(infinipath_init);
+module_exit(infinipath_cleanup);
+
+EXPORT_SYMBOL(infinipath_debug);
+EXPORT_SYMBOL(ipath_get_boardname);
+

2005-12-29 00:42:19

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 13 of 20] ipath - routines used by upper layer driver code

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r 5e9b0b7876e2 -r f9bcd9de3548 drivers/infiniband/hw/ipath/ipath_layer.c
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_layer.c Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,1313 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+/*
+ * These are the routines used by layered drivers, currently just the
+ * layered ethernet driver and verbs layer.
+ */
+
+#include <linux/pci.h>
+
+#include "ipath_kernel.h"
+#include "ips_common.h"
+#include "ipath_layer.h"
+
+/* unit number is already validated in ipath_ioctl() */
+int ipath_kset_linkstate(uint32_t arg)
+{
+ ipath_type unit = 0xffff & (arg >> 16);
+ uint32_t lstate;
+ struct ipath_devdata *dd;
+ int tryarmed = 0;
+
+ if (unit >= infinipath_max ||
+ !(devdata[unit].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", unit);
+ return -ENODEV;
+ }
+
+ dd = &devdata[unit];
+ arg &= 0xffff;
+ switch (arg) {
+ case IPATH_IB_LINKDOWN:
+ ipath_down_link(unit); /* really moving it to idle */
+ lstate = IPATH_LINKDOWN | IPATH_LINK_SLEEPING;
+ break;
+
+ case IPATH_IB_LINKDOWN_POLL:
+ ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKINITCMD_POLL <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT);
+ lstate = IPATH_LINKDOWN;
+ break;
+
+ case IPATH_IB_LINKDOWN_DISABLE:
+ ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKINITCMD_DISABLE <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT);
+ lstate = IPATH_LINKDOWN;
+ break;
+
+ case IPATH_IB_LINKINIT:
+ ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKCMD_INIT);
+ lstate = IPATH_LINKINIT;
+ break;
+
+ case IPATH_IB_LINKARM:
+ ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKCMD_ARMED);
+ lstate = IPATH_LINKARMED;
+ break;
+
+ case IPATH_IB_LINKACTIVE:
+ /*
+ * because we sometimes go to ARMED, but then back to 0x11
+ * (initialized) before the SMA asks us to move to ACTIVE,
+ * we will try to advance state to ARMED here, if necessary
+ */
+ if (!(dd->ipath_flags &
+ (IPATH_LINKINIT | IPATH_LINKARMED | IPATH_LINKDOWN |
+ IPATH_LINK_SLEEPING | IPATH_LINKACTIVE))) {
+ /* this one is just paranoia */
+ _IPATH_DBG
+ ("don't know current state (flags 0x%x), try anyway\n",
+ dd->ipath_flags);
+ tryarmed = 1;
+
+ }
+ if (!(dd->ipath_flags & (IPATH_LINKARMED | IPATH_LINKACTIVE)))
+ tryarmed = 1;
+ if (tryarmed) {
+ ipath_set_ib_lstate(unit,
+ INFINIPATH_IBCC_LINKCMD_ARMED);
+ /*
+ * give it up to 2 seconds to get to ARMED or
+ * ACTIVE; continue afterwards even if we fail
+ */
+ if (ipath_wait_linkstate
+ (unit, IPATH_LINKARMED | IPATH_LINKACTIVE, 2000))
+ _IPATH_VDBG
+ ("try for active, even though didn't get to ARMED\n");
+ }
+
+ ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKCMD_ACTIVE);
+ lstate = IPATH_LINKACTIVE;
+ break;
+
+ default:
+ _IPATH_DBG("Unknown linkstate 0x%x requested\n", arg);
+ return -EINVAL;
+ }
+ return ipath_wait_linkstate(unit, lstate, 2000);
+}
+
+/*
+ * we can handle "any" incoming size, the issue here is whether we
+ * need to restrict our outgoing size. For now, we don't do any
+ * sanity checking on this, and we don't deal with what happens to
+ * programs that are already running when the size changes.
+ * unit number is already validated in ipath_ioctl()
+ * NOTE: changing the MTU will usually cause the IBC to go back to
+ * link initialize (0x11) state...
+ */
+int ipath_kset_mtu(uint32_t arg)
+{
+ unsigned unit = (arg >> 16) & 0xffff;
+ uint32_t piosize;
+ int changed = 0;
+
+ if (unit >= infinipath_max ||
+ !(devdata[unit].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", unit);
+ return -ENODEV;
+ }
+
+ arg &= 0xffff;
+ /*
+ * mtu is IB data payload max. It's the largest power of 2 less
+ * than piosize (or even larger, since it only really controls the
+ * largest we can receive; we can send the max of the mtu and piosize).
+ * We check that it's one of the valid IB sizes.
+ */
+ if (arg != 256 && arg != 512 && arg != 1024 && arg != 2048 &&
+ arg != 4096) {
+ _IPATH_DBG("Trying to set invalid mtu %u, failing\n", arg);
+ return -EINVAL;
+ }
+ if (devdata[unit].ipath_ibmtu == arg) {
+ return 0; /* same as current */
+ }
+
+ piosize = devdata[unit].ipath_ibmaxlen;
+ devdata[unit].ipath_ibmtu = arg;
+
+ /*
+ * the 128 is the max IB header size allowed for in our pio send buffers
+ * If we are reducing the MTU below that, this doesn't completely make
+ * sense, but it's OK.
+ */
+ if (arg >= (piosize - 128)) {
+ /* hasn't been changed */
+ if (piosize == devdata[unit].ipath_init_ibmaxlen)
+ _IPATH_VDBG
+ ("mtu 0x%x >= ibmaxlen hardware max, nothing to do\n",
+ arg);
+ else {
+ _IPATH_VDBG
+ ("mtu 0x%x restores ibmaxlen to full amount 0x%x\n",
+ arg, piosize);
+ devdata[unit].ipath_ibmaxlen = piosize;
+ changed = 1;
+ }
+ } else if ((arg + 128) == devdata[unit].ipath_ibmaxlen)
+ _IPATH_VDBG("ibmaxlen %x same as current, no change\n", arg);
+ else {
+ piosize = arg + 128;
+ _IPATH_VDBG("ibmaxlen was 0x%x, setting to 0x%x (mtu 0x%x)\n",
+ devdata[unit].ipath_ibmaxlen, piosize, arg);
+ devdata[unit].ipath_ibmaxlen = piosize;
+ changed = 1;
+ }
+
+ if (changed) {
+ /*
+ * set the IBC maxpktlength to the size of our pio
+ * buffers in words
+ */
+ uint64_t ibc = devdata[unit].ipath_ibcctrl;
+ ibc &= ~(INFINIPATH_IBCC_MAXPKTLEN_MASK <<
+ INFINIPATH_IBCC_MAXPKTLEN_SHIFT);
+
+ piosize = piosize - 2 * sizeof(uint32_t); /* ignore pbc */
+ devdata[unit].ipath_ibmaxlen = piosize;
+ piosize /= sizeof(uint32_t); /* in words */
+ /*
+ * for ICRC, which we only send in diag test pkt mode, and we
+ * don't need to worry about that for mtu
+ */
+ piosize += 1;
+
+ ibc |= piosize << INFINIPATH_IBCC_MAXPKTLEN_SHIFT;
+ devdata[unit].ipath_ibcctrl = ibc;
+ ipath_kput_kreg(unit, kr_ibcctrl, devdata[unit].ipath_ibcctrl);
+ }
+ return 0;
+}
+
+void ipath_set_sps_lid(const ipath_type unit, uint32_t arg)
+{
+ if (unit >= infinipath_max ||
+ !(devdata[unit].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", unit);
+ return;
+ }
+
+ ipath_stats.sps_lid[unit] = devdata[unit].ipath_lid = arg;
+ if (devdata[unit].ipath_layer.l_intr)
+ devdata[unit].ipath_layer.l_intr(unit, IPATH_LAYER_INT_LID);
+}
+
+/* XXX - need to inform anyone who cares this just happened. */
+int ipath_layer_set_guid(const ipath_type device, uint64_t guid)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return -ENODEV;
+ }
+ devdata[device].ipath_guid = guid;
+ return 0;
+}
+
+uint64_t ipath_layer_get_guid(const ipath_type device)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+ return devdata[device].ipath_guid;
+}
+
+uint32_t ipath_layer_get_nguid(const ipath_type device)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+ return devdata[device].ipath_nguid;
+}
+
+int ipath_layer_query_device(const ipath_type device, uint32_t * vendor,
+ uint32_t * boardrev, uint32_t * majrev,
+ uint32_t * minrev)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return -ENODEV;
+ }
+
+ *vendor = devdata[device].ipath_vendorid;
+ *boardrev = devdata[device].ipath_boardrev;
+ *majrev = devdata[device].ipath_majrev;
+ *minrev = devdata[device].ipath_minrev;
+
+ return 0;
+}
+
+uint32_t ipath_layer_get_flags(const ipath_type device)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+
+ return devdata[device].ipath_flags;
+}
+
+struct device *ipath_layer_get_pcidev(const ipath_type device)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return NULL;
+ }
+
+ return &(devdata[device].pcidev->dev);
+}
+
+uint16_t ipath_layer_get_deviceid(const ipath_type device)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+
+ return devdata[device].ipath_deviceid;
+}
+
+uint64_t ipath_layer_get_lastibcstat(const ipath_type device)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+
+ return devdata[device].ipath_lastibcstat;
+}
+
+uint32_t ipath_layer_get_ibmtu(const ipath_type device)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+
+ return devdata[device].ipath_ibmtu;
+}
+
+int ipath_layer_register(const ipath_type device,
+ int (*l_intr) (const ipath_type, uint32_t),
+ int (*l_rcv) (const ipath_type, void *,
+ struct sk_buff *), uint16_t l_rcv_opcode,
+ int (*l_rcv_lid) (const ipath_type, void *),
+ uint16_t l_rcv_lid_opcode)
+{
+ int ret = 0;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 1;
+ }
+ if (!(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_VDBG("%s not yet initialized, failing\n",
+ ipath_get_unit_name(device));
+ return 1;
+ }
+
+ _IPATH_VDBG("intr %p rx %p, rx_lid %p\n", l_intr, l_rcv, l_rcv_lid);
+ if (devdata[device].ipath_layer.l_intr
+ || devdata[device].ipath_layer.l_rcv) {
+ _IPATH_DBG
+ ("Layered device already registered on unit %u, failing\n",
+ device);
+ return 1;
+ }
+
+ if (!(*devdata[device].ipath_statusp & IPATH_STATUS_SMA))
+ *devdata[device].ipath_statusp |= IPATH_STATUS_OIB_SMA;
+ devdata[device].ipath_layer.l_intr = l_intr;
+ devdata[device].ipath_layer.l_rcv = l_rcv;
+ devdata[device].ipath_layer.l_rcv_lid = l_rcv_lid;
+ devdata[device].ipath_layer.l_rcv_opcode = l_rcv_opcode;
+ devdata[device].ipath_layer.l_rcv_lid_opcode = l_rcv_lid_opcode;
+
+ return ret;
+}
+
+static void ipath_verbs_timer(unsigned long t)
+{
+ /*
+ * If port 0 receive packet interrupts are not availabile,
+ * check the receive queue.
+ */
+ if (!(devdata[t].ipath_flags & IPATH_GPIO_INTR))
+ ipath_kreceive(t);
+
+ /* Handle verbs layer timeouts. */
+ if (devdata[t].verbs_layer.l_timer_cb)
+ devdata[t].verbs_layer.l_timer_cb(t);
+
+ mod_timer(&devdata[t].verbs_layer.l_timer, jiffies + 1);
+}
+
+/* Verbs layer registration. */
+int ipath_verbs_register(const ipath_type device,
+ int (*l_piobufavail) (const ipath_type device),
+ void (*l_rcv) (const ipath_type device, void *rhdr,
+ void *data, uint32_t tlen),
+ void (*l_timer_cb) (const ipath_type device))
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+ if (!(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_VDBG("%s not yet initialized, failing\n",
+ ipath_get_unit_name(device));
+ return 0;
+ }
+
+ _IPATH_VDBG("piobufavail %p rx %p\n", l_piobufavail, l_rcv);
+ if (devdata[device].verbs_layer.l_piobufavail ||
+ devdata[device].verbs_layer.l_rcv) {
+ _IPATH_DBG("Verbs layer already registered on unit %u, "
+ "failing\n", device);
+ return 0;
+ }
+
+ devdata[device].verbs_layer.l_piobufavail = l_piobufavail;
+ devdata[device].verbs_layer.l_rcv = l_rcv;
+ devdata[device].verbs_layer.l_timer_cb = l_timer_cb;
+ devdata[device].verbs_layer.l_flags = 0;
+
+ return 1;
+}
+
+void ipath_verbs_unregister(const ipath_type device)
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return;
+ }
+ if (!(devdata[device].ipath_flags & IPATH_INITTED)) {
+ _IPATH_VDBG("%s not yet initialized, failing\n",
+ ipath_get_unit_name(device));
+ return;
+ }
+
+ *devdata[device].ipath_statusp &= ~IPATH_STATUS_OIB_SMA;
+ devdata[device].verbs_layer.l_piobufavail = NULL;
+ devdata[device].verbs_layer.l_rcv = NULL;
+ devdata[device].verbs_layer.l_timer_cb = NULL;
+ devdata[device].verbs_layer.l_flags = 0;
+}
+
+int ipath_layer_open(const ipath_type device, uint32_t * pktmax)
+{
+ int ret = 0;
+ uint32_t intval = 0;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 1;
+ }
+ if (!devdata[device].ipath_layer.l_intr
+ || !devdata[device].ipath_layer.l_rcv) {
+ _IPATH_DBG("layer not registered, failing\n");
+ return 1;
+ }
+
+ if ((ret =
+ ipath_setrcvhdrsize(device, NUM_OF_EKSTRA_WORDS_IN_HEADER_QUEUE)))
+ return ret;
+
+ *pktmax = devdata[device].ipath_ibmaxlen;
+
+ if (*devdata[device].ipath_statusp & IPATH_STATUS_IB_READY)
+ intval |= IPATH_LAYER_INT_IF_UP;
+ if (ipath_stats.sps_lid[device])
+ intval |= IPATH_LAYER_INT_LID;
+ if (ipath_stats.sps_mlid[device])
+ intval |= IPATH_LAYER_INT_BCAST;
+ /*
+ * do this on open, in case low level is already up and
+ * just layered driver was reloaded, etc.
+ */
+ if (intval)
+ devdata[device].ipath_layer.l_intr(device, intval);
+
+ return ret;
+}
+
+uint16_t ipath_layer_get_lid(const ipath_type device)
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+
+ _IPATH_VDBG("returning mylid 0x%x for layered dev %d\n",
+ devdata[device].ipath_lid, device);
+ return devdata[device].ipath_lid;
+}
+
+/*
+ * get the MAC address. This is the EUID-64 OUI octets (top 3), then
+ * skip the next 2 (which should both be zero or 0xff).
+ * The returned MAC is in network order
+ * mac points to at least 6 bytes of buffer
+ * returns 0 on error (to be consistent with get_lid and get_bcast
+ * return 1 on success
+ * We assume that by the time the LID is set, that the GUID is as valid
+ * as it's ever going to be, rather than adding yet another status bit.
+ */
+
+int ipath_layer_get_mac(const ipath_type device, uint8_t * mac)
+{
+ uint8_t *guid;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u, failing\n", device);
+ return 0;
+ }
+ guid = (uint8_t *) & devdata[device].ipath_guid;
+
+ mac[0] = guid[0];
+ mac[1] = guid[1];
+ mac[2] = guid[2];
+ mac[3] = guid[5];
+ mac[4] = guid[6];
+ mac[5] = guid[7];
+ if ((guid[3] || guid[4]) && !(guid[3] == 0xff && guid[4] == 0xff))
+ _IPATH_DBG("Warning, guid bytes 3 and 4 not 0 or 0xffff: %x %x\n",
+ guid[3], guid[4]);
+ _IPATH_VDBG("Returning %x:%x:%x:%x:%x:%x\n",
+ mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
+ return 1;
+}
+
+uint16_t ipath_layer_get_bcast(const ipath_type device)
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u, failing\n", device);
+ return 0;
+ }
+
+ _IPATH_VDBG("returning broadcast LID 0x%x for unit %u\n",
+ devdata[device].ipath_mlid, device);
+ return devdata[device].ipath_mlid;
+}
+
+int ipath_layer_get_num_of_dev(void)
+{
+ return infinipath_max;
+}
+
+int ipath_layer_get_cr_errpkey(const ipath_type device)
+{
+ return ipath_kget_creg32(device, cr_errpkey);
+}
+
+void ipath_layer_close(const ipath_type device)
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return;
+ }
+ if (!devdata[device].ipath_layer.l_intr
+ || !devdata[device].ipath_layer.l_rcv) {
+ /* normal if not all chips are present */
+ _IPATH_VDBG("layer close without open\n");
+ } else {
+ devdata[device].ipath_layer.l_intr = NULL;
+ devdata[device].ipath_layer.l_rcv = NULL;
+ devdata[device].ipath_layer.l_rcv_lid = NULL;
+ devdata[device].ipath_layer.l_rcv_opcode = 0;
+ devdata[device].ipath_layer.l_rcv_lid_opcode = 0;
+ }
+}
+
+static inline void copy_aligned(uint32_t __iomem *piobuf,
+ struct ipath_sge_state *ss,
+ uint32_t length)
+{
+ struct ipath_sge *sge = &ss->sge;
+
+ while (length) {
+ uint32_t len = sge->length;
+ uint32_t w;
+
+ BUG_ON(len == 0);
+ if (len > length)
+ len = length;
+ /* Need to round up for the last dword in the packet. */
+ w = (len + 3) >> 2;
+ if (length == len) { /* last chunk, trigger word is special */
+ uint32_t *src32;
+ memcpy_toio32(piobuf, sge->vaddr, w-1);
+ src32 = (w-1)+(uint32_t*)sge->vaddr;
+ mb(); /* must flush early everything before trigger word */
+ writel(*src32, piobuf+w-1);
+ }
+ else
+ memcpy_toio32(piobuf, sge->vaddr, w);
+ piobuf += w;
+ sge->vaddr += len;
+ sge->length -= len;
+ sge->sge_length -= len;
+ if (sge->sge_length == 0) {
+ if (--ss->num_sge)
+ *sge = *ss->sg_list++;
+ } else if (sge->length == 0 && sge->mr != NULL) {
+ if (++sge->n >= IPATH_SEGSZ) {
+ if (++sge->m >= sge->mr->mapsz)
+ break;
+ sge->n = 0;
+ }
+ sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr;
+ sge->length = sge->mr->map[sge->m]->segs[sge->n].length;
+ }
+ length -= len;
+ }
+}
+
+static inline void copy_unaligned(uint32_t __iomem *piobuf,
+ struct ipath_sge_state *ss,
+ uint32_t length)
+{
+ struct ipath_sge *sge = &ss->sge;
+ union {
+ uint8_t wbuf[4];
+ uint32_t w;
+ } u;
+ int extra = 0;
+
+ while (length) {
+ uint32_t len = sge->length;
+
+ BUG_ON(len == 0);
+ if (len > length)
+ len = length;
+ length -= len;
+ while (len) {
+ u.wbuf[extra++] = *(uint8_t *) sge->vaddr;
+ sge->vaddr++;
+ sge->length--;
+ sge->sge_length--;
+ if (extra >= 4) {
+ if (!length && len == 1)
+ mb(); /* flush all before the trigger word write */
+ writel(u.w, piobuf);
+ extra = 0;
+ }
+ len--;
+ }
+ if (sge->sge_length == 0) {
+ if (--ss->num_sge)
+ *sge = *ss->sg_list++;
+ } else if (sge->length == 0) {
+ if (++sge->n >= IPATH_SEGSZ) {
+ if (++sge->m >= sge->mr->mapsz)
+ break;
+ sge->n = 0;
+ }
+ sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr;
+ sge->length = sge->mr->map[sge->m]->segs[sge->n].length;
+ }
+ }
+ if (extra) {
+ while (extra < 4)
+ u.wbuf[extra++] = 0;
+ mb(); /* flush all before the trigger word write */
+ writel(u.w, piobuf);
+ }
+}
+
+/*
+ * This is like ipath_send_smapkt() in that we need to be able to send
+ * packets after the chip is initialized (MADs) but also like
+ * ipath_layer_send() since its used by the verbs layer.
+ */
+int ipath_verbs_send(const ipath_type device, uint32_t hdrwords,
+ uint32_t *hdr, uint32_t len, struct ipath_sge_state *ss)
+{
+ struct ipath_devdata *dd = &devdata[device];
+ uint32_t __iomem *piobuf;
+ uint32_t plen;
+
+ if (device >= infinipath_max ||
+ !(dd->ipath_flags & IPATH_PRESENT) || !dd->ipath_kregbase) {
+ _IPATH_DBG("illegal unit %u\n", device);
+ return -ENODEV;
+ }
+ if (!(dd->ipath_flags & IPATH_INITTED)) {
+ /* no hardware, freeze, etc. */
+ _IPATH_DBG("unit %u not usable\n", device);
+ return -ENODEV;
+ }
+ /* +1 is for the qword padding of pbc */
+ plen = hdrwords + ((len + 3) >> 2) + 1;
+ if ((plen << 2) > dd->ipath_ibmaxlen) {
+ _IPATH_DBG("packet len 0x%x too long, failing\n", plen);
+ return -EINVAL;
+ }
+
+ /* Get a PIO buffer to use. */
+ if (!(piobuf = ipath_getpiobuf(device, NULL)))
+ return -EBUSY;
+
+ _IPATH_EPDBG("0x%x+1w pio %p\n", plen - 1, piobuf);
+
+ /* Write len to control qword, no flags.
+ * we have to flush after the PBC for correctness on some cpus
+ * or WC buffer can be written out of order */
+ writeq(plen, piobuf);
+ mb();
+ piobuf += 2;
+ if (len == 0) {
+ /* if there is just the header portion, must flush before
+ * writing last word of header for correctness, and after
+ * the last header word (trigger word) */
+ memcpy_toio32(piobuf, hdr, hdrwords-1);
+ mb();
+ writel(hdr[hdrwords-1], piobuf+hdrwords-1);
+ mb();
+ return 0;
+ }
+ memcpy_toio32(piobuf, hdr, hdrwords);
+ piobuf += hdrwords;
+ /*
+ * If we really wanted to check everything, we would have to
+ * check that each segment starts on a dword boundary and is
+ * a dword multiple in length.
+ * Since there can be lots of segments, we only check for a simple
+ * common case where the amount to copy is contained in one segment.
+ */
+ if (ss->sge.length == len)
+ copy_aligned(piobuf, ss, len);
+ else
+ copy_unaligned(piobuf, ss, len);
+ mb(); /* be sure trigger word is written */
+ return 0;
+}
+
+void ipath_layer_snapshot_counters(const ipath_type device, uint64_t * swords,
+ uint64_t * rwords, uint64_t * spkts, uint64_t * rpkts)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_PRESENT)) {
+ _IPATH_DBG("illegal unit %u\n", device);
+ return;
+ }
+ if (!(devdata[device].ipath_flags & IPATH_INITTED)) {
+ /* no hardware, freeze, etc. */
+ _IPATH_DBG("unit %u not usable\n", device);
+ return;
+ }
+ *swords = ipath_snap_cntr(device, cr_wordsendcnt);
+ *rwords = ipath_snap_cntr(device, cr_wordrcvcnt);
+ *spkts = ipath_snap_cntr(device, cr_pktsendcnt);
+ *rpkts = ipath_snap_cntr(device, cr_pktrcvcnt);
+}
+
+/*
+ * Return the counters needed by recv_pma_get_portcounters().
+ */
+void ipath_layer_get_counters(const ipath_type device,
+ struct ipath_layer_counters *cntrs)
+{
+ if (device >= infinipath_max ||
+ !(devdata[device].ipath_flags & IPATH_PRESENT)) {
+ _IPATH_DBG("illegal unit %u\n", device);
+ return;
+ }
+ if (!(devdata[device].ipath_flags & IPATH_INITTED)) {
+ /* no hardware, freeze, etc. */
+ _IPATH_DBG("unit %u not usable\n", device);
+ return;
+ }
+ cntrs->symbol_error_counter =
+ ipath_snap_cntr(device, cr_ibsymbolerrcnt);
+ cntrs->link_error_recovery_counter =
+ ipath_snap_cntr(device, cr_iblinkerrrecovcnt);
+ cntrs->link_downed_counter = ipath_snap_cntr(device, cr_iblinkdowncnt);
+ cntrs->port_rcv_errors = ipath_snap_cntr(device, cr_err_rlencnt) +
+ ipath_snap_cntr(device, cr_invalidrlencnt) +
+ ipath_snap_cntr(device, cr_erricrccnt) +
+ ipath_snap_cntr(device, cr_errvcrccnt) +
+ ipath_snap_cntr(device, cr_badformatcnt);
+ cntrs->port_rcv_remphys_errors = ipath_snap_cntr(device, cr_rcvebpcnt);
+ cntrs->port_xmit_discards = ipath_snap_cntr(device, cr_unsupvlcnt);
+ cntrs->port_xmit_data = ipath_snap_cntr(device, cr_wordsendcnt);
+ cntrs->port_rcv_data = ipath_snap_cntr(device, cr_wordrcvcnt);
+ cntrs->port_xmit_packets = ipath_snap_cntr(device, cr_pktsendcnt);
+ cntrs->port_rcv_packets = ipath_snap_cntr(device, cr_pktrcvcnt);
+}
+
+void ipath_layer_want_buffer(const ipath_type device)
+{
+ atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL,
+ &devdata[device].ipath_sendctrl);
+ ipath_kput_kreg(device, kr_sendctrl, devdata[device].ipath_sendctrl);
+}
+
+int ipath_layer_send(const ipath_type device, void *hdr, void *data,
+ uint32_t datawords)
+{
+ int ret = 0;
+ uint32_t __iomem *piobuf;
+ uint32_t plen;
+ uint16_t vlsllnh;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u, failing\n", device);
+ return -EINVAL;
+ }
+ if (!(devdata[device].ipath_flags & IPATH_RCVHDRSZ_SET)) {
+ _IPATH_DBG("send while not open\n");
+ ret = -EINVAL;
+ } else
+ if ((devdata[device].ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN))
+ || devdata[device].ipath_lid == 0) {
+ /* lid check is for when sma hasn't yet configured */
+ ret = -ENETDOWN;
+ _IPATH_VDBG("send while not ready, mylid=%u, flags=0x%x\n",
+ devdata[device].ipath_lid,
+ devdata[device].ipath_flags);
+ }
+ /* +1 is for the qword padding of pbc */
+ plen = (sizeof(struct ips_message_header_typ) >> 2) + datawords + 1;
+ if (plen > (devdata[device].ipath_ibmaxlen >> 2)) {
+ _IPATH_DBG("packet len 0x%x too long, failing\n", plen);
+ ret = -EINVAL;
+ }
+ vlsllnh = *((uint16_t *) hdr);
+ if (vlsllnh != htons(IPS_LRH_BTH)) {
+ _IPATH_DBG("Warning: lrh[0] wrong (%x, not %x); not sending\n",
+ vlsllnh, htons(IPS_LRH_BTH));
+ ret = -EINVAL;
+ }
+ if (ret)
+ goto done;
+
+ /* Get a PIO buffer to use. */
+ if (!(piobuf = ipath_getpiobuf(device, NULL))) {
+ ret = -EBUSY;
+ goto done;
+ }
+
+ _IPATH_EPDBG("0x%x+1w pio %p\n", plen - 1, piobuf);
+
+ /* len to control qword, no flags */
+ writeq(plen, piobuf);
+ piobuf += 2;
+ memcpy_toio32(piobuf, hdr,
+ (sizeof(struct ips_message_header_typ) >> 2));
+ piobuf += (sizeof(struct ips_message_header_typ) >> 2);
+ memcpy_toio32(piobuf, data, datawords);
+
+ ipath_stats.sps_ether_spkts++; /* another ether packet sent */
+
+done:
+ return ret;
+}
+
+void ipath_layer_set_piointbufavail_int(const ipath_type device)
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return;
+ }
+
+ atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL,
+ &devdata[device].ipath_sendctrl);
+
+ ipath_kput_kreg(device, kr_sendctrl, devdata[device].ipath_sendctrl);
+}
+
+void ipath_layer_enable_timer(const ipath_type device)
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return;
+ }
+
+ /*
+ * HT-400 has a design flaw where the chip and kernel idea
+ * of the tail register don't always agree, and therefore we won't
+ * get an interrupt on the next packet received.
+ * If the board supports per packet receive interrupts, use it.
+ * Otherwise, the timer function periodically checks for packets
+ * to cover this case.
+ * Either way, the timer is needed for verbs layer related
+ * processing.
+ */
+ if (devdata[device].ipath_flags & IPATH_GPIO_INTR) {
+ ipath_kput_kreg(device, kr_debugportselect, 0x2074076542310UL);
+ /* Enable GPIO bit 2 interrupt */
+ ipath_kput_kreg(device, kr_gpio_mask, (uint64_t)(1 << 2));
+ }
+
+ init_timer(&devdata[device].verbs_layer.l_timer);
+ devdata[device].verbs_layer.l_timer.function = ipath_verbs_timer;
+ devdata[device].verbs_layer.l_timer.data = (unsigned long)device;
+ devdata[device].verbs_layer.l_timer.expires = jiffies + 1;
+ add_timer(&devdata[device].verbs_layer.l_timer);
+}
+
+void ipath_layer_disable_timer(const ipath_type device)
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return;
+ }
+
+ /* Disable GPIO bit 2 interrupt */
+ if (devdata[device].ipath_flags & IPATH_GPIO_INTR)
+ ipath_kput_kreg(device, kr_gpio_mask, 0);
+
+ del_timer_sync(&devdata[device].verbs_layer.l_timer);
+}
+
+/*
+ * Get the verbs layer flags.
+ */
+unsigned ipath_verbs_get_flags(const ipath_type device)
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+
+ return devdata[device].verbs_layer.l_flags;
+}
+
+/*
+ * Set the verbs layer flags.
+ */
+void ipath_verbs_set_flags(const ipath_type device, unsigned flags)
+{
+ ipath_type s;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return;
+ }
+
+ devdata[device].verbs_layer.l_flags = flags;
+
+ for (s = 0; s < infinipath_max; s++) {
+ if (!(devdata[s].ipath_flags & IPATH_INITTED))
+ continue;
+ if ((flags & IPATH_VERBS_KERNEL_SMA) &&
+ !(*devdata[s].ipath_statusp & IPATH_STATUS_SMA)) {
+ *devdata[s].ipath_statusp |= IPATH_STATUS_OIB_SMA;
+ } else {
+ *devdata[s].ipath_statusp &= ~IPATH_STATUS_OIB_SMA;
+ }
+ }
+}
+
+/*
+ * Return the size of the PKEY table for port 0.
+ */
+unsigned ipath_layer_get_npkeys(const ipath_type device)
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+
+ return ARRAY_SIZE(devdata[device].ipath_pd[0]->port_pkeys);
+}
+
+/*
+ * Return the indexed PKEY from the port 0 PKEY table.
+ */
+unsigned ipath_layer_get_pkey(const ipath_type device, unsigned index)
+{
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return 0;
+ }
+ if (index >= ARRAY_SIZE(devdata[device].ipath_pd[0]->port_pkeys))
+ return 0;
+
+ return devdata[device].ipath_pd[0]->port_pkeys[index];
+}
+
+/*
+ * Return the PKEY table for port 0.
+ */
+void ipath_layer_get_pkeys(const ipath_type device, uint16_t *pkeys)
+{
+ struct ipath_portdata *pd;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return;
+ }
+
+ pd = devdata[device].ipath_pd[0];
+ memcpy(pkeys, pd->port_pkeys, sizeof(pd->port_pkeys));
+}
+
+/*
+ * Decrecment the reference count for the given PKEY.
+ * Return true if this was the last reference and the hardware table entry
+ * needs to be changed.
+ */
+static inline int rm_pkey(struct ipath_devdata *dd, uint16_t key)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) {
+ if (dd->ipath_pkeys[i] != key)
+ continue;
+ if (atomic_dec_and_test(&dd->ipath_pkeyrefs[i])) {
+ dd->ipath_pkeys[i] = 0;
+ return 1;
+ }
+ break;
+ }
+ return 0;
+}
+
+/*
+ * Add the given PKEY to the hardware table.
+ * Return an error code if unable to add the entry, zero if no change,
+ * or 1 if the hardware PKEY register needs to be updated.
+ */
+static inline int add_pkey(struct ipath_devdata *dd, uint16_t key)
+{
+ int i;
+ uint16_t lkey = key & 0x7FFF;
+ int any = 0;
+
+ for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) {
+ if (!dd->ipath_pkeys[i]) {
+ any++;
+ continue;
+ }
+ /* If it matches exactly, try to increment the ref count */
+ if (dd->ipath_pkeys[i] == key) {
+ if (atomic_inc_return(&dd->ipath_pkeyrefs[i]) > 1)
+ return 0;
+ /* Lost the race. Look for an empty slot below. */
+ atomic_dec(&dd->ipath_pkeyrefs[i]);
+ any++;
+ }
+ /*
+ * It makes no sense to have both the limited and unlimited
+ * PKEY set at the same time since the unlimited one will
+ * disable the limited one.
+ */
+ if ((dd->ipath_pkeys[i] & 0x7FFF) == lkey)
+ return -EEXIST;
+ }
+ if (!any)
+ return -EBUSY;
+ for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) {
+ if (!dd->ipath_pkeys[i] &&
+ atomic_inc_return(&dd->ipath_pkeyrefs[i]) == 1) {
+ /* for ipathstats, etc. */
+ ipath_stats.sps_pkeys[i] = lkey;
+ dd->ipath_pkeys[i] = key;
+ return 1;
+ }
+ }
+ return -EBUSY;
+}
+
+/*
+ * Set the PKEY table for port 0.
+ */
+int ipath_layer_set_pkeys(const ipath_type device, uint16_t *pkeys)
+{
+ struct ipath_portdata *pd;
+ struct ipath_devdata *dd;
+ int i;
+ int changed = 0;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return -EINVAL;
+ }
+
+ dd = &devdata[device];
+ pd = dd->ipath_pd[0];
+
+ for (i = 0; i > ARRAY_SIZE(pd->port_pkeys); i++) {
+ uint16_t key = pkeys[i];
+ uint16_t okey = pd->port_pkeys[i];
+
+ if (key == okey)
+ continue;
+ /*
+ * The value of this PKEY table entry is changing.
+ * Remove the old entry in the hardware's array of PKEYs.
+ */
+ if (okey & 0x7FFF)
+ changed |= rm_pkey(dd, okey);
+ if (key & 0x7FFF) {
+ int ret = add_pkey(dd, key);
+
+ if (ret < 0)
+ key = 0;
+ else
+ changed |= ret;
+ }
+ pd->port_pkeys[i] = key;
+ }
+ if (changed) {
+ uint64_t pkey;
+
+ pkey = (uint64_t) dd->ipath_pkeys[0] |
+ ((uint64_t) dd->ipath_pkeys[1] << 16) |
+ ((uint64_t) dd->ipath_pkeys[2] << 32) |
+ ((uint64_t) dd->ipath_pkeys[3] << 48);
+ _IPATH_VDBG("p0 new pkey reg %llx\n", pkey);
+ ipath_kput_kreg(pd->port_unit, kr_partitionkey, pkey);
+ }
+ return 0;
+}
+
+/*
+ * Registers that vary with the chip implementation constants (port)
+ * use this routine.
+ */
+uint64_t ipath_kget_kreg64_port(const ipath_type stype, ipath_kreg regno,
+ unsigned port)
+{
+ ipath_kreg tmp =
+ (port < devdata[stype].ipath_portcnt && regno == kr_rcvhdraddr) ?
+ regno + port :
+ ((port < devdata[stype].ipath_portcnt
+ && regno == kr_rcvhdrtailaddr) ? regno + port : __kr_invalid);
+ return ipath_kget_kreg64(stype, tmp);
+}
+
+/*
+ * Registers that vary with the chip implementation constants (port)
+ * use this routine.
+ */
+void ipath_kput_kreg_port(const ipath_type stype, ipath_kreg regno,
+ unsigned port, uint64_t value)
+{
+ ipath_kreg tmp =
+ (port < devdata[stype].ipath_portcnt && regno == kr_rcvhdraddr) ?
+ regno + port :
+ ((port < devdata[stype].ipath_portcnt
+ && regno == kr_rcvhdrtailaddr) ? regno + port : __kr_invalid);
+ ipath_kput_kreg(stype, tmp, value);
+}
+
+/*
+ * Returns zero if the default is POLL, 1 if the default is SLEEP.
+ */
+int ipath_layer_get_linkdowndefaultstate(const ipath_type device)
+{
+ struct ipath_devdata *dd;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return -EINVAL;
+ }
+
+ dd = &devdata[device];
+ return (dd->ipath_ibcctrl & INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE) ?
+ 1 : 0;
+}
+
+/*
+ * Note that this will only take effect when the link state changes.
+ */
+int ipath_layer_set_linkdowndefaultstate(const ipath_type device, int sleep)
+{
+ struct ipath_devdata *dd;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return -EINVAL;
+ }
+
+ _IPATH_DBG("state %s\n", sleep ? "SLEEP" : "POLL");
+ dd = &devdata[device];
+ if (sleep)
+ dd->ipath_ibcctrl |= INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE;
+ else
+ dd->ipath_ibcctrl &= ~INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE;
+ return 0;
+}
+
+int ipath_layer_get_phyerrthreshold(const ipath_type device)
+{
+ struct ipath_devdata *dd;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return -EINVAL;
+ }
+
+ dd = &devdata[device];
+ return (dd->ipath_ibcctrl >> INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) &
+ INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK;
+}
+
+/*
+ * Note that this will only take effect when the link state changes.
+ */
+int ipath_layer_set_phyerrthreshold(const ipath_type device, unsigned n)
+{
+ struct ipath_devdata *dd;
+ unsigned v;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return -EINVAL;
+ }
+
+ dd = &devdata[device];
+ v = (dd->ipath_ibcctrl >> INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) &
+ INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK;
+ if (v != n) {
+ _IPATH_DBG("error threshold %u\n", n);
+ dd->ipath_ibcctrl &=
+ ~(INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK <<
+ INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT);
+ dd->ipath_ibcctrl |=
+ (uint64_t)n << INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT;
+ }
+ return 0;
+}
+
+int ipath_layer_get_overrunthreshold(const ipath_type device)
+{
+ struct ipath_devdata *dd;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return -EINVAL;
+ }
+
+ dd = &devdata[device];
+ return (dd->ipath_ibcctrl >> INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) &
+ INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK;
+}
+
+/*
+ * Note that this will only take effect when the link state changes.
+ */
+int ipath_layer_set_overrunthreshold(const ipath_type device, unsigned n)
+{
+ struct ipath_devdata *dd;
+ unsigned v;
+
+ if (device >= infinipath_max) {
+ _IPATH_DBG("Invalid unit %u\n", device);
+ return -EINVAL;
+ }
+
+ dd = &devdata[device];
+ v = (dd->ipath_ibcctrl >> INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) &
+ INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK;
+ if (v != n) {
+ _IPATH_DBG("overrun threshold %u\n", n);
+ dd->ipath_ibcctrl &=
+ ~(INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK <<
+ INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT);
+ dd->ipath_ibcctrl |=
+ (uint64_t)n << INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT;
+ }
+ return 0;
+}
+
+EXPORT_SYMBOL(ipath_kset_linkstate);
+EXPORT_SYMBOL(ipath_kset_mtu);
+EXPORT_SYMBOL(ipath_layer_close);
+EXPORT_SYMBOL(ipath_layer_get_bcast);
+EXPORT_SYMBOL(ipath_layer_get_cr_errpkey);
+EXPORT_SYMBOL(ipath_layer_get_deviceid);
+EXPORT_SYMBOL(ipath_layer_get_flags);
+EXPORT_SYMBOL(ipath_layer_get_guid);
+EXPORT_SYMBOL(ipath_layer_get_ibmtu);
+EXPORT_SYMBOL(ipath_layer_get_lastibcstat);
+EXPORT_SYMBOL(ipath_layer_get_lid);
+EXPORT_SYMBOL(ipath_layer_get_mac);
+EXPORT_SYMBOL(ipath_layer_get_nguid);
+EXPORT_SYMBOL(ipath_layer_get_num_of_dev);
+EXPORT_SYMBOL(ipath_layer_get_pcidev);
+EXPORT_SYMBOL(ipath_layer_open);
+EXPORT_SYMBOL(ipath_layer_query_device);
+EXPORT_SYMBOL(ipath_layer_register);
+EXPORT_SYMBOL(ipath_layer_send);
+EXPORT_SYMBOL(ipath_layer_set_guid);
+EXPORT_SYMBOL(ipath_layer_set_piointbufavail_int);
+EXPORT_SYMBOL(ipath_layer_snapshot_counters);
+EXPORT_SYMBOL(ipath_layer_get_counters);
+EXPORT_SYMBOL(ipath_layer_want_buffer);
+EXPORT_SYMBOL(ipath_verbs_register);
+EXPORT_SYMBOL(ipath_verbs_send);
+EXPORT_SYMBOL(ipath_verbs_unregister);
+EXPORT_SYMBOL(ipath_set_sps_lid);
+EXPORT_SYMBOL(ipath_layer_enable_timer);
+EXPORT_SYMBOL(ipath_layer_disable_timer);
+EXPORT_SYMBOL(ipath_verbs_get_flags);
+EXPORT_SYMBOL(ipath_verbs_set_flags);
+EXPORT_SYMBOL(ipath_layer_get_npkeys);
+EXPORT_SYMBOL(ipath_layer_get_pkey);
+EXPORT_SYMBOL(ipath_layer_get_pkeys);
+EXPORT_SYMBOL(ipath_layer_set_pkeys);
+EXPORT_SYMBOL(ipath_layer_get_linkdowndefaultstate);
+EXPORT_SYMBOL(ipath_layer_set_linkdowndefaultstate);
+EXPORT_SYMBOL(ipath_layer_get_phyerrthreshold);
+EXPORT_SYMBOL(ipath_layer_set_phyerrthreshold);
+EXPORT_SYMBOL(ipath_layer_get_overrunthreshold);
+EXPORT_SYMBOL(ipath_layer_set_overrunthreshold);

2005-12-29 00:42:28

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 8 of 20] ipath - core driver, part 1 of 4

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r ffbd416f30d4 -r ddd21709e12c drivers/infiniband/hw/ipath/ipath_driver.c
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800
@@ -0,0 +1,1879 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#include <linux/version.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <linux/swap.h>
+#include <asm/mtrr.h>
+#include <linux/netdevice.h>
+
+#include <linux/crc32.h> /* we can generate our own crc's for testing */
+
+#include "ipath_kernel.h"
+#include "ips_common.h"
+#include "ipath_layer.h"
+
+/*
+ * Our LSB-assigned major number, so scripts can figure
+ * out how to make entry in /dev.
+ */
+
+static int ipath_major = 233;
+
+/*
+ * number of buffers reserved for driver (layered drivers and SMA send).
+ * Reserved at end of buffer list.
+ */
+
+static uint infinipath_kpiobufs = 32;
+
+/*
+ * number of ports we are configured to use (to allow for more pio
+ * buffers per port, etc.) Zero means use chip value.
+ */
+
+static uint infinipath_cfgports;
+
+/*
+ * number of units we are configured to use (to allow for bringup on
+ * multi-chip systems) Zero means use only one for now, but eventually
+ * will mean to use infinipath_max
+ */
+
+static uint infinipath_cfgunits;
+
+uint64_t ipath_dummy_val_for_testing;
+
+static __kernel_pid_t ipath_sma_alive; /* PID of SMA, if it's running */
+static spinlock_t ipath_sma_lock; /* SMA receive */
+
+/* max SM received packets we'll queue; we keep the most recent packets. */
+
+#define IPATH_NUM_SMAPKTS 16
+
+#define IPATH_SMA_HDRSZ (8+12+8) /* LRH+BTH+DETH */
+
+static struct _ipath_sma_rpkt {
+ /* length of received packet; non-zero if queued */
+ uint32_t len;
+ /* unit number of interface packet was received from */
+ uint32_t unit;
+ uint8_t *buf;
+} ipath_sma_data[IPATH_NUM_SMAPKTS];
+
+static unsigned ipath_sma_first; /* oldest sma packet index */
+static unsigned ipath_sma_next; /* next sma packet index to use */
+
+/*
+ * ipath_sma_data_bufs has one extra, pointed to by ipath_sma_data_spare,
+ * so we can exchange buffers to do copy_to_user, and not hold the lock
+ * across the copy_to_user().
+ */
+
+#define SMA_MAX_PKTSZ (IPATH_SMA_HDRSZ+256) /* max len of an SMA packet */
+
+static uint8_t ipath_sma_data_bufs[IPATH_NUM_SMAPKTS + 1][SMA_MAX_PKTSZ];
+static uint8_t *ipath_sma_data_spare;
+/* sma waits globally on all units */
+static wait_queue_head_t ipath_sma_wait;
+static wait_queue_head_t ipath_sma_state_wait;
+
+struct infinipath_stats ipath_stats;
+
+/*
+ * this will only be used for diags, now that we have enabled the DMA
+ * of the sendpioavail regs to system memory.
+ */
+
+static inline uint64_t ipath_kget_sreg(const ipath_type stype,
+ ipath_sreg regno)
+{
+ uint64_t val;
+ uint64_t *sbase;
+
+ sbase = (uint64_t *) (devdata[stype].ipath_sregbase
+ + (char *)devdata[stype].ipath_kregbase);
+ val = sbase ? sbase[regno] : 0ULL;
+ return val;
+}
+
+static int ipath_do_user_init(struct ipath_portdata *,
+ struct ipath_user_info __user *);
+static int ipath_get_baseinfo(struct ipath_portdata *,
+ struct ipath_base_info __user *);
+static int ipath_get_units(void);
+static int ipath_wr_eeprom(struct ipath_portdata *,
+ struct ipath_eeprom_req __user *);
+static int ipath_wait_intr(struct ipath_portdata *, uint32_t);
+static int ipath_tid_update(struct ipath_portdata *, struct _tidupd __user *);
+static int ipath_tid_free(struct ipath_portdata *, struct _tidupd __user *);
+static int ipath_get_counters(ipath_type, struct infinipath_counters __user *);
+static int ipath_get_unit_counters(struct infinipath_getunitcounters __user *a);
+static int ipath_get_stats(struct infinipath_stats __user *);
+static int ipath_set_partkey(struct ipath_portdata *, uint16_t);
+static int ipath_manage_rcvq(struct ipath_portdata *, uint16_t);
+static void ipath_clean_partkey(struct ipath_portdata *,
+ struct ipath_devdata *);
+static void ipath_disarm_piobufs(const ipath_type, unsigned, unsigned);
+static int ipath_create_user_egr(struct ipath_portdata *);
+static int ipath_create_port0_egr(struct ipath_portdata *);
+static int ipath_create_rcvhdrq(struct ipath_portdata *);
+static void ipath_handle_errors(const ipath_type, uint64_t);
+static void ipath_update_pio_bufs(const ipath_type);
+static int ipath_shutdown_link(const ipath_type);
+static int ipath_bringup_link(const ipath_type);
+int ipath_bringup_serdes(const ipath_type);
+static void ipath_get_faststats(unsigned long);
+static int ipath_setup_htconfig(struct pci_dev *, uint64_t *, const ipath_type);
+static struct page *ipath_nopage(struct vm_area_struct *, unsigned long, int *);
+static irqreturn_t ipath_intr(int irq, void *devid, struct pt_regs *regs);
+static void ipath_decode_err(char *, size_t, uint64_t);
+void ipath_free_pddata(struct ipath_devdata *, uint32_t, int);
+static void ipath_clear_tids(const ipath_type, unsigned);
+static void ipath_get_guid(const ipath_type);
+static int ipath_sma_ioctl(struct file *, unsigned int, unsigned long);
+static int ipath_rcvsma_pkt(struct ipath_sendpkt __user *);
+static int ipath_kset_lid(uint32_t);
+static int ipath_kset_mlid(uint32_t);
+static int ipath_get_mlid(uint32_t __user *);
+static int ipath_get_devstatus(uint64_t __user *);
+static int ipath_kset_guid(struct ipath_setguid __user *);
+static int ipath_get_portinfo(uint32_t __user *);
+static int ipath_get_nodeinfo(uint32_t __user *);
+#ifdef _IPATH_EXTRA_DEBUG
+static void ipath_dump_allregs(char *, ipath_type);
+#endif
+
+static const char ipath_sma_name[] = "infinipath_SMA";
+
+/*
+ * is diags mode enabled? if it is, then things like auto bringup of
+ * links is disabled
+ */
+
+int ipath_diags_enabled = 0;
+
+void ipath_chip_done(void)
+{
+}
+
+void ipath_chip_cleanup(struct ipath_devdata * dd)
+{
+}
+
+/*
+ * cache aligned location
+ *
+ * where port 0 rcvhdrtail register is written back; also want
+ * nothing else sharing the cache line, so make it a cache line in size
+ * used for all units
+ *
+ * This is volatile as it's the target of a DMA from the chip.
+ */
+
+static volatile uint64_t ipath_port0_rcvhdrtail[512]
+ __attribute__ ((aligned(4096)));
+
+#define MODNAME "ipath_core"
+#define DRIVER_LOAD_MSG "PathScale " MODNAME " loaded: "
+#define PFX MODNAME ": "
+
+/*
+ * min buffers we want to have per port, after driver
+ */
+
+#define IPATH_MIN_USER_PORT_BUFCNT 8
+
+/* The size has to be longer than this string, so we can
+ * append board/chip information to it in the init code.
+ */
+static char ipath_core_version[192] = IPATH_IDSTR;
+static char *chip_driver_version;
+static int chip_driver_size;
+
+/* mylid and lidbase are to deal with LIDs in "fabric", until SM is working */
+
+module_param(infinipath_debug, uint, 0644);
+module_param(infinipath_kpiobufs, uint, 0644);
+module_param(infinipath_cfgports, uint, 0644);
+module_param(infinipath_cfgunits, uint, 0644);
+
+MODULE_PARM_DESC(infinipath_debug, "mask for debug prints");
+MODULE_PARM_DESC(infinipath_cfgports, "Set max number of ports to use");
+MODULE_PARM_DESC(infinipath_cfgunits, "Set max number of devices to use");
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("PathScale <[email protected]>");
+MODULE_DESCRIPTION("Pathscale InfiniPath driver");
+
+#ifdef IPATH_DIAG
+static __kernel_pid_t ipath_diag_alive; /* PID of diags, if running */
+int ipath_diags_ioctl(struct file *, unsigned, unsigned long);
+static int ipath_opendiag(struct inode *, struct file *);
+#endif
+
+#if __IPATH_INFO || __IPATH_DBG
+static const char *ipath_ibcstatus_str[] = {
+ "Disabled",
+ "LinkUp",
+ "PollActive",
+ "PollQuiet",
+ "SleepDelay",
+ "SleepQuiet",
+ "LState6", /* unused */
+ "LState7", /* unused */
+ "CfgDebounce",
+ "CfgRcvfCfg",
+ "CfgWaitRmt",
+ "CfgIdle",
+ "RecovRetrain",
+ "LState0xD", /* unused */
+ "RecovWaitRmt",
+ "RecovIdle",
+};
+#endif
+
+static ssize_t show_version(struct device_driver *dev, char *buf)
+{
+ return snprintf(buf, PAGE_SIZE, "%s", ipath_core_version);
+}
+
+static ssize_t show_status(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct ipath_devdata *dd = dev_get_drvdata(dev);
+
+ if (!dd)
+ return -EINVAL;
+
+ if (!dd->ipath_statusp)
+ return -EINVAL;
+
+ return snprintf(buf, PAGE_SIZE, "%llx\n", *(dd->ipath_statusp));
+}
+
+static const char *ipath_status_str[] = {
+ "Initted",
+ "Disabled",
+ "4", /* unused */
+ "OIB_SMA",
+ "SMA",
+ "Present",
+ "IB_link_up",
+ "IB_configured",
+ "NoIBcable",
+ "Fatal_Hardware_Error",
+ NULL,
+};
+
+static ssize_t show_status_str(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct ipath_devdata *dd = dev_get_drvdata(dev);
+ int i, any;
+ uint64_t s;
+
+ if (!dd)
+ return -EINVAL;
+
+ if (!dd->ipath_statusp)
+ return -EINVAL;
+
+ s = *(dd->ipath_statusp);
+ *buf = '\0';
+ for (any = i = 0; s && ipath_status_str[i]; i++) {
+ if (s & 1) {
+ if (any && strlcat(buf, " ", PAGE_SIZE) >= PAGE_SIZE)
+ /* overflow */
+ break;
+ if (strlcat(buf, ipath_status_str[i],
+ PAGE_SIZE) >= PAGE_SIZE)
+ break;
+ any = 1;
+ }
+ s >>= 1;
+ }
+ if (any)
+ strlcat(buf, "\n", PAGE_SIZE);
+
+ return strlen(buf);
+}
+
+static ssize_t show_lid(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct ipath_devdata *dd = dev_get_drvdata(dev);
+
+ if (!dd)
+ return -EINVAL;
+
+ return snprintf(buf, PAGE_SIZE, "%x\n", dd->ipath_lid);
+}
+
+static ssize_t show_mlid(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct ipath_devdata *dd = dev_get_drvdata(dev);
+
+ if (!dd)
+ return -EINVAL;
+
+ return snprintf(buf, PAGE_SIZE, "%x\n", dd->ipath_mlid);
+}
+
+static ssize_t show_guid(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct ipath_devdata *dd = dev_get_drvdata(dev);
+ uint8_t *guid;
+
+ if (!dd)
+ return -EINVAL;
+
+ guid = (uint8_t *)&(dd->ipath_guid);
+
+ return snprintf(buf, PAGE_SIZE, "%x:%x:%x:%x:%x:%x:%x:%x\n",
+ guid[0], guid[1], guid[2], guid[3], guid[4], guid[5],
+ guid[6], guid[7]);
+}
+
+static ssize_t show_nguid(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct ipath_devdata *dd = dev_get_drvdata(dev);
+
+ if (!dd)
+ return -EINVAL;
+
+ return snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_nguid);
+}
+
+static ssize_t show_serial(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct ipath_devdata *dd = dev_get_drvdata(dev);
+
+ if (!dd)
+ return -EINVAL;
+
+ buf[sizeof dd->ipath_serial] = '\0';
+ memcpy(buf, dd->ipath_serial, sizeof dd->ipath_serial);
+ strcat(buf, "\n");
+ return strlen(buf);
+}
+
+static ssize_t show_unit(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct ipath_devdata *dd = dev_get_drvdata(dev);
+
+ if (!dd)
+ return -EINVAL;
+
+ snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_unit);
+ return strlen(buf);
+}
+
+static DRIVER_ATTR(version, S_IRUGO, show_version, NULL);
+static DEVICE_ATTR(status, S_IRUGO, show_status, NULL);
+static DEVICE_ATTR(status_str, S_IRUGO, show_status_str, NULL);
+static DEVICE_ATTR(lid, S_IRUGO, show_lid, NULL);
+static DEVICE_ATTR(mlid, S_IRUGO, show_mlid, NULL);
+static DEVICE_ATTR(guid, S_IRUGO, show_guid, NULL);
+static DEVICE_ATTR(nguid, S_IRUGO, show_nguid, NULL);
+static DEVICE_ATTR(serial, S_IRUGO, show_serial, NULL);
+static DEVICE_ATTR(unit, S_IRUGO, show_unit, NULL);
+
+/*
+ * called from add_timer and user counter read calls, to deal with
+ * counters that wrap in "human time". The words sent and received, and
+ * the packets sent and received are all that we worry about. For now,
+ * at least, we don't worry about error counters, because if they wrap
+ * that quickly, we probably don't care. We may eventually just make this
+ * handle all the counters. word counters can wrap in about 20 seconds
+ * of full bandwidth traffic, packet counters in a few hours.
+ */
+
+uint64_t ipath_snap_cntr(const ipath_type t, ipath_creg creg)
+{
+ uint32_t val;
+ uint64_t val64, t0, t1;
+ struct ipath_devdata *dd = &devdata[t];
+ static uint64_t one_sec_in_cycles;
+ extern uint32_t _ipath_pico_per_cycle;
+
+ if (!one_sec_in_cycles && _ipath_pico_per_cycle)
+ one_sec_in_cycles = 1000000000000UL / _ipath_pico_per_cycle;
+
+ t0 = get_cycles();
+ val = ipath_kget_creg32(t, creg);
+ t1 = get_cycles();
+ if ((t1 - t0) > one_sec_in_cycles && val == -1) {
+ /*
+ * This is just a way to detect things that are quite broken.
+ * Normally this should take just a few cycles (the check is
+ * for long enough that we don't care if we get pre-empted.)
+ * An Opteron HT O read timeout is 4 seconds with normal
+ * NB values
+ */
+
+ _IPATH_UNIT_ERROR(t, "Error! Reading counter 0x%x timed out\n",
+ creg);
+ return 0ULL;
+ }
+
+ if (creg == cr_wordsendcnt) {
+ if (val != dd->ipath_lastsword) {
+ dd->ipath_sword += val - dd->ipath_lastsword;
+ dd->ipath_lastsword = val;
+ }
+ val64 = dd->ipath_sword;
+ } else if (creg == cr_wordrcvcnt) {
+ if (val != dd->ipath_lastrword) {
+ dd->ipath_rword += val - dd->ipath_lastrword;
+ dd->ipath_lastrword = val;
+ }
+ val64 = dd->ipath_rword;
+ } else if (creg == cr_pktsendcnt) {
+ if (val != dd->ipath_lastspkts) {
+ dd->ipath_spkts += val - dd->ipath_lastspkts;
+ dd->ipath_lastspkts = val;
+ }
+ val64 = dd->ipath_spkts;
+ } else if (creg == cr_pktrcvcnt) {
+ if (val != dd->ipath_lastrpkts) {
+ dd->ipath_rpkts += val - dd->ipath_lastrpkts;
+ dd->ipath_lastrpkts = val;
+ }
+ val64 = dd->ipath_rpkts;
+ } else
+ val64 = (uint64_t) val;
+
+ return val64;
+}
+
+/*
+ * print the delta of egrfull/hdrqfull errors for kernel ports no more
+ * than every 5 seconds. User processes are printed at close, but kernel
+ * doesn't close, so... Separate routine so may call from other places
+ * someday, and so function name when printed by _IPATH_INFO is meaningfull
+ */
+
+static void ipath_qcheck(const ipath_type t)
+{
+ static uint64_t last_tot_hdrqfull;
+ size_t blen = 0;
+ struct ipath_devdata *dd = &devdata[t];
+ char buf[128];
+
+ *buf = 0;
+ if (dd->ipath_pd[0]->port_hdrqfull != dd->ipath_p0_hdrqfull) {
+ blen = snprintf(buf, sizeof buf, "port 0 hdrqfull %u",
+ dd->ipath_pd[0]->port_hdrqfull -
+ dd->ipath_p0_hdrqfull);
+ dd->ipath_p0_hdrqfull = dd->ipath_pd[0]->port_hdrqfull;
+ }
+ if (ipath_stats.sps_etidfull != dd->ipath_last_tidfull) {
+ blen +=
+ snprintf(buf + blen, sizeof buf - blen, "%srcvegrfull %llu",
+ blen ? ", " : "",
+ ipath_stats.sps_etidfull - dd->ipath_last_tidfull);
+ dd->ipath_last_tidfull = ipath_stats.sps_etidfull;
+ }
+
+ /*
+ * this is actually the number of hdrq full interrupts, not actual
+ * events, but at the moment that's mostly what I'm interested in.
+ * Actual count, etc. is in the counters, if needed. For production
+ * users this won't ordinarily be printed.
+ */
+
+ if ((infinipath_debug & (__IPATH_PKTDBG | __IPATH_DBG)) &&
+ ipath_stats.sps_hdrqfull != last_tot_hdrqfull) {
+ blen +=
+ snprintf(buf + blen, sizeof buf - blen,
+ "%shdrqfull %llu (all ports)", blen ? ", " : "",
+ ipath_stats.sps_hdrqfull - last_tot_hdrqfull);
+ last_tot_hdrqfull = ipath_stats.sps_hdrqfull;
+ }
+ if (blen)
+ _IPATH_DBG("%s\n", buf);
+
+ if (*dd->ipath_hdrqtailptr != dd->ipath_port0head) {
+ if (dd->ipath_lastport0rcv_cnt == ipath_stats.sps_port0pkts) {
+ _IPATH_PDBG("missing rcv interrupts? port0 hd=%llx tl=%x; port0pkts %llx\n",
+ *dd->ipath_hdrqtailptr, dd->ipath_port0head,ipath_stats.sps_port0pkts);
+ ipath_kreceive(t);
+ }
+ dd->ipath_lastport0rcv_cnt = ipath_stats.sps_port0pkts;
+ }
+}
+
+/*
+ * called from add_timer to get word counters from chip before they
+ * can overflow
+ */
+
+static void ipath_get_faststats(unsigned long t)
+{
+ uint32_t val;
+ struct ipath_devdata *dd = &devdata[t];
+ static unsigned cnt;
+
+ /*
+ * don't access the chip while running diags, or memory diags
+ * can fail
+ */
+ if (!dd->ipath_kregbase || !(dd->ipath_flags & IPATH_PRESENT) ||
+ ipath_diags_enabled) {
+ /* but re-arm the timer, for diags case; won't hurt other */
+ goto done;
+ }
+
+ ipath_snap_cntr((ipath_type) t, cr_wordsendcnt);
+ ipath_snap_cntr((ipath_type) t, cr_wordrcvcnt);
+ ipath_snap_cntr((ipath_type) t, cr_pktsendcnt);
+ ipath_snap_cntr((ipath_type) t, cr_pktrcvcnt);
+
+ ipath_qcheck(t);
+
+ /*
+ * deal with repeat error suppression. Doesn't really matter if
+ * last error was almost a full interval ago, or just a few usecs
+ * ago; still won't get more than 2 per interval. We may want
+ * longer intervals for this eventually, could do with mod, counter
+ * or separate timer. Also see code in ipath_handle_errors() and
+ * ipath_handle_hwerrors().
+ */
+
+ if (dd->ipath_lasterror)
+ dd->ipath_lasterror = 0;
+ if (dd->ipath_lasthwerror)
+ dd->ipath_lasthwerror = 0;
+ if ((devdata[t].ipath_maskederrs & ~devdata[t].ipath_ignorederrs)
+ && get_cycles() > devdata[t].ipath_unmasktime) {
+ char ebuf[256];
+ ipath_decode_err(ebuf, sizeof ebuf,
+ (devdata[t].ipath_maskederrs & ~devdata[t].
+ ipath_ignorederrs));
+ if ((devdata[t].ipath_maskederrs & ~devdata[t].
+ ipath_ignorederrs)
+ & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL)) {
+ _IPATH_UNIT_ERROR(t, "Re-enabling masked errors (%s)\n",
+ ebuf);
+ } else {
+ /*
+ * rcvegrfull and rcvhdrqfull are "normal",
+ * for some types of processes (mostly benchmarks)
+ * that send huge numbers of messages, while
+ * not processing them. So only complain about
+ * these at debug level.
+ */
+ _IPATH_DBG
+ ("Disabling frequent queue full errors (%s)\n",
+ ebuf);
+ }
+ devdata[t].ipath_maskederrs = devdata[t].ipath_ignorederrs;
+ ipath_kput_kreg(t, kr_errormask, ~devdata[t].ipath_maskederrs);
+ }
+
+ if (dd->ipath_flags & IPATH_LINK_SLEEPING) {
+ uint64_t ibc;
+ _IPATH_VDBG("linkinitcmd SLEEP, move to POLL\n");
+ dd->ipath_flags &= ~IPATH_LINK_SLEEPING;
+ ibc = dd->ipath_ibcctrl;
+ /*
+ * don't put linkinitcmd in ipath_ibcctrl, want that to
+ * stay a NOP
+ */
+ ibc |=
+ INFINIPATH_IBCC_LINKINITCMD_POLL <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT;
+ ipath_kput_kreg(t, kr_ibcctrl, ibc);
+ }
+
+ /* limit qfull messages to ~one per minute per port */
+ if ((++cnt & 0x10)) {
+ for (val = devdata[t].ipath_cfgports - 1; ((int)val) >= 0;
+ val--) {
+ if (dd->ipath_lastegrheads[val] != -1)
+ dd->ipath_lastegrheads[val] = -1;
+ if (dd->ipath_lastrcvhdrqtails[val] != -1)
+ dd->ipath_lastrcvhdrqtails[val] = -1;
+ }
+ }
+
+ if (dd->ipath_nosma_bufs) {
+ dd->ipath_nosma_secs += 5;
+ if (dd->ipath_nosma_secs >= 30) {
+ _IPATH_SMADBG("No SMA bufs avail %u seconds; cancelling pending sends\n",
+ dd->ipath_nosma_secs);
+ ipath_disarm_piobufs(t, dd->ipath_lastport_piobuf,
+ dd->ipath_piobcnt - dd->ipath_lastport_piobuf);
+ dd->ipath_nosma_secs = 0; /* start again, if necessary */
+ }
+ else
+ _IPATH_SMADBG("No SMA bufs avail %u tries, after %u seconds\n",
+ dd->ipath_nosma_bufs, dd->ipath_nosma_secs);
+ }
+
+done:
+ mod_timer(&dd->ipath_stats_timer, jiffies + HZ * 5);
+}
+
+
+static void __devexit infinipath_remove_one(struct pci_dev *);
+static int infinipath_init_one(struct pci_dev *, const struct pci_device_id *);
+
+/* Only needed for registration, nothing else needs this info */
+#define PCI_VENDOR_ID_PATHSCALE 0x1fc1
+#define PCI_DEVICE_ID_PATHSCALE_INFINIPATH_HT 0xd
+
+const struct pci_device_id infinipath_pci_tbl[] = {
+ {
+ PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_PATHSCALE_INFINIPATH_HT,
+ PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0},
+ {0,}
+};
+
+MODULE_DEVICE_TABLE(pci, infinipath_pci_tbl);
+
+static struct pci_driver infinipath_driver = {
+ .name = MODNAME,
+ .driver.owner = THIS_MODULE,
+ .probe = infinipath_init_one,
+ .remove = __devexit_p(infinipath_remove_one),
+ .id_table = infinipath_pci_tbl,
+};
+
+#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC)
+int remap_area_pages(unsigned long address, unsigned long phys_addr,
+ unsigned long size, unsigned long flags);
+#endif
+
+static int infinipath_init_one(struct pci_dev *pdev,
+ const struct pci_device_id *ent)
+{
+ int ret, len, j;
+ static int chip_idx = -1;
+ unsigned long addr;
+ uint64_t intconfig;
+ uint8_t rev;
+ ipath_type dev;
+
+ /*
+ * XXX: Right now, we have a hardcoded array of devices. We'll
+ * change this in a future release, but not just yet. For the
+ * moment, we're limited to 4 infinipath devices per system.
+ */
+
+ dev = ++chip_idx;
+
+ _IPATH_VDBG("initializing unit #%u\n", dev);
+ if ((!infinipath_cfgunits && (dev >= 1)) ||
+ (infinipath_cfgunits && (dev >= infinipath_cfgunits)) ||
+ (dev >= infinipath_max)) {
+ _IPATH_ERROR("Trying to initialize unit %u, max is %u\n",
+ dev, infinipath_max - 1);
+ return -EINVAL;
+ }
+
+ devdata[dev].pci_registered = 1;
+ devdata[dev].ipath_unit = dev;
+
+ if ((ret = pci_enable_device(pdev))) {
+ _IPATH_DBG("pci_enable unit %u failed: %x\n", dev, ret);
+ }
+
+ if ((ret = pci_request_regions(pdev, MODNAME)))
+ _IPATH_INFO("pci_request_regions unit %u fails: %d\n", dev,
+ ret);
+
+ if ((ret = pci_set_dma_mask(pdev, DMA_64BIT_MASK)) != 0)
+ _IPATH_INFO("pci_set_dma_mask unit %u fails: %d\n", dev, ret);
+
+ pci_set_master(pdev); /* probably not be needed for HT */
+
+ addr = pci_resource_start(pdev, 0);
+ len = pci_resource_len(pdev, 0);
+ _IPATH_VDBG
+ ("regbase (0) %lx len %d irq %x, vend %x/%x driver_data %lx\n",
+ addr, len, pdev->irq, ent->vendor, ent->device, ent->driver_data);
+ devdata[dev].ipath_deviceid = ent->device; /* save for later use */
+ devdata[dev].ipath_vendorid = ent->vendor;
+ for (j = 0; j < 6; j++) {
+ if (!pdev->resource[j].start)
+ continue;
+ _IPATH_VDBG("BAR %d start %lx, end %lx, len %lx\n",
+ j, pdev->resource[j].start,
+ pdev->resource[j].end, pci_resource_len(pdev, j));
+ }
+
+ if (!addr) {
+ _IPATH_UNIT_ERROR(dev, "No valid address in BAR 0!\n");
+ return -ENODEV;
+ }
+
+ if ((ret = pci_read_config_byte(pdev, PCI_REVISION_ID, &rev))) {
+ _IPATH_UNIT_ERROR(dev,
+ "Failed to read PCI revision ID unit %u: %d\n",
+ dev, ret);
+ return ret; /* shouldn't ever happen */
+ } else
+ devdata[dev].ipath_pcirev = rev;
+
+ devdata[dev].ipath_kregbase = ioremap_nocache(addr, len);
+#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC)
+ printk("Remapping pages WC\n");
+ remap_area_pages((unsigned long) devdata[dev].ipath_kregbase +
+ 1024 * 1024, addr + 1024 * 1024, 1024 * 1024,
+ _PAGE_MA_WC);
+ /* devdata[dev].ipath_kregbase = __ioremap(addr, len, _PAGE_MA_WC); */
+#endif
+
+ if (!devdata[dev].ipath_kregbase) {
+ _IPATH_DBG("Unable to map io addr %lx to kvirt, failing\n",
+ addr);
+ ret = -ENOMEM;
+ goto fail;
+ }
+ devdata[dev].ipath_kregend = (uint64_t __iomem *)
+ ((void __iomem *) devdata[dev].ipath_kregbase + len);
+ devdata[dev].ipath_physaddr = addr; /* used for io_remap, etc. */
+ /* for user mmap */
+ devdata[dev].ipath_kregvirt = (uint64_t __iomem *) phys_to_virt(addr);
+ _IPATH_VDBG("mapped io addr %lx to kregbase %p kregvirt %p\n", addr,
+ devdata[dev].ipath_kregbase, devdata[dev].ipath_kregvirt);
+
+ /*
+ * set these up before registering the interrupt handler, just
+ * in case
+ */
+ devdata[dev].pcidev = pdev;
+ pci_set_drvdata(pdev, &(devdata[dev]));
+
+ /*
+ * set up our interrupt handler; SA_SHIRQ probably not needed,
+ * but won't hurt for now.
+ */
+
+ if (!pdev->irq) {
+ _IPATH_UNIT_ERROR(dev, "irq is 0, failing init\n");
+ ret = -EINVAL;
+ goto fail;
+ }
+ if ((ret = request_irq(pdev->irq, ipath_intr,
+ SA_SHIRQ, MODNAME, &devdata[dev]))) {
+ _IPATH_UNIT_ERROR(dev,
+ "Couldn't setup interrupt handler, irq=%u: %d\n",
+ pdev->irq, ret);
+ goto fail;
+ }
+
+ /*
+ * clear ipath_flags here instead of in ipath_init_chip as it is set
+ * by ipath_setup_htconfig.
+ */
+ devdata[dev].ipath_flags = 0;
+ if (ipath_setup_htconfig(pdev, &intconfig, dev))
+ _IPATH_DBG
+ ("Failed to setup HT config, continuing anyway for now\n");
+
+ ret = ipath_init_chip(dev); /* do the chip-specific init */
+ if (!ret) {
+#ifdef CONFIG_MTRR
+ uint64_t pioaddr, piolen;
+ unsigned bits;
+ /*
+ * Set the PIO buffers to be WCCOMB, so we get HT bursts
+ * to the chip. Linux (possibly the hardware) requires
+ * it to be on a power of 2 address matching the length
+ * (which has to be a power of 2). For rev1, that means
+ * the base address, for rev2, it will be just the PIO
+ * buffers themselves.
+ */
+ pioaddr = addr + devdata[dev].ipath_piobufbase;
+ piolen = devdata[dev].ipath_piobcnt *
+ ALIGN(devdata[dev].ipath_piosize,
+ devdata[dev].ipath_palign);
+
+ for (bits = 0; !(piolen & (1ULL << bits)); bits++)
+ /* do nothing */;
+
+ if (piolen != (1ULL << bits)) {
+ _IPATH_DBG("piolen 0x%llx not power of 2, bits=%u\n",
+ piolen, bits);
+ piolen >>= bits;
+ while (piolen >>= 1)
+ bits++;
+ piolen = 1ULL << (bits + 1);
+ _IPATH_DBG("Changed piolen to 0x%llx bits=%u\n", piolen,
+ bits);
+ }
+ if (pioaddr & (piolen - 1)) {
+ uint64_t atmp;
+ _IPATH_DBG
+ ("pioaddr %llx not on right boundary for size %llx, fixing\n",
+ pioaddr, piolen);
+ atmp = pioaddr & ~(piolen - 1);
+ if (atmp < addr || (atmp + piolen) > (addr + len)) {
+ _IPATH_UNIT_ERROR(dev,
+ "No way to align address/size (%llx/%llx), no WC mtrr\n",
+ atmp, piolen << 1);
+ ret = -ENODEV;
+ } else {
+ _IPATH_DBG
+ ("changing WC base from %llx to %llx, len from %llx to %llx\n",
+ pioaddr, atmp, piolen, piolen << 1);
+ pioaddr = atmp;
+ piolen <<= 1;
+ }
+ }
+
+ if (!ret) {
+ int cookie;
+ _IPATH_VDBG
+ ("Setting mtrr for chip to WC (addr %llx, len=0x%llx)\n",
+ pioaddr, piolen);
+ cookie = mtrr_add(pioaddr, piolen, MTRR_TYPE_WRCOMB, 0);
+ if (cookie < 0) {
+ _IPATH_INFO
+ ("mtrr_add(%llx,0x%llx,WC,0) failed (%d)\n",
+ pioaddr, piolen, cookie);
+ ret = -EINVAL;
+ } else {
+ _IPATH_VDBG
+ ("Set mtrr for chip to WC, cookie is %d\n",
+ cookie);
+ devdata[dev].ipath_mtrr = (uint32_t) cookie;
+ }
+ }
+#endif /* CONFIG_MTRR */
+ }
+
+ if (!ret && devdata[dev].ipath_kregbase && (devdata[dev].ipath_flags
+ & IPATH_PRESENT)) {
+ /*
+ * for the hardware, enable interrupts only after
+ * kr_interruptconfig is written, if we could set it up
+ */
+ if (intconfig) {
+ /* interrupt address */
+ ipath_kput_kreg(dev, kr_interruptconfig, intconfig);
+ /* enable all interrupts */
+ ipath_kput_kreg(dev, kr_intmask, -1LL);
+ /* force re-interrupt of any pending interrupts. */
+ ipath_kput_kreg(dev, kr_intclear, 0ULL);
+ /* OK, the chip is usable, marked it as initialized */
+ *devdata[dev].ipath_statusp |= IPATH_STATUS_INITTED;
+ } else
+ _IPATH_UNIT_ERROR(dev,
+ "No interrupts enabled, couldn't setup interrupt address\n");
+ } else if (ret != -EPERM)
+ _IPATH_INFO("Not configuring unit %u interrupts, init failed\n",
+ dev);
+
+ device_create_file(&(pdev->dev), &dev_attr_status);
+ device_create_file(&(pdev->dev), &dev_attr_status_str);
+ device_create_file(&(pdev->dev), &dev_attr_lid);
+ device_create_file(&(pdev->dev), &dev_attr_mlid);
+ device_create_file(&(pdev->dev), &dev_attr_guid);
+ device_create_file(&(pdev->dev), &dev_attr_nguid);
+ device_create_file(&(pdev->dev), &dev_attr_serial);
+ device_create_file(&(pdev->dev), &dev_attr_unit);
+
+ /*
+ * We used to cleanup here, with pci_release_regions, etc. but that
+ * can cause other problems if we want to run diags, etc., so instead
+ * defer that until driver unload.
+ */
+
+fail: /* after we've done at least some of the pci setup */
+ if (ret == -EPERM) /* disabled device, don't want module load error;
+ * just want to carry status through to this point */
+ ret = 0;
+
+ return ret;
+}
+
+
+
+#define HT_CAPABILITY_ID 0x08 /* HT capabilities not defined in kernel */
+#define HT_INTR_DISC_CONFIG 0x80 /* HT interrupt and discovery cap */
+#define HT_INTR_REG_INDEX 2 /* intconfig requires indirect accesses */
+
+/*
+ * setup the interruptconfig register from the HT config info.
+ * Also clear CRC errors in HT linkcontrol, if necessary.
+ * This is done only for the real hardware. It is done before
+ * chip address space is initted, so can't touch infinipath registers
+ */
+
+static int ipath_setup_htconfig(struct pci_dev *pdev, uint64_t * iaddr,
+ const ipath_type t)
+{
+ uint8_t cap_type;
+ uint32_t int_handler_addr_lower;
+ uint32_t int_handler_addr_upper;
+ uint64_t ihandler = 0;
+ int i, pos, ret = 0;
+
+ *iaddr = 0ULL; /* init to zero in case not able to configure */
+
+ /*
+ * Read the capability info to find the interrupt info, and also
+ * handle clearing CRC errors in linkctrl register if necessary.
+ * We do this early, before we ever enable errors or hardware errors,
+ * mostly to avoid causing the chip to enter freeze mode.
+ */
+ if (!(pos = pci_find_capability(pdev, HT_CAPABILITY_ID))) {
+ _IPATH_UNIT_ERROR(t,
+ "Couldn't find HyperTransport capability; no interrupts\n");
+ return -ENODEV;
+ }
+ do {
+ /* the HT capability type byte is 3 bytes after the
+ * capability byte.
+ */
+ if (pci_read_config_byte(pdev, pos+3, &cap_type)) {
+ _IPATH_INFO
+ ("Couldn't read config command @ %d\n", pos);
+ continue;
+ }
+ if (!(cap_type & 0xE0)) {
+ /* bits 13-15 of command==0 is slave/primary block.
+ * Clear any HT CRC errors. We only bother to
+ * do this at load time, because it's OK if it
+ * happened before we were loaded (first time
+ * after boot/reset), but any time after that,
+ * it's fatal anyway. Also need to not check for
+ * for upper byte errors if we are in 8 bit mode,
+ * so figure out our width. For now, at least,
+ * also complain if it's 8 bit.
+ */
+ uint8_t linkwidth = 0, linkerr, link_a_b_off, link_off;
+ uint16_t linkctrl = 0;
+
+ devdata[t].ipath_ht_slave_off = pos;
+ /* command word, master_host bit */
+ if ((cap_type >> 2) & 1) /* master host || slave */
+ link_a_b_off = 4;
+ else
+ link_a_b_off = 0;
+ _IPATH_VDBG("HT%u (Link %c) connected to processor\n",
+ link_a_b_off ? 1 : 0,
+ link_a_b_off ? 'B' : 'A');
+
+ link_a_b_off += pos;
+
+ /*
+ * check both link control registers; clear both
+ * HT CRC sets if necessary.
+ */
+
+ for (i = 0; i < 2; i++) {
+ link_off = pos + i * 4 + 0x4;
+ if (pci_read_config_word
+ (pdev, link_off, &linkctrl))
+ _IPATH_UNIT_ERROR(t,
+ "Couldn't read HT link control%d register\n",
+ i);
+ else if (linkctrl & (0xf << 8)) {
+ _IPATH_VDBG
+ ("Clear linkctrl%d CRC Error bits %x\n",
+ i, linkctrl & (0xf << 8));
+ /*
+ * now write them back to clear
+ * the error.
+ */
+ pci_write_config_byte(pdev, link_off,
+ linkctrl & (0xf <<
+ 8));
+ }
+ }
+
+ /*
+ * As with HT CRC bits, same for protocol errors
+ * that might occur during boot.
+ */
+
+ for (i = 0; i < 2; i++) {
+ link_off = pos + i * 4 + 0xd;
+ if (pci_read_config_byte
+ (pdev, link_off, &linkerr))
+ _IPATH_INFO
+ ("Couldn't read linkerror%d of HT slave/primary block\n",
+ i);
+ else if (linkerr & 0xf0) {
+ _IPATH_VDBG
+ ("HT linkerr%d bits 0x%x set, clearing\n",
+ linkerr >> 4, i);
+ /*
+ * writing the linkerr bits that
+ * are set will clear them
+ */
+ if (pci_write_config_byte
+ (pdev, link_off, linkerr))
+ _IPATH_DBG
+ ("Failed write to clear HT linkerror%d\n",
+ i);
+ if (pci_read_config_byte
+ (pdev, link_off, &linkerr))
+ _IPATH_INFO
+ ("Couldn't reread linkerror%d of HT slave/primary block\n",
+ i);
+ else if (linkerr & 0xf0)
+ _IPATH_INFO
+ ("HT linkerror%d bits 0x%x couldn't be cleared\n",
+ i, linkerr >> 4);
+ }
+ }
+
+ /*
+ * this is just for our link to the host, not
+ * devices connected through tunnel.
+ */
+
+ if (pci_read_config_byte
+ (pdev, link_a_b_off + 7, &linkwidth))
+ _IPATH_UNIT_ERROR(t,
+ "Couldn't read HT link width config register\n");
+ else {
+ uint32_t width;
+ switch (linkwidth & 7) {
+ case 5:
+ width = 4;
+ break;
+ case 4:
+ width = 2;
+ break;
+ case 3:
+ width = 32;
+ break;
+ case 1:
+ width = 16;
+ break;
+ case 0:
+ default: /* if wrong, assume 8 bit */
+ width = 8;
+ break;
+ }
+ ((struct ipath_devdata *) pci_get_drvdata(pdev))->ipath_htwidth = width;
+
+ if (linkwidth != 0x11) {
+ _IPATH_UNIT_ERROR(t,
+ "Not configured for 16 bit HT (%x)\n",
+ linkwidth);
+ if (!(linkwidth & 0xf)) {
+ _IPATH_DBG
+ ("Will ignore HT lane1 errors\n");
+ ((struct ipath_devdata *) pci_get_drvdata(pdev))->ipath_flags |= IPATH_8BIT_IN_HT0;
+ }
+ }
+ }
+
+ /*
+ * this is just for our link to the host, not
+ * devices connected through tunnel.
+ */
+
+ if (pci_read_config_byte
+ (pdev, link_a_b_off + 0xd, &linkwidth))
+ _IPATH_UNIT_ERROR(t,
+ "Couldn't read HT link frequency config register\n");
+ else {
+ uint32_t speed;
+ switch (linkwidth & 0xf) {
+ case 6:
+ speed = 1000;
+ break;
+ case 5:
+ speed = 800;
+ break;
+ case 4:
+ speed = 600;
+ break;
+ case 3:
+ speed = 500;
+ break;
+ case 2:
+ speed = 400;
+ break;
+ case 1:
+ speed = 300;
+ break;
+ default:
+ /*
+ * assume reserved and
+ * vendor-specific are 200...
+ */
+ case 0:
+ speed = 200;
+ break;
+ }
+ ((struct ipath_devdata *) pci_get_drvdata(pdev))->ipath_htspeed = speed;
+ }
+ } else if (cap_type == HT_INTR_DISC_CONFIG) {
+ /* use indirection register to get the intr handler */
+ uint32_t intvec;
+ pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX,
+ 0x10);
+ pci_read_config_dword(pdev, pos + 4,
+ &int_handler_addr_lower);
+
+ pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX,
+ 0x11);
+ pci_read_config_dword(pdev, pos + 4,
+ &int_handler_addr_upper);
+
+ ihandler = (uint64_t) int_handler_addr_lower |
+ ((uint64_t) int_handler_addr_upper << 32);
+
+ /*
+ * I'm unable to find an exported API to get
+ * the the actual vector, either from the PCI
+ * infrastructure, or from the APIC
+ * infrastructure. This heuristic seems to be
+ * valid for Opteron on 2.6.x kernels, for irq's > 2.
+ * It may not be universally true... Bug 2338
+ *
+ * Oh well; the heuristic doesn't work for the
+ * AMI/Iwill BIOS... But the good news is,
+ * somewhere by 2.6.9, when CONFIG_PCI_MSI is
+ * enabled, the irq field actually turned into
+ * the vector number
+ * We therefore require that MSI be enabled...
+ */
+
+ intvec = pdev->irq;
+ /*
+ * clear any bits there; normally not set but
+ * we'll overload this for some debug purposes
+ * (setting the HTC debug register value from
+ * software, rather than GPIOs), so it might be
+ * set on a driver reload.
+ */
+
+ ihandler &= ~0xff0000;
+ /* x86 vector goes in intrinfo[23:16] */
+ ihandler |= intvec << 16;
+ _IPATH_VDBG
+ ("ihandler lower %x, upper %x, intvec %x, interruptconfig %llx\n",
+ int_handler_addr_lower, int_handler_addr_upper,
+ intvec, ihandler);
+
+ /* return to caller, can't program yet. */
+ *iaddr = ihandler;
+ /*
+ * no break, have to be sure we find link control
+ * stuff also
+ */
+ }
+
+ } while ((pos=pci_find_next_capability(pdev, pos, HT_CAPABILITY_ID)));
+
+ if (!ihandler) {
+ _IPATH_UNIT_ERROR(t,
+ "Couldn't find interrupt handler in config space\n");
+ ret = -ENODEV;
+ }
+ return ret;
+}
+
+/*
+ * get the GUID from the i2c device
+ * When we add the multi-chip support, we will probably have to add
+ * the ability to use the number of guids field, and get the guid from
+ * the first chip's flash, to use for all of them.
+ */
+
+static void ipath_get_guid(const ipath_type t)
+{
+ void *buf;
+ struct ipath_flash *ifp;
+ uint64_t guid;
+ int len;
+ uint8_t csum, *bguid;
+
+ if (t && devdata[0].ipath_nguid > 1 && t <= devdata[0].ipath_nguid) {
+ uint8_t oguid;
+ devdata[t].ipath_guid = devdata[0].ipath_guid;
+ bguid = (uint8_t *) & devdata[t].ipath_guid;
+
+ oguid = bguid[7];
+ bguid[7] += t;
+ if (oguid > bguid[7]) {
+ if (bguid[6] == 0xff) {
+ if (bguid[5] == 0xff) {
+ _IPATH_UNIT_ERROR(t,
+ "Can't set %s GUID from base GUID, wraps to OUI!\n",
+ ipath_get_unit_name
+ (t));
+ devdata[t].ipath_guid = 0;
+ return;
+ }
+ bguid[5]++;
+ }
+ bguid[6]++;
+ }
+ devdata[t].ipath_nguid = 1;
+
+ _IPATH_DBG
+ ("nguid %u, so adding %u to device 0 guid, for %llx (big-endian)\n",
+ devdata[0].ipath_nguid, t, devdata[t].ipath_guid);
+ return;
+ }
+
+ len = offsetof(struct ipath_flash, if_future);
+ if (!(buf = vmalloc(len))) {
+ _IPATH_UNIT_ERROR(t,
+ "Couldn't allocate memory to read %u bytes from eeprom for GUID\n",
+ len);
+ return;
+ }
+
+ if (ipath_eeprom_read(t, 0, buf, len)) {
+ _IPATH_UNIT_ERROR(t, "Failed reading GUID from eeprom\n");
+ goto done;
+ }
+ ifp = (struct ipath_flash *)buf;
+
+ csum = ipath_flash_csum(ifp, 0);
+ if (csum != ifp->if_csum) {
+ _IPATH_INFO("Bad I2C flash checksum: 0x%x, not 0x%x\n",
+ csum, ifp->if_csum);
+ goto done;
+ }
+ if (*(uint64_t *) ifp->if_guid == 0ULL
+ || *(uint64_t *) ifp->if_guid == -1LL) {
+ _IPATH_UNIT_ERROR(t, "Invalid GUID %llx from flash; ignoring\n",
+ *(uint64_t *) ifp->if_guid);
+ goto done; /* don't allow GUID if all 0 or all 1's */
+ }
+
+ /* complain, but allow it */
+ if (*(uint64_t *) ifp->if_guid == 0x100007511000000ULL)
+ _IPATH_INFO
+ ("Warning, GUID %llx is default, probabaly not correct!\n",
+ *(uint64_t *) ifp->if_guid);
+
+ bguid = ifp->if_guid;
+ if (!bguid[0] && !bguid[1] && !bguid[2]) {
+ /* original incorrect GUID format in flash; fix in core copy, by
+ * shifting up 2 octets; don't need to change top octet, since both
+ * it and shifted are 0.. */
+ bguid[1] = bguid[3];
+ bguid[2] = bguid[4];
+ bguid[3] = bguid[4] = 0;
+ guid = *(uint64_t *)ifp->if_guid;
+ _IPATH_VDBG("Old GUID format in flash, top 3 zero, shifting 2 octets\n");
+ }
+ else
+ guid = *(uint64_t *)ifp->if_guid;
+ devdata[t].ipath_guid = guid;
+ devdata[t].ipath_nguid = ifp->if_numguid;
+ memcpy(devdata[t].ipath_serial, ifp->if_serial, sizeof(ifp->if_serial));
+ _IPATH_VDBG("Initted GUID to %llx (big-endian) from i2c flash\n",
+ devdata[t].ipath_guid);
+
+done:
+ vfree(buf);
+}
+
+static void __devexit infinipath_remove_one(struct pci_dev *pdev)
+{
+ struct ipath_devdata *dd;
+
+ _IPATH_VDBG("pci_release, pdev=%p\n", pdev);
+ if (pdev) {
+ device_remove_file(&(pdev->dev), &dev_attr_status);
+ device_remove_file(&(pdev->dev), &dev_attr_status_str);
+ device_remove_file(&(pdev->dev), &dev_attr_lid);
+ device_remove_file(&(pdev->dev), &dev_attr_mlid);
+ device_remove_file(&(pdev->dev), &dev_attr_guid);
+ device_remove_file(&(pdev->dev), &dev_attr_nguid);
+ device_remove_file(&(pdev->dev), &dev_attr_serial);
+ device_remove_file(&(pdev->dev), &dev_attr_unit);
+ dd = pci_get_drvdata(pdev);
+ pci_set_drvdata(pdev, NULL);
+ _IPATH_VDBG
+ ("Releasing pci memory regions, devdata %p, unit %u\n", dd,
+ (uint32_t) (dd - devdata));
+ if (dd && dd->ipath_kregbase) {
+ _IPATH_VDBG("Unmapping kregbase %p\n",
+ dd->ipath_kregbase);
+ iounmap((volatile void __iomem *) dd->ipath_kregbase);
+ dd->ipath_kregbase = NULL;
+ }
+ pci_release_regions(pdev);
+ _IPATH_VDBG("calling pci_disable_device\n");
+ pci_disable_device(pdev);
+ }
+}
+
+int ipath_open(struct inode *, struct file *);
+static int ipath_opensma(struct inode *, struct file *);
+int ipath_close(struct inode *, struct file *);
+static unsigned int ipath_poll(struct file *, struct poll_table_struct *);
+long ipath_ioctl(struct file *, unsigned int, unsigned long);
+static loff_t ipath_llseek(struct file *, loff_t, int);
+static int ipath_mmap(struct file *, struct vm_area_struct *);
+
+static struct file_operations ipath_fops = {
+ .owner = THIS_MODULE,
+ .open = ipath_open,
+ .release = ipath_close,
+ .poll = ipath_poll,
+ /*
+ * all of ours are completely compatible and don't require the
+ * kernel lock
+ */
+ .compat_ioctl = ipath_ioctl,
+ /* we don't need kernel lock for our ioctls */
+ .unlocked_ioctl = ipath_ioctl,
+ .llseek = ipath_llseek,
+ .mmap = ipath_mmap
+};
+
+static DECLARE_MUTEX(ipath_mutex); /* general driver use */
+spinlock_t ipath_pioavail_lock;
+
+/*
+ * For now, at least (and probably forever), we don't require root
+ * or equivalent permissions to use the device.
+ */
+
+int ipath_open(struct inode *in, struct file *fp)
+{
+ int ret = 0, minor, i, prefunit=-1, devmax;
+ int maxofallports, npresent = 0, notup = 0;
+ ipath_type ndev;
+
+ down(&ipath_mutex);
+
+ minor = iminor(in);
+ _IPATH_VDBG("open on dev %lx (minor %d)\n", (long)in->i_rdev, minor);
+
+ /* This code is present to allow a knowledgeable person to specify the
+ * layout of processes to processors before opening this driver, and
+ * then we'll assign the process to the "closest" HT-400 to
+ * that processor * (we assume reasonable connectivity, for now).
+ * This code assumes that if affinity has been set before this
+ * point, that at most one cpu is set; for now this is reasonable.
+ * I check for both cpus_empty() and cpus_full(), in case some
+ * kernel variant sets none of the bits when no affinity is set.
+ * 2.6.11 and 12 kernels have all present cpus set.
+ * Some day we'll have to fix it up further to handle a cpu subset.
+ * This algorithm fails for two HT-400's connected in tunnel fashion.
+ * Eventually this needs real topology information.
+ * There may be some issues with dual core numbering as well. This
+ * needs more work prior to release.
+ */
+ if (minor != IPATH_SMA
+#ifdef IPATH_DIAG
+ && minor != IPATH_DIAG
+#endif
+ && minor != IPATH_CTRL
+ && !cpus_empty(current->cpus_allowed)
+ && !cpus_full(current->cpus_allowed)) {
+ int ncpus = num_online_cpus(), curcpu = -1;
+ for (i=0; i<ncpus; i++)
+ if (cpu_isset(i, current->cpus_allowed)) {
+ _IPATH_PRDBG("%s[%u] affinity set for cpu %d\n",
+ current->comm, current->pid, i);
+ curcpu = i;
+ }
+ if (curcpu != -1) {
+ for (ndev = 0; ndev < infinipath_max; ndev++)
+ if ((devdata[ndev].ipath_flags & IPATH_PRESENT)
+ && devdata[ndev].ipath_kregbase)
+ npresent++;
+ if (npresent) {
+ prefunit = curcpu/(ncpus/npresent);
+ _IPATH_DBG("%s[%u] %d chips, %d cpus, "
+ "%d cpus/chip, select unit %d\n",
+ current->comm, current->pid,
+ npresent, ncpus, ncpus/npresent,
+ prefunit);
+ }
+ }
+ }
+
+ if (minor == IPATH_SMA) {
+ ret = ipath_opensma(in, fp);
+ /* for ipath_ioctl */
+ fp->private_data = (void *)(unsigned long)minor;
+ goto done;
+ }
+#ifdef IPATH_DIAG
+ else if (minor == IPATH_DIAG) {
+ ret = ipath_opendiag(in, fp);
+ /* for ipath_ioctl */
+ fp->private_data = (void *)(unsigned long)minor;
+ goto done;
+ }
+#endif
+ else if (minor == IPATH_CTRL) {
+ /* for ipath_ioctl */
+ fp->private_data = (void *)(unsigned long)minor;
+ ret = 0;
+ goto done;
+ }
+ else if (minor) {
+ /*
+ * minor number 0 is used for all chips, we choose available
+ * chip ourselves, it isn't based on what they open.
+ */
+
+ _IPATH_DBG("open on invalid minor %u\n", minor);
+ ret = -ENXIO;
+ goto done;
+ }
+
+ /*
+ * for now, we use all ports on one, then all ports on the
+ * next, etc. Eventually we want to tweak this to be cpu/chip
+ * topology aware, and round-robin across chips that are
+ * configured and connected, placing processes on the closest
+ * available processor that isn't already over-allocated.
+ * multi-HT400 topology could be better handled
+ */
+
+ npresent = maxofallports = 0;
+ for (ndev = 0; ndev < infinipath_max; ndev++) {
+ if (!(devdata[ndev].ipath_flags & IPATH_PRESENT) ||
+ !devdata[ndev].ipath_kregbase)
+ continue;
+ npresent++;
+ if ((devdata[ndev].
+ ipath_flags & (IPATH_LINKDOWN | IPATH_LINKUNK))) {
+ _IPATH_VDBG("unit %u present, but link not ready\n",
+ ndev);
+ notup++;
+ continue;
+ } else if (!devdata[ndev].ipath_lid) {
+ _IPATH_VDBG
+ ("unit %u present, but LID not assigned, down\n",
+ ndev);
+ notup++;
+ continue;
+ }
+ if (devdata[ndev].ipath_cfgports > maxofallports)
+ maxofallports = devdata[ndev].ipath_cfgports;
+ }
+
+ /*
+ * user ports start at 1, kernel port is 0
+ * For now, we do round-robin access across all chips
+ */
+
+ devmax = prefunit!=-1 ? prefunit+1 : infinipath_max;
+recheck:
+ for (i = 1; i < maxofallports; i++) {
+ for (ndev = prefunit!=-1?prefunit:0; ndev < devmax; ndev++) {
+ if (!(devdata[ndev].ipath_flags & IPATH_PRESENT) ||
+ !devdata[ndev].ipath_kregbase
+ || !devdata[ndev].ipath_lid
+ || (devdata[ndev].
+ ipath_flags & (IPATH_LINKDOWN | IPATH_LINKUNK)))
+ break; /* can't use this chip */
+ if (i >= devdata[ndev].ipath_cfgports)
+ break; /* max'ed out on users of this chip */
+ if (!devdata[ndev].ipath_pd[i]) {
+ void *p, *ptmp;
+ p = kmalloc(sizeof(struct ipath_portdata),
+ GFP_KERNEL);
+
+ /*
+ * allocate memory for use in
+ * ipath_tid_update() just once at open,
+ * not per call. Reduces cost of expected
+ * send setup
+ */
+
+ ptmp =
+ kmalloc(devdata[ndev].ipath_rcvtidcnt *
+ sizeof(uint16_t)
+ +
+ devdata[ndev].ipath_rcvtidcnt *
+ sizeof(struct page **), GFP_KERNEL);
+ if (!p || !ptmp) {
+ _IPATH_UNIT_ERROR(ndev,
+ "Unable to allocate portdata memory, failing open\n");
+ ret = -ENOMEM;
+ kfree(p);
+ kfree(ptmp);
+ goto done;
+ }
+ memset(p, 0, sizeof(struct ipath_portdata));
+ devdata[ndev].ipath_pd[i] = p;
+ devdata[ndev].ipath_pd[i]->port_port = i;
+ devdata[ndev].ipath_pd[i]->port_unit = ndev;
+ devdata[ndev].ipath_pd[i]->port_tid_pg_list =
+ ptmp;
+ init_waitqueue_head(&devdata[ndev].ipath_pd[i]->
+ port_wait);
+ }
+ if (!devdata[ndev].ipath_pd[i]->port_cnt) {
+ devdata[ndev].ipath_pd[i]->port_cnt = 1;
+ fp->private_data =
+ (void *)devdata[ndev].ipath_pd[i];
+ _IPATH_PRDBG("%s[%u] opened unit:port %u:%u\n",
+ current->comm, current->pid, ndev,
+ i);
+ devdata[ndev].ipath_pd[i]->port_pid =
+ current->pid;
+ strncpy(devdata[ndev].ipath_pd[i]->port_comm,
+ current->comm,
+ sizeof(devdata[ndev].ipath_pd[i]->
+ port_comm));
+ ipath_stats.sps_ports++;
+ goto done;
+ }
+ }
+ }
+
+ if (npresent) {
+ if (notup) {
+ ret = -ENETDOWN;
+ _IPATH_DBG
+ ("No ports available (none initialized and ready)\n");
+ } else {
+ if (prefunit > 0) { /* if we started above unit 0, retry from 0 */
+ _IPATH_PRDBG("%s[%u] no ports on prefunit %d, clear and re-check\n",
+ current->comm, current->pid, prefunit);
+ devmax = infinipath_max;
+ prefunit = -1;
+ goto recheck;
+ }
+ ret = -EBUSY;
+ _IPATH_DBG("No ports available\n");
+ }
+ } else {
+ ret = -ENXIO;
+ _IPATH_DBG("No boards found\n");
+ }
+
+done:
+ up(&ipath_mutex);
+ return ret;
+}
+
+static int ipath_opensma(struct inode *in, struct file *fp)
+{
+ ipath_type s;
+
+ if (ipath_sma_alive) {
+ _IPATH_DBG("SMA already running (pid %u), failing\n",
+ ipath_sma_alive);
+ return -EBUSY;
+ }
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM; /* all SMA functions are root-only */
+
+ for (s = 0; s < infinipath_max; s++) {
+ /* we need at least one infinipath device to be initialized. */
+ if (devdata[s].ipath_flags & IPATH_INITTED) {
+ ipath_sma_alive = current->pid;
+ *devdata[s].ipath_statusp |= IPATH_STATUS_SMA;
+ *devdata[s].ipath_statusp &= ~IPATH_STATUS_OIB_SMA;
+ }
+ }
+ if (ipath_sma_alive) {
+ _IPATH_SMADBG
+ ("SMA device now open, SMA active as PID %u\n",
+ ipath_sma_alive);
+ return 0;
+ }
+ _IPATH_DBG("No hardware yet found and initted, failing\n");
+ return -ENODEV;
+}
+
+
+#ifdef IPATH_DIAG
+static int ipath_opendiag(struct inode *in, struct file *fp)
+{
+ ipath_type s;
+
+ if (ipath_diag_alive) {
+ _IPATH_DBG("Diags already running (pid %u), failing\n",
+ ipath_diag_alive);
+ return -EBUSY;
+ }
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM; /* all diags functions are root-only */
+
+ for (s = 0; s < infinipath_max; s++)
+ /*
+ * we need at least one infinipath device to be present
+ * (don't use INITTED, because we want to be able to open
+ * even if device is in freeze mode, which cleared INITTED.
+ * There is s small amount of risk to this, which is
+ * why we also verify kregbase is set.
+ */
+
+ if ((devdata[s].ipath_flags & IPATH_PRESENT)
+ && devdata[s].ipath_kregbase) {
+ ipath_diag_alive = current->pid;
+ _IPATH_DBG("diag device now open, active as PID %u\n",
+ ipath_diag_alive);
+ return 0;
+ }
+ _IPATH_DBG("No hardware yet found and initted, failing diags\n");
+ return -ENODEV;
+}
+#endif
+
+/*
+ * clear all TID entries for a port, expected and eager.
+ * Used from ipath_close(), and at chip initialization.
+ */
+
+static void ipath_clear_tids(const ipath_type t, unsigned port)
+{
+ uint64_t __iomem *tidbase;
+ int i;
+ struct ipath_devdata *dd;
+ uint64_t tidval;
+ dd = &devdata[t];
+
+ if (!dd->ipath_kregbase)
+ return;
+
+ /*
+ * chip errata bug 7358, try to work around it by marking invalid
+ * tids as having max length
+ */
+
+ tidval =
+ (-1LL & INFINIPATH_RT_BUFSIZE_MASK) << INFINIPATH_RT_BUFSIZE_SHIFT;
+
+ /*
+ * need to invalidate all of the expected TID entries for this
+ * port, so we don't have valid entries that might somehow get
+ * used (early in next use of this port, or through some bug)
+ * We don't bother with the eager, because they are initialized
+ * each time before receives are enabled; expected aren't
+ */
+
+ tidbase = (uint64_t __iomem *) ((char __iomem *)(dd->ipath_kregbase) +
+ dd->ipath_rcvtidbase +
+ port * dd->ipath_rcvtidcnt *
+ sizeof(*tidbase));
+ _IPATH_VDBG("Invalidate expected TIDs for port %u, tidbase=%p\n", port,
+ tidbase);
+ for (i = 0; i < dd->ipath_rcvtidcnt; i++)
+ ipath_kput_memq(t, &tidbase[i], tidval);
+ yield(); /* don't hog the cpu */
+
+ /* zero the eager TID entries */
+ tidbase = (uint64_t __iomem *)((char __iomem *)(dd->ipath_kregbase) +
+ dd->ipath_rcvegrbase +
+ port * dd->ipath_rcvegrcnt *
+ sizeof(*tidbase));
+
+ for (i = 0; i < dd->ipath_rcvegrcnt; i++)
+ ipath_kput_memq(t, &tidbase[i], tidval);
+ yield(); /* don't hog the cpu */
+}
+
+int ipath_close(struct inode *in, struct file *fp)
+{
+ int ret = 0;
+ struct ipath_portdata *pd;
+
+ _IPATH_VDBG("close on dev %lx, private data %p\n", (long)in->i_rdev,
+ fp->private_data);
+
+ down(&ipath_mutex);
+ if (iminor(in) == IPATH_SMA) {
+ ipath_type s;
+
+ ipath_sma_alive = 0;
+ _IPATH_SMADBG("Closing SMA device\n");
+ for (s = 0; s < infinipath_max; s++) {
+ if (!(devdata[s].ipath_flags & IPATH_INITTED))
+ continue;
+ *devdata[s].ipath_statusp &= ~IPATH_STATUS_SMA;
+ if (devdata[s].verbs_layer.l_flags &
+ IPATH_VERBS_KERNEL_SMA)
+ *devdata[s].ipath_statusp |=
+ IPATH_STATUS_OIB_SMA;
+ }
+ }
+#ifdef IPATH_DIAG
+ else if (iminor(in) == IPATH_DIAG) {
+ ipath_diag_alive = 0;
+ _IPATH_DBG("Closing DIAG device\n");
+ }
+#endif
+ else if (fp->private_data && 255UL < (unsigned long)fp->private_data) {
+ ipath_type t;
+ unsigned port;
+ struct ipath_devdata *dd;
+
+ pd = (struct ipath_portdata *) fp->private_data;
+ port = pd->port_port;
+ fp->private_data = NULL;
+ t = pd->port_unit;
+ if (t > infinipath_max) {
+ _IPATH_ERROR
+ ("closing, fp %p, pd %p, but unit %x not valid!\n",
+ fp, pd, t);
+ goto done;
+ }
+ dd = &devdata[t];
+
+ if (pd->port_hdrqfull) {
+ _IPATH_PRDBG
+ ("%s[%u] had %u rcvhdrqfull errors during run\n",
+ pd->port_comm, pd->port_pid, pd->port_hdrqfull);
+ pd->port_hdrqfull = 0;
+ }
+
+ if (pd->port_rcvwait_to || pd->port_piowait_to
+ || pd->port_rcvnowait || pd->port_pionowait) {
+ _IPATH_VDBG
+ ("port%u, %u rcv, %u pio wait timeo; %u rcv %u, pio already\n",
+ pd->port_port, pd->port_rcvwait_to,
+ pd->port_piowait_to, pd->port_rcvnowait,
+ pd->port_pionowait);
+ pd->port_rcvwait_to = pd->port_piowait_to =
+ pd->port_rcvnowait = pd->port_pionowait = 0;
+ }
+ if (pd->port_flag) {
+ _IPATH_DBG("port %u port_flag still set to 0x%x\n",
+ pd->port_port, pd->port_flag);
+ pd->port_flag = 0;
+ }
+
+ if (devdata[t].ipath_kregbase) {
+ if (pd->port_rcvhdrtail_uaddr) {
+ pd->port_rcvhdrtail_uaddr = 0;
+ pd->port_rcvhdrtail_kvaddr = NULL;
+ ipath_putpages(1, &pd->port_rcvhdrtail_pagep);
+ pd->port_rcvhdrtail_pagep = NULL;
+ ipath_stats.sps_pageunlocks++;
+ }
+ ipath_kput_kreg_port(t, kr_rcvhdrtailaddr, port, 0ULL);
+ ipath_kput_kreg_port(pd->port_unit, kr_rcvhdraddr,
+ pd->port_port, 0);
+
+ /* clean up the pkeys for this port user */
+ ipath_clean_partkey(pd, dd);
+
+ if (port < dd->ipath_cfgports) {
+ int i = dd->ipath_pbufsport * (port - 1);
+ ipath_disarm_piobufs(t, i, dd->ipath_pbufsport);
+
+ /* atomically clear receive enable port. */
+ atomic_clear_mask(1U <<
+ (INFINIPATH_R_PORTENABLE_SHIFT
+ + port),
+ &devdata[t].ipath_rcvctrl);
+ ipath_kput_kreg(t, kr_rcvctrl,
+ devdata[t].ipath_rcvctrl);
+
+ if (dd->ipath_pageshadow) {
+ /*
+ * unlock any expected TID
+ * entries port still had in use
+ */
+ int port_tidbase =
+ pd->port_port * dd->ipath_rcvtidcnt;
+ int i, cnt = 0, maxtid =
+ port_tidbase + dd->ipath_rcvtidcnt;
+
+ _IPATH_VDBG
+ ("Port %u unlocking any locked expTID pages\n",
+ pd->port_port);
+ for (i = port_tidbase; i < maxtid; i++) {
+ if (dd->ipath_pageshadow[i]) {
+ ipath_putpages(1,
+ &dd->
+ ipath_pageshadow
+ [i]);
+ dd->ipath_pageshadow[i]
+ = NULL;
+ cnt++;
+ ipath_stats.
+ sps_pageunlocks++;
+ }
+ }
+ if (cnt)
+ _IPATH_VDBG
+ ("Port %u had %u expTID entries locked\n",
+ pd->port_port, cnt);
+ if (ipath_stats.sps_pagelocks
+ || ipath_stats.sps_pageunlocks)
+ _IPATH_VDBG
+ ("%llu pages locked, %llu unlocked with"
+ " ipath_m{un}lock\n",
+ ipath_stats.sps_pagelocks,
+ ipath_stats.
+ sps_pageunlocks);
+ }
+ ipath_stats.sps_ports--;
+ _IPATH_PRDBG("%s[%u] closed port %u:%u\n",
+ pd->port_comm, pd->port_pid, t,
+ port);
+ }
+ }
+
+ pd->port_cnt = 0;
+ pd->port_pid = 0;
+
+ ipath_clear_tids(t, pd->port_port);
+
+ ipath_free_pddata(dd, pd->port_port, 0);
+ }
+
+done:
+ up(&ipath_mutex);
+
+ return ret;
+}

2005-12-29 00:43:37

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 6 of 20] ipath - driver debugging headers

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r 2d9a3f27a10c -r 9e8d017ed298 drivers/infiniband/hw/ipath/ipath_debug.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_debug.h Wed Dec 28 14:19:42 2005 -0800
@@ -0,0 +1,98 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#ifndef _IPATH_DEBUG_H
+#define _IPATH_DEBUG_H
+
+#ifndef _IPATH_DEBUGGING /* debugging enabled or not */
+#define _IPATH_DEBUGGING 1
+#endif
+
+#if _IPATH_DEBUGGING
+
+/*
+ * Mask values for debugging. The scheme allows us to compile out any
+ * of the debug tracing stuff, and if compiled in, to enable or disable
+ * dynamically. This can be set at modprobe time also:
+ * modprobe infinipath.ko infinipath_debug=7
+ */
+
+#define __IPATH_INFO 0x1 /* generic low verbosity stuff */
+#define __IPATH_DBG 0x2 /* generic debug */
+#define __IPATH_TRSAMPLE 0x8 /* generate trace buffer sample entries */
+/* leave some low verbosity spots open */
+#define __IPATH_VERBDBG 0x40 /* very verbose debug */
+#define __IPATH_PKTDBG 0x80 /* print packet data */
+/* print process startup (init)/exit messages */
+#define __IPATH_PROCDBG 0x100
+/* print mmap/nopage stuff, not using VDBG any more */
+#define __IPATH_MMDBG 0x200
+#define __IPATH_USER_SEND 0x1000 /* use user mode send */
+#define __IPATH_KERNEL_SEND 0x2000 /* use kernel mode send */
+#define __IPATH_EPKTDBG 0x4000 /* print ethernet packet data */
+#define __IPATH_SMADBG 0x8000 /* sma packet debug */
+#define __IPATH_IPATHDBG 0x10000 /* Ethernet (IPATH) general debug on */
+#define __IPATH_IPATHWARN 0x20000 /* Ethernet (IPATH) warnings on */
+#define __IPATH_IPATHERR 0x40000 /* Ethernet (IPATH) errors on */
+#define __IPATH_IPATHPD 0x80000 /* Ethernet (IPATH) packet dump on */
+#define __IPATH_IPATHTABLE 0x100000 /* Ethernet (IPATH) table dump on */
+
+#else /* _IPATH_DEBUGGING */
+
+/*
+ * define all of these even with debugging off, for the few places that do
+ * if(infinipath_debug & _IPATH_xyzzy), but in a way that will make the
+ * compiler eliminate the code
+ */
+
+#define __IPATH_INFO 0x0 /* generic low verbosity stuff */
+#define __IPATH_DBG 0x0 /* generic debug */
+#define __IPATH_TRSAMPLE 0x0 /* generate trace buffer sample entries */
+#define __IPATH_VERBDBG 0x0 /* very verbose debug */
+#define __IPATH_PKTDBG 0x0 /* print packet data */
+#define __IPATH_PROCDBG 0x0 /* print process startup (init)/exit messages */
+/* print mmap/nopage stuff, not using VDBG any more */
+#define __IPATH_MMDBG 0x0
+#define __IPATH_EPKTDBG 0x0 /* print ethernet packet data */
+#define __IPATH_SMADBG 0x0 /* print process startup (init)/exit messages */#define __IPATH_IPATHDBG 0x0 /* Ethernet (IPATH) table dump on */
+#define __IPATH_IPATHWARN 0x0 /* Ethernet (IPATH) warnings on */
+#define __IPATH_IPATHERR 0x0 /* Ethernet (IPATH) errors on */
+#define __IPATH_IPATHPD 0x0 /* Ethernet (IPATH) packet dump on */
+#define __IPATH_IPATHTABLE 0x0 /* Ethernet (IPATH) packet dump on */
+
+#endif /* _IPATH_DEBUGGING */
+
+#endif /* _IPATH_DEBUG_H */
diff -r 2d9a3f27a10c -r 9e8d017ed298 drivers/infiniband/hw/ipath/ipath_kdebug.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_kdebug.h Wed Dec 28 14:19:42 2005 -0800
@@ -0,0 +1,109 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#ifndef _IPATH_KDEBUG_H
+#define _IPATH_KDEBUG_H
+
+#include "ipath_debug.h"
+
+/*
+ * This file contains lightweight kernel tracing code.
+ */
+
+extern unsigned infinipath_debug;
+const char *ipath_get_unit_name(int unit);
+
+#if _IPATH_DEBUGGING
+
+#define _IPATH_UNIT_ERROR(unit,fmt,...) \
+ printk(KERN_ERR "%s: " fmt, ipath_get_unit_name(unit), ##__VA_ARGS__)
+
+#define _IPATH_ERROR(fmt,...) printk(KERN_ERR "infinipath: " fmt, ##__VA_ARGS__)
+
+#define _IPATH_INFO(fmt,...) \
+ do { \
+ if(unlikely(infinipath_debug & __IPATH_INFO)) \
+ printk(KERN_INFO "infinipath: " fmt, ##__VA_ARGS__); \
+ } while(0)
+
+#define __IPATH_DBG_WHICH(which,fmt,...) \
+ do { \
+ if(unlikely(infinipath_debug&(which))) \
+ printk(KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \
+ } while(0)
+
+#define _IPATH_DBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_DBG,fmt,##__VA_ARGS__)
+#define _IPATH_VDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_VERBDBG,fmt,##__VA_ARGS__)
+#define _IPATH_PDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_PKTDBG,fmt,##__VA_ARGS__)
+#define _IPATH_EPDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_EPKTDBG,fmt,##__VA_ARGS__)
+#define _IPATH_PRDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_PROCDBG,fmt,##__VA_ARGS__)
+#define _IPATH_MMDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_MMDBG,fmt,##__VA_ARGS__)
+#define _IPATH_SMADBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_SMADBG,fmt,##__VA_ARGS__)
+#define _IPATH_IPATHDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHDBG,fmt,##__VA_ARGS__)
+#define _IPATH_IPATHWARN(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHWARN,fmt,##__VA_ARGS__)
+#define _IPATH_IPATHERR(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHERR ,fmt,##__VA_ARGS__)
+#define _IPATH_IPATHPD(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHPD ,fmt,##__VA_ARGS__)
+#define _IPATH_IPATHTABLE(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHTABLE ,fmt,##__VA_ARGS__)
+
+#else /* ! _IPATH_DEBUGGING */
+
+#define _IPATH_UNIT_ERROR(unit,fmt,...) \
+ do { \
+ printk(KERN_ERR "%s" fmt, "",##__VA_ARGS__); \
+ } while(0)
+
+#define _IPATH_ERROR(fmt,...) \
+ do { \
+ printk (KERN_ERR "%s" fmt, "",##__VA_ARGS__); \
+ } while(0)
+
+#define _IPATH_INFO(fmt,...)
+#define _IPATH_DBG(fmt,...)
+#define _IPATH_PDBG(fmt,...)
+#define _IPATH_EPDBG(fmt,...)
+#define _IPATH_PRDBG(fmt,...)
+#define _IPATH_VDBG(fmt,...)
+#define _IPATH_MMDBG(fmt,...)
+#define _IPATH_SMADBG(fmt,...)
+#define _IPATH_IPATHDBG(fmt,...)
+#define _IPATH_IPATHWARN(fmt,...)
+#define _IPATH_IPATHERR(fmt,...)
+#define _IPATH_IPATHPD(fmt,...)
+#define _IPATH_IPATHTABLE(fmt,...)
+
+#endif /* _IPATH_DEBUGGING */
+
+#endif /* _IPATH_DEBUG_H */

2005-12-29 00:44:50

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 9 of 20] ipath - core driver, part 2 of 4

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r ddd21709e12c -r dad2e87e21f4 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800
@@ -1877,3 +1877,2004 @@

return ret;
}
+
+/*
+ * cancel a range of PIO buffers, used when they might be armed, but
+ * not triggered. Used at init to ensure buffer state, and also user
+ * process close, in case it died while writing to a PIO buffer
+ */
+
+static void ipath_disarm_piobufs(const ipath_type t, unsigned first,
+ unsigned cnt)
+{
+ unsigned i, last = first + cnt;
+ uint64_t sendctrl;
+ for (i = first; i < last; i++) {
+ sendctrl = devdata[t].ipath_sendctrl | INFINIPATH_S_DISARM |
+ (i << INFINIPATH_S_DISARMPIOBUF_SHIFT);
+ ipath_kput_kreg(t, kr_sendctrl, sendctrl);
+ }
+}
+
+static void ipath_clean_partkey(struct ipath_portdata * pd,
+ struct ipath_devdata * dd)
+{
+ int i, j, pchanged = 0;
+ uint64_t oldpkey;
+
+ /* for debugging only */
+ oldpkey =
+ (uint64_t) dd->ipath_pkeys[0] | ((uint64_t) dd->
+ ipath_pkeys[1] << 16)
+ | ((uint64_t) dd->ipath_pkeys[2] << 32)
+ | ((uint64_t) dd->ipath_pkeys[3] << 48);
+
+ for (i = 0; i < (sizeof(pd->port_pkeys) / sizeof(pd->port_pkeys[0]));
+ i++) {
+ if (!pd->port_pkeys[i])
+ continue;
+ _IPATH_VDBG("look for key[%d] %hx in pkeys\n", i,
+ pd->port_pkeys[i]);
+ for (j = 0;
+ j < (sizeof(dd->ipath_pkeys) / sizeof(dd->ipath_pkeys[0]));
+ j++) {
+ /* check for match independent of the global bit */
+ if ((dd->ipath_pkeys[j] & 0x7fff) ==
+ (pd->port_pkeys[i] & 0x7fff)) {
+ if (atomic_dec_and_test(&dd->ipath_pkeyrefs[j])) {
+ _IPATH_VDBG
+ ("p%u clear key %x matches #%d\n",
+ pd->port_port, pd->port_pkeys[i],
+ j);
+ ipath_stats.sps_pkeys[j] =
+ dd->ipath_pkeys[j] = 0;
+ pchanged++;
+ } else
+ _IPATH_VDBG
+ ("p%u key %x matches #%d, but ref still %d\n",
+ pd->port_port, pd->port_pkeys[i],
+ j,
+ atomic_read(&dd->
+ ipath_pkeyrefs[j]));
+ break;
+ }
+ }
+ pd->port_pkeys[i] = 0;
+ }
+ if (pchanged) {
+ uint64_t pkey;
+ pkey =
+ (uint64_t) dd->ipath_pkeys[0] | ((uint64_t) dd->
+ ipath_pkeys[1] << 16)
+ | ((uint64_t) dd->ipath_pkeys[2] << 32)
+ | ((uint64_t) dd->ipath_pkeys[3] << 48);
+ _IPATH_VDBG("p%u old pkey reg %llx, new pkey reg %llx\n",
+ pd->port_port, oldpkey, pkey);
+ ipath_kput_kreg(pd->port_unit, kr_partitionkey, pkey);
+ }
+}
+
+static unsigned int ipath_poll(struct file *fp, struct poll_table_struct *pt)
+{
+ int ret;
+ struct ipath_portdata *pd;
+
+ pd = port_fp(fp);
+ /* nothing for select/poll in this driver, at least for now */
+ ret = 0;
+
+ return ret;
+}
+
+/*
+ * wait up to msecs milliseconds for IB link state change to occur
+ * for now, take the easy polling route. Currently used only by
+ * the SMA ioctls. Returns 0 if state reached, otherwise -ETIMEDOUT
+ * state can have multiple states set, for any of several transitions.
+ */
+
+int ipath_wait_linkstate(const ipath_type t, uint32_t state, int msecs)
+{
+ devdata[t].ipath_sma_state_wanted = state;
+ wait_event_interruptible_timeout(ipath_sma_state_wait,
+ (devdata[t].ipath_flags & state),
+ msecs_to_jiffies(msecs));
+ devdata[t].ipath_sma_state_wanted = 0;
+
+ if (!(devdata[t].ipath_flags & state))
+ _IPATH_DBG
+ ("Didn't reach linkstate %s within %u ms (ibcc %llx %s)\n",
+ /* test INIT ahead of DOWN, both can be set */
+ (state & IPATH_LINKINIT) ? "INIT" :
+ ((state & IPATH_LINKDOWN) ? "DOWN" :
+ ((state & IPATH_LINKARMED) ? "ARM" : "ACTIVE")),
+ msecs, ipath_kget_kreg64(t, kr_ibcctrl),
+ ipath_ibcstatus_str[ipath_kget_kreg64(t, kr_ibcstatus) &
+ 0xf]);
+ return (devdata[t].ipath_flags & state) ? 0 : -ETIMEDOUT;
+}
+
+/* unit number is already validated in ipath_ioctl() */
+static int ipath_kset_lid(uint32_t arg)
+{
+ unsigned unit = (arg >> 16) & 0xffff;
+
+ if (unit >= infinipath_max
+ || !(devdata[unit].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", unit);
+ return -ENODEV;
+ }
+
+ arg &= 0xffff;
+ _IPATH_SMADBG("Unit %u setting lid to 0x%x, was 0x%x\n", unit, arg,
+ devdata[unit].ipath_lid);
+ ipath_set_sps_lid(unit, arg);
+ return 0;
+}
+
+static int ipath_kset_mlid(uint32_t arg)
+{
+ unsigned unit = (arg >> 16) & 0xffff;
+
+ if (unit >= infinipath_max
+ || !(devdata[unit].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", unit);
+ return -ENODEV;
+ }
+
+ arg &= 0xffff;
+ _IPATH_SMADBG("Unit %u setting mlid to 0x%x, was 0x%x\n", unit, arg,
+ devdata[unit].ipath_mlid);
+ ipath_stats.sps_mlid[unit] = devdata[unit].ipath_mlid = arg;
+ if (devdata[unit].ipath_layer.l_intr)
+ devdata[unit].ipath_layer.l_intr(unit, IPATH_LAYER_INT_BCAST);
+ return 0;
+}
+
+/* unit number is in incoming, overwritten on return with data */
+
+static int ipath_get_devstatus(uint64_t __user *a)
+{
+ int ret;
+ uint64_t unit64;
+ uint32_t unit;
+ uint64_t devstatus;
+
+ if ((ret = copy_from_user(&unit64, a, sizeof unit64))) {
+ _IPATH_DBG("Failed to copy in unit: %d\n", ret);
+ return -EFAULT;
+ }
+ unit = unit64;
+ if (unit >= infinipath_max
+ || !(devdata[unit].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", unit);
+ return -ENODEV;
+ }
+
+ devstatus = *devdata[unit].ipath_statusp;
+
+ if ((ret = copy_to_user(a, &devstatus, sizeof devstatus))) {
+ _IPATH_DBG("Failed to copy out device status: %d\n", ret);
+ ret = -EFAULT;
+ }
+ return ret;
+}
+
+/* unit number is in incoming, overwritten on return with data */
+
+static int ipath_get_mlid(uint32_t __user *a)
+{
+ int ret;
+ uint32_t unit;
+ uint32_t mlid;
+
+ if ((ret = copy_from_user(&unit, a, sizeof unit))) {
+ _IPATH_DBG("Failed to copy in mlid: %d\n", ret);
+ return -EFAULT;
+ }
+ if (unit >= infinipath_max
+ || !(devdata[unit].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", unit);
+ return -ENODEV;
+ }
+
+ mlid = devdata[unit].ipath_mlid;
+
+ if ((ret = copy_to_user(a, &mlid, sizeof mlid))) {
+ _IPATH_DBG("Failed to copy out MLID: %d\n", ret);
+ ret = -EFAULT;
+ }
+ return ret;
+}
+
+static int ipath_kset_guid(struct ipath_setguid __user *a)
+{
+ struct ipath_setguid setguid;
+ int ret;
+
+ if ((ret = copy_from_user(&setguid, a, sizeof setguid))) {
+ _IPATH_DBG("Failed to copy in guid info: %d\n", ret);
+ return -EFAULT;
+ }
+ if (setguid.sunit >= infinipath_max ||
+ !(devdata[setguid.sunit].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %llu\n", setguid.sunit);
+ return -ENODEV;
+ }
+ if (setguid.sguid == 0ULL || setguid.sguid == -1LL) {
+ /*
+ * use INFO, not DBG, because ipath_mux doesn't yet
+ * complain about errors on this
+ */
+
+ _IPATH_INFO("Ignoring attempt to set invalid GUID %llx\n",
+ setguid.sguid);
+ return -EINVAL;
+ }
+ devdata[setguid.sunit].ipath_guid = setguid.sguid;
+ devdata[setguid.sunit].ipath_nguid = 1;
+ _IPATH_DBG("SMA set hardware GUID unit %llu to %llx (network order)\n",
+ setguid.sunit, devdata[setguid.sunit].ipath_guid);
+ return 0;
+}
+
+/*
+ * receive an IB packet with QP 0 or 1. For now, we have no timeout implemented
+ * We put the actual received count into the iov on return, and the unit we
+ * received from goes into the lower 16 bits of sps_flags.
+ * This receives from all/any of the active chips, and we currently do not
+ * allow specifying just one (we could, by filling in unit in the library
+ * before the syscall, and checking here).
+ */
+
+static int ipath_rcvsma_pkt(struct ipath_sendpkt __user *p)
+{
+ struct ipath_sendpkt rpkt;
+ int i, any, ret;
+ unsigned long flags;
+
+ if ((ret = copy_from_user(&rpkt, p, sizeof rpkt))) {
+ _IPATH_DBG("Failed to copy in pkt struct (%d)\n", ret);
+ return -EFAULT;
+ }
+ if (!ipath_sma_data_spare) {
+ _IPATH_DBG("can't do receive, sma not initialized\n");
+ return -ENETDOWN;
+ }
+
+ for (any = i = 0; i < infinipath_max; i++)
+ if (devdata[i].ipath_flags & IPATH_INITTED)
+ any++;
+ if (!any) { /* no hardware, freeze, etc. */
+ _IPATH_SMADBG("Didn't find any initialized and usable chips\n");
+ return -ENODEV;
+ }
+
+ wait_event_interruptible(ipath_sma_wait,
+ ipath_sma_data[ipath_sma_first].len);
+
+ spin_lock_irqsave(&ipath_sma_lock, flags);
+ if (ipath_sma_data[ipath_sma_first].len) {
+ int len;
+ uint32_t slen;
+ uint8_t *sdata;
+ struct _ipath_sma_rpkt *smpkt =
+ &ipath_sma_data[ipath_sma_first];
+
+ /*
+ * we swap out the buffer we are going to use with the
+ * spare buffer and set spare to that buffer. This code
+ * is the only code that ever manipulates spare, other
+ * than the initialization code. This code should never
+ * be entered by more than one process at a time, and
+ * if it is, the user code doing so deserves what it gets;
+ * it won't break anything in the driver by doing so.
+ * We do it this way to avoid holding a lock across the
+ * copy_to_user, which could fault, or delay a long time
+ * while paging occurs; ditto for printks
+ */
+
+ slen = smpkt->len;
+ sdata = smpkt->buf;
+ rpkt.sps_flags = smpkt->unit;
+ smpkt->buf = ipath_sma_data_spare;
+ ipath_sma_data_spare = sdata;
+ smpkt->len = 0; /* it's available again */
+ if (++ipath_sma_first >= IPATH_NUM_SMAPKTS)
+ ipath_sma_first = 0;
+ spin_unlock_irqrestore(&ipath_sma_lock, flags);
+
+ len = min((uint32_t) rpkt.sps_iov[0].iov_len, slen);
+ ret = copy_to_user((void __user *) rpkt.sps_iov[0].iov_base,
+ sdata, len);
+ _IPATH_VDBG("SMA packet (index=%d), len %d (actual %d) "
+ "buf %p, ubuf %llx\n", ipath_sma_first, slen,
+ len, sdata, rpkt.sps_iov[0].iov_base);
+ if (!ret) {
+ /* actual length read. */
+ rpkt.sps_iov[0].iov_len = len;
+ rpkt.sps_cnt = 1; /* received one packet */
+ if ((ret = copy_to_user(p, &rpkt, sizeof rpkt))) {
+ _IPATH_DBG("Failed to copy out pkt struct "
+ "(%d)\n", ret);
+ ret = -EFAULT;
+ }
+ } else {
+ _IPATH_DBG("copyout failed: %d\n", ret);
+ ret = -EFAULT;
+ }
+ } else {
+ /* usually means SMA process received a signal */
+ spin_unlock_irqrestore(&ipath_sma_lock, flags);
+ return -EAGAIN;
+ }
+
+ return ret;
+}
+
+/* unit number is in first word incoming, overwritten on return with data */
+static int ipath_get_portinfo(uint32_t __user *a)
+{
+ int ret;
+ uint32_t unit, tmp, tmp2;
+ struct ipath_devdata *dd;
+ uint32_t portinfo[13]; /* just the data for Portinfo, in host horder */
+
+ if ((ret = copy_from_user(&unit, a, sizeof unit))) {
+ _IPATH_DBG("Failed to copy in portinfo: %d\n", ret);
+ return -EFAULT;
+ }
+ if (unit >= infinipath_max
+ || !(devdata[unit].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG("Invalid unit %u\n", unit);
+ return -ENODEV;
+ }
+ dd = &devdata[unit];
+ /* so we only initialize non-zero fields. */
+ memset(portinfo, 0, sizeof portinfo);
+
+ /*
+ * Notimpl yet M_Key (64)
+ * Notimpl yet GID (64)
+ */
+
+ portinfo[4] = (dd->ipath_lid << 16);
+
+ /*
+ * Notimpl yet SMLID (should we store this in the driver, in
+ * case SMA dies?)
+ * CapabilityMask is 0, we don't support any of these
+ * DiagCode is 0; we don't store any diag info for now
+ * Notimpl yet M_KeyLeasePeriod (we don't support M_Key)
+ */
+
+ /* LocalPortNum is whichever port number they ask for */
+ portinfo[7] = (unit << 24)
+ /* LinkWidthEnabled */
+ |(2 << 16)
+ /* LinkWidthSupported (really 2, but that's not IB valid...) */
+ |(3 << 8)
+ /* LinkWidthActive */
+ |(2 << 0);
+ tmp = dd->ipath_lastibcstat & 0xff;
+ tmp2 = 5;
+ if (tmp == 0x11)
+ tmp = 2;
+ else if (tmp == 0x21)
+ tmp = 3;
+ else if (tmp == 0x31)
+ tmp = 4;
+ else {
+ tmp = 0; /* down */
+ tmp2 = tmp & 0xf;
+ }
+ portinfo[8] = (1 << 28) /* LinkSpeedSupported */
+ |(tmp << 24) /* PortState */
+ |(tmp2 << 20) /* PortPhysicalState */
+ |(2 << 16)
+
+ /* LinkDownDefaultState */
+ /* M_KeyProtectBits == 0 */
+ /* NotImpl yet LMC == 0 (we can support all values) */
+ |(1 << 4) /* LinkSpeedActive */
+ |(1 << 0); /* LinkSpeedEnabled */
+ switch (dd->ipath_ibmtu) {
+ case 4096:
+ tmp = 5;
+ break;
+ case 2048:
+ tmp = 4;
+ break;
+ case 1024:
+ tmp = 3;
+ break;
+ case 512:
+ tmp = 2;
+ break;
+ case 256:
+ tmp = 1;
+ break;
+ default: /* oops, something is wrong */
+ _IPATH_DBG
+ ("Problem, ipath_ibmtu 0x%x not a valid IB MTU, treat as 2048\n",
+ dd->ipath_ibmtu);
+ tmp = 4;
+ break;
+ }
+ portinfo[9] = (tmp << 28)
+ /* NeighborMTU */
+ /* Notimpl MasterSMSL */
+ |(1 << 20)
+
+ /* VLCap */
+ /* Notimpl InitType (actually, an SMA decision) */
+ /* VLHighLimit is 0 (only one VL) */
+ ; /* VLArbitrationHighCap is 0 (only one VL) */
+ portinfo[10] = /* VLArbitrationLowCap is 0 (only one VL) */
+ /* InitTypeReply is SMA decision */
+ (5 << 16) /* MTUCap 4096 */
+ |(7 << 13) /* VLStallCount */
+ |(0x1f << 8) /* HOQLife */
+ |(1 << 4) /* OperationalVLs 0 */
+
+ /* PartitionEnforcementInbound */
+ /* PartitionEnforcementOutbound not enforced */
+ /* FilterRawinbound not enforced */
+ ; /* FilterRawOutbound not enforced */
+ /* M_KeyViolations are not counted by hardware, SMA can count */
+ tmp = ipath_kget_creg32(unit, cr_errpkey);
+ /* P_KeyViolations are counted by hardware. */
+ portinfo[11] = ((tmp & 0xffff) << 0);
+ portinfo[12] =
+ /* Q_KeyViolations are not counted by hardware */
+ (1 << 8)
+
+ /* GUIDCap */
+ /* SubnetTimeOut handled by SMA */
+ /* RespTimeValue handled by SMA */
+ ;
+ /* LocalPhyErrors are programmed to max */
+ portinfo[12] |= (0xf << 20)
+ |(0xf << 16) /* OverRunErrors are programmed to max */
+ ;
+
+ if ((ret = copy_to_user(a, portinfo, sizeof portinfo))) {
+ _IPATH_DBG("Failed to copy out portinfo: %d\n", ret);
+ ret = -EFAULT;
+ }
+ return ret;
+}
+
+/* unit number is in first word incoming, overwritten on return with data */
+static int ipath_get_nodeinfo(uint32_t __user *a)
+{
+ int ret;
+ uint32_t unit; /*, tmp, tmp2; */
+ struct ipath_devdata *dd;
+ uint32_t nodeinfo[10]; /* just the data for Nodeinfo, in host horder */
+
+ if ((ret = copy_from_user(&unit, a, sizeof unit))) {
+ _IPATH_DBG("Failed to copy in nodeinfo: %d\n", ret);
+ return -EFAULT;
+ }
+ if (unit >= infinipath_max
+ || !(devdata[unit].ipath_flags & IPATH_INITTED)) {
+ /* VDBG because sma normally probes for all possible units */
+ _IPATH_VDBG("Invalid unit %u\n", unit);
+ return -ENODEV;
+ }
+ dd = &devdata[unit];
+
+ /* so we only initialize non-zero fields. */
+ memset(nodeinfo, 0, sizeof nodeinfo);
+
+ nodeinfo[0] = /* BaseVersion is SMA */
+ /* ClassVersion is SMA */
+ (1 << 8) /* NodeType */
+ |(1 << 0); /* NumPorts */
+ nodeinfo[1] = (uint32_t) (dd->ipath_guid >> 32);
+ nodeinfo[2] = (uint32_t) (dd->ipath_guid & 0xffffffff);
+ nodeinfo[3] = nodeinfo[1]; /* PortGUID == SystemImageGUID for us */
+ nodeinfo[4] = nodeinfo[2]; /* PortGUID == SystemImageGUID for us */
+ nodeinfo[5] = nodeinfo[3]; /* PortGUID == NodeGUID for us */
+ nodeinfo[6] = nodeinfo[4]; /* PortGUID == NodeGUID for us */
+ nodeinfo[7] = (4 << 16) /* we support 4 pkeys */
+ |(dd->ipath_deviceid << 0);
+ /* our chip version as 16 bits major, 16 bits minor */
+ nodeinfo[8] = dd->ipath_minrev | (dd->ipath_majrev << 16);
+ nodeinfo[9] = (unit << 24) | (dd->ipath_vendorid << 0);
+
+ if ((ret = copy_to_user(a, nodeinfo, sizeof nodeinfo))) {
+ _IPATH_DBG("Failed to copy out nodeinfo: %d\n", ret);
+ ret = -EFAULT;
+ }
+ return ret;
+}
+
+static int ipath_sma_ioctl(struct file *fp, unsigned int cmd, unsigned long a)
+{
+ int ret = 0;
+ switch (cmd) {
+ case IPATH_SEND_SMA_PKT: /* send SMA packet */
+ if (!(ret = ipath_send_smapkt((struct ipath_sendpkt __user *) a)))
+ /* another SMA packet sent */
+ ipath_stats.sps_sma_spkts++;
+ break;
+ case IPATH_RCV_SMA_PKT: /* recieve an SMA or MAD packet */
+ ret = ipath_rcvsma_pkt((struct ipath_sendpkt __user *) a);
+ break;
+ case IPATH_SET_LID: /* set our lid, (SMA) */
+ ret = ipath_kset_lid((uint32_t) a);
+ break;
+ case IPATH_SET_MTU: /* set the IB mtu (not maxpktlen) (SMA) */
+ ret = ipath_kset_mtu((uint32_t) a);
+ break;
+ case IPATH_SET_LINKSTATE:
+ /* walk through the linkstate states (SMA) */
+ ret = ipath_kset_linkstate((uint32_t) a);
+ break;
+ case IPATH_GET_PORTINFO: /* get the SMA portinfo */
+ ret = ipath_get_portinfo((uint32_t __user *) a);
+ break;
+ case IPATH_GET_NODEINFO: /* get the SMA nodeinfo */
+ ret = ipath_get_nodeinfo((uint32_t __user *) a);
+ break;
+ case IPATH_SET_GUID:
+ /*
+ * set our guid, (SMA). This is not normally
+ * used, but provides a way to set the GUID when the i2c flash
+ * has a problem, or for special testing.
+ */
+ ret = ipath_kset_guid((struct ipath_setguid __user *) a);
+ break;
+ case IPATH_SET_MLID: /* set multicast LID for ipath broadcast */
+ ret = ipath_kset_mlid((uint32_t) a);
+ break;
+ case IPATH_GET_MLID: /* get multicast LID for ipath broadcast */
+ ret = ipath_get_mlid((uint32_t __user *) a);
+ break;
+ case IPATH_GET_DEVSTATUS: /* get device status */
+ ret = ipath_get_devstatus((uint64_t __user *) a);
+ break;
+ default:
+ _IPATH_DBG("%x not a valid SMA ioctl for infinipath\n", cmd);
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static int ipath_get_unit_counters(struct infinipath_getunitcounters __user *a)
+{
+ struct infinipath_getunitcounters c;
+
+ if (copy_from_user(&c, a, sizeof c))
+ return -EFAULT;
+
+ if (c.unit >= infinipath_max ||
+ !(devdata[c.unit].ipath_flags & IPATH_PRESENT))
+ return -ENODEV;
+
+ return ipath_get_counters(c.unit,
+ (struct infinipath_counters __user *) c.data);
+}
+
+/*
+ * ioctls for the control device, which is useful when you don't want
+ * to open the main device and use up a port.
+ */
+
+static int ipath_ctrl_ioctl(struct file *fp, unsigned int cmd, unsigned long a)
+{
+ int ret = 0;
+
+ switch (cmd) {
+ case IPATH_GETSTATS: /* return driver stats */
+ ret = ipath_get_stats((struct infinipath_stats __user *) a);
+ break;
+ case IPATH_GETUNITCOUNTERS: /* return chip counters */
+ ret = ipath_get_unit_counters((struct infinipath_getunitcounters __user *) a);
+ break;
+ default:
+ _IPATH_DBG("%x not a valid CTRL ioctl for infinipath\n", cmd);
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+long ipath_ioctl(struct file *fp, unsigned int cmd, unsigned long a)
+{
+ int ret = 0;
+ struct ipath_portdata *pd;
+ ipath_type unit;
+ uint32_t tmp, i, nactive = 0;
+
+ if (cmd == IPATH_GETUNITS) {
+ /*
+ * Return number of units supported. This is called
+ * here as this ioctl is needed via both the normal and
+ * diags interface, and it does not need the device to
+ * be opened.
+ */
+ return ipath_get_units();
+ }
+
+ pd = port_fp(fp);
+ if (!pd) {
+ if (IPATH_SMA == (unsigned long)fp->private_data)
+ /* sma separate; no pd */
+ return (long)ipath_sma_ioctl(fp, cmd, a);
+#ifdef IPATH_DIAG
+ else if (IPATH_DIAG == (unsigned long)fp->private_data)
+ /* diags separate; no pd */
+ return (long)ipath_diags_ioctl(fp, cmd, a);
+#endif
+ else if (IPATH_CTRL == (unsigned long)fp->private_data)
+ /* ctrl separate; no pd */
+ return (long)ipath_ctrl_ioctl(fp, cmd, a);
+ else {
+ _IPATH_DBG("NULL pd from fp (%p), cmd=%x\n", fp, cmd);
+ return -ENODEV; /* bad; shouldn't ever happen */
+ }
+ }
+
+ unit = pd->port_unit;
+
+ if ((devdata[unit].ipath_flags & IPATH_PRESENT)
+ && (cmd == IPATH_GETCOUNTERS || cmd == IPATH_GETSTATS
+ || cmd == IPATH_READ_EEPROM || cmd == IPATH_WRITE_EEPROM)) {
+ /* allowed to do these, as long as chip is accessible */
+ } else if (!(devdata[unit].ipath_flags & IPATH_INITTED)) {
+ _IPATH_DBG
+ ("%s not initialized (flags=0x%x), failing ioctl #%u\n",
+ ipath_get_unit_name(unit), devdata[unit].ipath_flags,
+ _IOC_NR(cmd));
+ ret = -ENODEV;
+ } else
+ if ((devdata[unit].
+ ipath_flags & (IPATH_LINKDOWN | IPATH_LINKUNK))) {
+ _IPATH_DBG("%s link is down, failing ioctl #%u\n",
+ ipath_get_unit_name(unit), _IOC_NR(cmd));
+ ret = -ENETDOWN;
+ }
+
+ if (ret)
+ return ret;
+
+ switch (cmd) {
+ case IPATH_USERINIT:
+ /* real application is starting on a port */
+ ret = ipath_do_user_init(pd, (struct ipath_user_info __user *) a);
+ break;
+ case IPATH_BASEINFO:
+ /* it's done the init, now return the info it needs */
+ ret = ipath_get_baseinfo(pd, (struct ipath_base_info __user *) a);
+ break;
+ case IPATH_GETPORT:
+ /*
+ * just return the unit:port that we were assigned,
+ * and the number of active chips. This is is used for
+ * doing sched_setaffinity() before initialization.
+ */
+ for (i = 0; i < infinipath_max; i++)
+ if ((devdata[i].ipath_flags & IPATH_PRESENT)
+ && devdata[i].ipath_kregbase
+ && devdata[i].ipath_lid
+ && !(devdata[i].ipath_flags &
+ (IPATH_LINKDOWN | IPATH_LINKUNK)))
+ nactive++;
+ tmp = (nactive << 24) | (unit << 16) | pd->port_port;
+ if (copy_to_user((void __user *) a, &tmp, sizeof(tmp)))
+ ret = EFAULT;
+ break;
+ case IPATH_GETLID:
+ /* get LID for given unit # */
+ ret = ipath_layer_get_lid(a);
+ break;
+ case IPATH_UPDM_TID: /* update expected TID entries */
+ ret = ipath_tid_update(pd, (struct _tidupd __user *) a);
+ break;
+ case IPATH_FREE_TID: /* free expected TID entries */
+ ret = ipath_tid_free(pd, (struct _tidupd __user *) a);
+ break;
+ case IPATH_GETCOUNTERS: /* return chip counters */
+ ret = ipath_get_counters(unit, (struct infinipath_counters __user *) a);
+ break;
+ case IPATH_GETSTATS: /* return driver stats */
+ ret = ipath_get_stats((struct infinipath_stats __user *) a);
+ break;
+ case IPATH_GETUNITCOUNTERS: /* return chip counters */
+ ret = ipath_get_unit_counters(
+ (struct infinipath_getunitcounters __user *) a);
+ break;
+ case IPATH_SET_PKEY: /* set a partition key */
+ ret = ipath_set_partkey(pd, (uint16_t) a);
+ break;
+ case IPATH_RCVCTRL: /* error handling to manage the rcvq */
+ ret = ipath_manage_rcvq(pd, (uint16_t) a);
+ break;
+ case IPATH_WRITE_EEPROM:
+ /* write the eeprom (for GUID) */
+ ret = ipath_wr_eeprom(pd,
+ (struct ipath_eeprom_req __user *) a);
+ break;
+ case IPATH_READ_EEPROM: /* read the eeprom (for GUID) */
+ ret = ipath_rd_eeprom(pd->port_unit,
+ (struct ipath_eeprom_req __user *) a);
+ break;
+ case IPATH_WAIT:
+ /*
+ * wait for a receive intr for this port, or PIO avail
+ */
+ ret = ipath_wait_intr(pd, (uint32_t) a);
+ break;
+
+ default:
+ _IPATH_DBG("cmd %x (%c,%u) not a valid ioctl\n", cmd,
+ _IOC_TYPE(cmd), _IOC_NR(cmd));
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+static loff_t ipath_llseek(struct file *fp, loff_t off, int whence)
+{
+ loff_t ret;
+
+ /* range checking is done where offset is used, not here. */
+ down(&fp->f_dentry->d_inode->i_sem);
+ if (!whence)
+ ret = fp->f_pos = off;
+ else if (whence == 1) {
+ fp->f_pos += off;
+ ret = fp->f_pos;
+ } else
+ ret = -EINVAL;
+ up(&fp->f_dentry->d_inode->i_sem);
+ _IPATH_DBG("New offset %llx from seek %llx whence=%d\n", fp->f_pos, off,
+ whence);
+
+ return ret;
+}
+
+/*
+ * We use this to have a shared buffer between the kernel and the user
+ * code for the rcvhdr queue, egr buffers, and the per-port user regs and pio
+ * buffers in the chip. We have the open and close entries so we can bump
+ * the ref count and keep the driver from being unloaded while still mapped.
+ */
+
+static struct vm_operations_struct ipath_vmops = {
+ .nopage = ipath_nopage,
+};
+
+static int ipath_mmap(struct file *fp, struct vm_area_struct *vm)
+{
+ int setlen = 0, ret = -EINVAL;
+ struct ipath_portdata *pd;
+
+ if (fp->private_data && 255UL < (unsigned long)fp->private_data) {
+ pd = port_fp(fp);
+ {
+ /*
+ * This is the ipath_do_user_init() code,
+ * mapping the shared buffers into the user
+ * process. The address referred to by vm_pgoff
+ * is the virtual, not physical, address; we only
+ * do one mmap for each space mapped.
+ */
+ uint64_t pgaddr, ureg;
+
+ pgaddr = vm->vm_pgoff << PAGE_SHIFT;
+
+ /*
+ * note that ureg does *NOT* have the kregvirt
+ * as part of it, to be sure that for 32 bit
+ * programs, we don't end up trying to map
+ * a > 44 address. Has to match ipath_get_baseinfo()
+ * code that sets __spi_uregbase
+ */
+
+ ureg = devdata[pd->port_unit].ipath_uregbase +
+ devdata[pd->port_unit].ipath_palign * pd->port_port;
+
+ _IPATH_MMDBG
+ ("ushare: pgaddr %llx vm_start=%lx, vmlen %lx\n",
+ pgaddr, vm->vm_start, vm->vm_end - vm->vm_start);
+
+ if (pgaddr == ureg) {
+ /* it's the real hardware, so io_remap works */
+ unsigned long phys;
+ if ((vm->vm_end - vm->vm_start) > PAGE_SIZE) {
+ _IPATH_INFO
+ ("FAIL mmap userreg: reqlen %lx > PAGE\n",
+ vm->vm_end - vm->vm_start);
+ ret = -EFAULT;
+ } else {
+ phys =
+ devdata[pd->port_unit].
+ ipath_physaddr + ureg;
+ vm->vm_page_prot =
+ pgprot_noncached(vm->vm_page_prot);
+
+ vm->vm_flags |=
+ VM_DONTCOPY | VM_DONTEXPAND | VM_IO
+ | VM_SHM | VM_LOCKED;
+ ret =
+ io_remap_pfn_range(vm, vm->vm_start, phys >> PAGE_SHIFT,
+ vm->vm_end - vm->vm_start,
+ vm->vm_page_prot);
+ }
+ } else if (pgaddr == pd->port_piobufs) {
+ /*
+ * We use io_remap, so there is not a
+ * nopage handler for this case!
+ * when we map the PIO buffers, we want
+ * to map them as writeonly, no read possible.
+ */
+
+ unsigned long phys;
+ if ((vm->vm_end - vm->vm_start) >
+ (devdata[pd->port_unit].ipath_pbufsport *
+ devdata[pd->port_unit].ipath_palign)) {
+ _IPATH_INFO
+ ("FAIL mmap piobufs: reqlen %lx > PAGE\n",
+ vm->vm_end - vm->vm_start);
+ ret = -EFAULT;
+ } else {
+ phys =
+ devdata[pd->port_unit].
+ ipath_physaddr + pd->port_piobufs;
+ /*
+ * Do *NOT* mark this as
+ * non-cached (PWT bit), or we
+ * don't get the write combining
+ * behavior we want on the
+ * PIO buffers!
+ * vm->vm_page_prot = pgprot_noncached(vm->vm_page_prot);
+ */
+
+#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC)
+ /* Enable WC */
+ vm->vm_page_prot =
+ pgprot_writecombine(vm->
+ vm_page_prot);
+#endif
+
+ if (vm->vm_flags & VM_READ) {
+ _IPATH_INFO
+ ("Can't map piobufs as readable (flags=%lx)\n",
+ vm->vm_flags);
+ ret = -EPERM;
+ } else {
+ /*
+ * don't allow them to
+ * later change to readable
+ * with mprotect
+ */
+
+ vm->vm_flags &= ~VM_MAYWRITE;
+
+ vm->vm_flags |=
+ VM_DONTCOPY | VM_DONTEXPAND
+ | VM_IO | VM_SHM |
+ VM_LOCKED;
+ ret =
+ io_remap_pfn_range(vm, vm->vm_start, phys >> PAGE_SHIFT,
+ vm->vm_end - vm->vm_start,
+ vm->vm_page_prot);
+ }
+ }
+ } else if (pgaddr == (uint64_t) pd->port_rcvegr_phys) {
+ if (!pd->port_rcvegrbuf_virt)
+ return -EFAULT;
+ /*
+ * page_alloc'ed egr memory, not
+ * physically contiguous
+ * *BUT* to work around the 32 bit mmap64
+ * only handling 44 bits, we have remapped
+ * the first page to kernel virtual, so
+ * we have to do the conversion here to
+ * get back to the original virtual
+ * address (not contig pages) so we have
+ * to mark this for special handling.
+ */
+
+ /*
+ * not egrbufs * egrsize since they are
+ * no longer virtually contiguous.
+ */
+ setlen = pd->port_rcvegrbuf_chunks * PAGE_SIZE *
+ (1 << pd->port_rcvegrbuf_order);
+ if ((vm->vm_end - vm->vm_start) > setlen) {
+ _IPATH_INFO
+ ("FAIL on egr bufs: reqlen %lx > actual %x\n",
+ vm->vm_end - vm->vm_start, setlen);
+ ret = -EFAULT;
+ } else {
+ vm->vm_ops = &ipath_vmops;
+ vm->vm_private_data =
+ (void *)(3 | (uint64_t) pd);
+ if (vm->vm_flags & VM_WRITE) {
+ _IPATH_INFO
+ ("Can't map eager buffers as writable (flags=%lx)\n",
+ vm->vm_flags);
+ ret = -EPERM;
+ } else {
+ /*
+ * don't allow them to
+ * later change to writeable
+ * with mprotect
+ */
+
+ vm->vm_flags &= ~VM_MAYWRITE;
+ _IPATH_MMDBG
+ ("egrbufs, set private to %p, not %llx\n",
+ vm->vm_private_data,
+ pgaddr);
+ ret = 0;
+ }
+ }
+ } else if (pgaddr == (uint64_t) pd->port_rcvhdrq_phys) {
+ /*
+ * kmalloc'ed memory, physically
+ * contiguous; this is from
+ * spi_rcvhdr_base; we allow user to
+ * map read-write so they can write
+ * hdrq entries to allow protocol code
+ * to directly poll whether a hdrq entry
+ * has been written.
+ */
+ setlen = ALIGN(devdata[pd->port_unit].ipath_rcvhdrcnt * devdata[pd->port_unit].ipath_rcvhdrentsize * sizeof(uint32_t), PAGE_SIZE);
+ if ((vm->vm_end - vm->vm_start) > setlen) {
+ _IPATH_INFO
+ ("FAIL on rcvhdrq: reqlen %lx > actual %x\n",
+ vm->vm_end - vm->vm_start, setlen);
+ ret = -EFAULT;
+ } else {
+ vm->vm_ops = &ipath_vmops;
+ vm->vm_private_data =
+ (void *)(pgaddr | 1);
+ ret = 0;
+ }
+ }
+ /*
+ * when we map the PIO bufferavail registers,
+ * we want to map them as readonly, no read
+ * possible.
+ */
+ else if (pgaddr == devdata[pd->port_unit].ipath_pioavailregs_phys) {
+ /*
+ * kmalloc'ed memory, physically
+ * contiguous, one page only, readonly
+ */
+ setlen = PAGE_SIZE;
+ if ((vm->vm_end - vm->vm_start) > setlen) {
+ _IPATH_INFO
+ ("FAIL on pioavailregs_dma: reqlen %lx > actual %x\n",
+ vm->vm_end - vm->vm_start, setlen);
+ ret = -EFAULT;
+ } else if (vm->vm_flags & VM_WRITE) {
+ _IPATH_INFO
+ ("Can't map pioavailregs as writable (flags=%lx)\n",
+ vm->vm_flags);
+ ret = -EPERM;
+ } else {
+ /*
+ * don't allow them to later
+ * change with mprotect
+ */
+ vm->vm_flags &= ~VM_MAYWRITE;
+ vm->vm_ops = &ipath_vmops;
+ vm->vm_private_data =
+ (void *)(pgaddr | 2);
+ ret = 0;
+ }
+ }
+ if (!ret && setlen) {
+ /* keep page(s) from being swapped, etc. */
+ vm->vm_flags |=
+ VM_DONTEXPAND | VM_DONTCOPY | VM_RESERVED |
+ VM_IO | VM_SHM;
+ } else {
+ /* failure, or io_remap case */
+ vm->vm_private_data = NULL;
+ if (ret)
+ _IPATH_INFO
+ ("Failure %d, setlen %d, on addr %lx, off %lx\n",
+ ret, setlen, vm->vm_start,
+ vm->vm_pgoff);
+ }
+ }
+ } else /* something very wrong */
+ _IPATH_INFO("fp_private wasn't set, no mmaping\n");
+
+ return ret;
+}
+
+/* page fault handler. For each page that is first faulted in from the
+ * mmap'ed shared address buffer, this routine is called.
+ * It's always for a single page.
+ * We use the low bits of the private_data field to tell us which case
+ * we are dealing with.
+ */
+
+static struct page *ipath_nopage(struct vm_area_struct *vma, unsigned long addr,
+ int *type)
+{
+ unsigned long avirt, /* the original [kv]malloc virtual address */
+ paddr, /* physical address */
+ off; /* calculated page offset */
+ uint32_t which, chunk;
+ void *vaddr = NULL;
+ struct ipath_portdata *pd;
+ struct page *vpage = NOPAGE_SIGBUS;
+
+ if (!(avirt = (unsigned long)vma->vm_private_data)) {
+ _IPATH_DBG("NULL private_data, vm_pgoff %lx\n", vma->vm_pgoff);
+ which = 0; /* quiet incorrect gcc warning */
+ goto done;
+ }
+ which = avirt & 3;
+ avirt &= ~3ULL;
+
+ if (addr > vma->vm_end) {
+ _IPATH_DBG("trying to fault in addr %lx past end\n", addr);
+ goto done;
+ }
+
+ /*
+ * most of our memory is vmalloc'ed, but rcvhdr Q is physically
+ * contiguous, either from kmalloc or alloc_pages()
+ * pgoff is virtual.
+ */
+ switch (which) {
+ case 1: /* rcvhdrq_phys */
+ /* should always be 0 */
+ off = vma->vm_pgoff - (avirt >> PAGE_SHIFT);
+ paddr = addr - vma->vm_start + (off << PAGE_SHIFT) + avirt;
+ _IPATH_MMDBG("hdrq %lx (u=%lx)\n", paddr, addr);
+ vpage = pfn_to_page(paddr >> PAGE_SHIFT);
+ break;
+ case 2: /* PIO buffer avail regs */
+ /* should always be 0 */
+ off = vma->vm_pgoff - (avirt >> PAGE_SHIFT);
+ paddr = (addr - vma->vm_start + (off << PAGE_SHIFT) + avirt);
+ _IPATH_MMDBG("pioav %lx\n", paddr);
+ vpage = pfn_to_page(paddr >> PAGE_SHIFT);
+ break;
+ case 3:
+ /*
+ * rcvegrbufs; page_alloc()'ed like rcvhdrq, but we
+ * have to pick out which page_alloc()'ed chunk it is.
+ */
+ pd = (struct ipath_portdata *) avirt;
+ /* this should always be 0 */
+ off =
+ vma->vm_pgoff -
+ ((unsigned long)pd->port_rcvegr_phys >> PAGE_SHIFT);
+ off = (addr - vma->vm_start + (off << PAGE_SHIFT));
+
+ chunk = off / (PAGE_SIZE * (1 << pd->port_rcvegrbuf_order));
+ if (chunk > pd->port_rcvegrbuf_chunks)
+ _IPATH_DBG("Bad egrbuf chunk %u (max %u); off = %lx\n",
+ chunk, pd->port_rcvegrbuf_chunks, off);
+ vaddr = pd->port_rcvegrbuf_virt[chunk] +
+ off % (PAGE_SIZE * (1 << pd->port_rcvegrbuf_order));
+ paddr = virt_to_phys(vaddr);
+ vpage = pfn_to_page(paddr >> PAGE_SHIFT);
+ _IPATH_MMDBG("egrb %p,%lx\n", vaddr, paddr);
+ break;
+ default:
+ _IPATH_DBG
+ ("trying to fault in mmap addr %lx (avirt %lx) that isn't known (case %u)\n",
+ addr, avirt, which);
+ }
+
+done:
+ if (vpage != NOPAGE_SIGBUS && vpage != NOPAGE_OOM) {
+ if (which == 2)
+ /*
+ * media/video/video-buf.c doesn't do get_page() for
+ * buffer from alloc_page(). Hmmm.
+ *
+ * keep it from being swapped, complaints if
+ * process exits before we [vf]free it, etc,
+ * and keep shared page counts correct, etc.
+ */
+ get_page(vpage);
+ mark_page_accessed(vpage);
+ if (type)
+ *type = VM_FAULT_MINOR;
+ } else
+ _IPATH_DBG("faultin of addr %lx vaddr %p avirt %lx failed\n",
+ addr, vaddr, avirt);
+
+ return vpage;
+}
+
+/* this is separate to allow for better optimization of ipath_intr() */
+
+static void ipath_bad_intr(const ipath_type t, uint32_t * unexpectp)
+{
+ struct ipath_devdata *dd = &devdata[t];
+
+ /*
+ * sometimes happen during driver init and unload, don't want
+ * to process any interrupts at that point
+ */
+
+ /* this is just a bandaid, not a fix, if something goes badly wrong */
+ if (++*unexpectp > 100) {
+ if (++*unexpectp > 105) {
+ /*
+ * ok, we must be taking somebody else's interrupts,
+ * due to a messed up mptable and/or PIRQ table, so
+ * unregister the interrupt. We've seen this
+ * during linuxbios development work, and it
+ * may happen in the future again.
+ */
+ if (dd->pcidev && dd->pcidev->irq) {
+ _IPATH_UNIT_ERROR(t,
+ "Now %u unexpected interrupts, unregistering interrupt handler\n",
+ *unexpectp);
+ _IPATH_DBG("free_irq of irq %x\n",
+ dd->pcidev->irq);
+ free_irq(dd->pcidev->irq, dd);
+ dd->pcidev->irq = 0;
+ }
+ }
+ if (ipath_kget_kreg32(t, kr_intmask)) {
+ _IPATH_UNIT_ERROR(t,
+ "%u unexpected interrupts, disabling interrupts completely\n",
+ *unexpectp);
+ /* disable all interrupts, something is very wrong */
+ ipath_kput_kreg(t, kr_intmask, 0ULL);
+ }
+ } else if (*unexpectp > 1)
+ _IPATH_DBG
+ ("Interrupt when not ready, should not happen, ignoring\n");
+}
+
+/* separate routine, for better optimization of ipath_intr() */
+
+static void ipath_bad_regread(const ipath_type t)
+{
+ static int allbits;
+ struct ipath_devdata *dd = &devdata[t];
+
+ /*
+ * We print the message and disable interrupts, in hope of
+ * having a better chance of debugging the problem.
+ */
+ _IPATH_UNIT_ERROR(t,
+ "Read of interrupt status failed (all bits set)\n");
+ if (allbits++) {
+ /* disable all interrupts, something is very wrong */
+ ipath_kput_kreg(t, kr_intmask, 0ULL);
+ if (allbits == 2) {
+ _IPATH_UNIT_ERROR(t,
+ "Still bad interrupt status, unregistering interrupt\n");
+ free_irq(dd->pcidev->irq, dd);
+ dd->pcidev->irq = 0;
+ } else if (allbits > 2) {
+ if ((allbits % 10000) == 0)
+ printk(".");
+ } else
+ _IPATH_UNIT_ERROR(t,
+ "Disabling interrupts, multiple errors\n");
+ }
+}
+
+static irqreturn_t ipath_intr(int irq, void *data, struct pt_regs *regs)
+{
+ struct ipath_devdata *dd = data;
+ const ipath_type t = IPATH_UNIT(dd);
+ uint32_t istat = ipath_kget_kreg32(t, kr_intstatus);
+ uint64_t estat = 0;
+ static unsigned unexpected = 0;
+
+ if (unlikely(!istat)) {
+ ipath_stats.sps_nullintr++;
+ /* not our interrupt, or already handled */
+ return IRQ_NONE;
+ }
+ if (unlikely(istat == -1)) {
+ ipath_bad_regread(t);
+ /* don't know if it was our interrupt or not */
+ return IRQ_NONE;
+ }
+
+ ipath_stats.sps_ints++;
+
+ /*
+ * this needs to be flags&initted, not statusp, so we keep
+ * taking interrupts even after link goes down, etc.
+ * Also, we *must* clear the interrupt at some point, or we won't
+ * take it again, which can be real bad for errors, etc...
+ */
+
+ if (!(dd->ipath_flags & IPATH_INITTED)) {
+ ipath_bad_intr(t, &unexpected);
+ return IRQ_NONE;
+ }
+ if (unexpected)
+ unexpected = 0;
+
+ if (istat & ~infinipath_i_bitsextant)
+ _IPATH_UNIT_ERROR(t,
+ "interrupt with unknown interrupts %x set\n",
+ istat & (uint32_t) ~ infinipath_i_bitsextant);
+
+ if (istat & INFINIPATH_I_ERROR) {
+ ipath_stats.sps_errints++;
+ estat = ipath_kget_kreg64(t, kr_errorstatus);
+ if (!estat)
+ _IPATH_INFO
+ ("error interrupt (%x), but no error bits set!\n",
+ istat);
+ else if (estat == -1LL)
+ /*
+ * should we try clearing all, or hope next read
+ * works?
+ */
+ _IPATH_UNIT_ERROR(t,
+ "Read of error status failed (all bits set); ignoring\n");
+ else
+ ipath_handle_errors(t, estat);
+ }
+
+ if (istat & INFINIPATH_I_GPIO) {
+ /* Clear GPIO status bit 2 */
+ ipath_kput_kreg(t, kr_gpio_clear, (uint64_t)(1 << 2));
+
+ /*
+ * Packets are available in the port 0 receive queue.
+ * Eventually this needs to be generalized to check
+ * IPATH_GPIO_INTR, and the specific GPIO bit, when
+ * GPIO interrupts start being used for other things.
+ * We skip that now to improve performance.
+ */
+ ipath_kreceive(t);
+ }
+
+ /*
+ * clear the ones we will deal with on this round
+ * We clear it early, mostly for receive interrupts, so we
+ * know the chip will have seen this by the time we process
+ * the queue, and will re-interrupt if necessary. The processor
+ * itself won't take the interrupt again until we return.
+ */
+ ipath_kput_kreg(t, kr_intclear, istat);
+
+ if (istat & INFINIPATH_I_SPIOBUFAVAIL) {
+ atomic_clear_mask(INFINIPATH_S_PIOINTBUFAVAIL,
+ &dd->ipath_sendctrl);
+ ipath_kput_kreg(t, kr_sendctrl, dd->ipath_sendctrl);
+
+ if (dd->ipath_portpiowait) {
+ uint32_t i;
+ /*
+ * start from port 1, since for now port 0 is
+ * never using wait_event for PIO
+ */
+ for (i = 1;
+ dd->ipath_portpiowait && i < dd->ipath_cfgports;
+ i++) {
+ if (dd->ipath_pd[i]
+ && dd->ipath_portpiowait & (1U << i)) {
+ atomic_clear_mask(1U << i,
+ &dd->
+ ipath_portpiowait);
+ if (dd->ipath_pd[i]->
+ port_flag & IPATH_PORT_WAITING_PIO)
+ {
+ dd->ipath_pd[i]->port_flag &=
+ ~IPATH_PORT_WAITING_PIO;
+ wake_up_interruptible(&dd->
+ ipath_pd
+ [i]->
+ port_wait);
+ }
+ }
+ }
+ }
+
+ if (dd->ipath_layer.l_intr) {
+ if (dd->ipath_layer.l_intr(t,
+ IPATH_LAYER_INT_SEND_CONTINUE)) {
+ atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL,
+ &dd->ipath_sendctrl);
+ ipath_kput_kreg(t, kr_sendctrl,
+ dd->ipath_sendctrl);
+ }
+ }
+
+ if (dd->verbs_layer.l_piobufavail) {
+ if (!dd->verbs_layer.l_piobufavail(t)) {
+ atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL,
+ &dd->ipath_sendctrl);
+ ipath_kput_kreg(t, kr_sendctrl,
+ dd->ipath_sendctrl);
+ }
+ }
+ }
+
+ /*
+ * we check for both transition from empty to non-empty, and urgent
+ * packets (those with the interrupt bit set in the header)
+ */
+
+ if (istat & ((infinipath_i_rcvavail_mask << INFINIPATH_I_RCVAVAIL_SHIFT)
+ | (infinipath_i_rcvurg_mask << INFINIPATH_I_RCVURG_SHIFT))) {
+ uint64_t portr;
+ int i;
+ uint32_t rcvdint = 0;
+
+ portr = ((istat >> INFINIPATH_I_RCVAVAIL_SHIFT) &
+ infinipath_i_rcvavail_mask)
+ | ((istat >> INFINIPATH_I_RCVURG_SHIFT) &
+ infinipath_i_rcvurg_mask);
+ for (i = 0; i < dd->ipath_cfgports; i++) {
+ if (portr & (1 << i) && dd->ipath_pd[i]) {
+ if (i == 0)
+ ipath_kreceive(t);
+ else if (dd->ipath_pd[i]->
+ port_flag & IPATH_PORT_WAITING_RCV) {
+ atomic_clear_mask
+ (IPATH_PORT_WAITING_RCV,
+ &dd->ipath_pd[i]->port_flag);
+ wake_up_interruptible(&dd->ipath_pd[i]->
+ port_wait);
+ rcvdint |= 1U << i;
+ }
+ }
+ }
+ if (rcvdint) {
+ /*
+ * only want to take one interrupt, so turn off
+ * the rcv interrupt for all the ports that we
+ * did the wakeup on (but never for kernel port)
+ */
+ atomic_clear_mask(rcvdint <<
+ INFINIPATH_R_INTRAVAIL_SHIFT,
+ &dd->ipath_rcvctrl);
+ ipath_kput_kreg(t, kr_rcvctrl, dd->ipath_rcvctrl);
+ }
+ }
+
+ return IRQ_HANDLED;
+}
+
+static void ipath_decode_err(char *buf, size_t blen, uint64_t err)
+{
+ *buf = '\0';
+ if (err & INFINIPATH_E_RHDRLEN)
+ strlcat(buf, "rhdrlen ", blen);
+ if (err & INFINIPATH_E_RBADTID)
+ strlcat(buf, "rbadtid ", blen);
+ if (err & INFINIPATH_E_RBADVERSION)
+ strlcat(buf, "rbadversion ", blen);
+ if (err & INFINIPATH_E_RHDR)
+ strlcat(buf, "rhdr ", blen);
+ if (err & INFINIPATH_E_RLONGPKTLEN)
+ strlcat(buf, "rlongpktlen ", blen);
+ if (err & INFINIPATH_E_RSHORTPKTLEN)
+ strlcat(buf, "rshortpktlen ", blen);
+ if (err & INFINIPATH_E_RMAXPKTLEN)
+ strlcat(buf, "rmaxpktlen ", blen);
+ if (err & INFINIPATH_E_RMINPKTLEN)
+ strlcat(buf, "rminpktlen ", blen);
+ if (err & INFINIPATH_E_RFORMATERR)
+ strlcat(buf, "rformaterr ", blen);
+ if (err & INFINIPATH_E_RUNSUPVL)
+ strlcat(buf, "runsupvl ", blen);
+ if (err & INFINIPATH_E_RUNEXPCHAR)
+ strlcat(buf, "runexpchar ", blen);
+ if (err & INFINIPATH_E_RIBFLOW)
+ strlcat(buf, "ribflow ", blen);
+ if (err & INFINIPATH_E_REBP)
+ strlcat(buf, "EBP ", blen);
+ if (err & INFINIPATH_E_SUNDERRUN)
+ strlcat(buf, "sunderrun ", blen);
+ if (err & INFINIPATH_E_SPIOARMLAUNCH)
+ strlcat(buf, "spioarmlaunch ", blen);
+ if (err & INFINIPATH_E_SUNEXPERRPKTNUM)
+ strlcat(buf, "sunexperrpktnum ", blen);
+ if (err & INFINIPATH_E_SDROPPEDDATAPKT)
+ strlcat(buf, "sdroppeddatapkt ", blen);
+ if (err & INFINIPATH_E_SDROPPEDSMPPKT)
+ strlcat(buf, "sdroppedsmppkt ", blen);
+ if (err & INFINIPATH_E_SMAXPKTLEN)
+ strlcat(buf, "smaxpktlen ", blen);
+ if (err & INFINIPATH_E_SMINPKTLEN)
+ strlcat(buf, "sminpktlen ", blen);
+ if (err & INFINIPATH_E_SUNSUPVL)
+ strlcat(buf, "sunsupVL ", blen);
+ if (err & INFINIPATH_E_SPKTLEN)
+ strlcat(buf, "spktlen ", blen);
+ if (err & INFINIPATH_E_INVALIDADDR)
+ strlcat(buf, "invalidaddr ", blen);
+ if (err & INFINIPATH_E_RICRC)
+ strlcat(buf, "CRC ", blen);
+ if (err & INFINIPATH_E_RVCRC)
+ strlcat(buf, "VCRC ", blen);
+ if (err & INFINIPATH_E_RRCVEGRFULL)
+ strlcat(buf, "rcvegrfull ", blen);
+ if (err & INFINIPATH_E_RRCVHDRFULL)
+ strlcat(buf, "rcvhdrfull ", blen);
+ if (err & INFINIPATH_E_IBSTATUSCHANGED)
+ strlcat(buf, "ibcstatuschg ", blen);
+ if (err & INFINIPATH_E_RIBLOSTLINK)
+ strlcat(buf, "riblostlink ", blen);
+ if (err & INFINIPATH_E_HARDWARE)
+ strlcat(buf, "hardware ", blen);
+ if (err & INFINIPATH_E_RESET)
+ strlcat(buf, "reset ", blen);
+}
+
+/* decode RHF errors; only used one place now, may want more later */
+static void get_rhf_errstring(uint32_t err, char *msg, size_t len)
+{
+ /* if no errors, and so don't need to check what's first */
+ *msg = '\0';
+
+ if (err & INFINIPATH_RHF_H_ICRCERR)
+ strlcat(msg, "icrcerr ", len);
+ if (err & INFINIPATH_RHF_H_VCRCERR)
+ strlcat(msg, "vcrcerr ", len);
+ if (err & INFINIPATH_RHF_H_PARITYERR)
+ strlcat(msg, "parityerr ", len);
+ if (err & INFINIPATH_RHF_H_LENERR)
+ strlcat(msg, "lenerr ", len);
+ if (err & INFINIPATH_RHF_H_MTUERR)
+ strlcat(msg, "mtuerr ", len);
+ if (err & INFINIPATH_RHF_H_IHDRERR)
+ /* infinipath hdr checksum error */
+ strlcat(msg, "ipathhdrerr ", len);
+ if (err & INFINIPATH_RHF_H_TIDERR)
+ strlcat(msg, "tiderr ", len);
+ if (err & INFINIPATH_RHF_H_MKERR)
+ /* bad port, offset, etc. */
+ strlcat(msg, "invalid ipathhdr ", len);
+ if (err & INFINIPATH_RHF_H_IBERR)
+ strlcat(msg, "iberr ", len);
+ if (err & INFINIPATH_RHF_L_SWA)
+ strlcat(msg, "swA ", len);
+ if (err & INFINIPATH_RHF_L_SWB)
+ strlcat(msg, "swB ", len);
+}
+
+static void ipath_handle_errors(const ipath_type t, uint64_t errs)
+{
+ char msg[512];
+ uint32_t piobcnt;
+ uint64_t sbuf[4], ignore_this_time = 0;
+ int i;
+ int chkerrpkts = 0, noprint = 0;
+ cycles_t nc;
+ static cycles_t nextmsg_time;
+ static unsigned nmsgs, supp_msgs;
+ struct ipath_devdata *dd = &devdata[t];
+
+#define E_SUM_PKTERRS (INFINIPATH_E_RHDRLEN | INFINIPATH_E_RBADTID \
+ | INFINIPATH_E_RBADVERSION \
+ | INFINIPATH_E_RHDR | INFINIPATH_E_RLONGPKTLEN | INFINIPATH_E_RSHORTPKTLEN \
+ | INFINIPATH_E_RMAXPKTLEN | INFINIPATH_E_RMINPKTLEN \
+ | INFINIPATH_E_RFORMATERR | INFINIPATH_E_RUNSUPVL | INFINIPATH_E_RUNEXPCHAR \
+ | INFINIPATH_E_REBP)
+
+#define E_SUM_ERRS ( INFINIPATH_E_SPIOARMLAUNCH \
+ | INFINIPATH_E_SUNEXPERRPKTNUM | INFINIPATH_E_SDROPPEDDATAPKT \
+ | INFINIPATH_E_SDROPPEDSMPPKT | INFINIPATH_E_SMAXPKTLEN \
+ | INFINIPATH_E_SUNSUPVL | INFINIPATH_E_SMINPKTLEN | INFINIPATH_E_SPKTLEN \
+ | INFINIPATH_E_INVALIDADDR)
+
+ /*
+ * throttle back "fast" messages to no more than 10 per 5 seconds
+ * (1.4-2GHz clock). This isn't perfect, but it's a reasonable
+ * heuristic
+ * If we get more than 10, give a 5x longer delay
+ */
+ nc = get_cycles();
+ if (nmsgs > 10) {
+ if (nc < nextmsg_time) {
+ noprint = 1;
+ if (!supp_msgs++)
+ nextmsg_time = nc + 50000000000ULL;
+ } else if (supp_msgs) {
+ /*
+ * Print the message unless it's ibc status
+ * change only, which happens so often we never
+ * want to count it.
+ */
+ if (dd->ipath_lasterror & ~INFINIPATH_E_IBSTATUSCHANGED) {
+ ipath_decode_err(msg, sizeof msg,
+ dd->
+ ipath_lasterror &
+ ~INFINIPATH_E_IBSTATUSCHANGED);
+ if (dd->
+ ipath_lasterror & ~(INFINIPATH_E_RRCVEGRFULL
+ |
+ INFINIPATH_E_RRCVHDRFULL))
+ _IPATH_UNIT_ERROR(t,
+ "Suppressed %u messages for fast-repeating errors (%s) (%llx)\n",
+ supp_msgs, msg,
+ dd->ipath_lasterror);
+ else {
+ /*
+ * rcvegrfull and rcvhdrqfull are
+ * "normal", for some types of
+ * processes (mostly benchmarks)
+ * that send huge numbers of
+ * messages, while not processing
+ * them. So only complain about
+ * these at debug level.
+ */
+ _IPATH_DBG
+ ("Suppressed %u messages for %s\n",
+ supp_msgs, msg);
+ }
+ }
+ supp_msgs = 0;
+ nmsgs = 0;
+ }
+ } else if (!nmsgs++ || nc > nextmsg_time) /* start timer */
+ nextmsg_time = nc + 10000000000ULL;
+
+ /*
+ * don't report errors that are masked (includes those always
+ * ignored)
+ */
+ errs &= ~dd->ipath_maskederrs;
+
+ /* do these first, they are most important */
+ if (errs & INFINIPATH_E_HARDWARE) {
+ /* reuse same msg buf */
+ ipath_handle_hwerrors(t, msg, sizeof msg);
+ }
+
+ if (!noprint && (errs & ~infinipath_e_bitsextant))
+ _IPATH_UNIT_ERROR(t,
+ "error interrupt with unknown errors %llx set\n",
+ errs & ~infinipath_e_bitsextant);
+
+ if (errs & E_SUM_ERRS) {
+ /* if possible that sendbuffererror could be valid */
+ piobcnt = dd->ipath_piobcnt;
+ /* read these before writing errorclear */
+ sbuf[0] = ipath_kget_kreg64(t, kr_sendbuffererror);
+ sbuf[1] = ipath_kget_kreg64(t, kr_sendbuffererror + 1);
+ if (piobcnt > 128) {
+ sbuf[2] = ipath_kget_kreg64(t, kr_sendbuffererror + 2);
+ sbuf[3] = ipath_kget_kreg64(t, kr_sendbuffererror + 3);
+ }
+
+ if (sbuf[0] || sbuf[1]
+ || (piobcnt > 128 && (sbuf[2] || sbuf[3]))) {
+ _IPATH_PDBG("SendbufErrs %llx %llx ", sbuf[0], sbuf[1]);
+ if (infinipath_debug & __IPATH_PKTDBG && piobcnt > 128)
+ printk("%llx %llx ", sbuf[2], sbuf[3]);
+ for (i = 0; i < piobcnt; i++) {
+ if (test_bit(i, sbuf)) {
+ uint32_t sendctrl;
+ if (infinipath_debug & __IPATH_PKTDBG)
+ printk("%u ", i);
+ sendctrl =
+ dd->
+ ipath_sendctrl | INFINIPATH_S_DISARM
+ | (i <<
+ INFINIPATH_S_DISARMPIOBUF_SHIFT);
+ ipath_kput_kreg(t, kr_sendctrl,
+ sendctrl);
+ }
+ }
+ if (infinipath_debug & __IPATH_PKTDBG)
+ printk("\n");
+ }
+ if ((errs &
+ (INFINIPATH_E_SDROPPEDDATAPKT | INFINIPATH_E_SDROPPEDSMPPKT
+ | INFINIPATH_E_SMINPKTLEN))
+ && !(dd->ipath_flags & IPATH_LINKACTIVE)) {
+ /*
+ * This can happen when SMA is trying to bring
+ * the link up, but the IB link changes state
+ * at the "wrong" time. The IB logic then
+ * complains that the packet isn't valid.
+ * We don't want to confuse people, so we just
+ * don't print them, except at debug
+ */
+ _IPATH_DBG
+ ("Ignoring pktsend errors %llx, because not yet active\n",
+ errs);
+ ignore_this_time |=
+ INFINIPATH_E_SDROPPEDDATAPKT |
+ INFINIPATH_E_SDROPPEDSMPPKT |
+ INFINIPATH_E_SMINPKTLEN;
+ }
+ }
+
+ if (supp_msgs == 250000) {
+ /*
+ * It's not entirely reasonable assuming that the errors
+ * set in the last clear period are all responsible for
+ * the problem, but the alternative is to assume it's the only
+ * ones on this particular interrupt, which also isn't great
+ */
+ dd->ipath_maskederrs |= dd->ipath_lasterror | errs;
+ ipath_kput_kreg(t, kr_errormask, ~dd->ipath_maskederrs);
+ ipath_decode_err(msg, sizeof msg,
+ (dd->ipath_maskederrs & ~dd->
+ ipath_ignorederrs));
+
+ if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs)
+ & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL))
+ _IPATH_UNIT_ERROR(t,
+ "Disabling error(s) %llx because occuring too frequently (%s)\n",
+ (dd->ipath_maskederrs & ~dd->
+ ipath_ignorederrs), msg);
+ else {
+ /*
+ * rcvegrfull and rcvhdrqfull are "normal",
+ * for some types of processes (mostly benchmarks)
+ * that send huge numbers of messages, while not
+ * processing them. So only complain about
+ * these at debug level.
+ */
+ _IPATH_DBG
+ ("Disabling frequent queue full errors (%s)\n",
+ msg);
+ }
+
+ /*
+ * re-enable the masked errors after around 3 minutes.
+ * in ipath_get_faststats(). If we have a series of
+ * fast repeating but different errors, the interval will keep
+ * stretching out, but that's OK, as that's pretty catastrophic.
+ */
+ dd->ipath_unmasktime = nc + 400000000000ULL;
+ }
+
+ ipath_kput_kreg(t, kr_errorclear, errs);
+ if (ignore_this_time)
+ errs &= ~ignore_this_time;
+ if (errs & ~dd->ipath_lasterror) {
+ errs &= ~dd->ipath_lasterror;
+ /* never suppress duplicate hwerrors or ibstatuschange */
+ dd->ipath_lasterror |= errs &
+ ~(INFINIPATH_E_HARDWARE | INFINIPATH_E_IBSTATUSCHANGED);
+ }
+ if (!errs)
+ return;
+
+ if (!noprint)
+ /* the ones we mask off are handled specially below or above */
+ ipath_decode_err(msg, sizeof msg,
+ errs & ~(INFINIPATH_E_IBSTATUSCHANGED |
+ INFINIPATH_E_RRCVEGRFULL |
+ INFINIPATH_E_RRCVHDRFULL |
+ INFINIPATH_E_HARDWARE));
+ else
+ /* so we don't need if (!noprint) at strlcat's below */
+ *msg = 0;
+
+ if (errs & E_SUM_PKTERRS) {
+ ipath_stats.sps_pkterrs++;
+ chkerrpkts = 1;
+ }
+ if (errs & E_SUM_ERRS)
+ ipath_stats.sps_errs++;
+
+ if (errs & (INFINIPATH_E_RICRC | INFINIPATH_E_RVCRC)) {
+ ipath_stats.sps_crcerrs++;
+ chkerrpkts = 1;
+ }
+
+ /*
+ * We don't want to print these two as they happen, or we can make
+ * the situation even worse, because it takes so long to print messages.
+ * to serial consoles. kernel ports get printed from fast_stats, no
+ * more than every 5 seconds, user ports get printed on close
+ */
+ if (errs & INFINIPATH_E_RRCVHDRFULL) {
+ int any;
+ uint32_t hd, tl;
+ ipath_stats.sps_hdrqfull++;
+ for (any = i = 0; i < dd->ipath_cfgports; i++) {
+ if (i == 0) {
+ hd = dd->ipath_port0head;
+ tl = *dd->ipath_hdrqtailptr;
+ } else if (dd->ipath_pd[i] &&
+ dd->ipath_pd[i]->port_rcvhdrtail_kvaddr) {
+ /*
+ * don't report same point multiple times,
+ * except kernel
+ */
+ tl = (uint32_t) *
+ dd->ipath_pd[i]->port_rcvhdrtail_kvaddr;
+ if (tl == dd->ipath_lastrcvhdrqtails[i])
+ continue;
+ hd = ipath_kget_ureg32(t, ur_rcvhdrhead, i);
+ } else
+ continue;
+ if (hd == (tl + 1) || (!hd && tl == dd->ipath_hdrqlast)) {
+ dd->ipath_lastrcvhdrqtails[i] = tl;
+ dd->ipath_pd[i]->port_hdrqfull++;
+ if (i == 0)
+ chkerrpkts = 1;
+ }
+ }
+ }
+ if (errs & INFINIPATH_E_RRCVEGRFULL) {
+ /*
+ * since this is of less importance and not likely to
+ * happen without also getting hdrfull, only count
+ * occurrences; don't check each port (or even the kernel
+ * vs user)
+ */
+ ipath_stats.sps_etidfull++;
+ if (dd->ipath_port0head != *dd->ipath_hdrqtailptr)
+ chkerrpkts = 1;
+ }
+
+ /*
+ * do this before IBSTATUSCHANGED, in case both bits set in a single
+ * interrupt; we want the STATUSCHANGE to "win", so we do our
+ * internal copy of state machine correctly
+ */
+ if (errs & INFINIPATH_E_RIBLOSTLINK) {
+ /* force through block below */
+ errs |= INFINIPATH_E_IBSTATUSCHANGED;
+ ipath_stats.sps_iblink++;
+ dd->ipath_flags |= IPATH_LINKDOWN;
+ dd->ipath_flags &= ~(IPATH_LINKUNK | IPATH_LINKINIT
+ | IPATH_LINKARMED | IPATH_LINKACTIVE);
+ if (!noprint)
+ _IPATH_DBG("Lost link, link now down (%s)\n",
+ ipath_ibcstatus_str[ipath_kget_kreg64
+ (t,
+ kr_ibcstatus) & 0xf]);
+ }
+
+ if ((errs & INFINIPATH_E_IBSTATUSCHANGED) && (!ipath_diags_enabled)) {
+ uint64_t val;
+ uint32_t ltstate;
+
+ val = ipath_kget_kreg64(t, kr_ibcstatus);
+ ltstate = val & 0xff;
+ if (ltstate == 0x11 || ltstate == 0x21 || ltstate == 0x31)
+ _IPATH_DBG("Link state changed unit %u to 0x%x, last was 0x%llx\n",
+ t, ltstate, dd->ipath_lastibcstat);
+ else {
+ ltstate = dd->ipath_lastibcstat & 0xff;
+ if (ltstate == 0x11 || ltstate == 0x21 || ltstate == 0x31)
+ _IPATH_DBG("Link state unit %u changed to down state 0x%llx, last was 0x%llx\n",
+ t, val, dd->ipath_lastibcstat);
+ else
+ _IPATH_VDBG("Link state unit %u changed to 0x%llx from one of down states\n",
+ t, val);
+ }
+ ltstate = (val >> INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) &
+ INFINIPATH_IBCS_LINKTRAININGSTATE_MASK;
+
+ if (ltstate == 2 || ltstate == 3) {
+ uint32_t last_ltstate;
+
+ /*
+ * ignore cycling back and forth from states 2 to 3
+ * while waiting for other end of link to come up
+ * except that if it keeps happening, we switch between
+ * linkinitstate SLEEP and POLL. While we cycle
+ * back and forth between them, we aren't seeing
+ * any other device, either no cable plugged in,
+ * other device powered off, other device is
+ * switch that hasn't yet polled us, etc.
+ */
+ last_ltstate = (dd->ipath_lastibcstat >>
+ INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT)
+ & INFINIPATH_IBCS_LINKTRAININGSTATE_MASK;
+ if (last_ltstate == 2 || last_ltstate == 3) {
+ if (++dd->ipath_ibpollcnt > 4) {
+ uint64_t ibc;
+ dd->ipath_flags |=
+ IPATH_LINK_SLEEPING | IPATH_NOCABLE;
+ *dd->ipath_statusp |=
+ IPATH_STATUS_IB_NOCABLE;
+ _IPATH_VDBG
+ ("linkinitcmd POLL, move to SLEEP\n");
+ ibc = dd->ipath_ibcctrl;
+ ibc |= INFINIPATH_IBCC_LINKINITCMD_SLEEP
+ <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT;
+ /*
+ * don't put linkinitcmd in
+ * ipath_ibcctrl, want that to
+ * stay a NOP
+ */
+ ipath_kput_kreg(t, kr_ibcctrl, ibc);
+ dd->ipath_ibpollcnt = 0;
+ }
+ goto skip_ibchange;
+ }
+ }
+ /* some state other than 2 or 3 */
+ dd->ipath_ibpollcnt = 0;
+ ipath_stats.sps_iblink++;
+ /*
+ * Note: We try to match the Mellanox HCA LED behavior
+ * as best we can. That changed around Oct 2003.
+ * Green indicates link state (something is plugged in,
+ * and we can train). Amber indicates the link is
+ * logically up (ACTIVE). Mellanox further blinks the
+ * amber LED to indicate data packet activity, but we
+ * have no hardware support for that, so it would require
+ * waking up every 10-20 msecs and checking the counters
+ * on the chip, and then turning the LED off if
+ * appropriate. That's visible overhead, so not something
+ * we will do.
+ */
+ if (ltstate != 1 || ((dd->ipath_lastibcstat & 0x30) == 0x30 &&
+ (val & 0x30) != 0x30)) {
+ dd->ipath_flags |= IPATH_LINKDOWN;
+ dd->ipath_flags &= ~(IPATH_LINKUNK | IPATH_LINKINIT
+ | IPATH_LINKACTIVE |
+ IPATH_LINKARMED);
+ *dd->ipath_statusp &= ~IPATH_STATUS_IB_READY;
+ if (!noprint) {
+ if ((dd->ipath_lastibcstat & 0x30) == 0x30)
+ /* if from up to down be more vocal */
+ _IPATH_DBG("Link unit %u is now down (%s)\n",
+ t, ipath_ibcstatus_str
+ [ltstate]);
+ else
+ _IPATH_VDBG("Link unit %u is down (%s)\n",
+ t, ipath_ibcstatus_str
+ [ltstate]);
+ }
+
+ if (val & 0x30) {
+ /* leave just green on, 0x11 and 0x21 */
+ dd->ipath_extctrl &=
+ ~INFINIPATH_EXTC_LEDPRIPORTYELLOWON;
+ dd->ipath_extctrl |=
+ INFINIPATH_EXTC_LEDPRIPORTGREENON;
+ } else /* not up at all, so turn the leds off */
+ dd->ipath_extctrl &=
+ ~(INFINIPATH_EXTC_LEDPRIPORTGREENON |
+ INFINIPATH_EXTC_LEDPRIPORTYELLOWON);
+ ipath_kput_kreg(t, kr_extctrl,
+ (uint64_t) dd->ipath_extctrl);
+ if (ltstate == 1
+ && (dd->
+ ipath_flags & (IPATH_LINK_TOARMED |
+ IPATH_LINK_TOACTIVE))) {
+ ipath_set_ib_lstate(t,
+ INFINIPATH_IBCC_LINKCMD_INIT);
+ }
+ } else if ((val & 0x31) == 0x31) {
+ if (!noprint)
+ _IPATH_DBG("Link unit %u is now in active state\n", t);
+ dd->ipath_flags |= IPATH_LINKACTIVE;
+ dd->ipath_flags &=
+ ~(IPATH_LINKUNK | IPATH_LINKINIT | IPATH_LINKDOWN |
+ IPATH_LINKARMED | IPATH_NOCABLE |
+ IPATH_LINK_TOACTIVE | IPATH_LINK_SLEEPING);
+ *dd->ipath_statusp &= ~IPATH_STATUS_IB_NOCABLE;
+ *dd->ipath_statusp |=
+ IPATH_STATUS_IB_READY | IPATH_STATUS_IB_CONF;
+ /* set the externally visible LEDs to indicate state */
+ dd->ipath_extctrl |= INFINIPATH_EXTC_LEDPRIPORTGREENON
+ | INFINIPATH_EXTC_LEDPRIPORTYELLOWON;
+ ipath_kput_kreg(t, kr_extctrl,
+ (uint64_t) dd->ipath_extctrl);
+
+ /*
+ * since we are now active, set the linkinitcmd
+ * to NOP (0) it was probably either POLL or SLEEP
+ */
+ dd->ipath_ibcctrl &=
+ ~(INFINIPATH_IBCC_LINKINITCMD_MASK <<
+ INFINIPATH_IBCC_LINKINITCMD_SHIFT);
+ ipath_kput_kreg(t, kr_ibcctrl, dd->ipath_ibcctrl);
+
+ if (devdata[t].ipath_layer.l_intr)
+ devdata[t].ipath_layer.l_intr(t,
+ IPATH_LAYER_INT_IF_UP);
+ } else if ((val & 0x31) == 0x11) {
+ /*
+ * set set INIT and DOWN. Down is checked by
+ * most of the other code, but INIT is useful
+ * to know in a few places.
+ */
+ dd->ipath_flags |= IPATH_LINKINIT | IPATH_LINKDOWN;
+ dd->ipath_flags &=
+ ~(IPATH_LINKUNK | IPATH_LINKACTIVE | IPATH_LINKARMED
+ | IPATH_NOCABLE | IPATH_LINK_SLEEPING);
+ *dd->ipath_statusp &= ~(IPATH_STATUS_IB_NOCABLE
+ | IPATH_STATUS_IB_READY);
+
+ /* set the externally visible LEDs to indicate state */
+ dd->ipath_extctrl &=
+ ~INFINIPATH_EXTC_LEDPRIPORTYELLOWON;
+ dd->ipath_extctrl |= INFINIPATH_EXTC_LEDPRIPORTGREENON;
+ ipath_kput_kreg(t, kr_extctrl,
+ (uint64_t) dd->ipath_extctrl);
+ if (dd->
+ ipath_flags & (IPATH_LINK_TOARMED |
+ IPATH_LINK_TOACTIVE)) {
+ /*
+ * if we got here while trying to bring
+ * the link up, try again, but only once more!
+ */
+ ipath_set_ib_lstate(t,
+ INFINIPATH_IBCC_LINKCMD_ARMED);
+ dd->ipath_flags &=
+ ~(IPATH_LINK_TOARMED | IPATH_LINK_TOACTIVE);
+ }
+ } else if ((val & 0x31) == 0x21) {
+ dd->ipath_flags |= IPATH_LINKARMED;
+ dd->ipath_flags &=
+ ~(IPATH_LINKUNK | IPATH_LINKDOWN | IPATH_LINKINIT |
+ IPATH_LINKACTIVE | IPATH_NOCABLE |
+ IPATH_LINK_TOARMED | IPATH_LINK_SLEEPING);
+ *dd->ipath_statusp &= ~(IPATH_STATUS_IB_NOCABLE
+ | IPATH_STATUS_IB_READY);
+ /*
+ * set the externally visible LEDs to indicate
+ * state (same as 0x11)
+ */
+ dd->ipath_extctrl &=
+ ~INFINIPATH_EXTC_LEDPRIPORTYELLOWON;
+ dd->ipath_extctrl |= INFINIPATH_EXTC_LEDPRIPORTGREENON;
+ ipath_kput_kreg(t, kr_extctrl,
+ (uint64_t) dd->ipath_extctrl);
+ if (dd->ipath_flags & IPATH_LINK_TOACTIVE) {
+ /*
+ * if we got here while trying to bring
+ * the link up, try again, but only once more!
+ */
+ ipath_set_ib_lstate(t,
+ INFINIPATH_IBCC_LINKCMD_ACTIVE);
+ dd->ipath_flags &= ~IPATH_LINK_TOACTIVE;
+ }
+ } else {
+ if (dd->
+ ipath_flags & (IPATH_LINK_TOARMED |
+ IPATH_LINK_TOACTIVE))
+ ipath_set_ib_lstate(t,
+ INFINIPATH_IBCC_LINKCMD_INIT);
+ else if (!noprint)
+ _IPATH_DBG("IBstatuschange unit %u: %s\n",
+ t, ipath_ibcstatus_str[ltstate]);
+ }
+ dd->ipath_lastibcstat = val;
+ }
+
+skip_ibchange:
+
+ if (errs & INFINIPATH_E_RESET) {
+ if (!noprint)
+ _IPATH_UNIT_ERROR(t,
+ "Got reset, requires re-initialization (unload and reload driver)\n");
+ dd->ipath_flags &= ~IPATH_INITTED; /* needs re-init */
+ /* mark as having had error */
+ *dd->ipath_statusp |= IPATH_STATUS_HWERROR;
+ *dd->ipath_statusp &= ~IPATH_STATUS_IB_CONF;
+ }
+
+ if (!noprint && *msg)
+ _IPATH_UNIT_ERROR(t, "%s error\n", msg);
+ if (dd->ipath_sma_state_wanted & dd->ipath_flags) {
+ _IPATH_VDBG("sma wanted state %x, iflags now %x, waking\n",
+ dd->ipath_sma_state_wanted, dd->ipath_flags);
+ wake_up_interruptible(&ipath_sma_state_wait);
+ }
+
+ if (chkerrpkts)
+ /* process possible error packets in hdrq */
+ ipath_kreceive(t);
+}

2005-12-29 00:42:29

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 10 of 20] ipath - core driver, part 3 of 4

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r dad2e87e21f4 -r c37b118ef806 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800
@@ -3878,3 +3878,1533 @@
/* process possible error packets in hdrq */
ipath_kreceive(t);
}
+
+/* must only be called if ipath_pd[port] is known to be allocated */
+static inline void *ipath_get_egrbuf(const ipath_type t, uint32_t bufnum,
+ int err)
+{
+ return devdata[t].ipath_port0_skbs ?
+ (void *)devdata[t].ipath_port0_skbs[bufnum]->data : NULL;
+
+#ifdef _USE_FOR_DEBUGGING_ONLY
+ /*
+ * want routine to be inlined and fast this is here so if we do ports
+ * other than 0, I don't have to rewrite the code, since it's slightly
+ * complicated
+ */
+ if (port != 1) {
+ void *chunkbase;
+ /*
+ * This calculation takes about 50 cycles. Could do
+ * what I did for protocol code, and have an array of
+ * addresses, getting it down to just a few cycles per
+ * lookup, at the cost of 16KB of memory.
+ */
+ if (!devdata[t].ipath_pd[port]->port_rcvegrbuf_virt)
+ return NULL;
+ chunkbase = devdata[t].ipath_pd[port]->port_rcvegrbuf_virt
+ [bufnum /
+ devdata[t].ipath_pd[port]->port_rcvegrbufs_perchunk];
+ return (void *)(chunkbase +
+ (bufnum %
+ devdata[t].ipath_pd[port]->
+ port_rcvegrbufs_perchunk)
+ * devdata[t].ipath_rcvegrbufsize);
+ }
+#endif
+}
+
+/* receive an sma packet. Separate for better overall optimization */
+static void ipath_rcv_sma(const ipath_type t, uint32_t tlen,
+ uint64_t * rc, void *ebuf)
+{
+ int sindex, slen, elen;
+ void *smbuf;
+ uint8_t pad, *bthbytes;
+
+ ipath_stats.sps_sma_rpkts++; /* another SMA packet received */
+
+ bthbytes = (uint8_t *)((struct ips_message_header_typ *) &rc[1])->bth;
+
+ pad = (bthbytes[1] >> 4) & 3;
+ elen = tlen - (IPATH_SMA_HDRSZ + pad + (uint32_t) sizeof(uint32_t));
+ if (elen > (SMA_MAX_PKTSZ - IPATH_SMA_HDRSZ))
+ elen = SMA_MAX_PKTSZ - IPATH_SMA_HDRSZ;
+
+ spin_lock_irq(&ipath_sma_lock);
+ sindex = ipath_sma_next;
+ smbuf = ipath_sma_data[sindex].buf;
+ ipath_sma_data[sindex].unit = t;
+ slen = ipath_sma_data[ipath_sma_next].len;
+ memcpy(smbuf, &rc[1], IPATH_SMA_HDRSZ);
+ memcpy(smbuf + IPATH_SMA_HDRSZ, ebuf, elen);
+ if (slen) {
+ /*
+ * overwriting a yet unread old one (buffer wrap), have to
+ * advance ipath_sma_first to next oldest
+ */
+
+ /* count OK packets that we drop */
+ ipath_stats.sps_krdrops++;
+ if (++ipath_sma_first >= IPATH_NUM_SMAPKTS)
+ ipath_sma_first = 0;
+ }
+ slen = ipath_sma_data[sindex].len = elen + IPATH_SMA_HDRSZ;
+ if (++ipath_sma_next >= IPATH_NUM_SMAPKTS)
+ ipath_sma_next = 0;
+ spin_unlock_irq(&ipath_sma_lock);
+}
+
+/*
+ * receive a packet for the layered (ethernet) driver.
+ * Separate routine for better overall optimization
+ */
+static void ipath_rcv_layer(const ipath_type t, uint32_t etail,
+ uint32_t tlen, struct ether_header_typ * hdr)
+{
+ uint32_t elen;
+ uint8_t pad, *bthbytes;
+ struct sk_buff *skb;
+ struct sk_buff *nskb;
+ struct ipath_devdata *dd = &devdata[t];
+ struct ipath_portdata *pd;
+ unsigned long pa, pent;
+ uint64_t __iomem *egrbase;
+ uint64_t lenvalid; /* in words */
+
+ if (dd->ipath_port0_skbs && hdr->sub_opcode == OPCODE_ENCAP) {
+ /*
+ * Allocate a new sk_buff to replace the one we give
+ * to the network stack.
+ */
+ if (!(nskb = dev_alloc_skb(dd->ipath_ibmaxlen + 4))) {
+ /* count OK packets that we drop */
+ ipath_stats.sps_krdrops++;
+ return;
+ }
+
+ bthbytes = (uint8_t *) hdr->bth;
+ pad = (bthbytes[1] >> 4) & 3;
+ /* +CRC32 */
+ elen = tlen - (sizeof(*hdr) + pad + sizeof(uint32_t));
+
+ skb_reserve(nskb, 4);
+
+ skb = dd->ipath_port0_skbs[etail];
+ dd->ipath_port0_skbs[etail] = nskb;
+ skb_put(skb, elen);
+
+ pd = dd->ipath_pd[0];
+ lenvalid = (dd->ipath_ibmaxlen - pd->port_egrskip) >> 2;
+ lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT;
+ lenvalid |= INFINIPATH_RT_VALID;
+ pa = virt_to_phys(nskb->data);
+ pa += pd->port_egrskip;
+ pent = (pa & INFINIPATH_RT_ADDR_MASK) | lenvalid;
+ /* This is simplified for port 0 */
+ egrbase = (uint64_t __iomem *)
+ ((char __iomem *)(dd->ipath_kregbase) +
+ dd->ipath_rcvegrbase);
+ ipath_kput_memq(t, &egrbase[etail], pent);
+
+ dd->ipath_layer.l_rcv(t, hdr, skb);
+
+ /* another ether packet received */
+ ipath_stats.sps_ether_rpkts++;
+ } else if (hdr->sub_opcode == OPCODE_LID_ARP) {
+ if (dd->ipath_layer.l_rcv_lid)
+ dd->ipath_layer.l_rcv_lid(t, hdr);
+ }
+
+}
+
+/* called from interrupt handler for errors or receive interrupt */
+void ipath_kreceive(const ipath_type t)
+{
+ uint64_t *rc;
+ void *ebuf;
+ struct ipath_devdata *dd = &devdata[t];
+ const uint32_t rsize = dd->ipath_rcvhdrentsize; /* words */
+ const uint32_t maxcnt = dd->ipath_rcvhdrcnt * rsize; /* in words */
+ uint32_t etail = -1, l, hdrqtail, sma_this_time = 0;
+ struct ips_message_header_typ *hdr;
+ uint32_t eflags, i, etype, tlen, pkttot=0;
+ static uint64_t totcalls; /* stats, may eventually remove */
+ char emsg[128];
+
+ if (!dd->ipath_hdrqtailptr) {
+ _IPATH_UNIT_ERROR(t,
+ "hdrqtailptr not set, can't do receives\n");
+ return;
+ }
+
+ if (test_and_set_bit(0, &dd->ipath_rcv_pending)) {
+ /* There is already a thread processing this queue. */
+ return;
+ }
+
+ if (dd->ipath_port0head == *dd->ipath_hdrqtailptr)
+ goto done;
+
+gotmore:
+ /*
+ * read only once at start. If in flood situation, this helps
+ * performance slightly. If more arrive while we are processing,
+ * we'll come back here and do them
+ */
+ hdrqtail = *dd->ipath_hdrqtailptr;
+
+ for (i = 0, l = dd->ipath_port0head; l != hdrqtail; i++) {
+ uint32_t qp;
+ uint8_t *bthbytes;
+
+
+ rc = (uint64_t *) (dd->ipath_pd[0]->port_rcvhdrq + (l << 2));
+ hdr = (struct ips_message_header_typ *) & rc[1];
+ /*
+ * could make a network order version of IPATH_KD_QP, and
+ * do the obvious shift before masking to speed this up.
+ */
+ qp = ntohl(hdr->bth[1]) & 0xffffff;
+ bthbytes = (uint8_t *) hdr->bth;
+
+ eflags = ips_get_hdr_err_flags((uint32_t*)rc);
+ etype = ips_get_rcv_type((uint32_t*)rc);
+ tlen = ips_get_length_in_bytes((uint32_t*)rc); /* total length */
+ ebuf = NULL;
+ if (etype != RCVHQ_RCV_TYPE_EXPECTED) {
+ /*
+ * it turns out that the chips uses an eager buffer for
+ * all non-expected packets, whether it "needs"
+ * one or not. So always get the index, but
+ * don't set ebuf (so we try to copy data)
+ * unless the length requires it.
+ */
+ etail = ips_get_index((uint32_t*)rc);
+ if (tlen > sizeof(*hdr)
+ || etype == RCVHQ_RCV_TYPE_NON_KD) {
+ ebuf = ipath_get_egrbuf(t, etail, 0);
+ }
+ }
+
+ /*
+ * both tiderr and ipathhdrerr are set for all plain IB
+ * packets; only ipathhdrerr should be set.
+ */
+
+ if (etype != RCVHQ_RCV_TYPE_NON_KD
+ && etype != RCVHQ_RCV_TYPE_ERROR
+ && ips_get_ipath_ver(hdr->iph.ver_port_tid_offset) !=
+ IPS_PROTO_VERSION) {
+ _IPATH_PDBG("Bad InfiniPath protocol version %x\n",
+ etype);
+ }
+
+ if (eflags &
+ ~(INFINIPATH_RHF_H_TIDERR | INFINIPATH_RHF_H_IHDRERR)) {
+ get_rhf_errstring(eflags, emsg, sizeof emsg);
+ _IPATH_PDBG
+ ("RHFerrs %x hdrqtail=%x typ=%u tlen=%x opcode=%x egridx=%x: %s\n",
+ eflags, l, etype, tlen, bthbytes[0],
+ ips_get_index((uint32_t*)rc), emsg);
+ } else if (etype == RCVHQ_RCV_TYPE_NON_KD) {
+ /*
+ * If there is a userland SMA and this is a MAD packet,
+ * then pass it to the userland SMA.
+ */
+ if (ipath_sma_alive && qp <= 1) {
+ /*
+ * count OK packets that we drop because
+ * SMA isn't yet running, or because we
+ * are in an sma flood (no point in
+ * constantly acquiring the spin lock, and
+ * overwriting previous packets).
+ * Eventually things will recover.
+ * Similarly if the sma consumer is
+ * so far behind that we would overwrite
+ * (yes, it's outside the lock)
+ */
+ if (!ipath_sma_data_spare ||
+ ipath_sma_data[ipath_sma_next].len ||
+ ++sma_this_time > IPATH_NUM_SMAPKTS) {
+ ipath_stats.sps_krdrops++;
+ } else if (ebuf) {
+ ipath_rcv_sma(t, tlen, rc, ebuf);
+ }
+ } else if (dd->verbs_layer.l_rcv) {
+ dd->verbs_layer.l_rcv(t, rc + 1, ebuf, tlen);
+ } else {
+ _IPATH_VDBG("received IB packet, not SMA (QP=%x)\n",
+ qp);
+ }
+ } else if (etype == RCVHQ_RCV_TYPE_EAGER) {
+ if (qp == IPATH_KD_QP && bthbytes[0] ==
+ dd->ipath_layer.l_rcv_opcode && ebuf)
+ ipath_rcv_layer(t, etail, tlen,
+ (struct ether_header_typ *)hdr);
+ else
+ _IPATH_PDBG
+ ("typ %x, opcode %x (eager, qp=%x), len %x; ignored\n",
+ etype, bthbytes[0], qp, tlen);
+ } else if (etype == RCVHQ_RCV_TYPE_EXPECTED) {
+ _IPATH_DBG("Bug: Expected TID, opcode %x; ignored\n",
+ hdr->bth[0] & 0xff);
+ } else if (eflags &
+ (INFINIPATH_RHF_H_TIDERR | INFINIPATH_RHF_H_IHDRERR))
+ {
+ /*
+ * This is a type 3 packet, only the LRH is in
+ * the rcvhdrq, the rest of the header is in
+ * the eager buffer.
+ */
+ uint8_t opcode;
+ if (ebuf) {
+ bthbytes = (uint8_t *) ebuf;
+ opcode = *bthbytes;
+ } else
+ opcode = 0;
+ get_rhf_errstring(eflags, emsg, sizeof emsg);
+ _IPATH_DBG
+ ("Err %x (%s), opcode %x, egrbuf %x, len %x\n",
+ eflags, emsg, opcode, etail, tlen);
+ } else {
+ /*
+ * error packet, type of error unknown.
+ * Probably type 3, but we don't know, so don't
+ * even try to print the opcode, etc.
+ */
+ _IPATH_DBG
+ ("Error Pkt, but no eflags! egrbuf %x, len %x\n"
+ "hdrq@%lx;hdrq+%x rhf: %llx; hdr %llx %llx %llx %llx %llx\n",
+ etail, tlen, (unsigned long)rc, l, rc[0], rc[1],
+ rc[2], rc[3], rc[4], rc[5]);
+ }
+ l += rsize;
+ if (l >= maxcnt)
+ l = 0;
+ /*
+ * update for each packet, to help prevent overflows if we have
+ * lots of packets.
+ */
+ (void)ipath_kput_ureg(t, ur_rcvhdrhead, l, 0);
+ if (etype != RCVHQ_RCV_TYPE_EXPECTED)
+ (void)ipath_kput_ureg(t, ur_rcvegrindexhead, etail, 0);
+ }
+
+ pkttot += i;
+
+ dd->ipath_port0head = l;
+
+ if (hdrqtail != *dd->ipath_hdrqtailptr)
+ goto gotmore; /* more arrived while we handled first batch */
+
+ if (pkttot > ipath_stats.sps_maxpkts_call)
+ ipath_stats.sps_maxpkts_call = pkttot;
+ ipath_stats.sps_port0pkts += pkttot;
+ ipath_stats.sps_avgpkts_call = ipath_stats.sps_port0pkts / ++totcalls;
+
+ if (sma_this_time) /* only once at end, not each time */
+ wake_up_interruptible(&ipath_sma_wait);
+
+done:
+ clear_bit(0, &dd->ipath_rcv_pending);
+ smp_mb__after_clear_bit();
+}
+
+/*
+ * Update our shadow copy of the PIO availability register map, called
+ * whenever our local copy indicates we have run out of send buffers
+ * NOTE: This can be called from interrupt context by ipath_bufavail()
+ * and from non-interrupt context by ipath_getpiobuf().
+ */
+
+static void ipath_update_pio_bufs(const ipath_type t)
+{
+ unsigned long flags;
+ int i;
+ const unsigned piobregs = (unsigned)devdata[t].ipath_pioavregs;
+
+ /* If the generation (check) bits have changed, then we update the
+ * busy bit for the corresponding PIO buffer. This algorithm will
+ * modify positions to the value they already have in some cases
+ * (i.e., no change), but it's faster than changing only the bits
+ * that have changed.
+ *
+ * We would like to do this atomicly, to avoid spinlocks in the
+ * critical send path, but that's not really possible, given the
+ * type of changes, and that this routine could be called on multiple
+ * cpu's simultaneously, so we lock in this routine only, to avoid
+ * conflicting updates; all we change is the shadow, and it's a
+ * single 64 bit memory location, so by definition the update is
+ * atomic in terms of what other cpu's can see in testing the
+ * bits. The spin_lock overhead isn't too bad, since it only
+ * happens when all buffers are in use, so only cpu overhead,
+ * not latency or bandwidth is affected.
+ */
+#define _IPATH_ALL_CHECKBITS 0x5555555555555555ULL
+ if (!devdata[t].ipath_pioavailregs_dma) {
+ _IPATH_DBG("Update shadow pioavail, but regs_dma NULL!\n");
+ return;
+ }
+ if (infinipath_debug & __IPATH_VERBDBG) {
+ /* only if packet debug and verbose */
+ _IPATH_PDBG("Refill avail, dma0=%llx shad0=%llx, "
+ "d1=%llx s1=%llx, d2=%llx s2=%llx, d3=%llx s3=%llx\n",
+ devdata[t].ipath_pioavailregs_dma[0],
+ devdata[t].ipath_pioavailshadow[0],
+ devdata[t].ipath_pioavailregs_dma[1],
+ devdata[t].ipath_pioavailshadow[1],
+ devdata[t].ipath_pioavailregs_dma[2],
+ devdata[t].ipath_pioavailshadow[2],
+ devdata[t].ipath_pioavailregs_dma[3],
+ devdata[t].ipath_pioavailshadow[3]);
+ if (piobregs > 4)
+ _IPATH_PDBG("2nd group, dma4=%llx shad4=%llx, "
+ "d5=%llx s5=%llx, d6=%llx s6=%llx, d7=%llx s7=%llx\n",
+ devdata[t].ipath_pioavailregs_dma[4],
+ devdata[t].ipath_pioavailshadow[4],
+ devdata[t].ipath_pioavailregs_dma[5],
+ devdata[t].ipath_pioavailshadow[5],
+ devdata[t].ipath_pioavailregs_dma[6],
+ devdata[t].ipath_pioavailshadow[6],
+ devdata[t].ipath_pioavailregs_dma[7],
+ devdata[t].ipath_pioavailshadow[7]);
+ }
+ spin_lock_irqsave(&ipath_pioavail_lock, flags);
+ for (i = 0; i < piobregs; i++) {
+ uint64_t pchbusy, pchg, piov, pnew;
+ /* Chip Errata: bug 6641; even and odd qwords>3 are swapped */
+ piov = devdata[t].ipath_pioavailregs_dma[i > 3 ? i ^ 1 : i];
+ pchg =
+ _IPATH_ALL_CHECKBITS & ~(devdata[t].
+ ipath_pioavailshadow[i] ^ piov);
+ pchbusy = pchg << INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT;
+ if (pchg && (pchbusy & devdata[t].ipath_pioavailshadow[i])) {
+ pnew = devdata[t].ipath_pioavailshadow[i] & ~pchbusy;
+ pnew |= piov & pchbusy;
+ devdata[t].ipath_pioavailshadow[i] = pnew;
+ }
+ }
+ spin_unlock_irqrestore(&ipath_pioavail_lock, flags);
+}
+
+static int ipath_do_user_init(struct ipath_portdata *pd,
+ struct ipath_user_info __user *uinfo)
+{
+ int ret = 0;
+ ipath_type t = pd->port_unit;
+ struct ipath_devdata *dd = &devdata[t];
+ struct ipath_user_info kinfo;
+
+ if (copy_from_user(&kinfo, uinfo, sizeof kinfo))
+ ret = -EFAULT;
+ else {
+ /* for now, if major version is different, bail */
+ if ((kinfo.spu_userversion >> 16) != IPATH_USER_SWMAJOR) {
+ _IPATH_INFO
+ ("User major version %d not same as driver major %d\n",
+ kinfo.spu_userversion >> 16, IPATH_USER_SWMAJOR);
+ ret = -ENODEV;
+ } else {
+ if ((kinfo.spu_userversion & 0xffff) !=
+ IPATH_USER_SWMINOR)
+ _IPATH_DBG
+ ("User minor version %d not same as driver minor %d\n",
+ kinfo.spu_userversion & 0xffff,
+ IPATH_USER_SWMINOR);
+ if (kinfo.spu_rcvhdrsize) {
+ if ((ret =
+ ipath_setrcvhdrsize(t,
+ kinfo.spu_rcvhdrsize)))
+ goto done;
+ } else if (!dd->ipath_rcvhdrsize) {
+ /*
+ * first user of field, kernel or user
+ * code, and using default
+ */
+ dd->ipath_rcvhdrsize = IPATH_DFLT_RCVHDRSIZE;
+ ipath_kput_kreg(pd->port_unit, kr_rcvhdrsize,
+ dd->ipath_rcvhdrsize);
+ _IPATH_VDBG
+ ("Use default protocol header size %u\n",
+ dd->ipath_rcvhdrsize);
+ }
+
+ pd->port_egrskip = kinfo.spu_egrskip;
+ if (pd->port_egrskip) {
+ if (pd->port_egrskip & 3) {
+ _IPATH_DBG
+ ("eager skip 0x%x invalid, must be word multiple; using 0x%x\n",
+ pd->port_egrskip,
+ pd->port_egrskip & ~3);
+ pd->port_egrskip &= ~3;
+ }
+ _IPATH_DBG
+ ("user reserves 0x%x bytes at start of eager TIDs\n",
+ pd->port_egrskip);
+ }
+
+ /*
+ * for now we do nothing with rcvhdrcnt:
+ * kinfo.spu_rcvhdrcnt
+ */
+
+ /*
+ * set up for the rcvhdr Q tail register writeback
+ * to user memory
+ */
+ if (kinfo.spu_rcvhdraddr &&
+ access_ok(VERIFY_WRITE,
+ (uint64_t __user *) kinfo.spu_rcvhdraddr,
+ sizeof(uint64_t))) {
+ uint64_t physaddr, uaddr, off, atmp;
+ struct page *pagep;
+ off = offset_in_page(kinfo.spu_rcvhdraddr);
+ uaddr =
+ PAGE_MASK & (unsigned long)kinfo.
+ spu_rcvhdraddr;
+ if ((ret = ipath_get_upages_nocopy(uaddr, &pagep))) {
+ _IPATH_INFO
+ ("Failed to lookup and lock address %llx for rcvhdrtail: errno %d\n",
+ kinfo.spu_rcvhdraddr, -ret);
+ goto done;
+ }
+ ipath_stats.sps_pagelocks++;
+ pd->port_rcvhdrtail_uaddr = uaddr;
+ pd->port_rcvhdrtail_pagep = pagep;
+ pd->port_rcvhdrtail_kvaddr =
+ page_address(pagep);
+ pd->port_rcvhdrtail_kvaddr += off;
+ physaddr = page_to_phys(pagep) + off;
+ _IPATH_VDBG
+ ("port %d user addr %llx hdrtailaddr, %llx physical (off=%llx)\n",
+ pd->port_port, kinfo.spu_rcvhdraddr,
+ physaddr, off);
+ ipath_kput_kreg_port(t, kr_rcvhdrtailaddr,
+ pd->port_port, physaddr);
+ atmp =
+ ipath_kget_kreg64_port(t, kr_rcvhdrtailaddr,
+ pd->port_port);
+ if (physaddr != atmp) {
+ _IPATH_UNIT_ERROR(t,
+ "Catastrophic software error, RcvHdrTailAddr%u written as %llx, read back as %llx\n",
+ pd->port_port,
+ physaddr, atmp);
+ ret = -EINVAL;
+ goto done;
+ }
+ } else {
+ _IPATH_DBG
+ ("Port %d rcvhdrtail addr %llx not valid\n",
+ pd->port_port, kinfo.spu_rcvhdraddr);
+ ret = -EINVAL;
+ goto done;
+ }
+
+ /*
+ * for right now, kernel piobufs are at end,
+ * so port 1 is at 0
+ */
+ pd->port_piobufs = dd->ipath_piobufbase +
+ dd->ipath_pbufsport * (pd->port_port -
+ 1) * dd->ipath_palign;
+ _IPATH_VDBG("Set base of piobufs for port %u to 0x%x\n",
+ pd->port_port, pd->port_piobufs);
+
+ /*
+ * Now allocate the rcvhdr Q and eager TIDs;
+ * skip the TID array for time being.
+ * If pd->port_port > chip-supported, we need
+ * to do extra stuff here to handle by handling
+ * overflow through port 0, someday
+ */
+ if (!(ret = ipath_create_rcvhdrq(pd)))
+ ret = ipath_create_user_egr(pd);
+ if (!ret) { /* enable receives now */
+ uint64_t head;
+ uint32_t head32;
+ /* atomically set enable bit for this port */
+ atomic_set_mask(1U <<
+ (INFINIPATH_R_PORTENABLE_SHIFT +
+ pd->port_port),
+ &dd->ipath_rcvctrl);
+
+ /*
+ * set the head registers for this port
+ * to the current values of the tail
+ * pointers, since we don't know if they
+ * were updated on last use of the port.
+ */
+ head32 =
+ ipath_kget_ureg32(t, ur_rcvhdrtail,
+ pd->port_port);
+ head = (uint64_t) head32;
+ ipath_kput_ureg(t, ur_rcvhdrhead, head,
+ pd->port_port);
+ head32 =
+ ipath_kget_ureg32(t, ur_rcvegrindextail,
+ pd->port_port);
+ ipath_kput_ureg(t, ur_rcvegrindexhead, head32,
+ pd->port_port);
+ dd->ipath_lastegrheads[pd->port_port] = -1;
+ dd->ipath_lastrcvhdrqtails[pd->port_port] = -1;
+ _IPATH_VDBG
+ ("Wrote port%d head %llx, egrhead %x from tail regs\n",
+ pd->port_port, head, head32);
+ /* start at beginning after open */
+ pd->port_tidcursor = 0;
+ {
+ /*
+ * now enable the port; the tail
+ * registers will be written to
+ * memory by the chip as soon
+ * as it sees the write to
+ * kr_rcvctrl. The update only
+ * happens on transition from 0
+ * to 1, so clear it first, then
+ * set it as part of enabling
+ * the port. This will (very
+ * briefly) affect any other open
+ * ports, but it shouldn't be long
+ * enough to be an issue.
+ */
+ ipath_kput_kreg(t, kr_rcvctrl,
+ dd->
+ ipath_rcvctrl &
+ ~INFINIPATH_R_TAILUPD);
+ ipath_kput_kreg(t, kr_rcvctrl,
+ dd->ipath_rcvctrl);
+ }
+ }
+ }
+ }
+
+done:
+ return ret;
+}
+
+static int ipath_get_baseinfo(struct ipath_portdata *pd,
+ struct ipath_base_info __user *ubase)
+{
+ int ret = 0;
+ struct ipath_base_info kbase;
+ struct ipath_devdata *dd = &devdata[pd->port_unit];
+
+ /* be sure anything we don't set is 0ed */
+ memset(&kbase, 0, sizeof kbase);
+ kbase.spi_rcvhdr_cnt = dd->ipath_rcvhdrcnt;
+ kbase.spi_rcvhdrent_size = dd->ipath_rcvhdrentsize;
+ kbase.spi_tidegrcnt = dd->ipath_rcvegrcnt;
+ kbase.spi_rcv_egrbufsize = dd->ipath_rcvegrbufsize;
+ kbase.spi_rcv_egrbuftotlen = pd->port_rcvegrbuf_chunks * PAGE_SIZE * (1 << pd->port_rcvegrbuf_order); /* have to mmap whole thing */
+ kbase.spi_rcv_egrperchunk = pd->port_rcvegrbufs_perchunk;
+ kbase.spi_rcv_egrchunksize = kbase.spi_rcv_egrbuftotlen /
+ pd->port_rcvegrbuf_chunks;
+ kbase.spi_tidcnt = dd->ipath_rcvtidcnt;
+ /*
+ * for this use, may be ipath_cfgports summed over all chips that
+ * are are configured and present
+ */
+ kbase.spi_nports = dd->ipath_cfgports;
+ kbase.spi_unit = pd->port_unit; /* unit (chip/board) our port is on */
+ /* for now, only a single page */
+ kbase.spi_tid_maxsize = PAGE_SIZE;
+
+ /*
+ * doing this per port, and based on the skip value, etc.
+ * This has to be the actual buffer size, since the protocol
+ * code treats it as an array.
+ *
+ * These have to be set to user addresses in the user code via mmap
+ * These values are used on return to user code for the mmap target
+ * addresses only. For 32 bit, same 44 bit address problem, so use
+ * the physical address, not virtual. Before 2.6.11, using the
+ * page_address() macro worked, but in 2.6.11, even that returns
+ * the full 64 bit address (upper bits all 1's).
+ * So far, using the physical addresses (or chip offsets, for
+ * chip mapping) works, but no doubt some future kernel release
+ * will chang that, and we'll be on to yet another method of
+ * dealing with this
+ */
+ kbase.spi_rcvhdr_base = (uint64_t) pd->port_rcvhdrq_phys;
+ kbase.spi_rcv_egrbufs = (uint64_t) pd->port_rcvegr_phys;
+ kbase.spi_pioavailaddr = (uint64_t) dd->ipath_pioavailregs_phys;
+ kbase.spi_status = (uint64_t) kbase.spi_pioavailaddr +
+ (void *)dd->ipath_statusp - (void *)dd->ipath_pioavailregs_dma;
+ kbase.spi_piobufbase = (uint64_t) pd->port_piobufs;
+ kbase.__spi_uregbase =
+ dd->ipath_uregbase + dd->ipath_palign * pd->port_port;
+
+ kbase.spi_pioindex = dd->ipath_pbufsport * (pd->port_port - 1);
+ kbase.spi_piocnt = dd->ipath_pbufsport;
+ kbase.spi_pioalign = dd->ipath_palign;
+
+ kbase.spi_qpair = IPATH_KD_QP;
+ kbase.spi_piosize = dd->ipath_ibmaxlen;
+ kbase.spi_mtu = dd->ipath_ibmaxlen; /* maxlen, not ibmtu */
+ kbase.spi_port = pd->port_port;
+ kbase.spi_sw_version = IPATH_KERN_SWVERSION;
+ kbase.spi_hw_version = dd->ipath_revision;
+
+ if (copy_to_user(ubase, &kbase, sizeof kbase))
+ ret = -EFAULT;
+
+ return ret;
+}
+
+/*
+ * return number of units supported by driver. This is infinipath_max,
+ * unless there are no initted units.
+ */
+static int ipath_get_units(void)
+{
+ int i;
+
+ for (i = 0; i < infinipath_max; i++)
+ if (devdata[i].ipath_flags & IPATH_INITTED)
+ return infinipath_max;
+ return 0;
+}
+
+/* write data to the EEPROM on the board */
+static int ipath_wr_eeprom(struct ipath_portdata* pd,
+ struct ipath_eeprom_req __user *req)
+{
+ int ret = 0;
+ struct ipath_eeprom_req kreq;
+ void *buf = NULL;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM; /* not just any old user can write flash */
+ if (copy_from_user(&kreq, req, sizeof kreq))
+ return -EFAULT;
+ if (!kreq.addr || (kreq.offset + kreq.len) > 128) {
+ _IPATH_DBG
+ ("called with NULL addr %llx, or bad cnt %u or offset %u\n",
+ kreq.addr, kreq.len, kreq.offset);
+ return -EINVAL;
+ }
+
+ if (!(buf = vmalloc(kreq.len))) {
+ ret = -ENOMEM;
+ _IPATH_UNIT_ERROR(pd->port_unit,
+ "Couldn't allocate memory to write %u bytes from eeprom\n",
+ kreq.len);
+ goto done;
+ }
+ if (copy_from_user(buf, (void __user *) kreq.addr, kreq.len)) {
+ ret = -EFAULT;
+ goto done;
+ }
+ if (ipath_eeprom_write(pd->port_unit, kreq.offset, buf, kreq.len)) {
+ ret = -ENXIO;
+ _IPATH_UNIT_ERROR(pd->port_unit,
+ "Failed write to eeprom %u bytes offset %u\n",
+ kreq.len, kreq.offset);
+ }
+
+done:
+ if (buf)
+ vfree(buf);
+ return ret;
+}
+
+/* read data from the EEPROM on the board */
+int ipath_rd_eeprom(const ipath_type port_unit,
+ struct ipath_eeprom_req __user *req)
+{
+ int ret = 0;
+ struct ipath_eeprom_req kreq;
+ void *buf = NULL;
+
+ if (copy_from_user(&kreq, req, sizeof kreq))
+ return -EFAULT;
+ if (!kreq.addr || (kreq.offset + kreq.len) > 128) {
+ _IPATH_DBG
+ ("called with NULL addr %llx, or bad cnt %u or offset %u\n",
+ kreq.addr, kreq.len, kreq.offset);
+ return -EINVAL;
+ }
+
+ if (!(buf = vmalloc(kreq.len))) {
+ ret = -ENOMEM;
+ _IPATH_UNIT_ERROR(port_unit,
+ "Couldn't allocate memory to read %u bytes from eeprom\n",
+ kreq.len);
+ goto done;
+ }
+ if (ipath_eeprom_read(port_unit, kreq.offset, buf, kreq.len)) {
+ ret = -ENXIO;
+ _IPATH_UNIT_ERROR(port_unit,
+ "Failed reading %u bytes offset %u from eeprom\n",
+ kreq.len, kreq.offset);
+ }
+ if (copy_to_user((void __user *) kreq.addr, buf, kreq.len))
+ ret = -EFAULT;
+
+done:
+ if (buf)
+ vfree(buf);
+ return ret;
+}
+
+/*
+ * wait for something to happen on a port. Currently this is
+ * PIO buffer available, or a packet being received. For now, at
+ * least, we wait no longer than 1/2 seconds on rcv, 1 tick on PIO, so
+ * we recover from any bugs (or, as we see in ips.c init and close, cases
+ * where other side isn't yet ready).
+ * NOTE: currently called only with PIO or RCV, never both, so path with both
+ * has not been tested
+ */
+static int ipath_wait_intr(struct ipath_portdata * pd, uint32_t flag)
+{
+ struct ipath_devdata *dd = &devdata[pd->port_unit];
+ /* stupid compiler can't tell it's initialized */
+ uint32_t im = 0;
+ uint32_t head, tail, timeo = 0, wflag = 0;
+
+ if (!(flag & (IPATH_WAIT_RCV | IPATH_WAIT_PIO)))
+ return -EINVAL;
+ if (flag & IPATH_WAIT_RCV) {
+ head = flag >> 16;
+ im = (1U << pd->port_port) << INFINIPATH_R_INTRAVAIL_SHIFT;
+ atomic_set_mask(im, &dd->ipath_rcvctrl);
+ /*
+ * now, before blocking, make sure that head is still == tail,
+ * reading from the chip, so we can be sure the interrupt enable
+ * has made it to the chip. If not equal, disable
+ * interrupt again and return immediately. This avoids
+ * races, and the overhead of the chip read doesn't
+ * matter much at this point, since we are waiting for
+ * something anyway.
+ */
+ ipath_kput_kreg(pd->port_unit, kr_rcvctrl, dd->ipath_rcvctrl);
+ tail =
+ ipath_kget_ureg32(pd->port_unit, ur_rcvhdrtail,
+ pd->port_port);
+ if (tail == head) {
+ timeo = HZ / 2;
+ wflag = IPATH_PORT_WAITING_RCV;
+ } else {
+ atomic_clear_mask(im, &dd->ipath_rcvctrl);
+ ipath_kput_kreg(pd->port_unit, kr_rcvctrl,
+ dd->ipath_rcvctrl);
+ }
+ }
+ if (flag & IPATH_WAIT_PIO) {
+ /*
+ * this one's a bit worse than the receive case, in that we
+ * can't really verify that at least one interrupt
+ * will happen...
+ * We do use a really short timeout, however
+ */
+ timeo = 1; /* if both, the short PIO timeout wins */
+ atomic_set_mask(1U << pd->port_port, &dd->ipath_portpiowait);
+ wflag |= IPATH_PORT_WAITING_PIO;
+ /*
+ * this has a possible race with the ipath stuff, so do
+ * it atomicly
+ */
+ atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL,
+ &dd->ipath_sendctrl);
+ ipath_kput_kreg(pd->port_unit, kr_sendctrl, dd->ipath_sendctrl);
+ }
+ if (wflag) {
+ pd->port_flag |= wflag;
+ wait_event_interruptible_timeout(pd->port_wait,
+ (pd->port_flag & wflag) !=
+ wflag, timeo);
+ if (wflag & pd->port_flag & IPATH_PORT_WAITING_PIO) {
+ /* timed out, no PIO interrupts */
+ atomic_clear_mask(IPATH_PORT_WAITING_PIO,
+ &pd->port_flag);
+ pd->port_piowait_to++;
+ atomic_clear_mask(1U << pd->port_port,
+ &dd->ipath_portpiowait);
+ /*
+ * *don't* clear the pio interrupt enable;
+ * let that happen in the interrupt handler;
+ * else we have a race condition.
+ */
+ }
+ if (wflag & pd->port_flag & IPATH_PORT_WAITING_RCV) {
+ /* timed out, no packets received */
+ atomic_clear_mask(IPATH_PORT_WAITING_RCV,
+ &pd->port_flag);
+ pd->port_rcvwait_to++;
+ atomic_clear_mask(im, &dd->ipath_rcvctrl);
+ ipath_kput_kreg(pd->port_unit, kr_rcvctrl,
+ dd->ipath_rcvctrl);
+ }
+ } else {
+ /* else it's already happened, don't do wait_event overhead */
+ if (flag & IPATH_WAIT_RCV)
+ pd->port_rcvnowait++;
+ if (flag & IPATH_WAIT_PIO)
+ pd->port_pionowait++;
+ }
+ return 0;
+}
+
+/*
+ * The new implementation as of Oct 2004 is that the driver assigns
+ * the tid and returns it to the caller. To make it easier to
+ * catch bugs, and to reduce search time, we keep a cursor for
+ * each port, walking the shadow tid array to find one that's not
+ * in use.
+ *
+ * For now, if we can't allocate the full list, we fail, although
+ * in the long run, we'll allocate as many as we can, and the
+ * caller will deal with that by trying the remaining pages later.
+ * That means that when we fail, we have to mark the tids as not in
+ * use again, in our shadow copy.
+ *
+ * It's up to the caller to free the tids when they are done.
+ * We'll unlock the pages as they free them.
+ *
+ * Also, right now we are locking one page at a time, but since
+ * the intended use of this routine is for a single group of
+ * virtually contiguous pages, that should change to improve
+ * performance.
+ */
+static int ipath_tid_update(struct ipath_portdata * pd,
+ struct _tidupd __user *tidu)
+{
+ int ret = 0, ntids;
+ uint32_t tid, porttid, cnt, i, tidcnt;
+ struct _tidupd tu;
+ uint16_t *tidlist;
+ struct ipath_devdata *dd = &devdata[pd->port_unit];
+ uint64_t vaddr, physaddr, lenvalid;
+ uint64_t __iomem *tidbase;
+ uint64_t tidmap[8];
+ struct page **pagep = NULL;
+
+ tu.tidcnt = 0; /* for early errors */
+ if (!dd->ipath_pageshadow) {
+ ret = -ENOMEM;
+ goto done;
+ }
+ if (copy_from_user(&tu, tidu, sizeof tu)) {
+ ret = -EFAULT;
+ goto done;
+ }
+
+ if (!(cnt = tu.tidcnt)) {
+ _IPATH_DBG("After copyin, tidcnt 0, tidlist %llx\n",
+ tu.tidlist);
+ /* or should we treat as success? likely a bug */
+ ret = -EFAULT;
+ goto done;
+ }
+ tidcnt = dd->ipath_rcvtidcnt;
+ if (cnt >= tidcnt) { /* make sure it all fits in port_tid_pg_list */
+ _IPATH_INFO
+ ("Process tried to allocate %u TIDs, only trying max (%u)\n",
+ cnt, tidcnt);
+ cnt = tidcnt;
+ }
+ pagep = (struct page **)pd->port_tid_pg_list;
+ tidlist = (uint16_t *) (&pagep[cnt]);
+
+ memset(tidmap, 0, sizeof(tidmap));
+ tid = pd->port_tidcursor;
+ /* before decrement; chip actual # */
+ porttid = pd->port_port * tidcnt;
+ ntids = tidcnt;
+ tidbase = (uint64_t __iomem *)
+ (((char __iomem *) devdata[pd->port_unit].ipath_kregbase) +
+ devdata[pd->port_unit].ipath_rcvtidbase +
+ porttid * sizeof(*tidbase));
+
+ _IPATH_VDBG("Port%u %u tids, cursor %u, tidbase %p\n", pd->port_port,
+ cnt, tid, tidbase);
+
+ vaddr = tu.tidvaddr; /* virtual address of first page in transfer */
+ if (!access_ok(VERIFY_WRITE, (void __user *) vaddr, cnt * PAGE_SIZE)) {
+ _IPATH_DBG("Fail vaddr %llx, %u pages, !access_ok\n",
+ vaddr, cnt);
+ ret = -EFAULT;
+ goto done;
+ }
+ if ((ret = ipath_get_upages((unsigned long)vaddr, cnt, pagep))) {
+ if (ret == -EBUSY) {
+ _IPATH_DBG
+ ("Failed to lock addr %p, %u pages (already locked)\n",
+ (void *)vaddr, cnt);
+ /*
+ * for now, continue, and see what happens
+ * but with the new implementation, this should
+ * never happen, unless perhaps the user has
+ * mpin'ed the pages themselves (something we
+ * need to test)
+ */
+ ret = 0;
+ } else {
+ _IPATH_INFO
+ ("Failed to lock addr %p, %u pages: errno %d\n",
+ (void *)vaddr, cnt, -ret);
+ goto done;
+ }
+ }
+ for (i = 0; i < cnt; i++, vaddr += PAGE_SIZE) {
+ for (; ntids--; tid++) {
+ if (tid == tidcnt)
+ tid = 0;
+ if (!dd->ipath_pageshadow[porttid + tid])
+ break;
+ }
+ if (ntids < 0) {
+ /*
+ * oops, wrapped all the way through their TIDs,
+ * and didn't have enough free; see comments at
+ * start of routine
+ */
+ _IPATH_DBG
+ ("Not enough free TIDs for %u pages (index %d), failing\n",
+ cnt, i);
+ i--; /* last tidlist[i] not filled in */
+ ret = -ENOMEM;
+ break;
+ }
+ tidlist[i] = tid;
+ _IPATH_VDBG("Updating idx %u to TID %u, vaddr %llx\n",
+ i, tid, vaddr);
+ /* for now we "know" system pages and TID pages are same size */
+ /* for ipath_free_tid */
+ dd->ipath_pageshadow[porttid + tid] = pagep[i];
+ __set_bit(tid, tidmap); /* don't need atomic or it's overhead */
+ physaddr = page_to_phys(pagep[i]);
+ ipath_stats.sps_pagelocks++;
+ _IPATH_VDBG("TID %u, vaddr %llx, physaddr %llx pgp %p\n",
+ tid, vaddr, physaddr, pagep[i]);
+ /*
+ * in words (fixed, full page). could make less for very last
+ * page in transfer, but for now we won't worry about it.
+ */
+ lenvalid = PAGE_SIZE >> 2;
+ lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT;
+ physaddr |= lenvalid | INFINIPATH_RT_VALID;
+ ipath_kput_memq(pd->port_unit, &tidbase[tid], physaddr);
+ /*
+ * don't check this tid in ipath_portshadow, since we
+ * just filled it in; start with the next one.
+ */
+ tid++;
+ }
+
+ if (ret) {
+ uint32_t limit;
+ uint64_t tidval;
+ /*
+ * chip errata bug 7358, try to work around it by
+ * marking invalid tids as having max length
+ */
+ tidval =
+ (-1LL & INFINIPATH_RT_BUFSIZE_MASK) <<
+ INFINIPATH_RT_BUFSIZE_SHIFT;
+ cleanup:
+ /* jump here if copy out of updated info failed... */
+ _IPATH_DBG("After failure (ret=%d), undo %d of %d entries\n",
+ -ret, i, cnt);
+ /* same code that's in ipath_free_tid() */
+ if ((limit = sizeof(tidmap) * BITS_PER_BYTE) > tidcnt)
+ /* just in case size changes in future */
+ limit = tidcnt;
+ tid = find_first_bit((const unsigned long *)tidmap, limit);
+ /*
+ * chip errata bug 7358, try to work around it by
+ * marking invalid tids as having max length
+ */
+ tidval =
+ (-1LL & INFINIPATH_RT_BUFSIZE_MASK) <<
+ INFINIPATH_RT_BUFSIZE_SHIFT;
+ for (; tid < limit; tid++) {
+ if (!test_bit(tid, tidmap))
+ continue;
+ if (dd->ipath_pageshadow[porttid + tid]) {
+ _IPATH_VDBG("Freeing TID %u\n", tid);
+ ipath_kput_memq(pd->port_unit, &tidbase[tid],
+ tidval);
+ dd->ipath_pageshadow[porttid + tid] = NULL;
+ ipath_stats.sps_pageunlocks++;
+ }
+ }
+ ipath_putpages(cnt, pagep);
+ } else {
+ /*
+ * copy the updated array, with ipath_tid's filled in,
+ * back to user. Since we did the copy in already, this
+ * "should never fail"
+ * If it does, we have to clean up...
+ */
+ int r;
+ if ((r = copy_to_user((void __user *) tu.tidlist, tidlist,
+ cnt * sizeof(*tidlist)))) {
+ _IPATH_DBG("Failed to copy out %d TIDs (%lx bytes) "
+ "to %llx (ret %x)\n", cnt,
+ cnt * sizeof(*tidlist), tu.tidlist, r);
+ ret = -EFAULT;
+ goto cleanup;
+ }
+ if (copy_to_user((void __user *) tu.tidmap, tidmap,
+ sizeof tidmap)) {
+ _IPATH_DBG("Failed to copy out TID map to %llx\n",
+ tu.tidmap);
+ ret = -EFAULT;
+ goto cleanup;
+ }
+ if (tid == tidcnt)
+ tid = 0;
+ pd->port_tidcursor = tid;
+ }
+
+done:
+ if (ret)
+ _IPATH_DBG("Failed to map %u TID pages, failing with %d, "
+ "tidu %p\n", tu.tidcnt, -ret, tidu);
+ return ret;
+}
+
+/*
+ * right now we are unlocking one page at a time, but since
+ * the intended use of this routine is for a single group of
+ * virtually contiguous pages, that should change to improve
+ * performance. We check that the TID is in range for this port
+ * but otherwise don't check validity; if user has an error and
+ * frees the wrong tid, it's only their own data that can thereby
+ * be corrupted. We do check that the TID was in use, for sanity
+ * We always use our idea of the saved address, not the address that
+ * they pass in to us.
+ */
+
+static int ipath_tid_free(struct ipath_portdata * pd,
+ struct _tidupd __user *tidu)
+{
+ int ret = 0;
+ uint32_t tid, porttid, cnt, limit, tidcnt;
+ struct _tidupd tu;
+ struct ipath_devdata *dd = &devdata[pd->port_unit];
+ uint64_t __iomem *tidbase;
+ uint64_t tidmap[8];
+ uint64_t tidval;
+
+ tu.tidcnt = 0; /* for early errors */
+ if (!dd->ipath_pageshadow) {
+ ret = -ENOMEM;
+ goto done;
+ }
+
+ if (copy_from_user(&tu, tidu, sizeof tu)) {
+ _IPATH_DBG("copy of tidupd structure failed\n");
+ ret = -EFAULT;
+ goto done;
+ }
+ if (copy_from_user(tidmap, (void __user *) tu.tidmap, sizeof tidmap)) {
+ _IPATH_DBG("copy of tidmap failed\n");
+ ret = -EFAULT;
+ goto done;
+ }
+
+ porttid = pd->port_port * dd->ipath_rcvtidcnt;
+ tidbase = (uint64_t __iomem *)
+ ((char __iomem *) (devdata[pd->port_unit].ipath_kregbase) +
+ devdata[pd->port_unit].ipath_rcvtidbase +
+ porttid * sizeof(*tidbase));
+
+ tidcnt = dd->ipath_rcvtidcnt;
+ if ((limit = sizeof(tidmap) * BITS_PER_BYTE) > tidcnt)
+ limit = tidcnt; /* just in case size changes in future */
+ tid = find_first_bit((const unsigned long *)tidmap, limit);
+ _IPATH_VDBG
+ ("Port%u free %u tids; first bit (max=%d) set is %d, porttid %u\n",
+ pd->port_port, tu.tidcnt, limit, tid, porttid);
+ /*
+ * chip errata bug 7358, try to work around it by marking invalid
+ * tids as having max length
+ */
+ tidval =
+ (-1LL & INFINIPATH_RT_BUFSIZE_MASK) << INFINIPATH_RT_BUFSIZE_SHIFT;
+ for (cnt = 0; tid < limit; tid++) {
+ /*
+ * small optimization; if we detect a run of 3 or so without
+ * any set, use find_first_bit again. That's mainly to
+ * accelerate the case where we wrapped, so we have some at
+ * the beginning, and some at the end, and a big gap
+ * in the middle.
+ */
+ if (!test_bit(tid, tidmap))
+ continue;
+ cnt++;
+ if (dd->ipath_pageshadow[porttid + tid]) {
+ _IPATH_VDBG("Freeing TID %u\n", tid);
+ ipath_kput_memq(pd->port_unit, &tidbase[tid], tidval);
+ ipath_putpages(1, &dd->ipath_pageshadow[porttid + tid]);
+ dd->ipath_pageshadow[porttid + tid] = NULL;
+ ipath_stats.sps_pageunlocks++;
+ } else
+ _IPATH_DBG("Unused tid %u, ignoring\n", tid);
+ }
+ if (cnt != tu.tidcnt)
+ _IPATH_DBG("passed in tidcnt %d, only %d bits set in map\n",
+ tu.tidcnt, cnt);
+done:
+ if (ret)
+ _IPATH_DBG("Failed to unmap %u TID pages, failing with %d\n",
+ tu.tidcnt, -ret);
+ return ret;
+}
+
+/* called from user init code, and also layered driver init */
+int ipath_setrcvhdrsize(const ipath_type mdev, unsigned rhdrsize)
+{
+ int ret = 0;
+ if (devdata[mdev].ipath_flags & IPATH_RCVHDRSZ_SET) {
+ if (devdata[mdev].ipath_rcvhdrsize != rhdrsize) {
+ _IPATH_INFO
+ ("Error: can't set protocol header size %u, already %u\n",
+ rhdrsize, devdata[mdev].ipath_rcvhdrsize);
+ ret = -EAGAIN;
+ } else
+ /* OK if set already, with same value, nothing to do */
+ _IPATH_VDBG("Reuse same protocol header size %u\n",
+ devdata[mdev].ipath_rcvhdrsize);
+ } else if (rhdrsize >
+ (devdata[mdev].ipath_rcvhdrentsize -
+ (sizeof(uint64_t) / sizeof(uint32_t)))) {
+ _IPATH_DBG
+ ("Error: can't set protocol header size %u (> max %u)\n",
+ rhdrsize,
+ devdata[mdev].ipath_rcvhdrentsize -
+ (uint32_t) (sizeof(uint64_t) / sizeof(uint32_t)));
+ ret = -EOVERFLOW;
+ } else {
+ devdata[mdev].ipath_flags |= IPATH_RCVHDRSZ_SET;
+ devdata[mdev].ipath_rcvhdrsize = rhdrsize;
+ ipath_kput_kreg(mdev, kr_rcvhdrsize,
+ devdata[mdev].ipath_rcvhdrsize);
+ _IPATH_VDBG("Set protocol header size to %u\n",
+ devdata[mdev].ipath_rcvhdrsize);
+ }
+ return ret;
+}
+
+
+/*
+ * find an available pio buffer, and do appropriate marking as busy, etc.
+ * returns buffer number if one found (>=0), negative number is error.
+ * Used by ipath_send_smapkt and ipath_layer_send
+ */
+uint32_t __iomem *ipath_getpiobuf(int mdev, uint32_t *pbufnum)
+{
+ int i, j, starti, updated = 0;
+ unsigned piobcnt, iter;
+ unsigned long flags;
+ struct ipath_devdata *dd = &devdata[mdev];
+ uint64_t *shadow = dd->ipath_pioavailshadow;
+ uint32_t __iomem *buf;
+
+ piobcnt = (unsigned)devdata[mdev].ipath_piobcnt;
+ starti = devdata[mdev].ipath_lastport_piobuf;
+ iter = piobcnt - starti;
+ if (dd->ipath_upd_pio_shadow) {
+ /*
+ * minor optimization. If we had no buffers on last call, start out
+ * by doing the update; continue and do scan even if no buffers
+ * were updated, to be paranoid
+ */
+ ipath_update_pio_bufs(mdev);
+ updated = 1; /* we scanned here, don't do it at end of scan */
+ i = starti;
+ }
+ else
+ i = devdata[mdev].ipath_lastpioindex;
+
+rescan:
+ /*
+ * while test_and_set_bit() is atomic,
+ * we do that and then the change_bit(), and the pair is not.
+ * See if this is the cause of the remaining armlaunch errors.
+ */
+ spin_lock_irqsave(&ipath_pioavail_lock, flags);
+ for (j = 0; j < iter; j++, i++) {
+ if (i >= piobcnt)
+ i = starti;
+ /*
+ * To avoid bus lock overhead, we first find a candidate
+ * buffer, then do the test and set, and continue if that fails.
+ */
+ if (test_bit((2 * i) + 1, shadow) ||
+ test_and_set_bit((2 * i) + 1, shadow)) {
+ continue;
+ }
+ /* flip generation bit */
+ change_bit(2 * i, shadow);
+ break;
+ }
+ spin_unlock_irqrestore(&ipath_pioavail_lock, flags);
+
+ if (j == iter) {
+ /*
+ * first time through; shadow exhausted, but may be
+ * real buffers available, so go see; if any updated, rescan (once)
+ */
+ if (!updated) {
+ ipath_update_pio_bufs(mdev);
+ updated = 1;
+ i = starti;
+ goto rescan;
+ }
+ dd->ipath_upd_pio_shadow = 1;
+ /* not atomic, but if we lose one once in a while, that's OK */
+ ipath_stats.sps_nopiobufs++;
+ if (!(++dd->ipath_consec_nopiobuf % 100000)) {
+ _IPATH_DBG
+ ("%u pio sends with no bufavail; dmacopy: %llx %llx %llx %llx; shadow: %llx %llx %llx %llx\n",
+ dd->ipath_consec_nopiobuf,
+ dd->ipath_pioavailregs_dma[0],
+ dd->ipath_pioavailregs_dma[1],
+ dd->ipath_pioavailregs_dma[2],
+ dd->ipath_pioavailregs_dma[3],
+ shadow[0], shadow[1], shadow[2], shadow[3]);
+ /*
+ * 4 buffers per byte, 4 registers above, cover
+ * rest below
+ */
+ if (dd->ipath_piobcnt > (sizeof(shadow[0])
+ * 4 * 4))
+ _IPATH_DBG
+ ("2nd group: dmacopy: %llx %llx %llx %llx; shadow: %llx %llx %llx %llx\n",
+ devdata[mdev].ipath_pioavailregs_dma[4],
+ devdata[mdev].ipath_pioavailregs_dma[5],
+ devdata[mdev].ipath_pioavailregs_dma[6],
+ devdata[mdev].ipath_pioavailregs_dma[7],
+ shadow[4], shadow[5], shadow[6], shadow[7]);
+ }
+ return NULL;
+ }
+
+ if (updated && devdata[mdev].ipath_layer.l_intr) {
+ /*
+ * ran out of bufs, now some (at least this one we just got)
+ * are now available, so tell the layered driver.
+ */
+ dd->ipath_layer.l_intr(mdev, IPATH_LAYER_INT_SEND_CONTINUE);
+ }
+
+ /*
+ * set next starting place. Since it's just an optimization,
+ * it doesn't matter who wins on this, so no locking
+ */
+ dd->ipath_lastpioindex = i + 1;
+ if (dd->ipath_upd_pio_shadow)
+ dd->ipath_upd_pio_shadow = 0;
+ if (dd->ipath_consec_nopiobuf)
+ dd->ipath_consec_nopiobuf = 0;
+ buf = (uint32_t __iomem *)(dd->ipath_piobase + i * dd->ipath_palign);
+ _IPATH_VDBG("Return piobuf %u @ %p\n", i, buf);
+ if (pbufnum)
+ *pbufnum = i;
+ return buf;
+}
+
+/*
+ * this is like ipath_getpiobuf(), except it just probes to see if a buffer
+ * is available. If it returns that there is one, it's not allocated,
+ * and so may not be available if caller tries to send.
+ * NOTE: This can be called from interrupt context by ipath_intr()
+ * and from non-interrupt context by layer_send_getpiobuf().
+ */
+int ipath_bufavail(int mdev)
+{
+ int i;
+ unsigned piobcnt;
+ uint64_t *shadow = devdata[mdev].ipath_pioavailshadow;
+
+ piobcnt = (unsigned)devdata[mdev].ipath_piobcnt;
+
+ for (i = devdata[mdev].ipath_lastport_piobuf; i < piobcnt; i++)
+ if (!test_bit((2 * i) + 1, shadow))
+ return 1;
+
+ /* if none, check for update and rescan if we updated */
+ ipath_update_pio_bufs(mdev);
+ for (i = devdata[mdev].ipath_lastport_piobuf; i < piobcnt; i++)
+ if (!test_bit((2 * i) + 1, shadow))
+ return 1;
+ _IPATH_PDBG("No bufs avail\n");
+ return 0;
+}
+
+/*
+ * This routine is no longer on any critical paths; it is used only
+ * for sending SMA packets, and some diagnostic usage.
+ * Because it's currently sma only, there are no checks to see if the
+ * link is up; sma must be able to send in the not fully initialized state
+ */
+int ipath_send_smapkt(struct ipath_sendpkt __user *upkt)
+{
+ int i, ret = 0;
+ uint32_t __iomem *piobuf;
+ uint32_t plen = 0, clen, pbufn;
+ struct ipath_sendpkt kpkt;
+ struct ipath_iovec *iov = kpkt.sps_iov;
+ ipath_type t;
+ uint32_t *tmpbuf = NULL;
+
+ if (unlikely((copy_from_user(&kpkt, upkt, sizeof kpkt))))
+ ret = -EFAULT;
+ if (ret) {
+ _IPATH_VDBG("Send failed: error %d\n", -ret);
+ goto done;
+ }
+ t = kpkt.sps_flags;
+ if (t >= infinipath_max || !(devdata[t].ipath_flags & IPATH_PRESENT) ||
+ !devdata[t].ipath_kregbase) {
+ _IPATH_SMADBG("illegal unit %u for sma send\n", t);
+ return -ENODEV;
+ }
+ if (!(devdata[t].ipath_flags & IPATH_INITTED)) {
+ /* no hardware, freeze, etc. */
+ _IPATH_SMADBG("unit %u not usable\n", t);
+ return -ENODEV;
+ }
+
+ /* need total length before first word written */
+ plen = sizeof(uint32_t); /* +1 word is for the qword padding */
+ for (i = 0; i < kpkt.sps_cnt; i++)
+ /* each must be dword multiple */
+ plen += kpkt.sps_iov[i].iov_len;
+
+ if ((plen + 4) > devdata[t].ipath_ibmaxlen) {
+ _IPATH_DBG("Pkt len 0x%x > ibmaxlen %x\n",
+ plen - 4, devdata[t].ipath_ibmaxlen);
+ ret = -EINVAL;
+ goto done; /* before writing pbc */
+ }
+ if (!(tmpbuf = vmalloc(plen))) {
+ _IPATH_INFO("Unable to allocate tmp buffer, failing\n");
+ ret = -ENOMEM;
+ goto done;
+ }
+ plen >>= 2; /* in words */
+
+ piobuf = ipath_getpiobuf(t, &pbufn);
+ if (!piobuf) {
+ ret = -EBUSY;
+ devdata[t].ipath_nosma_bufs++;
+ _IPATH_SMADBG("No PIO buffers available unit %u %u times\n",
+ t, devdata[t].ipath_nosma_bufs);
+ goto done;
+ }
+ if (devdata[t].ipath_nosma_bufs) {
+ _IPATH_SMADBG(
+ "Unit %u got SMA send buffer after %u failures, %u seconds\n",
+ t, devdata[t].ipath_nosma_bufs, devdata[t].ipath_nosma_secs);
+ devdata[t].ipath_nosma_bufs = 0;
+ devdata[t].ipath_nosma_secs = 0;
+ }
+ if ((devdata[t].ipath_lastibcstat & 0x11) != 0x11 &&
+ (devdata[t].ipath_lastibcstat & 0x21) != 0x21) {
+ /* we need to be at least at INIT for SMA packets to go out. If we
+ * aren't, something has gone wrong, and SMA hasn't noticed.
+ * Therefore we'll try to go to INIT here, in hopes of fixing up the
+ * problem. First we verify that indeed the state is still "bad"
+ * (that is, that lastibcstat * isn't "stale") */
+ uint64_t val;
+ val = ipath_kget_kreg64(t, kr_ibcstatus);
+ if ((val & 0x11) != 0x11 && (val & 0x21) != 0x21) {
+ _IPATH_SMADBG("Invalid Link state 0x%llx unit %u for send, try INIT\n",
+ val, t);
+ ipath_set_ib_lstate(t, INFINIPATH_IBCC_LINKCMD_INIT);
+ val = ipath_kget_kreg64(t, kr_ibcstatus);
+ if ((val & 0x11) != 0x11 && (val & 0x21) != 0x21)
+ _IPATH_SMADBG("Link state still not OK unit %u (0x%llx) after INIT\n",
+ t, val);
+ else
+ _IPATH_SMADBG("Link state OK unit %u (0x%llx) after INIT\n",
+ t, val);
+ }
+ /* and continue, regardless */
+ }
+
+ if (infinipath_debug & __IPATH_PKTDBG) // SMA and PKT, both
+ _IPATH_SMADBG("unit %u 0x%x+1w pio%d, (scnt %d)\n",
+ t, plen - 1, pbufn, kpkt.sps_cnt);
+
+
+ /* we have to flush after the PBC for correctness on some cpus
+ * or WC buffer can be written out of order */
+ writeq(plen, piobuf);
+ mb();
+ ret = 0;
+ for (clen=i=0; i < kpkt.sps_cnt; i++) {
+ if (unlikely(copy_from_user(tmpbuf + clen,
+ (void __user *) iov->iov_base,
+ iov->iov_len)))
+ ret = -EFAULT; /* no break */
+ clen += iov->iov_len >> 2;
+ iov++;
+ }
+ /* copy all by the trigger word, then flush, so it's written
+ * to chip before trigger word, then write trigger word, then
+ * flush again, so packet is sent. */
+ memcpy_toio32(piobuf+2, tmpbuf, clen-1);
+ mb();
+ writel(tmpbuf[clen-1], piobuf+clen+1);
+ mb();
+
+ if (ret) {
+ /*
+ * Packet is bad, so we need to use the PIO abort mechanism to
+ * abort the packet
+ */
+ uint32_t sendctrl;
+ sendctrl = devdata[t].ipath_sendctrl | INFINIPATH_S_DISARM |
+ (pbufn << INFINIPATH_S_DISARMPIOBUF_SHIFT);
+ _IPATH_DBG("Doing PIO abort on buffer %u after error\n",
+ pbufn);
+ ipath_kput_kreg(t, kr_sendctrl, sendctrl);
+ }
+
+done:
+ vfree(tmpbuf);
+ return ret;
+}
+
+/*
+ * implemention of the ioctl to get the counter values from the chip
+ * For the time being, we get all of them when asked, no shadowing.
+ * We need to shadow the byte counters at a minimum, because otherwise
+ * they will wrap in just a few seconds at full bandwidth
+ * The second argument is the user address to which we do the copy_to_user()
+ */
+static int ipath_get_counters(ipath_type t,
+ struct infinipath_counters __user *ucounters)
+{
+ int ret = 0;
+ uint64_t val;
+ uint64_t __user *ucreg;
+ uint16_t vcreg;
+
+ ucreg = (uint64_t __user *) ucounters;
+ /*
+ * for now, let's do this one at a time. It's not the most
+ * optimal method, but it is simple, and has no intermediate
+ * memory requirements.
+ */
+ for (vcreg = 0;
+ vcreg < (sizeof(struct infinipath_counters) / sizeof(val));
+ vcreg++, ucreg++) {
+ ipath_creg creg = vcreg;
+ val = ipath_snap_cntr(t, creg);
+ if ((ret = copy_to_user(ucreg, &val, sizeof(val)))) {
+ _IPATH_DBG("copy_to_user error on counter %d\n", creg);
+ ret = -EFAULT;
+ break;
+ }
+ }
+
+ return ret;
+}

2005-12-29 00:44:50

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 5 of 20] ipath - driver core header files

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r a3a00f637da6 -r 2d9a3f27a10c drivers/infiniband/hw/ipath/ipath_common.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_common.h Wed Dec 28 14:19:42 2005 -0800
@@ -0,0 +1,704 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#ifndef _IPATH_COMMON_H
+#define _IPATH_COMMON_H
+
+/*
+ * This file contains defines, structures, etc. that are used
+ * to communicate between kernel and user code.
+ */
+
+#define IPATH_IDSTR "PathScale 1.1\n"
+
+typedef uint8_t ipath_type;
+
+/* This is the IEEE-assigned OUI for PathScale, Inc. */
+#define IPATH_SRC_OUI_1 0x00
+#define IPATH_SRC_OUI_2 0x11
+#define IPATH_SRC_OUI_3 0x75
+
+/* version of protocol header (known to chip also). In the long run,
+ * we should be able to generate and accept a range of version numbers;
+ * for now we only accept one, and it's compiled in.
+ */
+#define IPS_PROTO_VERSION 2
+
+/*
+ * These are compile time constants that you may want to enable or disable
+ * if you are trying to debug problems with code or performance.
+ * IPATH_VERBOSE_TRACING define as 1 if you want additional tracing in
+ * fastpath code
+ * IPATH_TRACE_REGWRITES define as 1 if you want register writes to be
+ * traced in faspath code
+ * _IPATH_TRACING define as 0 if you want to remove all tracing in a
+ * compilation unit
+ * _IPATH_DEBUGGING define as 0 if you want to remove debug prints
+ */
+
+
+/*
+ * These tell the driver which ioctl's belong to the diags interface.
+ * As above, don't use them elsewhere.
+ */
+#define _IPATH_DIAG_IOCTL_LOW 100
+#define _IPATH_DIAG_IOCTL_HIGH 109
+
+struct ipath_eeprom_req {
+ long long addr;
+ uint16_t len;
+ uint16_t offset;
+};
+
+/*
+ * NOTE: We use compatible ioctls, the same ioctl code for both 32 and 64
+ * bit user mode. For that reason, all structures, etc. used in these
+ * ioctls must have the same size and offsets, in both 32 and 64 bit mode.
+ * We normally use uint64_t to hold pointers for this reason, doing appropriate
+ * casts on both sides before using the data value.
+ */
+
+/* init; user params to kernel */
+#define IPATH_USERINIT _IOW('s', 16, struct ipath_user_info)
+/* init; kernel/chip params to user */
+#define IPATH_BASEINFO _IOR('s', 17, struct ipath_base_info)
+/* send a packet */
+#define IPATH_SENDPKT _IOW('s', 18, struct ipath_sendpkt)
+/*
+ * if arg is 0, disable port, used when flushing after a hdrq overflow.
+ * If arg ia 1, re-enable, and return new value of head register
+ */
+#define IPATH_RCVCTRL _IOR('s', 19, uint32_t)
+#define IPATH_READ_EEPROM _IOWR('s', 20, struct ipath_eeprom_req)
+/* set an accepted partition key; up to 4 pkeys can be active at once */
+#define IPATH_SET_PKEY _IOW('s', 21, uint16_t)
+#define IPATH_WRITE_EEPROM _IOWR('s', 22, struct ipath_eeprom_req)
+/* set LID for interface (SMA) */
+#define IPATH_SET_LID _IOW('s', 23, uint32_t)
+/* set IB MTU for interface (SMA) */
+#define IPATH_SET_MTU _IOW('s', 24, uint32_t)
+/* set IB link state for interface (SMA) */
+#define IPATH_SET_LINKSTATE _IOW('s', 25, uint32_t)
+/* send an SMA packet, sps_flags contains "normal" SMA unit and minor number. */
+#define IPATH_SEND_SMA_PKT _IOW('s', 26, struct ipath_sendpkt)
+/* receive an SMA packet */
+#define IPATH_RCV_SMA_PKT _IOW('s', 27, struct ipath_sendpkt)
+/* get the portinfo data (SMA)
+ * takes array of 13, returns port info fields. Data is in host order,
+ * not network order; SMA-only fields are not filled in
+ */
+#define IPATH_GET_PORTINFO _IOWR('s', 28, uint32_t *)
+/*
+ * get the nodeinfo data (SMA)
+ * takes an array of 10, returns nodeinfo fields in host order
+ */
+#define IPATH_GET_NODEINFO _IOWR('s', 29, uint32_t *)
+/* set GUID on interface (SMA; GUID given in network order) */
+#define IPATH_SET_GUID _IOW('s', 30, struct ipath_setguid)
+/* set MLID for interface (SMA) */
+#define IPATH_SET_MLID _IOW('s', 31, uint32_t)
+#define IPATH_GET_MLID _IOWR('s', 32, uint32_t *) /* get the MLID (SMA) */
+/* update expected TID entries */
+#define IPATH_UPDM_TID _IOWR('s', 33, struct _tidupd)
+/* free expected TID entries */
+#define IPATH_FREE_TID _IOW('s', 34, struct _tidupd)
+/* return assigned unit:port */
+#define IPATH_GETPORT _IOR('s', 35, uint32_t)
+/* wait for rcv pkt or pioavail */
+#define IPATH_WAIT _IOW('s', 36, uint32_t)
+/* return LID for passed in unit */
+#define IPATH_GETLID _IOR('s', 37, uint16_t)
+/* return # of units supported by driver */
+#define IPATH_GETUNITS _IO('s', 38)
+/* get the device status */
+#define IPATH_GET_DEVSTATUS _IOWR('s', 39, uint64_t *)
+
+/* available for reuse ('s', 48) */
+
+/* diagnostic read */
+#define IPATH_DIAGREAD _IOR('s', 100, struct ipath_diag_info)
+/* diagnostic write */
+#define IPATH_DIAGWRITE _IOW('s', 101, struct ipath_diag_info)
+/* HT Config read */
+#define IPATH_DIAG_HTREAD _IOR('s', 102, struct ipath_diag_info)
+/* HT config write */
+#define IPATH_DIAG_HTWRITE _IOW('s', 103, struct ipath_diag_info)
+#define IPATH_DIAGENTER _IO('s', 104) /* Enter diagnostic mode */
+#define IPATH_DIAGLEAVE _IO('s', 105) /* Leave diagnostic mode */
+/* send a packet, sps_flags contains unit and minor number. */
+#define IPATH_SEND_DIAG_PKT _IOW('s', 106, struct ipath_sendpkt)
+/*
+ * read I2C FLASH
+ * NOTE: To read the I2C device, the _uaddress field should contain
+ * a pointer to struct ipath_eeprom_req, and _unit must be valid
+ */
+#define IPATH_DIAG_RD_I2C _IOW('s', 107, struct ipath_diag_info)
+
+/*
+ * Monitoring ioctls. All of these work with the main device
+ * (/dev/ipath), if you don't mind using a port (e.g. you already have
+ * the device open.) IPATH_GETSTATS and IPATH_GETUNITCOUNTERS also
+ * work with the control device (/dev/ipath_ctrl), if you don't want to
+ * use a port.
+ */
+
+/* return chip counters for current unit. */
+#define IPATH_GETCOUNTERS _IOR('s', 40, struct infinipath_counters)
+/* return chip stats */
+#define IPATH_GETSTATS _IOR('s', 41, struct infinipath_stats)
+/* return chip counters for a particular unit. */
+#define IPATH_GETUNITCOUNTERS _IOR('s', 43, struct infinipath_getunitcounters)
+
+/*
+ * unit is incoming unit number.
+ * data is a pointer to the infinipath_counters structure.
+ */
+struct infinipath_getunitcounters {
+ uint16_t unit;
+ uint16_t fill[3]; /* required for same size struct 32/64 bit */
+ uint64_t data;
+};
+
+/*
+ * The value in the BTH QP field that InfiniPath uses to differentiate
+ * an infinipath protocol IB packet vs standard IB transport
+ */
+#define IPATH_KD_QP 0x656b79
+
+/*
+ * valid states passed to ipath_set_linkstate() user call
+ * (IPATH_SET_LINKSTATE ioctl)
+ */
+#define IPATH_IB_LINKDOWN 0
+#define IPATH_IB_LINKARM 1
+#define IPATH_IB_LINKACTIVE 2
+#define IPATH_IB_LINKINIT 3
+#define IPATH_IB_LINKDOWN_POLL 4
+#define IPATH_IB_LINKDOWN_DISABLE 5
+
+/*
+ * stats maintained by the driver. For now, at least, this is global
+ * to all minor devices.
+ */
+struct infinipath_stats {
+ uint64_t sps_ints; /* number of interrupts taken */
+ uint64_t sps_errints; /* number of interrupts for errors */
+ /* number of errors from chip (not including packet errors or CRC) */
+ uint64_t sps_errs;
+ /* number of packet errors from chip other than CRC */
+ uint64_t sps_pkterrs;
+ /* number of packets with CRC errors (ICRC and VCRC) */
+ uint64_t sps_crcerrs;
+ /* number of hardware errors reported (parity, etc.) */
+ uint64_t sps_hwerrs;
+ /* number of times IB link changed state unexpectedly */
+ uint64_t sps_iblink;
+ uint64_t sps_unused3; /* no longer used; left for compatibility */
+ uint64_t sps_port0pkts; /* number of kernel (port0) packets received */
+ /* number of "ethernet" packets sent by driver */
+ uint64_t sps_ether_spkts;
+ /* number of "ethernet" packets received by driver */
+ uint64_t sps_ether_rpkts;
+ uint64_t sps_sma_spkts; /* number of SMA packets sent by driver */
+ uint64_t sps_sma_rpkts; /* number of SMA packets received by driver */
+ /* number of times all ports rcvhdrq was full and packet dropped */
+ uint64_t sps_hdrqfull;
+ /* number of times all ports egrtid was full and packet dropped */
+ uint64_t sps_etidfull;
+ /*
+ * number of times we tried to send from driver, but no pio
+ * buffers avail
+ */
+ uint64_t sps_nopiobufs;
+ uint64_t sps_ports; /* number of ports currently open */
+ /* list of pkeys (other than default) accepted (0 means not set) */
+ uint16_t sps_pkeys[4];
+ /* lids for up to 4 infinipaths, indexed by infinipath # */
+ uint16_t sps_lid[4];
+ /* number of user ports per chip (not IB ports) */
+ uint32_t sps_nports;
+ uint32_t sps_nullintr; /* not our interrupt, or already handled */
+ uint32_t sps_maxpkts_call; /* max number of packets handled per receive call */
+ uint32_t sps_avgpkts_call; /* avg number of packets handled per receive call */
+ uint64_t sps_pagelocks; /* total number of pages locked */
+ uint64_t sps_pageunlocks; /* total number of pages unlocked */
+ /*
+ * Number of packets dropped in kernel other than errors
+ * (ether packets if ipath not configured, sma/mad, etc.)
+ */
+ uint64_t sps_krdrops;
+ /* mlids for up to 4 infinipaths, indexed by infinipath # */
+ uint16_t sps_mlid[4];
+ uint64_t __sps_pad[45]; /* pad for future growth */
+};
+
+/*
+ * These are the status bits returned (in ascii form, 64bit value)
+ * by the IPATH_GETSTATS ioctl.
+ */
+#define IPATH_STATUS_INITTED 0x1 /* basic driver initialization done */
+#define IPATH_STATUS_DISABLED 0x2 /* hardware disabled */
+#define IPATH_STATUS_UNUSED 0x4 /* available */
+#define IPATH_STATUS_OIB_SMA 0x8 /* ipath_mad kernel SMA running */
+#define IPATH_STATUS_SMA 0x10 /* user SMA running */
+/* Chip has been found and initted */
+#define IPATH_STATUS_CHIP_PRESENT 0x20
+#define IPATH_STATUS_IB_READY 0x40 /* IB link is at ACTIVE, has LID,
+ * usable for all VL's */
+/* after link up, LID,MTU,etc. has been configured */
+#define IPATH_STATUS_IB_CONF 0x80
+/* no link established, probably no cable */
+#define IPATH_STATUS_IB_NOCABLE 0x100
+/* A Fatal hardware error has occurred. */
+#define IPATH_STATUS_HWERROR 0x200
+
+/* The list of usermode accessible registers. Also see Reg_* later in file */
+typedef enum _ipath_ureg {
+ ur_rcvhdrtail = 0, /* (RO) DMA RcvHdr to be used next. */
+ /* (RW) RcvHdr entry to be processed next by host. */
+ ur_rcvhdrhead = 1,
+ ur_rcvegrindextail = 2, /* (RO) Index of next Eager index to use. */
+ ur_rcvegrindexhead = 3, /* (RW) Eager TID to be processed next */
+ /* For internal use only; max register number. */
+ _IPATH_UregMax
+} ipath_ureg;
+
+/* SMA minor# no portinfo, one for all instances */
+#define IPATH_SMA 128
+
+/* Control minor# no portinfo, one for all instances */
+#define IPATH_CTRL 130
+
+/*
+ * This structure is returned by ipath_userinit() immediately after open
+ * to get implementation-specific info, and info specific to this
+ * instance.
+ */
+struct ipath_base_info {
+ /* version of hardware, for feature checking. */
+ uint32_t spi_hw_version;
+ /* version of software, for feature checking. */
+ uint32_t spi_sw_version;
+ /* InfiniPath port assigned, goes into sent packets */
+ uint32_t spi_port;
+ /*
+ * IB MTU, packets IB data must be less than this.
+ * The MTU is in bytes, and will be a multiple of 4 bytes.
+ */
+ uint32_t spi_mtu;
+ /*
+ * size of a PIO buffer. Any given packet's total
+ * size must be less than this (in words). Included is the
+ * starting control word, so if 513 is returned, then total
+ * pkt size is 512 words or less.
+ */
+ uint32_t spi_piosize;
+ /* size of the TID cache in infinipath, in entries */
+ uint32_t spi_tidcnt;
+ /* size of the TID Eager list in infinipath, in entries */
+ uint32_t spi_tidegrcnt;
+ /* size of a single receive header queue entry. */
+ uint32_t spi_rcvhdrent_size;
+ /* Count of receive header queue entries allocated.
+ * This may be less than the spu_rcvhdrcnt passed in!.
+ */
+ uint32_t spi_rcvhdr_cnt;
+
+ uint32_t __32_bit_compatibility_pad; /* DO NOT MOVE OR REMOVE */
+
+ /* address where receive buffer queue is mapped into */
+ uint64_t spi_rcvhdr_base;
+
+ /* user program. */
+
+ /* base address of eager TID receive buffers. */
+ uint64_t spi_rcv_egrbufs;
+
+ /* Allocated by initialization code, not by protocol. */
+
+ /* size of each TID buffer in host memory,
+ * starting at spi_rcv_egrbufs. It includes spu_egrskip, and is
+ * at least spi_mtu bytes, and the buffers are virtually contiguous
+ */
+ uint32_t spi_rcv_egrbufsize;
+ /*
+ * The special QP (queue pair) value that identifies an infinipath
+ * protocol packet from standard IB packets. More, probably much
+ * more, to be added.
+ */
+ uint32_t spi_qpair;
+
+ /*
+ * user register base for init code, not to be used directly by
+ * protocol or applications
+ */
+ uint64_t __spi_uregbase;
+ /*
+ * maximum buffer size in bytes that can be used in a
+ * single TID entry (assuming the buffer is aligned to this boundary).
+ * This is the minimum of what the hardware and software support
+ * Guaranteed to be a power of 2.
+ */
+ uint32_t spi_tid_maxsize;
+ /*
+ * alignment of each pio send buffer (byte count
+ * to add to spi_piobufbase to get to second buffer)
+ */
+ uint32_t spi_pioalign;
+ /*
+ * the index of the first pio buffer available
+ * to this process; needed to do lookup in spi_pioavailaddr; not added
+ * to spi_piobufbase
+ */
+ uint32_t spi_pioindex;
+ uint32_t spi_piocnt; /* number of buffers mapped for this process */
+
+ /*
+ * base address of writeonly pio buffers for this process.
+ * Each buffer has spi_piosize words, and is aligned on spi_pioalign
+ * boundaries. spi_piocnt buffers are mapped from this address
+ */
+ uint64_t spi_piobufbase;
+
+ /*
+ * base address of readonly memory copy of the pioavail registers.
+ * There are 2 bits for each buffer.
+ */
+ uint64_t spi_pioavailaddr;
+
+ /*
+ * Address where driver updates a copy
+ * of the interface and driver status (IPATH_STATUS_*) as a 64 bit value
+ * It's followed by a string indicating hardware error, if there was one
+ */
+ uint64_t spi_status;
+
+ /* number of chip ports available to user processes */
+ uint32_t spi_nports;
+ uint32_t spi_unit; /* unit number of chip we are using */
+ uint32_t spi_rcv_egrperchunk; /* num bufs in each contiguous set */
+ /* size in bytes of each contiguous set */
+ uint32_t spi_rcv_egrchunksize;
+ /* total size of mmap to cover full rcvegrbuffers */
+ uint32_t spi_rcv_egrbuftotlen;
+ /*
+ * ioctl cmd includes struct size, so pad out, and adjust down as
+ * new fields are added to keep size constant
+ */
+ uint32_t __spi_pad[19];
+} __attribute__ ((aligned(8)));
+
+#define IPATH_WAIT_RCV 0x1 /* IPATH_WAIT, receive */
+#define IPATH_WAIT_PIO 0x2 /* IPATH_WAIT, PIO */
+
+/*
+ * This version number is given to the driver by the user code during
+ * initialization in the spu_userversion field of ipath_user_info, so
+ * the driver can check for compatibility with user code.
+ *
+ * The major version changes when data structures
+ * change in an incompatible way. The driver must be the same or higher
+ * for initialization to succeed. In some cases, a higher version
+ * driver will not interoperate with older software, and initialization
+ * will return an error.
+ */
+#define IPATH_USER_SWMAJOR 1
+
+/*
+ * Minor version differences are always compatible
+ * a within a major version, however if if user software is larger
+ * than driver software, some new features and/or structure fields
+ * may not be implemented; the user code must deal with this if it
+ * cares, or it must abort after initialization reports the difference
+ */
+#define IPATH_USER_SWMINOR 2
+
+#define IPATH_USER_SWVERSION ((IPATH_USER_SWMAJOR<<16) | IPATH_USER_SWMINOR)
+
+#define IPATH_KERN_TYPE 0
+
+/* Similarly, this is the kernel version going back to the user. It's slightly
+ * different, in that we want to tell if the driver was built as part of a
+ * PathScale release, or from the driver from OpenIB, kernel.org, or a
+ * standard distribution, for support reasons. The high bit is 0 for
+ * non-PathScale, and 1 for PathScale-built/supplied.
+ *
+ * It's returned by the driver to the user code during initialization
+ * in the spi_sw_version field of ipath_base_info, so the user code can
+ * in turn check for compatibility with the kernel.
+*/
+#define IPATH_KERN_SWVERSION ((IPATH_KERN_TYPE<<31) | IPATH_USER_SWVERSION)
+
+/*
+ * This structure is passed to ipath_userinit() to tell the driver where
+ * user code buffers are, sizes, etc.
+ */
+struct ipath_user_info {
+ /*
+ * version of user software, to detect compatibility issues.
+ * Should be set to IPATH_USER_SWVERSION.
+ */
+ uint32_t spu_userversion;
+
+ /* desired number of receive header queue entries */
+ uint32_t spu_rcvhdrcnt;
+
+ /*
+ * Leave this much unused space at the start of
+ * each eager buffer for software use. Similar in effect to
+ * setting K_Offset to this value. needs to be 'small', on the
+ * order of one or two cachelines
+ */
+ uint32_t spu_egrskip;
+
+ /*
+ * number of words in KD protocol header
+ * This tells InfiniPath how many words to copy to rcvhdrq. If 0,
+ * kernel uses a default. Once set, attempts to set any other value
+ * are an error (EAGAIN) until driver is reloaded.
+ */
+ uint32_t spu_rcvhdrsize;
+
+ /*
+ * cache line aligned (64 byte) user address to
+ * which the rcvhdrtail register will be written by infinipath
+ * whenever it changes, so that no chip registers are read in
+ * the performance path.
+ */
+ uint64_t spu_rcvhdraddr;
+
+ /*
+ * ioctl cmd includes struct size, so pad out,
+ * and adjust down as new fields are added to keep size constant
+ */
+ uint32_t __spu_pad[6];
+} __attribute__ ((aligned(8)));
+
+struct ipath_iovec {
+ /* Pointer to data, but same size 32 and 64 bit */
+ uint64_t iov_base;
+
+ /*
+ * Length of data; don't need 64 bits, but want
+ * ipath_sendpkt to remain same size as before 32 bit changes, so...
+ */
+ uint64_t iov_len;
+};
+
+/*
+ * Describes a single packet for send. Each packet can have one or more
+ * buffers, but the total length (exclusive of IB headers) must be less
+ * than the MTU, and if using the PIO method, entire packet length,
+ * including IB headers, must be less than the ipath_piosize value (words).
+ * Use of this necessitates including sys/uio.h
+ */
+struct ipath_sendpkt {
+ uint32_t sps_flags; /* flags for packet (TBD) */
+ uint32_t sps_cnt; /* number of entries to use in sps_iov */
+ /* array of iov's describing packet. TEMPORARY */
+ struct ipath_iovec sps_iov[4];
+};
+
+struct _tidupd { /* used only in inlined function for ioctl. */
+ uint32_t tidcnt;
+ uint32_t tid__unused; /* make structure same size in 32 and 64 bit */
+ uint64_t tidvaddr; /* virtual address of first page in transfer */
+ /* pointer (same size 32/64 bit) to uint16_t tid array */
+ uint64_t tidlist;
+
+ /*
+ * pointer (same size 32/64 bit) to bitmap of TIDs used
+ * for this call; checked for being large enough at open
+ */
+ uint64_t tidmap;
+};
+
+struct ipath_setguid { /* set GUID for interface */
+ uint64_t sguid; /* in network order */
+ uint64_t sunit; /* unit number of interface */
+};
+
+/*
+ * Structure used to send data to and receive data from a diags ioctl.
+ *
+ * NOTE: For HT reads and writes, we only support byte, word (16bits) and
+ * dword (32bits). All other sizes for HT are invalid.
+ */
+struct ipath_diag_info {
+ uint64_t _base_offset; /* register to start reading from */
+ uint64_t _num_bytes; /* number of bytes to read or write */
+ /*
+ * address in user space.
+ * for reads, this is the address to store the read result(s).
+ * for writes, it the address to get the write data from.
+ * This memory better be valid in user space!
+ */
+ uint64_t _uaddress;
+ uint64_t _unit; /* Unit ID of chip we are accessing. */
+ uint64_t _pad[15];
+};
+
+/*
+ * Data layout in I2C flash (for GUID, etc.)
+ * All fields are little-endian binary unless otherwise stated
+ */
+#define IPATH_FLASH_VERSION 1
+struct ipath_flash {
+ uint8_t if_fversion; /* flash layout version (IPATH_FLASH_VERSION) */
+ uint8_t if_csum; /* checksum protecting if_length bytes */
+ /*
+ * valid length (in use, protected by if_csum), including if_fversion
+ * and if_sum themselves)
+ */
+ uint8_t if_length;
+ uint8_t if_guid[8]; /* the GUID, in network order */
+ /* number of GUIDs to use, starting from if_guid */
+ uint8_t if_numguid;
+ char if_serial[12]; /* the board serial number, in ASCII */
+ char if_mfgdate[8]; /* board mfg date (YYYYMMDD ASCII) */
+ /* last board rework/test date (YYYYMMDD ASCII) */
+ char if_testdate[8];
+ uint8_t if_errcntp[4]; /* logging of error counts, TBD */
+ /* powered on hours, updated at driver unload */
+ uint8_t if_powerhour[2];
+ char if_comment[32]; /* ASCII free-form comment field */
+ uint8_t if_future[50]; /* 78 bytes used, min flash size is 128 bytes */
+};
+
+uint8_t ipath_flash_csum(struct ipath_flash *, int);
+
+/*
+ * These are the counters implemented in the chip, and are listed in order.
+ * They are returned in this order by the IPATH_GETCOUNTERS ioctl
+ */
+struct infinipath_counters {
+ unsigned long long LBIntCnt;
+ unsigned long long LBFlowStallCnt;
+ unsigned long long Reserved1;
+ unsigned long long TxUnsupVLErrCnt;
+ unsigned long long TxDataPktCnt;
+ unsigned long long TxFlowPktCnt;
+ unsigned long long TxDwordCnt;
+ unsigned long long TxLenErrCnt;
+ unsigned long long TxMaxMinLenErrCnt;
+ unsigned long long TxUnderrunCnt;
+ unsigned long long TxFlowStallCnt;
+ unsigned long long TxDroppedPktCnt;
+ unsigned long long RxDroppedPktCnt;
+ unsigned long long RxDataPktCnt;
+ unsigned long long RxFlowPktCnt;
+ unsigned long long RxDwordCnt;
+ unsigned long long RxLenErrCnt;
+ unsigned long long RxMaxMinLenErrCnt;
+ unsigned long long RxICRCErrCnt;
+ unsigned long long RxVCRCErrCnt;
+ unsigned long long RxFlowCtrlErrCnt;
+ unsigned long long RxBadFormatCnt;
+ unsigned long long RxLinkProblemCnt;
+ unsigned long long RxEBPCnt;
+ unsigned long long RxLPCRCErrCnt;
+ unsigned long long RxBufOvflCnt;
+ unsigned long long RxTIDFullErrCnt;
+ unsigned long long RxTIDValidErrCnt;
+ unsigned long long RxPKeyMismatchCnt;
+ unsigned long long RxP0HdrEgrOvflCnt;
+ unsigned long long RxP1HdrEgrOvflCnt;
+ unsigned long long RxP2HdrEgrOvflCnt;
+ unsigned long long RxP3HdrEgrOvflCnt;
+ unsigned long long RxP4HdrEgrOvflCnt;
+ unsigned long long RxP5HdrEgrOvflCnt;
+ unsigned long long RxP6HdrEgrOvflCnt;
+ unsigned long long RxP7HdrEgrOvflCnt;
+ unsigned long long RxP8HdrEgrOvflCnt;
+ unsigned long long Reserved6;
+ unsigned long long Reserved7;
+ unsigned long long IBStatusChangeCnt;
+ unsigned long long IBLinkErrRecoveryCnt;
+ unsigned long long IBLinkDownedCnt;
+ unsigned long long IBSymbolErrCnt;
+};
+
+/*
+ * The next set of defines are for packet headers, and chip register
+ * and memory bits that are visible to and/or used by user-mode software
+ * The other bits that are used only by the driver or diags are in
+ * ipath_registers.h
+ */
+
+/* RcvHdrFlags bits */
+#define INFINIPATH_RHF_LENGTH_MASK 0x7FF
+#define INFINIPATH_RHF_LENGTH_SHIFT 0
+#define INFINIPATH_RHF_RCVTYPE_MASK 0x7
+#define INFINIPATH_RHF_RCVTYPE_SHIFT 11
+#define INFINIPATH_RHF_EGRINDEX_MASK 0x7FF
+#define INFINIPATH_RHF_EGRINDEX_SHIFT 16
+#define INFINIPATH_RHF_H_ICRCERR 0x80000000
+#define INFINIPATH_RHF_H_VCRCERR 0x40000000
+#define INFINIPATH_RHF_H_PARITYERR 0x20000000
+#define INFINIPATH_RHF_H_LENERR 0x10000000
+#define INFINIPATH_RHF_H_MTUERR 0x08000000
+#define INFINIPATH_RHF_H_IHDRERR 0x04000000
+#define INFINIPATH_RHF_H_TIDERR 0x02000000
+#define INFINIPATH_RHF_H_MKERR 0x01000000
+#define INFINIPATH_RHF_H_IBERR 0x00800000
+#define INFINIPATH_RHF_L_SWA 0x00008000
+#define INFINIPATH_RHF_L_SWB 0x00004000
+
+/* infinipath header fields */
+#define INFINIPATH_I_VERS_MASK 0xF
+#define INFINIPATH_I_VERS_SHIFT 28
+#define INFINIPATH_I_PORT_MASK 0xF
+#define INFINIPATH_I_PORT_SHIFT 24
+#define INFINIPATH_I_TID_MASK 0x7FF
+#define INFINIPATH_I_TID_SHIFT 13
+#define INFINIPATH_I_OFFSET_MASK 0x1FFF
+#define INFINIPATH_I_OFFSET_SHIFT 0
+
+/* K_PktFlags bits */
+#define INFINIPATH_KPF_INTR 0x1
+
+/* SendPIO per-buffer control */
+#define INFINIPATH_SP_LENGTHP1_MASK 0x3FF
+#define INFINIPATH_SP_LENGTHP1_SHIFT 0
+#define INFINIPATH_SP_INTR 0x80000000
+#define INFINIPATH_SP_TEST 0x40000000
+#define INFINIPATH_SP_TESTEBP 0x20000000
+
+/* SendPIOAvail bits */
+#define INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT 1
+#define INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT 0
+
+#endif /* _IPATH_COMMON_H */
diff -r a3a00f637da6 -r 2d9a3f27a10c drivers/infiniband/hw/ipath/ipath_kernel.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Dec 28 14:19:42 2005 -0800
@@ -0,0 +1,697 @@
+#ifndef _IPATH_KERNEL_H
+#define _IPATH_KERNEL_H
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+
+
+/*
+ * This header file is the base header file for infinipath kernel code
+ * ipath_user.h serves a similar purpose for user code.
+ */
+
+#include "ipath_common.h"
+#include "ipath_kdebug.h"
+#include "ipath_registers.h"
+#include <linux/timex.h>
+#include <asm/io.h>
+
+/* only s/w major version of InfiniPath we can handle */
+#define IPATH_CHIP_VERS_MAJ 2U
+
+#define IPATH_CHIP_VERS_MIN 0U /* don't care about this except printing */
+
+extern struct infinipath_stats ipath_stats; /* temporary, maybe always */
+
+/* only s/w version of chip we can handle for now */
+#define IPATH_CHIP_SWVERSION IPATH_CHIP_VERS_MAJ
+
+struct ipath_portdata {
+ /* minor number of devices, for ipath_type use */
+ unsigned port_unit;
+ /* array of struct page pointers */
+ struct page **port_rcvegrbuf_pages;
+ /* array of virtual addresses (from above) */
+ void **port_rcvegrbuf_virt;
+ void *port_rcvhdrq; /* rcvhdrq base, needs mmap before useful */
+ /* kernel virtual address where hdrqtail is updated */
+ uint64_t *port_rcvhdrtail_kvaddr;
+ struct page *port_rcvhdrtail_pagep; /* page * used for uaddr */
+ /*
+ * temp buffer for expected send setup, allocated at open, instead
+ * of each setup call
+ */
+ void *port_tid_pg_list;
+ wait_queue_head_t port_wait; /* when waiting for rcv or pioavail */
+ /*
+ * rcvegr bufs base, physical, must fit
+ * in 44 bits so 32 bit programs mmap64 44 bit works)
+ */
+ unsigned long port_rcvegr_phys;
+ /* for mmap of hdrq, must fit in 44 bits */
+ unsigned long port_rcvhdrq_phys;
+ /*
+ * the actual user address that we locked, so we can
+ * unlock it at close
+ */
+ unsigned long port_rcvhdrtail_uaddr;
+ /*
+ * number of opens on this instance (0 or 1; ignoring forks, dup,
+ * etc. for now)
+ */
+ int port_cnt;
+ /*
+ * how much space to leave at start of eager TID entries for protocol
+ * use, on each TID
+ */
+ unsigned port_egrskip;
+ unsigned port_port; /* instead of calculating it */
+ uint32_t port_piobufs; /* chip offset of PIO buffers for this port */
+ /* how many alloc_pages() chunks in port_rcvegrbuf_pages */
+ uint32_t port_rcvegrbuf_chunks;
+ uint32_t port_rcvegrbufs_perchunk; /* how many egrbufs per chunk */
+ /* order used with port_rcvegrbuf_pages */
+ uint32_t port_rcvegrbuf_order;
+ uint32_t port_rcvhdrq_order; /* rcvhdrq order (for free_pages) */
+ /* next expected TID to check when looking for free */
+ uint32_t port_tidcursor;
+ /* next expected TID to check when looking for free */
+ uint32_t port_flag;
+ /* WAIT_RCV that timed out, no interrupt */
+ uint32_t port_rcvwait_to;
+ /* WAIT_PIO that timed out, no interrupt */
+ uint32_t port_piowait_to;
+ uint32_t port_rcvnowait; /* WAIT_RCV already happened, no wait */
+ uint32_t port_pionowait; /* WAIT_PIO already happened, no wait */
+ uint32_t port_hdrqfull; /* total number of rcvhdrqfull errors */
+ pid_t port_pid; /* pid of process using this port */
+ /* same size as task_struct .comm[], but no define */
+ char port_comm[16];
+ uint16_t port_pkeys[4]; /* pkeys set by this use of this port */
+};
+
+struct sk_buff;
+
+/*
+ * control information for layered drivers
+ * This is used only as part of devdata via ipath_layer;
+ */
+struct _ipath_layer {
+ int (*l_intr) (const ipath_type, uint32_t);
+ int (*l_rcv) (const ipath_type, void *, struct sk_buff *);
+ int (*l_rcv_lid) (const ipath_type, void *);
+ uint16_t l_rcv_opcode;
+ uint16_t l_rcv_lid_opcode;
+};
+
+/* Verbs layer interface */
+struct _verbs_layer {
+ int (*l_piobufavail) (const ipath_type);
+ void (*l_rcv) (const ipath_type, void *, void *, uint32_t);
+ void (*l_timer_cb) (const ipath_type);
+ struct timer_list l_timer;
+ unsigned l_flags;
+};
+
+/*
+ * These are the fields that only exist for port 0, not per port, so
+ * they aren't in ipath_devdata
+ */
+struct ipath_devdata {
+ /* driver data structures */
+ /* mem-mapped pointer to base of chip regs; should always use read/write{lq}
+ * when accesses are made, or via memcpy32() for PIO buffers*/
+ uint64_t __iomem *ipath_kregbase;
+ /* end of mem-mapped chip space; range checking */
+ uint64_t __iomem *ipath_kregend;
+ /* physical address of chip for io_remap, etc. */
+ unsigned long ipath_physaddr;
+ /* base of memory alloced for ipath_kregbase, for free */
+ uint64_t *ipath_kregalloc;
+ /*
+ * version of kregbase that doesn't have high bits set (for 32 bit
+ * programs, so mmap64 44 bit works)
+ */
+ uint64_t __iomem *ipath_kregvirt;
+ struct ipath_portdata **ipath_pd; /* ipath_cfgports pointers */
+ /* sk_buffs used by port 0 eager receive queue */
+ struct sk_buff **ipath_port0_skbs;
+ void __iomem *ipath_piobase; /* kvirt address of 1st pio buffer */
+
+ /*
+ * virtual address where port0 rcvhdrqtail updated by chip via DMA
+ * volatile because we want to be sure compiler always makes a memory
+ * reference when we dereference it.
+ */
+ volatile uint64_t *ipath_hdrqtailptr;
+ /*
+ * points to area where PIOavail registers will be DMA'ed. Has to
+ * be on a page of it's own, because the page will be mapped into user
+ * program space. Updated by chip via DMA, treated as readonly by software.
+ * volatile because we want to be sure compiler always makes a memory
+ * reference when we dereference it.
+ */
+ volatile uint64_t *ipath_pioavailregs_dma;
+
+ /* original address for kfree */
+ volatile uint64_t *__ipath_pioavailregs_base;
+ /* physical address where updates occur */
+ unsigned long ipath_pioavailregs_phys;
+ struct _ipath_layer ipath_layer;
+ struct _verbs_layer verbs_layer;
+ /* total dwords sent (summed from counter) */
+ uint64_t ipath_sword;
+ /* total dwords received (summed from counter) */
+ uint64_t ipath_rword;
+ /* total packets sent (summed from counter) */
+ uint64_t ipath_spkts;
+ /* total packets received (summed from counter) */
+ uint64_t ipath_rpkts;
+ /* to make the receive interrupt failsafe */
+ uint64_t ipath_lastqtail;
+ uint64_t _ipath_status; /* ipath_statusp initially points to this. */
+ uint64_t ipath_guid; /* GUID for this interface, in network order */
+ /*
+ * aggregrate of error bits reported since
+ * last cleared, for limiting of error reporting
+ */
+ uint64_t ipath_lasterror;
+ /*
+ * aggregrate of error bits reported
+ * since last cleared, for limiting of hwerror reporting
+ */
+ uint64_t ipath_lasthwerror;
+ /*
+ * errors masked because they occur too fast,
+ * also includes errors that are always ignored (ipath_ignorederrs)
+ */
+ uint64_t ipath_maskederrs;
+ /* time at which to re-enable maskederrs */
+ cycles_t ipath_unmasktime;
+ /*
+ * errors always ignored (masked), at least
+ * for a given chip/device, because they are wrong or not useful
+ */
+ uint64_t ipath_ignorederrs;
+ /* count of egrfull errors, combined for all ports */
+ uint64_t ipath_last_tidfull;
+ uint64_t ipath_lastport0rcv_cnt; /* for ipath_qcheck() */
+
+ uint32_t ipath_kregsize; /* size of memory at ipath_kregbase */
+ /* number of registers used for pioavail */
+ uint32_t ipath_pioavregs;
+ uint32_t ipath_flags; /* IPATH_POLL, etc. */
+ /* ipath_flags sma is waiting for */
+ uint32_t ipath_sma_state_wanted;
+ /* last buffer for user use, first buf for kernel use is this index. */
+ uint32_t ipath_lastport_piobuf;
+ uint32_t pci_registered; /* driver is a registered pci device */
+ uint32_t ipath_stats_timer_active; /* is a stats timer active */
+ /* dwords sent read from infinipath counter */
+ uint32_t ipath_lastsword;
+ /* dwords received read from infinipath counter */
+ uint32_t ipath_lastrword;
+ /* sent packets read from infinipath counter */
+ uint32_t ipath_lastspkts;
+ /* received packets read from infinipath counter */
+ uint32_t ipath_lastrpkts;
+ uint32_t ipath_pbufsport; /* pio bufs allocated per port */
+ /*
+ * number of ports configured as max; zero is
+ * set to number chip supports, less gives more pio bufs/port, etc.
+ */
+ uint32_t ipath_cfgports;
+ /* our idea of the port0 rcvhdrq head offset */
+ uint32_t ipath_port0head;
+ uint32_t ipath_p0_hdrqfull; /* count of port 0 hdrqfull errors */
+
+ /*
+ * (*cfgports) used to suppress multiple instances of same port
+ * staying stuck at same point
+ */
+ uint32_t *ipath_lastrcvhdrqtails;
+ /*
+ * (*cfgports) used to suppress multiple instances of same port
+ * staying stuck at same point
+ */
+ uint32_t *ipath_lastegrheads;
+ /*
+ * index of last piobuffer we used. Speeds up searching, by starting
+ * at this point. Doesn't matter if multiple cpu's use and update,
+ * last updater is only write that matters. Whenever it wraps,
+ * we update shadow copies. Need a copy per device when we get to
+ * multiple devices
+ */
+ uint32_t ipath_lastpioindex;
+ uint32_t ipath_freezelen; /* max length of freezemsg */
+ uint32_t ipath_consec_nopiobuf; /* consecutive times we wanted a PIO buffer
+ * but were unable to get one */
+ uint32_t ipath_upd_pio_shadow; /* hint that we should update
+ * ipath_pioavailshadow before looking for a PIO buffer */
+ uint32_t ipath_nosma_bufs; /* sequential tries for SMA send and no bufs */
+ uint32_t ipath_nosma_secs; /* duration (seconds) ipath_nosma_bufs set */
+ /* HT/PCI Vendor ID (here for NodeInfo) */
+ uint16_t ipath_vendorid;
+ /* HT/PCI Device ID (here for NodeInfo) */
+ uint16_t ipath_deviceid;
+ /* offset in HT config space of slave/primary interface block */
+ uint8_t ipath_ht_slave_off;
+ int ipath_mtrr; /* registration handle for WRCOMB setting on */
+ /* ref count of how many users set each pkey */
+ atomic_t ipath_pkeyrefs[4];
+ /* shadow copy of all exptids physaddr; used only by funcsim */
+ uint64_t *ipath_tidsimshadow;
+ /* shadow copy of struct page *'s for exp tid pages */
+ struct page **ipath_pageshadow;
+ /*
+ * IPATH_STATUS_*
+ * this address is mapped readonly into user processes so they can
+ * get status cheaply, whenever they want.
+ */
+ uint64_t *ipath_statusp;
+ char *ipath_freezemsg; /* freeze msg if hw error put chip in freeze */
+ struct pci_dev *pcidev; /* pci access data structure */
+ /* timer used to prevent stats overflow, error throttling, etc. */
+ struct timer_list ipath_stats_timer;
+ /* only allow one interrupt at a time. */
+ unsigned long ipath_rcv_pending;
+
+ /*
+ * shadow copies of registers; size indicates read access size
+ * Most of them are readonly, but some are write-only register, where
+ * we manipulate the bits in the shadow copy, and then write the shadow
+ * copy to infinipath
+ * We deliberately make most of these 32 bits, since they have
+ * restricted range and for any that we read, we won't to generate
+ * 32 bit accesses, since Opteron will generate 2 separate 32 bit
+ * HT transactions for a 64 bit read, and we want to avoid unnecessary
+ * HT transactions
+ */
+
+ /* This is the 64 bit group */
+ /*
+ * shadow of pioavail, check to be sure it's large enough at
+ * init time.
+ */
+ uint64_t ipath_pioavailshadow[8];
+ uint64_t ipath_gpio_out; /* shadow of kr_gpio_out, for rmw ops */
+ /* kr_revision value (also see ipath_majrev) */
+ uint64_t ipath_revision;
+ /* shadow of ibcctrl, for interrupt handling of link changes, etc. */
+ uint64_t ipath_ibcctrl;
+ /*
+ * last ibcstatus, to suppress "duplicate" status change messages,
+ * mostly from 2 to 3
+ */
+ uint64_t ipath_lastibcstat;
+ /* mask of hardware errors that are enabled */
+ uint64_t ipath_hwerrmask;
+ uint64_t ipath_extctrl; /* shadow the gpio output contents */
+
+ /* these are the "32 bit" regs */
+ /*
+ * number of GUIDs in the flash for this interface; may need some
+ * rethinking for setting on other ifaces
+ */
+ uint32_t ipath_nguid;
+ uint32_t ipath_rcvctrl; /* shadow kr_rcvctrl */
+ uint32_t ipath_sendctrl; /* shadow kr_sendctrl */
+ uint32_t ipath_rcvhdrcnt; /* value we put in kr_rcvhdrcnt */
+ uint32_t ipath_rcvhdrsize; /* value we put in kr_rcvhdrsize */
+ uint32_t ipath_rcvhdrentsize; /* value we put in kr_rcvhdrentsize */
+ /* byte offset of last entry in rcvhdrq */
+ uint32_t ipath_hdrqlast;
+ uint32_t ipath_portcnt; /* kr_portcnt value */
+ uint32_t ipath_palign; /* kr_pagealign value */
+ uint32_t ipath_piobcnt; /* kr_sendpiobufcnt value */
+ uint32_t ipath_piobufbase; /* kr_sendpiobufbase value */
+ uint32_t ipath_piosize; /* kr_sendpiosize */
+ uint32_t ipath_rcvegrbase; /* kr_rcvegrbase value */
+ uint32_t ipath_rcvegrcnt; /* kr_rcvegrcnt value */
+ uint32_t ipath_rcvtidbase; /* kr_rcvtidbase value */
+ uint32_t ipath_rcvtidcnt; /* kr_rcvtidcnt value */
+ uint32_t ipath_sregbase; /* kr_sendregbase */
+ uint32_t ipath_uregbase; /* kr_userregbase */
+ uint32_t ipath_cregbase; /* kr_counterregbase */
+ uint32_t ipath_control; /* shadow the control register contents */
+ uint32_t ipath_pcirev; /* PCI revision register (HTC rev on FPGA) */
+
+ uint32_t ipath_ibmtu; /* The MTU programmed for this unit */
+ /*
+ * The max size IB packet, included IB headers that we can send.
+ * Starts same as ipath_piosize, but is affected when ibmtu is
+ * changed, or by size of eager buffers
+ */
+ uint32_t ipath_ibmaxlen;
+ /*
+ * ibmaxlen at init time, limited by chip and by receive buffer size.
+ * Not changed after init.
+ */
+ uint32_t ipath_init_ibmaxlen;
+ /* size we allocate for each rcvegrbuffer */
+ uint32_t ipath_rcvegrbufsize;
+ uint32_t ipath_htwidth; /* width (2,4,8,16,32) from HT config reg */
+ uint32_t ipath_htspeed; /* HT speed (200,400,800,1000) from HT config */
+ /* bitmap of ports waiting for PIO avail intr */
+ uint32_t ipath_portpiowait;
+ /*
+ *number of sequential ibcstatus change for polling active/quiet
+ * (i.e., link not coming up).
+ */
+ uint32_t ipath_ibpollcnt;
+ uint16_t ipath_mlid; /* MLID programmed for this instance */
+ uint16_t ipath_lid; /* LID programmed for this instance */
+ /* list of pkeys programmed; 0 means not set */
+ uint16_t ipath_pkeys[4];
+ uint8_t ipath_serial[12]; /* ASCII serial number, from flash */
+ uint8_t ipath_majrev; /* chip major rev, from ipath_revision */
+ uint8_t ipath_minrev; /* chip minor rev, from ipath_revision */
+ uint8_t ipath_boardrev; /* board rev, from ipath_revision */
+ uint8_t ipath_unit; /* Unit number for this chip */
+};
+
+/*
+ * A segment is a linear region of low physical memory.
+ * XXX Maybe we should use phys addr here and kmap()/kunmap()
+ * Used by the verbs layer.
+ */
+struct ipath_seg {
+ void *vaddr;
+ size_t length;
+};
+
+/* The number of ipath_segs that fit in a page. */
+#define IPATH_SEGSZ (PAGE_SIZE / sizeof (struct ipath_seg))
+
+struct ipath_segarray {
+ struct ipath_seg segs[IPATH_SEGSZ];
+};
+
+/*
+ * Used by the verbs layer.
+ */
+struct ipath_mregion {
+ uint64_t user_base; /* User's address for this region */
+ uint64_t iova; /* IB start address of this region */
+ size_t length;
+ uint32_t lkey;
+ uint32_t offset; /* offset (bytes) to start of region */
+ int access_flags;
+ uint32_t max_segs; /* number of ipath_segs in all the arrays */
+ uint32_t mapsz; /* size of the map array */
+ struct ipath_segarray *map[0]; /* the segments */
+};
+
+/*
+ * These keep track of the copy progress within a memory region.
+ * Used by the verbs layer.
+ */
+struct ipath_sge {
+ struct ipath_mregion *mr;
+ void *vaddr; /* current pointer into the segment */
+ uint32_t sge_length; /* length of the SGE */
+ uint32_t length; /* remaining length of the segment */
+ uint16_t m; /* current index: mr->map[m] */
+ uint16_t n; /* current index: mr->map[m]->segs[n] */
+};
+
+struct ipath_sge_state {
+ struct ipath_sge *sg_list; /* next SGE to be used if any */
+ struct ipath_sge sge; /* progress state for the current SGE */
+ uint8_t num_sge;
+};
+
+extern struct ipath_devdata devdata[];
+#define IPATH_UNIT(p) ((p)-devdata)
+extern const uint32_t infinipath_max; /* number of units (chips) supported */
+extern const char *ipath_minor_names[];
+
+extern int ipath_diags_enabled; /* is diags mode enabled? */
+
+/* clean up any per-chip chip-specific stuff */
+void ipath_chip_cleanup(struct ipath_devdata *);
+void ipath_chip_done(void); /* clean up any chip type-specific stuff */
+void ipath_handle_hwerrors(const ipath_type, char *, int);
+int ipath_validate_rev(struct ipath_devdata *);
+void ipath_clear_init_hwerrs(const ipath_type);
+
+/*
+ * This is here to simplify compatibility with source that supports
+ * multiple chip types
+ */
+void ipath_ht_get_boardname(const ipath_type t, char *name, size_t namelen);
+
+/* these are primarily for SMA, but are also used by diags */
+int ipath_send_smapkt(struct ipath_sendpkt __user *);
+
+int ipath_wait_linkstate(const ipath_type, uint32_t, int);
+void ipath_down_link(const ipath_type);
+void ipath_set_ib_lstate(const ipath_type, int);
+void ipath_kreceive(const ipath_type);
+int ipath_setrcvhdrsize(const ipath_type, unsigned);
+
+/* for use in system calls, where we want to know device type, etc. */
+#define port_fp(fp) (((fp)->private_data>(void*)255UL)?((struct ipath_portdata *)fp->private_data):NULL)
+
+/*
+ * values for ipath_flags
+ */
+#define IPATH_INITTED 0x2 /* The chip is up and initted */
+#define IPATH_RCVHDRSZ_SET 0x4 /* set if any user code has set kr_rcvhdrsize */
+/* The chip is present and valid for accesses */
+#define IPATH_PRESENT 0x8
+/* HT link0 is only 8 bits wide, ignore upper byte crc errors, etc. */
+#define IPATH_8BIT_IN_HT0 0x10
+/* HT link1 is only 8 bits wide, ignore upper byte crc errors, etc. */
+#define IPATH_8BIT_IN_HT1 0x20
+/* The link is down (or not yet up 0x11 or earlier) */
+#define IPATH_LINKDOWN 0x40
+#define IPATH_LINKINIT 0x80 /* The link level is up (0x11) */
+/* The link is in the armed (0x21) state */
+#define IPATH_LINKARMED 0x100
+/* The link is in the active (0x31) state */
+#define IPATH_LINKACTIVE 0x200
+/* The link was taken down, but no interrupt yet */
+#define IPATH_LINKUNK 0x400
+/* link being moved to armed (0x21) state */
+#define IPATH_LINK_TOARMED 0x800
+/* link being moved to active (0x31) state */
+#define IPATH_LINK_TOACTIVE 0x1000
+/* linkinit cmd is SLEEP, move to POLL */
+#define IPATH_LINK_SLEEPING 0x2000
+/* no IB cable, or no device on IB cable */
+#define IPATH_NOCABLE 0x4000
+/* Supports port zero per packet receive interrupts via GPIO */
+#define IPATH_GPIO_INTR 0x8000
+
+/* portdata flag values */
+#define IPATH_PORT_WAITING_RCV 0x4 /* waiting for a packet to arrive */
+/* waiting for a PIO buffer to be available */
+#define IPATH_PORT_WAITING_PIO 0x8
+
+int ipath_init_chip(const ipath_type);
+/* free up any allocated data at closes */
+void ipath_free_data(struct ipath_portdata *dd);
+void ipath_init_picotime(void); /* init cycles to picosecs conversion */
+int ipath_bringup_serdes(const ipath_type);
+int ipath_waitfor_mdio_cmdready(const ipath_type);
+int ipath_waitfor_complete(const ipath_type, ipath_kreg, uint64_t, uint64_t *);
+void ipath_quiet_serdes(const ipath_type);
+void ipath_get_boardname(uint8_t, char *, size_t);
+uint32_t __iomem *ipath_getpiobuf(int, uint32_t *);
+int ipath_bufavail(int);
+int ipath_rd_eeprom(const ipath_type port_unit,
+ struct ipath_eeprom_req __user *);
+uint64_t ipath_snap_cntr(const ipath_type, ipath_creg);
+
+/*
+ * these should be somewhat dynamic someday, although they are fixed
+ * for all users of the device on any given load.
+ */
+/* (words) room for all IB headers and KD proto header */
+#define IPATH_RCVHDRENTSIZE 16
+/*
+ * 64K, which is about all you can hope to get contiguous. API allows
+ * users to request a size, for now I'm ignoring that.
+ */
+#define IPATH_RCVHDRCNT 1024
+
+/*
+ * number of words in KD protocol header if not set by ipath_userinit();
+ * this uses the full 64 bytes of rcvhdrentry
+ */
+#define IPATH_DFLT_RCVHDRSIZE 9
+
+#define IPATH_MDIO_CMD_WRITE 1
+#define IPATH_MDIO_CMD_READ 2
+#define IPATH_MDIO_CLD_DIV 25 /* to get 2.5 Mhz mdio clock */
+#define IPATH_MDIO_CMDVALID 0x40000000 /* bit 30 */
+#define IPATH_MDIO_DATAVALID 0x80000000 /* bit 31 */
+#define IPATH_MDIO_CTRL_STD 0x0
+
+#define IPATH_MDIO_REQ(cmd,dev,reg,data) ( (((uint64_t)IPATH_MDIO_CLD_DIV) << 32) | \
+ ((cmd) << 26) | ((dev)<<21) | ((reg) << 16) | ((data) & 0xFFFF))
+
+#define IPATH_MDIO_CTRL_XGXS_REG_8 0x8 /* signal and fifo status, in bank 31 */
+
+/* controls loopback, redundancy */
+#define IPATH_MDIO_CTRL_8355_REG_1 0x10
+#define IPATH_MDIO_CTRL_8355_REG_2 0x11 /* premph, encdec, etc. */
+#define IPATH_MDIO_CTRL_8355_REG_6 0x15 /* Kchars, etc. */
+#define IPATH_MDIO_CTRL_8355_REG_9 0x18
+#define IPATH_MDIO_CTRL_8355_REG_10 0x1D
+
+/*
+ * ipath_get_upages() is used to pin an address range (if not already pinned),
+ * and optionally return the list of physical addresses
+ * ipath_putpages() does the obvious, and ipath_get_upages() cleans up all
+ * private memory, used at driver unload.
+ * ipath_get_upages_nocopy() is similar to ipage_get_upages, but only 1 page,
+ * and marks the vm so the page isn't taken away on a fork.
+ */
+int ipath_get_upages(unsigned long, size_t, struct page **);
+int ipath_get_upages_nocopy(unsigned long, struct page **);
+void ipath_putpages(size_t, struct page **);
+void ipath_upages_cleanup(struct ipath_portdata *);
+int ipath_eeprom_read(const ipath_type, uint8_t, void *, int);
+int ipath_eeprom_write(const ipath_type, uint8_t, void *, int);
+
+/* these are used for the registers that vary with port */
+void ipath_kput_kreg_port(const ipath_type, ipath_kreg, unsigned, uint64_t);
+uint64_t ipath_kget_kreg64_port(const ipath_type, ipath_kreg, unsigned);
+
+/*
+ * we could have a single register get/put routine, that takes a group
+ * type, but this is somewhat clearer and cleaner. It also gives us some
+ * error checking. 64 bit register reads should always work, but are
+ * inefficient on opteron (the northbridge always generates 2 separate
+ * HT 32 bit reads), so we use kreg32 wherever possible.
+ * User register and counter register reads are always 32 bit reads, so only
+ * one form of those routines.
+ */
+
+/*
+ * At the moment, none of the s-registers are writable, so no ipath_kput_sreg()
+ * At the moment, none of the c-registers are writable, so no ipath_kput_creg()
+ */
+
+/*
+ * return the contents of a register that is virtualized to be per port
+ * prints a debug message and returns -1 on errors (not distinguishable from
+ * valid contents at runtime; we may add a separate error variable at some
+ * point).
+ * This is normally not used by the kernel, but may be for debugging,
+ * and has a different implementation than user mode, which is why
+ * it's not in _common.h
+ */
+static inline uint32_t ipath_kget_ureg32(const ipath_type stype,
+ ipath_ureg regno, int port)
+{
+ if (!devdata[stype].ipath_kregbase)
+ return 0;
+
+ return readl(regno + (uint64_t __iomem *)
+ (devdata[stype].ipath_uregbase +
+ (char __iomem *) devdata[stype].ipath_kregbase +
+ devdata[stype].ipath_palign * port));
+}
+
+/*
+ * change the contents of a register that is virtualized to be per port
+ * prints a debug message and returns 1 on errors, 0 on success.
+ */
+static inline void ipath_kput_ureg(const ipath_type stype, ipath_ureg regno,
+ uint64_t value, int port)
+{
+ uint64_t __iomem *ubase;
+
+ ubase = (uint64_t __iomem *)
+ (devdata[stype].ipath_uregbase +
+ (char __iomem *) devdata[stype].ipath_kregbase +
+ devdata[stype].ipath_palign * port);
+ if(devdata[stype].ipath_kregbase)
+ writeq(value, &ubase[regno]);
+}
+
+static inline uint32_t ipath_kget_kreg32(const ipath_type stype,
+ ipath_kreg regno)
+{
+ if (!devdata[stype].ipath_kregbase)
+ return -1;
+ return readl((uint32_t __iomem *) &devdata[stype].ipath_kregbase[regno]);
+}
+
+static inline uint64_t ipath_kget_kreg64(const ipath_type stype,
+ ipath_kreg regno)
+{
+ if (!devdata[stype].ipath_kregbase)
+ return -1;
+
+ return readq(&devdata[stype].ipath_kregbase[regno]);
+}
+
+static inline void ipath_kput_kreg(const ipath_type stype,
+ ipath_kreg regno, uint64_t value)
+{
+ if (devdata[stype].ipath_kregbase)
+ writeq(value, &devdata[stype].ipath_kregbase[regno]);
+}
+
+static inline uint32_t ipath_kget_creg32(const ipath_type stype,
+ ipath_sreg regno)
+{
+ if(!devdata[stype].ipath_kregbase)
+ return 0;
+ return readl(regno + (uint64_t __iomem *)
+ (devdata[stype].ipath_cregbase +
+ (char __iomem *) devdata[stype].ipath_kregbase));
+}
+
+/*
+ * caddr is the destination chip address (full pointer, not offset),
+ * val is the qword to write there. We only handle a single qword (8 bytes).
+ * This is not used for copies to the PIO buffer, just TID updates, etc.
+ * This function localizes all chip mem (as opposed to register) writes.
+ */
+static inline void ipath_kput_memq(const ipath_type stype,
+ uint64_t __iomem *caddr, uint64_t val)
+{
+ if (devdata[stype].ipath_kregbase)
+ writeq(val, caddr);
+}
+
+
+#endif /* _IPATH_KERNEL_H */
diff -r a3a00f637da6 -r 2d9a3f27a10c drivers/infiniband/hw/ipath/ipath_layer.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_layer.h Wed Dec 28 14:19:42 2005 -0800
@@ -0,0 +1,134 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#ifndef _IPATH_LAYER_H
+#define _IPATH_LAYER_H
+
+/*
+ * This header file is for symbols shared between the infinipath driver
+ * and drivers layered upon it (such as ipath).
+ */
+
+struct sk_buff;
+struct ipath_sge_state;
+
+struct ipath_layer_counters {
+ uint64_t symbol_error_counter;
+ uint64_t link_error_recovery_counter;
+ uint64_t link_downed_counter;
+ uint64_t port_rcv_errors;
+ uint64_t port_rcv_remphys_errors;
+ uint64_t port_xmit_discards;
+ uint64_t port_xmit_data;
+ uint64_t port_rcv_data;
+ uint64_t port_xmit_packets;
+ uint64_t port_rcv_packets;
+};
+
+int ipath_layer_register(const ipath_type device,
+ int (*l_intr) (const ipath_type, uint32_t),
+ int (*l_rcv) (const ipath_type, void *,
+ struct sk_buff *),
+ uint16_t rcv_opcode,
+ int (*l_rcv_lid) (const ipath_type, void *),
+ uint16_t rcv_lid_opcode);
+int ipath_verbs_register(const ipath_type device,
+ int (*l_piobufavail) (const ipath_type device),
+ void (*l_rcv) (const ipath_type device,
+ void *rhdr, void *data,
+ uint32_t tlen),
+ void (*l_timer_cb) (const ipath_type device));
+void ipath_verbs_unregister(const ipath_type device);
+int ipath_layer_open(const ipath_type device, uint32_t * pktmax);
+uint16_t ipath_layer_get_lid(const ipath_type device);
+int ipath_layer_get_mac(const ipath_type device, uint8_t *);
+uint16_t ipath_layer_get_bcast(const ipath_type device);
+int ipath_layer_get_num_of_dev(void);
+int ipath_layer_get_cr_errpkey(const ipath_type device);
+int ipath_kset_linkstate(uint32_t arg);
+int ipath_kset_mtu(uint32_t);
+void ipath_set_sps_lid(const ipath_type, uint32_t);
+void ipath_layer_close(const ipath_type device);
+int ipath_layer_send(const ipath_type device, void *hdr, void *data,
+ uint32_t datalen);
+int ipath_verbs_send(const ipath_type device, uint32_t hdrwords,
+ uint32_t *hdr, uint32_t len,
+ struct ipath_sge_state *ss);
+int ipath_layer_send_skb(struct copy_data_s *cdata);
+void ipath_layer_set_piointbufavail_int(const ipath_type device);
+void ipath_get_boardname(const ipath_type, char *name, size_t namelen);
+void ipath_layer_snapshot_counters(const ipath_type t, uint64_t * swords,
+ uint64_t * rwords, uint64_t * spkts, uint64_t * rpkts);
+void ipath_layer_get_counters(const ipath_type device,
+ struct ipath_layer_counters *cntrs);
+void ipath_layer_want_buffer(const ipath_type t);
+int ipath_layer_set_guid(const ipath_type t, uint64_t guid);
+uint64_t ipath_layer_get_guid(const ipath_type t);
+uint32_t ipath_layer_get_nguid(const ipath_type t);
+int ipath_layer_query_device(const ipath_type t, uint32_t * vendor,
+ uint32_t * boardrev, uint32_t * majrev,
+ uint32_t * minrev);
+uint32_t ipath_layer_get_flags(const ipath_type t);
+struct device *ipath_layer_get_pcidev(const ipath_type t);
+uint16_t ipath_layer_get_deviceid(const ipath_type t);
+uint64_t ipath_layer_get_lastibcstat(const ipath_type t);
+uint32_t ipath_layer_get_ibmtu(const ipath_type t);
+void ipath_layer_enable_timer(const ipath_type t);
+void ipath_layer_disable_timer(const ipath_type t);
+unsigned ipath_verbs_get_flags(const ipath_type device);
+void ipath_verbs_set_flags(const ipath_type device, unsigned flags);
+unsigned ipath_layer_get_npkeys(const ipath_type device);
+unsigned ipath_layer_get_pkey(const ipath_type device, unsigned index);
+void ipath_layer_get_pkeys(const ipath_type device, uint16_t *pkeys);
+int ipath_layer_set_pkeys(const ipath_type device, uint16_t *pkeys);
+int ipath_layer_get_linkdowndefaultstate(const ipath_type device);
+int ipath_layer_set_linkdowndefaultstate(const ipath_type device, int sleep);
+int ipath_layer_get_phyerrthreshold(const ipath_type device);
+int ipath_layer_set_phyerrthreshold(const ipath_type device, unsigned n);
+int ipath_layer_get_overrunthreshold(const ipath_type device);
+int ipath_layer_set_overrunthreshold(const ipath_type device, unsigned n);
+
+/* ipath_ether interrupt values */
+#define IPATH_LAYER_INT_IF_UP 0x2
+#define IPATH_LAYER_INT_IF_DOWN 0x4
+#define IPATH_LAYER_INT_LID 0x8
+#define IPATH_LAYER_INT_SEND_CONTINUE 0x10
+#define IPATH_LAYER_INT_BCAST 0x40
+
+/* _verbs_layer.l_flags */
+#define IPATH_VERBS_KERNEL_SMA 0x1
+
+#endif /* _IPATH_LAYER_H */
diff -r a3a00f637da6 -r 2d9a3f27a10c drivers/infiniband/hw/ipath/ipath_registers.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_registers.h Wed Dec 28 14:19:42 2005 -0800
@@ -0,0 +1,355 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#ifndef _IPATH_REGISTERS_H
+#define _IPATH_REGISTERS_H
+
+/*
+ * This file should only be included by kernel source, and by the diags.
+ * It defines the registers, and their contents, for the InfiniPath HT-400 chip
+ */
+
+/*
+ * These are the InfiniPath register and buffer bit definitions,
+ * that are visible to software, and needed only by the kernel
+ * and diag code. A few, that are visible to protocol and user
+ * code are in ipath_common.h. Some bits are specific
+ * to a given chip implementation, and have been moved to the
+ * chip-specific source file
+ */
+
+/* kr_revision bits */
+#define INFINIPATH_R_CHIPREVMINOR_MASK 0xFF
+#define INFINIPATH_R_CHIPREVMINOR_SHIFT 0
+#define INFINIPATH_R_CHIPREVMAJOR_MASK 0xFF
+#define INFINIPATH_R_CHIPREVMAJOR_SHIFT 8
+#define INFINIPATH_R_ARCH_MASK 0xFF
+#define INFINIPATH_R_ARCH_SHIFT 16
+#define INFINIPATH_R_SOFTWARE_MASK 0xFF
+#define INFINIPATH_R_SOFTWARE_SHIFT 24
+#define INFINIPATH_R_BOARDID_MASK 0xFF
+#define INFINIPATH_R_BOARDID_SHIFT 32
+
+/* kr_ontrol bits */
+#define INFINIPATH_C_FREEZEMODE 0x00000002
+#define INFINIPATH_C_LINKENABLE 0x00000004
+
+/* kr_sendctrl bits */
+#define INFINIPATH_S_DISARMPIOBUF_SHIFT 16
+#define INFINIPATH_S_ABORT 0x00000001U
+#define INFINIPATH_S_PIOINTBUFAVAIL 0x00000002U
+#define INFINIPATH_S_PIOBUFAVAILUPD 0x00000004U
+#define INFINIPATH_S_PIOENABLE 0x00000008U
+#define INFINIPATH_S_DISARM 0x80000000U
+
+/* kr_rcvctrl bits */
+#define INFINIPATH_R_PORTENABLE_SHIFT 0
+#define INFINIPATH_R_INTRAVAIL_SHIFT 16
+#define INFINIPATH_R_TAILUPD 0x80000000
+
+/* kr_intstatus, kr_intclear, kr_intmask bits */
+#define INFINIPATH_I_RCVURG_SHIFT 0
+#define INFINIPATH_I_RCVAVAIL_SHIFT 12
+#define INFINIPATH_I_ERROR 0x80000000
+#define INFINIPATH_I_SPIOSENT 0x40000000
+#define INFINIPATH_I_SPIOBUFAVAIL 0x20000000
+#define INFINIPATH_I_GPIO 0x10000000
+
+/* kr_errorstatus, kr_errorclear, kr_errormask bits */
+#define INFINIPATH_E_RFORMATERR 0x0000000000000001ULL
+#define INFINIPATH_E_RVCRC 0x0000000000000002ULL
+#define INFINIPATH_E_RICRC 0x0000000000000004ULL
+#define INFINIPATH_E_RMINPKTLEN 0x0000000000000008ULL
+#define INFINIPATH_E_RMAXPKTLEN 0x0000000000000010ULL
+#define INFINIPATH_E_RLONGPKTLEN 0x0000000000000020ULL
+#define INFINIPATH_E_RSHORTPKTLEN 0x0000000000000040ULL
+#define INFINIPATH_E_RUNEXPCHAR 0x0000000000000080ULL
+#define INFINIPATH_E_RUNSUPVL 0x0000000000000100ULL
+#define INFINIPATH_E_REBP 0x0000000000000200ULL
+#define INFINIPATH_E_RIBFLOW 0x0000000000000400ULL
+#define INFINIPATH_E_RBADVERSION 0x0000000000000800ULL
+#define INFINIPATH_E_RRCVEGRFULL 0x0000000000001000ULL
+#define INFINIPATH_E_RRCVHDRFULL 0x0000000000002000ULL
+#define INFINIPATH_E_RBADTID 0x0000000000004000ULL
+#define INFINIPATH_E_RHDRLEN 0x0000000000008000ULL
+#define INFINIPATH_E_RHDR 0x0000000000010000ULL
+#define INFINIPATH_E_RIBLOSTLINK 0x0000000000020000ULL
+#define INFINIPATH_E_SMINPKTLEN 0x0000000020000000ULL
+#define INFINIPATH_E_SMAXPKTLEN 0x0000000040000000ULL
+#define INFINIPATH_E_SUNDERRUN 0x0000000080000000ULL
+#define INFINIPATH_E_SPKTLEN 0x0000000100000000ULL
+#define INFINIPATH_E_SDROPPEDSMPPKT 0x0000000200000000ULL
+#define INFINIPATH_E_SDROPPEDDATAPKT 0x0000000400000000ULL
+#define INFINIPATH_E_SPIOARMLAUNCH 0x0000000800000000ULL
+#define INFINIPATH_E_SUNEXPERRPKTNUM 0x0000001000000000ULL
+#define INFINIPATH_E_SUNSUPVL 0x0000002000000000ULL
+#define INFINIPATH_E_IBSTATUSCHANGED 0x0001000000000000ULL
+#define INFINIPATH_E_INVALIDADDR 0x0002000000000000ULL
+#define INFINIPATH_E_RESET 0x0004000000000000ULL
+#define INFINIPATH_E_HARDWARE 0x0008000000000000ULL
+
+/* kr_hwerrclear, kr_hwerrmask, kr_hwerrstatus, bits */
+#define INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT 0
+#define INFINIPATH_HWE_TXEMEMPARITYERR_MASK 0xFULL
+#define INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT 40
+#define INFINIPATH_HWE_RXEMEMPARITYERR_MASK 0x7FULL
+#define INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT 44
+#define INFINIPATH_HWE_HTCBUSTREQPARITYERR 0x0000000080000000ULL
+#define INFINIPATH_HWE_HTCBUSTRESPPARITYERR 0x0000000100000000ULL
+#define INFINIPATH_HWE_HTCBUSIREQPARITYERR 0x0000000200000000ULL
+#define INFINIPATH_HWE_RXDSYNCMEMPARITYERR 0x0000000400000000ULL
+#define INFINIPATH_HWE_SERDESPLLFAILED 0x2000000000000000ULL
+#define INFINIPATH_HWE_IBCBUSTOSPCPARITYERR 0x4000000000000000ULL
+#define INFINIPATH_HWE_IBCBUSFRSPCPARITYERR 0x8000000000000000ULL
+
+/* kr_hwdiagctrl bits */
+#define INFINIPATH_DC_FORCEHTCENABLE 0x20
+#define INFINIPATH_DC_FORCEHTCMEMPARITYERR_MASK 0x3FULL
+#define INFINIPATH_DC_FORCEHTCMEMPARITYERR_SHIFT 0
+#define INFINIPATH_DC_FORCETXEMEMPARITYERR_MASK 0xFULL
+#define INFINIPATH_DC_FORCETXEMEMPARITYERR_SHIFT 40
+#define INFINIPATH_DC_FORCERXEMEMPARITYERR_MASK 0x7FULL
+#define INFINIPATH_DC_FORCERXEMEMPARITYERR_SHIFT 44
+#define INFINIPATH_DC_FORCEHTCBUSTREQPARITYERR 0x0000000080000000ULL
+#define INFINIPATH_DC_FORCEHTCBUSTRESPPARITYERR 0x0000000100000000ULL
+#define INFINIPATH_DC_FORCEHTCBUSIREQPARITYERR 0x0000000200000000ULL
+#define INFINIPATH_DC_FORCERXDSYNCMEMPARITYERR 0x0000000400000000ULL
+#define INFINIPATH_DC_COUNTERDISABLE 0x1000000000000000ULL
+#define INFINIPATH_DC_COUNTERWREN 0x2000000000000000ULL
+#define INFINIPATH_DC_FORCEIBCBUSTOSPCPARITYERR 0x4000000000000000ULL
+#define INFINIPATH_DC_FORCEIBCBUSFRSPCPARITYERR 0x8000000000000000ULL
+
+/* kr_ibcctrl bits */
+#define INFINIPATH_IBCC_FLOWCTRLPERIOD_MASK 0xFFULL
+#define INFINIPATH_IBCC_FLOWCTRLPERIOD_SHIFT 0
+#define INFINIPATH_IBCC_FLOWCTRLWATERMARK_MASK 0xFFULL
+#define INFINIPATH_IBCC_FLOWCTRLWATERMARK_SHIFT 8
+#define INFINIPATH_IBCC_LINKINITCMD_MASK 0x3ULL
+#define INFINIPATH_IBCC_LINKINITCMD_DISABLE 1
+/* cycle through TS1/TS2 till OK */
+#define INFINIPATH_IBCC_LINKINITCMD_POLL 2
+#define INFINIPATH_IBCC_LINKINITCMD_SLEEP 3 /* wait for TS1, then go on */
+#define INFINIPATH_IBCC_LINKINITCMD_SHIFT 16
+#define INFINIPATH_IBCC_LINKCMD_MASK 0x3ULL
+#define INFINIPATH_IBCC_LINKCMD_INIT 1 /* move to 0x11 */
+#define INFINIPATH_IBCC_LINKCMD_ARMED 2 /* move to 0x21 */
+#define INFINIPATH_IBCC_LINKCMD_ACTIVE 3 /* move to 0x31 */
+#define INFINIPATH_IBCC_LINKCMD_SHIFT 18
+#define INFINIPATH_IBCC_MAXPKTLEN_MASK 0x7FFULL
+#define INFINIPATH_IBCC_MAXPKTLEN_SHIFT 20
+#define INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK 0xFULL
+#define INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT 32
+#define INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK 0xFULL
+#define INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT 36
+#define INFINIPATH_IBCC_CREDITSCALE_MASK 0x7ULL
+#define INFINIPATH_IBCC_CREDITSCALE_SHIFT 40
+#define INFINIPATH_IBCC_LOOPBACK 0x8000000000000000ULL
+#define INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE 0x4000000000000000ULL
+
+/* kr_ibcstatus bits */
+#define INFINIPATH_IBCS_LINKTRAININGSTATE_MASK 0xF
+#define INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT 0
+#define INFINIPATH_IBCS_LINKSTATE_MASK 0x7
+#define INFINIPATH_IBCS_LINKSTATE_SHIFT 4
+#define INFINIPATH_IBCS_TXREADY 0x40000000
+#define INFINIPATH_IBCS_TXCREDITOK 0x80000000
+
+/* kr_extstatus bits */
+#define INFINIPATH_EXTS_SERDESPLLLOCK 0x1
+#define INFINIPATH_EXTS_GPIOIN_MASK 0xFFFFULL
+#define INFINIPATH_EXTS_GPIOIN_SHIFT 48
+
+/* kr_extctrl bits */
+#define INFINIPATH_EXTC_GPIOINVERT_MASK 0xFFFFULL
+#define INFINIPATH_EXTC_GPIOINVERT_SHIFT 32
+#define INFINIPATH_EXTC_GPIOOE_MASK 0xFFFFULL
+#define INFINIPATH_EXTC_GPIOOE_SHIFT 48
+#define INFINIPATH_EXTC_SERDESENABLE 0x80000000ULL
+#define INFINIPATH_EXTC_SERDESCONNECT 0x40000000ULL
+#define INFINIPATH_EXTC_SERDESENTRUNKING 0x20000000ULL
+#define INFINIPATH_EXTC_SERDESDISRXFIFO 0x10000000ULL
+#define INFINIPATH_EXTC_SERDESENPLPBK1 0x08000000ULL
+#define INFINIPATH_EXTC_SERDESENPLPBK2 0x04000000ULL
+#define INFINIPATH_EXTC_SERDESENENCDEC 0x02000000ULL
+#define INFINIPATH_EXTC_LEDSECPORTGREENON 0x00000020ULL
+#define INFINIPATH_EXTC_LEDSECPORTYELLOWON 0x00000010ULL
+#define INFINIPATH_EXTC_LEDPRIPORTGREENON 0x00000008ULL
+#define INFINIPATH_EXTC_LEDPRIPORTYELLOWON 0x00000004ULL
+#define INFINIPATH_EXTC_LEDGBLOKGREENON 0x00000002ULL
+#define INFINIPATH_EXTC_LEDGBLERRREDOFF 0x00000001ULL
+
+/* kr_mdio bits */
+#define INFINIPATH_MDIO_CLKDIV_MASK 0x7FULL
+#define INFINIPATH_MDIO_CLKDIV_SHIFT 32
+#define INFINIPATH_MDIO_COMMAND_MASK 0x7ULL
+#define INFINIPATH_MDIO_COMMAND_SHIFT 26
+#define INFINIPATH_MDIO_DEVADDR_MASK 0x1FULL
+#define INFINIPATH_MDIO_DEVADDR_SHIFT 21
+#define INFINIPATH_MDIO_REGADDR_MASK 0x1FULL
+#define INFINIPATH_MDIO_REGADDR_SHIFT 16
+#define INFINIPATH_MDIO_DATA_MASK 0xFFFFULL
+#define INFINIPATH_MDIO_DATA_SHIFT 0
+#define INFINIPATH_MDIO_CMDVALID 0x0000000040000000ULL
+#define INFINIPATH_MDIO_RDDATAVALID 0x0000000080000000ULL
+
+/* kr_partitionkey bits */
+#define INFINIPATH_PKEY_SIZE 16
+#define INFINIPATH_PKEY_MASK 0xFFFF
+#define INFINIPATH_PKEY_DEFAULT_PKEY 0xFFFF
+
+/* kr_serdesconfig0 bits */
+#define INFINIPATH_SERDC0_RESET_MASK 0xfULL /* overal reset bits */
+#define INFINIPATH_SERDC0_RESET_PLL 0x10000000ULL /* pll reset */
+#define INFINIPATH_SERDC0_TXIDLE 0xF000ULL /* tx idle enables (per lane) */
+
+/* kr_xgxsconfig bits */
+#define INFINIPATH_XGXS_RESET 0x7ULL
+#define INFINIPATH_XGXS_MDIOADDR_MASK 0xfULL
+#define INFINIPATH_XGXS_MDIOADDR_SHIFT 4
+
+/* TID entries (memory) */
+#define INFINIPATH_RT_VALID 0x8000000000000000ULL
+#define INFINIPATH_RT_ADDR_MASK 0xFFFFFFFFFFULL
+#define INFINIPATH_RT_ADDR_SHIFT 0
+#define INFINIPATH_RT_BUFSIZE_MASK 0x3FFF
+#define INFINIPATH_RT_BUFSIZE_SHIFT 48
+
+/* mask of defined bits for various registers */
+extern const uint64_t infinipath_c_bitsextant,
+ infinipath_s_bitsextant, infinipath_r_bitsextant,
+ infinipath_i_bitsextant, infinipath_e_bitsextant,
+ infinipath_hwe_bitsextant, infinipath_dc_bitsextant,
+ infinipath_extc_bitsextant, infinipath_mdio_bitsextant,
+ infinipath_ibcs_bitsextant, infinipath_ibcc_bitsextant;
+
+/* masks that are different in different chips */
+extern const uint32_t infinipath_i_rcvavail_mask, infinipath_i_rcvurg_mask;
+extern const uint64_t infinipath_hwe_htcmemparityerr_mask;
+extern const uint64_t infinipath_hwe_spibdcmlockfailed_mask;
+extern const uint64_t infinipath_hwe_sphtdcmlockfailed_mask;
+extern const uint64_t infinipath_hwe_htcdcmlockfailed_mask;
+extern const uint64_t infinipath_hwe_htcdcmlockfailed_shift;
+extern const uint64_t infinipath_hwe_sphtdcmlockfailed_shift;
+extern const uint64_t infinipath_hwe_spibdcmlockfailed_shift;
+
+extern const uint64_t infinipath_hwe_htclnkabyte0crcerr;
+extern const uint64_t infinipath_hwe_htclnkabyte1crcerr;
+extern const uint64_t infinipath_hwe_htclnkbbyte0crcerr;
+extern const uint64_t infinipath_hwe_htclnkbbyte1crcerr;
+
+/*
+ * These are the infinipath general register numbers (not offsets).
+ * The kernel registers are used directly, those beyond the kernel
+ * registers are calculated from one of the base registers. The use of
+ * an integer type doesn't allow type-checking as thorough as, say,
+ * an enum but allows for better hiding of chip differences.
+ */
+typedef const uint16_t
+ ipath_kreg, /* kernel-only, infinipath general registers */
+ ipath_creg, /* kernel-only, infinipath counter registers */
+ ipath_sreg; /* kernel-only, infinipath send registers */
+
+/*
+ * These are all implemented such that 64 bit accesses work.
+ * Some implement no more than 32 bits. Because 64 bit reads
+ * require 2 HT cmds on opteron, we access those with 32 bit
+ * reads for efficiency (they are written as 64 bits, since
+ * the extra 32 bits are nearly free on writes, and it slightly reduces
+ * complexity). The rest are all accessed as 64 bits.
+ */
+extern ipath_kreg
+ /* These are the 32 bit group */
+ kr_control, kr_counterregbase, kr_intmask, kr_intstatus,
+ kr_pagealign, kr_portcnt, kr_rcvtidbase, kr_rcvtidcnt,
+ kr_rcvegrbase, kr_rcvegrcnt, kr_scratch, kr_sendctrl,
+ kr_sendpiobufbase, kr_sendpiobufcnt, kr_sendpiosize,
+ kr_sendregbase, kr_userregbase,
+ /* These are the 64 bit group */
+ kr_debugport, kr_debugportselect, kr_errorclear, kr_errormask,
+ kr_errorstatus, kr_extctrl, kr_extstatus, kr_gpio_clear, kr_gpio_mask,
+ kr_gpio_out, kr_gpio_status, kr_hwdiagctrl, kr_hwerrclear,
+ kr_hwerrmask, kr_hwerrstatus, kr_ibcctrl, kr_ibcstatus, kr_intblocked,
+ kr_intclear, kr_interruptconfig, kr_mdio, kr_partitionkey, kr_rcvbthqp,
+ kr_rcvbufbase, kr_rcvbufsize, kr_rcvctrl, kr_rcvhdrcnt,
+ kr_rcvhdrentsize, kr_rcvhdrsize, kr_rcvintmembase, kr_rcvintmemsize,
+ kr_revision, kr_sendbuffererror, kr_sendbuffererror1,
+ kr_sendbuffererror2, kr_sendbuffererror3, kr_sendpioavailaddr,
+ kr_serdesconfig0, kr_serdesconfig1, kr_serdesstatus, kr_txintmembase,
+ kr_txintmemsize, kr_xgxsconfig,
+ __kr_invalid, /* a marker for debug, don't use them directly */
+ /* a marker for debug, don't use them directly */
+ __kr_lastvaliddirect,
+ /* use only with ipath_k*_kreg64_port(), not *kreg64() */
+ kr_rcvhdraddr,
+ /* use only with ipath_k*_kreg64_port(), not *kreg64() */
+ kr_rcvhdrtailaddr,
+ /* we define the full set for the diags, the kernel doesn't use them */
+ kr_rcvhdraddr1, kr_rcvhdraddr2, kr_rcvhdraddr3, kr_rcvhdraddr4,
+ kr_rcvhdraddr5, kr_rcvhdraddr6, kr_rcvhdraddr7, kr_rcvhdraddr8,
+ kr_rcvhdrtailaddr1, kr_rcvhdrtailaddr2, kr_rcvhdrtailaddr3,
+ kr_rcvhdrtailaddr4, kr_rcvhdrtailaddr5, kr_rcvhdrtailaddr6,
+ kr_rcvhdrtailaddr7, kr_rcvhdrtailaddr8;
+
+/*
+ * first of the pioavail registers, the total number is
+ * (kr_sendpiobufcnt / 32); each buffer uses 2 bits
+ */
+extern ipath_sreg sr_sendpioavail;
+
+extern ipath_creg cr_badformatcnt, cr_erricrccnt, cr_errlinkcnt,
+ cr_errlpcrccnt, cr_errpkey, cr_errrcvflowctrlcnt,
+ cr_err_rlencnt, cr_errslencnt, cr_errtidfull,
+ cr_errtidvalid, cr_errvcrccnt, cr_ibstatuschange,
+ cr_intcnt, cr_invalidrlencnt, cr_invalidslencnt,
+ cr_lbflowstallcnt, cr_iblinkdowncnt, cr_iblinkerrrecovcnt,
+ cr_ibsymbolerrcnt, cr_pktrcvcnt, cr_pktrcvflowctrlcnt,
+ cr_pktsendcnt, cr_pktsendflowcnt, cr_portovflcnt,
+ cr_portovflcnt1, cr_portovflcnt2, cr_portovflcnt3, cr_portovflcnt4,
+ cr_portovflcnt5, cr_portovflcnt6, cr_portovflcnt7, cr_portovflcnt8,
+ cr_rcvebpcnt, cr_rcvovflcnt, cr_rxdroppktcnt,
+ cr_senddropped, cr_sendstallcnt, cr_sendunderruncnt,
+ cr_unsupvlcnt, cr_wordrcvcnt, cr_wordsendcnt;
+
+/*
+ * register bits for selecting i2c direction and values, used for I2C serial
+ * flash
+ */
+extern const uint16_t ipath_gpio_sda_num;
+extern const uint16_t ipath_gpio_scl_num;
+extern const uint64_t ipath_gpio_sda;
+extern const uint64_t ipath_gpio_scl;
+
+#endif /* _IPATH_REGISTERS_H */
diff -r a3a00f637da6 -r 2d9a3f27a10c drivers/infiniband/hw/ipath/ips_common.h
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ips_common.h Wed Dec 28 14:19:42 2005 -0800
@@ -0,0 +1,249 @@
+#ifndef IPS_COMMON_H
+#define IPS_COMMON_H
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#include "ipath_common.h"
+
+struct ipath_header_typ {
+ /*
+ * Version - 4 bits, Port - 4 bits, TID - 10 bits and Offset - 14 bits
+ * before ECO change ~28 Dec 03.
+ * After that, Vers 4, Port 3, TID 11, offset 14.
+ */
+ uint32_t ver_port_tid_offset;
+ uint16_t chksum;
+ uint16_t pkt_flags;
+};
+
+struct ips_message_header_typ {
+ uint16_t lrh[4];
+ uint32_t bth[3];
+ struct ipath_header_typ iph;
+ uint8_t sub_opcode;
+ uint8_t flags;
+ uint16_t src_rank;
+ /* 24 bits. The upper 8 bit is available for other use */
+ union {
+ struct {
+ unsigned ack_seq_num : 24;
+ unsigned port : 4;
+ unsigned unused : 4;
+ };
+ uint32_t ack_seq_num_org;
+ };
+ uint8_t expected_tid_session_id;
+ uint8_t tinylen; /* to aid MPI */
+ uint16_t tag; /* to aid MPI */
+ union {
+ uint32_t mpi[4]; /* to aid MPI */
+ uint32_t data[4];
+ struct {
+ uint16_t mtu;
+ uint8_t major_ver;
+ uint8_t minor_ver;
+ uint32_t not_used; //free
+ uint32_t run_id;
+ uint32_t client_ver;
+ };
+ };
+};
+
+struct ether_header_typ {
+ uint16_t lrh[4];
+ uint32_t bth[3];
+ struct ipath_header_typ iph;
+ uint8_t sub_opcode;
+ uint8_t cmd;
+ uint16_t lid;
+ uint16_t mac[3];
+ uint8_t frag_num;
+ uint8_t seq_num;
+ uint32_t len;
+ /* MUST be of word size do to PIO write requirements */
+ uint32_t csum;
+ uint16_t csum_offset;
+ uint16_t flags;
+ uint16_t first_2_bytes;
+ uint8_t unused[2]; /* currently unused */
+};
+
+/*
+ * The PIO buffer used for sending infinipath messages must only be written
+ * in 32-bit words, all the data must be written, and no writes can occur
+ * after the last word is written (which transfers "ownership" of the buffer
+ * to the chip and triggers the message to be sent).
+ * Since the Linux sk_buff structure can be recursive, non-aligned, and
+ * any number of bytes in each segment, we use the following structure
+ * to keep information about the overall state of the copy operation.
+ * This is used to save the information needed to store the checksum
+ * in the right place before sending the last word to the hardware and
+ * to buffer the last 0-3 bytes of non-word sized segments.
+ */
+struct copy_data_s {
+ struct ether_header_typ *hdr;
+ uint32_t __iomem *csum_pio; /* addr of PIO buf to write csum to */
+ uint32_t __iomem *to; /* addr of PIO buf to write data to */
+ uint32_t device; /* which device to allocate PIO bufs from */
+ int error; /* set if there is an error. */
+ int extra; /* amount of data saved in u.buf below */
+ unsigned int len; /* total length to send in bytes */
+ unsigned int flen; /* frament length in words */
+ unsigned int csum; /* partial IP checksum */
+ unsigned int pos; /* position for partial checksum */
+ unsigned int offset; /* offset to where data currently starts */
+ int checksum_calc; /* set to 'true' when the checksum has been calculated */
+ struct sk_buff *skb;
+ union {
+ uint32_t w;
+ uint8_t buf[4];
+ } u;
+};
+
+/* IB - LRH header consts */
+#define IPS_LRH_GRH 0x0003 /* 1. word of IB LRH - next header: GRH */
+#define IPS_LRH_BTH 0x0002 /* 1. word of IB LRH - next header: BTH */
+
+#define IPS_OFFSET 0
+
+/*
+ * defines the cut-off point between the header queue and eager/expected
+ * TID queue
+ */
+#define NUM_OF_EKSTRA_WORDS_IN_HEADER_QUEUE ((sizeof(struct ips_message_header_typ) - offsetof(struct ips_message_header_typ, iph)) >> 2)
+
+/* OpCodes */
+#define OPCODE_IPS 0xC0
+#define OPCODE_ITH4X 0xC1
+
+/* OpCode 30 is use by stand-alone test programs */
+#define OPCODE_RAW_DATA 0xDE
+/* last OpCode (31) is reserved for test */
+#define OPCODE_TEST 0xDF
+
+/* sub OpCodes - ips */
+#define OPCODE_SEQ_DATA 0x01
+#define OPCODE_SEQ_CTRL 0x02
+
+#define OPCODE_ACK 0x10
+#define OPCODE_NAK 0x11
+
+#define OPCODE_ERR_CHK 0x20
+#define OPCODE_ERR_CHK_PLS 0x21
+
+#define OPCODE_STARTUP 0x30
+#define OPCODE_STARTUP_ACK 0x31
+#define OPCODE_STARTUP_NAK 0x32
+
+#define OPCODE_STARTUP_EXT 0x34
+#define OPCODE_STARTUP_ACK_EXT 0x35
+#define OPCODE_STARTUP_NAK_EXT 0x36
+
+#define OPCODE_TIDS_RELEASE 0x40
+#define OPCODE_TIDS_RELEASE_CONFIRM 0x41
+
+#define OPCODE_CLOSE 0x50
+#define OPCODE_CLOSE_ACK 0x51
+/*
+ * like OPCODE_CLOSE, but no complaint if other side has already closed. Used
+ * when doing abort(), MPI_Abort(), etc.
+ */
+#define OPCODE_ABORT 0x52
+
+/* sub OpCodes - ith4x */
+#define OPCODE_ENCAP 0x81
+#define OPCODE_LID_ARP 0x82
+
+/* Receive Header Queue: receive type (from infinipath) */
+#define RCVHQ_RCV_TYPE_EXPECTED 0
+#define RCVHQ_RCV_TYPE_EAGER 1
+#define RCVHQ_RCV_TYPE_NON_KD 2
+#define RCVHQ_RCV_TYPE_ERROR 3
+
+/* misc. */
+#define SIZE_OF_CRC 1
+
+#define EAGER_TID_ID INFINIPATH_I_TID_MASK
+
+#define IPS_DEFAULT_P_KEY 0xFFFF
+
+/* functions for extracting fields from rcvhdrq entries */
+static inline uint32_t ips_get_hdr_err_flags(uint32_t *rbuf)
+{
+ return rbuf[1];
+}
+
+static inline uint32_t ips_get_index(uint32_t *rbuf)
+{
+ return (rbuf[0] >> INFINIPATH_RHF_EGRINDEX_SHIFT)
+ & INFINIPATH_RHF_EGRINDEX_MASK;
+}
+
+static inline uint32_t ips_get_rcv_type(uint32_t *rbuf)
+{
+ return (rbuf[0] >> INFINIPATH_RHF_RCVTYPE_SHIFT)
+ & INFINIPATH_RHF_RCVTYPE_MASK;
+}
+
+static inline uint32_t ips_get_length_in_bytes(uint32_t *rbuf)
+{
+ return ((rbuf[0] >> INFINIPATH_RHF_LENGTH_SHIFT)
+ & INFINIPATH_RHF_LENGTH_MASK) << 2;
+}
+
+static inline void *ips_get_first_protocol_header(uint32_t *rbuf)
+{
+ return (void *)&rbuf[2];
+}
+
+static inline struct ips_message_header_typ *ips_get_ips_header(uint32_t *rbuf)
+{
+ return (struct ips_message_header_typ *)&rbuf[2];
+}
+
+static inline uint32_t ips_get_ipath_ver(uint32_t hdrword)
+{
+ return (hdrword >> INFINIPATH_I_VERS_SHIFT)
+ & INFINIPATH_I_VERS_MASK;
+}
+
+/*
+ * Copy routine that is guaranteed to work in terms of aligned 32-bit
+ * quantities.
+ */
+void ipath_dwordcpy(uint32_t *dest, uint32_t *src, uint32_t ndwords);
+
+#endif /* IPS_COMMON_H */

2005-12-29 00:44:33

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 15 of 20] ipath - infiniband verbs support, part 1 of 3

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r 26993cb5faee -r 471b7a7a005c drivers/infiniband/hw/ipath/ipath_verbs.c
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 28 14:19:43 2005 -0800
@@ -0,0 +1,2307 @@
+/*
+ * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+#include <linux/config.h>
+#include <linux/version.h>
+#include <linux/pci.h>
+#include <linux/err.h>
+#include <rdma/ib_pack.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_mad.h>
+#include <rdma/ib_user_verbs.h>
+
+#include <asm/uaccess.h>
+#include <asm-generic/bug.h>
+
+#include "ips_common.h"
+#include "ipath_layer.h"
+#include "ipath_verbs.h"
+
+/*
+ * Compare the lower 24 bits of the two values.
+ * Returns an integer <, ==, or > than zero.
+ */
+static inline int cmp24(u32 a, u32 b)
+{
+ return (((int) a) - ((int) b)) << 8;
+}
+
+#define MODNAME "ib_ipath"
+#define DRIVER_LOAD_MSG "PathScale " MODNAME " loaded: "
+#define PFX MODNAME ": "
+
+
+#define BITS_PER_PAGE (PAGE_SIZE*BITS_PER_BYTE)
+#define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1)
+#define mk_qpn(qpt, map, off) (((map) - (qpt)->map)*BITS_PER_PAGE + (off))
+#define find_next_offset(map, off) \
+ find_next_zero_bit((map)->page, BITS_PER_PAGE, off)
+
+/* Not static, because we don't want the compiler removing it */
+const char ipath_verbs_version[] = "ipath_verbs " IPATH_IDSTR;
+
+unsigned int ib_ipath_qp_table_size = 251;
+module_param(ib_ipath_qp_table_size, uint, 0444);
+MODULE_PARM_DESC(ib_ipath_qp_table_size, "QP table size");
+
+unsigned int ib_ipath_lkey_table_size = 12;
+module_param(ib_ipath_lkey_table_size, uint, 0444);
+MODULE_PARM_DESC(ib_ipath_lkey_table_size,
+ "LKEY table size in bits (2^n, 1 <= n <= 23)");
+
+unsigned int ib_ipath_debug; /* debug mask */
+module_param(ib_ipath_debug, uint, 0644);
+MODULE_PARM_DESC(ib_ipath_debug, "Verbs debug mask");
+
+
+static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss,
+ u32 len, struct ib_send_wr *wr, struct ib_wc *wc);
+static void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc);
+static int ipath_destroy_qp(struct ib_qp *ibqp);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("PathScale <[email protected]>");
+MODULE_DESCRIPTION("Pathscale InfiniPath driver");
+
+enum {
+ IPATH_FAULT_RC_DROP_SEND_F = 1,
+ IPATH_FAULT_RC_DROP_SEND_M,
+ IPATH_FAULT_RC_DROP_SEND_L,
+ IPATH_FAULT_RC_DROP_SEND_O,
+ IPATH_FAULT_RC_DROP_RDMA_WRITE_F,
+ IPATH_FAULT_RC_DROP_RDMA_WRITE_M,
+ IPATH_FAULT_RC_DROP_RDMA_WRITE_L,
+ IPATH_FAULT_RC_DROP_RDMA_WRITE_O,
+ IPATH_FAULT_RC_DROP_RDMA_READ_RESP_F,
+ IPATH_FAULT_RC_DROP_RDMA_READ_RESP_M,
+ IPATH_FAULT_RC_DROP_RDMA_READ_RESP_L,
+ IPATH_FAULT_RC_DROP_RDMA_READ_RESP_O,
+ IPATH_FAULT_RC_DROP_ACK,
+};
+
+enum {
+ IPATH_TRANS_INVALID = 0,
+ IPATH_TRANS_ANY2RST,
+ IPATH_TRANS_RST2INIT,
+ IPATH_TRANS_INIT2INIT,
+ IPATH_TRANS_INIT2RTR,
+ IPATH_TRANS_RTR2RTS,
+ IPATH_TRANS_RTS2RTS,
+ IPATH_TRANS_SQERR2RTS,
+ IPATH_TRANS_ANY2ERR,
+ IPATH_TRANS_RTS2SQD, /* XXX Wait for expected ACKs & signal event */
+ IPATH_TRANS_SQD2SQD, /* error if not drained & parameter change */
+ IPATH_TRANS_SQD2RTS, /* error if not drained */
+};
+
+enum {
+ IPATH_POST_SEND_OK = 0x0001,
+ IPATH_POST_RECV_OK = 0x0002,
+ IPATH_PROCESS_RECV_OK = 0x0004,
+ IPATH_PROCESS_SEND_OK = 0x0008,
+};
+
+static int state_ops[IB_QPS_ERR + 1] = {
+ [IB_QPS_RESET] = 0,
+ [IB_QPS_INIT] = IPATH_POST_RECV_OK,
+ [IB_QPS_RTR] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK,
+ [IB_QPS_RTS] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK |
+ IPATH_POST_SEND_OK | IPATH_PROCESS_SEND_OK,
+ [IB_QPS_SQD] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK |
+ IPATH_POST_SEND_OK,
+ [IB_QPS_SQE] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK,
+ [IB_QPS_ERR] = 0,
+};
+
+/*
+ * Convert the AETH credit code into the number of credits.
+ */
+static u32 credit_table[31] = {
+ 0, /* 0 */
+ 1, /* 1 */
+ 2, /* 2 */
+ 3, /* 3 */
+ 4, /* 4 */
+ 6, /* 5 */
+ 8, /* 6 */
+ 12, /* 7 */
+ 16, /* 8 */
+ 24, /* 9 */
+ 32, /* A */
+ 48, /* B */
+ 64, /* C */
+ 96, /* D */
+ 128, /* E */
+ 192, /* F */
+ 256, /* 10 */
+ 384, /* 11 */
+ 512, /* 12 */
+ 768, /* 13 */
+ 1024, /* 14 */
+ 1536, /* 15 */
+ 2048, /* 16 */
+ 3072, /* 17 */
+ 4096, /* 18 */
+ 6144, /* 19 */
+ 8192, /* 1A */
+ 12288, /* 1B */
+ 16384, /* 1C */
+ 24576, /* 1D */
+ 32768 /* 1E */
+};
+
+/*
+ * Convert the AETH RNR timeout code into the number of milliseconds.
+ */
+static u32 rnr_table[32] = {
+ 656, /* 0 */
+ 1, /* 1 */
+ 1, /* 2 */
+ 1, /* 3 */
+ 1, /* 4 */
+ 1, /* 5 */
+ 1, /* 6 */
+ 1, /* 7 */
+ 1, /* 8 */
+ 1, /* 9 */
+ 1, /* A */
+ 1, /* B */
+ 1, /* C */
+ 1, /* D */
+ 2, /* E */
+ 2, /* F */
+ 3, /* 10 */
+ 4, /* 11 */
+ 6, /* 12 */
+ 8, /* 13 */
+ 11, /* 14 */
+ 16, /* 15 */
+ 21, /* 16 */
+ 31, /* 17 */
+ 41, /* 18 */
+ 62, /* 19 */
+ 82, /* 1A */
+ 123, /* 1B */
+ 164, /* 1C */
+ 246, /* 1D */
+ 328, /* 1E */
+ 492 /* 1F */
+};
+
+/*
+ * Translate ib_wr_opcode into ib_wc_opcode.
+ */
+static enum ib_wc_opcode wc_opcode[] = {
+ [IB_WR_RDMA_WRITE] = IB_WC_RDMA_WRITE,
+ [IB_WR_RDMA_WRITE_WITH_IMM] = IB_WC_RDMA_WRITE,
+ [IB_WR_SEND] = IB_WC_SEND,
+ [IB_WR_SEND_WITH_IMM] = IB_WC_SEND,
+ [IB_WR_RDMA_READ] = IB_WC_RDMA_READ,
+ [IB_WR_ATOMIC_CMP_AND_SWP] = IB_WC_COMP_SWAP,
+ [IB_WR_ATOMIC_FETCH_AND_ADD] = IB_WC_FETCH_ADD
+};
+
+/*
+ * Array of device pointers.
+ */
+static uint32_t number_of_devices;
+static struct ipath_ibdev **ipath_devices;
+
+/*
+ * Global table of GID to attached QPs.
+ * The table is global to all ipath devices since a send from one QP/device
+ * needs to be locally routed to any locally attached QPs on the same
+ * or different device.
+ */
+static struct rb_root mcast_tree;
+static spinlock_t mcast_lock = SPIN_LOCK_UNLOCKED;
+
+/*
+ * Allocate a structure to link a QP to the multicast GID structure.
+ */
+static struct ipath_mcast_qp *ipath_mcast_qp_alloc(struct ipath_qp *qp)
+{
+ struct ipath_mcast_qp *mqp;
+
+ mqp = kmalloc(sizeof(*mqp), GFP_KERNEL);
+ if (!mqp)
+ return NULL;
+
+ mqp->qp = qp;
+ atomic_inc(&qp->refcount);
+
+ return mqp;
+}
+
+static void ipath_mcast_qp_free(struct ipath_mcast_qp *mqp)
+{
+ struct ipath_qp *qp = mqp->qp;
+
+ /* Notify ipath_destroy_qp() if it is waiting. */
+ if (atomic_dec_and_test(&qp->refcount))
+ wake_up(&qp->wait);
+
+ kfree(mqp);
+}
+
+/*
+ * Allocate a structure for the multicast GID.
+ * A list of QPs will be attached to this structure.
+ */
+static struct ipath_mcast *ipath_mcast_alloc(union ib_gid *mgid)
+{
+ struct ipath_mcast *mcast;
+
+ mcast = kmalloc(sizeof(*mcast), GFP_KERNEL);
+ if (!mcast)
+ return NULL;
+
+ mcast->mgid = *mgid;
+ INIT_LIST_HEAD(&mcast->qp_list);
+ init_waitqueue_head(&mcast->wait);
+ atomic_set(&mcast->refcount, 0);
+
+ return mcast;
+}
+
+static void ipath_mcast_free(struct ipath_mcast *mcast)
+{
+ struct ipath_mcast_qp *p, *tmp;
+
+ list_for_each_entry_safe(p, tmp, &mcast->qp_list, list)
+ ipath_mcast_qp_free(p);
+
+ kfree(mcast);
+}
+
+/*
+ * Search the global table for the given multicast GID.
+ * Return it or NULL if not found.
+ * The caller is responsible for decrementing the reference count if found.
+ */
+static struct ipath_mcast *ipath_mcast_find(union ib_gid *mgid)
+{
+ struct rb_node *n;
+ unsigned long flags;
+
+ spin_lock_irqsave(&mcast_lock, flags);
+ n = mcast_tree.rb_node;
+ while (n) {
+ struct ipath_mcast *mcast;
+ int ret;
+
+ mcast = rb_entry(n, struct ipath_mcast, rb_node);
+
+ ret = memcmp(mgid->raw, mcast->mgid.raw, sizeof(union ib_gid));
+ if (ret < 0)
+ n = n->rb_left;
+ else if (ret > 0)
+ n = n->rb_right;
+ else {
+ atomic_inc(&mcast->refcount);
+ spin_unlock_irqrestore(&mcast_lock, flags);
+ return mcast;
+ }
+ }
+ spin_unlock_irqrestore(&mcast_lock, flags);
+
+ return NULL;
+}
+
+/*
+ * Insert the multicast GID into the table and
+ * attach the QP structure.
+ * Return zero if both were added.
+ * Return EEXIST if the GID was already in the table but the QP was added.
+ * Return ESRCH if the QP was already attached and neither structure was added.
+ */
+static int ipath_mcast_add(struct ipath_mcast *mcast,
+ struct ipath_mcast_qp *mqp)
+{
+ struct rb_node **n = &mcast_tree.rb_node;
+ struct rb_node *pn = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(&mcast_lock, flags);
+
+ while (*n) {
+ struct ipath_mcast *tmcast;
+ struct ipath_mcast_qp *p;
+ int ret;
+
+ pn = *n;
+ tmcast = rb_entry(pn, struct ipath_mcast, rb_node);
+
+ ret = memcmp(mcast->mgid.raw, tmcast->mgid.raw,
+ sizeof(union ib_gid));
+ if (ret < 0) {
+ n = &pn->rb_left;
+ continue;
+ }
+ if (ret > 0) {
+ n = &pn->rb_right;
+ continue;
+ }
+
+ /* Search the QP list to see if this is already there. */
+ list_for_each_entry_rcu(p, &tmcast->qp_list, list) {
+ if (p->qp == mqp->qp) {
+ spin_unlock_irqrestore(&mcast_lock, flags);
+ return ESRCH;
+ }
+ }
+ list_add_tail_rcu(&mqp->list, &tmcast->qp_list);
+ spin_unlock_irqrestore(&mcast_lock, flags);
+ return EEXIST;
+ }
+
+ list_add_tail_rcu(&mqp->list, &mcast->qp_list);
+
+ atomic_inc(&mcast->refcount);
+ rb_link_node(&mcast->rb_node, pn, n);
+ rb_insert_color(&mcast->rb_node, &mcast_tree);
+
+ spin_unlock_irqrestore(&mcast_lock, flags);
+
+ return 0;
+}
+
+static int ipath_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid,
+ u16 lid)
+{
+ struct ipath_qp *qp = to_iqp(ibqp);
+ struct ipath_mcast *mcast;
+ struct ipath_mcast_qp *mqp;
+
+ /*
+ * Allocate data structures since its better to do this outside of
+ * spin locks and it will most likely be needed.
+ */
+ mcast = ipath_mcast_alloc(gid);
+ if (mcast == NULL)
+ return -ENOMEM;
+ mqp = ipath_mcast_qp_alloc(qp);
+ if (mqp == NULL) {
+ ipath_mcast_free(mcast);
+ return -ENOMEM;
+ }
+ switch (ipath_mcast_add(mcast, mqp)) {
+ case ESRCH:
+ /* Neither was used: can't attach the same QP twice. */
+ ipath_mcast_qp_free(mqp);
+ ipath_mcast_free(mcast);
+ return -EINVAL;
+ case EEXIST: /* The mcast wasn't used */
+ ipath_mcast_free(mcast);
+ break;
+ default:
+ break;
+ }
+ return 0;
+}
+
+static int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid,
+ u16 lid)
+{
+ struct ipath_qp *qp = to_iqp(ibqp);
+ struct ipath_mcast *mcast = NULL;
+ struct ipath_mcast_qp *p, *tmp;
+ struct rb_node *n;
+ unsigned long flags;
+ int last = 0;
+
+ spin_lock_irqsave(&mcast_lock, flags);
+
+ /* Find the GID in the mcast table. */
+ n = mcast_tree.rb_node;
+ while (1) {
+ int ret;
+
+ if (n == NULL) {
+ spin_unlock_irqrestore(&mcast_lock, flags);
+ return 0;
+ }
+
+ mcast = rb_entry(n, struct ipath_mcast, rb_node);
+ ret = memcmp(gid->raw, mcast->mgid.raw, sizeof(union ib_gid));
+ if (ret < 0)
+ n = n->rb_left;
+ else if (ret > 0)
+ n = n->rb_right;
+ else
+ break;
+ }
+
+ /* Search the QP list. */
+ list_for_each_entry_safe(p, tmp, &mcast->qp_list, list) {
+ if (p->qp != qp)
+ continue;
+ /*
+ * We found it, so remove it, but don't poison the forward link
+ * until we are sure there are no list walkers.
+ */
+ list_del_rcu(&p->list);
+
+ /* If this was the last attached QP, remove the GID too. */
+ if (list_empty(&mcast->qp_list)) {
+ rb_erase(&mcast->rb_node, &mcast_tree);
+ last = 1;
+ }
+ break;
+ }
+
+ spin_unlock_irqrestore(&mcast_lock, flags);
+
+ if (p) {
+ /*
+ * Wait for any list walkers to finish before freeing the
+ * list element.
+ */
+ wait_event(mcast->wait, atomic_read(&mcast->refcount) <= 1);
+ ipath_mcast_qp_free(p);
+ }
+ if (last) {
+ atomic_dec(&mcast->refcount);
+ wait_event(mcast->wait, !atomic_read(&mcast->refcount));
+ ipath_mcast_free(mcast);
+ }
+
+ return 0;
+}
+
+/*
+ * Copy data to SGE memory.
+ */
+static void copy_sge(struct ipath_sge_state *ss, void *data, u32 length)
+{
+ struct ipath_sge *sge = &ss->sge;
+
+ while (length) {
+ u32 len = sge->length;
+
+ BUG_ON(len == 0);
+ if (len > length)
+ len = length;
+ memcpy(sge->vaddr, data, len);
+ sge->vaddr += len;
+ sge->length -= len;
+ sge->sge_length -= len;
+ if (sge->sge_length == 0) {
+ if (--ss->num_sge)
+ *sge = *ss->sg_list++;
+ } else if (sge->length == 0 && sge->mr != NULL) {
+ if (++sge->n >= IPATH_SEGSZ) {
+ if (++sge->m >= sge->mr->mapsz)
+ break;
+ sge->n = 0;
+ }
+ sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr;
+ sge->length = sge->mr->map[sge->m]->segs[sge->n].length;
+ }
+ data += len;
+ length -= len;
+ }
+}
+
+/*
+ * Skip over length bytes of SGE memory.
+ */
+static void skip_sge(struct ipath_sge_state *ss, u32 length)
+{
+ struct ipath_sge *sge = &ss->sge;
+
+ while (length > sge->sge_length) {
+ length -= sge->sge_length;
+ ss->sge = *ss->sg_list++;
+ }
+ while (length) {
+ u32 len = sge->length;
+
+ BUG_ON(len == 0);
+ if (len > length)
+ len = length;
+ sge->vaddr += len;
+ sge->length -= len;
+ sge->sge_length -= len;
+ if (sge->sge_length == 0) {
+ if (--ss->num_sge)
+ *sge = *ss->sg_list++;
+ } else if (sge->length == 0 && sge->mr != NULL) {
+ if (++sge->n >= IPATH_SEGSZ) {
+ if (++sge->m >= sge->mr->mapsz)
+ break;
+ sge->n = 0;
+ }
+ sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr;
+ sge->length = sge->mr->map[sge->m]->segs[sge->n].length;
+ }
+ length -= len;
+ }
+}
+
+static inline u32 alloc_qpn(struct ipath_qp_table *qpt)
+{
+ u32 i, offset, max_scan, qpn;
+ struct qpn_map *map;
+
+ qpn = qpt->last + 1;
+ if (qpn >= QPN_MAX)
+ qpn = 2;
+ offset = qpn & BITS_PER_PAGE_MASK;
+ map = &qpt->map[qpn / BITS_PER_PAGE];
+ max_scan = qpt->nmaps - !offset;
+ for (i = 0;;) {
+ if (unlikely(!map->page)) {
+ unsigned long page = get_zeroed_page(GFP_KERNEL);
+ unsigned long flags;
+
+ /*
+ * Free the page if someone raced with us
+ * installing it:
+ */
+ spin_lock_irqsave(&qpt->lock, flags);
+ if (map->page)
+ free_page(page);
+ else
+ map->page = (void *)page;
+ spin_unlock_irqrestore(&qpt->lock, flags);
+ if (unlikely(!map->page))
+ break;
+ }
+ if (likely(atomic_read(&map->n_free))) {
+ do {
+ if (!test_and_set_bit(offset, map->page)) {
+ atomic_dec(&map->n_free);
+ qpt->last = qpn;
+ return qpn;
+ }
+ offset = find_next_offset(map, offset);
+ qpn = mk_qpn(qpt, map, offset);
+ /*
+ * This test differs from alloc_pidmap().
+ * If find_next_offset() does find a zero bit,
+ * we don't need to check for QPN wrapping
+ * around past our starting QPN. We
+ * just need to be sure we don't loop forever.
+ */
+ } while (offset < BITS_PER_PAGE && qpn < QPN_MAX);
+ }
+ /*
+ * In order to keep the number of pages allocated to a minimum,
+ * we scan the all existing pages before increasing the size
+ * of the bitmap table.
+ */
+ if (++i > max_scan) {
+ if (qpt->nmaps == QPNMAP_ENTRIES)
+ break;
+ map = &qpt->map[qpt->nmaps++];
+ offset = 0;
+ } else if (map < &qpt->map[qpt->nmaps]) {
+ ++map;
+ offset = 0;
+ } else {
+ map = &qpt->map[0];
+ offset = 2;
+ }
+ qpn = mk_qpn(qpt, map, offset);
+ }
+ return 0;
+}
+
+static inline void free_qpn(struct ipath_qp_table *qpt, u32 qpn)
+{
+ struct qpn_map *map;
+
+ map = qpt->map + qpn / BITS_PER_PAGE;
+ if (map->page)
+ clear_bit(qpn & BITS_PER_PAGE_MASK, map->page);
+ atomic_inc(&map->n_free);
+}
+
+/*
+ * Allocate the next available QPN and put the QP into the hash table.
+ * The hash table holds a reference to the QP.
+ */
+static int ipath_alloc_qpn(struct ipath_qp_table *qpt, struct ipath_qp *qp,
+ enum ib_qp_type type)
+{
+ unsigned long flags;
+ u32 qpn;
+
+ if (type == IB_QPT_SMI)
+ qpn = 0;
+ else if (type == IB_QPT_GSI)
+ qpn = 1;
+ else {
+ /* Allocate the next available QPN */
+ qpn = alloc_qpn(qpt);
+ if (qpn == 0) {
+ return -ENOMEM;
+ }
+ }
+ qp->ibqp.qp_num = qpn;
+
+ /* Add the QP to the hash table. */
+ spin_lock_irqsave(&qpt->lock, flags);
+
+ qpn %= qpt->max;
+ qp->next = qpt->table[qpn];
+ qpt->table[qpn] = qp;
+ atomic_inc(&qp->refcount);
+
+ spin_unlock_irqrestore(&qpt->lock, flags);
+ return 0;
+}
+
+/*
+ * Remove the QP from the table so it can't be found asynchronously by
+ * the receive interrupt routine.
+ */
+static void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp)
+{
+ struct ipath_qp *q, **qpp;
+ unsigned long flags;
+ int fnd = 0;
+
+ spin_lock_irqsave(&qpt->lock, flags);
+
+ /* Remove QP from the hash table. */
+ qpp = &qpt->table[qp->ibqp.qp_num % qpt->max];
+ for (; (q = *qpp) != NULL; qpp = &q->next) {
+ if (q == qp) {
+ *qpp = qp->next;
+ qp->next = NULL;
+ atomic_dec(&qp->refcount);
+ fnd = 1;
+ break;
+ }
+ }
+
+ spin_unlock_irqrestore(&qpt->lock, flags);
+
+ if (!fnd)
+ return;
+
+ /* If QPN is not reserved, mark QPN free in the bitmap. */
+ if (qp->ibqp.qp_num > 1)
+ free_qpn(qpt, qp->ibqp.qp_num);
+
+ wait_event(qp->wait, !atomic_read(&qp->refcount));
+}
+
+/*
+ * Remove all QPs from the table.
+ */
+static void ipath_free_all_qps(struct ipath_qp_table *qpt)
+{
+ unsigned long flags;
+ struct ipath_qp *qp, *nqp;
+ u32 n;
+
+ for (n = 0; n < qpt->max; n++) {
+ spin_lock_irqsave(&qpt->lock, flags);
+ qp = qpt->table[n];
+ qpt->table[n] = NULL;
+ spin_unlock_irqrestore(&qpt->lock, flags);
+
+ while (qp) {
+ nqp = qp->next;
+ if (qp->ibqp.qp_num > 1)
+ free_qpn(qpt, qp->ibqp.qp_num);
+ if (!atomic_dec_and_test(&qp->refcount) ||
+ !ipath_destroy_qp(&qp->ibqp))
+ _VERBS_INFO("QP memory leak!\n");
+ qp = nqp;
+ }
+ }
+
+ for (n = 0; n < ARRAY_SIZE(qpt->map); n++) {
+ if (qpt->map[n].page)
+ free_page((unsigned long)qpt->map[n].page);
+ }
+}
+
+/*
+ * Return the QP with the given QPN.
+ * The caller is responsible for decrementing the QP reference count when done.
+ */
+static struct ipath_qp *ipath_lookup_qpn(struct ipath_qp_table *qpt, u32 qpn)
+{
+ unsigned long flags;
+ struct ipath_qp *qp;
+
+ spin_lock_irqsave(&qpt->lock, flags);
+
+ for (qp = qpt->table[qpn % qpt->max]; qp; qp = qp->next) {
+ if (qp->ibqp.qp_num == qpn) {
+ atomic_inc(&qp->refcount);
+ break;
+ }
+ }
+
+ spin_unlock_irqrestore(&qpt->lock, flags);
+ return qp;
+}
+
+static int ipath_alloc_lkey(struct ipath_lkey_table *rkt,
+ struct ipath_mregion *mr)
+{
+ unsigned long flags;
+ u32 r;
+ u32 n;
+
+ spin_lock_irqsave(&rkt->lock, flags);
+
+ /* Find the next available LKEY */
+ r = n = rkt->next;
+ for (;;) {
+ if (rkt->table[r] == NULL)
+ break;
+ r = (r + 1) & (rkt->max - 1);
+ if (r == n) {
+ spin_unlock_irqrestore(&rkt->lock, flags);
+ _VERBS_INFO("LKEY table full\n");
+ return 0;
+ }
+ }
+ rkt->next = (r + 1) & (rkt->max - 1);
+ /*
+ * Make sure lkey is never zero which is reserved to indicate an
+ * unrestricted LKEY.
+ */
+ rkt->gen++;
+ mr->lkey = (r << (32 - ib_ipath_lkey_table_size)) |
+ ((((1 << (24 - ib_ipath_lkey_table_size)) - 1) & rkt->gen) << 8);
+ if (mr->lkey == 0) {
+ mr->lkey |= 1 << 8;
+ rkt->gen++;
+ }
+ rkt->table[r] = mr;
+ spin_unlock_irqrestore(&rkt->lock, flags);
+
+ return 1;
+}
+
+static void ipath_free_lkey(struct ipath_lkey_table *rkt, u32 lkey)
+{
+ unsigned long flags;
+ u32 r;
+
+ if (lkey == 0)
+ return;
+ r = lkey >> (32 - ib_ipath_lkey_table_size);
+ spin_lock_irqsave(&rkt->lock, flags);
+ rkt->table[r] = NULL;
+ spin_unlock_irqrestore(&rkt->lock, flags);
+}
+
+/*
+ * Check the IB SGE for validity and initialize our internal version of it.
+ * Return 1 if OK, else zero.
+ */
+static int ipath_lkey_ok(struct ipath_lkey_table *rkt, struct ipath_sge *isge,
+ struct ib_sge *sge, int acc)
+{
+ struct ipath_mregion *mr;
+ size_t off;
+
+ /*
+ * We use LKEY == zero to mean a physical kmalloc() address.
+ * This is a bit of a hack since we rely on dma_map_single()
+ * being reversible by calling bus_to_virt().
+ */
+ if (sge->lkey == 0) {
+ isge->mr = NULL;
+ isge->vaddr = bus_to_virt(sge->addr);
+ isge->length = sge->length;
+ isge->sge_length = sge->length;
+ return 1;
+ }
+ spin_lock(&rkt->lock);
+ mr = rkt->table[(sge->lkey >> (32 - ib_ipath_lkey_table_size))];
+ spin_unlock(&rkt->lock);
+ if (unlikely(mr == NULL || mr->lkey != sge->lkey))
+ return 0;
+
+ off = sge->addr - mr->user_base;
+ if (unlikely(sge->addr < mr->user_base ||
+ off + sge->length > mr->length ||
+ (mr->access_flags & acc) != acc))
+ return 0;
+
+ off += mr->offset;
+ isge->mr = mr;
+ isge->m = 0;
+ isge->n = 0;
+ while (off >= mr->map[isge->m]->segs[isge->n].length) {
+ off -= mr->map[isge->m]->segs[isge->n].length;
+ if (++isge->n >= IPATH_SEGSZ) {
+ isge->m++;
+ isge->n = 0;
+ }
+ }
+ isge->vaddr = mr->map[isge->m]->segs[isge->n].vaddr + off;
+ isge->length = mr->map[isge->m]->segs[isge->n].length - off;
+ isge->sge_length = sge->length;
+ return 1;
+}
+
+/*
+ * Initialize the qp->s_sge after a restart.
+ * The QP s_lock should be held.
+ */
+static void ipath_init_restart(struct ipath_qp *qp, struct ipath_swqe *wqe)
+{
+ struct ipath_ibdev *dev;
+ u32 len;
+
+ len = ((qp->s_psn - wqe->psn) & 0xFFFFFF) *
+ ib_mtu_enum_to_int(qp->path_mtu);
+ qp->s_sge.sge = wqe->sg_list[0];
+ qp->s_sge.sg_list = wqe->sg_list + 1;
+ qp->s_sge.num_sge = wqe->wr.num_sge;
+ skip_sge(&qp->s_sge, len);
+ qp->s_len = wqe->length - len;
+ dev = to_idev(qp->ibqp.device);
+ spin_lock(&dev->pending_lock);
+ if (qp->timerwait.next == LIST_POISON1)
+ list_add_tail(&qp->timerwait,
+ &dev->pending[dev->pending_index]);
+ spin_unlock(&dev->pending_lock);
+}
+
+/*
+ * Check the IB virtual address, length, and RKEY.
+ * Return 1 if OK, else zero.
+ * The QP r_rq.lock should be held.
+ */
+static int ipath_rkey_ok(struct ipath_ibdev *dev, struct ipath_sge_state *ss,
+ u32 len, u64 vaddr, u32 rkey, int acc)
+{
+ struct ipath_lkey_table *rkt = &dev->lk_table;
+ struct ipath_sge *sge = &ss->sge;
+ struct ipath_mregion *mr;
+ size_t off;
+
+ spin_lock(&rkt->lock);
+ mr = rkt->table[(rkey >> (32 - ib_ipath_lkey_table_size))];
+ spin_unlock(&rkt->lock);
+ if (unlikely(mr == NULL || mr->lkey != rkey))
+ return 0;
+
+ off = vaddr - mr->iova;
+ if (unlikely(vaddr < mr->iova || off + len > mr->length ||
+ (mr->access_flags & acc) == 0))
+ return 0;
+
+ off += mr->offset;
+ sge->mr = mr;
+ sge->m = 0;
+ sge->n = 0;
+ while (off >= mr->map[sge->m]->segs[sge->n].length) {
+ off -= mr->map[sge->m]->segs[sge->n].length;
+ if (++sge->n >= IPATH_SEGSZ) {
+ sge->m++;
+ sge->n = 0;
+ }
+ }
+ sge->vaddr = mr->map[sge->m]->segs[sge->n].vaddr + off;
+ sge->length = mr->map[sge->m]->segs[sge->n].length - off;
+ sge->sge_length = len;
+ ss->sg_list = NULL;
+ ss->num_sge = 1;
+ return 1;
+}
+
+/*
+ * Add a new entry to the completion queue.
+ * This may be called with one of the qp->s_lock or qp->r_rq.lock held.
+ */
+static void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int sig)
+{
+ unsigned long flags;
+ u32 next;
+
+ spin_lock_irqsave(&cq->lock, flags);
+
+ cq->queue[cq->head] = *entry;
+ next = cq->head + 1;
+ if (next == cq->ibcq.cqe)
+ next = 0;
+ if (likely(next != cq->tail))
+ cq->head = next;
+ else {
+ spin_unlock_irqrestore(&cq->lock, flags);
+ if (cq->ibcq.event_handler) {
+ struct ib_event ev;
+
+ ev.device = cq->ibcq.device;
+ ev.element.cq = &cq->ibcq;
+ ev.event = IB_EVENT_CQ_ERR;
+ cq->ibcq.event_handler(&ev, cq->ibcq.cq_context);
+ }
+ return;
+ }
+
+ if (cq->notify == IB_CQ_NEXT_COMP ||
+ (cq->notify == IB_CQ_SOLICITED && sig)) {
+ cq->notify = IB_CQ_NONE;
+ cq->triggered++;
+ /*
+ * This will cause send_complete() to be called in
+ * another thread.
+ */
+ tasklet_schedule(&cq->comptask);
+ }
+
+ spin_unlock_irqrestore(&cq->lock, flags);
+
+ if (entry->status != IB_WC_SUCCESS)
+ to_idev(cq->ibcq.device)->n_wqe_errs++;
+}
+
+static void send_complete(unsigned long data)
+{
+ struct ipath_cq *cq = (struct ipath_cq *)data;
+
+ /*
+ * The completion handler will most likely rearm the notification
+ * and poll for all pending entries. If a new completion entry
+ * is added while we are in this routine, tasklet_schedule()
+ * won't call us again until we return so we check triggered to
+ * see if we need to call the handler again.
+ */
+ for (;;) {
+ u8 triggered = cq->triggered;
+
+ cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+
+ if (cq->triggered == triggered)
+ return;
+ }
+}
+
+/*
+ * This is the QP state transition table.
+ * See ipath_modify_qp() for details.
+ */
+static const struct {
+ int trans;
+ u32 req_param[IB_QPT_RAW_IPV6];
+ u32 opt_param[IB_QPT_RAW_IPV6];
+} qp_state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = {
+ [IB_QPS_RESET] = {
+ [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST },
+ [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR },
+ [IB_QPS_INIT] = {
+ .trans = IPATH_TRANS_RST2INIT,
+ .req_param = {
+ [IB_QPT_SMI] = (IB_QP_PKEY_INDEX |
+ IB_QP_QKEY),
+ [IB_QPT_GSI] = (IB_QP_PKEY_INDEX |
+ IB_QP_QKEY),
+ [IB_QPT_UD] = (IB_QP_PKEY_INDEX |
+ IB_QP_PORT |
+ IB_QP_QKEY),
+ [IB_QPT_UC] = (IB_QP_PKEY_INDEX |
+ IB_QP_PORT |
+ IB_QP_ACCESS_FLAGS),
+ [IB_QPT_RC] = (IB_QP_PKEY_INDEX |
+ IB_QP_PORT |
+ IB_QP_ACCESS_FLAGS),
+ },
+ },
+ },
+ [IB_QPS_INIT] = {
+ [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST },
+ [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR },
+ [IB_QPS_INIT] = {
+ .trans = IPATH_TRANS_INIT2INIT,
+ .opt_param = {
+ [IB_QPT_SMI] = (IB_QP_PKEY_INDEX |
+ IB_QP_QKEY),
+ [IB_QPT_GSI] = (IB_QP_PKEY_INDEX |
+ IB_QP_QKEY),
+ [IB_QPT_UD] = (IB_QP_PKEY_INDEX |
+ IB_QP_PORT |
+ IB_QP_QKEY),
+ [IB_QPT_UC] = (IB_QP_PKEY_INDEX |
+ IB_QP_PORT |
+ IB_QP_ACCESS_FLAGS),
+ [IB_QPT_RC] = (IB_QP_PKEY_INDEX |
+ IB_QP_PORT |
+ IB_QP_ACCESS_FLAGS),
+ }
+ },
+ [IB_QPS_RTR] = {
+ .trans = IPATH_TRANS_INIT2RTR,
+ .req_param = {
+ [IB_QPT_UC] = (IB_QP_AV |
+ IB_QP_PATH_MTU |
+ IB_QP_DEST_QPN |
+ IB_QP_RQ_PSN),
+ [IB_QPT_RC] = (IB_QP_AV |
+ IB_QP_PATH_MTU |
+ IB_QP_DEST_QPN |
+ IB_QP_RQ_PSN |
+ IB_QP_MAX_DEST_RD_ATOMIC |
+ IB_QP_MIN_RNR_TIMER),
+ },
+ .opt_param = {
+ [IB_QPT_SMI] = (IB_QP_PKEY_INDEX |
+ IB_QP_QKEY),
+ [IB_QPT_GSI] = (IB_QP_PKEY_INDEX |
+ IB_QP_QKEY),
+ [IB_QPT_UD] = (IB_QP_PKEY_INDEX |
+ IB_QP_QKEY),
+ [IB_QPT_UC] = (IB_QP_ALT_PATH |
+ IB_QP_ACCESS_FLAGS |
+ IB_QP_PKEY_INDEX),
+ [IB_QPT_RC] = (IB_QP_ALT_PATH |
+ IB_QP_ACCESS_FLAGS |
+ IB_QP_PKEY_INDEX),
+ }
+ }
+ },
+ [IB_QPS_RTR] = {
+ [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST },
+ [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR },
+ [IB_QPS_RTS] = {
+ .trans = IPATH_TRANS_RTR2RTS,
+ .req_param = {
+ [IB_QPT_SMI] = IB_QP_SQ_PSN,
+ [IB_QPT_GSI] = IB_QP_SQ_PSN,
+ [IB_QPT_UD] = IB_QP_SQ_PSN,
+ [IB_QPT_UC] = IB_QP_SQ_PSN,
+ [IB_QPT_RC] = (IB_QP_TIMEOUT |
+ IB_QP_RETRY_CNT |
+ IB_QP_RNR_RETRY |
+ IB_QP_SQ_PSN |
+ IB_QP_MAX_QP_RD_ATOMIC),
+ },
+ .opt_param = {
+ [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_UC] = (IB_QP_CUR_STATE |
+ IB_QP_ALT_PATH |
+ IB_QP_ACCESS_FLAGS |
+ IB_QP_PKEY_INDEX |
+ IB_QP_PATH_MIG_STATE),
+ [IB_QPT_RC] = (IB_QP_CUR_STATE |
+ IB_QP_ALT_PATH |
+ IB_QP_ACCESS_FLAGS |
+ IB_QP_PKEY_INDEX |
+ IB_QP_MIN_RNR_TIMER |
+ IB_QP_PATH_MIG_STATE),
+ }
+ }
+ },
+ [IB_QPS_RTS] = {
+ [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST },
+ [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR },
+ [IB_QPS_RTS] = {
+ .trans = IPATH_TRANS_RTS2RTS,
+ .opt_param = {
+ [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_UC] = (IB_QP_ACCESS_FLAGS |
+ IB_QP_ALT_PATH |
+ IB_QP_PATH_MIG_STATE),
+ [IB_QPT_RC] = (IB_QP_ACCESS_FLAGS |
+ IB_QP_ALT_PATH |
+ IB_QP_PATH_MIG_STATE |
+ IB_QP_MIN_RNR_TIMER),
+ }
+ },
+ [IB_QPS_SQD] = {
+ .trans = IPATH_TRANS_RTS2SQD,
+ },
+ },
+ [IB_QPS_SQD] = {
+ [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST },
+ [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR },
+ [IB_QPS_RTS] = {
+ .trans = IPATH_TRANS_SQD2RTS,
+ .opt_param = {
+ [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_UC] = (IB_QP_CUR_STATE |
+ IB_QP_ALT_PATH |
+ IB_QP_ACCESS_FLAGS |
+ IB_QP_PATH_MIG_STATE),
+ [IB_QPT_RC] = (IB_QP_CUR_STATE |
+ IB_QP_ALT_PATH |
+ IB_QP_ACCESS_FLAGS |
+ IB_QP_MIN_RNR_TIMER |
+ IB_QP_PATH_MIG_STATE),
+ }
+ },
+ [IB_QPS_SQD] = {
+ .trans = IPATH_TRANS_SQD2SQD,
+ .opt_param = {
+ [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_UD] = (IB_QP_PKEY_INDEX | IB_QP_QKEY),
+ [IB_QPT_UC] = (IB_QP_AV |
+ IB_QP_TIMEOUT |
+ IB_QP_CUR_STATE |
+ IB_QP_ALT_PATH |
+ IB_QP_ACCESS_FLAGS |
+ IB_QP_PKEY_INDEX |
+ IB_QP_PATH_MIG_STATE),
+ [IB_QPT_RC] = (IB_QP_AV |
+ IB_QP_TIMEOUT |
+ IB_QP_RETRY_CNT |
+ IB_QP_RNR_RETRY |
+ IB_QP_MAX_QP_RD_ATOMIC |
+ IB_QP_MAX_DEST_RD_ATOMIC |
+ IB_QP_CUR_STATE |
+ IB_QP_ALT_PATH |
+ IB_QP_ACCESS_FLAGS |
+ IB_QP_PKEY_INDEX |
+ IB_QP_MIN_RNR_TIMER |
+ IB_QP_PATH_MIG_STATE),
+ }
+ }
+ },
+ [IB_QPS_SQE] = {
+ [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST },
+ [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR },
+ [IB_QPS_RTS] = {
+ .trans = IPATH_TRANS_SQERR2RTS,
+ .opt_param = {
+ [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY),
+ [IB_QPT_UC] = IB_QP_CUR_STATE,
+ [IB_QPT_RC] = (IB_QP_CUR_STATE |
+ IB_QP_MIN_RNR_TIMER),
+ }
+ }
+ },
+ [IB_QPS_ERR] = {
+ [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST },
+ [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }
+ }
+};
+
+/*
+ * Initialize the QP state to the reset state.
+ */
+static void ipath_reset_qp(struct ipath_qp *qp)
+{
+ qp->remote_qpn = 0;
+ qp->qkey = 0;
+ qp->qp_access_flags = 0;
+ qp->s_hdrwords = 0;
+ qp->s_psn = 0;
+ qp->r_psn = 0;
+ atomic_set(&qp->msn, 0);
+ if (qp->ibqp.qp_type == IB_QPT_RC) {
+ qp->s_state = IB_OPCODE_RC_SEND_LAST;
+ qp->r_state = IB_OPCODE_RC_SEND_LAST;
+ } else {
+ qp->s_state = IB_OPCODE_UC_SEND_LAST;
+ qp->r_state = IB_OPCODE_UC_SEND_LAST;
+ }
+ qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
+ qp->s_nak_state = 0;
+ qp->s_rnr_timeout = 0;
+ qp->s_head = 0;
+ qp->s_tail = 0;
+ qp->s_cur = 0;
+ qp->s_last = 0;
+ qp->s_ssn = 1;
+ qp->s_lsn = 0;
+ qp->r_rq.head = 0;
+ qp->r_rq.tail = 0;
+ qp->r_reuse_sge = 0;
+}
+
+/*
+ * Flush send work queue.
+ * The QP s_lock should be held.
+ */
+static void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc)
+{
+ struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+ struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last);
+
+ _VERBS_INFO("Send queue error on QP%d/%d: err: %d\n",
+ qp->ibqp.qp_num, qp->remote_qpn, wc->status);
+
+ spin_lock(&dev->pending_lock);
+ /* XXX What if its already removed by the timeout code? */
+ if (qp->timerwait.next != LIST_POISON1)
+ list_del(&qp->timerwait);
+ if (qp->piowait.next != LIST_POISON1)
+ list_del(&qp->piowait);
+ spin_unlock(&dev->pending_lock);
+
+ ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1);
+ if (++qp->s_last >= qp->s_size)
+ qp->s_last = 0;
+
+ wc->status = IB_WC_WR_FLUSH_ERR;
+
+ while (qp->s_last != qp->s_head) {
+ wc->wr_id = wqe->wr.wr_id;
+ wc->opcode = wc_opcode[wqe->wr.opcode];
+ ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1);
+ if (++qp->s_last >= qp->s_size)
+ qp->s_last = 0;
+ wqe = get_swqe_ptr(qp, qp->s_last);
+ }
+ qp->s_cur = qp->s_tail = qp->s_head;
+ qp->state = IB_QPS_SQE;
+}
+
+/*
+ * Flush both send and receive work queues.
+ * QP r_rq.lock and s_lock should be held.
+ */
+static void ipath_error_qp(struct ipath_qp *qp)
+{
+ struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+ struct ib_wc wc;
+
+ _VERBS_INFO("QP%d/%d in error state\n",
+ qp->ibqp.qp_num, qp->remote_qpn);
+
+ spin_lock(&dev->pending_lock);
+ /* XXX What if its already removed by the timeout code? */
+ if (qp->timerwait.next != LIST_POISON1)
+ list_del(&qp->timerwait);
+ if (qp->piowait.next != LIST_POISON1)
+ list_del(&qp->piowait);
+ spin_unlock(&dev->pending_lock);
+
+ wc.status = IB_WC_WR_FLUSH_ERR;
+ wc.vendor_err = 0;
+ wc.byte_len = 0;
+ wc.imm_data = 0;
+ wc.qp_num = qp->ibqp.qp_num;
+ wc.src_qp = 0;
+ wc.wc_flags = 0;
+ wc.pkey_index = 0;
+ wc.slid = 0;
+ wc.sl = 0;
+ wc.dlid_path_bits = 0;
+ wc.port_num = 0;
+
+ while (qp->s_last != qp->s_head) {
+ struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last);
+
+ wc.wr_id = wqe->wr.wr_id;
+ wc.opcode = wc_opcode[wqe->wr.opcode];
+ if (++qp->s_last >= qp->s_size)
+ qp->s_last = 0;
+ ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1);
+ }
+ qp->s_cur = qp->s_tail = qp->s_head;
+ qp->s_hdrwords = 0;
+ qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
+
+ wc.opcode = IB_WC_RECV;
+ while (qp->r_rq.tail != qp->r_rq.head) {
+ wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id;
+ if (++qp->r_rq.tail >= qp->r_rq.size)
+ qp->r_rq.tail = 0;
+ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1);
+ }
+}
+
+static int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+ int attr_mask)
+{
+ struct ipath_qp *qp = to_iqp(ibqp);
+ enum ib_qp_state cur_state, new_state;
+ u32 req_param, opt_param;
+ unsigned long flags;
+
+ if (attr_mask & IB_QP_CUR_STATE) {
+ cur_state = attr->cur_qp_state;
+ if (cur_state != IB_QPS_RTR &&
+ cur_state != IB_QPS_RTS &&
+ cur_state != IB_QPS_SQD && cur_state != IB_QPS_SQE)
+ return -EINVAL;
+ spin_lock_irqsave(&qp->r_rq.lock, flags);
+ spin_lock(&qp->s_lock);
+ } else {
+ spin_lock_irqsave(&qp->r_rq.lock, flags);
+ spin_lock(&qp->s_lock);
+ cur_state = qp->state;
+ }
+
+ if (attr_mask & IB_QP_STATE) {
+ new_state = attr->qp_state;
+ if (new_state < 0 || new_state > IB_QPS_ERR)
+ goto inval;
+ } else
+ new_state = cur_state;
+
+ switch (qp_state_table[cur_state][new_state].trans) {
+ case IPATH_TRANS_INVALID:
+ goto inval;
+
+ case IPATH_TRANS_ANY2RST:
+ ipath_reset_qp(qp);
+ break;
+
+ case IPATH_TRANS_ANY2ERR:
+ ipath_error_qp(qp);
+ break;
+
+ }
+
+ req_param =
+ qp_state_table[cur_state][new_state].req_param[qp->ibqp.qp_type];
+ opt_param =
+ qp_state_table[cur_state][new_state].opt_param[qp->ibqp.qp_type];
+
+ if ((req_param & attr_mask) != req_param)
+ goto inval;
+
+ if (attr_mask & ~(req_param | opt_param | IB_QP_STATE))
+ goto inval;
+
+ if (attr_mask & IB_QP_PKEY_INDEX) {
+ struct ipath_ibdev *dev = to_idev(ibqp->device);
+
+ if (attr->pkey_index >= ipath_layer_get_npkeys(dev->ib_unit))
+ goto inval;
+ qp->s_pkey_index = attr->pkey_index;
+ }
+
+ if (attr_mask & IB_QP_DEST_QPN)
+ qp->remote_qpn = attr->dest_qp_num;
+
+ if (attr_mask & IB_QP_SQ_PSN) {
+ qp->s_next_psn = attr->sq_psn;
+ qp->s_last_psn = qp->s_next_psn - 1;
+ }
+
+ if (attr_mask & IB_QP_RQ_PSN)
+ qp->r_psn = attr->rq_psn;
+
+ if (attr_mask & IB_QP_ACCESS_FLAGS)
+ qp->qp_access_flags = attr->qp_access_flags;
+
+ if (attr_mask & IB_QP_AV)
+ qp->remote_ah_attr = attr->ah_attr;
+
+ if (attr_mask & IB_QP_PATH_MTU)
+ qp->path_mtu = attr->path_mtu;
+
+ if (attr_mask & IB_QP_RETRY_CNT)
+ qp->s_retry = qp->s_retry_cnt = attr->retry_cnt;
+
+ if (attr_mask & IB_QP_RNR_RETRY) {
+ qp->s_rnr_retry = attr->rnr_retry;
+ if (qp->s_rnr_retry > 7)
+ qp->s_rnr_retry = 7;
+ qp->s_rnr_retry_cnt = qp->s_rnr_retry;
+ }
+
+ if (attr_mask & IB_QP_MIN_RNR_TIMER)
+ qp->s_min_rnr_timer = attr->min_rnr_timer & 0x1F;
+
+ if (attr_mask & IB_QP_QKEY)
+ qp->qkey = attr->qkey;
+
+ if (attr_mask & IB_QP_PKEY_INDEX)
+ qp->s_pkey_index = attr->pkey_index;
+
+ qp->state = new_state;
+ spin_unlock(&qp->s_lock);
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+
+ /*
+ * Try to move to ARMED if QP1 changed to the RTS state.
+ */
+ if (qp->ibqp.qp_num == 1 && new_state == IB_QPS_RTS) {
+ struct ipath_ibdev *dev = to_idev(ibqp->device);
+
+ /*
+ * Bounce the link even if it was active so the SM will
+ * reinitialize the SMA's state.
+ */
+ ipath_kset_linkstate((dev->ib_unit << 16) | IPATH_IB_LINKDOWN);
+ ipath_kset_linkstate((dev->ib_unit << 16) | IPATH_IB_LINKARM);
+ }
+ return 0;
+
+inval:
+ spin_unlock(&qp->s_lock);
+ spin_unlock_irqrestore(&qp->r_rq.lock, flags);
+ return -EINVAL;
+}
+
+/*
+ * Compute the AETH (syndrome + MSN).
+ * The QP s_lock should be held.
+ */
+static u32 ipath_compute_aeth(struct ipath_qp *qp)
+{
+ u32 aeth = atomic_read(&qp->msn) & 0xFFFFFF;
+
+ if (qp->s_nak_state) {
+ aeth |= qp->s_nak_state << 24;
+ } else if (qp->ibqp.srq) {
+ /* Shared receive queues don't generate credits. */
+ aeth |= 0x1F << 24;
+ } else {
+ u32 min, max, x;
+ u32 credits;
+
+ /*
+ * Compute the number of credits available (RWQEs).
+ * XXX Not holding the r_rq.lock here so there is a small
+ * chance that the pair of reads are not atomic.
+ */
+ credits = qp->r_rq.head - qp->r_rq.tail;
+ if ((int)credits < 0)
+ credits += qp->r_rq.size;
+ /* Binary search the credit table to find the code to use. */
+ min = 0;
+ max = 31;
+ for (;;) {
+ x = (min + max) / 2;
+ if (credit_table[x] == credits)
+ break;
+ if (credit_table[x] > credits)
+ max = x;
+ else if (min == x)
+ break;
+ else
+ min = x;
+ }
+ aeth |= x << 24;
+ }
+ return cpu_to_be32(aeth);
+}
+
+
+static void no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&dev->pending_lock, flags);
+ if (qp->piowait.next == LIST_POISON1)
+ list_add_tail(&qp->piowait, &dev->piowait);
+ spin_unlock_irqrestore(&dev->pending_lock, flags);
+ /*
+ * Note that as soon as ipath_layer_want_buffer() is called and
+ * possibly before it returns, ipath_ib_piobufavail()
+ * could be called. If we are still in the tasklet function,
+ * tasklet_schedule() will not call us until the next time
+ * tasklet_schedule() is called.
+ * We clear the tasklet flag now since we are committing to return
+ * from the tasklet function.
+ */
+ tasklet_unlock(&qp->s_task);
+ ipath_layer_want_buffer(dev->ib_unit);
+ dev->n_piowait++;
+}
+
+/*
+ * Process entries in the send work queue until the queue is exhausted.
+ * Only allow one CPU to send a packet per QP (tasklet).
+ * Otherwise, after we drop the QP lock, two threads could send
+ * packets out of order.
+ * This is similar to do_rc_send() below except we don't have timeouts or
+ * resends.
+ */
+static void do_uc_send(unsigned long data)
+{
+ struct ipath_qp *qp = (struct ipath_qp *)data;
+ struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+ struct ipath_swqe *wqe;
+ unsigned long flags;
+ u16 lrh0;
+ u32 hwords;
+ u32 nwords;
+ u32 extra_bytes;
+ u32 bth0;
+ u32 bth2;
+ u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu);
+ u32 len;
+ struct ipath_other_headers *ohdr;
+ struct ib_wc wc;
+
+ if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags))
+ return;
+
+ if (unlikely(qp->remote_ah_attr.dlid ==
+ ipath_layer_get_lid(dev->ib_unit))) {
+ /* Pass in an uninitialized ib_wc to save stack space. */
+ ipath_ruc_loopback(qp, &wc);
+ clear_bit(IPATH_S_BUSY, &qp->s_flags);
+ return;
+ }
+
+ ohdr = &qp->s_hdr.u.oth;
+ if (qp->remote_ah_attr.ah_flags & IB_AH_GRH)
+ ohdr = &qp->s_hdr.u.l.oth;
+
+again:
+ /* Check for a constructed packet to be sent. */
+ if (qp->s_hdrwords != 0) {
+ /*
+ * If no PIO bufs are available, return.
+ * An interrupt will call ipath_ib_piobufavail()
+ * when one is available.
+ */
+ if (ipath_verbs_send(dev->ib_unit, qp->s_hdrwords,
+ (uint32_t *) &qp->s_hdr,
+ qp->s_cur_size, qp->s_cur_sge)) {
+ no_bufs_available(qp, dev);
+ return;
+ }
+ dev->n_unicast_xmit++;
+ /* Record that we sent the packet and s_hdr is empty. */
+ qp->s_hdrwords = 0;
+ }
+
+ lrh0 = IPS_LRH_BTH;
+ /* header size in 32-bit words LRH+BTH = (8+12)/4. */
+ hwords = 5;
+
+ /*
+ * The lock is needed to synchronize between
+ * setting qp->s_ack_state and post_send().
+ */
+ spin_lock_irqsave(&qp->s_lock, flags);
+
+ if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK))
+ goto done;
+
+ bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index);
+
+ /* Send a request. */
+ wqe = get_swqe_ptr(qp, qp->s_last);
+ switch (qp->s_state) {
+ default:
+ /* Signal the completion of the last send (if there is one). */
+ if (qp->s_last != qp->s_tail) {
+ if (++qp->s_last == qp->s_size)
+ qp->s_last = 0;
+ if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) ||
+ (wqe->wr.send_flags & IB_SEND_SIGNALED)) {
+ wc.wr_id = wqe->wr.wr_id;
+ wc.status = IB_WC_SUCCESS;
+ wc.opcode = wc_opcode[wqe->wr.opcode];
+ wc.vendor_err = 0;
+ wc.byte_len = wqe->length;
+ wc.qp_num = qp->ibqp.qp_num;
+ wc.src_qp = qp->remote_qpn;
+ wc.pkey_index = 0;
+ wc.slid = qp->remote_ah_attr.dlid;
+ wc.sl = qp->remote_ah_attr.sl;
+ wc.dlid_path_bits = 0;
+ wc.port_num = 0;
+ ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc,
+ 0);
+ }
+ wqe = get_swqe_ptr(qp, qp->s_last);
+ }
+ /* Check if send work queue is empty. */
+ if (qp->s_tail == qp->s_head)
+ goto done;
+ /*
+ * Start a new request.
+ */
+ qp->s_psn = wqe->psn = qp->s_next_psn;
+ qp->s_sge.sge = wqe->sg_list[0];
+ qp->s_sge.sg_list = wqe->sg_list + 1;
+ qp->s_sge.num_sge = wqe->wr.num_sge;
+ qp->s_len = len = wqe->length;
+ switch (wqe->wr.opcode) {
+ case IB_WR_SEND:
+ case IB_WR_SEND_WITH_IMM:
+ if (len > pmtu) {
+ qp->s_state = IB_OPCODE_UC_SEND_FIRST;
+ len = pmtu;
+ break;
+ }
+ if (wqe->wr.opcode == IB_WR_SEND) {
+ qp->s_state = IB_OPCODE_UC_SEND_ONLY;
+ } else {
+ qp->s_state =
+ IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE;
+ /* Immediate data comes after the BTH */
+ ohdr->u.imm_data = wqe->wr.imm_data;
+ hwords += 1;
+ }
+ if (wqe->wr.send_flags & IB_SEND_SOLICITED)
+ bth0 |= 1 << 23;
+ break;
+
+ case IB_WR_RDMA_WRITE:
+ case IB_WR_RDMA_WRITE_WITH_IMM:
+ ohdr->u.rc.reth.vaddr =
+ cpu_to_be64(wqe->wr.wr.rdma.remote_addr);
+ ohdr->u.rc.reth.rkey =
+ cpu_to_be32(wqe->wr.wr.rdma.rkey);
+ ohdr->u.rc.reth.length = cpu_to_be32(len);
+ hwords += sizeof(struct ib_reth) / 4;
+ if (len > pmtu) {
+ qp->s_state = IB_OPCODE_UC_RDMA_WRITE_FIRST;
+ len = pmtu;
+ break;
+ }
+ if (wqe->wr.opcode == IB_WR_RDMA_WRITE) {
+ qp->s_state = IB_OPCODE_UC_RDMA_WRITE_ONLY;
+ } else {
+ qp->s_state =
+ IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE;
+ /* Immediate data comes after the RETH */
+ ohdr->u.rc.imm_data = wqe->wr.imm_data;
+ hwords += 1;
+ if (wqe->wr.send_flags & IB_SEND_SOLICITED)
+ bth0 |= 1 << 23;
+ }
+ break;
+
+ default:
+ goto done;
+ }
+ if (++qp->s_tail >= qp->s_size)
+ qp->s_tail = 0;
+ break;
+
+ case IB_OPCODE_UC_SEND_FIRST:
+ qp->s_state = IB_OPCODE_UC_SEND_MIDDLE;
+ /* FALLTHROUGH */
+ case IB_OPCODE_UC_SEND_MIDDLE:
+ len = qp->s_len;
+ if (len > pmtu) {
+ len = pmtu;
+ break;
+ }
+ if (wqe->wr.opcode == IB_WR_SEND)
+ qp->s_state = IB_OPCODE_UC_SEND_LAST;
+ else {
+ qp->s_state = IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE;
+ /* Immediate data comes after the BTH */
+ ohdr->u.imm_data = wqe->wr.imm_data;
+ hwords += 1;
+ }
+ if (wqe->wr.send_flags & IB_SEND_SOLICITED)
+ bth0 |= 1 << 23;
+ break;
+
+ case IB_OPCODE_UC_RDMA_WRITE_FIRST:
+ qp->s_state = IB_OPCODE_UC_RDMA_WRITE_MIDDLE;
+ /* FALLTHROUGH */
+ case IB_OPCODE_UC_RDMA_WRITE_MIDDLE:
+ len = qp->s_len;
+ if (len > pmtu) {
+ len = pmtu;
+ break;
+ }
+ if (wqe->wr.opcode == IB_WR_RDMA_WRITE)
+ qp->s_state = IB_OPCODE_UC_RDMA_WRITE_LAST;
+ else {
+ qp->s_state =
+ IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE;
+ /* Immediate data comes after the BTH */
+ ohdr->u.imm_data = wqe->wr.imm_data;
+ hwords += 1;
+ if (wqe->wr.send_flags & IB_SEND_SOLICITED)
+ bth0 |= 1 << 23;
+ }
+ break;
+ }
+ bth2 = qp->s_next_psn++ & 0xFFFFFF;
+ qp->s_len -= len;
+ bth0 |= qp->s_state << 24;
+
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+
+ /* Construct the header. */
+ extra_bytes = (4 - len) & 3;
+ nwords = (len + extra_bytes) >> 2;
+ if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) {
+ /* Header size in 32-bit words. */
+ hwords += 10;
+ lrh0 = IPS_LRH_GRH;
+ qp->s_hdr.u.l.grh.version_tclass_flow =
+ cpu_to_be32((6 << 28) |
+ (qp->remote_ah_attr.grh.traffic_class << 20) |
+ qp->remote_ah_attr.grh.flow_label);
+ qp->s_hdr.u.l.grh.paylen =
+ cpu_to_be16(((hwords - 12) + nwords + SIZE_OF_CRC) << 2);
+ qp->s_hdr.u.l.grh.next_hdr = 0x1B;
+ qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit;
+ /* The SGID is 32-bit aligned. */
+ qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix;
+ qp->s_hdr.u.l.grh.sgid.global.interface_id =
+ ipath_layer_get_guid(dev->ib_unit);
+ qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid;
+ }
+ qp->s_hdrwords = hwords;
+ qp->s_cur_sge = &qp->s_sge;
+ qp->s_cur_size = len;
+ lrh0 |= qp->remote_ah_attr.sl << 4;
+ qp->s_hdr.lrh[0] = cpu_to_be16(lrh0);
+ /* DEST LID */
+ qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid);
+ qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC);
+ qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit));
+ bth0 |= extra_bytes << 20;
+ ohdr->bth[0] = cpu_to_be32(bth0);
+ ohdr->bth[1] = cpu_to_be32(qp->remote_qpn);
+ ohdr->bth[2] = cpu_to_be32(bth2);
+
+ /* Check for more work to do. */
+ goto again;
+
+done:
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+ clear_bit(IPATH_S_BUSY, &qp->s_flags);
+}
+
+/*
+ * Process entries in the send work queue until credit or queue is exhausted.
+ * Only allow one CPU to send a packet per QP (tasklet).
+ * Otherwise, after we drop the QP s_lock, two threads could send
+ * packets out of order.
+ */
+static void do_rc_send(unsigned long data)
+{
+ struct ipath_qp *qp = (struct ipath_qp *)data;
+ struct ipath_ibdev *dev = to_idev(qp->ibqp.device);
+ struct ipath_swqe *wqe;
+ struct ipath_sge_state *ss;
+ unsigned long flags;
+ u16 lrh0;
+ u32 hwords;
+ u32 nwords;
+ u32 extra_bytes;
+ u32 bth0;
+ u32 bth2;
+ u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu);
+ u32 len;
+ struct ipath_other_headers *ohdr;
+ char newreq;
+
+ if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags))
+ return;
+
+ if (unlikely(qp->remote_ah_attr.dlid ==
+ ipath_layer_get_lid(dev->ib_unit))) {
+ struct ib_wc wc;
+
+ /*
+ * Pass in an uninitialized ib_wc to be consistent with
+ * other places where ipath_ruc_loopback() is called.
+ */
+ ipath_ruc_loopback(qp, &wc);
+ clear_bit(IPATH_S_BUSY, &qp->s_flags);
+ return;
+ }
+
+ ohdr = &qp->s_hdr.u.oth;
+ if (qp->remote_ah_attr.ah_flags & IB_AH_GRH)
+ ohdr = &qp->s_hdr.u.l.oth;
+
+again:
+ /* Check for a constructed packet to be sent. */
+ if (qp->s_hdrwords != 0) {
+ /*
+ * If no PIO bufs are available, return.
+ * An interrupt will call ipath_ib_piobufavail()
+ * when one is available.
+ */
+ if (ipath_verbs_send(dev->ib_unit, qp->s_hdrwords,
+ (uint32_t *) &qp->s_hdr,
+ qp->s_cur_size, qp->s_cur_sge)) {
+ no_bufs_available(qp, dev);
+ return;
+ }
+ dev->n_unicast_xmit++;
+ /* Record that we sent the packet and s_hdr is empty. */
+ qp->s_hdrwords = 0;
+ }
+
+ lrh0 = IPS_LRH_BTH;
+ /* header size in 32-bit words LRH+BTH = (8+12)/4. */
+ hwords = 5;
+
+ /*
+ * The lock is needed to synchronize between
+ * setting qp->s_ack_state, resend timer, and post_send().
+ */
+ spin_lock_irqsave(&qp->s_lock, flags);
+
+ bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index);
+
+ /* Sending responses has higher priority over sending requests. */
+ if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE) {
+ /*
+ * Send a response.
+ * Note that we are in the responder's side of the QP context.
+ */
+ switch (qp->s_ack_state) {
+ case IB_OPCODE_RC_RDMA_READ_REQUEST:
+ ss = &qp->s_rdma_sge;
+ len = qp->s_rdma_len;
+ if (len > pmtu) {
+ len = pmtu;
+ qp->s_ack_state =
+ IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST;
+ } else {
+ qp->s_ack_state =
+ IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY;
+ }
+ qp->s_rdma_len -= len;
+ bth0 |= qp->s_ack_state << 24;
+ ohdr->u.aeth = ipath_compute_aeth(qp);
+ hwords++;
+ break;
+
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST:
+ qp->s_ack_state =
+ IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE;
+ /* FALLTHROUGH */
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
+ ss = &qp->s_rdma_sge;
+ len = qp->s_rdma_len;
+ if (len > pmtu) {
+ len = pmtu;
+ } else {
+ ohdr->u.aeth = ipath_compute_aeth(qp);
+ hwords++;
+ qp->s_ack_state =
+ IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST;
+ }
+ qp->s_rdma_len -= len;
+ bth0 |= qp->s_ack_state << 24;
+ break;
+
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST:
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY:
+ /*
+ * We have to prevent new requests from changing
+ * the r_sge state while a ipath_verbs_send()
+ * is in progress.
+ * Changing r_state allows the receiver
+ * to continue processing new packets.
+ * We do it here now instead of above so
+ * that we are sure the packet was sent before
+ * changing the state.
+ */
+ qp->r_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST;
+ qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
+ goto send_req;
+
+ case IB_OPCODE_RC_COMPARE_SWAP:
+ case IB_OPCODE_RC_FETCH_ADD:
+ ss = NULL;
+ len = 0;
+ qp->r_state = IB_OPCODE_RC_SEND_LAST;
+ qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
+ bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24;
+ ohdr->u.at.aeth = ipath_compute_aeth(qp);
+ ohdr->u.at.atomic_ack_eth =
+ cpu_to_be64(qp->s_ack_atomic);
+ hwords += sizeof(ohdr->u.at) / 4;
+ break;
+
+ default:
+ /* Send a regular ACK. */
+ ss = NULL;
+ len = 0;
+ qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
+ bth0 |= qp->s_ack_state << 24;
+ ohdr->u.aeth = ipath_compute_aeth(qp);
+ hwords++;
+ }
+ bth2 = qp->s_ack_psn++ & 0xFFFFFF;
+ } else {
+ send_req:
+ if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK) ||
+ qp->s_rnr_timeout)
+ goto done;
+
+ /* Send a request. */
+ wqe = get_swqe_ptr(qp, qp->s_cur);
+ switch (qp->s_state) {
+ default:
+ /*
+ * Resend an old request or start a new one.
+ *
+ * We keep track of the current SWQE so that
+ * we don't reset the "furthest progress" state
+ * if we need to back up.
+ */
+ newreq = 0;
+ if (qp->s_cur == qp->s_tail) {
+ /* Check if send work queue is empty. */
+ if (qp->s_tail == qp->s_head)
+ goto done;
+ qp->s_psn = wqe->psn = qp->s_next_psn;
+ newreq = 1;
+ }
+ /*
+ * Note that we have to be careful not to modify the
+ * original work request since we may need to resend
+ * it.
+ */
+ qp->s_sge.sge = wqe->sg_list[0];
+ qp->s_sge.sg_list = wqe->sg_list + 1;
+ qp->s_sge.num_sge = wqe->wr.num_sge;
+ qp->s_len = len = wqe->length;
+ ss = &qp->s_sge;
+ bth2 = 0;
+ switch (wqe->wr.opcode) {
+ case IB_WR_SEND:
+ case IB_WR_SEND_WITH_IMM:
+ /* If no credit, return. */
+ if (qp->s_lsn != (u32) -1 &&
+ cmp24(wqe->ssn, qp->s_lsn + 1) > 0) {
+ goto done;
+ }
+ wqe->lpsn = wqe->psn;
+ if (len > pmtu) {
+ wqe->lpsn += (len - 1) / pmtu;
+ qp->s_state = IB_OPCODE_RC_SEND_FIRST;
+ len = pmtu;
+ break;
+ }
+ if (wqe->wr.opcode == IB_WR_SEND) {
+ qp->s_state = IB_OPCODE_RC_SEND_ONLY;
+ } else {
+ qp->s_state =
+ IB_OPCODE_RC_SEND_ONLY_WITH_IMMEDIATE;
+ /* Immediate data comes after the BTH */
+ ohdr->u.imm_data = wqe->wr.imm_data;
+ hwords += 1;
+ }
+ if (wqe->wr.send_flags & IB_SEND_SOLICITED)
+ bth0 |= 1 << 23;
+ bth2 = 1 << 31; /* Request ACK. */
+ if (++qp->s_cur == qp->s_size)
+ qp->s_cur = 0;
+ break;
+
+ case IB_WR_RDMA_WRITE:
+ if (newreq)
+ qp->s_lsn++;
+ /* FALLTHROUGH */
+ case IB_WR_RDMA_WRITE_WITH_IMM:
+ /* If no credit, return. */
+ if (qp->s_lsn != (u32) -1 &&
+ cmp24(wqe->ssn, qp->s_lsn + 1) > 0) {
+ goto done;
+ }
+ ohdr->u.rc.reth.vaddr =
+ cpu_to_be64(wqe->wr.wr.rdma.remote_addr);
+ ohdr->u.rc.reth.rkey =
+ cpu_to_be32(wqe->wr.wr.rdma.rkey);
+ ohdr->u.rc.reth.length = cpu_to_be32(len);
+ hwords += sizeof(struct ib_reth) / 4;
+ wqe->lpsn = wqe->psn;
+ if (len > pmtu) {
+ wqe->lpsn += (len - 1) / pmtu;
+ qp->s_state =
+ IB_OPCODE_RC_RDMA_WRITE_FIRST;
+ len = pmtu;
+ break;
+ }
+ if (wqe->wr.opcode == IB_WR_RDMA_WRITE) {
+ qp->s_state =
+ IB_OPCODE_RC_RDMA_WRITE_ONLY;
+ } else {
+ qp->s_state =
+ IB_OPCODE_RC_RDMA_WRITE_ONLY_WITH_IMMEDIATE;
+ /* Immediate data comes after RETH */
+ ohdr->u.rc.imm_data = wqe->wr.imm_data;
+ hwords += 1;
+ if (wqe->wr.
+ send_flags & IB_SEND_SOLICITED)
+ bth0 |= 1 << 23;
+ }
+ bth2 = 1 << 31; /* Request ACK. */
+ if (++qp->s_cur == qp->s_size)
+ qp->s_cur = 0;
+ break;
+
+ case IB_WR_RDMA_READ:
+ ohdr->u.rc.reth.vaddr =
+ cpu_to_be64(wqe->wr.wr.rdma.remote_addr);
+ ohdr->u.rc.reth.rkey =
+ cpu_to_be32(wqe->wr.wr.rdma.rkey);
+ ohdr->u.rc.reth.length = cpu_to_be32(len);
+ qp->s_state = IB_OPCODE_RC_RDMA_READ_REQUEST;
+ hwords += sizeof(ohdr->u.rc.reth) / 4;
+ if (newreq) {
+ qp->s_lsn++;
+ /*
+ * Adjust s_next_psn to count the
+ * expected number of responses.
+ */
+ if (len > pmtu)
+ qp->s_next_psn +=
+ (len - 1) / pmtu;
+ wqe->lpsn = qp->s_next_psn++;
+ }
+ ss = NULL;
+ len = 0;
+ if (++qp->s_cur == qp->s_size)
+ qp->s_cur = 0;
+ break;
+
+ case IB_WR_ATOMIC_CMP_AND_SWP:
+ case IB_WR_ATOMIC_FETCH_AND_ADD:
+ qp->s_state =
+ wqe->wr.opcode == IB_WR_ATOMIC_CMP_AND_SWP ?
+ IB_OPCODE_RC_COMPARE_SWAP :
+ IB_OPCODE_RC_FETCH_ADD;
+ ohdr->u.atomic_eth.vaddr =
+ cpu_to_be64(wqe->wr.wr.atomic.remote_addr);
+ ohdr->u.atomic_eth.rkey =
+ cpu_to_be32(wqe->wr.wr.atomic.rkey);
+ ohdr->u.atomic_eth.swap_data =
+ cpu_to_be64(wqe->wr.wr.atomic.swap);
+ ohdr->u.atomic_eth.compare_data =
+ cpu_to_be64(wqe->wr.wr.atomic.compare_add);
+ hwords += sizeof(struct ib_atomic_eth) / 4;
+ if (newreq) {
+ qp->s_lsn++;
+ wqe->lpsn = wqe->psn;
+ }
+ if (++qp->s_cur == qp->s_size)
+ qp->s_cur = 0;
+ ss = NULL;
+ len = 0;
+ break;
+
+ default:
+ goto done;
+ }
+ if (newreq) {
+ if (++qp->s_tail >= qp->s_size)
+ qp->s_tail = 0;
+ }
+ bth2 |= qp->s_psn++ & 0xFFFFFF;
+ if ((int)(qp->s_psn - qp->s_next_psn) > 0)
+ qp->s_next_psn = qp->s_psn;
+ spin_lock(&dev->pending_lock);
+ if (qp->timerwait.next == LIST_POISON1) {
+ list_add_tail(&qp->timerwait,
+ &dev->pending[dev->
+ pending_index]);
+ }
+ spin_unlock(&dev->pending_lock);
+ break;
+
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST:
+ /*
+ * This case can only happen if a send is
+ * restarted. See ipath_restart_rc().
+ */
+ ipath_init_restart(qp, wqe);
+ /* FALLTHROUGH */
+ case IB_OPCODE_RC_SEND_FIRST:
+ qp->s_state = IB_OPCODE_RC_SEND_MIDDLE;
+ /* FALLTHROUGH */
+ case IB_OPCODE_RC_SEND_MIDDLE:
+ bth2 = qp->s_psn++ & 0xFFFFFF;
+ if ((int)(qp->s_psn - qp->s_next_psn) > 0)
+ qp->s_next_psn = qp->s_psn;
+ ss = &qp->s_sge;
+ len = qp->s_len;
+ if (len > pmtu) {
+ /*
+ * Request an ACK every 1/2 MB to avoid
+ * retransmit timeouts.
+ */
+ if (((wqe->length - len) % (512 * 1024)) == 0)
+ bth2 |= 1 << 31;
+ len = pmtu;
+ break;
+ }
+ if (wqe->wr.opcode == IB_WR_SEND)
+ qp->s_state = IB_OPCODE_RC_SEND_LAST;
+ else {
+ qp->s_state =
+ IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE;
+ /* Immediate data comes after the BTH */
+ ohdr->u.imm_data = wqe->wr.imm_data;
+ hwords += 1;
+ }
+ if (wqe->wr.send_flags & IB_SEND_SOLICITED)
+ bth0 |= 1 << 23;
+ bth2 |= 1 << 31; /* Request ACK. */
+ if (++qp->s_cur >= qp->s_size)
+ qp->s_cur = 0;
+ break;
+
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST:
+ /*
+ * This case can only happen if a RDMA write is
+ * restarted. See ipath_restart_rc().
+ */
+ ipath_init_restart(qp, wqe);
+ /* FALLTHROUGH */
+ case IB_OPCODE_RC_RDMA_WRITE_FIRST:
+ qp->s_state = IB_OPCODE_RC_RDMA_WRITE_MIDDLE;
+ /* FALLTHROUGH */
+ case IB_OPCODE_RC_RDMA_WRITE_MIDDLE:
+ bth2 = qp->s_psn++ & 0xFFFFFF;
+ if ((int)(qp->s_psn - qp->s_next_psn) > 0)
+ qp->s_next_psn = qp->s_psn;
+ ss = &qp->s_sge;
+ len = qp->s_len;
+ if (len > pmtu) {
+ /*
+ * Request an ACK every 1/2 MB to avoid
+ * retransmit timeouts.
+ */
+ if (((wqe->length - len) % (512 * 1024)) == 0)
+ bth2 |= 1 << 31;
+ len = pmtu;
+ break;
+ }
+ if (wqe->wr.opcode == IB_WR_RDMA_WRITE)
+ qp->s_state = IB_OPCODE_RC_RDMA_WRITE_LAST;
+ else {
+ qp->s_state =
+ IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE;
+ /* Immediate data comes after the BTH */
+ ohdr->u.imm_data = wqe->wr.imm_data;
+ hwords += 1;
+ if (wqe->wr.send_flags & IB_SEND_SOLICITED)
+ bth0 |= 1 << 23;
+ }
+ bth2 |= 1 << 31; /* Request ACK. */
+ if (++qp->s_cur >= qp->s_size)
+ qp->s_cur = 0;
+ break;
+
+ case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
+ /*
+ * This case can only happen if a RDMA read is
+ * restarted. See ipath_restart_rc().
+ */
+ ipath_init_restart(qp, wqe);
+ len = ((qp->s_psn - wqe->psn) & 0xFFFFFF) * pmtu;
+ ohdr->u.rc.reth.vaddr =
+ cpu_to_be64(wqe->wr.wr.rdma.remote_addr + len);
+ ohdr->u.rc.reth.rkey =
+ cpu_to_be32(wqe->wr.wr.rdma.rkey);
+ ohdr->u.rc.reth.length = cpu_to_be32(qp->s_len);
+ qp->s_state = IB_OPCODE_RC_RDMA_READ_REQUEST;
+ hwords += sizeof(ohdr->u.rc.reth) / 4;
+ bth2 = qp->s_psn++ & 0xFFFFFF;
+ if ((int)(qp->s_psn - qp->s_next_psn) > 0)
+ qp->s_next_psn = qp->s_psn;
+ ss = NULL;
+ len = 0;
+ if (++qp->s_cur == qp->s_size)
+ qp->s_cur = 0;
+ break;
+
+ case IB_OPCODE_RC_RDMA_READ_REQUEST:
+ case IB_OPCODE_RC_COMPARE_SWAP:
+ case IB_OPCODE_RC_FETCH_ADD:
+ /*
+ * We shouldn't start anything new until this request
+ * is finished. The ACK will handle rescheduling us.
+ * XXX The number of outstanding ones is negotiated
+ * at connection setup time (see pg. 258,289)?
+ * XXX Also, if we support multiple outstanding
+ * requests, we need to check the WQE IB_SEND_FENCE
+ * flag and not send a new request if a RDMA read or
+ * atomic is pending.
+ */
+ goto done;
+ }
+ qp->s_len -= len;
+ bth0 |= qp->s_state << 24;
+ /* XXX queue resend timeout. */
+ }
+ /* Make sure it is non-zero before dropping the lock. */
+ qp->s_hdrwords = hwords;
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+
+ /* Construct the header. */
+ extra_bytes = (4 - len) & 3;
+ nwords = (len + extra_bytes) >> 2;
+ if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) {
+ /* Header size in 32-bit words. */
+ hwords += 10;
+ lrh0 = IPS_LRH_GRH;
+ qp->s_hdr.u.l.grh.version_tclass_flow =
+ cpu_to_be32((6 << 28) |
+ (qp->remote_ah_attr.grh.traffic_class << 20) |
+ qp->remote_ah_attr.grh.flow_label);
+ qp->s_hdr.u.l.grh.paylen =
+ cpu_to_be16(((hwords - 12) + nwords + SIZE_OF_CRC) << 2);
+ qp->s_hdr.u.l.grh.next_hdr = 0x1B;
+ qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit;
+ /* The SGID is 32-bit aligned. */
+ qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix;
+ qp->s_hdr.u.l.grh.sgid.global.interface_id =
+ ipath_layer_get_guid(dev->ib_unit);
+ qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid;
+ qp->s_hdrwords = hwords;
+ }
+ qp->s_cur_sge = ss;
+ qp->s_cur_size = len;
+ lrh0 |= qp->remote_ah_attr.sl << 4;
+ qp->s_hdr.lrh[0] = cpu_to_be16(lrh0);
+ /* DEST LID */
+ qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid);
+ qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC);
+ qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit));
+ bth0 |= extra_bytes << 20;
+ ohdr->bth[0] = cpu_to_be32(bth0);
+ ohdr->bth[1] = cpu_to_be32(qp->remote_qpn);
+ ohdr->bth[2] = cpu_to_be32(bth2);
+
+ /* Check for more work to do. */
+ goto again;
+
+done:
+ spin_unlock_irqrestore(&qp->s_lock, flags);
+ clear_bit(IPATH_S_BUSY, &qp->s_flags);
+}

2005-12-29 00:42:29

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 7 of 20] ipath - MMIO copy routines

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r 9e8d017ed298 -r ffbd416f30d4 drivers/infiniband/hw/ipath/ipath_copy.c
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_copy.c Wed Dec 28 14:19:42 2005 -0800
@@ -0,0 +1,612 @@
+/*
+ * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Patent licenses, if any, provided herein do not apply to
+ * combinations of this program with other software, or any other
+ * product whatsoever.
+ */
+
+/*
+ * This file provides support for doing sk_buff buffer swapping between
+ * the low level driver eager buffers, and the network layer. It's part
+ * of the core driver, rather than the ether driver, because it relies
+ * on variables and functions in the core driver. It exports a single
+ * entry point for use in the ipath_ether module.
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <asm/io.h>
+#include <asm/byteorder.h>
+#include <asm/bitops.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+
+#include <linux/crc32.h> /* we can generate our own crc's for testing */
+
+#include "ipath_kernel.h"
+#include "ips_common.h"
+#include "ipath_layer.h"
+
+/*
+ * Allocate a PIO send buffer, initialize the header and copy it out.
+ */
+static int layer_send_getpiobuf(struct copy_data_s *cdp)
+{
+ uint32_t device = cdp->device;
+ uint32_t extra_bytes;
+ uint32_t len, nwords;
+ uint32_t __iomem *piobuf;
+
+ if (!(piobuf = ipath_getpiobuf(device, NULL))) {
+ cdp->error = -EBUSY;
+ return cdp->error;
+ }
+
+ /*
+ * Compute the max amount of data that can fit into a PIO buffer.
+ * buffer size - header size - trigger qword length & flags - CRC
+ */
+ len = devdata[device].ipath_ibmaxlen -
+ sizeof(struct ether_header_typ) - 8 - (SIZE_OF_CRC << 2);
+ if (len > devdata[device].ipath_rcvegrbufsize)
+ len = devdata[device].ipath_rcvegrbufsize;
+ if (len > (cdp->len + cdp->extra))
+ len = (cdp->len + cdp->extra);
+ /* Compute word aligment (i.e., (len & 3) ? 4 - (len & 3) : 0) */
+ extra_bytes = (4 - len) & 3;
+ nwords = (sizeof(struct ether_header_typ) + len + extra_bytes) >> 2;
+ cdp->hdr->lrh[2] = htons(nwords + SIZE_OF_CRC);
+ cdp->hdr->bth[0] = htonl((OPCODE_ITH4X << 24) + (extra_bytes << 20) +
+ IPS_DEFAULT_P_KEY);
+ cdp->hdr->sub_opcode = OPCODE_ENCAP;
+
+ cdp->hdr->bth[2] = 0;
+ /* Generate an interrupt on the receive side for the last fragment. */
+ cdp->hdr->iph.pkt_flags = ((cdp->len+cdp->extra) == len) ? INFINIPATH_KPF_INTR : 0;
+ cdp->hdr->iph.chksum = (uint16_t) IPS_LRH_BTH +
+ (uint16_t) (nwords + SIZE_OF_CRC) -
+ (uint16_t) ((cdp->hdr->iph.ver_port_tid_offset >> 16)&0xFFFF) -
+ (uint16_t) (cdp->hdr->iph.ver_port_tid_offset & 0xFFFF) -
+ (uint16_t) cdp->hdr->iph.pkt_flags;
+
+ _IPATH_VDBG("send %d (%x %x %x %x %x %x %x)\n", nwords,
+ cdp->hdr->lrh[0], cdp->hdr->lrh[1],
+ cdp->hdr->lrh[2], cdp->hdr->lrh[3],
+ cdp->hdr->bth[0], cdp->hdr->bth[1], cdp->hdr->bth[2]);
+ /*
+ * Write len to control qword, no flags.
+ * +1 is for the qword padding of pbc.
+ */
+ writeq(nwords + 1ULL, (uint64_t __iomem *) piobuf);
+ /* we have to flush after the PBC for correctness on some cpus
+ * or WC buffer can be written out of order */
+ mb();
+ piobuf += 2;
+ memcpy_toio32(piobuf, cdp->hdr, sizeof(struct ether_header_typ) >> 2);
+ cdp->csum_pio = &((struct ether_header_typ __iomem *) piobuf)->csum;
+ cdp->to = piobuf + (sizeof(struct ether_header_typ) >> 2);
+ cdp->flen = nwords - (sizeof(struct ether_header_typ) >> 2);
+ cdp->hdr->frag_num++;
+ return 0;
+}
+
+/*
+ * copy the last full dword when that's the "extra" word, preceding it
+ * with a memory fence, so that all prior data is written to the PIO
+ * buffer before the trigger word, to enforce the correct bus ordering
+ * of the WC buffer contents on the bus.
+ */
+static inline unsigned copy_extra_dword(struct copy_data_s *cdp, unsigned dosum)
+{
+ if (!cdp->flen && layer_send_getpiobuf(cdp) < 0)
+ return 1;
+ /* write the checksum before the last PIO write, if requested. */
+ if (dosum && cdp->flen == 1)
+ writel(csum_fold(cdp->csum), cdp->csum_pio);
+ mb();
+ writel(cdp->u.w, cdp->to++);
+ mb();
+ cdp->extra = 0;
+ cdp->flen -= 1;
+ return 0;
+}
+
+/*
+ * copy a PIO buffer's worth (or the skb fragment, at least) to the PIO
+ * buffer, adding a memory fence before the last word. We need the fence
+ * as part of forcing the WC ordering on some cpus, for the cases where
+ * it will be the trigger word. The final fence after the trigger word
+ * will be done either at the next chunk, or on final return from the caller
+ * Takes max byte count, returns byte count actually done (always rounded
+ * to dword multiple).
+ */
+static uint32_t copy_a_buffer(struct copy_data_s *cdp, void *p, uint32_t n,
+ unsigned dosum)
+{
+ uint32_t *p32;
+
+ if (!cdp->flen && layer_send_getpiobuf(cdp) < 0)
+ return -1;
+ if (n > cdp->flen)
+ n = cdp->flen;
+ if (dosum && cdp->flen == n)
+ writel(csum_fold(cdp->csum), cdp->csum_pio);
+ p32 = p;
+ memcpy_toio32(cdp->to, p32, n-1);
+ cdp->to += n-1;
+ mb();
+ writel(p32[n-1], cdp->to++);
+ mb();
+ _IPATH_PDBG("trigger write to pio %p\n", &p32[n-1]);
+ cdp->flen -= n;
+ n <<= 2;
+ cdp->offset += n;
+ cdp->len -= n;
+ return n;
+}
+
+/*
+ * Copy data out of one or a chain of sk_buffs, into the PIO buffer.
+ * Fragment an sk_buff into multiple IB packets if the amount of data is
+ * more than a single eager send.
+ * Offset and len are in bytes.
+ * Note that this function is recursive!
+ */
+static void copy_bits(const struct sk_buff *skb, unsigned int offset,
+ unsigned int len, struct copy_data_s *cdp)
+{
+ unsigned int start = skb_headlen(skb);
+ unsigned int i, copy;
+ uint32_t n;
+ uint8_t *p;
+
+ /* Copy header. */
+ if ((int)(copy = start - offset) > 0) {
+ if (copy > len)
+ copy = len;
+ p = skb->data + offset;
+ offset += copy;
+ len -= copy;
+ /* If the alignment buffer is not empty, fill it and write it out. */
+ if (cdp->extra) {
+ if (cdp->extra == 4) {
+ if (copy_extra_dword(cdp, 0))
+ return;
+ }
+ else while (copy != 0) {
+ cdp->u.buf[cdp->extra] = *p++;
+ copy--;
+ cdp->offset++;
+ cdp->len--;
+
+ if (++cdp->extra == 4) {
+ if (copy_extra_dword(cdp, 0))
+ return;
+ break;
+ }
+ }
+ }
+ while (copy >= 4) {
+ n = copy_a_buffer(cdp, p, copy>>2, 0);
+ if (n == -1)
+ return;
+ p += n;
+ copy -= n;
+ }
+ /*
+ * Either cdp->extra is zero or copy is zero which means that
+ * the loop here can't cause the alignment buffer to fill up.
+ */
+ while (copy != 0) {
+ cdp->u.buf[cdp->extra++] = *p++;
+ copy--;
+ cdp->offset++;
+ cdp->len--;
+
+ }
+ if (len == 0)
+ return;
+ }
+
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+ unsigned int end;
+
+ end = start + frag->size;
+ if ((int)(copy = end - offset) > 0) {
+ uint8_t *vaddr;
+
+ if (copy > len)
+ copy = len;
+ vaddr = kmap_skb_frag(frag);
+ p = vaddr + frag->page_offset + offset - start;
+ offset += copy;
+ len -= copy;
+ /* If the alignment buffer is not empty, fill it and write it out. */
+ if (cdp->extra) {
+ if (cdp->extra == 4) {
+ if (copy_extra_dword(cdp, 0))
+ return;
+ }
+ else while (copy != 0) {
+ cdp->u.buf[cdp->extra] = *p++;
+ copy--;
+ cdp->offset++;
+ cdp->len--;
+
+ if (++cdp->extra == 4) {
+ if (copy_extra_dword(cdp, 0))
+ return;
+ break;
+ }
+ }
+ }
+ while (copy >= 4) {
+ n = copy_a_buffer(cdp, p, copy>>2, 0);
+ if (n == -1)
+ return;
+ p += n;
+ copy -= n;
+ }
+ /*
+ * Either cdp->extra is zero or copy is zero which means that
+ * the loop here can't cause the alignment buffer to fill up.
+ */
+ while (copy != 0) {
+ cdp->u.buf[cdp->extra++] = *p++;
+ copy--;
+ cdp->offset++;
+ cdp->len--;
+ }
+ kunmap_skb_frag(vaddr);
+
+ if (len == 0)
+ return;
+ }
+ start = end;
+ }
+
+ if (skb_shinfo(skb)->frag_list) {
+ struct sk_buff *list = skb_shinfo(skb)->frag_list;
+
+ for (; list; list = list->next) {
+ unsigned int end;
+
+ end = start + list->len;
+ if ((int)(copy = end - offset) > 0) {
+ if (copy > len)
+ copy = len;
+ copy_bits(list, offset - start, copy, cdp);
+ if (cdp->error || (len -= copy) == 0)
+ return;
+ }
+ start = end;
+ }
+ }
+ if (len)
+ cdp->error = -EFAULT;
+}
+
+/*
+ * Copy data out of one or a chain of sk_buffs, into the PIO buffer, generating
+ * the checksum as we go.
+ * Fragment an sk_buff into multiple IB packets if the amount of data is
+ * more than a single eager send.
+ * Offset and len are in bytes.
+ * Note that this function is recursive!
+ */
+static void copy_and_csum_bits(const struct sk_buff *skb, unsigned int offset,
+ unsigned int len, struct copy_data_s *cdp)
+{
+ unsigned int start = skb_headlen(skb);
+ unsigned int i, copy;
+ unsigned int csum2;
+ uint32_t n;
+ uint8_t *p;
+
+ /* Copy header. */
+ if ((int)(copy = start - offset) > 0) {
+ if (copy > len)
+ copy = len;
+ p = skb->data + offset;
+ offset += copy;
+ len -= copy;
+ if (!cdp->checksum_calc) {
+ cdp->checksum_calc = 1;
+
+ csum2 = csum_partial(p, copy, 0);
+ cdp->csum = csum_block_add(cdp->csum, csum2, cdp->pos);
+ cdp->pos += copy;
+ }
+ /* If the alignment buffer is not empty, fill it and write it out. */
+ if (cdp->extra) {
+ if (cdp->extra == 4) {
+ if (copy_extra_dword(cdp, 1))
+ goto done;
+ }
+ else while (copy != 0) {
+ cdp->u.buf[cdp->extra] = *p++;
+ copy--;
+ cdp->offset++;
+ cdp->len--;
+ if (++cdp->extra == 4) {
+ if (copy_extra_dword(cdp, 1))
+ goto done;
+ break;
+ }
+ }
+ }
+
+ while (copy >= 4) {
+ n = copy_a_buffer(cdp, p, copy>>2, 1);
+ if (n == -1)
+ goto done;
+ p += n;
+ copy -= n;
+ }
+ /*
+ * Either cdp->extra is zero or copy is zero which means that
+ * the loop here can't cause the alignment buffer to fill up.
+ */
+ while (copy != 0) {
+ cdp->u.buf[cdp->extra++] = *p++;
+ copy--;
+ cdp->offset++;
+ cdp->len--;
+ }
+
+ cdp->checksum_calc = 0;
+
+ if (len == 0)
+ goto done;
+ }
+
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+ unsigned int end;
+
+ end = start + frag->size;
+ if ((int)(copy = end - offset) > 0) {
+ uint8_t *vaddr;
+
+ if (copy > len)
+ copy = len;
+ vaddr = kmap_skb_frag(frag);
+ p = vaddr + frag->page_offset + offset - start;
+ offset += copy;
+ len -= copy;
+
+ if (!cdp->checksum_calc) {
+ cdp->checksum_calc = 1;
+
+ csum2 = csum_partial(p, copy, 0);
+ cdp->csum = csum_block_add(cdp->csum, csum2,
+ cdp->pos);
+ cdp->pos += copy;
+ }
+ /* If the alignment buffer is not empty, fill it and write it out. */
+ if (cdp->extra) {
+ if (cdp->extra == 4) {
+ if (copy_extra_dword(cdp, 1)) {
+ kunmap_skb_frag(vaddr);
+ goto done;
+ }
+ }
+ else while (copy != 0) {
+ cdp->u.buf[cdp->extra] = *p++;
+ copy--;
+ cdp->offset++;
+ cdp->len--;
+
+ if (++cdp->extra == 4) {
+ if (copy_extra_dword(cdp, 1)) {
+ kunmap_skb_frag(vaddr);
+ goto done;
+ }
+ break;
+ }
+ }
+ }
+ while (copy >= 4) {
+ n = copy_a_buffer(cdp, p, copy>>2, 1);
+ if (n == -1) {
+ kunmap_skb_frag(vaddr);
+ goto done;
+ }
+ p += n;
+ copy -= n;
+ }
+ /*
+ * Either cdp->extra is zero or copy is zero which means that
+ * the loop here can't cause the alignment buffer to fill up.
+ */
+ while (copy != 0) {
+ cdp->u.buf[cdp->extra++] = *p++;
+ copy--;
+ cdp->offset++;
+ cdp->len--;
+ }
+ kunmap_skb_frag(vaddr);
+
+ cdp->checksum_calc = 0;
+
+ if (len == 0)
+ goto done;
+ }
+ start = end;
+ }
+
+ if (skb_shinfo(skb)->frag_list) {
+ struct sk_buff *list = skb_shinfo(skb)->frag_list;
+
+ for (; list; list = list->next) {
+ unsigned int end;
+
+ end = start + list->len;
+ if ((int)(copy = end - offset) > 0) {
+ if (copy > len)
+ copy = len;
+ copy_and_csum_bits(list, offset - start, copy, cdp);
+ if (cdp->error || (len -= copy) == 0)
+ goto done;
+ offset += copy;
+ }
+ start = end;
+ }
+ }
+ if (len)
+ cdp->error = -EFAULT;
+done:
+ /* we have to flush after trigger word for correctness on some cpus
+ * or WC buffer can be written out of order; needed even if
+ * there was an error */
+ mb();
+}
+
+/*
+ * Note that the header should have the unchanging parts
+ * initialized but the rest of the header is computed as needed in
+ * order to break up skb data buffers larger than the hardware MTU.
+ * In other words, the Linux network stack MTU can be larger than the
+ * hardware MTU.
+ */
+int ipath_layer_send_skb(struct copy_data_s *cdata)
+{
+ int ret = 0;
+ uint16_t vlsllnh;
+ int device = cdata->device;
+
+ if (device >= infinipath_max) {
+ _IPATH_INFO("Invalid unit %u, failing\n", device);
+ return -EINVAL;
+ }
+ if (!(devdata[device].ipath_flags & IPATH_RCVHDRSZ_SET)) {
+ _IPATH_INFO("send while not open\n");
+ ret = -EINVAL;
+ }
+ else if ((devdata[device].ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN))
+ || devdata[device].ipath_lid == 0) {
+ /* lid check is for when sma hasn't yet configured */
+ ret = -ENETDOWN;
+ _IPATH_VDBG("send while not ready, mylid=%u, flags=0x%x\n",
+ devdata[device].ipath_lid, devdata[device].ipath_flags);
+ }
+ vlsllnh = *((uint16_t *) cdata->hdr);
+ if (vlsllnh != htons(IPS_LRH_BTH)) {
+ _IPATH_DBG("Warning: lrh[0] wrong (%x, not %x); not sending\n",
+ vlsllnh, htons(IPS_LRH_BTH));
+ ret = -EINVAL;
+ }
+ if (ret)
+ goto done;
+
+ cdata->error = 0; /* clear last calls error */
+
+ if (cdata->skb->ip_summed == CHECKSUM_HW) {
+ unsigned int csstart = cdata->skb->h.raw - cdata->skb->data;
+
+ /*
+ * Computing the checksum is a bit tricky since if we fragment
+ * the packet, the fragment that should contain the checksum
+ * will have already been sent. The solution is to store the checksum
+ * in the header of the last fragment just before we write the
+ * last data word which triggers the last fragment to be sent.
+ * The receiver will check the header "tag" field, see that
+ * there is a checksum, and store the checksum back into the packet.
+ *
+ * Save the offset of the two byte checksum.
+ * Note that we have to add 2 to account for the two bytes of the
+ * ethernet address we stripped from the packet and put in the header.
+ */
+ cdata->hdr->csum_offset = csstart + cdata->skb->csum + 2;
+
+ if (cdata->offset < csstart)
+ copy_bits(cdata->skb, cdata->offset,
+ csstart - cdata->offset, cdata);
+
+ if (cdata->error) {
+ ret = cdata->error;
+ goto done;
+ }
+
+ if (cdata->offset < cdata->skb->len)
+ copy_and_csum_bits(cdata->skb, cdata->offset,
+ cdata->skb->len - cdata->offset, cdata);
+
+ if (cdata->error) {
+ ret = cdata->error;
+ goto done;
+ }
+
+ if (cdata->extra) {
+ while (cdata->extra < 4)
+ cdata->u.buf[cdata->extra++] = 0;
+ (void)copy_extra_dword(cdata, 1);
+ }
+ }
+ else {
+ copy_bits(cdata->skb, cdata->offset,
+ cdata->skb->len - cdata->offset, cdata);
+
+ if (cdata->error) {
+ ret = cdata->error;
+ goto done;
+ }
+
+ if (cdata->extra) {
+ while (cdata->extra < 4)
+ cdata->u.buf[cdata->extra++] = 0;
+ (void)copy_extra_dword(cdata, 1);
+ }
+ }
+
+ if (cdata->error) {
+ ret = cdata->error;
+ if (cdata->error != -EBUSY)
+ _IPATH_UNIT_ERROR(device,
+ "layer_send copy_bits failed with error %d\n",
+ -ret);
+ }
+
+ ipath_stats.sps_ether_spkts++; /* another ether packet sent */
+
+done:
+ /* we have to flush after trigger word for correctness on some cpus
+ * or WC buffer can be written out of order; needed even if
+ * there was an error */
+ mb();
+ return ret;
+}
+
+EXPORT_SYMBOL(ipath_layer_send_skb);
+

2005-12-29 00:44:03

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 2 of 20] memcpy32 for x86_64

Introduce an x86_64-specific memcpy32 routine. The routine is similar
to memcpy, but is guaranteed to work in units of 32 bits at a time.

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r ef833f6712e7 -r 801287704e40 arch/x86_64/kernel/x8664_ksyms.c
--- a/arch/x86_64/kernel/x8664_ksyms.c Wed Dec 28 14:19:42 2005 -0800
+++ b/arch/x86_64/kernel/x8664_ksyms.c Wed Dec 28 14:19:42 2005 -0800
@@ -164,6 +164,8 @@
EXPORT_SYMBOL(memcpy);
EXPORT_SYMBOL(__memcpy);

+EXPORT_SYMBOL_GPL(memcpy32);
+
#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM
/* prototypes are wrong, these are assembly with custom calling functions */
extern void rwsem_down_read_failed_thunk(void);
diff -r ef833f6712e7 -r 801287704e40 arch/x86_64/lib/Makefile
--- a/arch/x86_64/lib/Makefile Wed Dec 28 14:19:42 2005 -0800
+++ b/arch/x86_64/lib/Makefile Wed Dec 28 14:19:42 2005 -0800
@@ -9,4 +9,4 @@
lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \
usercopy.o getuser.o putuser.o \
thunk.o clear_page.o copy_page.o bitstr.o bitops.o
-lib-y += memcpy.o memmove.o memset.o copy_user.o
+lib-y += memcpy.o memcpy32.o memmove.o memset.o copy_user.o
diff -r ef833f6712e7 -r 801287704e40 include/asm-x86_64/string.h
--- a/include/asm-x86_64/string.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-x86_64/string.h Wed Dec 28 14:19:42 2005 -0800
@@ -45,6 +45,15 @@
#define __HAVE_ARCH_MEMMOVE
void * memmove(void * dest,const void *src,size_t count);

+/*
+ * memcpy32 - copy data, 32 bits at a time
+ *
+ * @dst: destination (must be 32-bit aligned)
+ * @src: source (must be 32-bit aligned)
+ * @count: number of 32-bit quantities to copy
+ */
+void memcpy32(void *dst, const void *src, size_t count);
+
/* Use C out of line version for memcmp */
#define memcmp __builtin_memcmp
int memcmp(const void * cs,const void * ct,size_t count);
diff -r ef833f6712e7 -r 801287704e40 arch/x86_64/lib/memcpy32.S
--- /dev/null Thu Jan 1 00:00:00 1970 +0000
+++ b/arch/x86_64/lib/memcpy32.S Wed Dec 28 14:19:42 2005 -0800
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2003, 2004, 2005 PathScale, Inc.
+ */
+
+/*
+ * memcpy32 - Copy a memory block, 32 bits at a time.
+ *
+ * Count is number of dwords; it need not be a qword multiple.
+ * Input:
+ * rdi destination
+ * rsi source
+ * rdx count
+ */
+
+ .globl memcpy32
+memcpy32:
+ movl %edx,%ecx
+ shrl $1,%ecx
+ andl $1,%edx
+ rep movsq
+ movl %edx,%ecx
+ rep movsd
+ ret

2005-12-29 00:44:34

by Bryan O'Sullivan

[permalink] [raw]
Subject: [PATCH 3 of 20] Add memcpy_toio32 to each arch

Most arches use the generic __memcpy_toio32 routine, while x86_64
uses memcpy32, which is substantially faster.

Signed-off-by: Bryan O'Sullivan <[email protected]>

diff -r 801287704e40 -r b792638cc4bc include/asm-alpha/io.h
--- a/include/asm-alpha/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-alpha/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -504,6 +504,8 @@
extern void memcpy_toio(volatile void __iomem *, const void *, long);
extern void _memset_c_io(volatile void __iomem *, unsigned long, long);

+#define memcpy_toio32 __memcpy_toio32
+
static inline void memset_io(volatile void __iomem *addr, u8 c, long len)
{
_memset_c_io(addr, 0x0101010101010101UL * c, len);
diff -r 801287704e40 -r b792638cc4bc include/asm-arm/io.h
--- a/include/asm-arm/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-arm/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -184,6 +184,8 @@
#define memset_io(c,v,l) _memset_io(__mem_pci(c),(v),(l))
#define memcpy_fromio(a,c,l) _memcpy_fromio((a),__mem_pci(c),(l))
#define memcpy_toio(c,a,l) _memcpy_toio(__mem_pci(c),(a),(l))
+
+#define memcpy_toio32 __memcpy_toio32

#define eth_io_copy_and_sum(s,c,l,b) \
eth_copy_and_sum((s),__mem_pci(c),(l),(b))
diff -r 801287704e40 -r b792638cc4bc include/asm-cris/io.h
--- a/include/asm-cris/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-cris/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -121,6 +121,8 @@
#define memcpy_fromio(a,b,c) memcpy((a),(void *)(b),(c))
#define memcpy_toio(a,b,c) memcpy((void *)(a),(b),(c))

+#define memcpy_toio32 __memcpy_toio32
+
/*
* Again, CRIS does not require mem IO specific function.
*/
diff -r 801287704e40 -r b792638cc4bc include/asm-frv/io.h
--- a/include/asm-frv/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-frv/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -124,6 +124,8 @@
memcpy((void __force *) dst, src, count);
}

+#define memcpy_toio32 __memcpy_toio32
+
static inline uint8_t inb(unsigned long addr)
{
return __builtin_read8((void *)addr);
diff -r 801287704e40 -r b792638cc4bc include/asm-h8300/io.h
--- a/include/asm-h8300/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-h8300/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -209,6 +209,8 @@
#define memcpy_fromio(a,b,c) memcpy((a),(void *)(b),(c))
#define memcpy_toio(a,b,c) memcpy((void *)(a),(b),(c))

+#define memcpy_toio32 __memcpy_toio32
+
#define mmiowb()

#define inb(addr) ((h8300_buswidth(addr))?readw((addr) & ~1) & 0xff:readb(addr))
diff -r 801287704e40 -r b792638cc4bc include/asm-i386/io.h
--- a/include/asm-i386/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-i386/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -203,6 +203,9 @@
{
__memcpy((void __force *) dst, src, count);
}
+
+#define memcpy_toio32 __memcpy_toio32
+

/*
* ISA space is 'always mapped' on a typical x86 system, no need to
diff -r 801287704e40 -r b792638cc4bc include/asm-ia64/io.h
--- a/include/asm-ia64/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-ia64/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -443,6 +443,8 @@
extern void memcpy_toio(volatile void __iomem *dst, const void *src, long n);
extern void memset_io(volatile void __iomem *s, int c, long n);

+#define memcpy_toio32 __memcpy_toio32
+
#define dma_cache_inv(_start,_size) do { } while (0)
#define dma_cache_wback(_start,_size) do { } while (0)
#define dma_cache_wback_inv(_start,_size) do { } while (0)
diff -r 801287704e40 -r b792638cc4bc include/asm-m32r/io.h
--- a/include/asm-m32r/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-m32r/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -216,6 +216,8 @@
memcpy((void __force *) dst, src, count);
}

+#define memcpy_toio32 __memcpy_toio32
+
/*
* Convert a physical pointer to a virtual kernel pointer for /dev/mem
* access
diff -r 801287704e40 -r b792638cc4bc include/asm-m68knommu/io.h
--- a/include/asm-m68knommu/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-m68knommu/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -113,6 +113,8 @@
#define memcpy_fromio(a,b,c) memcpy((a),(void *)(b),(c))
#define memcpy_toio(a,b,c) memcpy((void *)(a),(b),(c))

+#define memcpy_toio32 __memcpy_toio32
+
#define inb(addr) readb(addr)
#define inw(addr) readw(addr)
#define inl(addr) readl(addr)
diff -r 801287704e40 -r b792638cc4bc include/asm-mips/io.h
--- a/include/asm-mips/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-mips/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -534,6 +534,8 @@
memcpy((void __force *) dst, src, count);
}

+#define memcpy_toio32 __memcpy_toio32
+
/*
* Memory Mapped I/O
*/
diff -r 801287704e40 -r b792638cc4bc include/asm-parisc/io.h
--- a/include/asm-parisc/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-parisc/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -294,6 +294,8 @@
void memcpy_fromio(void *dst, const volatile void __iomem *src, int count);
void memcpy_toio(volatile void __iomem *dst, const void *src, int count);

+#define memcpy_toio32 __memcpy_toio32
+
/* Support old drivers which don't ioremap.
* NB this interface is scheduled to disappear in 2.5
*/
diff -r 801287704e40 -r b792638cc4bc include/asm-powerpc/io.h
--- a/include/asm-powerpc/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-powerpc/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -63,6 +63,8 @@
#define memcpy_fromio(a,b,c) iSeries_memcpy_fromio((a), (b), (c))
#define memcpy_toio(a,b,c) iSeries_memcpy_toio((a), (b), (c))

+#define memcpy_toio32 __memcpy_toio32
+
#define inb(addr) readb(((void __iomem *)(long)(addr)))
#define inw(addr) readw(((void __iomem *)(long)(addr)))
#define inl(addr) readl(((void __iomem *)(long)(addr)))
diff -r 801287704e40 -r b792638cc4bc include/asm-ppc/io.h
--- a/include/asm-ppc/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-ppc/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -367,6 +367,8 @@
}
#endif

+#define memcpy_toio32 __memcpy_toio32
+
#define eth_io_copy_and_sum(a,b,c,d) eth_copy_and_sum((a),(void __force *)(void __iomem *)(b),(c),(d))

/*
diff -r 801287704e40 -r b792638cc4bc include/asm-s390/io.h
--- a/include/asm-s390/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-s390/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -99,6 +99,8 @@
#define memcpy_fromio(a,b,c) memcpy((a),__io_virt(b),(c))
#define memcpy_toio(a,b,c) memcpy(__io_virt(a),(b),(c))

+#define memcpy_toio32 __memcpy_toio32
+
#define inb_p(addr) readb(addr)
#define inb(addr) readb(addr)

diff -r 801287704e40 -r b792638cc4bc include/asm-sh/io.h
--- a/include/asm-sh/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-sh/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -177,6 +177,8 @@
extern void memcpy_toio(unsigned long, const void *, unsigned long);
extern void memset_io(unsigned long, int, unsigned long);

+#define memcpy_toio32 __memcpy_toio32
+
/* SuperH on-chip I/O functions */
static __inline__ unsigned char ctrl_inb(unsigned long addr)
{
diff -r 801287704e40 -r b792638cc4bc include/asm-sh64/io.h
--- a/include/asm-sh64/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-sh64/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -125,6 +125,8 @@

void memcpy_toio(void __iomem *to, const void *from, long count);
void memcpy_fromio(void *to, void __iomem *from, long count);
+
+#define memcpy_toio32 __memcpy_toio32

#define mmiowb()

diff -r 801287704e40 -r b792638cc4bc include/asm-sparc/io.h
--- a/include/asm-sparc/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-sparc/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -239,6 +239,8 @@

#define memcpy_toio(d,s,sz) _memcpy_toio(d,s,sz)

+#define memcpy_toio32 __memcpy_toio32
+
#ifdef __KERNEL__

/*
diff -r 801287704e40 -r b792638cc4bc include/asm-sparc64/io.h
--- a/include/asm-sparc64/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-sparc64/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -440,6 +440,8 @@

#define memcpy_toio(d,s,sz) _memcpy_toio(d,s,sz)

+#define memcpy_toio32 __memcpy_toio32
+
static inline int check_signature(void __iomem *io_addr,
const unsigned char *signature,
int length)
diff -r 801287704e40 -r b792638cc4bc include/asm-v850/io.h
--- a/include/asm-v850/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-v850/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -130,6 +130,8 @@
#define memcpy_fromio(dst, src, len) memcpy (dst, (void *)src, len)
#define memcpy_toio(dst, src, len) memcpy ((void *)dst, src, len)

+#define memcpy_toio32 __memcpy_toio32
+
/*
* Convert a physical pointer to a virtual kernel pointer for /dev/mem
* access
diff -r 801287704e40 -r b792638cc4bc include/asm-x86_64/io.h
--- a/include/asm-x86_64/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-x86_64/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -252,6 +252,13 @@
__memcpy_toio((unsigned long)to,from,len);
}

+#include <asm/string.h>
+
+static inline void memcpy_toio32(void __iomem *dst, const void *src, size_t count)
+{
+ memcpy32((void __force *) dst, src, count);
+}
+
void memset_io(volatile void __iomem *a, int b, size_t c);

/*
diff -r 801287704e40 -r b792638cc4bc include/asm-xtensa/io.h
--- a/include/asm-xtensa/io.h Wed Dec 28 14:19:42 2005 -0800
+++ b/include/asm-xtensa/io.h Wed Dec 28 14:19:42 2005 -0800
@@ -159,6 +159,8 @@
#define memcpy_fromio(a,b,c) memcpy((a),(void *)(b),(c))
#define memcpy_toio(a,b,c) memcpy((void *)(a),(b),(c))

+#define memcpy_toio32 __memcpy_toio32
+
/* At this point the Xtensa doesn't provide byte swap instructions */

#ifdef __XTENSA_EB__

2005-12-29 02:19:55

by Roland Dreier

[permalink] [raw]
Subject: Re: [openib-general] [PATCH 11 of 20] ipath - core driver, part 4 of 4

I didn't notice this before:

> + * This is volatile as it's the target of a DMA from the chip.
> + */
> +
> +static volatile uint64_t ipath_port0_rcvhdrtail[512]
> + __attribute__ ((aligned(4096)));

... and then much later ...

> + /*
> + * kernel modules loaded into vmalloc'ed memory,
> + * verify that when we assume that, map to phys, and back to virt,
> + * that we get the right contents, so we did the mapping right.
> + */
> + vpage = vmalloc_to_page((void *)ipath_port0_rcvhdrtail);
> + if (vpage == NOPAGE_SIGBUS || vpage == NOPAGE_OOM) {
> + _IPATH_UNIT_ERROR(t, "vmalloc_to_page for rcvhdrtail fails!\n");
> + ret = -ENOMEM;
> + goto done;
> + }

This seems very wrong to me: there's no guarantee that a module will
be loaded into memory that can be used as a DMA target. For example,
on a non-cache-coherent architecture, I think this memory must be
accessed through a non-cached mapping.

I think the correct solution is to allocate a buffer for each device
with pci_alloc_consistent() (or maybe dma_alloc_coherent()).

(As a general comment, I'm still unhappy about how your driver has a
static, fixed-size table of devices rather than allocating per-device
data structures dynamically)

- R.

2005-12-29 08:18:06

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 5 of 20] ipath - driver core header files

Hi Bryan,

On 12/29/05, Bryan O'Sullivan <[email protected]> wrote:
> +/*
> + * Copy routine that is guaranteed to work in terms of aligned 32-bit
> + * quantities.
> + */
> +void ipath_dwordcpy(uint32_t *dest, uint32_t *src, uint32_t ndwords);

Wasn't this supposed to be killed?

Pekka

2005-12-29 08:21:06

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 14 of 20] ipath - infiniband verbs header

On 12/29/05, Bryan O'Sullivan <[email protected]> wrote:
> diff -r f9bcd9de3548 -r 26993cb5faee drivers/infiniband/hw/ipath/verbs_debug.h
> --- /dev/null Thu Jan 1 00:00:00 1970 +0000
> +++ b/drivers/infiniband/hw/ipath/verbs_debug.h Wed Dec 28 14:19:43 2005 -0800
> +#ifndef _VERBS_DEBUG_H
> +#define _VERBS_DEBUG_H
> +
> +/*
> + * This file contains tracing code for the ib_ipath kernel module.
> + */
> +#ifndef _VERBS_DEBUGGING /* tracing enabled or not */
> +#define _VERBS_DEBUGGING 1
> +#endif
> +
> +extern unsigned ib_ipath_debug;
> +
> +#define _VERBS_ERROR(fmt,...) \
> + do { \
> + printk(KERN_ERR "%s: " fmt, "ib_ipath", ##__VA_ARGS__); \
> + } while(0)

[snip, snip]

Please consider using dev_dbg, dev_err, and friends from <linux/device.h>.

Pekka

2005-12-29 08:22:28

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 6 of 20] ipath - driver debugging headers

On 12/29/05, Bryan O'Sullivan <[email protected]> wrote:
> +#endif /* _IPATH_DEBUG_H */
> diff -r 2d9a3f27a10c -r 9e8d017ed298 drivers/infiniband/hw/ipath/ipath_kdebug.h
> --- /dev/null Thu Jan 1 00:00:00 1970 +0000
> +++ b/drivers/infiniband/hw/ipath/ipath_kdebug.h Wed Dec 28 14:19:42 2005 -0800
> @@ -0,0 +1,109 @@
> +#ifndef _IPATH_KDEBUG_H
> +#define _IPATH_KDEBUG_H
> +
> +#include "ipath_debug.h"
> +
> +/*
> + * This file contains lightweight kernel tracing code.
> + */
> +
> +extern unsigned infinipath_debug;
> +const char *ipath_get_unit_name(int unit);
> +
> +#if _IPATH_DEBUGGING
> +
> +#define _IPATH_UNIT_ERROR(unit,fmt,...) \
> + printk(KERN_ERR "%s: " fmt, ipath_get_unit_name(unit), ##__VA_ARGS__)
> +
> +#define _IPATH_ERROR(fmt,...) printk(KERN_ERR "infinipath: " fmt, ##__VA_ARGS__)
> +
> +#define _IPATH_INFO(fmt,...) \
> + do { \
> + if(unlikely(infinipath_debug & __IPATH_INFO)) \
> + printk(KERN_INFO "infinipath: " fmt, ##__VA_ARGS__); \
> + } while(0)
> +

[snip, snip]

Please consider using dev_dbg, dev_err, et al from <linux/device.h>.

Pekka

2005-12-29 14:15:57

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 5 of 20] ipath - driver core header files

On Thu, 2005-12-29 at 10:18 +0200, Pekka Enberg wrote:

> > +void ipath_dwordcpy(uint32_t *dest, uint32_t *src, uint32_t ndwords);

> Wasn't this supposed to be killed?

The routine itself is dead, but the prototype survived. Thanks for
spotting that.

<b

2005-12-29 14:21:49

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [openib-general] [PATCH 11 of 20] ipath - core driver, part 4 of 4

On Wed, 2005-12-28 at 18:19 -0800, Roland Dreier wrote:
> This seems very wrong to me: there's no guarantee that a module will
> be loaded into memory that can be used as a DMA target.

Right. I think we're just getting lucky on x86_64. I'll fix this.

Thanks,

<b

--
Bryan O'Sullivan <[email protected]>

2005-12-29 14:23:12

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 14 of 20] ipath - infiniband verbs header

On Thu, 2005-12-29 at 10:21 +0200, Pekka Enberg wrote:

> Please consider using dev_dbg, dev_err, and friends from <linux/device.h>.

Will do, thanks.

<b

2005-12-29 19:02:13

by Horst H. von Brand

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

Bryan O'Sullivan <[email protected]> wrote:
> Following Roland's submission of our InfiniPath InfiniBand HCA driver
> earlier this month, we have responded to people's comments by making a
> large number of changes to the driver.

Many thanks!

> Here is another set of driver patches for review. Roland is on
> vacation until January 4, so I'm posting these in his place. Once
> again, your comments are appreciated. We'd like to submit this driver
> for inclusion in 2.6.16, so we'll be responding quickly to all
> feedback.
>
> A short summary of the changes we have made is as follows:

Some comments, just based on this:

[...]

> - Renamed _BITS_PER_BYTE to BITS_PER_BYTE, and moved it into
> linux/types.h

Haven't come across anything with this not 8 for a /long/ time now, and no
Linux on that in sight.

[...]

> There are a few requested changes we have chosen to omit for now:
>
> - The driver still uses EXPORT_SYMBOL, for consistency with other
> code in drivers/infiniband

I'd suppose that is your choice...

> - Someone asked for the kernel's i2c infrastructure to be used, but
> our i2c usage is very specialised, and it would be more of a mess
> to use the kernel's

Problem with that is that if everybody and Aunt Tillie does the same, the
kernel as a whole gets to be a mess.

> - We're still using ioctls instead of sysfs or configfs in some
> cases, to maintain userspace compatibility

With what? You can very well ask people to upgrade to the latest userland
utilities, and even make them run the old versions when they find that the
new interface isn't there. Happened recently with modprobe/modutils.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2005-12-29 19:20:38

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Thu, 2005-12-29 at 16:01 -0300, Horst von Brand wrote:
> > - Someone asked for the kernel's i2c infrastructure to be used,but
> > our i2c usage is very specialised, and it would be more of a mess
> > to use the kernel's
>
> Problem with that is that if everybody and Aunt Tillie does the same,
> the kernel as a whole gets to be a mess.

ALSA does the exact same thing for the exact same reason. Maybe an
indication that the kernel's i2c layer is too heavy?

Lee

2005-12-29 19:24:40

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 17 of 20] ipath - infiniband verbs support, part 3 of 3

Hi,

[Copy-paste reuse alert!]

On 12/29/05, Bryan O'Sullivan <[email protected]> wrote:
> +static struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd,
> + struct ib_phys_buf *buffer_list,
> + int num_phys_buf,
> + int acc, u64 *iova_start)
> +{
> + struct ipath_mr *mr;
> + int n, m, i;
> +
> + /* Allocate struct plus pointers to first level page tables. */
> + m = (num_phys_buf + IPATH_SEGSZ - 1) / IPATH_SEGSZ;
> + mr = kmalloc(sizeof *mr + m * sizeof mr->mr.map[0], GFP_KERNEL);
> + if (!mr)
> + return ERR_PTR(-ENOMEM);
> +
> + /* Allocate first level page tables. */
> + for (i = 0; i < m; i++) {
> + mr->mr.map[i] = kmalloc(sizeof *mr->mr.map[0], GFP_KERNEL);
> + if (!mr->mr.map[i]) {
> + while (i)
> + kfree(mr->mr.map[--i]);
> + kfree(mr);
> + return ERR_PTR(-ENOMEM);
> + }
> + }
> + mr->mr.mapsz = m;

[snip, snip]

> +static struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd,
> + struct ib_umem *region,
> + int mr_access_flags,
> + struct ib_udata *udata)
> +{
> + struct ipath_mr *mr;
> + struct ib_umem_chunk *chunk;
> + int n, m, i;
> +
> + n = 0;
> + list_for_each_entry(chunk, &region->chunk_list, list)
> + n += chunk->nents;
> +
> + /* Allocate struct plus pointers to first level page tables. */
> + m = (n + IPATH_SEGSZ - 1) / IPATH_SEGSZ;
> + mr = kmalloc(sizeof *mr + m * sizeof mr->mr.map[0], GFP_KERNEL);
> + if (!mr)
> + return ERR_PTR(-ENOMEM);
> +
> + /* Allocate first level page tables. */
> + for (i = 0; i < m; i++) {
> + mr->mr.map[i] = kmalloc(sizeof *mr->mr.map[0], GFP_KERNEL);
> + if (!mr->mr.map[i]) {
> + while (i)
> + kfree(mr->mr.map[--i]);
> + kfree(mr);
> + return ERR_PTR(-ENOMEM);
> + }
> + }
> + mr->mr.mapsz = m;

[snip, more duplicate code]

The above fragment is repeated at least three times. Please factor out
the common code into separate functions.

Pekka

2005-12-30 03:19:53

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 17 of 20] ipath - infiniband verbs support, part 3 of 3

On Thu, 2005-12-29 at 21:24 +0200, Pekka Enberg wrote:

> [Copy-paste reuse alert!]

Yep, thanks for pointing that out. The source file in question is about
to go on a serious diet :-)

<b

2005-12-30 03:17:50

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Thu, 2005-12-29 at 16:01 -0300, Horst von Brand wrote:

> > - Renamed _BITS_PER_BYTE to BITS_PER_BYTE, and moved it into
> > linux/types.h

> Haven't come across anything with this not 8 for a /long/ time now, and no
> Linux on that in sight.

The point isn't that it might change, but that it makes code clearer to
use BITS_PER_BYTE in arithmetic than to have the magic number 8
sprinkled around mysteriously.

<b

2005-12-30 17:55:48

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Wed, Dec 28, 2005 at 04:31:19PM -0800, Bryan O'Sullivan wrote:
>
> There are a few requested changes we have chosen to omit for now:
>
> - The driver still uses EXPORT_SYMBOL, for consistency with other
> code in drivers/infiniband

Why would that matter?

> - Someone asked for the kernel's i2c infrastructure to be used, but
> our i2c usage is very specialised, and it would be more of a mess
> to use the kernel's

Why is this? What is so messy about the in-kernel i2c interfaces?
(yeah, I know that there are some oddities, just want to know what you
specifically are not liking...)

> - We're still using ioctls instead of sysfs or configfs in some
> cases, to maintain userspace compatibility

Compatibility with what? The driver isn't in the kernel tree yet, so
there's no old kernel versions to remain compatibile with :)

I also noticed that you are still using the uint64_t type variable
types, can you please switch to the proper kernel types instead (u64 in
this specific example.)

thanks,

greg k-h

2005-12-30 17:55:47

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 12 of 20] ipath - misc driver support code

On Wed, Dec 28, 2005 at 04:31:31PM -0800, Bryan O'Sullivan wrote:
> Signed-off-by: Bryan O'Sullivan <[email protected]>

No description of what the patch does?

> +struct _infinipath_do_not_use_kernel_regs {
> + unsigned long long Revision;

u64?

> + unsigned long long Control;
> + unsigned long long PageAlign;
> + unsigned long long PortCnt;

And what's with the InterCapsNamingScheme of these variables?

> +/*
> + * would prefer to not inline this, to avoid code bloat, and simplify debugging
> + * But when compiling against 2.6.10 kernel tree, it gets an error, so
> + * not for now.
> + */
> +static void ipath_i2c_delay(ipath_type, int);

You aren't compiling this for a 2.6.10 kernel anymore :)

> +/*
> + * we use this instead of udelay directly, so we can make sure
> + * that previous register writes have been flushed all the way
> + * to the chip. Since we are delaying anyway, the cost doesn't
> + * hurt, and makes the bit twiddling more regular
> + * If delay is negative, we'll do the chip read, to be sure write made it
> + * to our chip, but won't do udelay()
> + */
> +static void ipath_i2c_delay(ipath_type dev, int dtime)
> +{
> + /*
> + * This needs to be volatile, so that the compiler doesn't
> + * optimize away the read to the device's mapped memory.
> + */
> + volatile uint32_t read_val;
> + if (!dtime)
> + return;
> + read_val = ipath_kget_kreg32(dev, kr_scratch);
> + if (--dtime > 0) /* register read takes about .5 usec, itself */
> + udelay(dtime);
> +}

Huh? After reading your comment, I still don't understand why you can't
just use udelay(). Or are you counting on calling this function with
only "1" being set for dtime?

Ah, in looking at your code, that is exactly what is happening. That's
a mess, just delay and everything will work properly on the next rev of
the hardware where the time to read that register will have dropped to
1/8 the time it does today...

> +/*
> + * write a byte, one bit at a time. Returns 0 if we got the following
> + * ack, otherwise 1
> + */
> +static int ipath_wr_byte(ipath_type dev, uint8_t data)
> +{
> + int bit_cntr;
> + uint8_t bit;
> +
> + for (bit_cntr = 7; bit_cntr >= 0; bit_cntr--) {
> + bit = (data >> bit_cntr) & 1;
> + ipath_sda_out(dev, bit, 1);
> + ipath_scl_out(dev, i2c_line_high, 1);
> + ipath_scl_out(dev, i2c_line_low, 1);
> + }
> + if (!ipath_i2c_ackrcv(dev))
> + return 1;
> + return 0;
> +}

Ah, isn't it fun to write bit-banging functions... And the in-kernel
i2c code is messier than doing this by hand?

> +/*
> + * ipath_eeprom_read - Receives x # byte from the eeprom via I2C.
> + *
> + * eeprom: Atmel AT24C01
> + *
> + */
> +
> +int ipath_eeprom_read(ipath_type dev, uint8_t eeprom_offset, void *buffer,
> + int len)

Odd function comment style. Please fix this to be in kerneldoc format.

> diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_lib.c
> --- /dev/null Thu Jan 1 00:00:00 1970 +0000
> +++ b/drivers/infiniband/hw/ipath/ipath_lib.c Wed Dec 28 14:19:43 2005 -0800
> @@ -0,0 +1,90 @@
> +/*
> + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses. You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + * - Redistributions of source code must retain the above
> + * copyright notice, this list of conditions and the following
> + * disclaimer.
> + *
> + * - Redistributions in binary form must reproduce the above
> + * copyright notice, this list of conditions and the following
> + * disclaimer in the documentation and/or other materials
> + * provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Patent licenses, if any, provided herein do not apply to
> + * combinations of this program with other software, or any other
> + * product whatsoever.
> + */
> +
> +/*
> + * This is library code for the driver, similar to what's in libinfinipath for
> + * usermode code.
> + */
> +
> +#include <linux/config.h>
> +#include <linux/version.h>
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/types.h>
> +#include <linux/string.h>
> +#include <linux/delay.h>
> +#include <linux/timer.h>
> +#include <linux/fs.h>
> +#include <linux/poll.h>
> +#include <asm/io.h>
> +#include <asm/byteorder.h>
> +#include <asm/uaccess.h>

Are you _sure_ you need all of these for the one function in this file?

> +
> +#include "ipath_kernel.h"
> +
> +unsigned infinipath_debug = __IPATH_INFO;
> +
> +uint32_t _ipath_pico_per_cycle; /* always present, for now */
> +
> +/*
> + * This isn't perfect, but it's close enough for timing work. We want this
> + * to work on systems where the cycle counter isn't the same as the clock
> + * frequency. The one msec spin is OK, since we execute this only once
> + * when first loaded. We don't use CURRENT_TIME because on some systems
> + * it only has jiffy resolution; we just assume udelay is well calibrated
> + * and that we aren't likely to be rescheduled. Do it multiple times,
> + * with a yield in between, to try to make sure we get the "true minimum"
> + * value.
> + * _ipath_pico_per_cycle isn't going to lead to completely accurate
> + * conversions from timestamps to nanoseconds, but it's close enough
> + * for our purposes, which is mainly to allow people to show events with
> + * nsecs or usecs if desired, rather than cycles.
> + */
> +void ipath_init_picotime(void)
> +{
> + int i;
> + u_int64_t ts, te, delta = -1ULL;
> +
> + for (i = 0; i < 5; i++) {
> + ts = get_cycles();
> + udelay(250);
> + te = get_cycles();
> + if ((te - ts) < delta)
> + delta = te - ts;
> + yield();
> + }
> + _ipath_pico_per_cycle = 250000000 / delta;
> +}

Ick. A whole file for one function and 2 public variables? And a
horrible timing function too? Please just use the core kernel timing
functions, which will work all the time on all arches...


> diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_upages.c
> --- /dev/null Thu Jan 1 00:00:00 1970 +0000
> +++ b/drivers/infiniband/hw/ipath/ipath_upages.c Wed Dec 28 14:19:43 2005 -0800
> @@ -0,0 +1,144 @@
> +/*
> + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses. You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + * - Redistributions of source code must retain the above
> + * copyright notice, this list of conditions and the following
> + * disclaimer.
> + *
> + * - Redistributions in binary form must reproduce the above
> + * copyright notice, this list of conditions and the following
> + * disclaimer in the documentation and/or other materials
> + * provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Patent licenses, if any, provided herein do not apply to
> + * combinations of this program with other software, or any other
> + * product whatsoever.
> + */
> +
> +#include <stddef.h>

Where is this file being pulled in from?

> +
> +#include <linux/config.h>
> +#include <linux/version.h>
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/string.h>
> +#include <linux/delay.h>
> +#include <linux/slab.h>
> +#include <linux/mm.h>
> +#include <linux/spinlock.h>
> +
> +#include <asm/page.h>
> +#include <asm/io.h>
> +
> +#include "ipath_kernel.h"
> +
> +/*
> + * Our version of the kernel mlock function. This function is no longer
> + * exposed, so we need to do it ourselves.

Woah, um, don't you think that you should either export the main mlock
function itself, or fix your code to not need it? Rolling it yourself
isn't a good idea...

thanks,

greg k-h

2005-12-30 18:15:57

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 12 of 20] ipath - misc driver support code

> +int ipath_get_upages_nocopy(unsigned long start_page, struct page **p)
> +{
> + int n;
> + struct vm_area_struct *vm = NULL;
> +
> + down_read(&current->mm->mmap_sem);
> + n = get_user_pages(current, current->mm, start_page, 1, 1, 1, p, &vm);
> + up_read(&current->mm->mmap_sem);
> + if (n != 1) {
> + _IPATH_INFO("get_user_pages for 0x%lx failed with %d\n",
> + start_page, n);
> + if (n < 0) /* it's an errno */
> + return n;
> + /*
> + * If we ever ask for more than a single page, we will have to
> + * free the pages (if any) that we did get, via ipath_get_upages()
> + * or put_page() directly.
> + */
> + return -ENOMEM; /* no way to know actual error */
> + }
> + vm->vm_flags |= VM_SHM | VM_LOCKED;
> +
> + return 0;
> +}


I hope you're not depending on the VM_LOCKED thing.. since the user can
just undo that easily!

(this is also why all this "sys_mlock from the driver" is traditionally
buggy to the point of being a roothole, things like some of the binary
3D drivers have had this security hole for a long time, as did some of
the early infiniband drivers)

2005-12-30 18:46:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 10 of 20] ipath - core driver, part 3 of 4



All your user page lookup/pinning code is terminally broken.

You can't do it that way. You have serveral major conceptual bugs, like
keeping track of pages without incrementing their page count, and just
expecting that they are magically "pinned" even you do nothing at all to
pin them. The process exits or does an munmap, and the page will be used
for something else, and you'll just corrupt totally random memory.

Similarly, you do page_address() on the page, which just can't work on
highmem pages.

Crap like this must not be merged. Drivers aren't supposed to play VM
tricks in the first place - even if they were to get it right (which they
never do). Don't do it.

Linus

2005-12-30 22:48:30

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 8 of 20] ipath - core driver, part 1 of 4

On Wed, Dec 28, 2005 at 04:31:27PM -0800, Bryan O'Sullivan wrote:
> Signed-off-by: Bryan O'Sullivan <[email protected]>
>
> diff -r ffbd416f30d4 -r ddd21709e12c drivers/infiniband/hw/ipath/ipath_driver.c
> --- /dev/null Thu Jan 1 00:00:00 1970 +0000
> +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800
> @@ -0,0 +1,1879 @@
> +/*
> + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses. You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + * - Redistributions of source code must retain the above
> + * copyright notice, this list of conditions and the following
> + * disclaimer.
> + *
> + * - Redistributions in binary form must reproduce the above
> + * copyright notice, this list of conditions and the following
> + * disclaimer in the documentation and/or other materials
> + * provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Patent licenses, if any, provided herein do not apply to
> + * combinations of this program with other software, or any other
> + * product whatsoever.
> + */
> +
> +#include <linux/version.h>
> +#include <linux/pci.h>
> +#include <linux/delay.h>
> +#include <linux/swap.h>
> +#include <asm/mtrr.h>
> +#include <linux/netdevice.h>
> +
> +#include <linux/crc32.h> /* we can generate our own crc's for testing */
> +
> +#include "ipath_kernel.h"
> +#include "ips_common.h"
> +#include "ipath_layer.h"
> +
> +/*
> + * Our LSB-assigned major number, so scripts can figure
> + * out how to make entry in /dev.
> + */
> +
> +static int ipath_major = 233;
> +
> +/*
> + * number of buffers reserved for driver (layered drivers and SMA send).
> + * Reserved at end of buffer list.
> + */
> +
> +static uint infinipath_kpiobufs = 32;
> +
> +/*
> + * number of ports we are configured to use (to allow for more pio
> + * buffers per port, etc.) Zero means use chip value.
> + */
> +
> +static uint infinipath_cfgports;
> +
> +/*
> + * number of units we are configured to use (to allow for bringup on
> + * multi-chip systems) Zero means use only one for now, but eventually
> + * will mean to use infinipath_max
> + */
> +
> +static uint infinipath_cfgunits;
> +
> +uint64_t ipath_dummy_val_for_testing;
> +
> +static __kernel_pid_t ipath_sma_alive; /* PID of SMA, if it's running */
> +static spinlock_t ipath_sma_lock; /* SMA receive */
> +
> +/* max SM received packets we'll queue; we keep the most recent packets. */
> +
> +#define IPATH_NUM_SMAPKTS 16
> +
> +#define IPATH_SMA_HDRSZ (8+12+8) /* LRH+BTH+DETH */
> +
> +static struct _ipath_sma_rpkt {
> + /* length of received packet; non-zero if queued */
> + uint32_t len;
> + /* unit number of interface packet was received from */
> + uint32_t unit;
> + uint8_t *buf;
> +} ipath_sma_data[IPATH_NUM_SMAPKTS];
> +
> +static unsigned ipath_sma_first; /* oldest sma packet index */
> +static unsigned ipath_sma_next; /* next sma packet index to use */
> +
> +/*
> + * ipath_sma_data_bufs has one extra, pointed to by ipath_sma_data_spare,
> + * so we can exchange buffers to do copy_to_user, and not hold the lock
> + * across the copy_to_user().
> + */
> +
> +#define SMA_MAX_PKTSZ (IPATH_SMA_HDRSZ+256) /* max len of an SMA packet */
> +
> +static uint8_t ipath_sma_data_bufs[IPATH_NUM_SMAPKTS + 1][SMA_MAX_PKTSZ];
> +static uint8_t *ipath_sma_data_spare;
> +/* sma waits globally on all units */
> +static wait_queue_head_t ipath_sma_wait;
> +static wait_queue_head_t ipath_sma_state_wait;
> +
> +struct infinipath_stats ipath_stats;
> +
> +/*
> + * this will only be used for diags, now that we have enabled the DMA
> + * of the sendpioavail regs to system memory.
> + */
> +
> +static inline uint64_t ipath_kget_sreg(const ipath_type stype,
> + ipath_sreg regno)
> +{
> + uint64_t val;
> + uint64_t *sbase;
> +
> + sbase = (uint64_t *) (devdata[stype].ipath_sregbase
> + + (char *)devdata[stype].ipath_kregbase);
> + val = sbase ? sbase[regno] : 0ULL;
> + return val;
> +}
> +
> +static int ipath_do_user_init(struct ipath_portdata *,
> + struct ipath_user_info __user *);
> +static int ipath_get_baseinfo(struct ipath_portdata *,
> + struct ipath_base_info __user *);
> +static int ipath_get_units(void);
> +static int ipath_wr_eeprom(struct ipath_portdata *,
> + struct ipath_eeprom_req __user *);
> +static int ipath_wait_intr(struct ipath_portdata *, uint32_t);
> +static int ipath_tid_update(struct ipath_portdata *, struct _tidupd __user *);
> +static int ipath_tid_free(struct ipath_portdata *, struct _tidupd __user *);
> +static int ipath_get_counters(ipath_type, struct infinipath_counters __user *);
> +static int ipath_get_unit_counters(struct infinipath_getunitcounters __user *a);
> +static int ipath_get_stats(struct infinipath_stats __user *);
> +static int ipath_set_partkey(struct ipath_portdata *, uint16_t);
> +static int ipath_manage_rcvq(struct ipath_portdata *, uint16_t);
> +static void ipath_clean_partkey(struct ipath_portdata *,
> + struct ipath_devdata *);
> +static void ipath_disarm_piobufs(const ipath_type, unsigned, unsigned);
> +static int ipath_create_user_egr(struct ipath_portdata *);
> +static int ipath_create_port0_egr(struct ipath_portdata *);
> +static int ipath_create_rcvhdrq(struct ipath_portdata *);
> +static void ipath_handle_errors(const ipath_type, uint64_t);
> +static void ipath_update_pio_bufs(const ipath_type);
> +static int ipath_shutdown_link(const ipath_type);
> +static int ipath_bringup_link(const ipath_type);
> +int ipath_bringup_serdes(const ipath_type);
> +static void ipath_get_faststats(unsigned long);
> +static int ipath_setup_htconfig(struct pci_dev *, uint64_t *, const ipath_type);
> +static struct page *ipath_nopage(struct vm_area_struct *, unsigned long, int *);
> +static irqreturn_t ipath_intr(int irq, void *devid, struct pt_regs *regs);
> +static void ipath_decode_err(char *, size_t, uint64_t);
> +void ipath_free_pddata(struct ipath_devdata *, uint32_t, int);
> +static void ipath_clear_tids(const ipath_type, unsigned);
> +static void ipath_get_guid(const ipath_type);
> +static int ipath_sma_ioctl(struct file *, unsigned int, unsigned long);
> +static int ipath_rcvsma_pkt(struct ipath_sendpkt __user *);
> +static int ipath_kset_lid(uint32_t);
> +static int ipath_kset_mlid(uint32_t);
> +static int ipath_get_mlid(uint32_t __user *);
> +static int ipath_get_devstatus(uint64_t __user *);
> +static int ipath_kset_guid(struct ipath_setguid __user *);
> +static int ipath_get_portinfo(uint32_t __user *);
> +static int ipath_get_nodeinfo(uint32_t __user *);
> +#ifdef _IPATH_EXTRA_DEBUG
> +static void ipath_dump_allregs(char *, ipath_type);
> +#endif
> +
> +static const char ipath_sma_name[] = "infinipath_SMA";
> +
> +/*
> + * is diags mode enabled? if it is, then things like auto bringup of
> + * links is disabled
> + */
> +
> +int ipath_diags_enabled = 0;
> +
> +void ipath_chip_done(void)
> +{
> +}
> +
> +void ipath_chip_cleanup(struct ipath_devdata * dd)
> +{
> +}

What are these two empty functions for?

> +/*
> + * cache aligned location
> + *
> + * where port 0 rcvhdrtail register is written back; also want
> + * nothing else sharing the cache line, so make it a cache line in size
> + * used for all units
> + *
> + * This is volatile as it's the target of a DMA from the chip.
> + */
> +
> +static volatile uint64_t ipath_port0_rcvhdrtail[512]
> + __attribute__ ((aligned(4096)));
> +
> +#define MODNAME "ipath_core"
> +#define DRIVER_LOAD_MSG "PathScale " MODNAME " loaded: "
> +#define PFX MODNAME ": "
> +
> +/*
> + * min buffers we want to have per port, after driver
> + */
> +
> +#define IPATH_MIN_USER_PORT_BUFCNT 8
> +
> +/* The size has to be longer than this string, so we can
> + * append board/chip information to it in the init code.
> + */
> +static char ipath_core_version[192] = IPATH_IDSTR;
> +static char *chip_driver_version;
> +static int chip_driver_size;
> +
> +/* mylid and lidbase are to deal with LIDs in "fabric", until SM is working */
> +
> +module_param(infinipath_debug, uint, 0644);
> +module_param(infinipath_kpiobufs, uint, 0644);
> +module_param(infinipath_cfgports, uint, 0644);
> +module_param(infinipath_cfgunits, uint, 0644);
> +
> +MODULE_PARM_DESC(infinipath_debug, "mask for debug prints");
> +MODULE_PARM_DESC(infinipath_cfgports, "Set max number of ports to use");
> +MODULE_PARM_DESC(infinipath_cfgunits, "Set max number of devices to use");
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("PathScale <[email protected]>");
> +MODULE_DESCRIPTION("Pathscale InfiniPath driver");
> +
> +#ifdef IPATH_DIAG
> +static __kernel_pid_t ipath_diag_alive; /* PID of diags, if running */
> +int ipath_diags_ioctl(struct file *, unsigned, unsigned long);
> +static int ipath_opendiag(struct inode *, struct file *);
> +#endif
> +
> +#if __IPATH_INFO || __IPATH_DBG
> +static const char *ipath_ibcstatus_str[] = {
> + "Disabled",
> + "LinkUp",
> + "PollActive",
> + "PollQuiet",
> + "SleepDelay",
> + "SleepQuiet",
> + "LState6", /* unused */
> + "LState7", /* unused */
> + "CfgDebounce",
> + "CfgRcvfCfg",
> + "CfgWaitRmt",
> + "CfgIdle",
> + "RecovRetrain",
> + "LState0xD", /* unused */
> + "RecovWaitRmt",
> + "RecovIdle",
> +};
> +#endif
> +
> +static ssize_t show_version(struct device_driver *dev, char *buf)
> +{
> + return snprintf(buf, PAGE_SIZE, "%s", ipath_core_version);
> +}
> +
> +static ssize_t show_status(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct ipath_devdata *dd = dev_get_drvdata(dev);
> +
> + if (!dd)
> + return -EINVAL;
> +
> + if (!dd->ipath_statusp)
> + return -EINVAL;
> +
> + return snprintf(buf, PAGE_SIZE, "%llx\n", *(dd->ipath_statusp));
> +}
> +
> +static const char *ipath_status_str[] = {
> + "Initted",
> + "Disabled",
> + "4", /* unused */
> + "OIB_SMA",
> + "SMA",
> + "Present",
> + "IB_link_up",
> + "IB_configured",
> + "NoIBcable",
> + "Fatal_Hardware_Error",
> + NULL,
> +};
> +
> +static ssize_t show_status_str(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct ipath_devdata *dd = dev_get_drvdata(dev);
> + int i, any;
> + uint64_t s;
> +
> + if (!dd)
> + return -EINVAL;
> +
> + if (!dd->ipath_statusp)
> + return -EINVAL;
> +
> + s = *(dd->ipath_statusp);
> + *buf = '\0';
> + for (any = i = 0; s && ipath_status_str[i]; i++) {
> + if (s & 1) {
> + if (any && strlcat(buf, " ", PAGE_SIZE) >= PAGE_SIZE)
> + /* overflow */
> + break;
> + if (strlcat(buf, ipath_status_str[i],
> + PAGE_SIZE) >= PAGE_SIZE)
> + break;
> + any = 1;
> + }
> + s >>= 1;
> + }
> + if (any)
> + strlcat(buf, "\n", PAGE_SIZE);
> +
> + return strlen(buf);
> +}

how big can this "status string" be? If it's even getting close to
PAGE_SIZE, this doesn't need to be a sysfs attribute, but you should
break it up into its individual pieces.

Based on the table above, this function can get much simpler...

> +
> +static ssize_t show_lid(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct ipath_devdata *dd = dev_get_drvdata(dev);
> +
> + if (!dd)
> + return -EINVAL;
> +
> + return snprintf(buf, PAGE_SIZE, "%x\n", dd->ipath_lid);
> +}
> +
> +static ssize_t show_mlid(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct ipath_devdata *dd = dev_get_drvdata(dev);
> +
> + if (!dd)
> + return -EINVAL;
> +
> + return snprintf(buf, PAGE_SIZE, "%x\n", dd->ipath_mlid);
> +}
> +
> +static ssize_t show_guid(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct ipath_devdata *dd = dev_get_drvdata(dev);
> + uint8_t *guid;
> +
> + if (!dd)
> + return -EINVAL;
> +
> + guid = (uint8_t *)&(dd->ipath_guid);
> +
> + return snprintf(buf, PAGE_SIZE, "%x:%x:%x:%x:%x:%x:%x:%x\n",
> + guid[0], guid[1], guid[2], guid[3], guid[4], guid[5],
> + guid[6], guid[7]);
> +}
> +
> +static ssize_t show_nguid(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct ipath_devdata *dd = dev_get_drvdata(dev);
> +
> + if (!dd)
> + return -EINVAL;
> +
> + return snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_nguid);
> +}
> +
> +static ssize_t show_serial(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct ipath_devdata *dd = dev_get_drvdata(dev);
> +
> + if (!dd)
> + return -EINVAL;
> +
> + buf[sizeof dd->ipath_serial] = '\0';
> + memcpy(buf, dd->ipath_serial, sizeof dd->ipath_serial);
> + strcat(buf, "\n");
> + return strlen(buf);
> +}
> +
> +static ssize_t show_unit(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct ipath_devdata *dd = dev_get_drvdata(dev);
> +
> + if (!dd)
> + return -EINVAL;

Don't you mean -ENODEV?

> +
> + snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_unit);
> + return strlen(buf);

return the snprintf() call instead of calling strlen() all the time
please.

> +}
> +
> +static DRIVER_ATTR(version, S_IRUGO, show_version, NULL);
> +static DEVICE_ATTR(status, S_IRUGO, show_status, NULL);
> +static DEVICE_ATTR(status_str, S_IRUGO, show_status_str, NULL);
> +static DEVICE_ATTR(lid, S_IRUGO, show_lid, NULL);
> +static DEVICE_ATTR(mlid, S_IRUGO, show_mlid, NULL);
> +static DEVICE_ATTR(guid, S_IRUGO, show_guid, NULL);
> +static DEVICE_ATTR(nguid, S_IRUGO, show_nguid, NULL);
> +static DEVICE_ATTR(serial, S_IRUGO, show_serial, NULL);
> +static DEVICE_ATTR(unit, S_IRUGO, show_unit, NULL);
> +
> +/*
> + * called from add_timer and user counter read calls, to deal with
> + * counters that wrap in "human time". The words sent and received, and
> + * the packets sent and received are all that we worry about. For now,
> + * at least, we don't worry about error counters, because if they wrap
> + * that quickly, we probably don't care. We may eventually just make this
> + * handle all the counters. word counters can wrap in about 20 seconds
> + * of full bandwidth traffic, packet counters in a few hours.
> + */
> +
> +uint64_t ipath_snap_cntr(const ipath_type t, ipath_creg creg)
> +{
> + uint32_t val;
> + uint64_t val64, t0, t1;
> + struct ipath_devdata *dd = &devdata[t];
> + static uint64_t one_sec_in_cycles;
> + extern uint32_t _ipath_pico_per_cycle;
> +
> + if (!one_sec_in_cycles && _ipath_pico_per_cycle)
> + one_sec_in_cycles = 1000000000000UL / _ipath_pico_per_cycle;
> +
> + t0 = get_cycles();
> + val = ipath_kget_creg32(t, creg);
> + t1 = get_cycles();
> + if ((t1 - t0) > one_sec_in_cycles && val == -1) {
> + /*
> + * This is just a way to detect things that are quite broken.
> + * Normally this should take just a few cycles (the check is
> + * for long enough that we don't care if we get pre-empted.)
> + * An Opteron HT O read timeout is 4 seconds with normal
> + * NB values
> + */
> +
> + _IPATH_UNIT_ERROR(t, "Error! Reading counter 0x%x timed out\n",
> + creg);
> + return 0ULL;
> + }
> +
> + if (creg == cr_wordsendcnt) {
> + if (val != dd->ipath_lastsword) {
> + dd->ipath_sword += val - dd->ipath_lastsword;
> + dd->ipath_lastsword = val;
> + }
> + val64 = dd->ipath_sword;
> + } else if (creg == cr_wordrcvcnt) {
> + if (val != dd->ipath_lastrword) {
> + dd->ipath_rword += val - dd->ipath_lastrword;
> + dd->ipath_lastrword = val;
> + }
> + val64 = dd->ipath_rword;
> + } else if (creg == cr_pktsendcnt) {
> + if (val != dd->ipath_lastspkts) {
> + dd->ipath_spkts += val - dd->ipath_lastspkts;
> + dd->ipath_lastspkts = val;
> + }
> + val64 = dd->ipath_spkts;
> + } else if (creg == cr_pktrcvcnt) {
> + if (val != dd->ipath_lastrpkts) {
> + dd->ipath_rpkts += val - dd->ipath_lastrpkts;
> + dd->ipath_lastrpkts = val;
> + }
> + val64 = dd->ipath_rpkts;
> + } else
> + val64 = (uint64_t) val;
> +
> + return val64;
> +}
> +
> +/*
> + * print the delta of egrfull/hdrqfull errors for kernel ports no more
> + * than every 5 seconds. User processes are printed at close, but kernel
> + * doesn't close, so... Separate routine so may call from other places
> + * someday, and so function name when printed by _IPATH_INFO is meaningfull
> + */
> +
> +static void ipath_qcheck(const ipath_type t)
> +{
> + static uint64_t last_tot_hdrqfull;
> + size_t blen = 0;
> + struct ipath_devdata *dd = &devdata[t];
> + char buf[128];
> +
> + *buf = 0;
> + if (dd->ipath_pd[0]->port_hdrqfull != dd->ipath_p0_hdrqfull) {
> + blen = snprintf(buf, sizeof buf, "port 0 hdrqfull %u",
> + dd->ipath_pd[0]->port_hdrqfull -
> + dd->ipath_p0_hdrqfull);
> + dd->ipath_p0_hdrqfull = dd->ipath_pd[0]->port_hdrqfull;
> + }
> + if (ipath_stats.sps_etidfull != dd->ipath_last_tidfull) {
> + blen +=
> + snprintf(buf + blen, sizeof buf - blen, "%srcvegrfull %llu",
> + blen ? ", " : "",
> + ipath_stats.sps_etidfull - dd->ipath_last_tidfull);
> + dd->ipath_last_tidfull = ipath_stats.sps_etidfull;
> + }
> +
> + /*
> + * this is actually the number of hdrq full interrupts, not actual
> + * events, but at the moment that's mostly what I'm interested in.
> + * Actual count, etc. is in the counters, if needed. For production
> + * users this won't ordinarily be printed.
> + */
> +
> + if ((infinipath_debug & (__IPATH_PKTDBG | __IPATH_DBG)) &&
> + ipath_stats.sps_hdrqfull != last_tot_hdrqfull) {
> + blen +=
> + snprintf(buf + blen, sizeof buf - blen,
> + "%shdrqfull %llu (all ports)", blen ? ", " : "",
> + ipath_stats.sps_hdrqfull - last_tot_hdrqfull);
> + last_tot_hdrqfull = ipath_stats.sps_hdrqfull;
> + }
> + if (blen)
> + _IPATH_DBG("%s\n", buf);
> +
> + if (*dd->ipath_hdrqtailptr != dd->ipath_port0head) {
> + if (dd->ipath_lastport0rcv_cnt == ipath_stats.sps_port0pkts) {
> + _IPATH_PDBG("missing rcv interrupts? port0 hd=%llx tl=%x; port0pkts %llx\n",
> + *dd->ipath_hdrqtailptr, dd->ipath_port0head,ipath_stats.sps_port0pkts);
> + ipath_kreceive(t);
> + }
> + dd->ipath_lastport0rcv_cnt = ipath_stats.sps_port0pkts;
> + }
> +}
> +
> +/*
> + * called from add_timer to get word counters from chip before they
> + * can overflow
> + */
> +
> +static void ipath_get_faststats(unsigned long t)
> +{
> + uint32_t val;
> + struct ipath_devdata *dd = &devdata[t];
> + static unsigned cnt;
> +
> + /*
> + * don't access the chip while running diags, or memory diags
> + * can fail
> + */
> + if (!dd->ipath_kregbase || !(dd->ipath_flags & IPATH_PRESENT) ||
> + ipath_diags_enabled) {
> + /* but re-arm the timer, for diags case; won't hurt other */
> + goto done;
> + }
> +
> + ipath_snap_cntr((ipath_type) t, cr_wordsendcnt);
> + ipath_snap_cntr((ipath_type) t, cr_wordrcvcnt);
> + ipath_snap_cntr((ipath_type) t, cr_pktsendcnt);
> + ipath_snap_cntr((ipath_type) t, cr_pktrcvcnt);
> +
> + ipath_qcheck(t);
> +
> + /*
> + * deal with repeat error suppression. Doesn't really matter if
> + * last error was almost a full interval ago, or just a few usecs
> + * ago; still won't get more than 2 per interval. We may want
> + * longer intervals for this eventually, could do with mod, counter
> + * or separate timer. Also see code in ipath_handle_errors() and
> + * ipath_handle_hwerrors().
> + */
> +
> + if (dd->ipath_lasterror)
> + dd->ipath_lasterror = 0;
> + if (dd->ipath_lasthwerror)
> + dd->ipath_lasthwerror = 0;
> + if ((devdata[t].ipath_maskederrs & ~devdata[t].ipath_ignorederrs)
> + && get_cycles() > devdata[t].ipath_unmasktime) {
> + char ebuf[256];
> + ipath_decode_err(ebuf, sizeof ebuf,
> + (devdata[t].ipath_maskederrs & ~devdata[t].
> + ipath_ignorederrs));
> + if ((devdata[t].ipath_maskederrs & ~devdata[t].
> + ipath_ignorederrs)
> + & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL)) {
> + _IPATH_UNIT_ERROR(t, "Re-enabling masked errors (%s)\n",
> + ebuf);
> + } else {
> + /*
> + * rcvegrfull and rcvhdrqfull are "normal",
> + * for some types of processes (mostly benchmarks)
> + * that send huge numbers of messages, while
> + * not processing them. So only complain about
> + * these at debug level.
> + */
> + _IPATH_DBG
> + ("Disabling frequent queue full errors (%s)\n",
> + ebuf);
> + }
> + devdata[t].ipath_maskederrs = devdata[t].ipath_ignorederrs;
> + ipath_kput_kreg(t, kr_errormask, ~devdata[t].ipath_maskederrs);
> + }
> +
> + if (dd->ipath_flags & IPATH_LINK_SLEEPING) {
> + uint64_t ibc;
> + _IPATH_VDBG("linkinitcmd SLEEP, move to POLL\n");
> + dd->ipath_flags &= ~IPATH_LINK_SLEEPING;
> + ibc = dd->ipath_ibcctrl;
> + /*
> + * don't put linkinitcmd in ipath_ibcctrl, want that to
> + * stay a NOP
> + */
> + ibc |=
> + INFINIPATH_IBCC_LINKINITCMD_POLL <<
> + INFINIPATH_IBCC_LINKINITCMD_SHIFT;
> + ipath_kput_kreg(t, kr_ibcctrl, ibc);
> + }
> +
> + /* limit qfull messages to ~one per minute per port */
> + if ((++cnt & 0x10)) {
> + for (val = devdata[t].ipath_cfgports - 1; ((int)val) >= 0;
> + val--) {
> + if (dd->ipath_lastegrheads[val] != -1)
> + dd->ipath_lastegrheads[val] = -1;
> + if (dd->ipath_lastrcvhdrqtails[val] != -1)
> + dd->ipath_lastrcvhdrqtails[val] = -1;
> + }
> + }
> +
> + if (dd->ipath_nosma_bufs) {
> + dd->ipath_nosma_secs += 5;
> + if (dd->ipath_nosma_secs >= 30) {
> + _IPATH_SMADBG("No SMA bufs avail %u seconds; cancelling pending sends\n",
> + dd->ipath_nosma_secs);
> + ipath_disarm_piobufs(t, dd->ipath_lastport_piobuf,
> + dd->ipath_piobcnt - dd->ipath_lastport_piobuf);
> + dd->ipath_nosma_secs = 0; /* start again, if necessary */
> + }
> + else
> + _IPATH_SMADBG("No SMA bufs avail %u tries, after %u seconds\n",
> + dd->ipath_nosma_bufs, dd->ipath_nosma_secs);
> + }
> +
> +done:
> + mod_timer(&dd->ipath_stats_timer, jiffies + HZ * 5);
> +}
> +
> +
> +static void __devexit infinipath_remove_one(struct pci_dev *);
> +static int infinipath_init_one(struct pci_dev *, const struct pci_device_id *);
> +
> +/* Only needed for registration, nothing else needs this info */
> +#define PCI_VENDOR_ID_PATHSCALE 0x1fc1
> +#define PCI_DEVICE_ID_PATHSCALE_INFINIPATH_HT 0xd
> +
> +const struct pci_device_id infinipath_pci_tbl[] = {
> + {
> + PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_PATHSCALE_INFINIPATH_HT,
> + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0},

PCI_DEVICE() instead?

> + {0,}

{},
is all that is needed here.

> +};
> +
> +MODULE_DEVICE_TABLE(pci, infinipath_pci_tbl);
> +
> +static struct pci_driver infinipath_driver = {
> + .name = MODNAME,
> + .driver.owner = THIS_MODULE,

This line is not needed, you can remove it.

> + .probe = infinipath_init_one,
> + .remove = __devexit_p(infinipath_remove_one),
> + .id_table = infinipath_pci_tbl,
> +};
> +
> +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC)
> +int remap_area_pages(unsigned long address, unsigned long phys_addr,
> + unsigned long size, unsigned long flags);
> +#endif
> +
> +static int infinipath_init_one(struct pci_dev *pdev,
> + const struct pci_device_id *ent)
> +{
> + int ret, len, j;
> + static int chip_idx = -1;
> + unsigned long addr;
> + uint64_t intconfig;
> + uint8_t rev;
> + ipath_type dev;
> +
> + /*
> + * XXX: Right now, we have a hardcoded array of devices. We'll
> + * change this in a future release, but not just yet. For the
> + * moment, we're limited to 4 infinipath devices per system.
> + */
> +
> + dev = ++chip_idx;
> +
> + _IPATH_VDBG("initializing unit #%u\n", dev);
> + if ((!infinipath_cfgunits && (dev >= 1)) ||
> + (infinipath_cfgunits && (dev >= infinipath_cfgunits)) ||
> + (dev >= infinipath_max)) {
> + _IPATH_ERROR("Trying to initialize unit %u, max is %u\n",
> + dev, infinipath_max - 1);
> + return -EINVAL;
> + }
> +
> + devdata[dev].pci_registered = 1;
> + devdata[dev].ipath_unit = dev;
> +
> + if ((ret = pci_enable_device(pdev))) {
> + _IPATH_DBG("pci_enable unit %u failed: %x\n", dev, ret);
> + }

{} not needed here.

> +
> + if ((ret = pci_request_regions(pdev, MODNAME)))
> + _IPATH_INFO("pci_request_regions unit %u fails: %d\n", dev,
> + ret);
> +
> + if ((ret = pci_set_dma_mask(pdev, DMA_64BIT_MASK)) != 0)
> + _IPATH_INFO("pci_set_dma_mask unit %u fails: %d\n", dev, ret);
> +
> + pci_set_master(pdev); /* probably not be needed for HT */
> +
> + addr = pci_resource_start(pdev, 0);
> + len = pci_resource_len(pdev, 0);
> + _IPATH_VDBG
> + ("regbase (0) %lx len %d irq %x, vend %x/%x driver_data %lx\n",
> + addr, len, pdev->irq, ent->vendor, ent->device, ent->driver_data);
> + devdata[dev].ipath_deviceid = ent->device; /* save for later use */
> + devdata[dev].ipath_vendorid = ent->vendor;
> + for (j = 0; j < 6; j++) {
> + if (!pdev->resource[j].start)
> + continue;
> + _IPATH_VDBG("BAR %d start %lx, end %lx, len %lx\n",
> + j, pdev->resource[j].start,
> + pdev->resource[j].end, pci_resource_len(pdev, j));
> + }
> +
> + if (!addr) {
> + _IPATH_UNIT_ERROR(dev, "No valid address in BAR 0!\n");
> + return -ENODEV;
> + }
> +
> + if ((ret = pci_read_config_byte(pdev, PCI_REVISION_ID, &rev))) {
> + _IPATH_UNIT_ERROR(dev,
> + "Failed to read PCI revision ID unit %u: %d\n",
> + dev, ret);
> + return ret; /* shouldn't ever happen */
> + } else
> + devdata[dev].ipath_pcirev = rev;
> +
> + devdata[dev].ipath_kregbase = ioremap_nocache(addr, len);
> +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC)
> + printk("Remapping pages WC\n");

No KERN_ level?

> + remap_area_pages((unsigned long) devdata[dev].ipath_kregbase +
> + 1024 * 1024, addr + 1024 * 1024, 1024 * 1024,
> + _PAGE_MA_WC);
> + /* devdata[dev].ipath_kregbase = __ioremap(addr, len, _PAGE_MA_WC); */
> +#endif
> +
> + if (!devdata[dev].ipath_kregbase) {
> + _IPATH_DBG("Unable to map io addr %lx to kvirt, failing\n",
> + addr);
> + ret = -ENOMEM;
> + goto fail;
> + }
> + devdata[dev].ipath_kregend = (uint64_t __iomem *)
> + ((void __iomem *) devdata[dev].ipath_kregbase + len);
> + devdata[dev].ipath_physaddr = addr; /* used for io_remap, etc. */
> + /* for user mmap */
> + devdata[dev].ipath_kregvirt = (uint64_t __iomem *) phys_to_virt(addr);
> + _IPATH_VDBG("mapped io addr %lx to kregbase %p kregvirt %p\n", addr,
> + devdata[dev].ipath_kregbase, devdata[dev].ipath_kregvirt);
> +
> + /*
> + * set these up before registering the interrupt handler, just
> + * in case
> + */
> + devdata[dev].pcidev = pdev;
> + pci_set_drvdata(pdev, &(devdata[dev]));

It's not a "just in case" type thing, you have to do this before you
register that interrupt handler, as you can be instantly called here.

Are you sure everything else is set up properly here before calling that
function?

> +
> + /*
> + * set up our interrupt handler; SA_SHIRQ probably not needed,
> + * but won't hurt for now.
> + */
> +
> + if (!pdev->irq) {
> + _IPATH_UNIT_ERROR(dev, "irq is 0, failing init\n");
> + ret = -EINVAL;
> + goto fail;
> + }
> + if ((ret = request_irq(pdev->irq, ipath_intr,
> + SA_SHIRQ, MODNAME, &devdata[dev]))) {
> + _IPATH_UNIT_ERROR(dev,
> + "Couldn't setup interrupt handler, irq=%u: %d\n",
> + pdev->irq, ret);
> + goto fail;
> + }
> +
> + /*
> + * clear ipath_flags here instead of in ipath_init_chip as it is set
> + * by ipath_setup_htconfig.
> + */
> + devdata[dev].ipath_flags = 0;
> + if (ipath_setup_htconfig(pdev, &intconfig, dev))
> + _IPATH_DBG
> + ("Failed to setup HT config, continuing anyway for now\n");
> +
> + ret = ipath_init_chip(dev); /* do the chip-specific init */
> + if (!ret) {
> +#ifdef CONFIG_MTRR
> + uint64_t pioaddr, piolen;
> + unsigned bits;
> + /*
> + * Set the PIO buffers to be WCCOMB, so we get HT bursts
> + * to the chip. Linux (possibly the hardware) requires
> + * it to be on a power of 2 address matching the length
> + * (which has to be a power of 2). For rev1, that means
> + * the base address, for rev2, it will be just the PIO
> + * buffers themselves.
> + */
> + pioaddr = addr + devdata[dev].ipath_piobufbase;
> + piolen = devdata[dev].ipath_piobcnt *
> + ALIGN(devdata[dev].ipath_piosize,
> + devdata[dev].ipath_palign);
> +
> + for (bits = 0; !(piolen & (1ULL << bits)); bits++)
> + /* do nothing */;
> +
> + if (piolen != (1ULL << bits)) {
> + _IPATH_DBG("piolen 0x%llx not power of 2, bits=%u\n",
> + piolen, bits);
> + piolen >>= bits;
> + while (piolen >>= 1)
> + bits++;
> + piolen = 1ULL << (bits + 1);
> + _IPATH_DBG("Changed piolen to 0x%llx bits=%u\n", piolen,
> + bits);
> + }
> + if (pioaddr & (piolen - 1)) {
> + uint64_t atmp;
> + _IPATH_DBG
> + ("pioaddr %llx not on right boundary for size %llx, fixing\n",
> + pioaddr, piolen);
> + atmp = pioaddr & ~(piolen - 1);
> + if (atmp < addr || (atmp + piolen) > (addr + len)) {
> + _IPATH_UNIT_ERROR(dev,
> + "No way to align address/size (%llx/%llx), no WC mtrr\n",
> + atmp, piolen << 1);
> + ret = -ENODEV;
> + } else {
> + _IPATH_DBG
> + ("changing WC base from %llx to %llx, len from %llx to %llx\n",
> + pioaddr, atmp, piolen, piolen << 1);
> + pioaddr = atmp;
> + piolen <<= 1;
> + }
> + }
> +
> + if (!ret) {
> + int cookie;
> + _IPATH_VDBG
> + ("Setting mtrr for chip to WC (addr %llx, len=0x%llx)\n",
> + pioaddr, piolen);
> + cookie = mtrr_add(pioaddr, piolen, MTRR_TYPE_WRCOMB, 0);
> + if (cookie < 0) {
> + _IPATH_INFO
> + ("mtrr_add(%llx,0x%llx,WC,0) failed (%d)\n",
> + pioaddr, piolen, cookie);
> + ret = -EINVAL;
> + } else {
> + _IPATH_VDBG
> + ("Set mtrr for chip to WC, cookie is %d\n",
> + cookie);
> + devdata[dev].ipath_mtrr = (uint32_t) cookie;
> + }
> + }
> +#endif /* CONFIG_MTRR */
> + }
> +
> + if (!ret && devdata[dev].ipath_kregbase && (devdata[dev].ipath_flags
> + & IPATH_PRESENT)) {
> + /*
> + * for the hardware, enable interrupts only after
> + * kr_interruptconfig is written, if we could set it up
> + */
> + if (intconfig) {
> + /* interrupt address */
> + ipath_kput_kreg(dev, kr_interruptconfig, intconfig);
> + /* enable all interrupts */
> + ipath_kput_kreg(dev, kr_intmask, -1LL);
> + /* force re-interrupt of any pending interrupts. */
> + ipath_kput_kreg(dev, kr_intclear, 0ULL);
> + /* OK, the chip is usable, marked it as initialized */
> + *devdata[dev].ipath_statusp |= IPATH_STATUS_INITTED;
> + } else
> + _IPATH_UNIT_ERROR(dev,
> + "No interrupts enabled, couldn't setup interrupt address\n");
> + } else if (ret != -EPERM)
> + _IPATH_INFO("Not configuring unit %u interrupts, init failed\n",
> + dev);
> +
> + device_create_file(&(pdev->dev), &dev_attr_status);
> + device_create_file(&(pdev->dev), &dev_attr_status_str);
> + device_create_file(&(pdev->dev), &dev_attr_lid);
> + device_create_file(&(pdev->dev), &dev_attr_mlid);
> + device_create_file(&(pdev->dev), &dev_attr_guid);
> + device_create_file(&(pdev->dev), &dev_attr_nguid);
> + device_create_file(&(pdev->dev), &dev_attr_serial);
> + device_create_file(&(pdev->dev), &dev_attr_unit);

Why not use an attribute array? Makes for proper error handling if one
of those calls does not work...

> + /*
> + * We used to cleanup here, with pci_release_regions, etc. but that
> + * can cause other problems if we want to run diags, etc., so instead
> + * defer that until driver unload.
> + */

So memory leaks are acceptable?

> +fail: /* after we've done at least some of the pci setup */
> + if (ret == -EPERM) /* disabled device, don't want module load error;
> + * just want to carry status through to this point */
> + ret = 0;

Module load error does not happen no matter what kind of return value
you send back from this function. So the comment is wrong, and the fact
that you failed initializing the device is also wrong, please don't do
this.

thanks,

greg k-h

2005-12-30 22:48:29

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 11 of 20] ipath - core driver, part 4 of 4

On Wed, Dec 28, 2005 at 04:31:30PM -0800, Bryan O'Sullivan wrote:
> Signed-off-by: Bryan O'Sullivan <[email protected]>
>
> diff -r c37b118ef806 -r e8af3873b0d9 drivers/infiniband/hw/ipath/ipath_driver.c
> --- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800
> +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:43 2005 -0800
> @@ -5408,3 +5408,1709 @@

Clever use of 4 patches to just add onto the same file. This has grown
into a huge file, can't you split it up into smaller pieces?

> +int __init infinipath_init(void)
> +{
> + int r = 0, i;
> +
> + _IPATH_DBG(KERN_INFO DRIVER_LOAD_MSG "%s", ipath_core_version);
> +
> + ipath_init_picotime(); /* init cycles -> pico conversion */
> +
> + /*
> + * initialize the statusp to temporary storage so we can use it
> + * everywhere without first checking. When we "really" assign it,
> + * we copy from _ipath_status
> + */
> + for (i = 0; i < infinipath_max; i++)
> + devdata[i].ipath_statusp = &devdata[i]._ipath_status;
> +
> + /*
> + * init these early, in case we take an interrupt as soon as the irq
> + * is setup. Saw a spinlock panic once that appeared to be due to that
> + * problem, when they were initted later on.
> + */
> + spin_lock_init(&ipath_pioavail_lock);
> + spin_lock_init(&ipath_sma_lock);
> +
> + pci_register_driver(&infinipath_driver);
> +
> + driver_create_file(&(infinipath_driver.driver), &driver_attr_version);
> +
> + if ((r = register_chrdev(ipath_major, MODNAME, &ipath_fops)))
> + _IPATH_ERROR("Unable to register %s device\n", MODNAME);

Why even save off the return value if you don't do anything with it?

And please don't put assignments in the middle of if statements, that's
just messy and harder to read (the fact that gcc made you put an extra
() should be a hint that you were doing something wrong...)

And does your driver work with udev? I didn't see where you were
exporting the major:minor number of your devices to sysfs, but I might
have missed it.

> + /*
> + * never return an error, since we could have stuff registered,
> + * resources used, etc., even if no hardware found. This way we
> + * can clean up through unload.
> + */
> + return 0;
> +}

Are you sure that's a good idea? Please do the proper thing and tear
down your infrastructure if something fails, that's the correct thing to
do. That way you can actually recover if something that you call in
this function fails (like driver_create_file(), or
pci_register_driver().) Functions don't return error values just so you
can ignore them :)

> +/*
> + * note: if for some reason the unload fails after this routine, and leaves
> + * the driver enterable by user code, we'll almost certainly crash and burn...
> + */

See, you admit that what you are doing isn't the wisest thing, which
should tell you something...

> +static void __exit infinipath_cleanup(void)
> +{
> + int r, m, port;
> +
> + driver_remove_file(&(infinipath_driver.driver), &driver_attr_version);
> + if ((r = unregister_chrdev(ipath_major, MODNAME)))
> + _IPATH_DBG("unregister of device failed: %d\n", r);
> +
> +
> + /*
> + * turn off rcv, send, and interrupts for all ports, all drivers
> + * should also hard reset the chip here?
> + * free up port 0 (kernel) rcvhdr, egr bufs, and eventually tid bufs
> + * for all versions of the driver, if they were allocated
> + */
> + for (m = 0; m < infinipath_max; m++) {
> + uint64_t val;
> + struct ipath_devdata *dd = &devdata[m];
> + if (dd->ipath_kregbase) {
> + /* in case unload fails, be consistent */
> + dd->ipath_rcvctrl = 0U;
> + ipath_kput_kreg(m, kr_rcvctrl, dd->ipath_rcvctrl);
> +
> + /*
> + * gracefully stop all sends allowing any in
> + * progress to trickle out first.
> + */
> + ipath_kput_kreg(m, kr_sendctrl, 0ULL);
> + val = ipath_kget_kreg64(m, kr_scratch); /* flush it */
> + /*
> + * enough for anything that's going to trickle
> + * out to have actually done so.
> + */
> + udelay(5);
> +
> + /*
> + * abort any armed or launched PIO buffers that
> + * didn't go. (self clearing). Will cause any
> + * packet currently being transmitted to go out
> + * with an EBP, and may also cause a short packet
> + * error on the receiver.
> + */
> + ipath_kput_kreg(m, kr_sendctrl, INFINIPATH_S_ABORT);
> +
> + /* mask interrupts, but not errors */
> + ipath_kput_kreg(m, kr_intmask, 0ULL);
> + ipath_shutdown_link(m);
> +
> + /*
> + * clear all interrupts and errors. Next time
> + * driver is loaded, we know that whatever is
> + * set happened while we were unloaded
> + */
> + ipath_kput_kreg(m, kr_hwerrclear, -1LL);
> + ipath_kput_kreg(m, kr_errorclear, -1LL);
> + ipath_kput_kreg(m, kr_intclear, -1LL);
> + if (dd->__ipath_pioavailregs_base) {
> + kfree((void *)dd->__ipath_pioavailregs_base);
> + dd->__ipath_pioavailregs_base = NULL;
> + dd->ipath_pioavailregs_dma = NULL;
> + }
> +
> + if (dd->ipath_pageshadow) {
> + struct page **tmpp = dd->ipath_pageshadow;
> + int i, cnt = 0;
> +
> + _IPATH_VDBG
> + ("Unlocking any expTID pages still locked\n");
> + for (port = 0; port < dd->ipath_cfgports;
> + port++) {
> + int port_tidbase =
> + port * dd->ipath_rcvtidcnt;
> + int maxtid =
> + port_tidbase + dd->ipath_rcvtidcnt;
> + for (i = port_tidbase; i < maxtid; i++) {
> + if (tmpp[i]) {
> + ipath_putpages(1,
> + &tmpp[i]);
> + tmpp[i] = NULL;
> + cnt++;
> + }
> + }
> + }
> + if (cnt) {
> + ipath_stats.sps_pageunlocks += cnt;
> + _IPATH_VDBG
> + ("There were still %u expTID entries locked\n",
> + cnt);
> + }
> + if (ipath_stats.sps_pagelocks
> + || ipath_stats.sps_pageunlocks)
> + _IPATH_VDBG
> + ("%llu pages locked, %llu unlocked via ipath_m{un}lock\n",
> + ipath_stats.sps_pagelocks,
> + ipath_stats.sps_pageunlocks);
> +
> + _IPATH_VDBG
> + ("Free shadow page tid array at %p\n",
> + dd->ipath_pageshadow);
> + vfree(dd->ipath_pageshadow);
> + dd->ipath_pageshadow = NULL;
> + }
> +
> + /*
> + * free any resources still in use (usually just
> + * kernel ports) at unload
> + */
> + for (port = 0; port < dd->ipath_cfgports; port++)
> + ipath_free_pddata(dd, port, 1);
> + kfree(dd->ipath_pd);
> + /*
> + * debuggability, in case some cleanup path
> + * tries to use it after this
> + */
> + dd->ipath_pd = NULL;
> + }
> +
> + if (dd->pcidev) {
> + if (dd->pcidev->irq) {
> + _IPATH_VDBG("unit %u free_irq of irq %x\n", m,
> + dd->pcidev->irq);
> + free_irq(dd->pcidev->irq, dd);
> + } else
> + _IPATH_DBG
> + ("irq is 0, not doing free_irq for unit %u\n",
> + m);
> + dd->pcidev = NULL;
> + }
> + if (dd->pci_registered) {
> + _IPATH_VDBG
> + ("Unregistering pci infrastructure unit %u\n", m);
> + pci_unregister_driver(&infinipath_driver);

This is the call that should have cleaned up all of the memory and other
stuff that you do above. If not, then your driver will not work in any
hotplug pci systems, which would not be a good thing. Please do like
Roland says and put your resources and stuff in the device specific
structures, like the rest of the kernel drivers do. You know, we do
things like this for a reason, not just because we like to be difficult
:)

> + dd->pci_registered = 0;
> + } else
> + _IPATH_VDBG
> + ("unit %u: no pci unreg, wasn't registered\n", m);
> + ipath_chip_cleanup(dd); /* clean up any per-chip chip-specific stuff */
> + }
> + /*
> + * clean up any chip-specific stuff for now, only one type of chip
> + * for any given driver
> + */
> + ipath_chip_done();
> +
> + /* cleanup all our locked pages private data structures */
> + ipath_upages_cleanup(NULL);
> +}
> +
> +/* This is a generic function here, so it can return device-specific
> + * info. This allows keeping in sync with the version that supports
> + * multiple chip types.
> +*/
> +void ipath_get_boardname(const ipath_type t, char *name, size_t namelen)
> +{
> + ipath_ht_get_boardname(t, name, namelen);
> +}

Why not just export ipath_ht_get_boardname instead?

> +module_init(infinipath_init);
> +module_exit(infinipath_cleanup);
> +
> +EXPORT_SYMBOL(infinipath_debug);
> +EXPORT_SYMBOL(ipath_get_boardname);

EXPORT_SYMBOL_GPL() ?
And put them next to the functions themselves, it's easier to notice
that way.

thanks,

greg k-h

2005-12-30 23:10:16

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 12 of 20] ipath - misc driver support code

On Fri, 2005-12-30 at 00:25 -0800, Greg KH wrote:

> No description of what the patch does?

Ahem. Oops.

> > +struct _infinipath_do_not_use_kernel_regs {
> > + unsigned long long Revision;
>
> u64?

Right.

> > + unsigned long long Control;
> > + unsigned long long PageAlign;
> > + unsigned long long PortCnt;
>
> And what's with the InterCapsNamingScheme of these variables?

They're taken straight from the register names in our chip spec. I can
squish them to lowercase-only, if that seems important.

> > +/*
> > + * would prefer to not inline this, to avoid code bloat, and simplify debugging
> > + * But when compiling against 2.6.10 kernel tree, it gets an error, so
> > + * not for now.
> > + */
> > +static void ipath_i2c_delay(ipath_type, int);
>
> You aren't compiling this for a 2.6.10 kernel anymore :)

Yes, that hunk is redundant. Thanks for spotting it.

> > +static void ipath_i2c_delay(ipath_type dev, int dtime)

> Huh? After reading your comment, I still don't understand why you can't
> just use udelay(). Or are you counting on calling this function with
> only "1" being set for dtime?

It's usually called with a dtime of 1, but there's an added delay in one
place.

I just rewrote that routine, so it's now a one-liner that does a read
which waits for writes to the chip to complete. The sole caller that
wanted an added wait calls udelay itself now.

> Ah, isn't it fun to write bit-banging functions... And the in-kernel
> i2c code is messier than doing this by hand?

>From looking at it, it will make the i2c part of the driver longer,
rather than shorter. There's nothing objectionable about the kernel i2c
interfaces per se, but our bit-banging code is pretty small and
specialised.

> Odd function comment style. Please fix this to be in kerneldoc format.

Sure.

> Are you _sure_ you need all of these for the one function in this file?

That file will be taken out and put to sleep.

> > +#include <stddef.h>
>
> Where is this file being pulled in from?

Ugh, braino.

> Woah, um, don't you think that you should either export the main mlock
> function itself, or fix your code to not need it? Rolling it yourself
> isn't a good idea...

Other people have pointed out that our page-pinning code is horked.
We'll find a saner alternative.

Thanks for the comments, Greg.

<b

2005-12-30 23:11:46

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Fri, 2005-12-30 at 00:00 -0800, Greg KH wrote:

> > - The driver still uses EXPORT_SYMBOL, for consistency with other
> > code in drivers/infiniband
>
> Why would that matter?

I don't want to do something gratuitously different to the prevailing
set of code in which it lives.

> > - We're still using ioctls instead of sysfs or configfs in some
> > cases, to maintain userspace compatibility
>
> Compatibility with what? The driver isn't in the kernel tree yet, so
> there's no old kernel versions to remain compatibile with :)

We already ship userspace code to customers that relies on the ioctl
interfaces.

> I also noticed that you are still using the uint64_t type variable
> types, can you please switch to the proper kernel types instead (u64 in
> this specific example.)

Yes, we'll use u64 for internal variables, and __u64 for stuff exported
to userspace, etc.

<b

2005-12-30 23:17:56

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 11 of 20] ipath - core driver, part 4 of 4

On Fri, 2005-12-30 at 00:12 -0800, Greg KH wrote:

> This has grown
> into a huge file, can't you split it up into smaller pieces?

Absolutely.

> Why even save off the return value if you don't do anything with it?

I think that's just a throwback to an earlier rev of the driver.

> And please don't put assignments in the middle of if statements, that's
> just messy and harder to read (the fact that gcc made you put an extra
> () should be a hint that you were doing something wrong...)

OK.

> And does your driver work with udev? I didn't see where you were
> exporting the major:minor number of your devices to sysfs, but I might
> have missed it.

It was written in a pre-udev world, so it still uses a fixed major and
minor number. How important is this to you? Is it "nice to have", or
"blocker"? :-)

> Are you sure that's a good idea? Please do the proper thing and tear
> down your infrastructure if something fails, that's the correct thing to
> do. That way you can actually recover if something that you call in
> this function fails (like driver_create_file(), or
> pci_register_driver().) Functions don't return error values just so you
> can ignore them :)

This will take a bit of cleaning up, but it's a reasonable request.

> > +/*
> > + * note: if for some reason the unload fails after this routine, and leaves
> > + * the driver enterable by user code, we'll almost certainly crash and burn...
> > + */
>
> See, you admit that what you are doing isn't the wisest thing, which
> should tell you something...

Indeed.

> This is the call that should have cleaned up all of the memory and other
> stuff that you do above. If not, then your driver will not work in any
> hotplug pci systems, which would not be a good thing. Please do like
> Roland says and put your resources and stuff in the device specific
> structures, like the rest of the kernel drivers do.

I'm working on the appropriate hearts and minds as we speak :-)

> Why not just export ipath_ht_get_boardname instead?

Because that's too specific to HT for my personal liking.

> > +module_init(infinipath_init);
> > +module_exit(infinipath_cleanup);
> > +
> > +EXPORT_SYMBOL(infinipath_debug);
> > +EXPORT_SYMBOL(ipath_get_boardname);
>
> EXPORT_SYMBOL_GPL() ?

I don't see a problem with that.

> And put them next to the functions themselves, it's easier to notice
> that way.

OK.

Thanks again for the review,

<b

2005-12-30 23:47:14

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 8 of 20] ipath - core driver, part 1 of 4

On Fri, 2005-12-30 at 00:39 -0800, Greg KH wrote:

> > +void ipath_chip_done(void)
> > +{
> > +}
> > +
> > +void ipath_chip_cleanup(struct ipath_devdata * dd)
> > +{
> > +}
>
> What are these two empty functions for?

They're just as dead as they look.

> > +static ssize_t show_status_str(struct device *dev,

> how big can this "status string" be?

Just a few dozen bytes.

> If it's even getting close to
> PAGE_SIZE, this doesn't need to be a sysfs attribute, but you should
> break it up into its individual pieces.

Do you think that's still warranted, given this?

> > +static ssize_t show_unit(struct device *dev,

> Don't you mean -ENODEV?

Yes, thanks.

> > + snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_unit);
> > + return strlen(buf);
>
> return the snprintf() call instead of calling strlen() all the time
> please.

OK.

> > +const struct pci_device_id infinipath_pci_tbl[] = {
> > + {
> > + PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_PATHSCALE_INFINIPATH_HT,
> > + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0},
>
> PCI_DEVICE() instead?

OK.

> > + {0,}
>
> {},
> is all that is needed here.

OK.

> > + .driver.owner = THIS_MODULE,
>
> This line is not needed, you can remove it.

OK.

> {} not needed here.

OK.

> > +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC)
> > + printk("Remapping pages WC\n");
>
> No KERN_ level?

That should just become a debug statement.

> > + /*
> > + * set these up before registering the interrupt handler, just
> > + * in case
> > + */
> > + devdata[dev].pcidev = pdev;
> > + pci_set_drvdata(pdev, &(devdata[dev]));
>
> It's not a "just in case" type thing, you have to do this before you
> register that interrupt handler, as you can be instantly called here.

OK, I'll remove the misleading comment.

> Are you sure everything else is set up properly here before calling that
> function?

I believe so. I'll double check.

> > + device_create_file(&(pdev->dev), &dev_attr_status);
> > + device_create_file(&(pdev->dev), &dev_attr_status_str);
> > + device_create_file(&(pdev->dev), &dev_attr_lid);
> > + device_create_file(&(pdev->dev), &dev_attr_mlid);
> > + device_create_file(&(pdev->dev), &dev_attr_guid);
> > + device_create_file(&(pdev->dev), &dev_attr_nguid);
> > + device_create_file(&(pdev->dev), &dev_attr_serial);
> > + device_create_file(&(pdev->dev), &dev_attr_unit);
>
> Why not use an attribute array? Makes for proper error handling if one
> of those calls does not work...

OK, thanks.

> > + /*
> > + * We used to cleanup here, with pci_release_regions, etc. but that
> > + * can cause other problems if we want to run diags, etc., so instead
> > + * defer that until driver unload.
> > + */
>
> So memory leaks are acceptable?

That clearly needs a bit of attention.

> > +fail: /* after we've done at least some of the pci setup */
> > + if (ret == -EPERM) /* disabled device, don't want module load error;
> > + * just want to carry status through to this point */
> > + ret = 0;
>
> Module load error does not happen no matter what kind of return value
> you send back from this function. So the comment is wrong, and the fact
> that you failed initializing the device is also wrong, please don't do
> this.

OK.

Thanks for the extensive comments,

<b

2005-12-30 23:50:19

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 10 of 20] ipath - core driver, part 3 of 4

On Fri, 2005-12-30 at 10:46 -0800, Linus Torvalds wrote:

> All your user page lookup/pinning code is terminally broken.

Yes, this has been pointed out by a few others.

> Crap like this must not be merged.

I'm already busy decrappifying it...

<b

2005-12-31 00:19:10

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 8 of 20] ipath - core driver, part 1 of 4

On Fri, Dec 30, 2005 at 03:47:07PM -0800, Bryan O'Sullivan wrote:
> On Fri, 2005-12-30 at 00:39 -0800, Greg KH wrote:
>
> > > +void ipath_chip_done(void)
> > > +{
> > > +}
> > > +
> > > +void ipath_chip_cleanup(struct ipath_devdata * dd)
> > > +{
> > > +}
> >
> > What are these two empty functions for?
>
> They're just as dead as they look.

Then you might want to remove them :)

> > > +static ssize_t show_status_str(struct device *dev,
>
> > how big can this "status string" be?
>
> Just a few dozen bytes.
>
> > If it's even getting close to
> > PAGE_SIZE, this doesn't need to be a sysfs attribute, but you should
> > break it up into its individual pieces.
>
> Do you think that's still warranted, given this?

No I don't, unless you think that message will grow somehow...

thanks,

greg k-h

2005-12-31 00:19:09

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 12 of 20] ipath - misc driver support code

On Fri, Dec 30, 2005 at 03:10:09PM -0800, Bryan O'Sullivan wrote:
> On Fri, 2005-12-30 at 00:25 -0800, Greg KH wrote:
> > > + unsigned long long Control;
> > > + unsigned long long PageAlign;
> > > + unsigned long long PortCnt;
> >
> > And what's with the InterCapsNamingScheme of these variables?
>
> They're taken straight from the register names in our chip spec. I can
> squish them to lowercase-only, if that seems important.

No, but document it that this is the reason for it (along with a pointer
to your chip spec, if possible.)

thanks,

greg k-h

2005-12-31 00:19:33

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 11 of 20] ipath - core driver, part 4 of 4

On Fri, Dec 30, 2005 at 03:17:55PM -0800, Bryan O'Sullivan wrote:
> On Fri, 2005-12-30 at 00:12 -0800, Greg KH wrote:
> > And does your driver work with udev? I didn't see where you were
> > exporting the major:minor number of your devices to sysfs, but I might
> > have missed it.
>
> It was written in a pre-udev world, so it still uses a fixed major and
> minor number. How important is this to you? Is it "nice to have", or
> "blocker"? :-)

Well, depends on if you want your driver to work with any of the major
distros that rely on udev (RHEL, SLES, etc...) If not, fine, you don't
need it :)

thanks,

greg k-h

2005-12-31 00:19:32

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Fri, Dec 30, 2005 at 03:11:44PM -0800, Bryan O'Sullivan wrote:
> On Fri, 2005-12-30 at 00:00 -0800, Greg KH wrote:
> > > - The driver still uses EXPORT_SYMBOL, for consistency with other
> > > code in drivers/infiniband
> >
> > Why would that matter?
>
> I don't want to do something gratuitously different to the prevailing
> set of code in which it lives.
>
> > > - We're still using ioctls instead of sysfs or configfs in some
> > > cases, to maintain userspace compatibility
> >
> > Compatibility with what? The driver isn't in the kernel tree yet, so
> > there's no old kernel versions to remain compatibile with :)
>
> We already ship userspace code to customers that relies on the ioctl
> interfaces.

But we (the kernel community), don't really accept that as a valid
reason to accept this kind of code, sorry.

Why not just update your userspace code and ship that out to your
customers, as you know exactly who they are due to the lack of the
driver in the mainline kernel tree :)

thanks,

greg k-h

2005-12-31 01:40:56

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Fri, 2005-12-30 at 16:10 -0800, Greg KH wrote:

> But we (the kernel community), don't really accept that as a valid
> reason to accept this kind of code, sorry.

Fair enough. I'd like some guidance in that case. Some of our ioctls
access the hardware more or less directly, while others do things like
read or reset counters.

Which of these kinds of operations are appropriate to retain as ioctls,
in your eyes, and which are best converted to sysfs or configfs
alternatives?

As an example, take a look at ipath_sma_ioctl. It seems to me that
receiving or sending subnet management packets ought to remain as
ioctls, while getting port or node data could be turned into sysfs
attributes. Lane identification could live in configfs. If you think
otherwise, please let me know what's more appropriate.

The less blind I am in doing these conversions, the fewer rounds we'll
have to go in reviewing humongous driver submission patches :-)

Thanks,

<b

2005-12-31 05:36:38

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

>> > - Someone asked for the kernel's i2c infrastructure to be used,but
>> > our i2c usage is very specialised, and it would be more of a mess
>> > to use the kernel's
>>
>> Problem with that is that if everybody and Aunt Tillie does the same,
>> the kernel as a whole gets to be a mess.
>
>ALSA does the exact same thing for the exact same reason. Maybe an
>indication that the kernel's i2c layer is too heavy?

Sounds like a discussion a while back why jfs/xfs/reiser3/reiser4 all have
their own journalling - compared to ext3-jbd.


Jan Engelhardt
--

2005-12-31 08:36:30

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 10 of 20] ipath - core driver, part 3 of 4

On Fri, 2005-12-30 at 15:50 -0800, Bryan O'Sullivan wrote:
> On Fri, 2005-12-30 at 10:46 -0800, Linus Torvalds wrote:
>
> > All your user page lookup/pinning code is terminally broken.
>
> Yes, this has been pointed out by a few others.
>
> > Crap like this must not be merged.
>
> I'm already busy decrappifying it...

the point I think also was the fact that it exists is already wrong :)

makes it easier for you.. "rm" is a very powerful decrappify tool, as is
"block delete" in just about any editor ;)


2006-01-02 16:06:21

by Horst H. von Brand

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

Lee Revell <[email protected]> wrote:
> On Thu, 2005-12-29 at 16:01 -0300, Horst von Brand wrote:
> > > - Someone asked for the kernel's i2c infrastructure to be used,but
> > > our i2c usage is very specialised, and it would be more of a mess
> > > to use the kernel's

> > Problem with that is that if everybody and Aunt Tillie does the same,
> > the kernel as a whole gets to be a mess.

> ALSA does the exact same thing for the exact same reason. Maybe an
> indication that the kernel's i2c layer is too heavy?

That would mean that the respective teams should put their heads together
and (re)design it to their needs...
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2006-01-02 16:22:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Mon, Jan 02, 2006 at 01:05:43PM -0300, Horst von Brand wrote:
> > > Problem with that is that if everybody and Aunt Tillie does the same,
> > > the kernel as a whole gets to be a mess.
>
> > ALSA does the exact same thing for the exact same reason. Maybe an
> > indication that the kernel's i2c layer is too heavy?
>
> That would mean that the respective teams should put their heads together
> and (re)design it to their needs...

Exactly. We got quite a few developers to help adjusting the i2c stack
for their needs and improve it. The i2c stack started out beeing used only
for hardware monitoring chips and then later multimedia devices. Help to
make it more useful for other users is always appreciated.

2006-01-02 20:35:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

"Bryan O'Sullivan" <[email protected]> writes:

> On Fri, 2005-12-30 at 16:10 -0800, Greg KH wrote:
>
>> But we (the kernel community), don't really accept that as a valid
>> reason to accept this kind of code, sorry.
>
> Fair enough. I'd like some guidance in that case. Some of our ioctls
> access the hardware more or less directly, while others do things like
> read or reset counters.

As a general rule a driver should push as much functionality to
libraries and the infrastructure code as possible.

> Which of these kinds of operations are appropriate to retain as ioctls,
> in your eyes, and which are best converted to sysfs or configfs
> alternatives?
>
> As an example, take a look at ipath_sma_ioctl. It seems to me that
> receiving or sending subnet management packets ought to remain as
> ioctls, while getting port or node data could be turned into sysfs
> attributes. Lane identification could live in configfs. If you think
> otherwise, please let me know what's more appropriate.

I haven't looked closely enough at the state of the openib tree but
you should not need an additional interface to send/receive standard
IB subnet management packets. That is something that should be provided
the same way by all infiniband drivers.

The only case I can think of where this might not already exist
is the code that responds to the subnet manager. If the current
interfaces are not sufficient then the infiniband layer needs
more work.

> The less blind I am in doing these conversions, the fewer rounds we'll
> have to go in reviewing humongous driver submission patches :-)

Given Linus's comments and looking at where you are getting stuck I
would recommend you split out support for the nonstandard ipath
protocol from the rest of the driver. If the standard infiniband
interfaces for kernel bypass are not sufficient for flinging packets
then we need to re-examine them.

Eric

2006-01-02 22:22:44

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Mon, 2006-01-02 at 13:35 -0700, Eric W. Biederman wrote:

> I haven't looked closely enough at the state of the openib tree but
> you should not need an additional interface to send/receive standard
> IB subnet management packets. That is something that should be provided
> the same way by all infiniband drivers.

We provide the standard OpenIB mechanisms for doing that, of course.
However, our driver is layered. The OpenIB layer uses facilities
provided by the main driver (via ipath_layer.c). The main driver can
stand alone, without the OpenIB code compiled into the kernel or
available as a module at all. In that case, a userland subnet
management agent must still be able to send and receive management
packets.

> Given Linus's comments and looking at where you are getting stuck I
> would recommend you split out support for the nonstandard ipath
> protocol from the rest of the driver.

While we can split the main driver source file up along those lines, we
are not planning to make the ipath protocol optional. We are planning
to submit another non-OpenIB network driver that depends on the ipath
protocol support.

<b

2006-01-03 17:54:25

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Fri, Dec 30, 2005 at 05:40:50PM -0800, Bryan O'Sullivan wrote:
> On Fri, 2005-12-30 at 16:10 -0800, Greg KH wrote:
>
> > But we (the kernel community), don't really accept that as a valid
> > reason to accept this kind of code, sorry.
>
> Fair enough. I'd like some guidance in that case. Some of our ioctls
> access the hardware more or less directly, while others do things like
> read or reset counters.
>
> Which of these kinds of operations are appropriate to retain as ioctls,
> in your eyes, and which are best converted to sysfs or configfs
> alternatives?

Idealy, nothing should be new ioctls. But in the end, it all depends on
exactly what you are trying to do with each different one.

> As an example, take a look at ipath_sma_ioctl. It seems to me that
> receiving or sending subnet management packets ought to remain as
> ioctls, while getting port or node data could be turned into sysfs
> attributes. Lane identification could live in configfs. If you think
> otherwise, please let me know what's more appropriate.

I really don't know what the subnet management stuff involves, sorry.
But doesn't the open-ib layer handle that all for you already?

thanks,

greg k-h

2006-01-03 20:55:10

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Tue, 2006-01-03 at 09:27 -0800, Greg KH wrote:

> Idealy, nothing should be new ioctls. But in the end, it all depends on
> exactly what you are trying to do with each different one.

Fair enough.

> I really don't know what the subnet management stuff involves, sorry.
> But doesn't the open-ib layer handle that all for you already?

It does when our OpenIB driver is being used. But our lower level
driver is independent of OpenIB (and is often used without the
infiniband stuff even configured into the kernel), and needs to provide
some way for a userspace subnet management agent to send and receive
packets.

<b

2006-01-03 20:57:38

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Tue, 2006-01-03 at 12:54 -0800, Bryan O'Sullivan wrote:
> On Tue, 2006-01-03 at 09:27 -0800, Greg KH wrote:
>
> > Idealy, nothing should be new ioctls. But in the end, it all depends on
> > exactly what you are trying to do with each different one.
>
> Fair enough.
>
> > I really don't know what the subnet management stuff involves, sorry.
> > But doesn't the open-ib layer handle that all for you already?
>
> It does when our OpenIB driver is being used. But our lower level
> driver is independent of OpenIB (and is often used without the
> infiniband stuff even configured into the kernel), and needs to provide
> some way for a userspace subnet management agent to send and receive
> packets.

that sounds like your driver should mimic the openIB userspace ABI for
this *exactly* so that you can use the same management tools for either
scenario...


2006-01-03 21:26:05

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Tue, 2006-01-03 at 21:57 +0100, Arjan van de Ven wrote:

> that sounds like your driver should mimic the openIB userspace ABI for
> this *exactly* so that you can use the same management tools for either
> scenario...

The OpenIB userspace ABI is huge and complex, and the OpenIB subnet
management agent (OpenSM) is even more so. Our low-level subnet
management agent has vastly simpler needs, so it really is better served
with 300 lines of specialised code (I don't care what the ABI actually
is) than 15,000 lines introduced for the sake of unneeded compatibility.

Perhaps read/write on the character device file would be preferable to
ioctls for sending and receiving these management packets? We don't
implement those file methods at the moment, so it's not like we'd be
displacing anything.

<b

2006-01-03 21:27:07

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver


> Perhaps read/write on the character device file would be preferable to
> ioctls for sending and receiving these management packets? We don't
> implement those file methods at the moment, so it's not like we'd be
> displacing anything.

if it's just data packets.. you could implement a device that offers the
SG_IO interface. Yes it's ioctls, but it's a preexisting ABI so I
suspect that's not too big a deal (and maybe you can even leverage a lot
of existing code for this)


2006-01-04 03:34:05

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Tue, 2006-01-03 at 22:26 +0100, Arjan van de Ven wrote:
> > Perhaps read/write on the character device file would be preferable to
> > ioctls for sending and receiving these management packets? We don't
> > implement those file methods at the moment, so it's not like we'd be
> > displacing anything.
>
> if it's just data packets.. you could implement a device that offers the
> SG_IO interface.

OK, thanks for the pointer. I'll take a look.

<b

2006-01-04 21:27:04

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

Eric> Given Linus's comments and looking at where you are getting
Eric> stuck I would recommend you split out support for the
Eric> nonstandard ipath protocol from the rest of the driver. If
Eric> the standard infiniband interfaces for kernel bypass are not
Eric> sufficient for flinging packets then we need to re-examine
Eric> them.

Yes, this might be a good idea. The "core" driver looks like it is
suffering from really being several things stuck together. It would
probably make things a lot cleaner and easier to maintain if the core
driver just handled synchronizing access to the low-level hardware,
with other stuff split into its own driver. It seems there might even
be enough stuff to split "core" into three drivers: the real core, the
ultra-high-performance MPI transport, and the management/diagnostitcs
stuff.

Also, there are APIs in the "core" driver that are only exported for a
single user outside the driver -- it would probably make sense to move
that logic directly to where it's used. I'm thinking of things like
ipath_verbs_send() and the whole ipath_copy.c file.

- R.

2006-01-04 21:28:36

by Roland Dreier

[permalink] [raw]
Subject: Re: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

Bryan> It does when our OpenIB driver is being used. But our
Bryan> lower level driver is independent of OpenIB (and is often
Bryan> used without the infiniband stuff even configured into the
Bryan> kernel), and needs to provide some way for a userspace
Bryan> subnet management agent to send and receive packets.

Isn't there some way you can use the same SMA (subnet management
agent) interface in all the cases? Can ipath_mad.c just go away in
favor of your userspace SMA?

- R.

2006-01-05 15:28:50

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Wed, 2006-01-04 at 13:26 -0800, Roland Dreier wrote:

> Yes, this might be a good idea. The "core" driver looks like it is
> suffering from really being several things stuck together.

Yes, this is undoubtedly the case; we developed it organically based on
our evolving needs, and we're only now (maybe a bit belatedly) stepping
back to take a breath and see how things should be logically split out.

> Also, there are APIs in the "core" driver that are only exported for a
> single user outside the driver -- it would probably make sense to move
> that logic directly to where it's used.

Right. The purpose of the whole ipath_layer.c file has perhaps been
unclear; we've been holding back a network driver that makes use of it,
to keep the size of the review patches down. Some of the other
verbs-related routines in the core driver are in the process of finding
a new home, as you suggested.

As ever, thanks for the comments.

<b

2006-01-05 15:31:20

by Bryan O'Sullivan

[permalink] [raw]
Subject: Re: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver

On Wed, 2006-01-04 at 13:28 -0800, Roland Dreier wrote:

> Isn't there some way you can use the same SMA (subnet management
> agent) interface in all the cases?

I'll look into it, but I rather doubt it.

> Can ipath_mad.c just go away in
> favor of your userspace SMA?

Our userspace SMA is a tiny shrivelled thing that expects there to be a
real subnet manager out there, so it only needs a very simple interface,
and it's decoupled from OpenIB entirely. ipath_mad.c is part of our
OpenIB layer, so it can't really go away.

<b