Ondrej, would you please test this new series?
Changed since v1:
- PDMA transfer residual is calculated earlier.
- End of DMA flag check is now polled (if there is any residual).
Changed since v2:
- Bail out of transfer loops when Gated IRQ gets asserted.
- Make udelay conditional on board type.
- Drop sg_tablesize patch due to performance regression.
Finn Thain (1):
g_NCR5380: Cleanup comments and whitespace
Ondrej Zary (3):
g_NCR5380: Fix PDMA transfer size
g_NCR5380: End PDMA transfer correctly on target disconnection
g_NCR5380: Re-work PDMA loops
drivers/scsi/g_NCR5380.c | 239 +++++++++++++++++++++++------------------------
1 file changed, 116 insertions(+), 123 deletions(-)
--
2.13.0
From: Ondrej Zary <[email protected]>
The polling loops in pread() and pwrite() can easily become infinite
loops and hang the machine.
On DTC chips, IRQ can arrive late and we miss it because we only check
once. Merge the IRQ check into host buffer wait and add polling limit.
Also place a limit on polling for 53C80 registers accessibility.
[Use NCR5380_poll_politely2() for register polling. Rely on polling for
gated IRQ rather than polling for phase error, like the algorithm in the
datasheet. Calculate residual from block count register instead of the
loop counter. Factor-out common code as wait_for_53c80_access(). -- F.T.]
Signed-off-by: Ondrej Zary <[email protected]>
Signed-off-by: Finn Thain <[email protected]>
---
drivers/scsi/g_NCR5380.c | 160 ++++++++++++++++++++++-------------------------
1 file changed, 75 insertions(+), 85 deletions(-)
diff --git a/drivers/scsi/g_NCR5380.c b/drivers/scsi/g_NCR5380.c
index 9f18082415c4..28233ec49fdc 100644
--- a/drivers/scsi/g_NCR5380.c
+++ b/drivers/scsi/g_NCR5380.c
@@ -45,7 +45,8 @@
int c400_blk_cnt; \
int c400_host_buf; \
int io_width; \
- int pdma_residual
+ int pdma_residual; \
+ int board
#define NCR5380_dma_xfer_len generic_NCR5380_dma_xfer_len
#define NCR5380_dma_recv_setup generic_NCR5380_pread
@@ -316,6 +317,7 @@ static int generic_NCR5380_init_one(struct scsi_host_template *tpnt,
}
hostdata = shost_priv(instance);
+ hostdata->board = board;
hostdata->io = iomem;
hostdata->region_size = region_size;
@@ -480,6 +482,31 @@ static void generic_NCR5380_release_resources(struct Scsi_Host *instance)
release_mem_region(base, region_size);
}
+/* wait_for_53c80_access - wait for 53C80 registers to become accessible
+ * @hostdata: scsi host private data
+ *
+ * The registers within the 53C80 logic block are inaccessible until
+ * bit 7 in the 53C400 control status register gets asserted.
+ */
+
+static int wait_for_53c80_access(struct NCR5380_hostdata *hostdata)
+{
+ int count = 10000;
+
+ do {
+ if (hostdata->board == BOARD_DTC3181E)
+ udelay(4); /* DTC436 chip hangs without this */
+ if (NCR5380_read(hostdata->c400_ctl_status) & CSR_53C80_REG)
+ return 0;
+ } while (--count > 0);
+
+ scmd_printk(KERN_ERR, hostdata->connected,
+ "53c80 registers not accessible, device will be reset\n");
+ NCR5380_write(hostdata->c400_ctl_status, CSR_RESET);
+ NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
+ return -1;
+}
+
/**
* generic_NCR5380_pread - pseudo DMA receive
* @hostdata: scsi host private data
@@ -492,18 +519,23 @@ static void generic_NCR5380_release_resources(struct Scsi_Host *instance)
static inline int generic_NCR5380_pread(struct NCR5380_hostdata *hostdata,
unsigned char *dst, int len)
{
- int blocks = len / 128;
- int start = 0;
+ int result;
+ int start;
NCR5380_write(hostdata->c400_ctl_status, CSR_BASE | CSR_TRANS_DIR);
- NCR5380_write(hostdata->c400_blk_cnt, blocks);
- while (1) {
- if (NCR5380_read(hostdata->c400_blk_cnt) == 0)
+ NCR5380_write(hostdata->c400_blk_cnt, len / 128);
+
+ for (start = 0; start < len; start += 128) {
+ if (NCR5380_poll_politely2(hostdata, hostdata->c400_ctl_status,
+ CSR_HOST_BUF_NOT_RDY, 0,
+ hostdata->c400_ctl_status,
+ CSR_GATED_53C80_IRQ,
+ CSR_GATED_53C80_IRQ, HZ / 64) < 0)
+ break;
+
+ if (NCR5380_read(hostdata->c400_ctl_status) &
+ CSR_GATED_53C80_IRQ)
break;
- if (NCR5380_read(hostdata->c400_ctl_status) & CSR_GATED_53C80_IRQ)
- goto out_wait;
- while (NCR5380_read(hostdata->c400_ctl_status) & CSR_HOST_BUF_NOT_RDY)
- ; /* FIXME - no timeout */
if (hostdata->io_port && hostdata->io_width == 2)
insw(hostdata->io_port + hostdata->c400_host_buf,
@@ -514,46 +546,20 @@ static inline int generic_NCR5380_pread(struct NCR5380_hostdata *hostdata,
else
memcpy_fromio(dst + start,
hostdata->io + NCR53C400_host_buffer, 128);
-
- start += 128;
- blocks--;
- }
-
- if (blocks) {
- while (NCR5380_read(hostdata->c400_ctl_status) & CSR_HOST_BUF_NOT_RDY)
- ; /* FIXME - no timeout */
-
- if (hostdata->io_port && hostdata->io_width == 2)
- insw(hostdata->io_port + hostdata->c400_host_buf,
- dst + start, 64);
- else if (hostdata->io_port)
- insb(hostdata->io_port + hostdata->c400_host_buf,
- dst + start, 128);
- else
- memcpy_fromio(dst + start,
- hostdata->io + NCR53C400_host_buffer, 128);
-
- start += 128;
- blocks--;
}
- if (!(NCR5380_read(hostdata->c400_ctl_status) & CSR_GATED_53C80_IRQ))
- printk("53C400r: no 53C80 gated irq after transfer");
-
-out_wait:
- hostdata->pdma_residual = len - start;
+ hostdata->pdma_residual = NCR5380_read(hostdata->c400_blk_cnt) * 128;
- /* wait for 53C80 registers to be available */
- while (!(NCR5380_read(hostdata->c400_ctl_status) & CSR_53C80_REG))
- ;
+ result = wait_for_53c80_access(hostdata);
- if (NCR5380_poll_politely(hostdata, BUS_AND_STATUS_REG,
+ if (hostdata->pdma_residual == 0 &&
+ NCR5380_poll_politely(hostdata, BUS_AND_STATUS_REG,
BASR_END_DMA_TRANSFER, BASR_END_DMA_TRANSFER,
HZ / 64) < 0)
scmd_printk(KERN_ERR, hostdata->connected, "%s: End of DMA timeout (%d)\n",
__func__, hostdata->pdma_residual);
- return 0;
+ return result;
}
/**
@@ -568,36 +574,23 @@ static inline int generic_NCR5380_pread(struct NCR5380_hostdata *hostdata,
static inline int generic_NCR5380_pwrite(struct NCR5380_hostdata *hostdata,
unsigned char *src, int len)
{
- int blocks = len / 128;
- int start = 0;
+ int result;
+ int start;
NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
- NCR5380_write(hostdata->c400_blk_cnt, blocks);
- while (1) {
- if (NCR5380_read(hostdata->c400_ctl_status) & CSR_GATED_53C80_IRQ)
- goto out_wait;
-
- if (NCR5380_read(hostdata->c400_blk_cnt) == 0)
+ NCR5380_write(hostdata->c400_blk_cnt, len / 128);
+
+ for (start = 0; start < len; start += 128) {
+ if (NCR5380_poll_politely2(hostdata, hostdata->c400_ctl_status,
+ CSR_HOST_BUF_NOT_RDY, 0,
+ hostdata->c400_ctl_status,
+ CSR_GATED_53C80_IRQ,
+ CSR_GATED_53C80_IRQ, HZ / 64) < 0)
break;
- while (NCR5380_read(hostdata->c400_ctl_status) & CSR_HOST_BUF_NOT_RDY)
- ; // FIXME - timeout
-
- if (hostdata->io_port && hostdata->io_width == 2)
- outsw(hostdata->io_port + hostdata->c400_host_buf,
- src + start, 64);
- else if (hostdata->io_port)
- outsb(hostdata->io_port + hostdata->c400_host_buf,
- src + start, 128);
- else
- memcpy_toio(hostdata->io + NCR53C400_host_buffer,
- src + start, 128);
- start += 128;
- blocks--;
- }
- if (blocks) {
- while (NCR5380_read(hostdata->c400_ctl_status) & CSR_HOST_BUF_NOT_RDY)
- ; // FIXME - no timeout
+ if (NCR5380_read(hostdata->c400_ctl_status) &
+ CSR_GATED_53C80_IRQ)
+ break;
if (hostdata->io_port && hostdata->io_width == 2)
outsw(hostdata->io_port + hostdata->c400_host_buf,
@@ -608,30 +601,27 @@ static inline int generic_NCR5380_pwrite(struct NCR5380_hostdata *hostdata,
else
memcpy_toio(hostdata->io + NCR53C400_host_buffer,
src + start, 128);
-
- start += 128;
- blocks--;
}
-out_wait:
- hostdata->pdma_residual = len - start;
+ hostdata->pdma_residual = NCR5380_read(hostdata->c400_blk_cnt) * 128;
- /* wait for 53C80 registers to be available */
- while (!(NCR5380_read(hostdata->c400_ctl_status) & CSR_53C80_REG)) {
- udelay(4); /* DTC436 chip hangs without this */
- /* FIXME - no timeout */
- }
+ result = wait_for_53c80_access(hostdata);
- while (!(NCR5380_read(TARGET_COMMAND_REG) & TCR_LAST_BYTE_SENT))
- ; // TIMEOUT
+ if (hostdata->pdma_residual == 0) {
+ if (NCR5380_poll_politely(hostdata, TARGET_COMMAND_REG,
+ TCR_LAST_BYTE_SENT, TCR_LAST_BYTE_SENT,
+ HZ / 64) < 0)
+ scmd_printk(KERN_ERR, hostdata->connected,
+ "%s: Last Byte Sent timeout\n", __func__);
- if (NCR5380_poll_politely(hostdata, BUS_AND_STATUS_REG,
- BASR_END_DMA_TRANSFER, BASR_END_DMA_TRANSFER,
- HZ / 64) < 0)
- scmd_printk(KERN_ERR, hostdata->connected, "%s: End of DMA timeout (%d)\n",
- __func__, hostdata->pdma_residual);
+ if (NCR5380_poll_politely(hostdata, BUS_AND_STATUS_REG,
+ BASR_END_DMA_TRANSFER, BASR_END_DMA_TRANSFER,
+ HZ / 64) < 0)
+ scmd_printk(KERN_ERR, hostdata->connected, "%s: End of DMA timeout (%d)\n",
+ __func__, hostdata->pdma_residual);
+ }
- return 0;
+ return result;
}
static int generic_NCR5380_dma_xfer_len(struct NCR5380_hostdata *hostdata,
--
2.13.0
From: Ondrej Zary <[email protected]>
generic_NCR5380_dma_xfer_len() incorrectly uses cmd->transfersize
which causes rescan-scsi-bus and CD-ROM access to hang the system.
Use cmd->SCp.this_residual instead, like other NCR5380 drivers.
Signed-off-by: Ondrej Zary <[email protected]>
Signed-off-by: Finn Thain <[email protected]>
---
drivers/scsi/g_NCR5380.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)
diff --git a/drivers/scsi/g_NCR5380.c b/drivers/scsi/g_NCR5380.c
index 67c8dac321ad..14ef4e8c4713 100644
--- a/drivers/scsi/g_NCR5380.c
+++ b/drivers/scsi/g_NCR5380.c
@@ -76,6 +76,7 @@
#define IRQ_AUTO 254
#define MAX_CARDS 8
+#define DMA_MAX_SIZE 32768
/* old-style parameters for compatibility */
static int ncr_irq = -1;
@@ -629,23 +630,16 @@ static inline int generic_NCR5380_pwrite(struct NCR5380_hostdata *hostdata,
static int generic_NCR5380_dma_xfer_len(struct NCR5380_hostdata *hostdata,
struct scsi_cmnd *cmd)
{
- int transfersize = cmd->transfersize;
+ int transfersize = cmd->SCp.this_residual;
if (hostdata->flags & FLAG_NO_PSEUDO_DMA)
return 0;
- /* Limit transfers to 32K, for xx400 & xx406
- * pseudoDMA that transfers in 128 bytes blocks.
- */
- if (transfersize > 32 * 1024 && cmd->SCp.this_residual &&
- !(cmd->SCp.this_residual % transfersize))
- transfersize = 32 * 1024;
-
/* 53C400 datasheet: non-modulo-128-byte transfers should use PIO */
if (transfersize % 128)
transfersize = 0;
- return transfersize;
+ return min(transfersize, DMA_MAX_SIZE);
}
/*
--
2.13.0
From: Ondrej Zary <[email protected]>
When an IRQ arrives during PDMA transfer, pread() and pwrite() return
without waiting for the 53C80 registers to be ready and this ends up
messing up the chip state. This was observed with SONY CDU-55S which is
slow enough to disconnect during 4096-byte reads.
IRQ during PDMA is not an error so don't return -1. Instead, store the
remaining byte count for use by NCR5380_dma_residual().
[Poll for the BASR_END_DMA_TRANSFER condition rather than remove the
error message -- F.T.]
Signed-off-by: Ondrej Zary <[email protected]>
Signed-off-by: Finn Thain <[email protected]>
---
drivers/scsi/g_NCR5380.c | 48 +++++++++++++++++++++++++++++++-----------------
1 file changed, 31 insertions(+), 17 deletions(-)
diff --git a/drivers/scsi/g_NCR5380.c b/drivers/scsi/g_NCR5380.c
index 14ef4e8c4713..911a4300ea51 100644
--- a/drivers/scsi/g_NCR5380.c
+++ b/drivers/scsi/g_NCR5380.c
@@ -44,12 +44,13 @@
int c400_ctl_status; \
int c400_blk_cnt; \
int c400_host_buf; \
- int io_width
+ int io_width; \
+ int pdma_residual
#define NCR5380_dma_xfer_len generic_NCR5380_dma_xfer_len
#define NCR5380_dma_recv_setup generic_NCR5380_pread
#define NCR5380_dma_send_setup generic_NCR5380_pwrite
-#define NCR5380_dma_residual NCR5380_dma_residual_none
+#define NCR5380_dma_residual generic_NCR5380_dma_residual
#define NCR5380_intr generic_NCR5380_intr
#define NCR5380_queue_command generic_NCR5380_queue_command
@@ -500,10 +501,8 @@ static inline int generic_NCR5380_pread(struct NCR5380_hostdata *hostdata,
while (1) {
if (NCR5380_read(hostdata->c400_blk_cnt) == 0)
break;
- if (NCR5380_read(hostdata->c400_ctl_status) & CSR_GATED_53C80_IRQ) {
- printk(KERN_ERR "53C400r: Got 53C80_IRQ start=%d, blocks=%d\n", start, blocks);
- return -1;
- }
+ if (NCR5380_read(hostdata->c400_ctl_status) & CSR_GATED_53C80_IRQ)
+ goto out_wait;
while (NCR5380_read(hostdata->c400_ctl_status) & CSR_HOST_BUF_NOT_RDY)
; /* FIXME - no timeout */
@@ -542,13 +541,19 @@ static inline int generic_NCR5380_pread(struct NCR5380_hostdata *hostdata,
if (!(NCR5380_read(hostdata->c400_ctl_status) & CSR_GATED_53C80_IRQ))
printk("53C400r: no 53C80 gated irq after transfer");
+out_wait:
+ hostdata->pdma_residual = len - start;
+
/* wait for 53C80 registers to be available */
while (!(NCR5380_read(hostdata->c400_ctl_status) & CSR_53C80_REG))
;
- if (!(NCR5380_read(BUS_AND_STATUS_REG) & BASR_END_DMA_TRANSFER))
- printk(KERN_ERR "53C400r: no end dma signal\n");
-
+ if (NCR5380_poll_politely(hostdata, BUS_AND_STATUS_REG,
+ BASR_END_DMA_TRANSFER, BASR_END_DMA_TRANSFER,
+ HZ / 64) < 0)
+ scmd_printk(KERN_ERR, hostdata->connected, "%s: End of DMA timeout (%d)\n",
+ __func__, hostdata->pdma_residual);
+
return 0;
}
@@ -571,10 +576,8 @@ static inline int generic_NCR5380_pwrite(struct NCR5380_hostdata *hostdata,
NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
NCR5380_write(hostdata->c400_blk_cnt, blocks);
while (1) {
- if (NCR5380_read(hostdata->c400_ctl_status) & CSR_GATED_53C80_IRQ) {
- printk(KERN_ERR "53C400w: Got 53C80_IRQ start=%d, blocks=%d\n", start, blocks);
- return -1;
- }
+ if (NCR5380_read(hostdata->c400_ctl_status) & CSR_GATED_53C80_IRQ)
+ goto out_wait;
if (NCR5380_read(hostdata->c400_blk_cnt) == 0)
break;
@@ -612,18 +615,24 @@ static inline int generic_NCR5380_pwrite(struct NCR5380_hostdata *hostdata,
blocks--;
}
+out_wait:
+ hostdata->pdma_residual = len - start;
+
/* wait for 53C80 registers to be available */
while (!(NCR5380_read(hostdata->c400_ctl_status) & CSR_53C80_REG)) {
udelay(4); /* DTC436 chip hangs without this */
/* FIXME - no timeout */
}
- if (!(NCR5380_read(BUS_AND_STATUS_REG) & BASR_END_DMA_TRANSFER)) {
- printk(KERN_ERR "53C400w: no end dma signal\n");
- }
-
while (!(NCR5380_read(TARGET_COMMAND_REG) & TCR_LAST_BYTE_SENT))
; // TIMEOUT
+
+ if (NCR5380_poll_politely(hostdata, BUS_AND_STATUS_REG,
+ BASR_END_DMA_TRANSFER, BASR_END_DMA_TRANSFER,
+ HZ / 64) < 0)
+ scmd_printk(KERN_ERR, hostdata->connected, "%s: End of DMA timeout (%d)\n",
+ __func__, hostdata->pdma_residual);
+
return 0;
}
@@ -642,6 +651,11 @@ static int generic_NCR5380_dma_xfer_len(struct NCR5380_hostdata *hostdata,
return min(transfersize, DMA_MAX_SIZE);
}
+static int generic_NCR5380_dma_residual(struct NCR5380_hostdata *hostdata)
+{
+ return hostdata->pdma_residual;
+}
+
/*
* Include the NCR5380 core code that we build our driver around
*/
--
2.13.0
Signed-off-by: Finn Thain <[email protected]>
---
drivers/scsi/g_NCR5380.c | 61 ++++++++++++++++++++++--------------------------
1 file changed, 28 insertions(+), 33 deletions(-)
diff --git a/drivers/scsi/g_NCR5380.c b/drivers/scsi/g_NCR5380.c
index 911a4300ea51..9f18082415c4 100644
--- a/drivers/scsi/g_NCR5380.c
+++ b/drivers/scsi/g_NCR5380.c
@@ -1,17 +1,17 @@
/*
* Generic Generic NCR5380 driver
- *
+ *
* Copyright 1993, Drew Eckhardt
- * Visionary Computing
- * (Unix and Linux consulting and custom programming)
- * [email protected]
- * +1 (303) 440-4894
+ * Visionary Computing
+ * (Unix and Linux consulting and custom programming)
+ * [email protected]
+ * +1 (303) 440-4894
*
* NCR53C400 extensions (c) 1994,1995,1996, Kevin Lentin
- * [email protected]
+ * [email protected]
*
* NCR53C400A extensions (c) 1996, Ingmar Baumgart
- * [email protected]
+ * [email protected]
*
* DTC3181E extensions (c) 1997, Ronald van Cuijlenborg
* [email protected] or [email protected]
@@ -481,15 +481,14 @@ static void generic_NCR5380_release_resources(struct Scsi_Host *instance)
}
/**
- * generic_NCR5380_pread - pseudo DMA read
- * @hostdata: scsi host private data
- * @dst: buffer to read into
- * @len: buffer length
+ * generic_NCR5380_pread - pseudo DMA receive
+ * @hostdata: scsi host private data
+ * @dst: buffer to write into
+ * @len: transfer size
*
- * Perform a pseudo DMA mode read from an NCR53C400 or equivalent
- * controller
+ * Perform a pseudo DMA mode receive from a 53C400 or equivalent device.
*/
-
+
static inline int generic_NCR5380_pread(struct NCR5380_hostdata *hostdata,
unsigned char *dst, int len)
{
@@ -508,10 +507,10 @@ static inline int generic_NCR5380_pread(struct NCR5380_hostdata *hostdata,
if (hostdata->io_port && hostdata->io_width == 2)
insw(hostdata->io_port + hostdata->c400_host_buf,
- dst + start, 64);
+ dst + start, 64);
else if (hostdata->io_port)
insb(hostdata->io_port + hostdata->c400_host_buf,
- dst + start, 128);
+ dst + start, 128);
else
memcpy_fromio(dst + start,
hostdata->io + NCR53C400_host_buffer, 128);
@@ -558,13 +557,12 @@ static inline int generic_NCR5380_pread(struct NCR5380_hostdata *hostdata,
}
/**
- * generic_NCR5380_pwrite - pseudo DMA write
- * @hostdata: scsi host private data
- * @dst: buffer to read into
- * @len: buffer length
+ * generic_NCR5380_pwrite - pseudo DMA send
+ * @hostdata: scsi host private data
+ * @src: buffer to read from
+ * @len: transfer size
*
- * Perform a pseudo DMA mode read from an NCR53C400 or equivalent
- * controller
+ * Perform a pseudo DMA mode send to a 53C400 or equivalent device.
*/
static inline int generic_NCR5380_pwrite(struct NCR5380_hostdata *hostdata,
@@ -603,10 +601,10 @@ static inline int generic_NCR5380_pwrite(struct NCR5380_hostdata *hostdata,
if (hostdata->io_port && hostdata->io_width == 2)
outsw(hostdata->io_port + hostdata->c400_host_buf,
- src + start, 64);
+ src + start, 64);
else if (hostdata->io_port)
outsb(hostdata->io_port + hostdata->c400_host_buf,
- src + start, 128);
+ src + start, 128);
else
memcpy_toio(hostdata->io + NCR53C400_host_buffer,
src + start, 128);
@@ -656,10 +654,8 @@ static int generic_NCR5380_dma_residual(struct NCR5380_hostdata *hostdata)
return hostdata->pdma_residual;
}
-/*
- * Include the NCR5380 core code that we build our driver around
- */
-
+/* Include the core driver code. */
+
#include "NCR5380.c"
static struct scsi_host_template driver_template = {
@@ -679,11 +675,10 @@ static struct scsi_host_template driver_template = {
.max_sectors = 128,
};
-
static int generic_NCR5380_isa_match(struct device *pdev, unsigned int ndev)
{
int ret = generic_NCR5380_init_one(&driver_template, pdev, base[ndev],
- irq[ndev], card[ndev]);
+ irq[ndev], card[ndev]);
if (ret) {
if (base[ndev])
printk(KERN_WARNING "Card not found at address 0x%03x\n",
@@ -695,7 +690,7 @@ static int generic_NCR5380_isa_match(struct device *pdev, unsigned int ndev)
}
static int generic_NCR5380_isa_remove(struct device *pdev,
- unsigned int ndev)
+ unsigned int ndev)
{
generic_NCR5380_release_resources(dev_get_drvdata(pdev));
dev_set_drvdata(pdev, NULL);
@@ -718,7 +713,7 @@ static struct pnp_device_id generic_NCR5380_pnp_ids[] = {
MODULE_DEVICE_TABLE(pnp, generic_NCR5380_pnp_ids);
static int generic_NCR5380_pnp_probe(struct pnp_dev *pdev,
- const struct pnp_device_id *id)
+ const struct pnp_device_id *id)
{
int base, irq;
@@ -729,7 +724,7 @@ static int generic_NCR5380_pnp_probe(struct pnp_dev *pdev,
irq = pnp_irq(pdev, 0);
return generic_NCR5380_init_one(&driver_template, &pdev->dev, base, irq,
- id->driver_data);
+ id->driver_data);
}
static void generic_NCR5380_pnp_remove(struct pnp_dev *pdev)
--
2.13.0
On Monday 26 June 2017 09:30:33 Finn Thain wrote:
> Ondrej, would you please test this new series?
>
> Changed since v1:
> - PDMA transfer residual is calculated earlier.
> - End of DMA flag check is now polled (if there is any residual).
>
> Changed since v2:
> - Bail out of transfer loops when Gated IRQ gets asserted.
> - Make udelay conditional on board type.
> - Drop sg_tablesize patch due to performance regression.
>
>
> Finn Thain (1):
> g_NCR5380: Cleanup comments and whitespace
>
> Ondrej Zary (3):
> g_NCR5380: Fix PDMA transfer size
> g_NCR5380: End PDMA transfer correctly on target disconnection
> g_NCR5380: Re-work PDMA loops
>
> drivers/scsi/g_NCR5380.c | 239
> +++++++++++++++++++++++------------------------ 1 file changed, 116
> insertions(+), 123 deletions(-)
No apparent change in behavior, the first write test resulted in:
[ 842.830802] sd 2:0:1:0: [sdb] tag#0 53c80 registers not accessible, device will be reset
[ 842.830802] sd 2:0:1:0: [sdb] tag#0 switching to slow handshake
Checking for IRQ after poll_politely2 does not seem right because we can
miss the buffer ready signal.
According to my tests, buffer ready signal is most important - if there is
any data to read/write, do the transfer. If not, only then check why - maybe
we got an IRQ (that terminated PDMA). Or no IRQ, sometimes the wait for buffer
ready times out - we need to terminate PDMA manually then (reset).
Then 53C80 registers should become ready.
This is a log from writing 230 MB file using my code with some debug
prints, on a NCR53C400. No 53C80 timeouts, only disconnects and some host
buffer timeouts (maybe the drive sometimes just slows down without
disconnecting?)
[ 3378.503828] basr=0x10 csr=0xd5 at start=512
[ 3461.257973] w basr=0x10 csr=0x95 at start=3840
[ 3461.838225] w basr=0x10 csr=0x95 at start=3840
[ 3462.683446] w basr=0x10 csr=0x95 at start=3840
[ 3463.416911] w basr=0x10 csr=0x95 at start=3840
[ 3465.117804] scsi host2: 53C400w: host buffer not ready in time
[ 3465.276375] w basr=0x10 csr=0x95 at start=3328
[ 3466.457701] w basr=0x10 csr=0x95 at start=1792
[ 3467.019643] scsi host2: 53C400w: host buffer not ready in time
[ 3467.619058] scsi host2: 53C400w: host buffer not ready in time
[ 3467.799619] w basr=0x10 csr=0x95 at start=3840
[ 3552.123501] w basr=0x10 csr=0x95 at start=2304
[ 3552.771223] w basr=0x10 csr=0x95 at start=1280
[ 3554.556451] w basr=0x10 csr=0x95 at start=2816
[ 3555.229646] w basr=0x10 csr=0x95 at start=1792
[ 3555.630632] scsi host2: 53C400w: host buffer not ready in time
[ 3555.774560] w basr=0x10 csr=0x95 at start=768
[ 3625.541608] w basr=0x10 csr=0x95 at start=3328
[ 3640.099861] w basr=0x10 csr=0x95 at start=1792
[ 3641.442671] w basr=0x10 csr=0x95 at start=2816
[ 3641.865469] w basr=0x10 csr=0x95 at start=768
[ 3642.939223] w basr=0x10 csr=0x95 at start=1280
[ 3643.356858] w basr=0x10 csr=0x95 at start=3328
[ 3643.701636] w basr=0x10 csr=0x95 at start=3840
[ 3645.153405] w basr=0x10 csr=0x95 at start=2304
[ 3646.135642] w basr=0x10 csr=0x95 at start=1280
[ 3647.007321] w basr=0x10 csr=0x95 at start=2816
[ 3648.065874] w basr=0x10 csr=0x95 at start=3328
[ 3650.071961] w basr=0x10 csr=0x95 at start=1280
[ 3650.827630] w basr=0x10 csr=0x95 at start=1792
[ 3651.827011] w basr=0x10 csr=0x95 at start=2816
[ 3652.559984] w basr=0x10 csr=0x95 at start=2816
[ 3653.203566] w basr=0x10 csr=0x95 at start=3328
[ 3653.594376] w basr=0x10 csr=0x95 at start=1280
[ 3653.903437] w basr=0x10 csr=0x95 at start=3840
[ 3654.305753] w basr=0x10 csr=0x95 at start=1792
[ 3654.676009] w basr=0x10 csr=0x95 at start=2304
[ 3655.367686] w basr=0x10 csr=0x95 at start=2816
[ 3655.733854] w basr=0x10 csr=0x95 at start=768
[ 3656.075023] w basr=0x10 csr=0x95 at start=3328
[ 3656.493046] w basr=0x10 csr=0x95 at start=2816
[ 3657.208089] w basr=0x10 csr=0x95 at start=1280
[ 3657.537223] w basr=0x10 csr=0x95 at start=1280
And this is from reading the file back:
[ 3799.053067] basr=0x10 csr=0xd5 at start=512
[ 3801.056337] basr=0x10 csr=0xd5 at start=3584
[ 3976.323836] scsi host2: 53C400r: host buffer not ready in time
[ 3976.404699] basr=0x10 csr=0xd5 at start=512
[ 3977.800647] basr=0x10 csr=0xd5 at start=512
[ 3979.240611] scsi host2: 53C400r: host buffer not ready in time
[ 3979.320698] basr=0x10 csr=0xd5 at start=512
[ 3980.040220] scsi host2: 53C400r: host buffer not ready in time
[ 3980.096401] basr=0x10 csr=0xd5 at start=512
[ 3980.394854] scsi host2: 53C400r: host buffer not ready in time
--
Ondrej Zary
On Mon, 26 Jun 2017, Ondrej Zary wrote:
>
> No apparent change in behavior, the first write test resulted in:
> [ 842.830802] sd 2:0:1:0: [sdb] tag#0 53c80 registers not accessible, device will be reset
> [ 842.830802] sd 2:0:1:0: [sdb] tag#0 switching to slow handshake
>
> Checking for IRQ after poll_politely2 does not seem right because we can
> miss the buffer ready signal.
>
How so? As long as there's no gated IRQ, we poll for buffer readiness
until timeout. And when there is a gated IRQ, we break both the polling
loop and the transfer loop immediately. Your code and mine are basically
in agreement here.
> According to my tests, buffer ready signal is most important - if there
> is any data to read/write, do the transfer. If not, only then check why
> - maybe we got an IRQ (that terminated PDMA). Or no IRQ, sometimes the
> wait for buffer ready times out - we need to terminate PDMA manually
> then (reset).
>
> Then 53C80 registers should become ready.
>
You seem to be saying that we should ignore the IRQ signal if the buffers
have become ready. Maybe so. Can we try simply resetting the block
counter? (I could imagine that the 53c400 core might leave the 53c80
registers inaccessible unless we keep accessing the buffers in the 53c400
core until the transfer is done.)
BTW, with regard to your patch, note that this construct is race prone:
while (1) { /* monitor IRQ while waiting for host buffer */
csr = NCR5380_read(hostdata->c400_ctl_status);
if (!(csr & CSR_HOST_BUF_NOT_RDY))
break;
if (csr & CSR_GATED_53C80_IRQ) {
basr = NCR5380_read(BUS_AND_STATUS_REG);
if (!(basr & BASR_PHASE_MATCH) ||
(basr & BASR_BUSY_ERROR)) {
printk("basr=0x%02x csr=0x%02x at start=%d\n", basr, csr, start);
goto out_wait;
}
}
if (retries-- < 1) {
shost_printk(KERN_ERR, hostdata->host, "53C400r: host buffer not ready in time\n");
NCR5380_write(hostdata->c400_ctl_status, CSR_RESET);
NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
goto out_wait;
}
}
This code can "goto out_wait" when !(csr & CSR_HOST_BUF_NOT_RDY). It
depends on timing. This would seem to be contrary to your stated aim.
Moreover, this code can also "break" when (csr & CSR_GATED_53C80_IRQ).
That depends on timing too. But this may be an improvement on my code if
it allows the 53c80 registers to become accessible, by allowing the block
counter to be decremented.
The uncertainty here was one of the reasons I reworked this code.
> This is a log from writing 230 MB file using my code with some debug
> prints, on a NCR53C400. No 53C80 timeouts, only disconnects and some
> host buffer timeouts (maybe the drive sometimes just slows down without
> disconnecting?)
>
> [ 3378.503828] basr=0x10 csr=0xd5 at start=512
> [ 3461.257973] w basr=0x10 csr=0x95 at start=3840
> [ 3461.838225] w basr=0x10 csr=0x95 at start=3840
> [ 3462.683446] w basr=0x10 csr=0x95 at start=3840
> [ 3463.416911] w basr=0x10 csr=0x95 at start=3840
> [ 3465.117804] scsi host2: 53C400w: host buffer not ready in time
> [ 3465.276375] w basr=0x10 csr=0x95 at start=3328
> [ 3466.457701] w basr=0x10 csr=0x95 at start=1792
> [ 3467.019643] scsi host2: 53C400w: host buffer not ready in time
> [ 3467.619058] scsi host2: 53C400w: host buffer not ready in time
> [ 3467.799619] w basr=0x10 csr=0x95 at start=3840
> [ 3552.123501] w basr=0x10 csr=0x95 at start=2304
> [ 3552.771223] w basr=0x10 csr=0x95 at start=1280
> [ 3554.556451] w basr=0x10 csr=0x95 at start=2816
> [ 3555.229646] w basr=0x10 csr=0x95 at start=1792
> [ 3555.630632] scsi host2: 53C400w: host buffer not ready in time
> [ 3555.774560] w basr=0x10 csr=0x95 at start=768
> [ 3625.541608] w basr=0x10 csr=0x95 at start=3328
> [ 3640.099861] w basr=0x10 csr=0x95 at start=1792
> [ 3641.442671] w basr=0x10 csr=0x95 at start=2816
> [ 3641.865469] w basr=0x10 csr=0x95 at start=768
> [ 3642.939223] w basr=0x10 csr=0x95 at start=1280
> [ 3643.356858] w basr=0x10 csr=0x95 at start=3328
> [ 3643.701636] w basr=0x10 csr=0x95 at start=3840
> [ 3645.153405] w basr=0x10 csr=0x95 at start=2304
> [ 3646.135642] w basr=0x10 csr=0x95 at start=1280
> [ 3647.007321] w basr=0x10 csr=0x95 at start=2816
> [ 3648.065874] w basr=0x10 csr=0x95 at start=3328
> [ 3650.071961] w basr=0x10 csr=0x95 at start=1280
> [ 3650.827630] w basr=0x10 csr=0x95 at start=1792
> [ 3651.827011] w basr=0x10 csr=0x95 at start=2816
> [ 3652.559984] w basr=0x10 csr=0x95 at start=2816
> [ 3653.203566] w basr=0x10 csr=0x95 at start=3328
> [ 3653.594376] w basr=0x10 csr=0x95 at start=1280
> [ 3653.903437] w basr=0x10 csr=0x95 at start=3840
> [ 3654.305753] w basr=0x10 csr=0x95 at start=1792
> [ 3654.676009] w basr=0x10 csr=0x95 at start=2304
> [ 3655.367686] w basr=0x10 csr=0x95 at start=2816
> [ 3655.733854] w basr=0x10 csr=0x95 at start=768
> [ 3656.075023] w basr=0x10 csr=0x95 at start=3328
> [ 3656.493046] w basr=0x10 csr=0x95 at start=2816
> [ 3657.208089] w basr=0x10 csr=0x95 at start=1280
> [ 3657.537223] w basr=0x10 csr=0x95 at start=1280
>
> And this is from reading the file back:
> [ 3799.053067] basr=0x10 csr=0xd5 at start=512
> [ 3801.056337] basr=0x10 csr=0xd5 at start=3584
> [ 3976.323836] scsi host2: 53C400r: host buffer not ready in time
> [ 3976.404699] basr=0x10 csr=0xd5 at start=512
> [ 3977.800647] basr=0x10 csr=0xd5 at start=512
> [ 3979.240611] scsi host2: 53C400r: host buffer not ready in time
> [ 3979.320698] basr=0x10 csr=0xd5 at start=512
> [ 3980.040220] scsi host2: 53C400r: host buffer not ready in time
> [ 3980.096401] basr=0x10 csr=0xd5 at start=512
> [ 3980.394854] scsi host2: 53C400r: host buffer not ready in time
>
The register values look normal (?)
Anyway, there are only a few material differences between your code and
this patch series.
1) Your code does not break the transfer loop for any Gated IRQ, but only
for a phase error IRQ. My version responds to any Gated IRQ, as per the
algorithm in the datasheet. Your code seems to assume that the 53c80
registers are accessible whenever gated IRQ is set, which seems
unlikely.
2) My code fails the transfer when the 53c80 registers don't become
accessible, which is why you see a fallback to PIO (slow handshake).
3) We use different timeouts.
4) In my code, the race condition applies only after the timeout has
already expired, so it's inconsequential.
Please apply the patch below (on top of this series) so we can perhaps
narrow this down a bit.
diff --git a/drivers/scsi/g_NCR5380.c b/drivers/scsi/g_NCR5380.c
index 28233ec49fdc..2e8ff001af46 100644
--- a/drivers/scsi/g_NCR5380.c
+++ b/drivers/scsi/g_NCR5380.c
@@ -526,6 +526,8 @@ static inline int generic_NCR5380_pread(struct NCR5380_hostdata *hostdata,
NCR5380_write(hostdata->c400_blk_cnt, len / 128);
for (start = 0; start < len; start += 128) {
+ udelay(500);
+
if (NCR5380_poll_politely2(hostdata, hostdata->c400_ctl_status,
CSR_HOST_BUF_NOT_RDY, 0,
hostdata->c400_ctl_status,
@@ -549,7 +551,7 @@ static inline int generic_NCR5380_pread(struct NCR5380_hostdata *hostdata,
}
hostdata->pdma_residual = NCR5380_read(hostdata->c400_blk_cnt) * 128;
-
+ NCR5380_write(hostdata->c400_blk_cnt, 0);
result = wait_for_53c80_access(hostdata);
if (hostdata->pdma_residual == 0 &&
@@ -581,6 +583,8 @@ static inline int generic_NCR5380_pwrite(struct NCR5380_hostdata *hostdata,
NCR5380_write(hostdata->c400_blk_cnt, len / 128);
for (start = 0; start < len; start += 128) {
+ udelay(500);
+
if (NCR5380_poll_politely2(hostdata, hostdata->c400_ctl_status,
CSR_HOST_BUF_NOT_RDY, 0,
hostdata->c400_ctl_status,
@@ -604,7 +608,7 @@ static inline int generic_NCR5380_pwrite(struct NCR5380_hostdata *hostdata,
}
hostdata->pdma_residual = NCR5380_read(hostdata->c400_blk_cnt) * 128;
-
+ NCR5380_write(hostdata->c400_blk_cnt, 0);
result = wait_for_53c80_access(hostdata);
if (hostdata->pdma_residual == 0) {
On Monday 26 June 2017, Ondrej Zary wrote:
> On Monday 26 June 2017 09:30:33 Finn Thain wrote:
> > Ondrej, would you please test this new series?
> >
> > Changed since v1:
> > - PDMA transfer residual is calculated earlier.
> > - End of DMA flag check is now polled (if there is any residual).
> >
> > Changed since v2:
> > - Bail out of transfer loops when Gated IRQ gets asserted.
> > - Make udelay conditional on board type.
> > - Drop sg_tablesize patch due to performance regression.
> >
> >
> > Finn Thain (1):
> > g_NCR5380: Cleanup comments and whitespace
> >
> > Ondrej Zary (3):
> > g_NCR5380: Fix PDMA transfer size
> > g_NCR5380: End PDMA transfer correctly on target disconnection
> > g_NCR5380: Re-work PDMA loops
> >
> > drivers/scsi/g_NCR5380.c | 239
> > +++++++++++++++++++++++------------------------ 1 file changed, 116
> > insertions(+), 123 deletions(-)
BTW. I've probably found the DTC write corruption.
Added the following check (13 is host buffer index register) - and it triggers
sometimes: the value is 1 instead of 0. As we use only 16-bit writes, I don't
see how the value could ever be odd. Looks like a bug in the chip.
The index register corrupts during the transfer, not after IRQ or timeout. The
same check at beginning of pwrite() did not trigger.
The index register is not writable so we must(?) reset the PDMA engine to
recover. However, this quick attempt to fix does not work, maybe we should
reload the block count and continue?
--- a/drivers/scsi/g_NCR5380.c
+++ b/drivers/scsi/g_NCR5380.c
@@ -595,7 +603,13 @@ static inline int generic_NCR5380_pwrite(struct
NCR5380_hostdata *hostdata,
goto out_wait;
}
}
-
+ idx = NCR5380_read(13);
+ if (idx != 0) {
+ printk("host idx=%d, start=%d\n", idx, start);
+ NCR5380_write(hostdata->c400_ctl_status, CSR_RESET);
+ NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
+ goto out_wait;
+ }
if (hostdata->io_port && hostdata->io_width == 2)
outsw(hostdata->io_port + hostdata->c400_host_buf,
src + start, 64);
--
Ondrej Zary
Ondrej,
could this be a partial write (target did not transfer the last byte)?
One would suppose the chip posts a phase mismatch in that case ...
Cheers,
Michael
Am 27.06.2017 um 18:28 schrieb Ondrej Zary:
> On Monday 26 June 2017, Ondrej Zary wrote:
>> On Monday 26 June 2017 09:30:33 Finn Thain wrote:
>>> Ondrej, would you please test this new series?
>>>
>>> Changed since v1:
>>> - PDMA transfer residual is calculated earlier.
>>> - End of DMA flag check is now polled (if there is any residual).
>>>
>>> Changed since v2:
>>> - Bail out of transfer loops when Gated IRQ gets asserted.
>>> - Make udelay conditional on board type.
>>> - Drop sg_tablesize patch due to performance regression.
>>>
>>>
>>> Finn Thain (1):
>>> g_NCR5380: Cleanup comments and whitespace
>>>
>>> Ondrej Zary (3):
>>> g_NCR5380: Fix PDMA transfer size
>>> g_NCR5380: End PDMA transfer correctly on target disconnection
>>> g_NCR5380: Re-work PDMA loops
>>>
>>> drivers/scsi/g_NCR5380.c | 239
>>> +++++++++++++++++++++++------------------------ 1 file changed, 116
>>> insertions(+), 123 deletions(-)
>
> BTW. I've probably found the DTC write corruption.
> Added the following check (13 is host buffer index register) - and it triggers
> sometimes: the value is 1 instead of 0. As we use only 16-bit writes, I don't
> see how the value could ever be odd. Looks like a bug in the chip.
> The index register corrupts during the transfer, not after IRQ or timeout. The
> same check at beginning of pwrite() did not trigger.
>
> The index register is not writable so we must(?) reset the PDMA engine to
> recover. However, this quick attempt to fix does not work, maybe we should
> reload the block count and continue?
>
> --- a/drivers/scsi/g_NCR5380.c
> +++ b/drivers/scsi/g_NCR5380.c
> @@ -595,7 +603,13 @@ static inline int generic_NCR5380_pwrite(struct
> NCR5380_hostdata *hostdata,
> goto out_wait;
> }
> }
> -
> + idx = NCR5380_read(13);
> + if (idx != 0) {
> + printk("host idx=%d, start=%d\n", idx, start);
> + NCR5380_write(hostdata->c400_ctl_status, CSR_RESET);
> + NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
> + goto out_wait;
> + }
> if (hostdata->io_port && hostdata->io_width == 2)
> outsw(hostdata->io_port + hostdata->c400_host_buf,
> src + start, 64);
>
>
On Tue, 27 Jun 2017, Ondrej Zary wrote:
> BTW. I've probably found the DTC write corruption. Added the following
> check (13 is host buffer index register) -
That register is not mentioned in my 53c400 datasheet.
> and it triggers sometimes: the value is 1 instead of 0. As we use only
> 16-bit writes, I don't see how the value could ever be odd. Looks like a
> bug in the chip. The index register corrupts during the transfer, not
> after IRQ or timeout. The same check at beginning of pwrite() did not
> trigger.
>
Are you reading this register at the right moment? Have you tried waiting
for it to reach zero, as in,
if (NCR5380_poll_politely(hostdata, 13, 0xff, 0, HZ / 64) < 0)
/* printk, reset etc */;
Even if this is a reliable way to detect a short transfer, it would be
nice to know the root cause. But I'm being unrealistic: the DTC436 vendor
never responded to my requests for technical documentation.
> The index register is not writable so we must(?) reset the PDMA engine
> to recover. However, this quick attempt to fix does not work, maybe we
> should reload the block count and continue?
>
I don't know if it is possible to recover. If the last byte never reached
the scsi bus, then once you reset the 53c400 core, you need the driver to
perform a single-byte PIO transfer after the short PDMA transfer. This
would require that you set the residual appropriately (though in my
experience that may not be sufficient).
It may be better to simply limit the transfer to 512 bytes instead of
attempting to recover based on an undocumented (?) register, etc. Seems
like a bit of a hack.
> --- a/drivers/scsi/g_NCR5380.c
> +++ b/drivers/scsi/g_NCR5380.c
> @@ -595,7 +603,13 @@ static inline int generic_NCR5380_pwrite(struct
> NCR5380_hostdata *hostdata,
> goto out_wait;
> }
> }
> -
> + idx = NCR5380_read(13);
> + if (idx != 0) {
> + printk("host idx=%d, start=%d\n", idx, start);
> + NCR5380_write(hostdata->c400_ctl_status, CSR_RESET);
> + NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
> + goto out_wait;
> + }
> if (hostdata->io_port && hostdata->io_width == 2)
> outsw(hostdata->io_port + hostdata->c400_host_buf,
> src + start, 64);
>
I find it hard to reason about this code. For example, out_wait is to be
removed. Let's get the preceding patches working and signed-off. Please go
ahead and use a 512 B transfer for DTC436 testing if that will help get
this patch series over the line.
--
On Tue, 27 Jun 2017, Michael Schmitz wrote:
> Ondrej,
>
> could this be a partial write (target did not transfer the last byte)?
>
We do wait for TCR_LAST_BYTE_SENT, but only when there is no residual.
Perhaps we should wait for TCR_LAST_BYTE_SENT whenever the 53c400 asserts
/EOP. That is, whenever BASR_END_DMA_TRANSFER is set.
--
> One would suppose the chip posts a phase mismatch in that case ...
>
> Cheers,
>
> Michael
>
On Tuesday 27 June 2017 14:42:29 Finn Thain wrote:
> On Tue, 27 Jun 2017, Ondrej Zary wrote:
> > BTW. I've probably found the DTC write corruption. Added the following
> > check (13 is host buffer index register) -
>
> That register is not mentioned in my 53c400 datasheet.
Yes, it's not there. But we don't have 53C400A and DTC436 datasheets (this
register works on both).
> > and it triggers sometimes: the value is 1 instead of 0. As we use only
> > 16-bit writes, I don't see how the value could ever be odd. Looks like a
> > bug in the chip. The index register corrupts during the transfer, not
> > after IRQ or timeout. The same check at beginning of pwrite() did not
> > trigger.
>
> Are you reading this register at the right moment? Have you tried waiting
> for it to reach zero, as in,
>
> if (NCR5380_poll_politely(hostdata, 13, 0xff, 0, HZ / 64) < 0)
> /* printk, reset etc */;
I have not but will try (expecting that it will not change by itself).
> Even if this is a reliable way to detect a short transfer, it would be
> nice to know the root cause. But I'm being unrealistic: the DTC436 vendor
> never responded to my requests for technical documentation.
According to the data corruption observed, it's not a short transfer. The
corruption is always the same: one byte missing at the beginning of a 128 B
block. It happens only with slow Quantum LPS 240 drive, not with faster IBM
DORS-32160.
> > The index register is not writable so we must(?) reset the PDMA engine
> > to recover. However, this quick attempt to fix does not work, maybe we
> > should reload the block count and continue?
>
> I don't know if it is possible to recover. If the last byte never reached
> the scsi bus, then once you reset the 53c400 core, you need the driver to
> perform a single-byte PIO transfer after the short PDMA transfer. This
> would require that you set the residual appropriately (though in my
> experience that may not be sufficient).
>
> It may be better to simply limit the transfer to 512 bytes instead of
> attempting to recover based on an undocumented (?) register, etc. Seems
> like a bit of a hack.
>
> > --- a/drivers/scsi/g_NCR5380.c
> > +++ b/drivers/scsi/g_NCR5380.c
> > @@ -595,7 +603,13 @@ static inline int generic_NCR5380_pwrite(struct
> > NCR5380_hostdata *hostdata,
> > goto out_wait;
> > }
> > }
> > -
> > + idx = NCR5380_read(13);
> > + if (idx != 0) {
> > + printk("host idx=%d, start=%d\n", idx, start);
> > + NCR5380_write(hostdata->c400_ctl_status,
> > CSR_RESET); +
> > NCR5380_write(hostdata->c400_ctl_status, CSR_BASE); +
> > goto out_wait;
> > + }
> > if (hostdata->io_port && hostdata->io_width == 2)
> > outsw(hostdata->io_port +
> > hostdata->c400_host_buf, src + start, 64);
>
> I find it hard to reason about this code. For example, out_wait is to be
> removed. Let's get the preceding patches working and signed-off. Please go
> ahead and use a 512 B transfer for DTC436 testing if that will help get
> this patch series over the line.
OK, I agree. Let's fix the problems first and leave this hack for later.
--
Ondrej Zary
On Tuesday 27 June 2017 03:49:16 Finn Thain wrote:
> On Mon, 26 Jun 2017, Ondrej Zary wrote:
> > No apparent change in behavior, the first write test resulted in:
> > [ 842.830802] sd 2:0:1:0: [sdb] tag#0 53c80 registers not accessible,
> > device will be reset [ 842.830802] sd 2:0:1:0: [sdb] tag#0 switching to
> > slow handshake
> >
> > Checking for IRQ after poll_politely2 does not seem right because we can
> > miss the buffer ready signal.
>
> How so? As long as there's no gated IRQ, we poll for buffer readiness
> until timeout. And when there is a gated IRQ, we break both the polling
> loop and the transfer loop immediately. Your code and mine are basically
> in agreement here.
Yes, it stops transfer when an IRQ arrives. But the host buffer could be ready
at the same time. The IRQ can be an "end-of-DMA" IRQ (IIRC DTC chips assert
this earlier than 53C400). Or just a disconnect that occured right now but the
chip already read a buffer full of data.
> > According to my tests, buffer ready signal is most important - if there
> > is any data to read/write, do the transfer. If not, only then check why
> > - maybe we got an IRQ (that terminated PDMA). Or no IRQ, sometimes the
> > wait for buffer ready times out - we need to terminate PDMA manually
> > then (reset).
> >
> > Then 53C80 registers should become ready.
>
> You seem to be saying that we should ignore the IRQ signal if the buffers
> have become ready. Maybe so. Can we try simply resetting the block
> counter? (I could imagine that the 53c400 core might leave the 53c80
> registers inaccessible unless we keep accessing the buffers in the 53c400
> core until the transfer is done.)
We can't reset the block counter because 0 means 256 blocks to transfer (page
13 in datasheet).
Yes, the 53C80 registers seem to become available only when the PDMA transfer
ends by either:
1. transferring all blocks (block counter decrementing to zero)
2. IRQ
3. reset
> BTW, with regard to your patch, note that this construct is race prone:
>
> while (1) { /* monitor IRQ while waiting for host buffer */
> csr = NCR5380_read(hostdata->c400_ctl_status);
> if (!(csr & CSR_HOST_BUF_NOT_RDY))
> break;
> if (csr & CSR_GATED_53C80_IRQ) {
> basr = NCR5380_read(BUS_AND_STATUS_REG);
> if (!(basr & BASR_PHASE_MATCH) ||
> (basr & BASR_BUSY_ERROR)) {
> printk("basr=0x%02x csr=0x%02x at start=%d\n", basr, csr, start);
> goto out_wait;
> }
> }
> if (retries-- < 1) {
> shost_printk(KERN_ERR, hostdata->host, "53C400r: host buffer not ready in
> time\n"); NCR5380_write(hostdata->c400_ctl_status, CSR_RESET);
> NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
> goto out_wait;
> }
> }
>
> This code can "goto out_wait" when !(csr & CSR_HOST_BUF_NOT_RDY). It
> depends on timing. This would seem to be contrary to your stated aim.
>
> Moreover, this code can also "break" when (csr & CSR_GATED_53C80_IRQ).
> That depends on timing too. But this may be an improvement on my code if
> it allows the 53c80 registers to become accessible, by allowing the block
> counter to be decremented.
Yes, it continue the transfer even if the IRQ is asserted - as long as the
buffer is ready. That's intended.
> The uncertainty here was one of the reasons I reworked this code.
My version reads CSR only once per loop but that probably does not help at all
as the HW state could change anytime. The chip's design seems to be very
race-prone.
> > This is a log from writing 230 MB file using my code with some debug
> > prints, on a NCR53C400. No 53C80 timeouts, only disconnects and some
> > host buffer timeouts (maybe the drive sometimes just slows down without
> > disconnecting?)
> >
> > [ 3378.503828] basr=0x10 csr=0xd5 at start=512
> > [ 3461.257973] w basr=0x10 csr=0x95 at start=3840
> > [ 3461.838225] w basr=0x10 csr=0x95 at start=3840
> > [ 3462.683446] w basr=0x10 csr=0x95 at start=3840
> > [ 3463.416911] w basr=0x10 csr=0x95 at start=3840
> > [ 3465.117804] scsi host2: 53C400w: host buffer not ready in time
> > [ 3465.276375] w basr=0x10 csr=0x95 at start=3328
> > [ 3466.457701] w basr=0x10 csr=0x95 at start=1792
> > [ 3467.019643] scsi host2: 53C400w: host buffer not ready in time
> > [ 3467.619058] scsi host2: 53C400w: host buffer not ready in time
> > [ 3467.799619] w basr=0x10 csr=0x95 at start=3840
> > [ 3552.123501] w basr=0x10 csr=0x95 at start=2304
> > [ 3552.771223] w basr=0x10 csr=0x95 at start=1280
> > [ 3554.556451] w basr=0x10 csr=0x95 at start=2816
> > [ 3555.229646] w basr=0x10 csr=0x95 at start=1792
> > [ 3555.630632] scsi host2: 53C400w: host buffer not ready in time
> > [ 3555.774560] w basr=0x10 csr=0x95 at start=768
> > [ 3625.541608] w basr=0x10 csr=0x95 at start=3328
> > [ 3640.099861] w basr=0x10 csr=0x95 at start=1792
> > [ 3641.442671] w basr=0x10 csr=0x95 at start=2816
> > [ 3641.865469] w basr=0x10 csr=0x95 at start=768
> > [ 3642.939223] w basr=0x10 csr=0x95 at start=1280
> > [ 3643.356858] w basr=0x10 csr=0x95 at start=3328
> > [ 3643.701636] w basr=0x10 csr=0x95 at start=3840
> > [ 3645.153405] w basr=0x10 csr=0x95 at start=2304
> > [ 3646.135642] w basr=0x10 csr=0x95 at start=1280
> > [ 3647.007321] w basr=0x10 csr=0x95 at start=2816
> > [ 3648.065874] w basr=0x10 csr=0x95 at start=3328
> > [ 3650.071961] w basr=0x10 csr=0x95 at start=1280
> > [ 3650.827630] w basr=0x10 csr=0x95 at start=1792
> > [ 3651.827011] w basr=0x10 csr=0x95 at start=2816
> > [ 3652.559984] w basr=0x10 csr=0x95 at start=2816
> > [ 3653.203566] w basr=0x10 csr=0x95 at start=3328
> > [ 3653.594376] w basr=0x10 csr=0x95 at start=1280
> > [ 3653.903437] w basr=0x10 csr=0x95 at start=3840
> > [ 3654.305753] w basr=0x10 csr=0x95 at start=1792
> > [ 3654.676009] w basr=0x10 csr=0x95 at start=2304
> > [ 3655.367686] w basr=0x10 csr=0x95 at start=2816
> > [ 3655.733854] w basr=0x10 csr=0x95 at start=768
> > [ 3656.075023] w basr=0x10 csr=0x95 at start=3328
> > [ 3656.493046] w basr=0x10 csr=0x95 at start=2816
> > [ 3657.208089] w basr=0x10 csr=0x95 at start=1280
> > [ 3657.537223] w basr=0x10 csr=0x95 at start=1280
> >
> > And this is from reading the file back:
> > [ 3799.053067] basr=0x10 csr=0xd5 at start=512
> > [ 3801.056337] basr=0x10 csr=0xd5 at start=3584
> > [ 3976.323836] scsi host2: 53C400r: host buffer not ready in time
> > [ 3976.404699] basr=0x10 csr=0xd5 at start=512
> > [ 3977.800647] basr=0x10 csr=0xd5 at start=512
> > [ 3979.240611] scsi host2: 53C400r: host buffer not ready in time
> > [ 3979.320698] basr=0x10 csr=0xd5 at start=512
> > [ 3980.040220] scsi host2: 53C400r: host buffer not ready in time
> > [ 3980.096401] basr=0x10 csr=0xd5 at start=512
> > [ 3980.394854] scsi host2: 53C400r: host buffer not ready in time
>
> The register values look normal (?)
>
> Anyway, there are only a few material differences between your code and
> this patch series.
>
> 1) Your code does not break the transfer loop for any Gated IRQ, but only
> for a phase error IRQ. My version responds to any Gated IRQ, as per the
> algorithm in the datasheet. Your code seems to assume that the 53c80
> registers are accessible whenever gated IRQ is set, which seems
> unlikely.
BASR seems to be always readable. Nothing in the datasheet about that but I
needed it to determine the IRQ type and it works.
> 2) My code fails the transfer when the 53c80 registers don't become
> accessible, which is why you see a fallback to PIO (slow handshake).
>
> 3) We use different timeouts.
>
> 4) In my code, the race condition applies only after the timeout has
> already expired, so it's inconsequential.
>
> Please apply the patch below (on top of this series) so we can perhaps
> narrow this down a bit.
It breaks immediately bacause 0 means 256 blocks:
[ 658.359022] sd 2:0:1:0: [sdb] 479350 512-byte logical blocks: (245 MB/234 MiB)
[ 658.367683] sd 2:0:1:0: [sdb] Write Protect is off
[ 658.371668] sd 2:0:1:0: [sdb] Mode Sense: 8b 00 00 08
[ 658.377314] sd 2:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 658.444177] sdb: sdb1
[ 658.478658] sd 2:0:1:0: [sdb] Attached SCSI disk
[ 659.382887] sd 2:0:1:0: [sdb] tag#0 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.420862] sd 2:0:1:0: [sdb] tag#0 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.513015] sd 2:0:1:0: [sdb] tag#0 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.550943] sd 2:0:1:0: [sdb] tag#0 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.588817] sd 2:0:1:0: [sdb] tag#0 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.627071] sd 2:0:1:0: [sdb] tag#0 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.664872] sd 2:0:1:0: [sdb] tag#0 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.702776] sd 2:0:1:0: [sdb] tag#0 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.758947] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.796598] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.834615] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.872358] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.910164] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.947651] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 659.985239] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 660.022676] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 660.060230] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 660.098229] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 660.135808] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 660.173513] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 660.210801] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 660.248064] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
[ 660.285459] sd 2:0:1:0: [sdb] tag#1 generic_NCR5380_pread: End of DMA timeout (0)
...
I was able to mount the filesystem, haven't tried anything else.
--
Ondrej Zary
On Tue, 27 Jun 2017, Ondrej Zary wrote:
> On Tuesday 27 June 2017 03:49:16 Finn Thain wrote:
> >
> > ... As long as there's no gated IRQ, we poll for buffer readiness
> > until timeout. And when there is a gated IRQ, we break both the
> > polling loop and the transfer loop immediately. Your code and mine are
> > basically in agreement here.
>
> Yes, it stops transfer when an IRQ arrives. But the host buffer could be
> ready at the same time. The IRQ can be an "end-of-DMA" IRQ (IIRC DTC
> chips assert this earlier than 53C400). Or just a disconnect that
> occured right now but the chip already read a buffer full of data.
>
The IRQ should not normally arise during the loop. A BASR_END_DMA_TRANSFER
interrupt could only happen after the loop has finished sending/receiving,
which is when /EOP becomes active.
The BASR_PHASE_MATCH interrupt could happen during the transfer if the
target disconnects suddenly.
It is possible that the 53c400 core would assert /EOP upon
BASR_PHASE_MATCH interrupt, which could then cause the 53c80 to raise a
BASR_END_DMA_TRANSFER interrupt too. But who knows?
> > > According to my tests, buffer ready signal is most important - if
> > > there is any data to read/write, do the transfer. If not, only then
> > > check why - maybe we got an IRQ (that terminated PDMA). Or no IRQ,
> > > sometimes the wait for buffer ready times out - we need to terminate
> > > PDMA manually then (reset).
> > >
> > > Then 53C80 registers should become ready.
> >
> > You seem to be saying that we should ignore the IRQ signal if the
> > buffers have become ready. Maybe so. Can we try simply resetting the
> > block counter? (I could imagine that the 53c400 core might leave the
> > 53c80 registers inaccessible unless we keep accessing the buffers in
> > the 53c400 core until the transfer is done.)
>
> We can't reset the block counter because 0 means 256 blocks to transfer
> (page 13 in datasheet).
I forgot about that. How awful.
> Yes, the 53C80 registers seem to become available only when the PDMA
> transfer ends by either:
> 1. transferring all blocks (block counter decrementing to zero)
> 2. IRQ
I don't think that Gated IRQ is sufficient to make the 53c80 registers
available again. If it was, you probably wouldn't have seen "switching to
slow handshake" when you tested my earlier patch series.
> 3. reset
>
Maybe we need to do the reset whenever IRQ is detected. I'll put this in
v4. Please try commenting it out, to see what difference that makes.
> > BTW, with regard to your patch, note that this construct is race prone:
> >
> > while (1) { /* monitor IRQ while waiting for host buffer */
> > csr = NCR5380_read(hostdata->c400_ctl_status);
> > if (!(csr & CSR_HOST_BUF_NOT_RDY))
> > break;
> > if (csr & CSR_GATED_53C80_IRQ) {
> > basr = NCR5380_read(BUS_AND_STATUS_REG);
> > if (!(basr & BASR_PHASE_MATCH) ||
> > (basr & BASR_BUSY_ERROR)) {
> > printk("basr=0x%02x csr=0x%02x at start=%d\n", basr, csr, start);
> > goto out_wait;
> > }
> > }
> > if (retries-- < 1) {
> > shost_printk(KERN_ERR, hostdata->host, "53C400r: host buffer not ready in
> > time\n"); NCR5380_write(hostdata->c400_ctl_status, CSR_RESET);
> > NCR5380_write(hostdata->c400_ctl_status, CSR_BASE);
> > goto out_wait;
> > }
> > }
> >
> > This code can "goto out_wait" when !(csr & CSR_HOST_BUF_NOT_RDY). It
> > depends on timing. This would seem to be contrary to your stated aim.
> >
> > Moreover, this code can also "break" when (csr & CSR_GATED_53C80_IRQ).
> > That depends on timing too. But this may be an improvement on my code
> > if it allows the 53c80 registers to become accessible, by allowing the
> > block counter to be decremented.
>
> Yes, it continue the transfer even if the IRQ is asserted - as long as
> the buffer is ready. That's intended.
>
If we continue to try to send when there is a phase mismatch (i.e. sudden
disconnection) we'll probably end up with a buffer ready timeout. And we
may also have trouble calculating the residual correctly.
Hence my version of your patch always breaks out of the transfer loop as
soon as any Gated IRQ is detected. If that then means a compulsory reset
of the 53c400 core, I guess I can live with that.
> > The uncertainty here was one of the reasons I reworked this code.
>
> My version reads CSR only once per loop but that probably does not help
> at all as the HW state could change anytime. The chip's design seems to
> be very race-prone.
>
> > > This is a log from writing 230 MB file using my code with some debug
> > > prints, on a NCR53C400. No 53C80 timeouts, only disconnects and some
> > > host buffer timeouts (maybe the drive sometimes just slows down
> > > without disconnecting?)
> > >
> > > [ 3378.503828] basr=0x10 csr=0xd5 at start=512
> > > [ 3461.257973] w basr=0x10 csr=0x95 at start=3840
> > > [ 3461.838225] w basr=0x10 csr=0x95 at start=3840
> > > [ 3462.683446] w basr=0x10 csr=0x95 at start=3840
> > > [ 3463.416911] w basr=0x10 csr=0x95 at start=3840
> > > [ 3465.117804] scsi host2: 53C400w: host buffer not ready in time
> > > [ 3465.276375] w basr=0x10 csr=0x95 at start=3328
> > > [ 3466.457701] w basr=0x10 csr=0x95 at start=1792
> > > [ 3467.019643] scsi host2: 53C400w: host buffer not ready in time
> > > [ 3467.619058] scsi host2: 53C400w: host buffer not ready in time
> > > [ 3467.799619] w basr=0x10 csr=0x95 at start=3840
> > > [ 3552.123501] w basr=0x10 csr=0x95 at start=2304
> > > [ 3552.771223] w basr=0x10 csr=0x95 at start=1280
> > > [ 3554.556451] w basr=0x10 csr=0x95 at start=2816
> > > [ 3555.229646] w basr=0x10 csr=0x95 at start=1792
> > > [ 3555.630632] scsi host2: 53C400w: host buffer not ready in time
> > > [ 3555.774560] w basr=0x10 csr=0x95 at start=768
> > > [ 3625.541608] w basr=0x10 csr=0x95 at start=3328
> > > [ 3640.099861] w basr=0x10 csr=0x95 at start=1792
> > > [ 3641.442671] w basr=0x10 csr=0x95 at start=2816
> > > [ 3641.865469] w basr=0x10 csr=0x95 at start=768
> > > [ 3642.939223] w basr=0x10 csr=0x95 at start=1280
> > > [ 3643.356858] w basr=0x10 csr=0x95 at start=3328
> > > [ 3643.701636] w basr=0x10 csr=0x95 at start=3840
> > > [ 3645.153405] w basr=0x10 csr=0x95 at start=2304
> > > [ 3646.135642] w basr=0x10 csr=0x95 at start=1280
> > > [ 3647.007321] w basr=0x10 csr=0x95 at start=2816
> > > [ 3648.065874] w basr=0x10 csr=0x95 at start=3328
> > > [ 3650.071961] w basr=0x10 csr=0x95 at start=1280
> > > [ 3650.827630] w basr=0x10 csr=0x95 at start=1792
> > > [ 3651.827011] w basr=0x10 csr=0x95 at start=2816
> > > [ 3652.559984] w basr=0x10 csr=0x95 at start=2816
> > > [ 3653.203566] w basr=0x10 csr=0x95 at start=3328
> > > [ 3653.594376] w basr=0x10 csr=0x95 at start=1280
> > > [ 3653.903437] w basr=0x10 csr=0x95 at start=3840
> > > [ 3654.305753] w basr=0x10 csr=0x95 at start=1792
> > > [ 3654.676009] w basr=0x10 csr=0x95 at start=2304
> > > [ 3655.367686] w basr=0x10 csr=0x95 at start=2816
> > > [ 3655.733854] w basr=0x10 csr=0x95 at start=768
> > > [ 3656.075023] w basr=0x10 csr=0x95 at start=3328
> > > [ 3656.493046] w basr=0x10 csr=0x95 at start=2816
> > > [ 3657.208089] w basr=0x10 csr=0x95 at start=1280
> > > [ 3657.537223] w basr=0x10 csr=0x95 at start=1280
> > >
> > > And this is from reading the file back:
> > > [ 3799.053067] basr=0x10 csr=0xd5 at start=512
> > > [ 3801.056337] basr=0x10 csr=0xd5 at start=3584
> > > [ 3976.323836] scsi host2: 53C400r: host buffer not ready in time
> > > [ 3976.404699] basr=0x10 csr=0xd5 at start=512
> > > [ 3977.800647] basr=0x10 csr=0xd5 at start=512
> > > [ 3979.240611] scsi host2: 53C400r: host buffer not ready in time
> > > [ 3979.320698] basr=0x10 csr=0xd5 at start=512
> > > [ 3980.040220] scsi host2: 53C400r: host buffer not ready in time
> > > [ 3980.096401] basr=0x10 csr=0xd5 at start=512
> > > [ 3980.394854] scsi host2: 53C400r: host buffer not ready in time
> >
> > The register values look normal (?)
> >
> > Anyway, there are only a few material differences between your code
> > and this patch series.
> >
> > 1) Your code does not break the transfer loop for any Gated IRQ, but
> > only for a phase error IRQ. My version responds to any Gated IRQ,
> > as per the algorithm in the datasheet. Your code seems to assume
> > that the 53c80 registers are accessible whenever gated IRQ is set,
> > which seems unlikely.
>
> BASR seems to be always readable. Nothing in the datasheet about that
> but I needed it to determine the IRQ type and it works.
>
So, the BASR_PHASE_MATCH bit appears to be set... but that doesn't mean
that the register is actually accessable and valid.
--
On Tue, 27 Jun 2017, Ondrej Zary wrote:
> On Tuesday 27 June 2017 14:42:29 Finn Thain wrote:
>
> > > ... it triggers sometimes: the value is 1 instead of 0. As we use
> > > only 16-bit writes, I don't see how the value could ever be odd.
> > > Looks like a bug in the chip. The index register corrupts during the
> > > transfer, not after IRQ or timeout. The same check at beginning of
> > > pwrite() did not trigger.
> >
> > Are you reading this register at the right moment? Have you tried
> > waiting for it to reach zero, as in,
> >
> > if (NCR5380_poll_politely(hostdata, 13, 0xff, 0, HZ / 64) < 0)
> > /* printk, reset etc */;
>
> I have not but will try (expecting that it will not change by itself).
>
Now that I know that it is the byte at the beginning of the block that
went missing, I agree that there's no point waiting for the byte count to
change.
I've included a patch with your 512 B limit in v4.
Thanks.
> > Even if this is a reliable way to detect a short transfer, it would be
> > nice to know the root cause. But I'm being unrealistic: the DTC436
> > vendor never responded to my requests for technical documentation.
>
> According to the data corruption observed, it's not a short transfer.
> The corruption is always the same: one byte missing at the beginning of
> a 128 B block. It happens only with slow Quantum LPS 240 drive, not with
> faster IBM DORS-32160.
>
--