2006-10-24 21:54:45

by linas

[permalink] [raw]
Subject: [PATCH] s2io: add PCI error recovery support


Koushik, Raju,
Please review, comment, and if you find this acceptable,
please forward upstream.

--linas

This patch adds PCI error recovery support to the
s2io 10-Gigabit ethernet device driver.

Tested, seems to work well.

Signed-off-by: Linas Vepstas <[email protected]>
Cc: Raghavendra Koushik <[email protected]>
Cc: Ananda Raju <[email protected]>
Cc: Wen Xiong <[email protected]>

----
drivers/net/s2io.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/net/s2io.h | 5 +++
2 files changed, 82 insertions(+)

Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c
===================================================================
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-20 12:24:17.000000000 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-24 16:19:49.000000000 -0500
@@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _

MODULE_DEVICE_TABLE(pci, s2io_tbl);

+static struct pci_error_handlers s2io_err_handler = {
+ .error_detected = s2io_io_error_detected,
+ .slot_reset = s2io_io_slot_reset,
+ .resume = s2io_io_resume,
+};
+
static struct pci_driver s2io_driver = {
.name = "S2IO",
.id_table = s2io_tbl,
.probe = s2io_init_nic,
.remove = __devexit_p(s2io_rem_nic),
+ .err_handler = &s2io_err_handler,
};

/* A simplifier macro used both by init and free shared_mem Fns(). */
@@ -7564,3 +7571,73 @@ static void lro_append_pkt(nic_t *sp, lr
sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++;
return;
}
+
+/**
+ * s2io_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ netif_device_detach(netdev);
+
+ if (netif_running(netdev)) {
+ /* Reset card */
+ s2io_card_down(sp);
+ sp->device_close_flag = TRUE; /* Device is shut down. */
+ }
+ pci_disable_device(pdev);
+
+ return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * s2io_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ if (pci_enable_device(pdev)) {
+ printk(KERN_ERR "s2io: Cannot re-enable PCI device after reset.\n");
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
+
+ pci_set_master(pdev);
+ s2io_reset(sp);
+
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * s2io_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells us that
+ * its OK to resume normal operation.
+ */
+static void s2io_io_resume(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ if (netif_running(netdev)) {
+ if (s2io_card_up(sp)) {
+ printk(KERN_ERR "s2io: can't bring device back up after reset\n");
+ return;
+ }
+ }
+
+ netif_device_attach(netdev);
+ netif_wake_queue(netdev);
+}
Index: linux-2.6.19-rc1-git11/drivers/net/s2io.h
===================================================================
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.h 2006-10-20 12:24:17.000000000 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.h 2006-10-20 12:41:39.000000000 -0500
@@ -1013,6 +1013,11 @@ static void queue_rx_frame(struct sk_buf
static void update_L3L4_header(nic_t *sp, lro_t *lro);
static void lro_append_pkt(nic_t *sp, lro_t *lro, struct sk_buff *skb, u32 tcp_len);

+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state);
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev);
+static void s2io_io_resume(struct pci_dev *pdev);
+
#define s2io_tcp_mss(skb) skb_shinfo(skb)->gso_size
#define s2io_udp_mss(skb) skb_shinfo(skb)->gso_size
#define s2io_offload_type(skb) skb_shinfo(skb)->gso_type


2006-10-25 06:33:00

by Ananda Raju

[permalink] [raw]
Subject: RE: [PATCH] s2io: add PCI error recovery support

Hi,

s2io_card_down() will do few BAR0 read/write. As per
pci-error-recovery.txt Documentation we are not supposed to do any new
IO in error_detected().

Can you try using

atomic_set(&sp->card_state, CARD_DOWN);

instead of s2io_card_down().

Also we have to add following if statement in beginning of s2io_isr().

if (atomic_read(&nic->card_state) == CARD_DOWN)
return IRQ_NOTHANDLED.

If it is ok to do BAR0 read/write in error_detected() then patch is OK.

Ananda
-----Original Message-----
From: Linas Vepstas [mailto:[email protected]]
Sent: Tuesday, October 24, 2006 2:55 PM
To: Raghavendra Koushik; Ananda Raju; Wen Xiong
Cc: [email protected]; [email protected];
[email protected]; Jeff Garzik; Andrew Morton
Subject: [PATCH] s2io: add PCI error recovery support


Koushik, Raju,
Please review, comment, and if you find this acceptable,
please forward upstream.

--linas

This patch adds PCI error recovery support to the
s2io 10-Gigabit ethernet device driver.

Tested, seems to work well.

Signed-off-by: Linas Vepstas <[email protected]>
Cc: Raghavendra Koushik <[email protected]>
Cc: Ananda Raju <[email protected]>
Cc: Wen Xiong <[email protected]>

----
drivers/net/s2io.c | 77
+++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/net/s2io.h | 5 +++
2 files changed, 82 insertions(+)

Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c
===================================================================
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-20
12:24:17.000000000 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-24
16:19:49.000000000 -0500
@@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _

MODULE_DEVICE_TABLE(pci, s2io_tbl);

+static struct pci_error_handlers s2io_err_handler = {
+ .error_detected = s2io_io_error_detected,
+ .slot_reset = s2io_io_slot_reset,
+ .resume = s2io_io_resume,
+};
+
static struct pci_driver s2io_driver = {
.name = "S2IO",
.id_table = s2io_tbl,
.probe = s2io_init_nic,
.remove = __devexit_p(s2io_rem_nic),
+ .err_handler = &s2io_err_handler,
};

/* A simplifier macro used both by init and free shared_mem Fns(). */
@@ -7564,3 +7571,73 @@ static void lro_append_pkt(nic_t *sp, lr
sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++;
return;
}
+
+/**
+ * s2io_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ netif_device_detach(netdev);
+
+ if (netif_running(netdev)) {
+ /* Reset card */
+ s2io_card_down(sp);
+ sp->device_close_flag = TRUE; /* Device is shut down.
*/
+ }
+ pci_disable_device(pdev);
+
+ return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * s2io_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ if (pci_enable_device(pdev)) {
+ printk(KERN_ERR "s2io: Cannot re-enable PCI device after
reset.\n");
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
+
+ pci_set_master(pdev);
+ s2io_reset(sp);
+
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * s2io_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells us that
+ * its OK to resume normal operation.
+ */
+static void s2io_io_resume(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ if (netif_running(netdev)) {
+ if (s2io_card_up(sp)) {
+ printk(KERN_ERR "s2io: can't bring device back
up after reset\n");
+ return;
+ }
+ }
+
+ netif_device_attach(netdev);
+ netif_wake_queue(netdev);
+}
Index: linux-2.6.19-rc1-git11/drivers/net/s2io.h
===================================================================
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.h 2006-10-20
12:24:17.000000000 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.h 2006-10-20
12:41:39.000000000 -0500
@@ -1013,6 +1013,11 @@ static void queue_rx_frame(struct sk_buf
static void update_L3L4_header(nic_t *sp, lro_t *lro);
static void lro_append_pkt(nic_t *sp, lro_t *lro, struct sk_buff *skb,
u32 tcp_len);

+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev,
+
pci_channel_state_t state);
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev);
+static void s2io_io_resume(struct pci_dev *pdev);
+
#define s2io_tcp_mss(skb) skb_shinfo(skb)->gso_size
#define s2io_udp_mss(skb) skb_shinfo(skb)->gso_size
#define s2io_offload_type(skb) skb_shinfo(skb)->gso_type

2006-10-25 15:11:28

by linas

[permalink] [raw]
Subject: Re: [PATCH] s2io: add PCI error recovery support

On Wed, Oct 25, 2006 at 02:29:33AM -0400, Ananda Raju wrote:
> Hi,
>
> s2io_card_down() will do few BAR0 read/write. As per
> pci-error-recovery.txt Documentation we are not supposed to do any new
> IO in error_detected().

Hmm, actually, its harmless to do further i/o. The s2io driver barks
(as it should) because the result of reads is always 0xffffffff.

> Can you try using
>
> atomic_set(&sp->card_state, CARD_DOWN);
>
> instead of s2io_card_down().

I will try that. I was mostly concerned that s2io_card_down()
also may do some other "important" things with regards to the driver
state, things which might be needed to keep the down-up sequence
symmetrical. I wasn't sure, so I took the conservative route.

> Also we have to add following if statement in beginning of s2io_isr().
>
> if (atomic_read(&nic->card_state) == CARD_DOWN)
> return IRQ_NOTHANDLED.

Right! Will revise this patch shortly.

> If it is ok to do BAR0 read/write in error_detected() then patch is OK.

Its OK on pSeries, and I beleive it will be OK on PCIE, but I do
not quite know enough about actual PCIE chipsets.

--linas

2006-10-25 20:56:00

by linas

[permalink] [raw]
Subject: Re: [PATCH] s2io: add PCI error recovery support

On Wed, Oct 25, 2006 at 10:11:24AM -0500, Linas Vepstas wrote:
>
> > Also we have to add following if statement in beginning of s2io_isr().

Done, below,

> > If it is ok to do BAR0 read/write in error_detected() then patch is OK.

I re-wrote that section to avoid doing I/O. It seems to work well,
and generates a few less messages in the process. New, improved patch
below, please ack and send upstream if you like it.

--linas

This patch adds PCI error recovery support to the
s2io 10-Gigabit ethernet device driver.

Tested, seems to work well.

Signed-off-by: Linas Vepstas <[email protected]>
Cc: Raghavendra Koushik <[email protected]>
Cc: Ananda Raju <[email protected]>
Cc: Wen Xiong <[email protected]>

----
drivers/net/s2io.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/net/s2io.h | 5 ++
2 files changed, 108 insertions(+)

Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c
===================================================================
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-25 14:09:47.000000000 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-25 15:18:25.000000000 -0500
@@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _

MODULE_DEVICE_TABLE(pci, s2io_tbl);

+static struct pci_error_handlers s2io_err_handler = {
+ .error_detected = s2io_io_error_detected,
+ .slot_reset = s2io_io_slot_reset,
+ .resume = s2io_io_resume,
+};
+
static struct pci_driver s2io_driver = {
.name = "S2IO",
.id_table = s2io_tbl,
.probe = s2io_init_nic,
.remove = __devexit_p(s2io_rem_nic),
+ .err_handler = &s2io_err_handler,
};

/* A simplifier macro used both by init and free shared_mem Fns(). */
@@ -4171,6 +4178,11 @@ static irqreturn_t s2io_isr(int irq, voi
mac_info_t *mac_control;
struct config_param *config;

+ if ((sp->pdev->error_state != pci_channel_io_normal) &&
+ (sp->pdev->error_state != 0)) {
+ return IRQ_HANDLED;
+ }
+
atomic_inc(&sp->isr_cnt);
mac_control = &sp->mac_control;
config = &sp->config;
@@ -7564,3 +7576,94 @@ static void lro_append_pkt(nic_t *sp, lr
sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++;
return;
}
+
+/**
+ * s2io_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, pci_channel_state_t state)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ netif_device_detach(netdev);
+
+ if (netif_running(netdev)) {
+ unsigned long flags;
+
+ /* The folowing is an abreviated subset of the
+ * steps taken by s2io_card_down(), avoiding
+ * steps that touch the card itself.
+ */
+ del_timer_sync(&sp->alarm_timer);
+ atomic_set(&sp->card_state, CARD_DOWN);
+
+ /* Kill tasklet. */
+ tasklet_kill(&sp->task);
+
+ /* Free all Tx buffers */
+ spin_lock_irqsave(&sp->tx_lock, flags);
+ free_tx_buffers(sp);
+ spin_unlock_irqrestore(&sp->tx_lock, flags);
+
+ /* Free all Rx buffers */
+ spin_lock_irqsave(&sp->rx_lock, flags);
+ free_rx_buffers(sp);
+ spin_unlock_irqrestore(&sp->rx_lock, flags);
+
+ clear_bit(0, &(sp->link_state));
+ sp->device_close_flag = TRUE; /* Device is shut down. */
+ }
+ pci_disable_device(pdev);
+
+ return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * s2io_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ if (pci_enable_device(pdev)) {
+ printk(KERN_ERR "s2io: Cannot re-enable PCI device after reset.\n");
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
+
+ pci_set_master(pdev);
+ s2io_reset(sp);
+
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * s2io_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells us that
+ * its OK to resume normal operation.
+ */
+static void s2io_io_resume(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ if (netif_running(netdev)) {
+ if (s2io_card_up(sp)) {
+ printk(KERN_ERR "s2io: can't bring device back up after reset\n");
+ return;
+ }
+ }
+
+ netif_device_attach(netdev);
+ netif_wake_queue(netdev);
+}
Index: linux-2.6.19-rc1-git11/drivers/net/s2io.h
===================================================================
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.h 2006-10-25 14:09:47.000000000 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.h 2006-10-25 14:11:05.000000000 -0500
@@ -1013,6 +1013,11 @@ static void queue_rx_frame(struct sk_buf
static void update_L3L4_header(nic_t *sp, lro_t *lro);
static void lro_append_pkt(nic_t *sp, lro_t *lro, struct sk_buff *skb, u32 tcp_len);

+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state);
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev);
+static void s2io_io_resume(struct pci_dev *pdev);
+
#define s2io_tcp_mss(skb) skb_shinfo(skb)->gso_size
#define s2io_udp_mss(skb) skb_shinfo(skb)->gso_size
#define s2io_offload_type(skb) skb_shinfo(skb)->gso_type

2006-10-26 09:58:10

by Ananda Raju

[permalink] [raw]
Subject: RE: [PATCH] s2io: add PCI error recovery support

Hi,
Can you try attached patch. The attached patch is simple. We set card
state as down in error_detecct() so that all entry points return error
and don't proceed further.

In slot_reset() we do s2io_card_down() will reset adapter.
In io_resume() we bringup the driver.

Ananda

-----Original Message-----
From: Linas Vepstas [mailto:[email protected]]
Sent: Wednesday, October 25, 2006 1:55 PM
To: Ananda Raju
Cc: Wen Xiong; [email protected];
[email protected]; [email protected]; Jeff Garzik;
Andrew Morton
Subject: Re: [PATCH] s2io: add PCI error recovery support

On Wed, Oct 25, 2006 at 10:11:24AM -0500, Linas Vepstas wrote:
>
> > Also we have to add following if statement in beginning of
s2io_isr().

Done, below,

> > If it is ok to do BAR0 read/write in error_detected() then patch is
OK.

I re-wrote that section to avoid doing I/O. It seems to work well,
and generates a few less messages in the process. New, improved patch
below, please ack and send upstream if you like it.

--linas

This patch adds PCI error recovery support to the
s2io 10-Gigabit ethernet device driver.

Tested, seems to work well.

Signed-off-by: Linas Vepstas <[email protected]>
Cc: Raghavendra Koushik <[email protected]>
Cc: Ananda Raju <[email protected]>
Cc: Wen Xiong <[email protected]>

----
drivers/net/s2io.c | 103
+++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/net/s2io.h | 5 ++
2 files changed, 108 insertions(+)

Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c
===================================================================
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-25
14:09:47.000000000 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-25
15:18:25.000000000 -0500
@@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _

MODULE_DEVICE_TABLE(pci, s2io_tbl);

+static struct pci_error_handlers s2io_err_handler = {
+ .error_detected = s2io_io_error_detected,
+ .slot_reset = s2io_io_slot_reset,
+ .resume = s2io_io_resume,
+};
+
static struct pci_driver s2io_driver = {
.name = "S2IO",
.id_table = s2io_tbl,
.probe = s2io_init_nic,
.remove = __devexit_p(s2io_rem_nic),
+ .err_handler = &s2io_err_handler,
};

/* A simplifier macro used both by init and free shared_mem Fns(). */
@@ -4171,6 +4178,11 @@ static irqreturn_t s2io_isr(int irq, voi
mac_info_t *mac_control;
struct config_param *config;

+ if ((sp->pdev->error_state != pci_channel_io_normal) &&
+ (sp->pdev->error_state != 0)) {
+ return IRQ_HANDLED;
+ }
+
atomic_inc(&sp->isr_cnt);
mac_control = &sp->mac_control;
config = &sp->config;
@@ -7564,3 +7576,94 @@ static void lro_append_pkt(nic_t *sp, lr
sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++;
return;
}
+
+/**
+ * s2io_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ netif_device_detach(netdev);
+
+ if (netif_running(netdev)) {
+ unsigned long flags;
+
+ /* The folowing is an abreviated subset of the
+ * steps taken by s2io_card_down(), avoiding
+ * steps that touch the card itself.
+ */
+ del_timer_sync(&sp->alarm_timer);
+ atomic_set(&sp->card_state, CARD_DOWN);
+
+ /* Kill tasklet. */
+ tasklet_kill(&sp->task);
+
+ /* Free all Tx buffers */
+ spin_lock_irqsave(&sp->tx_lock, flags);
+ free_tx_buffers(sp);
+ spin_unlock_irqrestore(&sp->tx_lock, flags);
+
+ /* Free all Rx buffers */
+ spin_lock_irqsave(&sp->rx_lock, flags);
+ free_rx_buffers(sp);
+ spin_unlock_irqrestore(&sp->rx_lock, flags);
+
+ clear_bit(0, &(sp->link_state));
+ sp->device_close_flag = TRUE; /* Device is shut down.
*/
+ }
+ pci_disable_device(pdev);
+
+ return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * s2io_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ if (pci_enable_device(pdev)) {
+ printk(KERN_ERR "s2io: Cannot re-enable PCI device after
reset.\n");
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
+
+ pci_set_master(pdev);
+ s2io_reset(sp);
+
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * s2io_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells us that
+ * its OK to resume normal operation.
+ */
+static void s2io_io_resume(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ if (netif_running(netdev)) {
+ if (s2io_card_up(sp)) {
+ printk(KERN_ERR "s2io: can't bring device back
up after reset\n");
+ return;
+ }
+ }
+
+ netif_device_attach(netdev);
+ netif_wake_queue(netdev);
+}
Index: linux-2.6.19-rc1-git11/drivers/net/s2io.h
===================================================================
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.h 2006-10-25
14:09:47.000000000 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.h 2006-10-25
14:11:05.000000000 -0500
@@ -1013,6 +1013,11 @@ static void queue_rx_frame(struct sk_buf
static void update_L3L4_header(nic_t *sp, lro_t *lro);
static void lro_append_pkt(nic_t *sp, lro_t *lro, struct sk_buff *skb,
u32 tcp_len);

+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t
state);
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev);
+static void s2io_io_resume(struct pci_dev *pdev);
+
#define s2io_tcp_mss(skb) skb_shinfo(skb)->gso_size
#define s2io_udp_mss(skb) skb_shinfo(skb)->gso_size
#define s2io_offload_type(skb) skb_shinfo(skb)->gso_type


Attachments:
pci_err_rec.patch (4.33 kB)
pci_err_rec.patch

2006-10-26 22:51:59

by linas

[permalink] [raw]
Subject: Re: [PATCH] s2io: add PCI error recovery support

Hi.

On Thu, Oct 26, 2006 at 05:56:34AM -0400, Ananda Raju wrote:
> Hi,
> Can you try attached patch. The attached patch is simple. We set card
> state as down in error_detecct() so that all entry points return error
> and don't proceed further.
>
> In slot_reset() we do s2io_card_down() will reset adapter.
> In io_resume() we bringup the driver.

Simplicity is always better. However, some questions/comments:

> @@ -4175,6 +4186,10 @@ static irqreturn_t s2io_isr(int irq, voi
> mac_info_t *mac_control;
> struct config_param *config;
>
> + if (atomic_read(&sp->card_state) == CARD_DOWN) {
> + return IRQ_NONE;
> + }

I used

if ((sp->pdev->error_state != pci_channel_io_normal)

here for a reason: the pdev->error_state is set even in an interrupt
context, that is, it gets set even if interrups are disabled, and
so it represents the actual state immediately. By contrast, the
error callbacks do not get called until possibly much later,
and so sp->card_state = CARD_DOWN might not get set for a while.

If, for any reason, e.g. some obscure corner case, the s2io
generates zillions of interupts, this could result in a soft-lockup.
I actually saw this in the symbios device driver, which will
regenerate an interrupt until its acknowledged -- and so it
sat there, spinning. :-(

I was returning IRQ_HANDLED instead of IRQ_NONE, so as to avoid
falling into handle_bad_irq() or report_bad_irq(). I haven't
seen this happen on s2io, but thought it would still be wise.

If this can't happen, then there's no problem here.

> +/**
> + * s2io_io_slot_reset - called after the pci bus has been reset.
> + * @pdev: Pointer to PCI device
> + *
> + * Restart the card from scratch, as if from a cold-boot.
> + */
> +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
> +{

At this point, the card has just experienced a hardware reset,
(the #RST wire was held low for 250 millisecs, followed by
a settle time of 2 seconds, followed by whatever BIOS thinks
it needed to do, followed by a restore of the pci config space
to what it was after a cold boot. So the card is in a "fresh"
state; in theory its identitcal to a cold boot. So ...
are you sure you want to "down" at this point?

> + s2io_card_down(sp);
> + sp->device_close_flag = TRUE; /* Device is shut down. */


One problem I'm having is that the watchdog timer sometimes
pops and tries to reset the card before s2io_card_down()
has a chance to run. I fixed this ...

======================
So -- just for grins, I thought to myself, "Maybe I can make
s2io be the first adapter ever to fully recover without
a hard reset of the card."

The idea is simple:

1) enable MMIO,
2) call s2io_card_down()
3) enable DMA
4) cal s2io_card_up()

I have a patch that does this, but then hit a few more snags.
I haven't yet nailed down all the trouble spots, maybe tommorrow.

--linas

2006-10-27 11:38:57

by Ananda Raju

[permalink] [raw]
Subject: RE: [PATCH] s2io: add PCI error recovery support

Looking at all scenarios I feel the first patch is OK. Can you add the
watchdog timer fix to first initial patch and resubmit.

-----Original Message-----
From: Linas Vepstas [mailto:[email protected]]
Sent: Thursday, October 26, 2006 3:52 PM
To: Ananda Raju
Cc: Wen Xiong; [email protected];
[email protected]; [email protected]; Jeff Garzik;
Andrew Morton
Subject: Re: [PATCH] s2io: add PCI error recovery support

Hi.

On Thu, Oct 26, 2006 at 05:56:34AM -0400, Ananda Raju wrote:
> Hi,
> Can you try attached patch. The attached patch is simple. We set card
> state as down in error_detecct() so that all entry points return error
> and don't proceed further.
>
> In slot_reset() we do s2io_card_down() will reset adapter.
> In io_resume() we bringup the driver.

Simplicity is always better. However, some questions/comments:

> @@ -4175,6 +4186,10 @@ static irqreturn_t s2io_isr(int irq, voi
> mac_info_t *mac_control;
> struct config_param *config;
>
> + if (atomic_read(&sp->card_state) == CARD_DOWN) {
> + return IRQ_NONE;
> + }

I used

if ((sp->pdev->error_state != pci_channel_io_normal)

here for a reason: the pdev->error_state is set even in an interrupt
context, that is, it gets set even if interrups are disabled, and
so it represents the actual state immediately. By contrast, the
error callbacks do not get called until possibly much later,
and so sp->card_state = CARD_DOWN might not get set for a while.

If, for any reason, e.g. some obscure corner case, the s2io
generates zillions of interupts, this could result in a soft-lockup.
I actually saw this in the symbios device driver, which will
regenerate an interrupt until its acknowledged -- and so it
sat there, spinning. :-(

I was returning IRQ_HANDLED instead of IRQ_NONE, so as to avoid
falling into handle_bad_irq() or report_bad_irq(). I haven't
seen this happen on s2io, but thought it would still be wise.

If this can't happen, then there's no problem here.

> +/**
> + * s2io_io_slot_reset - called after the pci bus has been reset.
> + * @pdev: Pointer to PCI device
> + *
> + * Restart the card from scratch, as if from a cold-boot.
> + */
> +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
> +{

At this point, the card has just experienced a hardware reset,
(the #RST wire was held low for 250 millisecs, followed by
a settle time of 2 seconds, followed by whatever BIOS thinks
it needed to do, followed by a restore of the pci config space
to what it was after a cold boot. So the card is in a "fresh"
state; in theory its identitcal to a cold boot. So ...
are you sure you want to "down" at this point?

> + s2io_card_down(sp);
> + sp->device_close_flag = TRUE; /* Device is shut down.
*/


One problem I'm having is that the watchdog timer sometimes
pops and tries to reset the card before s2io_card_down()
has a chance to run. I fixed this ...

======================
So -- just for grins, I thought to myself, "Maybe I can make
s2io be the first adapter ever to fully recover without
a hard reset of the card."

The idea is simple:

1) enable MMIO,
2) call s2io_card_down()
3) enable DMA
4) cal s2io_card_up()

I have a patch that does this, but then hit a few more snags.
I haven't yet nailed down all the trouble spots, maybe tommorrow.

--linas


2006-10-27 19:33:24

by linas

[permalink] [raw]
Subject: Re: [PATCH] s2io: add PCI error recovery support

On Fri, Oct 27, 2006 at 07:35:18AM -0400, Ananda Raju wrote:
> Looking at all scenarios I feel the first patch is OK. Can you add the
> watchdog timer fix to first initial patch and resubmit.

Appended below.

> So -- just for grins, I thought to myself, "Maybe I can make
> s2io be the first adapter ever to fully recover without
> a hard reset of the card."

... I couldn't quite make this work. Since the patch below
already works, I didn't see much point exterting myself further.

--linas

This patch adds PCI error recovery support to the
s2io 10-Gigabit ethernet device driver. Third revision,
blocks interrupts and the watchdog.

Tested, seems to work well.

Signed-off-by: Linas Vepstas <[email protected]>
Cc: Raghavendra Koushik <[email protected]>
Cc: Ananda Raju <[email protected]>
Cc: Wen Xiong <[email protected]>

----
drivers/net/s2io.c | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/net/s2io.h | 5 ++
2 files changed, 126 insertions(+)

Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c
===================================================================
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-27 10:49:07.000000000 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-27 13:55:01.000000000 -0500
@@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _

MODULE_DEVICE_TABLE(pci, s2io_tbl);

+static struct pci_error_handlers s2io_err_handler = {
+ .error_detected = s2io_io_error_detected,
+ .slot_reset = s2io_io_slot_reset,
+ .resume = s2io_io_resume,
+};
+
static struct pci_driver s2io_driver = {
.name = "S2IO",
.id_table = s2io_tbl,
.probe = s2io_init_nic,
.remove = __devexit_p(s2io_rem_nic),
+ .err_handler = &s2io_err_handler,
};

/* A simplifier macro used both by init and free shared_mem Fns(). */
@@ -3159,6 +3166,11 @@ static void alarm_intr_handler(struct s2
register u64 val64 = 0, err_reg = 0;
u64 cnt;
int i;
+
+ if ((nic->pdev->error_state != pci_channel_io_normal) &&
+ (nic->pdev->error_state != 0))
+ return;
+
nic->mac_control.stats_info->sw_stat.ring_full_cnt = 0;
/* Handling the XPAK counters update */
if(nic->mac_control.stats_info->xpak_stat.xpak_timer_count < 72000) {
@@ -4171,6 +4183,11 @@ static irqreturn_t s2io_isr(int irq, voi
mac_info_t *mac_control;
struct config_param *config;

+ /* Pretend we handled any irq's from a disconnected card */
+ if ((sp->pdev->error_state != pci_channel_io_normal) &&
+ (sp->pdev->error_state != 0))
+ return IRQ_HANDLED;
+
atomic_inc(&sp->isr_cnt);
mac_control = &sp->mac_control;
config = &sp->config;
@@ -7564,3 +7581,107 @@ static void lro_append_pkt(nic_t *sp, lr
sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++;
return;
}
+
+/**
+ * s2io_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ netif_device_detach(netdev);
+
+ if (netif_running(netdev)) {
+ unsigned long flags;
+
+ /* The folowing is an abreviated subset of the
+ * steps taken by s2io_card_down(), avoiding
+ * steps that touch the card itself.
+ */
+ del_timer_sync(&sp->alarm_timer);
+ atomic_set(&sp->card_state, CARD_DOWN);
+
+ /* Kill tasklet. */
+ tasklet_kill(&sp->task);
+
+ /* Free all Tx buffers */
+ spin_lock_irqsave(&sp->tx_lock, flags);
+ free_tx_buffers(sp);
+ spin_unlock_irqrestore(&sp->tx_lock, flags);
+
+ /* Free all Rx buffers */
+ spin_lock_irqsave(&sp->rx_lock, flags);
+ free_rx_buffers(sp);
+ spin_unlock_irqrestore(&sp->rx_lock, flags);
+
+ clear_bit(0, &(sp->link_state));
+ sp->device_close_flag = TRUE; /* Device is shut down. */
+ }
+ pci_disable_device(pdev);
+
+ return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * s2io_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ * At this point, the card has exprienced a hard reset,
+ * followed by fixups by BIOS, and has its config space
+ * set up identically to what it was at cold boot.
+ */
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ if (pci_enable_device(pdev)) {
+ printk(KERN_ERR "s2io: "
+ "Cannot re-enable PCI device after reset.\n");
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
+
+ pci_set_master(pdev);
+ s2io_reset(sp);
+
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * s2io_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells
+ * us that its OK to resume normal operation.
+ */
+static void s2io_io_resume(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ nic_t *sp = netdev->priv;
+
+ if (netif_running(netdev)) {
+ if (s2io_card_up(sp)) {
+ printk(KERN_ERR "s2io: "
+ "Can't bring device back up after reset.\n");
+ return;
+ }
+
+ if (s2io_set_mac_addr(netdev, netdev->dev_addr) == FAILURE) {
+ s2io_card_down(sp);
+ printk(KERN_ERR "s2io: "
+ "Can't resetore mac addr after reset.\n");
+ return;
+ }
+ }
+
+ netif_device_attach(netdev);
+ netif_wake_queue(netdev);
+}
Index: linux-2.6.19-rc1-git11/drivers/net/s2io.h
===================================================================
--- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.h 2006-10-27 10:49:07.000000000 -0500
+++ linux-2.6.19-rc1-git11/drivers/net/s2io.h 2006-10-27 10:50:53.000000000 -0500
@@ -1013,6 +1013,11 @@ static void queue_rx_frame(struct sk_buf
static void update_L3L4_header(nic_t *sp, lro_t *lro);
static void lro_append_pkt(nic_t *sp, lro_t *lro, struct sk_buff *skb, u32 tcp_len);

+static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state);
+static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev);
+static void s2io_io_resume(struct pci_dev *pdev);
+
#define s2io_tcp_mss(skb) skb_shinfo(skb)->gso_size
#define s2io_udp_mss(skb) skb_shinfo(skb)->gso_size
#define s2io_offload_type(skb) skb_shinfo(skb)->gso_type

2007-01-10 20:08:14

by Ramkrishna Vepa

[permalink] [raw]
Subject: RE: [PATCH] s2io: add PCI error recovery support

Linas,

We reviewed this patch and it looks good.

Thanks,
Ram
> > -----Original Message-----
> > From: [email protected]
[mailto:[email protected]]
> > On Behalf Of Linas Vepstas
> > Sent: Tuesday, October 24, 2006 2:55 PM
> > To: Raghavendra Koushik; Ananda Raju; Wen Xiong
> > Cc: [email protected];
[email protected];
> > [email protected]; Jeff Garzik; Andrew Morton
> > Subject: [PATCH] s2io: add PCI error recovery support
> >
> >
> > Koushik, Raju,
> > Please review, comment, and if you find this acceptable,
> > please forward upstream.
> >
> > --linas
> >
> > This patch adds PCI error recovery support to the
> > s2io 10-Gigabit ethernet device driver.
> >
> > Tested, seems to work well.
> >
> > Signed-off-by: Linas Vepstas <[email protected]>
> > Cc: Raghavendra Koushik <[email protected]>
> > Cc: Ananda Raju <[email protected]>
> > Cc: Wen Xiong <[email protected]>
> >
> > ----
> > drivers/net/s2io.c | 77
> > +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > drivers/net/s2io.h | 5 +++
> > 2 files changed, 82 insertions(+)
> >
> > Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c
> > ===================================================================
> > --- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-20
> > 12:24:17.000000000 -0500
> > +++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-24
> > 16:19:49.000000000 -0500
> > @@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _
> >
> > MODULE_DEVICE_TABLE(pci, s2io_tbl);
> >
> > +static struct pci_error_handlers s2io_err_handler = {
> > + .error_detected = s2io_io_error_detected,
> > + .slot_reset = s2io_io_slot_reset,
> > + .resume = s2io_io_resume,
> > +};
> > +
> > static struct pci_driver s2io_driver = {
> > .name = "S2IO",
> > .id_table = s2io_tbl,
> > .probe = s2io_init_nic,
> > .remove = __devexit_p(s2io_rem_nic),
> > + .err_handler = &s2io_err_handler,
> > };
> >
> > /* A simplifier macro used both by init and free shared_mem Fns().
*/
> > @@ -7564,3 +7571,73 @@ static void lro_append_pkt(nic_t *sp, lr
> > sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++;
> > return;
> > }
> > +
> > +/**
> > + * s2io_io_error_detected - called when PCI error is detected
> > + * @pdev: Pointer to PCI device
> > + * @state: The current pci conneection state
> > + *
> > + * This function is called after a PCI bus error affecting
> > + * this device has been detected.
> > + */
> > +static pci_ers_result_t s2io_io_error_detected(struct pci_dev
*pdev,
> > pci_channel_state_t state)
> > +{
> > + struct net_device *netdev = pci_get_drvdata(pdev);
> > + nic_t *sp = netdev->priv;
> > +
> > + netif_device_detach(netdev);
> > +
> > + if (netif_running(netdev)) {
> > + /* Reset card */
> > + s2io_card_down(sp);
> > + sp->device_close_flag = TRUE; /* Device is shut down.
*/
> > + }
> > + pci_disable_device(pdev);
> > +
> > + return PCI_ERS_RESULT_NEED_RESET;
> > +}
> > +
> > +/**
> > + * s2io_io_slot_reset - called after the pci bus has been reset.
> > + * @pdev: Pointer to PCI device
> > + *
> > + * Restart the card from scratch, as if from a cold-boot.
> > + */
> > +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev)
> > +{
> > + struct net_device *netdev = pci_get_drvdata(pdev);
> > + nic_t *sp = netdev->priv;
> > +
> > + if (pci_enable_device(pdev)) {
> > + printk(KERN_ERR "s2io: Cannot re-enable PCI device after
> > reset.\n");
> > + return PCI_ERS_RESULT_DISCONNECT;
> > + }
> > +
> > + pci_set_master(pdev);
> > + s2io_reset(sp);
> > +
> > + return PCI_ERS_RESULT_RECOVERED;
> > +}
> > +
> > +/**
> > + * s2io_io_resume - called when traffic can start flowing again.
> > + * @pdev: Pointer to PCI device
> > + *
> > + * This callback is called when the error recovery driver tells us
that
> > + * its OK to resume normal operation.
> > + */
> > +static void s2io_io_resume(struct pci_dev *pdev)
> > +{
> > + struct net_device *netdev = pci_get_drvdata(pdev);
> > + nic_t *sp = netdev->priv;
> > +
> > + if (netif_running(netdev)) {
> > + if (s2io_card_up(sp)) {
> > + printk(KERN_ERR "s2io: can't bring device back
up after
> > reset\n");
> > + return;
> > + }
> > + }
> > +
> > + netif_device_attach(netdev);
> > + netif_wake_queue(netdev);
> > +}
> > Index: linux-2.6.19-rc1-git11/drivers/net/s2io.h
> > ===================================================================
> > --- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.h 2006-10-20
> > 12:24:17.000000000 -0500
> > +++ linux-2.6.19-rc1-git11/drivers/net/s2io.h 2006-10-20
> > 12:41:39.000000000 -0500
> > @@ -1013,6 +1013,11 @@ static void queue_rx_frame(struct sk_buf
> > static void update_L3L4_header(nic_t *sp, lro_t *lro);
> > static void lro_append_pkt(nic_t *sp, lro_t *lro, struct sk_buff
*skb,
> > u32 tcp_len);
> >
> > +static pci_ers_result_t s2io_io_error_detected(struct pci_dev
*pdev,
> > +
pci_channel_state_t
> > state);
> > +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev);
> > +static void s2io_io_resume(struct pci_dev *pdev);
> > +
> > #define s2io_tcp_mss(skb) skb_shinfo(skb)->gso_size
> > #define s2io_udp_mss(skb) skb_shinfo(skb)->gso_size
> > #define s2io_offload_type(skb) skb_shinfo(skb)->gso_type
> > -
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html