2024-01-26 17:43:25

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 0/10] pds_core: Various improvements and AQ race condition cleanup

This series includes the following changes:

There can be many users of the pds_core's adminq. This includes
pds_core's uses and any clients that depend on it. When the pds_core
device goes through a reset for any reason the adminq is freed
and reconfigured. There are some gaps in the current implementation
that will cause crashes during reset if any of the previously mentioned
users of the adminq attempt to use it after it's been freed.

Issues around how resets are handled, specifically regarding the driver's
error handlers.

Some general cleanups.

v1:
https://lore.kernel.org/netdev/[email protected]/

v2:
- Combined the RCT clean-ups with an incorrect goto label fix
- Added a couple more patches related to reset flows
- Slightly updated the cover letter to mention the extra patches that
were added
- Changed a function used only once to be static

Brett Creeley (10):
pds_core: Prevent health thread from running during reset/remove
pds_core: Cancel AQ work on teardown
pds_core: Use struct pdsc for the pdsc_adminq_isr private data
pds_core: Prevent race issues involving the adminq
pds_core: Clear BARs on reset
pds_core: Don't assign interrupt index/bound_intr to notifyq
pds_core: Unmask adminq interrupt in work thread
pds_core: Fix up some minor issues
pds_core: Rework teardown/setup flow to be more common
pds_core: Clean up init/uninit flows to be more readable

drivers/net/ethernet/amd/pds_core/adminq.c | 74 +++++++----
drivers/net/ethernet/amd/pds_core/core.c | 130 ++++++++++++--------
drivers/net/ethernet/amd/pds_core/core.h | 3 +-
drivers/net/ethernet/amd/pds_core/debugfs.c | 12 +-
drivers/net/ethernet/amd/pds_core/dev.c | 30 +++--
drivers/net/ethernet/amd/pds_core/devlink.c | 3 +-
drivers/net/ethernet/amd/pds_core/fw.c | 3 +
drivers/net/ethernet/amd/pds_core/main.c | 26 +++-
8 files changed, 187 insertions(+), 94 deletions(-)

--
2.17.1



2024-01-26 17:43:57

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 01/10] pds_core: Prevent health thread from running during reset/remove

The PCIe reset handlers can run at the same time as the
health thread. This can cause the health thread to
stomp on the PCIe reset. Fix this by preventing the
health thread from running while a PCIe reset is happening.

As part of this use timer_shutdown_sync() during reset and
remove to make sure the timer doesn't ever get rearmed.

Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
---
drivers/net/ethernet/amd/pds_core/main.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/main.c b/drivers/net/ethernet/amd/pds_core/main.c
index 3080898d7b95..5172a5ad8ec6 100644
--- a/drivers/net/ethernet/amd/pds_core/main.c
+++ b/drivers/net/ethernet/amd/pds_core/main.c
@@ -293,7 +293,7 @@ static int pdsc_init_pf(struct pdsc *pdsc)
err_out_teardown:
pdsc_teardown(pdsc, PDSC_TEARDOWN_REMOVING);
err_out_unmap_bars:
- del_timer_sync(&pdsc->wdtimer);
+ timer_shutdown_sync(&pdsc->wdtimer);
if (pdsc->wq)
destroy_workqueue(pdsc->wq);
mutex_destroy(&pdsc->config_lock);
@@ -420,7 +420,7 @@ static void pdsc_remove(struct pci_dev *pdev)
*/
pdsc_sriov_configure(pdev, 0);

- del_timer_sync(&pdsc->wdtimer);
+ timer_shutdown_sync(&pdsc->wdtimer);
if (pdsc->wq)
destroy_workqueue(pdsc->wq);

@@ -445,10 +445,24 @@ static void pdsc_remove(struct pci_dev *pdev)
devlink_free(dl);
}

+static void pdsc_stop_health_thread(struct pdsc *pdsc)
+{
+ timer_shutdown_sync(&pdsc->wdtimer);
+ if (pdsc->health_work.func)
+ cancel_work_sync(&pdsc->health_work);
+}
+
+static void pdsc_restart_health_thread(struct pdsc *pdsc)
+{
+ timer_setup(&pdsc->wdtimer, pdsc_wdtimer_cb, 0);
+ mod_timer(&pdsc->wdtimer, jiffies + 1);
+}
+
void pdsc_reset_prepare(struct pci_dev *pdev)
{
struct pdsc *pdsc = pci_get_drvdata(pdev);

+ pdsc_stop_health_thread(pdsc);
pdsc_fw_down(pdsc);

pci_free_irq_vectors(pdev);
@@ -486,6 +500,7 @@ void pdsc_reset_done(struct pci_dev *pdev)
}

pdsc_fw_up(pdsc);
+ pdsc_restart_health_thread(pdsc);
}

static const struct pci_error_handlers pdsc_err_handler = {
--
2.17.1


2024-01-26 17:44:23

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 03/10] pds_core: Use struct pdsc for the pdsc_adminq_isr private data

The initial design for the adminq interrupt was done based
on client drivers having their own adminq and adminq
interrupt. So, each client driver's adminq isr would use
their specific adminqcq for the private data struct. For the
time being the design has changed to only use a single
adminq for all clients. So, instead use the struct pdsc for
the private data to simplify things a bit.

This also has the benefit of not dereferencing the adminqcq
to access the pdsc struct when the PDSC_S_STOPPING_DRIVER bit
is set and the adminqcq has actually been cleared/freed.

Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
---
drivers/net/ethernet/amd/pds_core/adminq.c | 5 +++--
drivers/net/ethernet/amd/pds_core/core.c | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index 5beadabc2136..68be5ea251fc 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -135,8 +135,8 @@ void pdsc_work_thread(struct work_struct *work)

irqreturn_t pdsc_adminq_isr(int irq, void *data)
{
- struct pdsc_qcq *qcq = data;
- struct pdsc *pdsc = qcq->pdsc;
+ struct pdsc *pdsc = data;
+ struct pdsc_qcq *qcq;

/* Don't process AdminQ when shutting down */
if (pdsc->state & BIT_ULL(PDSC_S_STOPPING_DRIVER)) {
@@ -145,6 +145,7 @@ irqreturn_t pdsc_adminq_isr(int irq, void *data)
return IRQ_HANDLED;
}

+ qcq = &pdsc->adminqcq;
queue_work(pdsc->wq, &qcq->work);
pds_core_intr_mask(&pdsc->intr_ctrl[qcq->intx], PDS_CORE_INTR_MASK_CLEAR);

diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index b582729331eb..0356e56a6e99 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -125,7 +125,7 @@ static int pdsc_qcq_intr_alloc(struct pdsc *pdsc, struct pdsc_qcq *qcq)

snprintf(name, sizeof(name), "%s-%d-%s",
PDS_CORE_DRV_NAME, pdsc->pdev->bus->number, qcq->q.name);
- index = pdsc_intr_alloc(pdsc, name, pdsc_adminq_isr, qcq);
+ index = pdsc_intr_alloc(pdsc, name, pdsc_adminq_isr, pdsc);
if (index < 0)
return index;
qcq->intx = index;
--
2.17.1


2024-01-26 17:44:27

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 02/10] pds_core: Cancel AQ work on teardown

There is a small window where pdsc_work_thread()
calls pdsc_process_adminq() and pdsc_process_adminq()
passes the PDSC_S_STOPPING_DRIVER check and starts
to process adminq/notifyq work and then the driver
starts a fw_down cycle. This could cause some
undefined behavior if the notifyqcq/adminqcq are
free'd while pdsc_process_adminq() is running. Use
cancel_work_sync() on the adminqcq's work struct
to make sure any pending work items are cancelled
and any in progress work items are completed.

Also, make sure to not call cancel_work_sync() if
the work item has not be initialized. Without this,
traces will happen in cases where a reset fails and
teardown is called again or if reset fails and the
driver is removed.

Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
---
drivers/net/ethernet/amd/pds_core/core.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index 0d2091e9eb28..b582729331eb 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -464,6 +464,8 @@ void pdsc_teardown(struct pdsc *pdsc, bool removing)

if (!pdsc->pdev->is_virtfn)
pdsc_devcmd_reset(pdsc);
+ if (pdsc->adminqcq.work.func)
+ cancel_work_sync(&pdsc->adminqcq.work);
pdsc_qcq_free(pdsc, &pdsc->notifyqcq);
pdsc_qcq_free(pdsc, &pdsc->adminqcq);

--
2.17.1


2024-01-26 17:44:56

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 05/10] pds_core: Clear BARs on reset

During reset the BARs might be accessed when they are
unmapped. This can cause unexpected issues, so fix it by
clearing the cached BAR values so they are not accessed
until they are re-mapped.

Also, make sure any places that can access the BARs
when they are NULL are prevented.

Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
---
drivers/net/ethernet/amd/pds_core/adminq.c | 28 +++++++++++++++------
drivers/net/ethernet/amd/pds_core/core.c | 8 +++++-
drivers/net/ethernet/amd/pds_core/dev.c | 9 ++++++-
drivers/net/ethernet/amd/pds_core/devlink.c | 3 ++-
drivers/net/ethernet/amd/pds_core/fw.c | 3 +++
drivers/net/ethernet/amd/pds_core/main.c | 5 ++++
6 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index 5edff33d56f3..ea773cfa0af6 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -191,10 +191,16 @@ static int __pdsc_adminq_post(struct pdsc *pdsc,

/* Check that the FW is running */
if (!pdsc_is_fw_running(pdsc)) {
- u8 fw_status = ioread8(&pdsc->info_regs->fw_status);
-
- dev_info(pdsc->dev, "%s: post failed - fw not running %#02x:\n",
- __func__, fw_status);
+ if (pdsc->info_regs) {
+ u8 fw_status =
+ ioread8(&pdsc->info_regs->fw_status);
+
+ dev_info(pdsc->dev, "%s: post failed - fw not running %#02x:\n",
+ __func__, fw_status);
+ } else {
+ dev_info(pdsc->dev, "%s: post failed - BARs not setup\n",
+ __func__);
+ }
ret = -ENXIO;

goto err_out_unlock;
@@ -266,10 +272,16 @@ int pdsc_adminq_post(struct pdsc *pdsc,
break;

if (!pdsc_is_fw_running(pdsc)) {
- u8 fw_status = ioread8(&pdsc->info_regs->fw_status);
-
- dev_dbg(pdsc->dev, "%s: post wait failed - fw not running %#02x:\n",
- __func__, fw_status);
+ if (pdsc->info_regs) {
+ u8 fw_status =
+ ioread8(&pdsc->info_regs->fw_status);
+
+ dev_dbg(pdsc->dev, "%s: post wait failed - fw not running %#02x:\n",
+ __func__, fw_status);
+ } else {
+ dev_dbg(pdsc->dev, "%s: post wait failed - BARs not setup\n",
+ __func__);
+ }
err = -ENXIO;
break;
}
diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index f44333bd1256..65c8a7072e35 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -600,7 +600,13 @@ void pdsc_fw_up(struct pdsc *pdsc)

static void pdsc_check_pci_health(struct pdsc *pdsc)
{
- u8 fw_status = ioread8(&pdsc->info_regs->fw_status);
+ u8 fw_status;
+
+ /* some sort of teardown already in progress */
+ if (!pdsc->info_regs)
+ return;
+
+ fw_status = ioread8(&pdsc->info_regs->fw_status);

/* is PCI broken? */
if (fw_status != PDS_RC_BAD_PCI)
diff --git a/drivers/net/ethernet/amd/pds_core/dev.c b/drivers/net/ethernet/amd/pds_core/dev.c
index 31940b857e0e..62a38e0a8454 100644
--- a/drivers/net/ethernet/amd/pds_core/dev.c
+++ b/drivers/net/ethernet/amd/pds_core/dev.c
@@ -57,6 +57,9 @@ int pdsc_err_to_errno(enum pds_core_status_code code)

bool pdsc_is_fw_running(struct pdsc *pdsc)
{
+ if (!pdsc->info_regs)
+ return false;
+
pdsc->fw_status = ioread8(&pdsc->info_regs->fw_status);
pdsc->last_fw_time = jiffies;
pdsc->last_hb = ioread32(&pdsc->info_regs->fw_heartbeat);
@@ -182,13 +185,17 @@ int pdsc_devcmd_locked(struct pdsc *pdsc, union pds_core_dev_cmd *cmd,
{
int err;

+ if (!pdsc->cmd_regs)
+ return -ENXIO;
+
memcpy_toio(&pdsc->cmd_regs->cmd, cmd, sizeof(*cmd));
pdsc_devcmd_dbell(pdsc);
err = pdsc_devcmd_wait(pdsc, cmd->opcode, max_seconds);
- memcpy_fromio(comp, &pdsc->cmd_regs->comp, sizeof(*comp));

if ((err == -ENXIO || err == -ETIMEDOUT) && pdsc->wq)
queue_work(pdsc->wq, &pdsc->health_work);
+ else
+ memcpy_fromio(comp, &pdsc->cmd_regs->comp, sizeof(*comp));

return err;
}
diff --git a/drivers/net/ethernet/amd/pds_core/devlink.c b/drivers/net/ethernet/amd/pds_core/devlink.c
index e9948ea5bbcd..54864f27c87a 100644
--- a/drivers/net/ethernet/amd/pds_core/devlink.c
+++ b/drivers/net/ethernet/amd/pds_core/devlink.c
@@ -111,7 +111,8 @@ int pdsc_dl_info_get(struct devlink *dl, struct devlink_info_req *req,

mutex_lock(&pdsc->devcmd_lock);
err = pdsc_devcmd_locked(pdsc, &cmd, &comp, pdsc->devcmd_timeout * 2);
- memcpy_fromio(&fw_list, pdsc->cmd_regs->data, sizeof(fw_list));
+ if (!err)
+ memcpy_fromio(&fw_list, pdsc->cmd_regs->data, sizeof(fw_list));
mutex_unlock(&pdsc->devcmd_lock);
if (err && err != -EIO)
return err;
diff --git a/drivers/net/ethernet/amd/pds_core/fw.c b/drivers/net/ethernet/amd/pds_core/fw.c
index 90a811f3878a..fa626719e68d 100644
--- a/drivers/net/ethernet/amd/pds_core/fw.c
+++ b/drivers/net/ethernet/amd/pds_core/fw.c
@@ -107,6 +107,9 @@ int pdsc_firmware_update(struct pdsc *pdsc, const struct firmware *fw,

dev_info(pdsc->dev, "Installing firmware\n");

+ if (!pdsc->cmd_regs)
+ return -ENXIO;
+
dl = priv_to_devlink(pdsc);
devlink_flash_update_status_notify(dl, "Preparing to flash",
NULL, 0, 0);
diff --git a/drivers/net/ethernet/amd/pds_core/main.c b/drivers/net/ethernet/amd/pds_core/main.c
index 5172a5ad8ec6..05fdeb235e5f 100644
--- a/drivers/net/ethernet/amd/pds_core/main.c
+++ b/drivers/net/ethernet/amd/pds_core/main.c
@@ -37,6 +37,11 @@ static void pdsc_unmap_bars(struct pdsc *pdsc)
struct pdsc_dev_bar *bars = pdsc->bars;
unsigned int i;

+ pdsc->info_regs = NULL;
+ pdsc->cmd_regs = NULL;
+ pdsc->intr_status = NULL;
+ pdsc->intr_ctrl = NULL;
+
for (i = 0; i < PDS_CORE_BARS_MAX; i++) {
if (bars[i].vaddr)
pci_iounmap(pdsc->pdev, bars[i].vaddr);
--
2.17.1


2024-01-26 17:44:59

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 06/10] pds_core: Don't assign interrupt index/bound_intr to notifyq

The notifyq rides on the adminq's interrupt, so there's
no need to setup and/or access the notifyq's interrupt
index or bound_intr. The driver sets the bound_intr
using qcq->intx = -1 for the notifyq, but luckily
nothing accesses that field for notifyq. Instead of
expecting that remains the case, just clean up
the notifyq's interrupt index and bound_intr fields.

Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
---
drivers/net/ethernet/amd/pds_core/core.c | 5 +----
drivers/net/ethernet/amd/pds_core/debugfs.c | 3 ++-
2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index 65c8a7072e35..fe7e1d10224c 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -129,6 +129,7 @@ static int pdsc_qcq_intr_alloc(struct pdsc *pdsc, struct pdsc_qcq *qcq)
if (index < 0)
return index;
qcq->intx = index;
+ qcq->cq.bound_intr = &pdsc->intr_info[index];

return 0;
}
@@ -222,7 +223,6 @@ int pdsc_qcq_alloc(struct pdsc *pdsc, unsigned int type, unsigned int index,
goto err_out_free_irq;
}

- qcq->cq.bound_intr = &pdsc->intr_info[qcq->intx];
qcq->cq.num_descs = num_descs;
qcq->cq.desc_size = cq_desc_size;
qcq->cq.tail_idx = 0;
@@ -433,9 +433,6 @@ int pdsc_setup(struct pdsc *pdsc, bool init)
if (err)
goto err_out_teardown;

- /* NotifyQ rides on the AdminQ interrupt */
- pdsc->notifyqcq.intx = pdsc->adminqcq.intx;
-
/* Set up the Core with the AdminQ and NotifyQ info */
err = pdsc_core_init(pdsc);
if (err)
diff --git a/drivers/net/ethernet/amd/pds_core/debugfs.c b/drivers/net/ethernet/amd/pds_core/debugfs.c
index 8ec392299b7d..dd6aaeecfe0a 100644
--- a/drivers/net/ethernet/amd/pds_core/debugfs.c
+++ b/drivers/net/ethernet/amd/pds_core/debugfs.c
@@ -105,7 +105,6 @@ void pdsc_debugfs_add_qcq(struct pdsc *pdsc, struct pdsc_qcq *qcq)
struct dentry *qcq_dentry, *q_dentry, *cq_dentry;
struct dentry *intr_dentry;
struct debugfs_regset32 *intr_ctrl_regset;
- struct pdsc_intr_info *intr = &pdsc->intr_info[qcq->intx];
struct pdsc_queue *q = &qcq->q;
struct pdsc_cq *cq = &qcq->cq;

@@ -143,6 +142,8 @@ void pdsc_debugfs_add_qcq(struct pdsc *pdsc, struct pdsc_qcq *qcq)
debugfs_create_u16("tail", 0400, cq_dentry, &cq->tail_idx);

if (qcq->flags & PDS_CORE_QCQ_F_INTR) {
+ struct pdsc_intr_info *intr = &pdsc->intr_info[qcq->intx];
+
intr_dentry = debugfs_create_dir("intr", qcq->dentry);
if (IS_ERR_OR_NULL(intr_dentry))
return;
--
2.17.1


2024-01-26 17:45:56

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 07/10] pds_core: Unmask adminq interrupt in work thread

Unmasking the interrupt during the pdsc_adminq_isr
is a bit early and could cause unnecessary interrupts.
Instead always unmask after processing the adminq
and notifyq in pdsc_work_thread()->pdsc_process_adminq().
Also, since we are always unmasking, there's no need
for the local credits variable in pdsc_process_adminq().

Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
---
drivers/net/ethernet/amd/pds_core/adminq.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index ea773cfa0af6..c83a0a80d533 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -82,7 +82,6 @@ void pdsc_process_adminq(struct pdsc_qcq *qcq)
unsigned long irqflags;
int nq_work = 0;
int aq_work = 0;
- int credits;

/* Don't process AdminQ when it's not up */
if (!pdsc_adminq_inc_if_up(pdsc)) {
@@ -128,11 +127,9 @@ void pdsc_process_adminq(struct pdsc_qcq *qcq)

credits:
/* Return the interrupt credits, one for each completion */
- credits = nq_work + aq_work;
- if (credits)
- pds_core_intr_credits(&pdsc->intr_ctrl[qcq->intx],
- credits,
- PDS_CORE_INTR_CRED_REARM);
+ pds_core_intr_credits(&pdsc->intr_ctrl[qcq->intx],
+ nq_work + aq_work,
+ PDS_CORE_INTR_CRED_REARM);
refcount_dec(&pdsc->adminq_refcnt);
}

@@ -157,7 +154,6 @@ irqreturn_t pdsc_adminq_isr(int irq, void *data)

qcq = &pdsc->adminqcq;
queue_work(pdsc->wq, &qcq->work);
- pds_core_intr_mask(&pdsc->intr_ctrl[qcq->intx], PDS_CORE_INTR_MASK_CLEAR);
refcount_dec(&pdsc->adminq_refcnt);

return IRQ_HANDLED;
--
2.17.1


2024-01-26 17:48:12

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 08/10] pds_core: Fix up some minor issues

Running xmastree.py against the driver found some
RCT issues, so fix them.

Also, if allocating pdsc->intr_info in pdsc_dev_init()
fails the driver still tries to free pdsc->intr_info.
Fix this by just returning -ENOMEM since there's
nothing to free at this point of failure.

Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
---
drivers/net/ethernet/amd/pds_core/debugfs.c | 5 ++---
drivers/net/ethernet/amd/pds_core/dev.c | 6 ++----
2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/debugfs.c b/drivers/net/ethernet/amd/pds_core/debugfs.c
index dd6aaeecfe0a..d56fdbb4cdb9 100644
--- a/drivers/net/ethernet/amd/pds_core/debugfs.c
+++ b/drivers/net/ethernet/amd/pds_core/debugfs.c
@@ -32,8 +32,8 @@ void pdsc_debugfs_del_dev(struct pdsc *pdsc)

static int identity_show(struct seq_file *seq, void *v)
{
- struct pdsc *pdsc = seq->private;
struct pds_core_dev_identity *ident;
+ struct pdsc *pdsc = seq->private;
int vt;

ident = &pdsc->dev_ident;
@@ -102,8 +102,7 @@ static const struct debugfs_reg32 intr_ctrl_regs[] = {

void pdsc_debugfs_add_qcq(struct pdsc *pdsc, struct pdsc_qcq *qcq)
{
- struct dentry *qcq_dentry, *q_dentry, *cq_dentry;
- struct dentry *intr_dentry;
+ struct dentry *qcq_dentry, *q_dentry, *cq_dentry, *intr_dentry;
struct debugfs_regset32 *intr_ctrl_regset;
struct pdsc_queue *q = &qcq->q;
struct pdsc_cq *cq = &qcq->cq;
diff --git a/drivers/net/ethernet/amd/pds_core/dev.c b/drivers/net/ethernet/amd/pds_core/dev.c
index 62a38e0a8454..b237cea65a5e 100644
--- a/drivers/net/ethernet/amd/pds_core/dev.c
+++ b/drivers/net/ethernet/amd/pds_core/dev.c
@@ -348,10 +348,8 @@ int pdsc_dev_init(struct pdsc *pdsc)

/* Get intr_info struct array for tracking */
pdsc->intr_info = kcalloc(nintrs, sizeof(*pdsc->intr_info), GFP_KERNEL);
- if (!pdsc->intr_info) {
- err = -ENOMEM;
- goto err_out;
- }
+ if (!pdsc->intr_info)
+ return -ENOMEM;

err = pci_alloc_irq_vectors(pdsc->pdev, nintrs, nintrs, PCI_IRQ_MSIX);
if (err != nintrs) {
--
2.17.1


2024-01-26 17:49:02

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 04/10] pds_core: Prevent race issues involving the adminq

There are multiple paths that can result in using the pdsc's
adminq.

[1] pdsc_adminq_isr and the resulting work from queue_work(),
i.e. pdsc_work_thread()->pdsc_process_adminq()

[2] pdsc_adminq_post()

When the device goes through reset via PCIe reset and/or
a fw_down/fw_up cycle due to bad PCIe state or bad device
state the adminq is destroyed and recreated.

A NULL pointer dereference can happen if [1] or [2] happens
after the adminq is already destroyed.

In order to fix this, add some further state checks and
implement reference counting for adminq uses. Reference
counting was used because multiple threads can attempt to
access the adminq at the same time via [1] or [2]. Additionally,
multiple clients (i.e. pds-vfio-pci) can be using [2]
at the same time.

The adminq_refcnt is initialized to 1 when the adminq has been
allocated and is ready to use. Users/clients of the adminq
(i.e. [1] and [2]) will increment the refcnt when they are using
the adminq. When the driver goes into a fw_down cycle it will
set the PDSC_S_FW_DEAD bit and then wait for the adminq_refcnt
to hit 1. Setting the PDSC_S_FW_DEAD before waiting will prevent
any further adminq_refcnt increments. Waiting for the
adminq_refcnt to hit 1 allows for any current users of the adminq
to finish before the driver frees the adminq. Once the
adminq_refcnt hits 1 the driver clears the refcnt to signify that
the adminq is deleted and cannot be used. On the fw_up cycle the
driver will once again initialize the adminq_refcnt to 1 allowing
the adminq to be used again.

Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
---
drivers/net/ethernet/amd/pds_core/adminq.c | 31 +++++++++++++++++-----
drivers/net/ethernet/amd/pds_core/core.c | 21 +++++++++++++++
drivers/net/ethernet/amd/pds_core/core.h | 1 +
3 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index 68be5ea251fc..5edff33d56f3 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -63,6 +63,15 @@ static int pdsc_process_notifyq(struct pdsc_qcq *qcq)
return nq_work;
}

+static bool pdsc_adminq_inc_if_up(struct pdsc *pdsc)
+{
+ if (pdsc->state & BIT_ULL(PDSC_S_STOPPING_DRIVER) ||
+ pdsc->state & BIT_ULL(PDSC_S_FW_DEAD))
+ return false;
+
+ return refcount_inc_not_zero(&pdsc->adminq_refcnt);
+}
+
void pdsc_process_adminq(struct pdsc_qcq *qcq)
{
union pds_core_adminq_comp *comp;
@@ -75,9 +84,9 @@ void pdsc_process_adminq(struct pdsc_qcq *qcq)
int aq_work = 0;
int credits;

- /* Don't process AdminQ when shutting down */
- if (pdsc->state & BIT_ULL(PDSC_S_STOPPING_DRIVER)) {
- dev_err(pdsc->dev, "%s: called while PDSC_S_STOPPING_DRIVER\n",
+ /* Don't process AdminQ when it's not up */
+ if (!pdsc_adminq_inc_if_up(pdsc)) {
+ dev_err(pdsc->dev, "%s: called while adminq is unavailable\n",
__func__);
return;
}
@@ -124,6 +133,7 @@ void pdsc_process_adminq(struct pdsc_qcq *qcq)
pds_core_intr_credits(&pdsc->intr_ctrl[qcq->intx],
credits,
PDS_CORE_INTR_CRED_REARM);
+ refcount_dec(&pdsc->adminq_refcnt);
}

void pdsc_work_thread(struct work_struct *work)
@@ -138,9 +148,9 @@ irqreturn_t pdsc_adminq_isr(int irq, void *data)
struct pdsc *pdsc = data;
struct pdsc_qcq *qcq;

- /* Don't process AdminQ when shutting down */
- if (pdsc->state & BIT_ULL(PDSC_S_STOPPING_DRIVER)) {
- dev_err(pdsc->dev, "%s: called while PDSC_S_STOPPING_DRIVER\n",
+ /* Don't process AdminQ when it's not up */
+ if (!pdsc_adminq_inc_if_up(pdsc)) {
+ dev_err(pdsc->dev, "%s: called while adminq is unavailable\n",
__func__);
return IRQ_HANDLED;
}
@@ -148,6 +158,7 @@ irqreturn_t pdsc_adminq_isr(int irq, void *data)
qcq = &pdsc->adminqcq;
queue_work(pdsc->wq, &qcq->work);
pds_core_intr_mask(&pdsc->intr_ctrl[qcq->intx], PDS_CORE_INTR_MASK_CLEAR);
+ refcount_dec(&pdsc->adminq_refcnt);

return IRQ_HANDLED;
}
@@ -231,6 +242,12 @@ int pdsc_adminq_post(struct pdsc *pdsc,
int err = 0;
int index;

+ if (!pdsc_adminq_inc_if_up(pdsc)) {
+ dev_dbg(pdsc->dev, "%s: preventing adminq cmd %u\n",
+ __func__, cmd->opcode);
+ return -ENXIO;
+ }
+
wc.qcq = &pdsc->adminqcq;
index = __pdsc_adminq_post(pdsc, &pdsc->adminqcq, cmd, comp, &wc);
if (index < 0) {
@@ -286,6 +303,8 @@ int pdsc_adminq_post(struct pdsc *pdsc,
queue_work(pdsc->wq, &pdsc->health_work);
}

+ refcount_dec(&pdsc->adminq_refcnt);
+
return err;
}
EXPORT_SYMBOL_GPL(pdsc_adminq_post);
diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index 0356e56a6e99..f44333bd1256 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -450,6 +450,7 @@ int pdsc_setup(struct pdsc *pdsc, bool init)
pdsc_debugfs_add_viftype(pdsc);
}

+ refcount_set(&pdsc->adminq_refcnt, 1);
clear_bit(PDSC_S_FW_DEAD, &pdsc->state);
return 0;

@@ -514,6 +515,24 @@ void pdsc_stop(struct pdsc *pdsc)
PDS_CORE_INTR_MASK_SET);
}

+static void pdsc_adminq_wait_and_dec_once_unused(struct pdsc *pdsc)
+{
+ /* The driver initializes the adminq_refcnt to 1 when the adminq is
+ * allocated and ready for use. Other users/requesters will increment
+ * the refcnt while in use. If the refcnt is down to 1 then the adminq
+ * is not in use and the refcnt can be cleared and adminq freed. Before
+ * calling this function the driver will set PDSC_S_FW_DEAD, which
+ * prevent subsequent attempts to use the adminq and increment the
+ * refcnt to fail. This guarantees that this function will eventually
+ * exit.
+ */
+ while (!refcount_dec_if_one(&pdsc->adminq_refcnt)) {
+ dev_dbg_ratelimited(pdsc->dev, "%s: adminq in use\n",
+ __func__);
+ cpu_relax();
+ }
+}
+
void pdsc_fw_down(struct pdsc *pdsc)
{
union pds_core_notifyq_comp reset_event = {
@@ -529,6 +548,8 @@ void pdsc_fw_down(struct pdsc *pdsc)
if (pdsc->pdev->is_virtfn)
return;

+ pdsc_adminq_wait_and_dec_once_unused(pdsc);
+
/* Notify clients of fw_down */
if (pdsc->fw_reporter)
devlink_health_report(pdsc->fw_reporter, "FW down reported", pdsc);
diff --git a/drivers/net/ethernet/amd/pds_core/core.h b/drivers/net/ethernet/amd/pds_core/core.h
index e35d3e7006bf..cbd5716f46e6 100644
--- a/drivers/net/ethernet/amd/pds_core/core.h
+++ b/drivers/net/ethernet/amd/pds_core/core.h
@@ -184,6 +184,7 @@ struct pdsc {
struct mutex devcmd_lock; /* lock for dev_cmd operations */
struct mutex config_lock; /* lock for configuration operations */
spinlock_t adminq_lock; /* lock for adminq operations */
+ refcount_t adminq_refcnt;
struct pds_core_dev_info_regs __iomem *info_regs;
struct pds_core_dev_cmd_regs __iomem *cmd_regs;
struct pds_core_intr __iomem *intr_ctrl;
--
2.17.1


2024-01-26 17:49:18

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 10/10] pds_core: Clean up init/uninit flows to be more readable

The setup and teardown flows are somewhat hard to follow regarding
pdsc_core_init()/pdsc_dev_init() and their corresponding teardown
flows being in pdsc_teardown(). Improve the readability by adding
new pdsc_core_uninit()/pdsc_dev_unint() functions that mirror their
init counterparts. Also, move the notify and admin qcq allocations
into pdsc_core_init(), so they can be freed in pdsc_core_uninit().

Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
---
drivers/net/ethernet/amd/pds_core/core.c | 87 ++++++++++++------------
drivers/net/ethernet/amd/pds_core/core.h | 1 +
drivers/net/ethernet/amd/pds_core/dev.c | 16 +++++
3 files changed, 61 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index 41507ade3570..1234a4a8a4ae 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -300,6 +300,17 @@ int pdsc_qcq_alloc(struct pdsc *pdsc, unsigned int type, unsigned int index,
return err;
}

+static void pdsc_core_uninit(struct pdsc *pdsc)
+{
+ pdsc_qcq_free(pdsc, &pdsc->notifyqcq);
+ pdsc_qcq_free(pdsc, &pdsc->adminqcq);
+
+ if (pdsc->kern_dbpage) {
+ iounmap(pdsc->kern_dbpage);
+ pdsc->kern_dbpage = NULL;
+ }
+}
+
static int pdsc_core_init(struct pdsc *pdsc)
{
union pds_core_dev_comp comp = {};
@@ -310,9 +321,32 @@ static int pdsc_core_init(struct pdsc *pdsc)
struct pds_core_dev_init_data_in cidi;
u32 dbid_count;
u32 dbpage_num;
+ int numdescs;
size_t sz;
int err;

+ /* Scale the descriptor ring length based on number of CPUs and VFs */
+ numdescs = max_t(int, PDSC_ADMINQ_MIN_LENGTH, num_online_cpus());
+ numdescs += 2 * pci_sriov_get_totalvfs(pdsc->pdev);
+ numdescs = roundup_pow_of_two(numdescs);
+ err = pdsc_qcq_alloc(pdsc, PDS_CORE_QTYPE_ADMINQ, 0, "adminq",
+ PDS_CORE_QCQ_F_CORE | PDS_CORE_QCQ_F_INTR,
+ numdescs,
+ sizeof(union pds_core_adminq_cmd),
+ sizeof(union pds_core_adminq_comp),
+ 0, &pdsc->adminqcq);
+ if (err)
+ return err;
+
+ err = pdsc_qcq_alloc(pdsc, PDS_CORE_QTYPE_NOTIFYQ, 0, "notifyq",
+ PDS_CORE_QCQ_F_NOTIFYQ,
+ PDSC_NOTIFYQ_LENGTH,
+ sizeof(struct pds_core_notifyq_cmd),
+ sizeof(union pds_core_notifyq_comp),
+ 0, &pdsc->notifyqcq);
+ if (err)
+ goto err_out_uninit;
+
cidi.adminq_q_base = cpu_to_le64(pdsc->adminqcq.q_base_pa);
cidi.adminq_cq_base = cpu_to_le64(pdsc->adminqcq.cq_base_pa);
cidi.notifyq_cq_base = cpu_to_le64(pdsc->notifyqcq.cq.base_pa);
@@ -336,7 +370,7 @@ static int pdsc_core_init(struct pdsc *pdsc)
if (err) {
dev_err(pdsc->dev, "Device init command failed: %pe\n",
ERR_PTR(err));
- return err;
+ goto err_out_uninit;
}

pdsc->hw_index = le32_to_cpu(cido.core_hw_index);
@@ -346,7 +380,8 @@ static int pdsc_core_init(struct pdsc *pdsc)
pdsc->kern_dbpage = pdsc_map_dbpage(pdsc, dbpage_num);
if (!pdsc->kern_dbpage) {
dev_err(pdsc->dev, "Cannot map dbpage, aborting\n");
- return -ENOMEM;
+ err = -ENOMEM;
+ goto err_out_uninit;
}

pdsc->adminqcq.q.hw_type = cido.adminq_hw_type;
@@ -359,6 +394,10 @@ static int pdsc_core_init(struct pdsc *pdsc)

pdsc->last_eid = 0;

+ return 0;
+
+err_out_uninit:
+ pdsc_core_uninit(pdsc);
return err;
}

@@ -401,35 +440,12 @@ static int pdsc_viftypes_init(struct pdsc *pdsc)

int pdsc_setup(struct pdsc *pdsc, bool init)
{
- int numdescs;
int err;

err = pdsc_dev_init(pdsc);
if (err)
return err;

- /* Scale the descriptor ring length based on number of CPUs and VFs */
- numdescs = max_t(int, PDSC_ADMINQ_MIN_LENGTH, num_online_cpus());
- numdescs += 2 * pci_sriov_get_totalvfs(pdsc->pdev);
- numdescs = roundup_pow_of_two(numdescs);
- err = pdsc_qcq_alloc(pdsc, PDS_CORE_QTYPE_ADMINQ, 0, "adminq",
- PDS_CORE_QCQ_F_CORE | PDS_CORE_QCQ_F_INTR,
- numdescs,
- sizeof(union pds_core_adminq_cmd),
- sizeof(union pds_core_adminq_comp),
- 0, &pdsc->adminqcq);
- if (err)
- goto err_out_teardown;
-
- err = pdsc_qcq_alloc(pdsc, PDS_CORE_QTYPE_NOTIFYQ, 0, "notifyq",
- PDS_CORE_QCQ_F_NOTIFYQ,
- PDSC_NOTIFYQ_LENGTH,
- sizeof(struct pds_core_notifyq_cmd),
- sizeof(union pds_core_notifyq_comp),
- 0, &pdsc->notifyqcq);
- if (err)
- goto err_out_teardown;
-
/* Set up the Core with the AdminQ and NotifyQ info */
err = pdsc_core_init(pdsc);
if (err)
@@ -455,35 +471,20 @@ int pdsc_setup(struct pdsc *pdsc, bool init)

void pdsc_teardown(struct pdsc *pdsc, bool removing)
{
- int i;
-
if (!pdsc->pdev->is_virtfn)
pdsc_devcmd_reset(pdsc);
if (pdsc->adminqcq.work.func)
cancel_work_sync(&pdsc->adminqcq.work);
- pdsc_qcq_free(pdsc, &pdsc->notifyqcq);
- pdsc_qcq_free(pdsc, &pdsc->adminqcq);
+
+ pdsc_core_uninit(pdsc);

if (removing) {
kfree(pdsc->viftype_status);
pdsc->viftype_status = NULL;
}

- if (pdsc->intr_info) {
- for (i = 0; i < pdsc->nintrs; i++)
- pdsc_intr_free(pdsc, i);
-
- kfree(pdsc->intr_info);
- pdsc->intr_info = NULL;
- pdsc->nintrs = 0;
- }
-
- if (pdsc->kern_dbpage) {
- iounmap(pdsc->kern_dbpage);
- pdsc->kern_dbpage = NULL;
- }
+ pdsc_dev_uninit(pdsc);

- pci_free_irq_vectors(pdsc->pdev);
set_bit(PDSC_S_FW_DEAD, &pdsc->state);
}

diff --git a/drivers/net/ethernet/amd/pds_core/core.h b/drivers/net/ethernet/amd/pds_core/core.h
index 110c4b826b22..3468a63f5ae9 100644
--- a/drivers/net/ethernet/amd/pds_core/core.h
+++ b/drivers/net/ethernet/amd/pds_core/core.h
@@ -282,6 +282,7 @@ int pdsc_devcmd_locked(struct pdsc *pdsc, union pds_core_dev_cmd *cmd,
int pdsc_devcmd_init(struct pdsc *pdsc);
int pdsc_devcmd_reset(struct pdsc *pdsc);
int pdsc_dev_init(struct pdsc *pdsc);
+void pdsc_dev_uninit(struct pdsc *pdsc);

void pdsc_reset_prepare(struct pci_dev *pdev);
void pdsc_reset_done(struct pci_dev *pdev);
diff --git a/drivers/net/ethernet/amd/pds_core/dev.c b/drivers/net/ethernet/amd/pds_core/dev.c
index 7dc102a31185..e494e1298dc9 100644
--- a/drivers/net/ethernet/amd/pds_core/dev.c
+++ b/drivers/net/ethernet/amd/pds_core/dev.c
@@ -316,6 +316,22 @@ static int pdsc_identify(struct pdsc *pdsc)
return 0;
}

+void pdsc_dev_uninit(struct pdsc *pdsc)
+{
+ if (pdsc->intr_info) {
+ int i;
+
+ for (i = 0; i < pdsc->nintrs; i++)
+ pdsc_intr_free(pdsc, i);
+
+ kfree(pdsc->intr_info);
+ pdsc->intr_info = NULL;
+ pdsc->nintrs = 0;
+ }
+
+ pci_free_irq_vectors(pdsc->pdev);
+}
+
int pdsc_dev_init(struct pdsc *pdsc)
{
unsigned int nintrs;
--
2.17.1


2024-01-26 17:49:22

by Brett Creeley

[permalink] [raw]
Subject: [PATCH v2 net-next 09/10] pds_core: Rework teardown/setup flow to be more common

Currently the teardown/setup flow for driver probe/remove is quite
a bit different from the reset flows in pdsc_fw_down()/pdsc_fw_up().
One key piece that's missing are the calls to pci_alloc_irq_vectors()
and pci_free_irq_vectors(). The pcie reset case is calling
pci_free_irq_vectors() on reset_prepare, but not calling the
corresponding pci_alloc_irq_vectors() on reset_done. This is causing
unexpected/unwanted interrupt behavior due to the adminq interrupt
being accidentally put into legacy interrupt mode. Also, the
pci_alloc_irq_vectors()/pci_free_irq_vectors() functions are being
called directly in probe/remove respectively.

Fix this inconsistency by making the following changes:
1. Always call pdsc_dev_init() in pdsc_setup(), which calls
pci_alloc_irq_vectors() and get rid of the now unused
pds_dev_reinit().
2. Always free/clear the pdsc->intr_info in pdsc_teardown()
since this structure will get re-alloced in pdsc_setup().
3. Move the calls of pci_free_irq_vectors() to pdsc_teardown()
since pci_alloc_irq_vectors() will always be called in
pdsc_setup()->pdsc_dev_init() for both the probe/remove and
reset flows.
4. Make sure to only create the debugfs "identity" entry when it
doesn't already exist, which it will in the reset case because
it's already been created in the initial call to pdsc_dev_init().

Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Shannon Nelson <[email protected]>
---
drivers/net/ethernet/amd/pds_core/core.c | 13 +++++--------
drivers/net/ethernet/amd/pds_core/core.h | 1 -
drivers/net/ethernet/amd/pds_core/debugfs.c | 4 ++++
drivers/net/ethernet/amd/pds_core/dev.c | 7 -------
drivers/net/ethernet/amd/pds_core/main.c | 2 --
5 files changed, 9 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index fe7e1d10224c..41507ade3570 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -404,10 +404,7 @@ int pdsc_setup(struct pdsc *pdsc, bool init)
int numdescs;
int err;

- if (init)
- err = pdsc_dev_init(pdsc);
- else
- err = pdsc_dev_reinit(pdsc);
+ err = pdsc_dev_init(pdsc);
if (err)
return err;

@@ -476,10 +473,9 @@ void pdsc_teardown(struct pdsc *pdsc, bool removing)
for (i = 0; i < pdsc->nintrs; i++)
pdsc_intr_free(pdsc, i);

- if (removing) {
- kfree(pdsc->intr_info);
- pdsc->intr_info = NULL;
- }
+ kfree(pdsc->intr_info);
+ pdsc->intr_info = NULL;
+ pdsc->nintrs = 0;
}

if (pdsc->kern_dbpage) {
@@ -487,6 +483,7 @@ void pdsc_teardown(struct pdsc *pdsc, bool removing)
pdsc->kern_dbpage = NULL;
}

+ pci_free_irq_vectors(pdsc->pdev);
set_bit(PDSC_S_FW_DEAD, &pdsc->state);
}

diff --git a/drivers/net/ethernet/amd/pds_core/core.h b/drivers/net/ethernet/amd/pds_core/core.h
index cbd5716f46e6..110c4b826b22 100644
--- a/drivers/net/ethernet/amd/pds_core/core.h
+++ b/drivers/net/ethernet/amd/pds_core/core.h
@@ -281,7 +281,6 @@ int pdsc_devcmd_locked(struct pdsc *pdsc, union pds_core_dev_cmd *cmd,
union pds_core_dev_comp *comp, int max_seconds);
int pdsc_devcmd_init(struct pdsc *pdsc);
int pdsc_devcmd_reset(struct pdsc *pdsc);
-int pdsc_dev_reinit(struct pdsc *pdsc);
int pdsc_dev_init(struct pdsc *pdsc);

void pdsc_reset_prepare(struct pci_dev *pdev);
diff --git a/drivers/net/ethernet/amd/pds_core/debugfs.c b/drivers/net/ethernet/amd/pds_core/debugfs.c
index d56fdbb4cdb9..6bdd02b7aa6d 100644
--- a/drivers/net/ethernet/amd/pds_core/debugfs.c
+++ b/drivers/net/ethernet/amd/pds_core/debugfs.c
@@ -64,6 +64,10 @@ DEFINE_SHOW_ATTRIBUTE(identity);

void pdsc_debugfs_add_ident(struct pdsc *pdsc)
{
+ /* This file will already exist in the reset flow */
+ if (debugfs_lookup("identity", pdsc->dentry))
+ return;
+
debugfs_create_file("identity", 0400, pdsc->dentry,
pdsc, &identity_fops);
}
diff --git a/drivers/net/ethernet/amd/pds_core/dev.c b/drivers/net/ethernet/amd/pds_core/dev.c
index b237cea65a5e..7dc102a31185 100644
--- a/drivers/net/ethernet/amd/pds_core/dev.c
+++ b/drivers/net/ethernet/amd/pds_core/dev.c
@@ -316,13 +316,6 @@ static int pdsc_identify(struct pdsc *pdsc)
return 0;
}

-int pdsc_dev_reinit(struct pdsc *pdsc)
-{
- pdsc_init_devinfo(pdsc);
-
- return pdsc_identify(pdsc);
-}
-
int pdsc_dev_init(struct pdsc *pdsc)
{
unsigned int nintrs;
diff --git a/drivers/net/ethernet/amd/pds_core/main.c b/drivers/net/ethernet/amd/pds_core/main.c
index 05fdeb235e5f..cdbf053b5376 100644
--- a/drivers/net/ethernet/amd/pds_core/main.c
+++ b/drivers/net/ethernet/amd/pds_core/main.c
@@ -438,7 +438,6 @@ static void pdsc_remove(struct pci_dev *pdev)
mutex_destroy(&pdsc->config_lock);
mutex_destroy(&pdsc->devcmd_lock);

- pci_free_irq_vectors(pdev);
pdsc_unmap_bars(pdsc);
pci_release_regions(pdev);
}
@@ -470,7 +469,6 @@ void pdsc_reset_prepare(struct pci_dev *pdev)
pdsc_stop_health_thread(pdsc);
pdsc_fw_down(pdsc);

- pci_free_irq_vectors(pdev);
pdsc_unmap_bars(pdsc);
pci_release_regions(pdev);
pci_disable_device(pdev);
--
2.17.1


2024-01-27 04:44:36

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/10] pds_core: Various improvements and AQ race condition cleanup

On Fri, 26 Jan 2024 09:42:45 -0800 Brett Creeley wrote:
> This series includes the following changes:
>
> There can be many users of the pds_core's adminq. This includes
> pds_core's uses and any clients that depend on it. When the pds_core
> device goes through a reset for any reason the adminq is freed
> and reconfigured. There are some gaps in the current implementation
> that will cause crashes during reset if any of the previously mentioned
> users of the adminq attempt to use it after it's been freed.
>
> Issues around how resets are handled, specifically regarding the driver's
> error handlers.

Patches 1, 2 and 4 look like fixes. Is there any reason these are
targeting net-next? If someone deploys this device at scale rare
things will happen a lot..

2024-01-29 17:31:53

by Brett Creeley

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/10] pds_core: Various improvements and AQ race condition cleanup

On 1/26/2024 8:44 PM, Jakub Kicinski wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Fri, 26 Jan 2024 09:42:45 -0800 Brett Creeley wrote:
>> This series includes the following changes:
>>
>> There can be many users of the pds_core's adminq. This includes
>> pds_core's uses and any clients that depend on it. When the pds_core
>> device goes through a reset for any reason the adminq is freed
>> and reconfigured. There are some gaps in the current implementation
>> that will cause crashes during reset if any of the previously mentioned
>> users of the adminq attempt to use it after it's been freed.
>>
>> Issues around how resets are handled, specifically regarding the driver's
>> error handlers.
>
> Patches 1, 2 and 4 look like fixes. Is there any reason these are
> targeting net-next? If someone deploys this device at scale rare
> things will happen a lot..

No reason, just an oversight on my part. I actually think patches 1, 2,
3, 4, 5, and 9 could all go to net. Unfortunately some of these patches
are intertwined (i.e. patch 10 depends on patch 9).

If I push the previously mentioned patches to net and they get accepted,
how soon are fixes typically added to the net-next tree so I can
rebase/re-push the remaining patches?

Thank for the review,

Brett

2024-01-29 20:06:34

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/10] pds_core: Various improvements and AQ race condition cleanup

On Mon, 29 Jan 2024 09:27:21 -0800 Brett Creeley wrote:
> > On Fri, 26 Jan 2024 09:42:45 -0800 Brett Creeley wrote:
> >> This series includes the following changes:
> >>
> >> There can be many users of the pds_core's adminq. This includes
> >> pds_core's uses and any clients that depend on it. When the pds_core
> >> device goes through a reset for any reason the adminq is freed
> >> and reconfigured. There are some gaps in the current implementation
> >> that will cause crashes during reset if any of the previously mentioned
> >> users of the adminq attempt to use it after it's been freed.
> >>
> >> Issues around how resets are handled, specifically regarding the driver's
> >> error handlers.
> >
> > Patches 1, 2 and 4 look like fixes. Is there any reason these are
> > targeting net-next? If someone deploys this device at scale rare
> > things will happen a lot..
>
> No reason, just an oversight on my part. I actually think patches 1, 2,
> 3, 4, 5, and 9 could all go to net. Unfortunately some of these patches
> are intertwined (i.e. patch 10 depends on patch 9).
>
> If I push the previously mentioned patches to net and they get accepted,
> how soon are fixes typically added to the net-next tree so I can
> rebase/re-push the remaining patches?

net gets merged into net-next very Thursday, exact timing depends on how
quickly Linus pulls from us.

2024-01-29 21:13:00

by Brett Creeley

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/10] pds_core: Various improvements and AQ race condition cleanup

On 1/29/2024 12:05 PM, Jakub Kicinski wrote:
> net gets merged into net-next very Thursday, exact timing depends on how
> quickly Linus pulls from us

Okay, then I will work on splitting this series up between net and net-next.

Thanks again,

Brett