When the driver is processing the interrupt, it will read the value of
the register to determine the status of the device. If the device is in
an incorrect state, the driver may mistakenly enter this branch. At this
time, the dma buffer has not been allocated, which will result in a null
pointer dereference.
Fix this by checking whether the buffer is allocated.
This log reveals it:
BUG: kernel NULL pointer dereference, address: 0000000000000070
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.12.4-g70e7f0549188-dirty #88
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:_vortex_interrupt+0x323/0x670
Code: 84 d4 00 00 00 e8 bd e9 60 fe 48 8b 45 d8 48 83 c0 0c 48 89 c6 bf 00 10 00 00 e8 98 d0 f0 fe 48 8b 45 d0 48 8b 80 d8 01 00 00 <8b> 40 70 83 c0 03 89 c0 83 e0 fc 48 89 c2 48 8b 45 d0 48 8b b0 e0
RSP: 0018:ffffc900001a4dd0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff888115da0580 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81bf710e RDI: 0000000000001000
RBP: ffffc900001a4e30 R08: ffff8881008edbc0 R09: 00000000fffffffe
R10: 0000000000000001 R11: 00000000a5c81234 R12: ffff8881049530a8
R13: 0000000000000000 R14: ffffffff87313288 R15: ffff888108c92000
FS: 0000000000000000(0000) GS:ffff88817b200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000070 CR3: 00000001198c2000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
? _raw_spin_lock_irqsave+0x81/0xa0
vortex_boomerang_interrupt+0x56/0xc10
? __this_cpu_preempt_check+0x1c/0x20
__handle_irq_event_percpu+0x58/0x3e0
handle_irq_event_percpu+0x3a/0x90
handle_irq_event+0x3e/0x60
handle_fasteoi_irq+0xc7/0x1d0
__common_interrupt+0x84/0x150
common_interrupt+0xb4/0xd0
</IRQ>
asm_common_interrupt+0x1e/0x40
RIP: 0010:native_safe_halt+0x17/0x20
Code: 07 0f 00 2d 3b 3e 4b 00 f4 5d c3 0f 1f 84 00 00 00 00 00 8b 05 42 a9 72 02 55 48 89 e5 85 c0 7e 07 0f 00 2d 1b 3e 4b 00 fb f4 <5d> c3 cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 e8 92 4a ff
RSP: 0018:ffffc900000afe90 EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000005 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8666cafb RDI: ffffffff865058de
RBP: ffffc900000afe90 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff87313288
R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881008ed1c0
default_idle+0xe/0x20
arch_cpu_idle+0xf/0x20
default_idle_call+0x73/0x250
do_idle+0x1f5/0x2d0
cpu_startup_entry+0x1d/0x20
start_secondary+0x11f/0x160
secondary_startup_64_no_verify+0xb0/0xbb
Modules linked in:
Dumping ftrace buffer:
(ftrace buffer empty)
CR2: 0000000000000070
---[ end trace 0735407a540147e1 ]---
RIP: 0010:_vortex_interrupt+0x323/0x670
Code: 84 d4 00 00 00 e8 bd e9 60 fe 48 8b 45 d8 48 83 c0 0c 48 89 c6 bf 00 10 00 00 e8 98 d0 f0 fe 48 8b 45 d0 48 8b 80 d8 01 00 00 <8b> 40 70 83 c0 03 89 c0 83 e0 fc 48 89 c2 48 8b 45 d0 48 8b b0 e0
RSP: 0018:ffffc900001a4dd0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff888115da0580 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81bf710e RDI: 0000000000001000
RBP: ffffc900001a4e30 R08: ffff8881008edbc0 R09: 00000000fffffffe
R10: 0000000000000001 R11: 00000000a5c81234 R12: ffff8881049530a8
R13: 0000000000000000 R14: ffffffff87313288 R15: ffff888108c92000
FS: 0000000000000000(0000) GS:ffff88817b200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000070 CR3: 00000001198c2000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Kernel panic - not syncing: Fatal exception in interrupt
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 1 seconds..
Signed-off-by: Zheyu Ma <[email protected]>
---
drivers/net/ethernet/3com/3c59x.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c
index 741c67e546d4..e27901ded7a0 100644
--- a/drivers/net/ethernet/3com/3c59x.c
+++ b/drivers/net/ethernet/3com/3c59x.c
@@ -2300,7 +2300,7 @@ _vortex_interrupt(int irq, struct net_device *dev)
}
if (status & DMADone) {
- if (ioread16(ioaddr + Wn7_MasterStatus) & 0x1000) {
+ if ((ioread16(ioaddr + Wn7_MasterStatus) & 0x1000) && vp->tx_skb_dma) {
iowrite16(0x1000, ioaddr + Wn7_MasterStatus); /* Ack the event. */
dma_unmap_single(vp->gendev, vp->tx_skb_dma, (vp->tx_skb->len + 3) & ~3, DMA_TO_DEVICE);
pkts_compl++;
--
2.17.6
On 6/12/21 4:56 AM, Zheyu Ma wrote:
> When the driver is processing the interrupt, it will read the value of
> the register to determine the status of the device. If the device is in
> an incorrect state, the driver may mistakenly enter this branch. At this
> time, the dma buffer has not been allocated, which will result in a null
> pointer dereference.
>
> Fix this by checking whether the buffer is allocated.
>
> This log reveals it:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000070
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.12.4-g70e7f0549188-dirty #88
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> RIP: 0010:_vortex_interrupt+0x323/0x670
> Code: 84 d4 00 00 00 e8 bd e9 60 fe 48 8b 45 d8 48 83 c0 0c 48 89 c6 bf 00 10 00 00 e8 98 d0 f0 fe 48 8b 45 d0 48 8b 80 d8 01 00 00 <8b> 40 70 83 c0 03 89 c0 83 e0 fc 48 89 c2 48 8b 45 d0 48 8b b0 e0
> RSP: 0018:ffffc900001a4dd0 EFLAGS: 00010046
> RAX: 0000000000000000 RBX: ffff888115da0580 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff81bf710e RDI: 0000000000001000
> RBP: ffffc900001a4e30 R08: ffff8881008edbc0 R09: 00000000fffffffe
> R10: 0000000000000001 R11: 00000000a5c81234 R12: ffff8881049530a8
> R13: 0000000000000000 R14: ffffffff87313288 R15: ffff888108c92000
> FS: 0000000000000000(0000) GS:ffff88817b200000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000070 CR3: 00000001198c2000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <IRQ>
> ? _raw_spin_lock_irqsave+0x81/0xa0
> vortex_boomerang_interrupt+0x56/0xc10
> ? __this_cpu_preempt_check+0x1c/0x20
> __handle_irq_event_percpu+0x58/0x3e0
> handle_irq_event_percpu+0x3a/0x90
> handle_irq_event+0x3e/0x60
> handle_fasteoi_irq+0xc7/0x1d0
> __common_interrupt+0x84/0x150
> common_interrupt+0xb4/0xd0
> </IRQ>
> asm_common_interrupt+0x1e/0x40
> RIP: 0010:native_safe_halt+0x17/0x20
> Code: 07 0f 00 2d 3b 3e 4b 00 f4 5d c3 0f 1f 84 00 00 00 00 00 8b 05 42 a9 72 02 55 48 89 e5 85 c0 7e 07 0f 00 2d 1b 3e 4b 00 fb f4 <5d> c3 cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 e8 92 4a ff
> RSP: 0018:ffffc900000afe90 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000000000005 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff8666cafb RDI: ffffffff865058de
> RBP: ffffc900000afe90 R08: 0000000000000001 R09: 0000000000000001
> R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff87313288
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881008ed1c0
> default_idle+0xe/0x20
> arch_cpu_idle+0xf/0x20
> default_idle_call+0x73/0x250
> do_idle+0x1f5/0x2d0
> cpu_startup_entry+0x1d/0x20
> start_secondary+0x11f/0x160
> secondary_startup_64_no_verify+0xb0/0xbb
> Modules linked in:
> Dumping ftrace buffer:
> (ftrace buffer empty)
> CR2: 0000000000000070
> ---[ end trace 0735407a540147e1 ]---
> RIP: 0010:_vortex_interrupt+0x323/0x670
> Code: 84 d4 00 00 00 e8 bd e9 60 fe 48 8b 45 d8 48 83 c0 0c 48 89 c6 bf 00 10 00 00 e8 98 d0 f0 fe 48 8b 45 d0 48 8b 80 d8 01 00 00 <8b> 40 70 83 c0 03 89 c0 83 e0 fc 48 89 c2 48 8b 45 d0 48 8b b0 e0
> RSP: 0018:ffffc900001a4dd0 EFLAGS: 00010046
> RAX: 0000000000000000 RBX: ffff888115da0580 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff81bf710e RDI: 0000000000001000
> RBP: ffffc900001a4e30 R08: ffff8881008edbc0 R09: 00000000fffffffe
> R10: 0000000000000001 R11: 00000000a5c81234 R12: ffff8881049530a8
> R13: 0000000000000000 R14: ffffffff87313288 R15: ffff888108c92000
> FS: 0000000000000000(0000) GS:ffff88817b200000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000070 CR3: 00000001198c2000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Kernel panic - not syncing: Fatal exception in interrupt
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 1 seconds..
>
> Signed-off-by: Zheyu Ma <[email protected]>
> ---
> drivers/net/ethernet/3com/3c59x.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c
> index 741c67e546d4..e27901ded7a0 100644
> --- a/drivers/net/ethernet/3com/3c59x.c
> +++ b/drivers/net/ethernet/3com/3c59x.c
> @@ -2300,7 +2300,7 @@ _vortex_interrupt(int irq, struct net_device *dev)
> }
>
> if (status & DMADone) {
> - if (ioread16(ioaddr + Wn7_MasterStatus) & 0x1000) {
> + if ((ioread16(ioaddr + Wn7_MasterStatus) & 0x1000) && vp->tx_skb_dma) {
> iowrite16(0x1000, ioaddr + Wn7_MasterStatus); /* Ack the event. */
> dma_unmap_single(vp->gendev, vp->tx_skb_dma, (vp->tx_skb->len + 3) & ~3, DMA_TO_DEVICE);
> pkts_compl++;
This means you won't be ack'ing the event - is this unacknowledged event
going to cause an issue later?
If the error is because the buffer doesn't exist, then can you simply
put the buffer check on the dma_unmap_single() and allow the rest of the
handling to happen?
sln
On Tue, Jun 15, 2021 at 12:48 AM Shannon Nelson <[email protected]> wrote:
>
> On 6/12/21 4:56 AM, Zheyu Ma wrote:
> > When the driver is processing the interrupt, it will read the value of
> > the register to determine the status of the device. If the device is in
> > an incorrect state, the driver may mistakenly enter this branch. At this
> > time, the dma buffer has not been allocated, which will result in a null
> > pointer dereference.
> >
> > Fix this by checking whether the buffer is allocated.
> >
> > This log reveals it:
> >
> > BUG: kernel NULL pointer dereference, address: 0000000000000070
> > PGD 0 P4D 0
> > Oops: 0000 [#1] PREEMPT SMP PTI
> > CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.12.4-g70e7f0549188-dirty #88
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:_vortex_interrupt+0x323/0x670
> > Code: 84 d4 00 00 00 e8 bd e9 60 fe 48 8b 45 d8 48 83 c0 0c 48 89 c6 bf 00 10 00 00 e8 98 d0 f0 fe 48 8b 45 d0 48 8b 80 d8 01 00 00 <8b> 40 70 83 c0 03 89 c0 83 e0 fc 48 89 c2 48 8b 45 d0 48 8b b0 e0
> > RSP: 0018:ffffc900001a4dd0 EFLAGS: 00010046
> > RAX: 0000000000000000 RBX: ffff888115da0580 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: ffffffff81bf710e RDI: 0000000000001000
> > RBP: ffffc900001a4e30 R08: ffff8881008edbc0 R09: 00000000fffffffe
> > R10: 0000000000000001 R11: 00000000a5c81234 R12: ffff8881049530a8
> > R13: 0000000000000000 R14: ffffffff87313288 R15: ffff888108c92000
> > FS: 0000000000000000(0000) GS:ffff88817b200000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000070 CR3: 00000001198c2000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > <IRQ>
> > ? _raw_spin_lock_irqsave+0x81/0xa0
> > vortex_boomerang_interrupt+0x56/0xc10
> > ? __this_cpu_preempt_check+0x1c/0x20
> > __handle_irq_event_percpu+0x58/0x3e0
> > handle_irq_event_percpu+0x3a/0x90
> > handle_irq_event+0x3e/0x60
> > handle_fasteoi_irq+0xc7/0x1d0
> > __common_interrupt+0x84/0x150
> > common_interrupt+0xb4/0xd0
> > </IRQ>
> > asm_common_interrupt+0x1e/0x40
> > RIP: 0010:native_safe_halt+0x17/0x20
> > Code: 07 0f 00 2d 3b 3e 4b 00 f4 5d c3 0f 1f 84 00 00 00 00 00 8b 05 42 a9 72 02 55 48 89 e5 85 c0 7e 07 0f 00 2d 1b 3e 4b 00 fb f4 <5d> c3 cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 e8 92 4a ff
> > RSP: 0018:ffffc900000afe90 EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: 0000000000000005 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: ffffffff8666cafb RDI: ffffffff865058de
> > RBP: ffffc900000afe90 R08: 0000000000000001 R09: 0000000000000001
> > R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff87313288
> > R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881008ed1c0
> > default_idle+0xe/0x20
> > arch_cpu_idle+0xf/0x20
> > default_idle_call+0x73/0x250
> > do_idle+0x1f5/0x2d0
> > cpu_startup_entry+0x1d/0x20
> > start_secondary+0x11f/0x160
> > secondary_startup_64_no_verify+0xb0/0xbb
> > Modules linked in:
> > Dumping ftrace buffer:
> > (ftrace buffer empty)
> > CR2: 0000000000000070
> > ---[ end trace 0735407a540147e1 ]---
> > RIP: 0010:_vortex_interrupt+0x323/0x670
> > Code: 84 d4 00 00 00 e8 bd e9 60 fe 48 8b 45 d8 48 83 c0 0c 48 89 c6 bf 00 10 00 00 e8 98 d0 f0 fe 48 8b 45 d0 48 8b 80 d8 01 00 00 <8b> 40 70 83 c0 03 89 c0 83 e0 fc 48 89 c2 48 8b 45 d0 48 8b b0 e0
> > RSP: 0018:ffffc900001a4dd0 EFLAGS: 00010046
> > RAX: 0000000000000000 RBX: ffff888115da0580 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: ffffffff81bf710e RDI: 0000000000001000
> > RBP: ffffc900001a4e30 R08: ffff8881008edbc0 R09: 00000000fffffffe
> > R10: 0000000000000001 R11: 00000000a5c81234 R12: ffff8881049530a8
> > R13: 0000000000000000 R14: ffffffff87313288 R15: ffff888108c92000
> > FS: 0000000000000000(0000) GS:ffff88817b200000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000070 CR3: 00000001198c2000 CR4: 00000000000006e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Kernel panic - not syncing: Fatal exception in interrupt
> > Dumping ftrace buffer:
> > (ftrace buffer empty)
> > Kernel Offset: disabled
> > Rebooting in 1 seconds..
> >
> > Signed-off-by: Zheyu Ma <[email protected]>
> > ---
> > drivers/net/ethernet/3com/3c59x.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c
> > index 741c67e546d4..e27901ded7a0 100644
> > --- a/drivers/net/ethernet/3com/3c59x.c
> > +++ b/drivers/net/ethernet/3com/3c59x.c
> > @@ -2300,7 +2300,7 @@ _vortex_interrupt(int irq, struct net_device *dev)
> > }
> >
> > if (status & DMADone) {
> > - if (ioread16(ioaddr + Wn7_MasterStatus) & 0x1000) {
> > + if ((ioread16(ioaddr + Wn7_MasterStatus) & 0x1000) && vp->tx_skb_dma) {
> > iowrite16(0x1000, ioaddr + Wn7_MasterStatus); /* Ack the event. */
> > dma_unmap_single(vp->gendev, vp->tx_skb_dma, (vp->tx_skb->len + 3) & ~3, DMA_TO_DEVICE);
> > pkts_compl++;
>
> This means you won't be ack'ing the event - is this unacknowledged event
> going to cause an issue later?
>
First, I'm not an expert in networking, but from my perspective, I
don't think this will cause a problem. Because when the driver enters
this branch, It means that it thinks that the hardware has already
performed a DMA operation, and the driver only needs to do some
follow-up work, but this is not the case. At this time,
'vp->tx_skb_dma' is still a null pointer, so there is no need for
follow-up work at this time, it is meaningless, and it is appropriate
not to perform any operations at this time.
> If the error is because the buffer doesn't exist, then can you simply
> put the buffer check on the dma_unmap_single() and allow the rest of the
> handling to happen?
The error is not only because the buffer is empty. In fact,
'vp->tx_skb' is also empty at this time, because these two buffers are
allocated in the 'vortex_start_xmit' function at the same time, so
only check in 'dma_unmap_single' is not enough.
Thanks,
Zheyu Ma
On Tue, 15 Jun 2021, Zheyu Ma wrote:
> > > When the driver is processing the interrupt, it will read the value of
> > > the register to determine the status of the device. If the device is in
> > > an incorrect state, the driver may mistakenly enter this branch. At this
> > > time, the dma buffer has not been allocated, which will result in a null
> > > pointer dereference.
[...]
> > > diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c
> > > index 741c67e546d4..e27901ded7a0 100644
> > > --- a/drivers/net/ethernet/3com/3c59x.c
> > > +++ b/drivers/net/ethernet/3com/3c59x.c
> > > @@ -2300,7 +2300,7 @@ _vortex_interrupt(int irq, struct net_device *dev)
> > > }
> > >
> > > if (status & DMADone) {
> > > - if (ioread16(ioaddr + Wn7_MasterStatus) & 0x1000) {
> > > + if ((ioread16(ioaddr + Wn7_MasterStatus) & 0x1000) && vp->tx_skb_dma) {
> > > iowrite16(0x1000, ioaddr + Wn7_MasterStatus); /* Ack the event. */
> > > dma_unmap_single(vp->gendev, vp->tx_skb_dma, (vp->tx_skb->len + 3) & ~3, DMA_TO_DEVICE);
> > > pkts_compl++;
> >
> > This means you won't be ack'ing the event - is this unacknowledged event
> > going to cause an issue later?
> >
>
> First, I'm not an expert in networking, but from my perspective, I
> don't think this will cause a problem. Because when the driver enters
> this branch, It means that it thinks that the hardware has already
> performed a DMA operation, and the driver only needs to do some
> follow-up work, but this is not the case. At this time,
> 'vp->tx_skb_dma' is still a null pointer, so there is no need for
> follow-up work at this time, it is meaningless, and it is appropriate
> not to perform any operations at this time.
What are the circumstances you observe this behaviour under? The state
of hardware is supposed to be consistent with the state of the driver. If
an inconsistency happens, then there are various possible causes such as:
1. The driver has a bug (in which case you need to track the bug down and
fix it).
2. The hardware does not behave as specified, e.g. due to an erratum (in
which case you need to track the problem down and work it around in the
driver).
3. The hardware may have been disturbed, e.g. due to EMC interference (in
which case you may implement a recovery attempt by reinitialising the
hardware once an odd state has been discovered).
4. The hardware is broken (throw it away).
For #4 the solution is obvious. For #3 you might want to implement a
hardware reset path rather than ignoring the inconsistent state and only
prevent the driver from crashing. If you have a way to reproduce the
issue, which I gather you do, then it's likely not #3 as that would be
intermittent, and then you'll have to investigate what is causing the
problem to see if it is #1 or #2 (or maybe #4), and act accordingly.
Someone more familiar with this hardware (is there a spec available?)
might be able to assist you once you have figured out what the exact
scenario leading to the failure you have observed is.
HTH,
Maciej