2002-09-04 16:49:43

by Libershteyn, Vladimir

[permalink] [raw]
Subject: Problem on a kernel driver(SuSE, SMP)

Hi, All,

I'm looking for a help on a kernel module programming issue.

We have a hardware device I have a driver written for.
It works without any problem on any of Red Hat, SMP machine or not. The same driver does work with
SuSE 7.x non-SMP, but has a problem with SMP machines.

Here is a problem:

For synchronization I use semaphores. Semaphores initializes with init_MUTEX_LOCKED function
which basically set the atomic value of a count to 0.
In a routine that looking for a data I use a function down_interruptible(&sem) which should
check the count value to be 0 and if it so the thread goes to slip until the ISR will set the
semaphore count back to initial value with a command up(&sem);

What happens with my driver, when it comes to the command down_interruptible(...) the whole
system just hangs. I could not find any restrictions on using this mechanism, but just for the testing
I switched from using semaphores to wait queues. The result is exactly the same.

What am I doing wrong? The kernel I'm running is 2.4.7-64G... and the OS is SuSE7.x Enterprise
Again, It works fine on any Red Hat OS

Regards,
Vladimir Libershteyn
Hewlett-Packard
tel. 408-285-5270
fax. 408-285-2044



2002-09-04 17:23:16

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Problem on a kernel driver(SuSE, SMP)

On Wed, 4 Sep 2002, Libershteyn, Vladimir wrote:

> Hi, All,
>
> I'm looking for a help on a kernel module programming issue.
>
> We have a hardware device I have a driver written for.
> It works without any problem on any of Red Hat, SMP machine or not. The same driver does work with
> SuSE 7.x non-SMP, but has a problem with SMP machines.
>
> Here is a problem:
>
> For synchronization I use semaphores. Semaphores initializes with init_MUTEX_LOCKED function
> which basically set the atomic value of a count to 0.
> In a routine that looking for a data I use a function down_interruptible(&sem) which should
> check the count value to be 0 and if it so the thread goes to slip until the ISR will set the
> semaphore count back to initial value with a command up(&sem);
>
> What happens with my driver, when it comes to the command down_interruptible(...) the whole
> system just hangs. I could not find any restrictions on using this mechanism, but just for the testing
> I switched from using semaphores to wait queues. The result is exactly the same.
>
> What am I doing wrong? The kernel I'm running is 2.4.7-64G... and the OS is SuSE7.x Enterprise
> Again, It works fine on any Red Hat OS
>

You are not too specific, which makes it hard to understand what may
be going wrong so I'll assume that you probably did a bad thing.

(1) You cannot sleep in an interrupt, which means you can't use
down_*() and friends inside an ISR.

(2) Wait queues should work fine from the 'user-side' of a driver,
but again, you cannot ever sleep in an interrupt service routine.
Look in ../linux/drivers/* for examples of code that works.

(3) You can't use any wait-queue or sleep on a semaphore while
holding a spin-lock or while the interrupts are disabled. You can
manage your own lock against re-entry in your procedure, but you
can't allow two tasks to try the same semaphore at (nearly) the
same time or you can dead-lock.


(4) The fact that 'down' hangs means that there is nothing
that the CPU can do. This is direct evidence that you have the
interrupts disabled when down executes.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-04 17:35:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Problem on a kernel driver(SuSE, SMP)

On Wed, Sep 04, 2002 at 09:54:15AM -0700, Libershteyn, Vladimir wrote:
> Hi, All,
>
> I'm looking for a help on a kernel module programming issue.
>
> We have a hardware device I have a driver written for.
> It works without any problem on any of Red Hat, SMP machine or not. The same driver does work with
> SuSE 7.x non-SMP, but has a problem with SMP machines.

Where can I find that driver source to take a look?

2002-09-04 17:51:37

by Libershteyn, Vladimir

[permalink] [raw]
Subject: RE: Problem on a kernel driver(SuSE, SMP)

Hi, Cristoph
Attached are two related functions,
I don't know if I can attach the files to the message,
but I can place them in a message body.
Please, let me know

----------------------------------------------------
function when thread go to sleep, if data not ready
---------------------------------------------------
/
// Get the response back
//
static int axl_get_response(axl_interconnect *buf, int device)
{
int enumerator, len;
unsigned long flags;
axl_info_t *a;

#ifdef AXL300_DEBUG
printk("<1> get_response\n");
#endif

a = axl[device];

//
// sanity check
//
if (!buf || !buf->CommBuffer ||
(buf->Enumerator >= MAX_LOGICAL_DEV) || (buf->Enumerator < 0) ||
(buf->DataContext1 != a->ld_unique[buf->Enumerator]))
{
return -EINVAL;
}

//
// extract input data
//
enumerator = buf->Enumerator % MAX_LOGICAL_DEV;

//
// sleep until data is ready
//
down_interruptible(&a->sem[enumerator]);


//
// data is here
//
spin_lock_irqsave(&a->lock, flags);

len = axl_copy_buffer(buf->Enumerator, buf->CommBuffer, device);

spin_unlock_irqrestore(&a->lock, flags);

//
// return length in bytes
//
buf->Length = len;

//
// raise data ready flag
//
a->data_ready[enumerator] = 1;


#ifdef AXL300_DEBUG
printk("<1> get_response_exit\n");
#endif

return 0;
}

----------------------------------------------------
function when thread wakes up
---------------------------------------------------
//
// Interrupt Service Routine.
//
static void axl_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
axl_info_t *a = dev_id;
int istat;
unsigned long flags;
volatile unsigned long *board_address;
__u32 length, enumerator;

#ifdef AXL300_DEBUG
printk("<1> axl_interrupt enter.\n");
printk("<1> device - %x.\n", a->ctlr);
#endif

//
// check if it's Colusa interrupt
//
istat = atql5032_check_interrupt(a->ctlr);
if (istat)
{

#ifdef AXL300_DEBUG
printk("<1> INTS-31 has not been asserted.\n");
#endif
goto axl_isr_exit;
}

#ifdef AXL300_DEBUG
printk("<1> interrupt received.\n");
printk("<1> device - %x.\n", a->ctlr);
#endif

//
// Clear an int
//
atql5032_clear_interrupt(a->ctlr);

//
// create a board FIFO length count address
//
board_address = ((unsigned long *)((unsigned char *)a->vaddr + OutputQueueFilled));
length = *board_address;

if (length <= 4)
{
number_reqs++;
#ifdef NEW_CODE_DEBUG
printk("<1> bad FIFO length: %x.\n", length);
#endif
goto axl_isr_exit;
}

//
// create a board FIFO address
//
board_address = ((unsigned long *)((unsigned char *)a->vaddr + OutputQueueData));

//
// read the length of the message from the card
// do the sanity check, length should be > 2 ans < 256
//
length = *board_address;

if ((length <= 2) || ((length & 0x0fffffff) > MAXIMUM_COMMAND_LENGTH_WORDS))
{

#ifdef NEW_CODE_DEBUG
printk("<1> bad length: %x.\n", (length & 0x0fffffff));
#endif

goto axl_isr_exit;
}

#ifdef AXL300_DEBUG
printk("<1> got length: %x.\n", length);
#endif

//
// read the enumerator from the message from the card
// do the sanity check, enumerator should be < MAX_LOGICAL_DEV
//
enumerator = *board_address;
if (enumerator > MAX_LOGICAL_DEV)
{

#ifdef NEW_CODE_DEBUG
printk("<1> bad enumerator in ISR: %x.\n", enumerator);
#endif

goto axl_isr_exit;
}

#ifdef AXL300_DEBUG
printk("<1> got enumerator in ISR: %x.\n", enumerator);
#endif

//
// save data to be passed to application
//
a->length[enumerator] = length;

//
// Wake up the LD that interrupted us and setup "data ready" flag
//
up(&a->sem[enumerator]);

axl_isr_exit:
//
// Clear an int
//
atql5032_clear_interrupt(a->ctlr);
}

2002-09-04 17:49:31

by Libershteyn, Vladimir

[permalink] [raw]
Subject: RE: Problem on a kernel driver(SuSE, SMP)

>
> You are not too specific, which makes it hard to understand what may
> be going wrong so I'll assume that you probably did a bad thing.

I think I gave all the specific information

>
> (1) You cannot sleep in an interrupt, which means you can't use
> down_*() and friends inside an ISR.

I DO NOT sleep in ISR, the up*() routine is inside the ISR, dut not down*()
This is a standard use of the mechanism

>
> (2) Wait queues should work fine from the 'user-side' of a driver,
> but again, you cannot ever sleep in an interrupt service routine.
> Look in ../linux/drivers/* for examples of code that works.
>

Again there is NO sleep in ISR, and I know that queues should work fine,
but they don't, that why I have a problem, but problem only on
SMP machine.

> (3) You can't use any wait-queue or sleep on a semaphore while
> holding a spin-lock or while the interrupts are disabled. You can
> manage your own lock against re-entry in your procedure, but you
> can't allow two tasks to try the same semaphore at (nearly) the
> same time or you can dead-lock.

I DO NOT hold any spinlocks, while use down_interruptible

>
>
> (4) The fact that 'down' hangs means that there is nothing
> that the CPU can do. This is direct evidence that you have the
> interrupts disabled when down executes.

To be more specific, here is a code:
----------------------------------------------------
function when thread go to sleep, if data not ready
---------------------------------------------------
/
// Get the response back
//
static int axl_get_response(axl_interconnect *buf, int device)
{
int enumerator, len;
unsigned long flags;
axl_info_t *a;

#ifdef AXL300_DEBUG
printk("<1> get_response\n");
#endif

a = axl[device];

//
// sanity check
//
if (!buf || !buf->CommBuffer ||
(buf->Enumerator >= MAX_LOGICAL_DEV) || (buf->Enumerator < 0) ||
(buf->DataContext1 != a->ld_unique[buf->Enumerator]))
{
return -EINVAL;
}

//
// extract input data
//
enumerator = buf->Enumerator % MAX_LOGICAL_DEV;

//
// sleep until data is ready
//
down_interruptible(&a->sem[enumerator]);


//
// data is here
//
spin_lock_irqsave(&a->lock, flags);

len = axl_copy_buffer(buf->Enumerator, buf->CommBuffer, device);

spin_unlock_irqrestore(&a->lock, flags);

//
// return length in bytes
//
buf->Length = len;

//
// raise data ready flag
//
a->data_ready[enumerator] = 1;


#ifdef AXL300_DEBUG
printk("<1> get_response_exit\n");
#endif

return 0;
}

----------------------------------------------------
function when thread wakes up
---------------------------------------------------
//
// Interrupt Service Routine.
//
static void axl_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
axl_info_t *a = dev_id;
int istat;
unsigned long flags;
volatile unsigned long *board_address;
__u32 length, enumerator;

#ifdef AXL300_DEBUG
printk("<1> axl_interrupt enter.\n");
printk("<1> device - %x.\n", a->ctlr);
#endif

//
// check if it's Colusa interrupt
//
istat = atql5032_check_interrupt(a->ctlr);
if (istat)
{

#ifdef AXL300_DEBUG
printk("<1> INTS-31 has not been asserted.\n");
#endif
goto axl_isr_exit;
}

#ifdef AXL300_DEBUG
printk("<1> interrupt received.\n");
printk("<1> device - %x.\n", a->ctlr);
#endif

//
// Clear an int
//
atql5032_clear_interrupt(a->ctlr);

//
// create a board FIFO length count address
//
board_address = ((unsigned long *)((unsigned char *)a->vaddr + OutputQueueFilled));
length = *board_address;

if (length <= 4)
{
number_reqs++;
#ifdef NEW_CODE_DEBUG
printk("<1> bad FIFO length: %x.\n", length);
#endif
goto axl_isr_exit;
}

//
// create a board FIFO address
//
board_address = ((unsigned long *)((unsigned char *)a->vaddr + OutputQueueData));

//
// read the length of the message from the card
// do the sanity check, length should be > 2 ans < 256
//
length = *board_address;

if ((length <= 2) || ((length & 0x0fffffff) > MAXIMUM_COMMAND_LENGTH_WORDS))
{

#ifdef NEW_CODE_DEBUG
printk("<1> bad length: %x.\n", (length & 0x0fffffff));
#endif

goto axl_isr_exit;
}

#ifdef AXL300_DEBUG
printk("<1> got length: %x.\n", length);
#endif

//
// read the enumerator from the message from the card
// do the sanity check, enumerator should be < MAX_LOGICAL_DEV
//
enumerator = *board_address;
if (enumerator > MAX_LOGICAL_DEV)
{

#ifdef NEW_CODE_DEBUG
printk("<1> bad enumerator in ISR: %x.\n", enumerator);
#endif

goto axl_isr_exit;
}

#ifdef AXL300_DEBUG
printk("<1> got enumerator in ISR: %x.\n", enumerator);
#endif

//
// save data to be passed to application
//
a->length[enumerator] = length;

//
// Wake up the LD that interrupted us and setup "data ready" flag
//
up(&a->sem[enumerator]);

axl_isr_exit:
//
// Clear an int
//
atql5032_clear_interrupt(a->ctlr);
}

2002-09-04 18:20:50

by Richard B. Johnson

[permalink] [raw]
Subject: RE: Problem on a kernel driver(SuSE, SMP)

On Wed, 4 Sep 2002, Libershteyn, Vladimir wrote:

> >
> > You are not too specific, which makes it hard to understand what may
> > be going wrong so I'll assume that you probably did a bad thing.
>
> I think I gave all the specific information
>
> >
> > (1) You cannot sleep in an interrupt, which means you can't use
> > down_*() and friends inside an ISR.
>
> I DO NOT sleep in ISR, the up*() routine is inside the ISR, dut not down*()
> This is a standard use of the mechanism
>
> >
> > (2) Wait queues should work fine from the 'user-side' of a driver,
> > but again, you cannot ever sleep in an interrupt service routine.
> > Look in ../linux/drivers/* for examples of code that works.
> >
>
> Again there is NO sleep in ISR, and I know that queues should work fine,
> but they don't, that why I have a problem, but problem only on
> SMP machine.
>
> > (3) You can't use any wait-queue or sleep on a semaphore while
> > holding a spin-lock or while the interrupts are disabled. You can
> > manage your own lock against re-entry in your procedure, but you
> > can't allow two tasks to try the same semaphore at (nearly) the
> > same time or you can dead-lock.
>
> I DO NOT hold any spinlocks, while use down_interruptible
>
> >
> >
> > (4) The fact that 'down' hangs means that there is nothing
> > that the CPU can do. This is direct evidence that you have the
> > interrupts disabled when down executes.
>
> To be more specific, here is a code:
> ----------------------------------------------------
> function when thread go to sleep, if data not ready
> ---------------------------------------------------

Snipped code.

How do you know that
up(&a->sem[enumerator]);
in the ISR and...
down_interruptible(&a->sem[enumerator]);
In axl_get_response...
... have the same value of 'enumerator', therefore the same
semaphore?

One comes from a modulus and another from indirection off from 'board
address'?

I think there is a bug there.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-04 18:30:57

by Libershteyn, Vladimir

[permalink] [raw]
Subject: RE: Problem on a kernel driver(SuSE, SMP)



> -----Original Message-----
> From: Richard B. Johnson [mailto:[email protected]]
> Sent: Wednesday, September 04, 2002 11:25 AM
> To: Libershteyn, Vladimir
> Cc: [email protected]
> Subject: RE: Problem on a kernel driver(SuSE, SMP)
>
>
> On Wed, 4 Sep 2002, Libershteyn, Vladimir wrote:
>
> > >
> > > You are not too specific, which makes it hard to
> understand what may
> > > be going wrong so I'll assume that you probably did a bad thing.
> >
> > I think I gave all the specific information
> >
> > >
> > > (1) You cannot sleep in an interrupt, which means you can't use
> > > down_*() and friends inside an ISR.
> >
> > I DO NOT sleep in ISR, the up*() routine is inside the ISR,
> dut not down*()
> > This is a standard use of the mechanism
> >
> > >
> > > (2) Wait queues should work fine from the 'user-side' of
> a driver,
> > > but again, you cannot ever sleep in an interrupt service routine.
> > > Look in ../linux/drivers/* for examples of code that works.
> > >
> >
> > Again there is NO sleep in ISR, and I know that queues
> should work fine,
> > but they don't, that why I have a problem, but problem only on
> > SMP machine.
> >
> > > (3) You can't use any wait-queue or sleep on a semaphore while
> > > holding a spin-lock or while the interrupts are disabled. You can
> > > manage your own lock against re-entry in your procedure, but you
> > > can't allow two tasks to try the same semaphore at (nearly) the
> > > same time or you can dead-lock.
> >
> > I DO NOT hold any spinlocks, while use down_interruptible
> >
> > >
> > >
> > > (4) The fact that 'down' hangs means that there is nothing
> > > that the CPU can do. This is direct evidence that you have the
> > > interrupts disabled when down executes.
> >
> > To be more specific, here is a code:
> > ----------------------------------------------------
> > function when thread go to sleep, if data not ready
> > ---------------------------------------------------
>
> Snipped code.
>
> How do you know that
> up(&a->sem[enumerator]);
> in the ISR and...
> down_interruptible(&a->sem[enumerator]);
> In axl_get_response...
> ... have the same value of 'enumerator', therefore the same
> semaphore?
>
> One comes from a modulus and another from indirection off from 'board
> address'?
>
> I think there is a bug there.
>

First of all it does not even come to the ISR routine.
The hang appears at down_interruptible. the interrupt don't even occur yet.
I know that.

To prove it I did limited possible resources to one card (card 0) and to only
one logical device (enumerator 0). The probelm is the same.

Regards,
Vlad

P.S. This product has been shiped for 3 years with RH and SuSE, There was no problems
with any but SuSE SMP machines.

2002-09-04 21:32:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Problem on a kernel driver(SuSE, SMP)

On Wed, Sep 04, 2002 at 10:56:00AM -0700, Libershteyn, Vladimir wrote:
> Hi, Cristoph
> Attached are two related functions,
> I don't know if I can attach the files to the message,
> but I can place them in a message body.
> Please, let me know

Except of your failure to handle the return value of down_interruptible
I can't find any obvious bugs. Does it work on plain 2.4.7 or redhats
2.4.7 rpm? If so I'd fill a bug report against suses patch collection.

2002-09-04 22:16:49

by Libershteyn, Vladimir

[permalink] [raw]
Subject: RE: Problem on a kernel driver(SuSE, SMP)

Hi, Cristoph.

It does work fine on a plain 2.4.7 and on any Red Hat kernel versions.

Regards,
Vlad

> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Wednesday, September 04, 2002 2:37 PM
> To: Libershteyn, Vladimir
> Cc: [email protected]
> Subject: Re: Problem on a kernel driver(SuSE, SMP)
>
>
> On Wed, Sep 04, 2002 at 10:56:00AM -0700, Libershteyn, Vladimir wrote:
> > Hi, Cristoph
> > Attached are two related functions,
> > I don't know if I can attach the files to the message,
> > but I can place them in a message body.
> > Please, let me know
>
> Except of your failure to handle the return value of
> down_interruptible
> I can't find any obvious bugs. Does it work on plain 2.4.7 or redhats
> 2.4.7 rpm? If so I'd fill a bug report against suses patch
> collection.
>
>

2002-09-04 23:39:50

by Alan

[permalink] [raw]
Subject: RE: Problem on a kernel driver(SuSE, SMP)

On Wed, 2002-09-04 at 18:56, Libershteyn, Vladimir wrote:

> //
> // sleep until data is ready
> //
> down_interruptible(&a->sem[enumerator]);

Suppose its interrupted. You dont check that and handle it..


> board_address = ((unsigned long *)((unsigned char *)a->vaddr + OutputQueueFilled));
> length = *board_address;

You can't poke around in memory directly either. Yes it works on x86 but
unless you use ioremap combined with readl and friends it wont work they
way you expect on ia64, x86-64, ...


2002-09-04 23:47:58

by Libershteyn, Vladimir

[permalink] [raw]
Subject: RE: Problem on a kernel driver(SuSE, SMP)



> -----Original Message-----
> From: Alan Cox [mailto:[email protected]]
> Sent: Wednesday, September 04, 2002 4:45 PM
> To: Libershteyn, Vladimir
> Cc: Christoph Hellwig; [email protected]
> Subject: RE: Problem on a kernel driver(SuSE, SMP)
>
>
> On Wed, 2002-09-04 at 18:56, Libershteyn, Vladimir wrote:
>
> > //
> > // sleep until data is ready
> > //
> > down_interruptible(&a->sem[enumerator]);
>
> Suppose its interrupted. You dont check that and handle it..
>

Agree, I'll fix it(thanks for noticing that), but it's not the point.
The point is THIS INSTRUCTION HANGS. NO RETURN FROM IT.

>
> > board_address = ((unsigned long *)((unsigned char
> *)a->vaddr + OutputQueueFilled));
> > length = *board_address;
>
> You can't poke around in memory directly either. Yes it works
> on x86 but
> unless you use ioremap combined with readl and friends it
> wont work they
> way you expect on ia64, x86-64, ...
>

I did exactly what you've said. You have to understand,
attached is a PART of the code. All the right initializations were made
prior to that and AGAIN, EVERYTHING WORKS FINE ON A RED HAT AND ON SUSE
NON_SMP

Regards,
Vlad

2002-09-05 00:00:15

by Alan

[permalink] [raw]
Subject: RE: Problem on a kernel driver(SuSE, SMP)

> Agree, I'll fix it(thanks for noticing that), but it's not the point.
> The point is THIS INSTRUCTION HANGS. NO RETURN FROM IT.

I only deal with people who use CAPITAL LETTER RANTS when paid 8)


Next guess is you didnt init the semaphore structure and it happened to
come out ok, but I find that unlikely to have worked in other cases.


2002-09-05 00:22:14

by Libershteyn, Vladimir

[permalink] [raw]
Subject: RE: Problem on a kernel driver(SuSE, SMP)

I do apologize for using capital letters. I did not have
any intention to offend you.
It was a long week for me

Sorry again
Regards,
Vlad

> -----Original Message-----
> From: Alan Cox [mailto:[email protected]]
> Sent: Wednesday, September 04, 2002 5:06 PM
> To: Libershteyn, Vladimir
> Cc: [email protected]
> Subject: RE: Problem on a kernel driver(SuSE, SMP)
>
>
> > Agree, I'll fix it(thanks for noticing that), but it's not
> the point.
> > The point is THIS INSTRUCTION HANGS. NO RETURN FROM IT.
>
> I only deal with people who use CAPITAL LETTER RANTS when paid 8)
>
>
> Next guess is you didnt init the semaphore structure and it
> happened to
> come out ok, but I find that unlikely to have worked in other cases.
>
>
>