2009-10-11 06:13:47

by Artem Bityutskiy

[permalink] [raw]
Subject: [PATCH] panic.c: export panic_on_oops

From: Artem Bityutskiy <[email protected]>

We use the mtdoops module which stores oopses on the Flash partition,
in order to make it possible to analyze them later. And mtdoops needs
to know whether 'panic_on_oops' is on or off. Thus, we need this
variable to be exported.

Signed-off-by: Artem Bityutskiy <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Aaro Koskinen <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Simon Kagstrom <[email protected]>
Cc: linux-mtd <[email protected]>
Cc: LKML <[email protected]>
---
kernel/panic.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index bcdef26..6d29be3 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -24,6 +24,8 @@
#include <linux/dmi.h>

int panic_on_oops;
+EXPORT_SYMBOL_GPL(panic_on_oops);
+
static unsigned long tainted_mask;
static int pause_on_oops;
static int pause_on_oops_flag;
--
1.6.2.5


2009-10-12 11:16:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Artem Bityutskiy <[email protected]> wrote:

> From: Artem Bityutskiy <[email protected]>
>
> We use the mtdoops module which stores oopses on the Flash partition,
> in order to make it possible to analyze them later. And mtdoops needs
> to know whether 'panic_on_oops' is on or off. Thus, we need this
> variable to be exported.

hm, why does it need it? Could you send the patch please that makes use
of it (as an easy way to explain the need for the export) - current
upstream drivers/mtd/mtdoops.c doesnt need it so it must be some pending
change.

Ingo

2009-10-12 11:23:56

by Simon Kagstrom

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 12 Oct 2009 13:15:45 +0200
Ingo Molnar <[email protected]> wrote:

> * Artem Bityutskiy <[email protected]> wrote:
>
> > We use the mtdoops module which stores oopses on the Flash partition,
> > in order to make it possible to analyze them later. And mtdoops needs
> > to know whether 'panic_on_oops' is on or off. Thus, we need this
> > variable to be exported.
>
> hm, why does it need it? Could you send the patch please that makes use
> of it (as an easy way to explain the need for the export) - current
> upstream drivers/mtd/mtdoops.c doesnt need it so it must be some pending
> change.

You have the patch that requires it here:

http://patchwork.ozlabs.org/patch/34732/

it's a fix to allow mtdoops to actually store oopses which happen when
panic_on_oops is set (before these would be missed as the write would
never be done).

// Simon

2009-10-12 11:27:33

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 2009-10-12 at 13:15 +0200, ext Ingo Molnar wrote:
> > We use the mtdoops module which stores oopses on the Flash partition,
> > in order to make it possible to analyze them later. And mtdoops needs
> > to know whether 'panic_on_oops' is on or off. Thus, we need this
> > variable to be exported.
>
> hm, why does it need it? Could you send the patch please that makes use
> of it (as an easy way to explain the need for the export) - current
> upstream drivers/mtd/mtdoops.c doesnt need it so it must be some pending
> change.

Yes, current drivers/mtd/mtdoops.c does not need it, but we have this
patch:

http://lists.infradead.org/pipermail/linux-mtd/2009-October/027450.html

which makes sure mtdoops writes the oops log immediately, because we
have 'panic_on_oops' set so this is our last chance to save the oops.

Here is the patch inlined as well:

Author: Aaro Koskinen <[email protected]>
Date: Thu Oct 1 18:16:55 2009 +0300

mtd: mtdoops: do not schedule work if we are going to die

If panic_on_oops is set, the log should be written right away
because this is not going to be a second chance.

Signed-off-by: Aaro Koskinen <[email protected]>
Signed-off-by: Artem Bityutskiy <[email protected]>

diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index 1060337..ac67833 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -335,7 +335,7 @@ static void mtdoops_console_sync(void)
cxt->ready = 0;
spin_unlock_irqrestore(&cxt->writecount_lock, flags);

- if (mtd->panic_write && in_interrupt())
+ if (mtd->panic_write && (in_interrupt() || panic_on_oops))
/* Interrupt context, we're going to panic so try and log */
mtdoops_write(cxt, 1);
else

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-10-12 11:38:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Artem Bityutskiy <[email protected]> wrote:

> On Mon, 2009-10-12 at 13:15 +0200, ext Ingo Molnar wrote:
> > > We use the mtdoops module which stores oopses on the Flash partition,
> > > in order to make it possible to analyze them later. And mtdoops needs
> > > to know whether 'panic_on_oops' is on or off. Thus, we need this
> > > variable to be exported.
> >
> > hm, why does it need it? Could you send the patch please that makes use
> > of it (as an easy way to explain the need for the export) - current
> > upstream drivers/mtd/mtdoops.c doesnt need it so it must be some pending
> > change.
>
> Yes, current drivers/mtd/mtdoops.c does not need it, but we have this
> patch:
>
> http://lists.infradead.org/pipermail/linux-mtd/2009-October/027450.html
>
> which makes sure mtdoops writes the oops log immediately, because we
> have 'panic_on_oops' set so this is our last chance to save the oops.
>
> Here is the patch inlined as well:
>
> Author: Aaro Koskinen <[email protected]>
> Date: Thu Oct 1 18:16:55 2009 +0300
>
> mtd: mtdoops: do not schedule work if we are going to die
>
> If panic_on_oops is set, the log should be written right away
> because this is not going to be a second chance.
>
> Signed-off-by: Aaro Koskinen <[email protected]>
> Signed-off-by: Artem Bityutskiy <[email protected]>
>
> diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
> index 1060337..ac67833 100644
> --- a/drivers/mtd/mtdoops.c
> +++ b/drivers/mtd/mtdoops.c
> @@ -335,7 +335,7 @@ static void mtdoops_console_sync(void)
> cxt->ready = 0;
> spin_unlock_irqrestore(&cxt->writecount_lock, flags);
>
> - if (mtd->panic_write && in_interrupt())
> + if (mtd->panic_write && (in_interrupt() || panic_on_oops))
> /* Interrupt context, we're going to panic so try and log */
> mtdoops_write(cxt, 1);

Hm, the code seems to be somewhat confused about this. It tries to guess
when it's panic-ing, right? in_interrupt() is the wrong test for that.

mtdoops_console_sync() gets called by regular printk() as well via:

static void
mtdoops_console_write(struct console *co, const char *s, unsigned int count)
{
struct mtdoops_context *cxt = co->data;
struct mtd_info *mtd = cxt->mtd;
unsigned long flags;

if (!oops_in_progress) {
mtdoops_console_sync();
return;

I think this is all layered incorrectly - because mtdoops (which is a
_VERY_ cool debugging facility btw - i wish generic x86 had something
like that) tries to be a 'driver' and tries to achieve these things
without modifying the core kernel.

But it really should do that. We can certainly enhance the core kernel
logic to tell the console driver more clearly when printk should go out
immediately.

Instead of using oops_in_progress it might be better to go into 'sync
immeditately' mode if the kernel gets tainted. Add a callback for that
to struct console (in include/linux/console.h). The ->unblank() callback
is already such a "user attention needed immediately" callback. We could
add a ->kernel_bug() callback to that.

Ingo

2009-10-12 12:02:42

by Simon Kagstrom

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

(Risking that Artem also replies, I'll bite on this one! Let's hope we
agree at least :-))

On Mon, 12 Oct 2009 13:37:58 +0200
Ingo Molnar <[email protected]> wrote:

> > - if (mtd->panic_write && in_interrupt())
> > + if (mtd->panic_write && (in_interrupt() || panic_on_oops))
> > /* Interrupt context, we're going to panic so try and log */
> > mtdoops_write(cxt, 1);
>
> Hm, the code seems to be somewhat confused about this. It tries to guess
> when it's panic-ing, right? in_interrupt() is the wrong test for that.

Well, the main reason is to get the write done directly if we know
we're going to crash. The rest of the code around the patch looks like
this:

if (mtd->panic_write && (in_interrupt() || panic_on_oops))
/* Interrupt context, we're going to panic so try and log */
mtdoops_write(cxt, 1);
else
schedule_work(&cxt->work_write);

so if we're oopsing in interrupt context or are going to panic, we just
write directly. mtdoops_write will then use mtd->panic_write if it's
available to get the write done immediately without sleeping.

> mtdoops_console_sync() gets called by regular printk() as well via:
>
> static void
> mtdoops_console_write(struct console *co, const char *s, unsigned int count)
> {
> struct mtdoops_context *cxt = co->data;
> struct mtd_info *mtd = cxt->mtd;
> unsigned long flags;
>
> if (!oops_in_progress) {
> mtdoops_console_sync();
> return;
>
> I think this is all layered incorrectly - because mtdoops (which is a
> _VERY_ cool debugging facility btw - i wish generic x86 had something
> like that) tries to be a 'driver' and tries to achieve these things
> without modifying the core kernel.
>
> But it really should do that. We can certainly enhance the core kernel
> logic to tell the console driver more clearly when printk should go out
> immediately.
>
> Instead of using oops_in_progress it might be better to go into 'sync
> immeditately' mode if the kernel gets tainted. Add a callback for that
> to struct console (in include/linux/console.h). The ->unblank() callback
> is already such a "user attention needed immediately" callback. We could
> add a ->kernel_bug() callback to that.

I sent another patch which changes the mtdoops behavior a bit:

http://patchwork.ozlabs.org/patch/35750/

(the thread with the patch series is here:

http://lists.infradead.org/pipermail/linux-mtd/2009-October/027576.html)

the change is basically to log all messages in a circular buffer (the
current one only logs messages after an oops has started). It also
needed some restructuring, and I believe parts of the code becomes
simpler that way.

The console write now only adds messages to the buffer (never calls
mtdoops_console_sync), and mtdoops_console_sync (i.e., the ->unblank()
callback) simply becomes

static void mtdoops_console_sync(void)
{
struct mtdoops_context *cxt = &oops_cxt;

/* Write out the buffer if we are called during an oops */
if (oops_in_progress)
schedule_work(&cxt->work_write);
}

To handle the panic case, I've simply added a panic notifier which does

static int mtdoops_panic(struct notifier_block *this, unsigned long event,
void *ptr)
{
struct mtdoops_context *cxt = &oops_cxt;

cancel_work_sync(&cxt->work_write);
cxt->ready = 0;
if (cxt->mtd->panic_write)
mtdoops_write(cxt, 1);
else
printk(KERN_WARNING "mtdoops: panic_write is not defined, "
"cannot store dump from panic\n");

return NOTIFY_DONE;
}

So with this one, the exported panic_on_oops is no longer needed, and
normal oopses are handled by the scheduled work while panic_on_oopses
are handled by the panic handler.


Anyway, this particular patch is still up for discussion.

// Simon

2009-10-12 12:10:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Simon Kagstrom <[email protected]> wrote:

> (Risking that Artem also replies, I'll bite on this one! Let's hope we
> agree at least :-))
>
> On Mon, 12 Oct 2009 13:37:58 +0200
> Ingo Molnar <[email protected]> wrote:
>
> > > - if (mtd->panic_write && in_interrupt())
> > > + if (mtd->panic_write && (in_interrupt() || panic_on_oops))
> > > /* Interrupt context, we're going to panic so try and log */
> > > mtdoops_write(cxt, 1);
> >
> > Hm, the code seems to be somewhat confused about this. It tries to guess
> > when it's panic-ing, right? in_interrupt() is the wrong test for that.
>
> Well, the main reason is to get the write done directly if we know
> we're going to crash. The rest of the code around the patch looks like
> this:
>
> if (mtd->panic_write && (in_interrupt() || panic_on_oops))
> /* Interrupt context, we're going to panic so try and log */
> mtdoops_write(cxt, 1);
> else
> schedule_work(&cxt->work_write);
>
> so if we're oopsing in interrupt context or are going to panic, we
> just write directly. mtdoops_write will then use mtd->panic_write if
> it's available to get the write done immediately without sleeping.

but i'm not sure that code achieves your intention.

in_interrupt() is a generic test. It will be true whenever you printk in
irq context - be that a panic or not a panic.

Also, the panic_on_oops usage looks wrong as well: it is set on a system
that wants a panic on oops - but the flag will be set all the time, even
when we are not oopsing.

I suppose the intention is to add a logic like this:

- buffer writes to the MTD async writeout thread for regular printks

- if we are in some sort of emergency, write to the MTD device directly
as we cannot buffer anymore.

Correct?

> [...]
>
> To handle the panic case, I've simply added a panic notifier which
> does
>
> static int mtdoops_panic(struct notifier_block *this, unsigned long event,
> void *ptr)
> {
> struct mtdoops_context *cxt = &oops_cxt;
>
> cancel_work_sync(&cxt->work_write);
> cxt->ready = 0;
> if (cxt->mtd->panic_write)
> mtdoops_write(cxt, 1);
> else
> printk(KERN_WARNING "mtdoops: panic_write is not defined, "
> "cannot store dump from panic\n");
>
> return NOTIFY_DONE;
> }
>
> So with this one, the exported panic_on_oops is no longer needed, and
> normal oopses are handled by the scheduled work while panic_on_oopses
> are handled by the panic handler.

Yes, that looks like the better direction - but 'panic' is still the
wrong trigger condition i think. We generally just crash and dont panic.
Often we'll display a kernel warning and then hang. Etc.

Also, would it be possible to just simplify the thing and not do any
buffering at all? Extra buffering complexity in a console driver is only
asking for trouble. Or is flash storage write cycles optimization that
important in this case?

Ingo

2009-10-12 12:16:39

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> Also, would it be possible to just simplify the thing and not do any
> buffering at all? Extra buffering complexity in a console driver is only
> asking for trouble. Or is flash storage write cycles optimization that
> important in this case?

That and the fact that on NAND flash you have to write full pages at a
time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. So
we really do want to buffer it where we can.

We don't want to write a 2KiB page for every line of printk output.

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-10-12 12:21:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* David Woodhouse <[email protected]> wrote:

> On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> > Also, would it be possible to just simplify the thing and not do any
> > buffering at all? Extra buffering complexity in a console driver is only
> > asking for trouble. Or is flash storage write cycles optimization that
> > important in this case?
>
> That and the fact that on NAND flash you have to write full pages at a
> time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip.
> So we really do want to buffer it where we can.
>
> We don't want to write a 2KiB page for every line of printk output.

Then i think the buffering is at the wrong place: we should instead
buffer in the generic layer and pass it to lowlevel if we know that we
have gone past a 2K boundary.

The size of the generic log buffer is always a power of two so detecting
2K boundaries is very easy. On any emergency the generic console layer
will do faster flushes - this is nothing the console driver itself
should bother with.

And that would avoid the whole workqueue logic - which is fragile to be
done in a printk to begin with.

So what we need is an extension to struct console that sets a buffering
limit. Zero (the default) means unbuffered.

(Btw., things like netconsole might make use of such buffering too.)

Agreed?

Ingo

2009-10-12 12:28:11

by Simon Kagstrom

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 12 Oct 2009 14:09:51 +0200
Ingo Molnar <[email protected]> wrote:

> > Well, the main reason is to get the write done directly if we know
> > we're going to crash. The rest of the code around the patch looks like
> > this:
> >
> > if (mtd->panic_write && (in_interrupt() || panic_on_oops))
> > /* Interrupt context, we're going to panic so try and log */
> > mtdoops_write(cxt, 1);
> > else
> > schedule_work(&cxt->work_write);
> >
> > so if we're oopsing in interrupt context or are going to panic, we
> > just write directly. mtdoops_write will then use mtd->panic_write if
> > it's available to get the write done immediately without sleeping.
>
> but i'm not sure that code achieves your intention.
>
> in_interrupt() is a generic test. It will be true whenever you printk in
> irq context - be that a panic or not a panic.

Well, this particular code is only called when oops_in_progress is set,
so we know we have an oops chasing us. And if we oops in an interrupt,
we're going to panic and the same thing goes if panic_on_oops is set.

So the current code only logs messages which happen when an oops is in
progress.

> Also, the panic_on_oops usage looks wrong as well: it is set on a system
> that wants a panic on oops - but the flag will be set all the time, even
> when we are not oopsing.
>
> I suppose the intention is to add a logic like this:
>
> - buffer writes to the MTD async writeout thread for regular printks
>
> - if we are in some sort of emergency, write to the MTD device directly
> as we cannot buffer anymore.
>
> Correct?

Almost :-). During an oops, it stores in a memory buffer until we have
either filled the memory buffer, the oops is over, or unblank() is called,
and after that it will write out the buffer to flash.

I believe the other implementation (outlined below) is a bit simpler in
that respect. That will also log all messages (well the last 4-8KB or
so of them), but only write it out when a panic or oops occurs.

> > So with this one, the exported panic_on_oops is no longer needed, and
> > normal oopses are handled by the scheduled work while panic_on_oopses
> > are handled by the panic handler.
>
> Yes, that looks like the better direction - but 'panic' is still the
> wrong trigger condition i think. We generally just crash and dont panic.
> Often we'll display a kernel warning and then hang. Etc.

But how can we detect that? The code above will write to the MTD device
either if an oops happens, or if we panic for some reason. If the
kernel just hangs (and is reset by the watchdog, if we have one), how
should we know when to write the log out?


For me personally, the panic case is the important one. Regular oopses
(without panic_on_oops) can go to syslog, but it's important to be able
to catch the cases when the kernel panics as well.

// Simon

2009-10-12 12:33:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Simon Kagstrom <[email protected]> wrote:

> > Yes, that looks like the better direction - but 'panic' is still the
> > wrong trigger condition i think. We generally just crash and dont
> > panic. Often we'll display a kernel warning and then hang. Etc.
>
> But how can we detect that? The code above will write to the MTD
> device either if an oops happens, or if we panic for some reason. If
> the kernel just hangs (and is reset by the watchdog, if we have one),
> how should we know when to write the log out?

You shouldnt need to care about that in a console driver - it's up to
higher layers.

See my reply to David Woodhouse, i think we should add support for
buffering in kernel/printk.c and that would both fix your problems,
would simplify the driver (significantly!) and would expose the generic
buffering capability to other console drivers as well.

Ingo

2009-10-12 12:34:10

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 2009-10-12 at 14:20 +0200, Ingo Molnar wrote:
> * David Woodhouse <[email protected]> wrote:
>
> > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> > > Also, would it be possible to just simplify the thing and not do any
> > > buffering at all? Extra buffering complexity in a console driver is only
> > > asking for trouble. Or is flash storage write cycles optimization that
> > > important in this case?
> >
> > That and the fact that on NAND flash you have to write full pages at a
> > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip.
> > So we really do want to buffer it where we can.
> >
> > We don't want to write a 2KiB page for every line of printk output.
>
> Then i think the buffering is at the wrong place: we should instead
> buffer in the generic layer and pass it to lowlevel if we know that we
> have gone past a 2K boundary.
>
> The size of the generic log buffer is always a power of two so detecting
> 2K boundaries is very easy. On any emergency the generic console layer
> will do faster flushes - this is nothing the console driver itself
> should bother with.
>
> And that would avoid the whole workqueue logic - which is fragile to be
> done in a printk to begin with.
>
> So what we need is an extension to struct console that sets a buffering
> limit. Zero (the default) means unbuffered.
>
> (Btw., things like netconsole might make use of such buffering too.)
>
> Agreed?

Makes some sense, yes.

We also use the workqueue logic to allow us to co-ordinate access to the
hardware properly -- taking locks where appropriate, etc. We can't do
that directly from a console ->write() method.

Some device drivers do provide a ->panic_write() function which breaks
all the locks and just resets the hardware because it knows we're
panicking, but we don't want to do that in the common case.

--
dwmw2

2009-10-12 12:37:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* David Woodhouse <[email protected]> wrote:

> On Mon, 2009-10-12 at 14:20 +0200, Ingo Molnar wrote:
> > * David Woodhouse <[email protected]> wrote:
> >
> > > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> > > > Also, would it be possible to just simplify the thing and not do any
> > > > buffering at all? Extra buffering complexity in a console driver is only
> > > > asking for trouble. Or is flash storage write cycles optimization that
> > > > important in this case?
> > >
> > > That and the fact that on NAND flash you have to write full pages at a
> > > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip.
> > > So we really do want to buffer it where we can.
> > >
> > > We don't want to write a 2KiB page for every line of printk output.
> >
> > Then i think the buffering is at the wrong place: we should instead
> > buffer in the generic layer and pass it to lowlevel if we know that we
> > have gone past a 2K boundary.
> >
> > The size of the generic log buffer is always a power of two so detecting
> > 2K boundaries is very easy. On any emergency the generic console layer
> > will do faster flushes - this is nothing the console driver itself
> > should bother with.
> >
> > And that would avoid the whole workqueue logic - which is fragile to be
> > done in a printk to begin with.
> >
> > So what we need is an extension to struct console that sets a buffering
> > limit. Zero (the default) means unbuffered.
> >
> > (Btw., things like netconsole might make use of such buffering too.)
> >
> > Agreed?
>
> Makes some sense, yes.
>
> We also use the workqueue logic to allow us to co-ordinate access to
> the hardware properly -- taking locks where appropriate, etc. We can't
> do that directly from a console ->write() method.
>
> Some device drivers do provide a ->panic_write() function which breaks
> all the locks and just resets the hardware because it knows we're
> panicking, but we don't want to do that in the common case.

Well other than not using sleeping locks in that codepath it should be
properly serializable. If the serial driver, netconsole, fbcon and all
the other non-trivial console drivers can do it then MTD should be able
to do it too.

Printk via workqueue ... lets not go there, really. (I know you already
have that but i think it's a mistake on a fundamental level.)

Ingo

2009-10-12 12:48:56

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 2009-10-12 at 14:36 +0200, Ingo Molnar wrote:
> Well other than not using sleeping locks in that codepath it should be
> properly serializable. If the serial driver, netconsole, fbcon and all
> the other non-trivial console drivers can do it then MTD should be able
> to do it too.

It's distinctly non-trivial. A 'real' user of the hardware may have just
triggered a block erase, which could take hundreds (or even thousands)
of milliseconds to complete. We can't always suspend that erase; do we
really want to _wait_ for it?

If we're actually in a _panic_ situation, then yes -- we want to wait,
and that's what the device driver's panic_write() method should do.

But for less critical output? I'm not so sure.

Of course, I'm currently looking at revamping the MTD APIs so they're
not entirely synchronous, and are queue-based like the block device API.
Then the console ->write() method could just queue the write, and we
need some other logic to ensure that it really does _happen_ when the
system crashes. A kind of 'purge queue of console writes' function in a
panic handler... except that as you point out, sometimes we actually
want that even when we haven't actually panicked.

--
dwmw2

2009-10-12 13:07:44

by Simon Kagstrom

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 12 Oct 2009 14:20:23 +0200
Ingo Molnar <[email protected]> wrote:

> * David Woodhouse <[email protected]> wrote:
>
> > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> > > Also, would it be possible to just simplify the thing and not do any
> > > buffering at all? Extra buffering complexity in a console driver is only
> > > asking for trouble. Or is flash storage write cycles optimization that
> > > important in this case?
> >
> > That and the fact that on NAND flash you have to write full pages at a
> > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip.
> > So we really do want to buffer it where we can.
> >
> > We don't want to write a 2KiB page for every line of printk output.
>
> Then i think the buffering is at the wrong place: we should instead
> buffer in the generic layer and pass it to lowlevel if we know that we
> have gone past a 2K boundary.
>
> The size of the generic log buffer is always a power of two so detecting
> 2K boundaries is very easy. On any emergency the generic console layer
> will do faster flushes - this is nothing the console driver itself
> should bother with.

But this is only part of the mtdoops problem (the reason why we don't
write all the time). The current code only stores messages printed
during an oops, and this behavior will surely change if the console
driver gets large buffers of output - or it would have to take in the
output unbuffered anyway.

My patch changes this behavior, and with that I don't think buffered
output would be a problem - it would indeed make it more simple as you
say - assuming there is something like ->kernel_bug() that would flush
the last 4KiB or so of messages to mtdoops when there is an oops or
panic.

> And that would avoid the whole workqueue logic - which is fragile to be
> done in a printk to begin with.

I'm afraid I don't really see this issue. The workqueue is used to
write the buffer to the mtd device if we are not in a panic or
interrupt context - in which case we do it directly.

So it's only used when an oops is ongoing.

// Simon

2009-10-12 13:08:12

by Alan

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

> See my reply to David Woodhouse, i think we should add support for
> buffering in kernel/printk.c and that would both fix your problems,
> would simplify the driver (significantly!) and would expose the generic
> buffering capability to other console drivers as well.

Buffering printk in general is bad. Given a driver needs only to provide
about 6 lines of code using a kfifo is it really that hard for the odd
code that wants to buffer to do that ?

2009-10-12 13:16:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Simon Kagstrom <[email protected]> wrote:

> On Mon, 12 Oct 2009 14:20:23 +0200
> Ingo Molnar <[email protected]> wrote:
>
> > * David Woodhouse <[email protected]> wrote:
> >
> > > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> > > > Also, would it be possible to just simplify the thing and not do any
> > > > buffering at all? Extra buffering complexity in a console driver is only
> > > > asking for trouble. Or is flash storage write cycles optimization that
> > > > important in this case?
> > >
> > > That and the fact that on NAND flash you have to write full pages at a
> > > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip.
> > > So we really do want to buffer it where we can.
> > >
> > > We don't want to write a 2KiB page for every line of printk output.
> >
> > Then i think the buffering is at the wrong place: we should instead
> > buffer in the generic layer and pass it to lowlevel if we know that we
> > have gone past a 2K boundary.
> >
> > The size of the generic log buffer is always a power of two so
> > detecting 2K boundaries is very easy. On any emergency the generic
> > console layer will do faster flushes - this is nothing the console
> > driver itself should bother with.
>
> But this is only part of the mtdoops problem (the reason why we don't
> write all the time). The current code only stores messages printed
> during an oops, and this behavior will surely change if the console
> driver gets large buffers of output - or it would have to take in the
> output unbuffered anyway.
>
> My patch changes this behavior, and with that I don't think buffered
> output would be a problem - it would indeed make it more simple as you
> say - assuming there is something like ->kernel_bug() that would flush
> the last 4KiB or so of messages to mtdoops when there is an oops or
> panic.
>
> > And that would avoid the whole workqueue logic - which is fragile to
> > be done in a printk to begin with.
>
> I'm afraid I don't really see this issue. The workqueue is used to
> write the buffer to the mtd device if we are not in a panic or
> interrupt context - in which case we do it directly.
>
> So it's only used when an oops is ongoing.

This fixation on 'panic' is so wrong!

90% of the bugs users care about dont involve any panic. And even if
there is a panic down the line, most of the interesting messages are in
the stream leading up to the panic - now tucked away in that async
workqueue mechanism and not visible.

There's two clean solutions i think:

1) add some new "ok, there's trouble!" callback to struct console and
the console driver could via that mechanism send out the _last_ 2KB
(or more) of kernel log messages. Basically we can go back in time by
looking at the dmesg buffer. The low level console driver does not
need to 'follow' the high level console state - it only wants to
print in case of trouble anyway.

2) or add buffered (flash-friendly) writes for all printk output - panic
and non-panic alike. This would be useful to debug suspend/resume
bugs for example. This would also optimize the packets of netconsole
output. (last i checked we sent a packet per line.)

The workqueue looks wrong in both variants. If we are panic-ing (or
hanging, or ...) then we are halting the machine - the workqueue has no
chance to actually execute.

Ingo

2009-10-12 13:26:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Alan Cox <[email protected]> wrote:

> > See my reply to David Woodhouse, i think we should add support for
> > buffering in kernel/printk.c and that would both fix your problems,
> > would simplify the driver (significantly!) and would expose the
> > generic buffering capability to other console drivers as well.
>
> Buffering printk in general is bad. [...]

The general (and default) case would be 0 buffering - i.e. finegrained
per line calls to ->console_write().

This is the common case indeed, we want to get console output out as
soon as possible and as finegrained as possible - as we dont know when a
failure mode removes our ability to print anything else.

My argument is that instead of a complicated dance of workqueue versus
non-workqueue printk support in the MTD code in drivers/mtd/mtdoops.c,
this should be done at the generic console level.

It's not hard, nor complex if done at the right level, nor does it
impact the regular zero-buffering codepath in a significant way - and
the end result would be a significantly simpler (and, in turn, more
robust) MTD printk driver.

I care about this because i still havent given up hope that the company
you are working for will finally give us some permanent storage in the
CPU itself, so that we can have cross-reboot printk buffering ;-) If
that storage is in the form of Flash, then buffering (to optimize write
cycles) is probably a must.

> [...] Given a driver needs only to provide about 6 lines of code using
> a kfifo is it really that hard for the odd code that wants to buffer
> to do that ?

Avoiding a workqueue in printk is about the critical path of failure and
about complexity in general.

I dont see how kfifo helps here much - kfifo is really just a relatively
simple dynamic memory buffer abstraction - while most of the complexity
here is elsewhere. Could you explain what you meant with kfifo here
please?

Ingo

2009-10-12 13:33:23

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 2009-10-12 at 15:25 +0200, Ingo Molnar wrote:
>
> I care about this because i still havent given up hope that the company
> you are working for will finally give us some permanent storage in the
> CPU itself, so that we can have cross-reboot printk buffering ;-)

Why would it have to be in the CPU? If we had decent firmware, couldn't
we just look at the old kernel's log_buf in RAM?

On machines where coreboot is supported (which is mostly machines
_other_ than those from my employer, unfortunately), you ought to be
able to do this today.

--
dwmw2

2009-10-12 13:40:32

by Simon Kagstrom

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

OK, I don't think we understand each other. Sorry if I'm being slow
here, please tell me if I'm misunderstanding something fundamental
below.

On Mon, 12 Oct 2009 15:15:29 +0200
Ingo Molnar <[email protected]> wrote:

> > I'm afraid I don't really see this issue. The workqueue is used to
> > write the buffer to the mtd device if we are not in a panic or
> > interrupt context - in which case we do it directly.
> >
> > So it's only used when an oops is ongoing.
>
> This fixation on 'panic' is so wrong!
>
> 90% of the bugs users care about dont involve any panic. And even if
> there is a panic down the line, most of the interesting messages are in
> the stream leading up to the panic - now tucked away in that async
> workqueue mechanism and not visible.

Well, this is what my patch [1] aims to fix. What it does is to put all
messages in a circular buffer, and when an oops or panic occurs it
writes them out. The current version only collects messages _during_ an
oops. I'll rework it with using kfifo as per Alans suggestion though.

Neither the current code nor the new patch has them stored in the work
queue during a panic though. If this happens, they will call
panic_write (if it's available) to write it out directly.

> There's two clean solutions i think:
>
> 1) add some new "ok, there's trouble!" callback to struct console and
> the console driver could via that mechanism send out the _last_ 2KB
> (or more) of kernel log messages. Basically we can go back in time by
> looking at the dmesg buffer. The low level console driver does not
> need to 'follow' the high level console state - it only wants to
> print in case of trouble anyway.
>
> 2) or add buffered (flash-friendly) writes for all printk output - panic
> and non-panic alike. This would be useful to debug suspend/resume
> bugs for example. This would also optimize the packets of netconsole
> output. (last i checked we sent a packet per line.)

Well, suspend/resume hangs is one of the cases which mtdoops won't
catch. But at least on NAND flash, I'd be a bit weary about logging all
printk output for fear of wearing out the flash.

> The workqueue looks wrong in both variants. If we are panic-ing (or
> hanging, or ...) then we are halting the machine - the workqueue has no
> chance to actually execute.

but then we are using mtd->panic_write to write it out directly, not
via the work queue.

// Simon

[1] http://patchwork.ozlabs.org/patch/35750/

2009-10-12 14:12:54

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 12 Oct 2009 13:37:58 +0200
Ingo Molnar <[email protected]> wrote:

>
> * Artem Bityutskiy <[email protected]> wrote:
>
> > On Mon, 2009-10-12 at 13:15 +0200, ext Ingo Molnar wrote:
> > > > We use the mtdoops module which stores oopses on the Flash
> > > > partition, in order to make it possible to analyze them later.
> > > > And mtdoops needs to know whether 'panic_on_oops' is on or off.
> > > > Thus, we need this variable to be exported.
> > >
> > > hm, why does it need it? Could you send the patch please that
> > > makes use of it (as an easy way to explain the need for the
> > > export) - current upstream drivers/mtd/mtdoops.c doesnt need it
> > > so it must be some pending change.
> >
> > Yes, current drivers/mtd/mtdoops.c does not need it, but we have
> > this patch:
> >
> > http://lists.infradead.org/pipermail/linux-mtd/2009-October/027450.html
> >
> > which makes sure mtdoops writes the oops log immediately, because we
> > have 'panic_on_oops' set so this is our last chance to save the
> > oops.
> >
> > Here is the patch inlined as well:
> >
> > Author: Aaro Koskinen <[email protected]>
> > Date: Thu Oct 1 18:16:55 2009 +0300
> >
> > mtd: mtdoops: do not schedule work if we are going to die
> >
> > If panic_on_oops is set, the log should be written right away
> > because this is not going to be a second chance.
> >
> > Signed-off-by: Aaro Koskinen <[email protected]>
> > Signed-off-by: Artem Bityutskiy <[email protected]>
> >
> > diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
> > index 1060337..ac67833 100644
> > --- a/drivers/mtd/mtdoops.c
> > +++ b/drivers/mtd/mtdoops.c
> > @@ -335,7 +335,7 @@ static void mtdoops_console_sync(void)
> > cxt->ready = 0;
> > spin_unlock_irqrestore(&cxt->writecount_lock, flags);
> >
> > - if (mtd->panic_write && in_interrupt())
> > + if (mtd->panic_write && (in_interrupt() || panic_on_oops))
> > /* Interrupt context, we're going to panic so try
> > and log */ mtdoops_write(cxt, 1);
>
> Hm, the code seems to be somewhat confused about this. It tries to
> guess when it's panic-ing, right? in_interrupt() is the wrong test
> for that.
>
> mtdoops_console_sync() gets called by regular printk() as well via:
>
> static void
> mtdoops_console_write(struct console *co, const char *s, unsigned int
> count) {
> struct mtdoops_context *cxt = co->data;
> struct mtd_info *mtd = cxt->mtd;
> unsigned long flags;
>
> if (!oops_in_progress) {
> mtdoops_console_sync();
> return;
>
> I think this is all layered incorrectly - because mtdoops (which is a
> _VERY_ cool debugging facility btw - i wish generic x86 had something
> like that) tries to be a 'driver' and tries to achieve these things
> without modifying the core kernel.
>
> But it really should do that. We can certainly enhance the core
> kernel logic to tell the console driver more clearly when printk
> should go out immediately.
>
> Instead of using oops_in_progress it might be better to go into 'sync
> immeditately' mode if the kernel gets tainted.

actually there are very nice "begin of oops" and "end of oops"
functions... they are the perfect places for this ;-)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-10-12 14:27:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* David Woodhouse <[email protected]> wrote:

> On Mon, 2009-10-12 at 15:25 +0200, Ingo Molnar wrote:
> >
> > I care about this because i still havent given up hope that the
> > company you are working for will finally give us some permanent
> > storage in the CPU itself, so that we can have cross-reboot printk
> > buffering ;-)
>
> Why would it have to be in the CPU? If we had decent firmware,
> couldn't we just look at the old kernel's log_buf in RAM?

Not if the failure is say a s2ram hang that requires a power cycle. Also
there are certain classes of bugs that only occur on cold boot. Plus
there's the "need to unplug the battery to revive the system" class of
bugs (but they are rare).

So i think the MTD / flash stuff is powerful.

Ingo

2009-10-12 14:31:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Simon Kagstrom <[email protected]> wrote:

> OK, I don't think we understand each other. Sorry if I'm being slow
> here, please tell me if I'm misunderstanding something fundamental
> below.

[ it could easily be me being confused - i dont know the mtdoops code
that well - i just raised an eyebrow at the export request, which
yelled 'layering violation' at me ;-) ]

> On Mon, 12 Oct 2009 15:15:29 +0200
> Ingo Molnar <[email protected]> wrote:
>
> > > I'm afraid I don't really see this issue. The workqueue is used to
> > > write the buffer to the mtd device if we are not in a panic or
> > > interrupt context - in which case we do it directly.
> > >
> > > So it's only used when an oops is ongoing.
> >
> > This fixation on 'panic' is so wrong!
> >
> > 90% of the bugs users care about dont involve any panic. And even if
> > there is a panic down the line, most of the interesting messages are in
> > the stream leading up to the panic - now tucked away in that async
> > workqueue mechanism and not visible.
>
> Well, this is what my patch [1] aims to fix. What it does is to put
> all messages in a circular buffer, and when an oops or panic occurs it
> writes them out. The current version only collects messages _during_
> an oops. I'll rework it with using kfifo as per Alans suggestion
> though.
>
> Neither the current code nor the new patch has them stored in the work
> queue during a panic though. If this happens, they will call
> panic_write (if it's available) to write it out directly.
>
> > There's two clean solutions i think:
> >
> > 1) add some new "ok, there's trouble!" callback to struct console and
> > the console driver could via that mechanism send out the _last_ 2KB
> > (or more) of kernel log messages. Basically we can go back in time by
> > looking at the dmesg buffer. The low level console driver does not
> > need to 'follow' the high level console state - it only wants to
> > print in case of trouble anyway.
> >
> > 2) or add buffered (flash-friendly) writes for all printk output - panic
> > and non-panic alike. This would be useful to debug suspend/resume
> > bugs for example. This would also optimize the packets of netconsole
> > output. (last i checked we sent a packet per line.)
>
> Well, suspend/resume hangs is one of the cases which mtdoops won't
> catch. [...]

( Sidenote: i see no reason why that wouldnt be possible if it's
implemented properly. )

> [...] But at least on NAND flash, I'd be a bit weary about logging all
> printk output for fear of wearing out the flash.

Clearly should be optional - like the s2ram debug hack to RTC registers
is optional on x86.

> > The workqueue looks wrong in both variants. If we are panic-ing (or
> > hanging, or ...) then we are halting the machine - the workqueue has
> > no chance to actually execute.
>
> but then we are using mtd->panic_write to write it out directly, not
> via the work queue.

... i might be confused, but in which case _is_ the workqueue used?

It clearly shows up in the codepaths i've read, but maybe i've
misinterpreted what it does.

Ingo

2009-10-12 14:37:31

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 2009-10-12 at 16:26 +0200, Ingo Molnar wrote:
> Not if the failure is say a s2ram hang that requires a power cycle. Also
> there are certain classes of bugs that only occur on cold boot. Plus
> there's the "need to unplug the battery to revive the system" class of
> bugs (but they are rare).

So you need to build in enough ECC to cope with the decay which happens
when RAM isn't being refreshed for a few seconds... :)

> So i think the MTD / flash stuff is powerful.

Yeah, definitely. I was just pointing out that we can actually do a lot
better on today's commodity hardware too.

--
dwmw2

2009-10-12 15:02:33

by Simon Kagstrom

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 12 Oct 2009 16:30:17 +0200
Ingo Molnar <[email protected]> wrote:

> * Simon Kagstrom <[email protected]> wrote:
>
> > OK, I don't think we understand each other. Sorry if I'm being slow
> > here, please tell me if I'm misunderstanding something fundamental
> > below.
>
> [ it could easily be me being confused - i dont know the mtdoops code
> that well - i just raised an eyebrow at the export request, which
> yelled 'layering violation' at me ;-) ]

I sent the same patch (export panic_on_oops) to LKML a week ago, but
got no replies. I now know who to Cc next time ;-)

> > > 2) or add buffered (flash-friendly) writes for all printk output - panic
> > > and non-panic alike. This would be useful to debug suspend/resume
> > > bugs for example. This would also optimize the packets of netconsole
> > > output. (last i checked we sent a packet per line.)
> >
> > Well, suspend/resume hangs is one of the cases which mtdoops won't
> > catch. [...]
>
> ( Sidenote: i see no reason why that wouldnt be possible if it's
> implemented properly. )

Provided there is a callback for "really_dump_the_console" which gets
called from interesting places (e.g., suspend/resume) it should be easy
to do with the patched mtdoops without much changes, at least with my
proposed circular buffer patch it will be trivial.

> > [...] But at least on NAND flash, I'd be a bit weary about logging all
> > printk output for fear of wearing out the flash.
>
> Clearly should be optional - like the s2ram debug hack to RTC registers
> is optional on x86.

That would be OK for me at least, and again should not be very difficult
to implement even with todays code.

> > > The workqueue looks wrong in both variants. If we are panic-ing (or
> > > hanging, or ...) then we are halting the machine - the workqueue has
> > > no chance to actually execute.
> >
> > but then we are using mtd->panic_write to write it out directly, not
> > via the work queue.
>
> ... i might be confused, but in which case _is_ the workqueue used?
>
> It clearly shows up in the codepaths i've read, but maybe i've
> misinterpreted what it does.

With the code in mainline, it basically works like this.

- When oops_in_progress is not set, mtdoops_console_write will not put
anything in the buffer. It will call mtdoops_console_sync(), but
since the buffer will be empty (cxt->writecount is 0), no writes will
be done.

- When oops_in_progress _is_ set, mtdoops_console_write will start
putting things in the buffer.

- mtdoops_console_sync is then called either when the buffer has been
filled, or when ->unblank() is called, or when oops_in_progress is no
longer true.

This is the place when the workqueue _can_ be used (in the cases when
this is not in interrupt context or panic_on_oops is unset).

With my patch it instead works like this:

- mtdoops_console_write continuously writes messages to the buffer, but
never calls mtdoops_console_sync() itself.

- mtdoops_console_sync (i.e., the ->unblank() callback) will schedule
work if oops_in_progress is set.

- if we have a panic, it will call mtdoops_write directly (if
mtd->panic_write is set, otherwise we are out of luck). This is also
the code path on oopses in interrupt context.

So the workqueue only gets used on unblank() from oopses. I think the
second implementation is simpler, but it also changes the behavior of
mtdoops a bit to include messages before the oops/panic as well.

// Simon

2009-10-12 15:15:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* David Woodhouse <[email protected]> wrote:

> On Mon, 2009-10-12 at 16:26 +0200, Ingo Molnar wrote:

> > Not if the failure is say a s2ram hang that requires a power cycle.
> > Also there are certain classes of bugs that only occur on cold boot.
> > Plus there's the "need to unplug the battery to revive the system"
> > class of bugs (but they are rare).
>
> So you need to build in enough ECC to cope with the decay which
> happens when RAM isn't being refreshed for a few seconds... :)

[ hey, i think you should line up with BIOS writers at that wall ;-) ]

> > So i think the MTD / flash stuff is powerful.
>
> Yeah, definitely. I was just pointing out that we can actually do a
> lot better on today's commodity hardware too.

I wish it worked on any of the 10+ x86 systems i have. Is there anyone
who'd be interested in exploring whether warm BIOS reboots work
_anywhere_?

A simple patch with a new (default-off) CONFIG_DEBUG_ feature that just
puts a signature into a predictable spot in RAM, switches the reboot
method over to warm reboot (reboot=w) and prints some friendly "yay,
this BIOS rocks!" message if the signature is still there after a reboot
and not zeroed out.

If that works _anywhere_ we could complete it: we could cache the dmesg
buffer address (__log_buf[]) across reboots (and maybe the printk tail
offset (log_end)), and that would be an _excellent_ debuggability
feature for a large class of otherwise undebuggable crashes ...

We could use that to preserve a kernel function trace (or a branch
execution hardware trace using BTS on Intel CPUs) across crashes, etc.
etc.

Ingo

2009-10-12 15:24:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Simon Kagstrom <[email protected]> wrote:

> With my patch it instead works like this:
>
> - mtdoops_console_write continuously writes messages to the buffer, but
> never calls mtdoops_console_sync() itself.
>
> - mtdoops_console_sync (i.e., the ->unblank() callback) will schedule
> work if oops_in_progress is set.
>
> - if we have a panic, it will call mtdoops_write directly (if
> mtd->panic_write is set, otherwise we are out of luck). This is also
> the code path on oopses in interrupt context.
>
> So the workqueue only gets used on unblank() from oopses. I think the
> second implementation is simpler, but it also changes the behavior of
> mtdoops a bit to include messages before the oops/panic as well.

The main printk principle is simplicity:

- The simpler the printk codepath, the higher the chances that we still
are within the window of opportunity to get anything out to the user.

- Furthermore, the simpler the printk codepath, the larger the window of
opportunity is to begin with: we rely on less external state, so we
have a smaller surface of interaction that might break printk.

In that sense, i think your modified workqueue use is less wrong than
what is in current mainline, but i'm afraid it's still wrong.

Why use a workqueue on unblank()? Why use a workqueue _at all_? (If we
piggyback to any kernel thread we could as well piggyback to syslog
itself - and we all know how often syslog fails at capturing oopses.)

The mtdoops console driver only seems to act if there's an emergency -
and in emergencies we really _never_ want to do complex things like
using a workqueue thread.

( Sidenote: i proffer that we dont want to use a workqueue in the
'regular' printk case either - but that seems to be irrelevant here as
mtdoops does not seem to save / care about regular non-emergency
printks. )

Ingo

2009-10-12 15:38:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops



On Mon, 12 Oct 2009, Simon Kagstrom wrote:
>
> Well, this is what my patch [1] aims to fix. What it does is to put all
> messages in a circular buffer, and when an oops or panic occurs it
> writes them out.

Umm. That's wrong. We already _have_ the circular buffer.

It's called 'log_buf'.

I agree with the "save kernel buffer on panic" thing, but I disagree with
making it anything new, and hooking into "printk()" or the console
subsystem AT ALL. That's just bogus, stupid, and WRONG.

What you can do is to just flush the 'log_buf' buffer (or as much of it as
you want - the buffer may be a megabyte in size, and maybe you only want
to flush the last 8kB or something like that) on oops. And _not_ mix this
up with anything else. It's a really simple circular buffer, which just
has

- log_buf: buffer start
- log_end: number of characters ever seen
- log_buf_len: size of buffer (guaranteed to be a power-of-2)

so it's literally as easy as looking at those three values (there's a few
more that you _can_ look at, but they'd not likely be relevant for a
"panic_on_oops" thing)

> The current version only collects messages _during_ an
> oops. I'll rework it with using kfifo as per Alans suggestion though.

Don't. kfifo's aren't going to help. You're doing this at all the wrong
levels ENTIRELY, and we already have the buffer you want to flush.

Linus

2009-10-12 15:46:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops



On Mon, 12 Oct 2009, Linus Torvalds wrote:
>
> Don't. kfifo's aren't going to help. You're doing this at all the wrong
> levels ENTIRELY, and we already have the buffer you want to flush.

Btw, a few simple rules:

- if you need to make your device look like a "console device" for
dumping at oops time, you're doing things wrong. You don't want line
buffered output to begin with, and you don't want to see each line, you
only want this at exceptional points.

- if you need to look at "in_interrupt()" or "panic_on_oops", you're
doing things wrong.

- if you add your own buffers, you're doing things wrong. I have
CONFIG_LOG_BUF_SHIFT=18 in my kernel, so I've already got 256kB worth
of memory allocated for kernel messages. Sometimes I increase that
further, just because I do some silly printk debugging.

IOW, just add a very simple "flush the dmesg buffer on oops" callback to
the end of the oops printout code (just a single call after the oops thing
is now known to be in the dmesg buffers!)

It's not just oopses, btw. Maybe people would like to do this as the last
stage of a reboot/shutdown too. Because some of the final printouts from
the shutdown will never make it to disk, because 'ksyslogd' has been
killed, and the root filesystem has been turned read-only.

Linus

2009-10-12 17:31:33

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 2009-10-12 at 08:44 -0700, Linus Torvalds wrote:
>
> On Mon, 12 Oct 2009, Linus Torvalds wrote:
> >
> > Don't. kfifo's aren't going to help. You're doing this at all the wrong
> > levels ENTIRELY, and we already have the buffer you want to flush.
>
> Btw, a few simple rules:
>
> - if you need to make your device look like a "console device" for
> dumping at oops time, you're doing things wrong. You don't want line
> buffered output to begin with, and you don't want to see each line, you
> only want this at exceptional points.
>
> - if you need to look at "in_interrupt()" or "panic_on_oops", you're
> doing things wrong.
>
> - if you add your own buffers, you're doing things wrong. I have
> CONFIG_LOG_BUF_SHIFT=18 in my kernel, so I've already got 256kB worth
> of memory allocated for kernel messages. Sometimes I increase that
> further, just because I do some silly printk debugging.
>
> IOW, just add a very simple "flush the dmesg buffer on oops" callback to
> the end of the oops printout code (just a single call after the oops thing
> is now known to be in the dmesg buffers!)
>
> It's not just oopses, btw. Maybe people would like to do this as the last
> stage of a reboot/shutdown too. Because some of the final printouts from
> the shutdown will never make it to disk, because 'ksyslogd' has been
> killed, and the root filesystem has been turned read-only.

Yes, it is difficult to disagree with this. As Ingo also pointed,
pretending to be a console driver is ugly, although I think it
was a clever way to avoid touching generic code.

But mtdoops tries to solves the following problem. What if we are
oopsing in an interrupt, which interrupted the mtd driver, so we have
all the locks held, and the mtd driver is in a unexpected stage ATM? Or
what if we are oopsing in the mtd driver, or in something which was
called by the MTD driver.?

This is what all that "workqueue vs. mtd->panic_write()" logic is about,
which Ingo dislikes.

The logic of mtdoops (my interpretation):

1. OK, there was an oops, we have it in a buffer and we need to write
it to flash now.
2. If we are in interrupt context, we have to write right now, because
the system is about to die very soon. And must use special
'mtd->panic_write()' call.
3. If we have 'panic_on_oops' set, the system will be stopped soon. So
we also have to write now, because this is our last chance. And we
also have to use 'mtd->panic_write()'.
4. Otherwise, it is probable that system will be somewhat alive, and we
may schedule the write instead of using 'mtd->panic_write()', because
'mtd->panic_write()' abuses locking and chip state, and will screw up
the mtd flash driver. So we prefer to schedule the write instead.

I guess similar could be done in 'panic()' instead. Or you have a more
elegant solution?

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-10-12 17:45:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops



On Mon, 12 Oct 2009, Artem Bityutskiy wrote:
>
> But mtdoops tries to solves the following problem. What if we are
> oopsing in an interrupt, which interrupted the mtd driver, so we have
> all the locks held, and the mtd driver is in a unexpected stage ATM? Or
> what if we are oopsing in the mtd driver, or in something which was
> called by the MTD driver.?

Well, quite frankly, if you have an oops while holding a spinlock, then
the machine is dead _anyway_.

So what I would suggest is to just ignore the above problem. No amount of
workqueue logic will help it - if the oops happened while an interrupt
held a critical mtd lock, that lock will _never_ be released, so exactly
what would be helped?

Now, I realize that _if_ you treat mtdoops as a 'console' layer, then you
need to do that crazy thing, because you still want the oops to print out
to the other consoles, and you're only getting data one line at a time.
But since that was the wrong thing to do for a lot of other reasons
anyway, that's not a very good argument.

Once you do the final flush in a controlled place _after_ you've printed
out all the oops information, you simply don't care about locks any more.
Because if you were holding critical locks, you're done anyway.

Sure, maybe you want to do a "trylock()" and skip the oops flush entirely
in the mtd layer if you can't do it, but it's the "let's use a workqueue"
or something that doesn't seem to make a lot of sense to me.

Linus

2009-10-12 17:48:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops



On Mon, 12 Oct 2009, Linus Torvalds wrote:
>
> Once you do the final flush in a controlled place _after_ you've printed
> out all the oops information, you simply don't care about locks any more.
> Because if you were holding critical locks, you're done anyway.
>
> Sure, maybe you want to do a "trylock()" and skip the oops flush entirely
> in the mtd layer if you can't do it, but it's the "let's use a workqueue"
> or something that doesn't seem to make a lot of sense to me.

Side note: I don't actually care deeply. Once it's all inside some driver,
and once it's not messing up the console layer, I don't think the small
details matter all that much. I just think it's likely to be a sign of
something wrong if you need to use workqueues to flush - it probably means
that the most interesting oopses will never make it to the mtd device in
the first place.

Linus

2009-10-12 18:11:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 12 Oct 2009 08:36:38 -0700 (PDT) Linus Torvalds <[email protected]> wrote:

> I agree with the "save kernel buffer on panic" thing, but I disagree with
> making it anything new, and hooking into "printk()" or the console
> subsystem AT ALL. That's just bogus, stupid, and WRONG.
>
> What you can do is to just flush the 'log_buf' buffer (or as much of it as
> you want - the buffer may be a megabyte in size, and maybe you only want
> to flush the last 8kB or something like that) on oops. And _not_ mix this
> up with anything else.

What he said. I did it that way in the Digeo kernel back in 2002.
Worked good.

Doing it via a console is rather weird. It will need core kernel
changes to do it properly.

Perhaps oops_enter() is a good place to mark the start of the log, and
flush it within oops_exit().

2009-10-12 18:25:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Andrew Morton <[email protected]> wrote:

> On Mon, 12 Oct 2009 08:36:38 -0700 (PDT) Linus Torvalds <[email protected]> wrote:
>
> > I agree with the "save kernel buffer on panic" thing, but I disagree with
> > making it anything new, and hooking into "printk()" or the console
> > subsystem AT ALL. That's just bogus, stupid, and WRONG.
> >
> > What you can do is to just flush the 'log_buf' buffer (or as much of it as
> > you want - the buffer may be a megabyte in size, and maybe you only want
> > to flush the last 8kB or something like that) on oops. And _not_ mix this
> > up with anything else.
>
> What he said. I did it that way in the Digeo kernel back in 2002.
> Worked good.
>
> Doing it via a console is rather weird. It will need core kernel
> changes to do it properly.
>
> Perhaps oops_enter() is a good place to mark the start of the log, and
> flush it within oops_exit().

Simplest would be to do the last 2K in oops_exit()? That gives the oops,
and the history leading up to it. Since the blocking is 2K, the extra
log output is for free.

(unless the oops is larger than 2K - but that is rather rare.)

Ingo

2009-10-12 18:33:19

by Carl-Daniel Hailfinger

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On 12.10.2009 17:14, Ingo Molnar wrote:
> * David Woodhouse <[email protected]> wrote:
>
>> On Mon, 2009-10-12 at 16:26 +0200, Ingo Molnar wrote:
>>
>>> Not if the failure is say a s2ram hang that requires a power cycle.
>>> Also there are certain classes of bugs that only occur on cold boot.
>>> Plus there's the "need to unplug the battery to revive the system"
>>> class of bugs (but they are rare).
>>>
>> So you need to build in enough ECC to cope with the decay which
>> happens when RAM isn't being refreshed for a few seconds... :)
>>
>
> [ hey, i think you should line up with BIOS writers at that wall ;-) ]
>

Not all of us x86 firmware writers are evil.


>>> So i think the MTD / flash stuff is powerful.
>>>
>> Yeah, definitely. I was just pointing out that we can actually do a
>> lot better on today's commodity hardware too.
>>
>
> I wish it worked on any of the 10+ x86 systems i have. Is there anyone
> who'd be interested in exploring whether warm BIOS reboots work
> _anywhere_?
>

AFAIK memory clearing is default off in coreboot for non-ECC RAM and
default on for ECC RAM (to avoid parity errors on read, but that can
probably be worked around). Unless I'm mistaken, the SeaBIOS BIOS
compatibility layer on top of coreboot doesn't erase RAM at all, so
contents can survive.
No idea about classic AMI/Award/Phoenix/Insyde/whatever BIOS, though.


> A simple patch with a new (default-off) CONFIG_DEBUG_ feature that just
> puts a signature into a predictable spot in RAM, switches the reboot
> method over to warm reboot (reboot=w) and prints some friendly "yay,
> this BIOS rocks!" message if the signature is still there after a reboot
> and not zeroed out.
>
> If that works _anywhere_ we could complete it: we could cache the dmesg
> buffer address (__log_buf[]) across reboots (and maybe the printk tail
> offset (log_end)), and that would be an _excellent_ debuggability
> feature for a large class of otherwise undebuggable crashes ...
>
> We could use that to preserve a kernel function trace (or a branch
> execution hardware trace using BTS on Intel CPUs) across crashes, etc.
> etc.
>

Since we're discussing log buffers anyway, does it make sense to have
this feature interact with the coreboot log buffer which is passed on to
the OS (no official patches for that one, yet)? Basically, coreboot has
its own log buffer where it stores the hardware init diagnostic messages
(very similar to the Linux kernel message buffer) and it could make
sense to use the same memory area for both purposes (if you can deal
with coreboot messages being absent after a kernel Oops).

Thoughts?

Regards,
Carl-Daniel

--
Developer quote of the week:
"We are juggling too many chainsaws and flaming arrows and tigers."

2009-10-12 18:38:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 12 Oct 2009 20:23:46 +0200 Ingo Molnar <[email protected]> wrote:

>
> * Andrew Morton <[email protected]> wrote:
>
> > On Mon, 12 Oct 2009 08:36:38 -0700 (PDT) Linus Torvalds <[email protected]> wrote:
> >
> > > I agree with the "save kernel buffer on panic" thing, but I disagree with
> > > making it anything new, and hooking into "printk()" or the console
> > > subsystem AT ALL. That's just bogus, stupid, and WRONG.
> > >
> > > What you can do is to just flush the 'log_buf' buffer (or as much of it as
> > > you want - the buffer may be a megabyte in size, and maybe you only want
> > > to flush the last 8kB or something like that) on oops. And _not_ mix this
> > > up with anything else.
> >
> > What he said. I did it that way in the Digeo kernel back in 2002.
> > Worked good.
> >
> > Doing it via a console is rather weird. It will need core kernel
> > changes to do it properly.
> >
> > Perhaps oops_enter() is a good place to mark the start of the log, and
> > flush it within oops_exit().
>
> Simplest would be to do the last 2K in oops_exit()? That gives the oops,
> and the history leading up to it. Since the blocking is 2K, the extra
> log output is for free.
>
> (unless the oops is larger than 2K - but that is rather rare.)
>

Some oops traces can be pretty large. Perhaps "oops_enter minus 1k up
to oops_exit".

The digeo kernel later got changed to package the oops into a binary
record format for logging to nvram. iirc that was for space reasons,
and for ease of downstream analysis.

That kernel used to be downloadable but I can't immediately find it.
Probably not very interesting anyway.

2009-10-12 18:47:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops



On Mon, 12 Oct 2009, Ingo Molnar wrote:
>
> * Andrew Morton <[email protected]> wrote:
> >
> > Perhaps oops_enter() is a good place to mark the start of the log, and
> > flush it within oops_exit().
>
> Simplest would be to do the last 2K in oops_exit()? That gives the oops,
> and the history leading up to it. Since the blocking is 2K, the extra
> log output is for free.

I agree, except I don't think it should be fixed to 2k.

We should just dump as much as is "appropriate" for the dump device. It
might be the last 2kB, it might be 8kB, it might be 64kB. We don't know,
we don't care. The device may have its own per-device limits. Any extra
data we get from before the oops is just gravy (often there might be
interestign warning messages leadign up to the dump), and if the oops is
too big for the dump device, it's not something we can do anything about
anyway.

So the logic should literally be something like this:

- kernel/printk.c:

void dump_kmsg(void)
{
unsigned long len = ACCESS_ONCE(log_end);
struct dump_device *dump;
const char *s1, *s2;
unsigned long l1, l2;

s1 = "";
l1 = 0;
s2 = log_buf;
l2 = len;

/* Have we rotated around the circular buffer? */
if (len > log_buf_len) {
unsigned long pos = (len & LOG_BUF_MASK);

s1 = log_buf + pos;
l1 = log_buf_len - pos;

s2 = log_buf;
l2 = pos;
}

list_for_each_entry (dump, dump_list, list) {
dump->fn(s1, l1, s2, l2);
}
}

ie we just always give the whole buffer (as two "sections", since it's a
circular buffer) to the dumper, and then the dumper can decide how much of
those buffers it is able to dump sanely.

If the dumper has some hardware limitation where it can only dump a single
aligned block, it can decide to just look at <s2,l2> which is the "latter"
part of the dump. It's usually going to be fine, and it should be
page-aligned (and you can even round up the size to a page, since the
worst that can happen is that you'll see some old printk output at the
end)

Look ma, no locking, no buffer allocations, no nothing.

Linus

2009-10-12 19:15:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Linus Torvalds <[email protected]> wrote:

>
>
> On Mon, 12 Oct 2009, Ingo Molnar wrote:
> >
> > * Andrew Morton <[email protected]> wrote:
> > >
> > > Perhaps oops_enter() is a good place to mark the start of the log, and
> > > flush it within oops_exit().
> >
> > Simplest would be to do the last 2K in oops_exit()? That gives the
> > oops, and the history leading up to it. Since the blocking is 2K,
> > the extra log output is for free.
>
> I agree, except I don't think it should be fixed to 2k.

Yeah - i cited 2K only because that is what mtdoops uses.

> void dump_kmsg(void)
> [...]
>
> Look ma, no locking, no buffer allocations, no nothing.

Neat ...

This could also be used for a warm-reboot preserve-memory thing as well.
A well-known 4K (or so) area to preserve and print out during the next
bootup after a crash. dump_kmsg() could copy the kernel's last will out
to that area, or so.

That would be cross-kernel compatible and the newly booted kernel image
wouldnt overwrite this area. (which it does currently via its
__log_buf[])

Ingo

2009-10-12 19:19:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops


* Carl-Daniel Hailfinger <[email protected]> wrote:

> > I wish it worked on any of the 10+ x86 systems i have. Is there
> > anyone who'd be interested in exploring whether warm BIOS reboots
> > work _anywhere_?
>
> AFAIK memory clearing is default off in coreboot for non-ECC RAM and
> default on for ECC RAM (to avoid parity errors on read, but that can
> probably be worked around). Unless I'm mistaken, the SeaBIOS BIOS
> compatibility layer on top of coreboot doesn't erase RAM at all, so
> contents can survive.
>
> No idea about classic AMI/Award/Phoenix/Insyde/whatever BIOS, though.

I wouldnt mind to support this for coreboot too, but it will only be
practical if there's at least one standard BIOS out of the many i run
that supports warm reboot in practice ...

Can try a test-patch on all of those systems btw. - we do support
warm-reboot on x86 in arch/x86/kernel/reboot.c:

/* Write 0x1234 to absolute memory location 0x472. The BIOS reads
this on booting to tell it to "Bypass memory test (also warm
boot)". This seems like a fairly standard thing that gets set by
REBOOT.COM programs, and the previous reset routine did this
too. */
*((unsigned short *)0x472) = reboot_mode;

But someone would have to come up with a test-patch to prove whether it
works really works.

Ingo

2009-10-12 19:20:18

by Dirk Hohndel

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

Here's what I get for traveling and not reading mail for two days...

On Mon, 2009-10-12 at 11:45 -0700, Linus Torvalds wrote:
>
> On Mon, 12 Oct 2009, Ingo Molnar wrote:
> >
> > * Andrew Morton <[email protected]> wrote:
> > >
> > > Perhaps oops_enter() is a good place to mark the start of the log, and
> > > flush it within oops_exit().
> >
> > Simplest would be to do the last 2K in oops_exit()? That gives the oops,
> > and the history leading up to it. Since the blocking is 2K, the extra
> > log output is for free.
>
> I agree, except I don't think it should be fixed to 2k.
>
> We should just dump as much as is "appropriate" for the dump device. It
> might be the last 2kB, it might be 8kB, it might be 64kB. We don't know,
> we don't care. The device may have its own per-device limits. Any extra
> data we get from before the oops is just gravy (often there might be
> interestign warning messages leadign up to the dump), and if the oops is
> too big for the dump device, it's not something we can do anything about
> anyway.

I'm working on something different but related - also using the ring
buffer and just getting as much from its tail as I am able to conserve.

The approach in my case is to write a 2D bar code to the screen and have
the user take a picture / submit the picture to kerneloops.org where it
then gets decoded back into the oops message. This is intended for
situations where you don't have access to other storage / network - or
where a picture of the screen is actually the easiest way to get to the
information.

Right now the project is slightly stalled as I am running into an
unexpected project on the decode side, but I'd love to make sure that
the core changes I'm doing integrate cleanly with this project...

>
> So the logic should literally be something like this:
>
> - kernel/printk.c:
>
> void dump_kmsg(void)
> {
> unsigned long len = ACCESS_ONCE(log_end);
> struct dump_device *dump;
> const char *s1, *s2;
> unsigned long l1, l2;
>
> s1 = "";
> l1 = 0;
> s2 = log_buf;
> l2 = len;
>
> /* Have we rotated around the circular buffer? */
> if (len > log_buf_len) {
> unsigned long pos = (len & LOG_BUF_MASK);
>
> s1 = log_buf + pos;
> l1 = log_buf_len - pos;
>
> s2 = log_buf;
> l2 = pos;
> }
>
> list_for_each_entry (dump, dump_list, list) {
> dump->fn(s1, l1, s2, l2);
> }
> }
>
> ie we just always give the whole buffer (as two "sections", since it's a
> circular buffer) to the dumper, and then the dumper can decide how much of
> those buffers it is able to dump sanely.

That's pretty close to what I do - only in my case the information then
doesn't get written to a device but instead gets compressed, encoded and
displayed on the framebuffer...

/D

--
Dirk Hohndel
Intel Open Source Technology Center

2009-10-13 07:58:59

by Simon Kagstrom

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Mon, 12 Oct 2009 11:45:15 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

> So the logic should literally be something like this:
>
> - kernel/printk.c:
>
> void dump_kmsg(void)
> {
> unsigned long len = ACCESS_ONCE(log_end);

OK, I'll get back with a few patches that does this later today.

// Simon

2009-10-13 09:00:46

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH] panic.c: export panic_on_oops

On Tue, 2009-10-13 at 09:58 +0200, Simon Kagstrom wrote:
> On Mon, 12 Oct 2009 11:45:15 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
> > So the logic should literally be something like this:
> >
> > - kernel/printk.c:
> >
> > void dump_kmsg(void)
> > {
> > unsigned long len = ACCESS_ONCE(log_end);
>
> OK, I'll get back with a few patches that does this later today.

Yeah, I think it is good idea to work on a generic support, and simplify
mtdoops.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-10-13 13:18:45

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH/RFC v5 0/5]: mtdoops: fixes and improvements

Hi!

Here are a couple of patches to mtdoops which I've been working on for
a while and which were recently discussed on the MTD list. They are
applied on top of Artems tree,

http://git.infradead.org/users/dedekind/l2-mtd-2.6.git

which apart from the mainline tree contains a panic fix from Aaro and
Artems mtdoops style cleanup patch.


Now, I'd like to warn sensitive readers that patch 5 *does* contain work
queue elements. However, I believe there is a good reason for this: The
mtd->write call is not good to call while oopsing, and mtd->panic_write
is sort of a last resort. Now, if we panic anyhow, mtd->panic_write
will be called, so the oops will still be written and the scheduled
work won't run anyway.


The patches are:

1. Avoid erasing already empty areas (the area would be erased at each
module load otherwise)

2. Keep track of mtdoops page cleanliness in an array. This allows
mtdoops_inc_counter to be called during panic (which fails in my
case with the current code in mtd->read, although I believe this is
MTD-driver dependent).

3. Make page size configurable to support longer messages. Mainly
needed for patch 3, but also allows longer messages to be stored
during panics (when the next oops area cannot be erased).

4. (kernel/printk.c) Add a dump_device as per Linus suggestion which
includes a dump_kmsg function which dumps the kernel log buffer to
registered devices.

5. Refactor mtdoops as a dump_device device instead of a console device.

ChangeLog:

v2:
* Patches have been reordered to keep the least controversial (?) first.
* Implement Artems suggestion to keep cleanliness in an array
* Use Aaros patch to fix the panic_on_oops problem

v3:
* Correct checkpatch issues

v4: Review corrections
* Rebase on top of "[PATCH] mtd: mtdoops: several minor cleanups"
* Patch 1 has been added
* Use 1 bit per oops page, and rename clean/dirty unused/used
* Don't initialize oops_page_dirty (it's a global structure anyway so
it will be cleared together with the rest of BSS)
* Rename mtdoops_page_size record_size (although only for the page
size setting)

v5: Corrections after panic_on_oops discussion
* Add dump_device
* Refactor mtdoops as a dump device.

2009-10-13 13:22:49

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH/RFC v5 1/5]: mtdoops: avoid erasing already empty areas

After having scanned the entire mtdoops area, mtdoops will erase it if
there are no mtdoops headers in it. However, empty and already erased
areas (i.e., without mtdoops headers) will therefore also be erased at
each startup.

This patch counts the number of unclean pages (neither empty nor with
the mtdoops header) and only erases if no headers are found and the area
is still unclean.

Signed-off-by: Simon Kagstrom <[email protected]>
---
drivers/mtd/mtdoops.c | 18 ++++++++++++++----
1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index 18c6c96..c785e1a 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -225,7 +225,7 @@ static void mtdoops_workfunc_write(struct work_struct *work)
static void find_next_position(struct mtdoops_context *cxt)
{
struct mtd_info *mtd = cxt->mtd;
- int ret, page, maxpos = 0;
+ int ret, page, maxpos = 0, unclean_pages = 0;
u32 count[2], maxcount = 0xffffffff;
size_t retlen;

@@ -237,10 +237,13 @@ static void find_next_position(struct mtdoops_context *cxt)
continue;
}

- if (count[1] != MTDOOPS_KERNMSG_MAGIC)
- continue;
if (count[0] == 0xffffffff)
continue;
+ if (count[1] != MTDOOPS_KERNMSG_MAGIC) {
+ /* Page is neither clean nor empty */
+ unclean_pages++;
+ continue;
+ }
if (maxcount == 0xffffffff) {
maxcount = count[0];
maxpos = page;
@@ -259,7 +262,14 @@ static void find_next_position(struct mtdoops_context *cxt)
if (maxcount == 0xffffffff) {
cxt->nextpage = 0;
cxt->nextcount = 1;
- schedule_work(&cxt->work_erase);
+ if (unclean_pages != 0) {
+ printk(KERN_INFO "mtdoops: cleaning area\n");
+ schedule_work(&cxt->work_erase);
+ } else {
+ printk(KERN_DEBUG "mtdoops: ready %d, %d (clean)\n",
+ cxt->nextpage, cxt->nextcount);
+ cxt->ready = 1;
+ }
return;
}

--
1.6.0.4

2009-10-13 13:23:16

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH/RFC v5 2/5]: mtdoops: Keep track of used/unused mtdoops pages in an array

This patch makes mtdoops keep track of used/unused pages in an array
instead of scanning the flash after a write. The advantage with this
approach is that it avoids calling mtd->read on a panic, which is not
possible for all mtd drivers.

Signed-off-by: Simon Kagstrom <[email protected]>
---
drivers/mtd/mtdoops.c | 72 ++++++++++++++++++++++++++++++++++++------------
1 files changed, 54 insertions(+), 18 deletions(-)

diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index c785e1a..6af340e 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -44,6 +44,7 @@ static struct mtdoops_context {
int oops_pages;
int nextpage;
int nextcount;
+ u32 *oops_page_used;
char *name;

void *oops_buf;
@@ -54,18 +55,47 @@ static struct mtdoops_context {
int writecount;
} oops_cxt;

+static void mark_page_used(struct mtdoops_context *cxt, int page)
+{
+ u32 *p = cxt->oops_page_used + page / (8 * sizeof(u32));
+ u32 idx = page % (8 * sizeof(u32));
+
+ *p = *p | (1 << idx);
+}
+
+static void mark_page_unused(struct mtdoops_context *cxt, int page)
+{
+ u32 *p = cxt->oops_page_used + page / (8 * sizeof(u32));
+ u32 idx = page % (8 * sizeof(u32));
+
+ *p = *p & ~(1 << idx);
+}
+
+static int page_is_used(struct mtdoops_context *cxt, int page)
+{
+ u32 *p = cxt->oops_page_used + page / (8 * sizeof(u32));
+ u32 idx = page % (8 * sizeof(u32));
+
+ return *p & (1 << idx);
+}
+
static void mtdoops_erase_callback(struct erase_info *done)
{
wait_queue_head_t *wait_q = (wait_queue_head_t *)done->priv;
wake_up(wait_q);
}

-static int mtdoops_erase_block(struct mtd_info *mtd, int offset)
+static int mtdoops_erase_block(struct mtdoops_context *cxt, int offset)
{
+ struct mtd_info *mtd = cxt->mtd;
+ u32 start_page_offset = mtd_div_by_eb(offset, mtd) * mtd->erasesize;
+ u32 start_page = start_page_offset / OOPS_PAGE_SIZE;
+ u32 erase_pages = mtd->erasesize / OOPS_PAGE_SIZE;
struct erase_info erase;
DECLARE_WAITQUEUE(wait, current);
wait_queue_head_t wait_q;
int ret;
+ int page;

init_waitqueue_head(&wait_q);
erase.mtd = mtd;
@@ -90,16 +120,15 @@ static int mtdoops_erase_block(struct mtd_info *mtd, int offset)
schedule(); /* Wait for erase to finish. */
remove_wait_queue(&wait_q, &wait);

+ /* Mark pages as unused */
+ for (page = start_page; page < start_page + erase_pages; page++)
+ mark_page_unused(cxt, page);
+
return 0;
}

static void mtdoops_inc_counter(struct mtdoops_context *cxt)
{
- struct mtd_info *mtd = cxt->mtd;
- size_t retlen;
- u32 count;
- int ret;
-
cxt->nextpage++;
if (cxt->nextpage >= cxt->oops_pages)
cxt->nextpage = 0;
@@ -107,17 +136,7 @@ static void mtdoops_inc_counter(struct mtdoops_context *cxt)
if (cxt->nextcount == 0xffffffff)
cxt->nextcount = 0;

- ret = mtd->read(mtd, cxt->nextpage * OOPS_PAGE_SIZE, 4,
- &retlen, (u_char *) &count);
- if (retlen != 4 || (ret < 0 && ret != -EUCLEAN)) {
- printk(KERN_ERR "mtdoops: read failure at %d (%td of 4 read), err %d\n",
- cxt->nextpage * OOPS_PAGE_SIZE, retlen, ret);
- schedule_work(&cxt->work_erase);
- return;
- }
-
- /* See if we need to erase the next block */
- if (count != 0xffffffff) {
+ if (page_is_used(cxt, cxt->nextpage)) {
schedule_work(&cxt->work_erase);
return;
}
@@ -168,7 +187,7 @@ badblock:
}

for (j = 0, ret = -1; (j < 3) && (ret < 0); j++)
- ret = mtdoops_erase_block(mtd, cxt->nextpage * OOPS_PAGE_SIZE);
+ ret = mtdoops_erase_block(cxt, cxt->nextpage * OOPS_PAGE_SIZE);

if (ret >= 0) {
printk(KERN_DEBUG "mtdoops: ready %d, %d\n",
@@ -209,6 +228,7 @@ static void mtdoops_write(struct mtdoops_context *cxt, int panic)
if (retlen != OOPS_PAGE_SIZE || ret < 0)
printk(KERN_ERR "mtdoops: write failure at %d (%td of %d written), error %d\n",
cxt->nextpage * OOPS_PAGE_SIZE, retlen, OOPS_PAGE_SIZE, ret);
+ mark_page_used(cxt, cxt->nextpage);

mtdoops_inc_counter(cxt);
}
@@ -230,6 +250,8 @@ static void find_next_position(struct mtdoops_context *cxt)
size_t retlen;

for (page = 0; page < cxt->oops_pages; page++) {
+ /* Assume the page is used */
+ mark_page_used(cxt, page);
ret = mtd->read(mtd, page * OOPS_PAGE_SIZE, 8, &retlen, (u_char *) &count[0]);
if (retlen != 8 || (ret < 0 && ret != -EUCLEAN)) {
printk(KERN_ERR "mtdoops: read failure at %d (%td of 8 read), err %d\n",
@@ -237,6 +259,8 @@ static void find_next_position(struct mtdoops_context *cxt)
continue;
}

+ if (count[0] == 0xffffffff && count[1] == 0xffffffff)
+ mark_page_unused(cxt, page);
if (count[0] == 0xffffffff)
continue;
if (count[1] != MTDOOPS_KERNMSG_MAGIC) {
@@ -283,6 +307,9 @@ static void find_next_position(struct mtdoops_context *cxt)
static void mtdoops_notify_add(struct mtd_info *mtd)
{
struct mtdoops_context *cxt = &oops_cxt;
+ u64 mtdoops_pages = mtd->size;
+
+ do_div(mtdoops_pages, OOPS_PAGE_SIZE);

if (cxt->name && !strcmp(mtd->name, cxt->name))
cxt->mtd_index = mtd->index;
@@ -302,6 +329,13 @@ static void mtdoops_notify_add(struct mtd_info *mtd)
return;
}

+ /* oops_page_used is a bit field */
+ cxt->oops_page_used = vmalloc(DIV_ROUND_UP(mtdoops_pages,
+ 8 * sizeof(u32)));
+ if (!cxt->oops_page_used) {
+ printk(KERN_ERR "Could not allocate page array\n");
+ return;
+ }
cxt->mtd = mtd;
if (mtd->size > INT_MAX)
cxt->oops_pages = INT_MAX / OOPS_PAGE_SIZE;
@@ -454,6 +488,8 @@ static void __exit mtdoops_console_exit(void)
unregister_console(&mtdoops_console);
kfree(cxt->name);
vfree(cxt->oops_buf);
+ if (cxt->oops_page_used)
+ vfree(cxt->oops_page_used);
}


--
1.6.0.4

2009-10-13 13:23:42

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH/RFC v5 3/5]: mtdoops: Make page (record) size configurable

The main justification for this is to allow catching long messages
during a panic, where the top part might otherwise be lost since moving
to the next block can require a flash erase.

Signed-off-by: Simon Kagstrom <[email protected]>
---
drivers/mtd/mtdoops.c | 77 ++++++++++++++++++++++++++++--------------------
1 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index 6af340e..c2a3b5c 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -32,9 +32,14 @@
#include <linux/spinlock.h>
#include <linux/interrupt.h>
#include <linux/mtd/mtd.h>
+#include <linux/log2.h>

#define MTDOOPS_KERNMSG_MAGIC 0x5d005d00
-#define OOPS_PAGE_SIZE 4096
+
+static unsigned long record_size = 4096;
+module_param(record_size, ulong, 0);
+MODULE_PARM_DESC(record_size,
+ "record size for MTD OOPS pages in bytes (default 4096)");

static struct mtdoops_context {
int mtd_index;
@@ -89,8 +94,8 @@ static int mtdoops_erase_block(struct mtdoops_context *cxt, int offset)
{
struct mtd_info *mtd = cxt->mtd;
u32 start_page_offset = mtd_div_by_eb(offset, mtd) * mtd->erasesize;
- u32 start_page = start_page_offset / OOPS_PAGE_SIZE;
- u32 erase_pages = mtd->erasesize / OOPS_PAGE_SIZE;
+ u32 start_page = start_page_offset / record_size;
+ u32 erase_pages = mtd->erasesize / record_size;
struct erase_info erase;
DECLARE_WAITQUEUE(wait, current);
wait_queue_head_t wait_q;
@@ -158,15 +163,15 @@ static void mtdoops_workfunc_erase(struct work_struct *work)
if (!mtd)
return;

- mod = (cxt->nextpage * OOPS_PAGE_SIZE) % mtd->erasesize;
+ mod = (cxt->nextpage * record_size) % mtd->erasesize;
if (mod != 0) {
- cxt->nextpage = cxt->nextpage + ((mtd->erasesize - mod) / OOPS_PAGE_SIZE);
+ cxt->nextpage = cxt->nextpage + ((mtd->erasesize - mod) / record_size);
if (cxt->nextpage >= cxt->oops_pages)
cxt->nextpage = 0;
}

while (mtd->block_isbad) {
- ret = mtd->block_isbad(mtd, cxt->nextpage * OOPS_PAGE_SIZE);
+ ret = mtd->block_isbad(mtd, cxt->nextpage * record_size);
if (!ret)
break;
if (ret < 0) {
@@ -174,20 +179,20 @@ static void mtdoops_workfunc_erase(struct work_struct *work)
return;
}
badblock:
- printk(KERN_WARNING "mtdoops: bad block at %08x\n",
- cxt->nextpage * OOPS_PAGE_SIZE);
+ printk(KERN_WARNING "mtdoops: bad block at %08lx\n",
+ cxt->nextpage * record_size);
i++;
- cxt->nextpage = cxt->nextpage + (mtd->erasesize / OOPS_PAGE_SIZE);
+ cxt->nextpage = cxt->nextpage + (mtd->erasesize / record_size);
if (cxt->nextpage >= cxt->oops_pages)
cxt->nextpage = 0;
- if (i == cxt->oops_pages / (mtd->erasesize / OOPS_PAGE_SIZE)) {
+ if (i == cxt->oops_pages / (mtd->erasesize / record_size)) {
printk(KERN_ERR "mtdoops: all blocks bad!\n");
return;
}
}

for (j = 0, ret = -1; (j < 3) && (ret < 0); j++)
- ret = mtdoops_erase_block(cxt, cxt->nextpage * OOPS_PAGE_SIZE);
+ ret = mtdoops_erase_block(cxt, cxt->nextpage * record_size);

if (ret >= 0) {
printk(KERN_DEBUG "mtdoops: ready %d, %d\n",
@@ -197,7 +202,7 @@ badblock:
}

if (mtd->block_markbad && ret == -EIO) {
- ret = mtd->block_markbad(mtd, cxt->nextpage * OOPS_PAGE_SIZE);
+ ret = mtd->block_markbad(mtd, cxt->nextpage * record_size);
if (ret < 0) {
printk(KERN_ERR "mtdoops: block_markbad failed, aborting\n");
return;
@@ -212,22 +217,22 @@ static void mtdoops_write(struct mtdoops_context *cxt, int panic)
size_t retlen;
int ret;

- if (cxt->writecount < OOPS_PAGE_SIZE)
+ if (cxt->writecount < record_size)
memset(cxt->oops_buf + cxt->writecount, 0xff,
- OOPS_PAGE_SIZE - cxt->writecount);
+ record_size - cxt->writecount);

if (panic)
- ret = mtd->panic_write(mtd, cxt->nextpage * OOPS_PAGE_SIZE,
- OOPS_PAGE_SIZE, &retlen, cxt->oops_buf);
+ ret = mtd->panic_write(mtd, cxt->nextpage * record_size,
+ record_size, &retlen, cxt->oops_buf);
else
- ret = mtd->write(mtd, cxt->nextpage * OOPS_PAGE_SIZE,
- OOPS_PAGE_SIZE, &retlen, cxt->oops_buf);
+ ret = mtd->write(mtd, cxt->nextpage * record_size,
+ record_size, &retlen, cxt->oops_buf);

cxt->writecount = 0;

- if (retlen != OOPS_PAGE_SIZE || ret < 0)
- printk(KERN_ERR "mtdoops: write failure at %d (%td of %d written), error %d\n",
- cxt->nextpage * OOPS_PAGE_SIZE, retlen, OOPS_PAGE_SIZE, ret);
+ if (retlen != record_size || ret < 0)
+ printk(KERN_ERR "mtdoops: write failure at %ld (%td of %ld written), error %d\n",
+ cxt->nextpage * record_size, retlen, record_size, ret);
mark_page_used(cxt, cxt->nextpage);

mtdoops_inc_counter(cxt);
@@ -252,10 +257,10 @@ static void find_next_position(struct mtdoops_context *cxt)
for (page = 0; page < cxt->oops_pages; page++) {
/* Assume the page is used */
mark_page_used(cxt, page);
- ret = mtd->read(mtd, page * OOPS_PAGE_SIZE, 8, &retlen, (u_char *) &count[0]);
+ ret = mtd->read(mtd, page * record_size, 8, &retlen, (u_char *) &count[0]);
if (retlen != 8 || (ret < 0 && ret != -EUCLEAN)) {
- printk(KERN_ERR "mtdoops: read failure at %d (%td of 8 read), err %d\n",
- page * OOPS_PAGE_SIZE, retlen, ret);
+ printk(KERN_ERR "mtdoops: read failure at %ld (%td of 8 read), err %d\n",
+ page * record_size, retlen, ret);
continue;
}

@@ -309,7 +314,7 @@ static void mtdoops_notify_add(struct mtd_info *mtd)
struct mtdoops_context *cxt = &oops_cxt;
u64 mtdoops_pages = mtd->size;

- do_div(mtdoops_pages, OOPS_PAGE_SIZE);
+ do_div(mtdoops_pages, record_size);

if (cxt->name && !strcmp(mtd->name, cxt->name))
cxt->mtd_index = mtd->index;
@@ -323,7 +328,7 @@ static void mtdoops_notify_add(struct mtd_info *mtd)
return;
}

- if (mtd->erasesize < OOPS_PAGE_SIZE) {
+ if (mtd->erasesize < record_size) {
printk(KERN_ERR "mtdoops: eraseblock size of MTD partition %d too small\n",
mtd->index);
return;
@@ -338,9 +343,9 @@ static void mtdoops_notify_add(struct mtd_info *mtd)
}
cxt->mtd = mtd;
if (mtd->size > INT_MAX)
- cxt->oops_pages = INT_MAX / OOPS_PAGE_SIZE;
+ cxt->oops_pages = INT_MAX / record_size;
else
- cxt->oops_pages = (int)mtd->size / OOPS_PAGE_SIZE;
+ cxt->oops_pages = (int)mtd->size / record_size;

find_next_position(cxt);

@@ -417,15 +422,15 @@ mtdoops_console_write(struct console *co, const char *s, unsigned int count)
cxt->writecount = 8;
}

- if (count + cxt->writecount > OOPS_PAGE_SIZE)
- count = OOPS_PAGE_SIZE - cxt->writecount;
+ if (count + cxt->writecount > record_size)
+ count = record_size - cxt->writecount;

memcpy(cxt->oops_buf + cxt->writecount, s, count);
cxt->writecount += count;

spin_unlock_irqrestore(&cxt->writecount_lock, flags);

- if (cxt->writecount == OOPS_PAGE_SIZE)
+ if (cxt->writecount == record_size)
mtdoops_console_sync();
}

@@ -464,8 +469,16 @@ static int __init mtdoops_console_init(void)
{
struct mtdoops_context *cxt = &oops_cxt;

+ if (!is_power_of_2(record_size)) {
+ printk(KERN_ERR "mtdoops: record_size must be a power of two\n");
+ return -EINVAL;
+ }
+ if (record_size < 4096) {
+ printk(KERN_ERR "mtdoops: record_size must be over 4096 bytes\n");
+ return -EINVAL;
+ }
cxt->mtd_index = -1;
- cxt->oops_buf = vmalloc(OOPS_PAGE_SIZE);
+ cxt->oops_buf = vmalloc(record_size);
if (!cxt->oops_buf) {
printk(KERN_ERR "mtdoops: failed to allocate buffer workspace\n");
return -ENOMEM;
--
1.6.0.4

2009-10-13 13:24:10

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

The core functionality is implemented as per Linus suggestion from

http://lists.infradead.org/pipermail/linux-mtd/2009-October/027620.html

a struct dump_device type has been added which contains callbacks to
dump the kernel log buffers on crashes. The dump_kmsg function gets
called from oops_exit() and panic() and invokes these callbacks.

Signed-off-by: Simon Kagstrom <[email protected]>
---
include/linux/dump_device.h | 33 +++++++++++++++++
kernel/panic.c | 3 ++
kernel/printk.c | 83 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 119 insertions(+), 0 deletions(-)
create mode 100644 include/linux/dump_device.h

diff --git a/include/linux/dump_device.h b/include/linux/dump_device.h
new file mode 100644
index 0000000..8c3f378
--- /dev/null
+++ b/include/linux/dump_device.h
@@ -0,0 +1,33 @@
+/*
+ * linux/include/dump_device.h
+ *
+ * Copyright (C) 2009 Net Insight
+ *
+ * Author: Simon Kagstrom <[email protected]>
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of this archive
+ * for more details.
+ */
+#ifndef _LINUX_DUMP_DEVICE_H
+#define _LINUX_DUMP_DEVICE_H
+
+#include <linux/list.h>
+
+struct dump_device {
+ void (*oops)(struct dump_device *, const char *, unsigned long,
+ const char *, unsigned long);
+ void (*panic)(struct dump_device *, const char *, unsigned long,
+ const char *, unsigned long);
+ int (*setup)(struct dump_device *);
+ void *priv;
+ struct list_head list;
+};
+
+void dump_kmsg(int panic);
+
+int register_dumpdevice(struct dump_device *dump, void *priv);
+
+void unregister_dumpdevice(struct dump_device *dump);
+
+#endif /* _LINUX_DUMP_DEVICE_H */
diff --git a/kernel/panic.c b/kernel/panic.c
index c0b33b8..e7dbf2b 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -8,6 +8,7 @@
* This function is used through-out the kernel (including mm and fs)
* to indicate a major problem.
*/
+#include <linux/dump_device.h>
#include <linux/debug_locks.h>
#include <linux/interrupt.h>
#include <linux/kallsyms.h>
@@ -76,6 +77,7 @@ NORET_TYPE void panic(const char * fmt, ...)
dump_stack();
#endif

+ dump_kmsg(1);
/*
* If we have crashed and we have a crash kernel loaded let it handle
* everything else.
@@ -341,6 +343,7 @@ void oops_exit(void)
{
do_oops_enter_exit();
print_oops_end_marker();
+ dump_kmsg(0);
}

#ifdef WANT_WARN_ON_SLOWPATH
diff --git a/kernel/printk.c b/kernel/printk.c
index f38b07f..5e1fe73 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -33,6 +33,7 @@
#include <linux/bootmem.h>
#include <linux/syscalls.h>
#include <linux/kexec.h>
+#include <linux/dump_device.h>

#include <asm/uaccess.h>

@@ -1405,3 +1406,85 @@ bool printk_timed_ratelimit(unsigned long *caller_jiffies,
}
EXPORT_SYMBOL(printk_timed_ratelimit);
#endif
+
+static LIST_HEAD(dump_list);
+static DEFINE_RWLOCK(dump_list_lock);
+
+/**
+ * register_dump_device - register a dumpdevice.
+ * @dump: pointer to the dump structure
+ * @priv: private data for the structure
+ *
+ * Adds a dump device to the system. The oops and panic callbacks
+ * in the structure will be called when the kernel oopses and panics
+ * respectively. At least one of these must be set. Returns zero on
+ * success and -EINVAL otherwise.
+ */
+int register_dumpdevice(struct dump_device *dump, void *priv)
+{
+ /* We need at least one of these set */
+ if (!dump->oops && !dump->panic)
+ return -EINVAL;
+ if (dump->setup && dump->setup(dump) != 0)
+ return -EINVAL;
+ dump->priv = priv;
+
+ INIT_LIST_HEAD(&dump->list);
+ write_lock(&dump_list_lock);
+ list_add(&dump->list, &dump_list);
+ write_unlock(&dump_list_lock);
+ return 0;
+}
+EXPORT_SYMBOL(register_dumpdevice);
+
+/**
+ * unregister_dump_device - unregister a dumpdevice.
+ * @dump: pointer to the dump structure
+ *
+ * Removes a dump device from the system.
+ */
+void unregister_dumpdevice(struct dump_device *dump)
+{
+ write_lock(&dump_list_lock);
+ list_del(&dump->list);
+ write_unlock(&dump_list_lock);
+}
+EXPORT_SYMBOL(unregister_dumpdevice);
+
+/**
+ * dump_kmsg - dump kernel log to dump devices.
+ * @panic: if this is a panic (1) or an oops (0)
+ *
+ * Iterate through each of the dump devices and call the oops/panic
+ * callbacks with the log buffer.
+ */
+void dump_kmsg(int panic)
+{
+ unsigned long len = ACCESS_ONCE(log_end);
+ struct dump_device *dump;
+ const char *s1, *s2;
+ unsigned long l1, l2;
+
+ s1 = "";
+ l1 = 0;
+ s2 = log_buf;
+ l2 = len;
+
+ /* Have we rotated around the circular buffer? */
+ if (len > log_buf_len) {
+ unsigned long pos = (len & LOG_BUF_MASK);
+
+ s1 = log_buf + pos;
+ l1 = log_buf_len - pos;
+
+ s2 = log_buf;
+ l2 = pos;
+ }
+
+ list_for_each_entry(dump, &dump_list, list) {
+ if (panic && dump->panic)
+ dump->panic(dump, s1, l1, s2, l2);
+ else if (!panic && dump->oops)
+ dump->oops(dump, s1, l1, s2, l2);
+ }
+}
--
1.6.0.4

2009-10-13 13:24:36

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH/RFC v5 5/5]: mtdoops: refactor as a dump_device

The last messages which happens before a crash might contain interesting
information about the crash. This patch reworks mtdoops using the
dump_device support instead of a console, which simplifies the code and
includes also messages before the oops started.

It registers handlers for both oopses and panics. On oopses, the write
is scheduled in a work queue (to be able to use the regular mtd->write
call), while panics call mtd->panic_write directly. Thus, if
panic_on_oops is set, the oops will be written out during the panic.

A parameter to specify which mtd device to use (number or name), as well
as a flag to toggle wheter to dump oopses or only panics (since oopses
can often be handled by regular syslog).

Signed-off-by: Simon Kagstrom <[email protected]>
---
drivers/mtd/mtdoops.c | 218 +++++++++++++++++++++----------------------------
1 files changed, 92 insertions(+), 126 deletions(-)

diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index c2a3b5c..920ce4f 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -29,10 +29,10 @@
#include <linux/sched.h>
#include <linux/wait.h>
#include <linux/delay.h>
-#include <linux/spinlock.h>
#include <linux/interrupt.h>
#include <linux/mtd/mtd.h>
#include <linux/log2.h>
+#include <linux/dump_device.h>

#define MTDOOPS_KERNMSG_MAGIC 0x5d005d00

@@ -41,7 +41,19 @@ module_param(record_size, ulong, 0);
MODULE_PARM_DESC(record_size,
"record size for MTD OOPS pages in bytes (default 4096)");

+static char mtddev[80];
+module_param_string(mtddev, mtddev, 80, 0600);
+MODULE_PARM_DESC(mtddev,
+ "name or index number of the MTD device to use");
+
+static int dump_oops = 1;
+module_param(dump_oops, int, 0);
+MODULE_PARM_DESC(dump_oops,
+ "set to 1 to dump oopses, 0 to only dump panics (default 1)");
+
static struct mtdoops_context {
+ struct dump_device dump;
+
int mtd_index;
struct work_struct work_erase;
struct work_struct work_write;
@@ -50,14 +62,9 @@ static struct mtdoops_context {
int nextpage;
int nextcount;
u32 *oops_page_used;
- char *name;
+ const char *name;

void *oops_buf;
-
- /* writecount and disabling ready are spin lock protected */
- spinlock_t writecount_lock;
- int ready;
- int writecount;
} oops_cxt;

static void mark_page_used(struct mtdoops_context *cxt, int page)
@@ -148,7 +155,6 @@ static void mtdoops_inc_counter(struct mtdoops_context *cxt)

printk(KERN_DEBUG "mtdoops: ready %d, %d (no erase)\n",
cxt->nextpage, cxt->nextcount);
- cxt->ready = 1;
}

/* Scheduled work - when we can't proceed without erasing a block */
@@ -197,7 +203,6 @@ badblock:
if (ret >= 0) {
printk(KERN_DEBUG "mtdoops: ready %d, %d\n",
cxt->nextpage, cxt->nextcount);
- cxt->ready = 1;
return;
}

@@ -215,11 +220,13 @@ static void mtdoops_write(struct mtdoops_context *cxt, int panic)
{
struct mtd_info *mtd = cxt->mtd;
size_t retlen;
+ u32 *stamp;
int ret;

- if (cxt->writecount < record_size)
- memset(cxt->oops_buf + cxt->writecount, 0xff,
- record_size - cxt->writecount);
+ /* Add mtdoops header to the buffer */
+ stamp = cxt->oops_buf;
+ *stamp++ = cxt->nextcount;
+ *stamp = MTDOOPS_KERNMSG_MAGIC;

if (panic)
ret = mtd->panic_write(mtd, cxt->nextpage * record_size,
@@ -228,8 +235,6 @@ static void mtdoops_write(struct mtdoops_context *cxt, int panic)
ret = mtd->write(mtd, cxt->nextpage * record_size,
record_size, &retlen, cxt->oops_buf);

- cxt->writecount = 0;
-
if (retlen != record_size || ret < 0)
printk(KERN_ERR "mtdoops: write failure at %ld (%td of %ld written), error %d\n",
cxt->nextpage * record_size, retlen, record_size, ret);
@@ -238,7 +243,6 @@ static void mtdoops_write(struct mtdoops_context *cxt, int panic)
mtdoops_inc_counter(cxt);
}

-
static void mtdoops_workfunc_write(struct work_struct *work)
{
struct mtdoops_context *cxt =
@@ -294,11 +298,9 @@ static void find_next_position(struct mtdoops_context *cxt)
if (unclean_pages != 0) {
printk(KERN_INFO "mtdoops: cleaning area\n");
schedule_work(&cxt->work_erase);
- } else {
+ } else
printk(KERN_DEBUG "mtdoops: ready %d, %d (clean)\n",
cxt->nextpage, cxt->nextcount);
- cxt->ready = 1;
- }
return;
}

@@ -308,13 +310,55 @@ static void find_next_position(struct mtdoops_context *cxt)
mtdoops_inc_counter(cxt);
}

+static void mtdoops_copy_buffer(struct mtdoops_context *cxt,
+ const char *s1, unsigned long l1,
+ const char *s2, unsigned long l2)
+{
+ unsigned long l1_cpy = min(l1, record_size - 8);
+ unsigned long l2_cpy = min(l2 - l1_cpy, record_size - l1_cpy - 8);
+ char *dst = cxt->oops_buf + 8; /* Skip the header */
+
+ memcpy(dst, s1, l1_cpy);
+ memcpy(dst + l1_cpy, s2, l2_cpy);
+}
+
+static void mtdoops_oops(struct dump_device *dump,
+ const char *s1, unsigned long l1,
+ const char *s2, unsigned long l2)
+{
+ struct mtdoops_context *cxt = dump->priv;
+
+ mtdoops_copy_buffer(cxt, s1, l1, s2, l2);
+
+ /* Write when done */
+ schedule_work(&cxt->work_write);
+}
+
+static void mtdoops_panic(struct dump_device *dump,
+ const char *s1, unsigned long l1,
+ const char *s2, unsigned long l2)
+{
+ struct mtdoops_context *cxt = dump->priv;
+
+ mtdoops_copy_buffer(cxt, s1, l1, s2, l2);
+ /* Must be written immediately */
+ if (cxt->mtd->panic_write)
+ mtdoops_write(cxt, 1);
+ else
+ printk(KERN_ERR "mtdoops: Cannot write from panic without panic_write\n");
+}
+
+static int mtdoops_setup(struct dump_device *dump)
+{
+ return 0;
+}

static void mtdoops_notify_add(struct mtd_info *mtd)
{
struct mtdoops_context *cxt = &oops_cxt;
- u64 mtdoops_pages = mtd->size;
+ u64 mtdoops_records = mtd->size;

- do_div(mtdoops_pages, record_size);
+ do_div(mtdoops_records, record_size);

if (cxt->name && !strcmp(mtd->name, cxt->name))
cxt->mtd_index = mtd->index;
@@ -335,7 +379,7 @@ static void mtdoops_notify_add(struct mtd_info *mtd)
}

/* oops_page_used is a bit field */
- cxt->oops_page_used = vmalloc(DIV_ROUND_UP(mtdoops_pages,
+ cxt->oops_page_used = vmalloc(DIV_ROUND_UP(mtdoops_records,
8 * sizeof(u32)));
if (!cxt->oops_page_used) {
printk(KERN_ERR "Could not allocate page array\n");
@@ -349,6 +393,11 @@ static void mtdoops_notify_add(struct mtd_info *mtd)

find_next_position(cxt);

+ cxt->dump.setup = mtdoops_setup;
+ if (dump_oops)
+ cxt->dump.oops = mtdoops_oops;
+ cxt->dump.panic = mtdoops_panic;
+ register_dumpdevice(&cxt->dump, cxt);
printk(KERN_INFO "mtdoops: Attached to MTD device %d\n", mtd->index);
}

@@ -359,116 +408,27 @@ static void mtdoops_notify_remove(struct mtd_info *mtd)
if (mtd->index != cxt->mtd_index || cxt->mtd_index < 0)
return;

+ unregister_dumpdevice(&cxt->dump);
cxt->mtd = NULL;
flush_scheduled_work();
}

-static void mtdoops_console_sync(void)
-{
- struct mtdoops_context *cxt = &oops_cxt;
- struct mtd_info *mtd = cxt->mtd;
- unsigned long flags;
-
- if (!cxt->ready || !mtd || cxt->writecount == 0)
- return;
-
- /*
- * Once ready is 0 and we've held the lock no further writes to the
- * buffer will happen
- */
- spin_lock_irqsave(&cxt->writecount_lock, flags);
- if (!cxt->ready) {
- spin_unlock_irqrestore(&cxt->writecount_lock, flags);
- return;
- }
- cxt->ready = 0;
- spin_unlock_irqrestore(&cxt->writecount_lock, flags);
-
- if (mtd->panic_write && (in_interrupt() || panic_on_oops))
- /* Interrupt context, we're going to panic so try and log */
- mtdoops_write(cxt, 1);
- else
- schedule_work(&cxt->work_write);
-}
-
-static void
-mtdoops_console_write(struct console *co, const char *s, unsigned int count)
-{
- struct mtdoops_context *cxt = co->data;
- struct mtd_info *mtd = cxt->mtd;
- unsigned long flags;
-
- if (!oops_in_progress) {
- mtdoops_console_sync();
- return;
- }
-
- if (!cxt->ready || !mtd)
- return;
-
- /* Locking on writecount ensures sequential writes to the buffer */
- spin_lock_irqsave(&cxt->writecount_lock, flags);
-
- /* Check ready status didn't change whilst waiting for the lock */
- if (!cxt->ready) {
- spin_unlock_irqrestore(&cxt->writecount_lock, flags);
- return;
- }
-
- if (cxt->writecount == 0) {
- u32 *stamp = cxt->oops_buf;
- *stamp++ = cxt->nextcount;
- *stamp = MTDOOPS_KERNMSG_MAGIC;
- cxt->writecount = 8;
- }
-
- if (count + cxt->writecount > record_size)
- count = record_size - cxt->writecount;
-
- memcpy(cxt->oops_buf + cxt->writecount, s, count);
- cxt->writecount += count;
-
- spin_unlock_irqrestore(&cxt->writecount_lock, flags);
-
- if (cxt->writecount == record_size)
- mtdoops_console_sync();
-}
-
-static int __init mtdoops_console_setup(struct console *co, char *options)
-{
- struct mtdoops_context *cxt = co->data;
-
- if (cxt->mtd_index != -1 || cxt->name)
- return -EBUSY;
- if (options) {
- cxt->name = kstrdup(options, GFP_KERNEL);
- return 0;
- }
- if (co->index == -1)
- return -EINVAL;
-
- cxt->mtd_index = co->index;
- return 0;
-}

static struct mtd_notifier mtdoops_notifier = {
.add = mtdoops_notify_add,
.remove = mtdoops_notify_remove,
};

-static struct console mtdoops_console = {
- .name = "ttyMTD",
- .write = mtdoops_console_write,
- .setup = mtdoops_console_setup,
- .unblank = mtdoops_console_sync,
- .index = -1,
- .data = &oops_cxt,
-};
-
-static int __init mtdoops_console_init(void)
+static int __init mtdoops_init(void)
{
struct mtdoops_context *cxt = &oops_cxt;
+ int mtd_index;
+ char *endp;

+ if (strlen(mtddev) == 0) {
+ printk(KERN_ERR "mtdoops: mtd device must be supplied\n");
+ return -EINVAL;
+ }
if (!is_power_of_2(record_size)) {
printk(KERN_ERR "mtdoops: record_size must be a power of two\n");
return -EINVAL;
@@ -477,37 +437,43 @@ static int __init mtdoops_console_init(void)
printk(KERN_ERR "mtdoops: record_size must be over 4096 bytes\n");
return -EINVAL;
}
+
+ /* Setup the MTD device to use */
+ cxt->name = mtddev;
cxt->mtd_index = -1;
+ mtd_index = simple_strtoul(mtddev, &endp, 0);
+ if (endp != mtddev)
+ cxt->mtd_index = mtd_index;
+
cxt->oops_buf = vmalloc(record_size);
if (!cxt->oops_buf) {
- printk(KERN_ERR "mtdoops: failed to allocate buffer workspace\n");
+ printk(KERN_ERR "Failed to allocate mtdoops write buffer workspace\n");
+ vfree(cxt->oops_buf);
return -ENOMEM;
}
+ memset(cxt->oops_buf, 0xff, record_size);

- spin_lock_init(&cxt->writecount_lock);
INIT_WORK(&cxt->work_erase, mtdoops_workfunc_erase);
INIT_WORK(&cxt->work_write, mtdoops_workfunc_write);

- register_console(&mtdoops_console);
register_mtd_user(&mtdoops_notifier);
return 0;
}

-static void __exit mtdoops_console_exit(void)
+static void __exit mtdoops_exit(void)
{
struct mtdoops_context *cxt = &oops_cxt;

unregister_mtd_user(&mtdoops_notifier);
- unregister_console(&mtdoops_console);
kfree(cxt->name);
- vfree(cxt->oops_buf);
if (cxt->oops_page_used)
vfree(cxt->oops_page_used);
+ vfree(cxt->oops_buf);
}


-subsys_initcall(mtdoops_console_init);
-module_exit(mtdoops_console_exit);
+module_init(mtdoops_init);
+module_exit(mtdoops_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Richard Purdie <[email protected]>");
--
1.6.0.4

2009-10-13 15:38:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics



On Tue, 13 Oct 2009, Simon Kagstrom wrote:
> +
> +struct dump_device {

I know I used that name myself in the example code I sent out, but
thinking about it more, I hate the name.

Most people think a "dump" is something much more than just the kernel
messages, so "dump_device" ends up being misleading. And the "device" part
is kind of pointless and redundant (it need not be a real device).

So I suspect it would be better to name it by what it does, and make it
clear what that is.

Maybe something like "struct kmsg_dumper" or similar. That is pretty
unambiguous, and nobody is going to get confused about what the semantics
are.

> + void (*oops)(struct dump_device *, const char *, unsigned long,
> + const char *, unsigned long);
> + void (*panic)(struct dump_device *, const char *, unsigned long,
> + const char *, unsigned long);

I don't much like this. Why separate 'oops' and 'panic' functions,
especially since we migth have many more causes to do a kmsg dump in the
future (ie 'shutdown', 'sysrq', 'suspend' etc etc)?

So separating them out as two different functions is just wrong. Make it
one function, and then perhaps you can add a

enum kmsg_dump_reason {
kmsg_dump_panic,
kmsg_dump_oops,
..
};

and pass it as an argument.

> + int (*setup)(struct dump_device *);

Why?

There seems to be no reason for this. Who ever registers the dumper should
just do the setup call itself. Why would we have a callback that just gets
called immediately, rather than have the registration code just do the
call itself?

> +int register_dumpdevice(struct dump_device *dump, void *priv)
> +{
> + /* We need at least one of these set */
> + if (!dump->oops && !dump->panic)
> + return -EINVAL;
> + if (dump->setup && dump->setup(dump) != 0)
> + return -EINVAL;

So the above two tests should be pointless.

> + dump->priv = priv;
> +
> + INIT_LIST_HEAD(&dump->list);

Don't do "INIT_LIST_HEAD()" here. It's pointless as far as I can tell (the
list_add() will initialize it), but even in general we should basically
have basic initialization done by callers if needed.

And judging by historical problems in areas like that, we should protect
against people registering the same dumper twice. One way to do that would
be to perhaps _require_ that the caller has initialized it, and then do a

if (!list_empty(&dump->list))
return -EBUSY;

(but I could also see using just a "registered" flag)

> + write_lock(&dump_list_lock);

This looks dubious. Dumping can obviously happen from interrupts, so _if_
you were to protect against dumpers, you'd need to use an interrupt-safe
lock.

Of course, you do not actually take the lock at dump time (which may be
intentional, and that is not necessarily wrong - taking locks at oops time
is generally not a good thing to do, and it may be entirely reasonable to
say "we take the risk of not locking properly in order to _maybe_ work
even if the lock is scrogged").

But if you don't take the lock at dump time (or, perhaps preferably, if
you make the dump-time lock be a "try_lock()" - maybe the oops is due to
dump list corruption, and if the dump_list_lock is held thew oopser
should simply not dump!), then you should probably use a spinlock rather
than an rwlock.

> + list_for_each_entry(dump, &dump_list, list) {
> + if (panic && dump->panic)
> + dump->panic(dump, s1, l1, s2, l2);
> + else if (!panic && dump->oops)
> + dump->oops(dump, s1, l1, s2, l2);
> + }

So this would just become

list_for_each_entry(dump, &dump_list, list)
dump->fn(dump, s1, l1, s2, l2, reason);

or something.

Linus

2009-10-14 13:35:54

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH v6 0/5]: mtdoops: fixes and improvements

Hi!

Here are a couple of patches to mtdoops which I've been working on for
a while and which were recently discussed on the MTD list. They are
applied on top of Artems tree,

http://git.infradead.org/users/dedekind/l2-mtd-2.6.git

which apart from the mainline tree contains a panic fix from Aaro and
Artems mtdoops style cleanup patch.


Now, I'd like to warn sensitive readers that patch 5 *does* contain work
queue elements. However, I believe there is a good reason for this: The
mtd->write call is not good to call while oopsing, and mtd->panic_write
is sort of a last resort. Now, if we panic anyhow, mtd->panic_write
will be called, so the oops will still be written and the scheduled
work won't run anyway.


The patches are:

1. Avoid erasing already empty areas (the area would be erased at each
module load otherwise)

2. Keep track of mtdoops page cleanliness in an array. This allows
mtdoops_inc_counter to be called during panic (which fails in my
case with the current code in mtd->read, although I believe this is
MTD-driver dependent).

3. Make page size configurable to support longer messages. Mainly
needed for patch 4, but also allows longer messages to be stored
during panics (when the next oops area cannot be erased).

4. (kernel/printk.c) Add a dump_device as per Linus suggestion which
includes a dump_kmsg function which dumps the kernel log buffer to
registered devices.

5. Refactor mtdoops as a dump_device device instead of a console device.

ChangeLog:

v2:
* Patches have been reordered to keep the least controversial (?) first.
* Implement Artems suggestion to keep cleanliness in an array
* Use Aaros patch to fix the panic_on_oops problem

v3:
* Correct checkpatch issues

v4: Review corrections
* Rebase on top of "[PATCH] mtd: mtdoops: several minor cleanups"
* Patch 1 has been added
* Use 1 bit per oops page, and rename clean/dirty unused/used
* Don't initialize oops_page_dirty (it's a global structure anyway so
it will be cleared together with the rest of BSS)
* Rename mtdoops_page_size record_size (although only for the page
size setting)

v5: Corrections after panic_on_oops discussion
* Add dump_device
* Refactor mtdoops as a dump device.

v6: More corrections (panic_on_oops and Anders Grafström, my room-mate at work)
* Pathces 2-5 modified
* Use set_bit/clear_bit primitives in patch 2
* Set permissions for mtdoops arguments correct in patch 3 and 5
* Rename files and structures after Linus suggestions in patch 4
* Correct output under some conditions in patch 5

// Simon

2009-10-14 13:41:46

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH v6 1/5]: mtdoops: avoid erasing already empty areas

After having scanned the entire mtdoops area, mtdoops will erase it if
there are no mtdoops headers in it. However, empty and already erased
areas (i.e., without mtdoops headers) will therefore also be erased at
each startup.

This patch counts the number of unclean pages (neither empty nor with
the mtdoops header) and only erases if no headers are found and the area
is still unclean.

Signed-off-by: Simon Kagstrom <[email protected]>
---
drivers/mtd/mtdoops.c | 18 ++++++++++++++----
1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index 18c6c96..c785e1a 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -225,7 +225,7 @@ static void mtdoops_workfunc_write(struct work_struct *work)
static void find_next_position(struct mtdoops_context *cxt)
{
struct mtd_info *mtd = cxt->mtd;
- int ret, page, maxpos = 0;
+ int ret, page, maxpos = 0, unclean_pages = 0;
u32 count[2], maxcount = 0xffffffff;
size_t retlen;

@@ -237,10 +237,13 @@ static void find_next_position(struct mtdoops_context *cxt)
continue;
}

- if (count[1] != MTDOOPS_KERNMSG_MAGIC)
- continue;
if (count[0] == 0xffffffff)
continue;
+ if (count[1] != MTDOOPS_KERNMSG_MAGIC) {
+ /* Page is neither clean nor empty */
+ unclean_pages++;
+ continue;
+ }
if (maxcount == 0xffffffff) {
maxcount = count[0];
maxpos = page;
@@ -259,7 +262,14 @@ static void find_next_position(struct mtdoops_context *cxt)
if (maxcount == 0xffffffff) {
cxt->nextpage = 0;
cxt->nextcount = 1;
- schedule_work(&cxt->work_erase);
+ if (unclean_pages != 0) {
+ printk(KERN_INFO "mtdoops: cleaning area\n");
+ schedule_work(&cxt->work_erase);
+ } else {
+ printk(KERN_DEBUG "mtdoops: ready %d, %d (clean)\n",
+ cxt->nextpage, cxt->nextcount);
+ cxt->ready = 1;
+ }
return;
}

--
1.6.0.4

2009-10-14 13:42:12

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH v6 2/5]: mtdoops: Keep track of used/unused mtdoops pages in an array

This patch makes mtdoops keep track of used/unused pages in an array
instead of scanning the flash after a write. The advantage with this
approach is that it avoids calling mtd->read on a panic, which is not
possible for all mtd drivers.

Signed-off-by: Simon Kagstrom <[email protected]>
Reviewed-by: Anders Grafstrom <[email protected]>
---
ChangeLog:
Review comments from Anders Grafström:
* Use set_bit/clear_bit and test_bit to check the page array
* Make oops_page_used a unsigned long and use BITS_PER_LONG
* Skip NULL-test for vfree

drivers/mtd/mtdoops.c | 62 ++++++++++++++++++++++++++++++++++--------------
1 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index c785e1a..08627c2 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -44,6 +44,7 @@ static struct mtdoops_context {
int oops_pages;
int nextpage;
int nextcount;
+ unsigned long *oops_page_used;
char *name;

void *oops_buf;
@@ -54,18 +55,38 @@ static struct mtdoops_context {
int writecount;
} oops_cxt;

+static void mark_page_used(struct mtdoops_context *cxt, int page)
+{
+ set_bit(page, cxt->oops_page_used);
+}
+
+static void mark_page_unused(struct mtdoops_context *cxt, int page)
+{
+ clear_bit(page, cxt->oops_page_used);
+}
+
+static int page_is_used(struct mtdoops_context *cxt, int page)
+{
+ return test_bit(page, cxt->oops_page_used);
+}
+
static void mtdoops_erase_callback(struct erase_info *done)
{
wait_queue_head_t *wait_q = (wait_queue_head_t *)done->priv;
wake_up(wait_q);
}

-static int mtdoops_erase_block(struct mtd_info *mtd, int offset)
+static int mtdoops_erase_block(struct mtdoops_context *cxt, int offset)
{
+ struct mtd_info *mtd = cxt->mtd;
+ u32 start_page_offset = mtd_div_by_eb(offset, mtd) * mtd->erasesize;
+ u32 start_page = start_page_offset / OOPS_PAGE_SIZE;
+ u32 erase_pages = mtd->erasesize / OOPS_PAGE_SIZE;
struct erase_info erase;
DECLARE_WAITQUEUE(wait, current);
wait_queue_head_t wait_q;
int ret;
+ int page;

init_waitqueue_head(&wait_q);
erase.mtd = mtd;
@@ -90,16 +111,15 @@ static int mtdoops_erase_block(struct mtd_info *mtd, int offset)
schedule(); /* Wait for erase to finish. */
remove_wait_queue(&wait_q, &wait);

+ /* Mark pages as unused */
+ for (page = start_page; page < start_page + erase_pages; page++)
+ mark_page_unused(cxt, page);
+
return 0;
}

static void mtdoops_inc_counter(struct mtdoops_context *cxt)
{
- struct mtd_info *mtd = cxt->mtd;
- size_t retlen;
- u32 count;
- int ret;
-
cxt->nextpage++;
if (cxt->nextpage >= cxt->oops_pages)
cxt->nextpage = 0;
@@ -107,17 +127,7 @@ static void mtdoops_inc_counter(struct mtdoops_context *cxt)
if (cxt->nextcount == 0xffffffff)
cxt->nextcount = 0;

- ret = mtd->read(mtd, cxt->nextpage * OOPS_PAGE_SIZE, 4,
- &retlen, (u_char *) &count);
- if (retlen != 4 || (ret < 0 && ret != -EUCLEAN)) {
- printk(KERN_ERR "mtdoops: read failure at %d (%td of 4 read), err %d\n",
- cxt->nextpage * OOPS_PAGE_SIZE, retlen, ret);
- schedule_work(&cxt->work_erase);
- return;
- }
-
- /* See if we need to erase the next block */
- if (count != 0xffffffff) {
+ if (page_is_used(cxt, cxt->nextpage)) {
schedule_work(&cxt->work_erase);
return;
}
@@ -168,7 +178,7 @@ badblock:
}

for (j = 0, ret = -1; (j < 3) && (ret < 0); j++)
- ret = mtdoops_erase_block(mtd, cxt->nextpage * OOPS_PAGE_SIZE);
+ ret = mtdoops_erase_block(cxt, cxt->nextpage * OOPS_PAGE_SIZE);

if (ret >= 0) {
printk(KERN_DEBUG "mtdoops: ready %d, %d\n",
@@ -209,6 +219,7 @@ static void mtdoops_write(struct mtdoops_context *cxt, int panic)
if (retlen != OOPS_PAGE_SIZE || ret < 0)
printk(KERN_ERR "mtdoops: write failure at %d (%td of %d written), error %d\n",
cxt->nextpage * OOPS_PAGE_SIZE, retlen, OOPS_PAGE_SIZE, ret);
+ mark_page_used(cxt, cxt->nextpage);

mtdoops_inc_counter(cxt);
}
@@ -230,6 +241,8 @@ static void find_next_position(struct mtdoops_context *cxt)
size_t retlen;

for (page = 0; page < cxt->oops_pages; page++) {
+ /* Assume the page is used */
+ mark_page_used(cxt, page);
ret = mtd->read(mtd, page * OOPS_PAGE_SIZE, 8, &retlen, (u_char *) &count[0]);
if (retlen != 8 || (ret < 0 && ret != -EUCLEAN)) {
printk(KERN_ERR "mtdoops: read failure at %d (%td of 8 read), err %d\n",
@@ -237,6 +250,8 @@ static void find_next_position(struct mtdoops_context *cxt)
continue;
}

+ if (count[0] == 0xffffffff && count[1] == 0xffffffff)
+ mark_page_unused(cxt, page);
if (count[0] == 0xffffffff)
continue;
if (count[1] != MTDOOPS_KERNMSG_MAGIC) {
@@ -283,6 +298,9 @@ static void find_next_position(struct mtdoops_context *cxt)
static void mtdoops_notify_add(struct mtd_info *mtd)
{
struct mtdoops_context *cxt = &oops_cxt;
+ u64 mtdoops_pages = mtd->size;
+
+ do_div(mtdoops_pages, OOPS_PAGE_SIZE);

if (cxt->name && !strcmp(mtd->name, cxt->name))
cxt->mtd_index = mtd->index;
@@ -302,6 +320,13 @@ static void mtdoops_notify_add(struct mtd_info *mtd)
return;
}

+ /* oops_page_used is a bit field */
+ cxt->oops_page_used = vmalloc(DIV_ROUND_UP(mtdoops_pages,
+ BITS_PER_LONG));
+ if (!cxt->oops_page_used) {
+ printk(KERN_ERR "Could not allocate page array\n");
+ return;
+ }
cxt->mtd = mtd;
if (mtd->size > INT_MAX)
cxt->oops_pages = INT_MAX / OOPS_PAGE_SIZE;
@@ -454,6 +479,7 @@ static void __exit mtdoops_console_exit(void)
unregister_console(&mtdoops_console);
kfree(cxt->name);
vfree(cxt->oops_buf);
+ vfree(cxt->oops_page_used);
}


--
1.6.0.4

2009-10-14 13:42:40

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH v6 3/5]: mtdoops: Make page (record) size configurable

The main justification for this is to allow catching long messages
during a panic, where the top part might otherwise be lost since moving
to the next block can require a flash erase.

Signed-off-by: Simon Kagstrom <[email protected]>
Reviewed-by: Anders Grafstrom <[email protected]>
---
ChangeLog:
Review comments from Anders Grafström and own fixes:
* Set record_size parameter permissions to 0400
* Don't require power-of-two sizes

drivers/mtd/mtdoops.c | 76 ++++++++++++++++++++++++++++--------------------
1 files changed, 44 insertions(+), 32 deletions(-)

diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index 08627c2..2870a11 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -34,7 +34,11 @@
#include <linux/mtd/mtd.h>

#define MTDOOPS_KERNMSG_MAGIC 0x5d005d00
-#define OOPS_PAGE_SIZE 4096
+
+static unsigned long record_size = 4096;
+module_param(record_size, ulong, 0400);
+MODULE_PARM_DESC(record_size,
+ "record size for MTD OOPS pages in bytes (default 4096)");

static struct mtdoops_context {
int mtd_index;
@@ -80,8 +84,8 @@ static int mtdoops_erase_block(struct mtdoops_context *cxt, int offset)
{
struct mtd_info *mtd = cxt->mtd;
u32 start_page_offset = mtd_div_by_eb(offset, mtd) * mtd->erasesize;
- u32 start_page = start_page_offset / OOPS_PAGE_SIZE;
- u32 erase_pages = mtd->erasesize / OOPS_PAGE_SIZE;
+ u32 start_page = start_page_offset / record_size;
+ u32 erase_pages = mtd->erasesize / record_size;
struct erase_info erase;
DECLARE_WAITQUEUE(wait, current);
wait_queue_head_t wait_q;
@@ -149,15 +153,15 @@ static void mtdoops_workfunc_erase(struct work_struct *work)
if (!mtd)
return;

- mod = (cxt->nextpage * OOPS_PAGE_SIZE) % mtd->erasesize;
+ mod = (cxt->nextpage * record_size) % mtd->erasesize;
if (mod != 0) {
- cxt->nextpage = cxt->nextpage + ((mtd->erasesize - mod) / OOPS_PAGE_SIZE);
+ cxt->nextpage = cxt->nextpage + ((mtd->erasesize - mod) / record_size);
if (cxt->nextpage >= cxt->oops_pages)
cxt->nextpage = 0;
}

while (mtd->block_isbad) {
- ret = mtd->block_isbad(mtd, cxt->nextpage * OOPS_PAGE_SIZE);
+ ret = mtd->block_isbad(mtd, cxt->nextpage * record_size);
if (!ret)
break;
if (ret < 0) {
@@ -165,20 +169,20 @@ static void mtdoops_workfunc_erase(struct work_struct *work)
return;
}
badblock:
- printk(KERN_WARNING "mtdoops: bad block at %08x\n",
- cxt->nextpage * OOPS_PAGE_SIZE);
+ printk(KERN_WARNING "mtdoops: bad block at %08lx\n",
+ cxt->nextpage * record_size);
i++;
- cxt->nextpage = cxt->nextpage + (mtd->erasesize / OOPS_PAGE_SIZE);
+ cxt->nextpage = cxt->nextpage + (mtd->erasesize / record_size);
if (cxt->nextpage >= cxt->oops_pages)
cxt->nextpage = 0;
- if (i == cxt->oops_pages / (mtd->erasesize / OOPS_PAGE_SIZE)) {
+ if (i == cxt->oops_pages / (mtd->erasesize / record_size)) {
printk(KERN_ERR "mtdoops: all blocks bad!\n");
return;
}
}

for (j = 0, ret = -1; (j < 3) && (ret < 0); j++)
- ret = mtdoops_erase_block(cxt, cxt->nextpage * OOPS_PAGE_SIZE);
+ ret = mtdoops_erase_block(cxt, cxt->nextpage * record_size);

if (ret >= 0) {
printk(KERN_DEBUG "mtdoops: ready %d, %d\n",
@@ -188,7 +192,7 @@ badblock:
}

if (mtd->block_markbad && ret == -EIO) {
- ret = mtd->block_markbad(mtd, cxt->nextpage * OOPS_PAGE_SIZE);
+ ret = mtd->block_markbad(mtd, cxt->nextpage * record_size);
if (ret < 0) {
printk(KERN_ERR "mtdoops: block_markbad failed, aborting\n");
return;
@@ -203,22 +207,22 @@ static void mtdoops_write(struct mtdoops_context *cxt, int panic)
size_t retlen;
int ret;

- if (cxt->writecount < OOPS_PAGE_SIZE)
+ if (cxt->writecount < record_size)
memset(cxt->oops_buf + cxt->writecount, 0xff,
- OOPS_PAGE_SIZE - cxt->writecount);
+ record_size - cxt->writecount);

if (panic)
- ret = mtd->panic_write(mtd, cxt->nextpage * OOPS_PAGE_SIZE,
- OOPS_PAGE_SIZE, &retlen, cxt->oops_buf);
+ ret = mtd->panic_write(mtd, cxt->nextpage * record_size,
+ record_size, &retlen, cxt->oops_buf);
else
- ret = mtd->write(mtd, cxt->nextpage * OOPS_PAGE_SIZE,
- OOPS_PAGE_SIZE, &retlen, cxt->oops_buf);
+ ret = mtd->write(mtd, cxt->nextpage * record_size,
+ record_size, &retlen, cxt->oops_buf);

cxt->writecount = 0;

- if (retlen != OOPS_PAGE_SIZE || ret < 0)
- printk(KERN_ERR "mtdoops: write failure at %d (%td of %d written), error %d\n",
- cxt->nextpage * OOPS_PAGE_SIZE, retlen, OOPS_PAGE_SIZE, ret);
+ if (retlen != record_size || ret < 0)
+ printk(KERN_ERR "mtdoops: write failure at %ld (%td of %ld written), error %d\n",
+ cxt->nextpage * record_size, retlen, record_size, ret);
mark_page_used(cxt, cxt->nextpage);

mtdoops_inc_counter(cxt);
@@ -243,10 +247,10 @@ static void find_next_position(struct mtdoops_context *cxt)
for (page = 0; page < cxt->oops_pages; page++) {
/* Assume the page is used */
mark_page_used(cxt, page);
- ret = mtd->read(mtd, page * OOPS_PAGE_SIZE, 8, &retlen, (u_char *) &count[0]);
+ ret = mtd->read(mtd, page * record_size, 8, &retlen, (u_char *) &count[0]);
if (retlen != 8 || (ret < 0 && ret != -EUCLEAN)) {
- printk(KERN_ERR "mtdoops: read failure at %d (%td of 8 read), err %d\n",
- page * OOPS_PAGE_SIZE, retlen, ret);
+ printk(KERN_ERR "mtdoops: read failure at %ld (%td of 8 read), err %d\n",
+ page * record_size, retlen, ret);
continue;
}

@@ -300,7 +304,7 @@ static void mtdoops_notify_add(struct mtd_info *mtd)
struct mtdoops_context *cxt = &oops_cxt;
u64 mtdoops_pages = mtd->size;

- do_div(mtdoops_pages, OOPS_PAGE_SIZE);
+ do_div(mtdoops_pages, record_size);

if (cxt->name && !strcmp(mtd->name, cxt->name))
cxt->mtd_index = mtd->index;
@@ -314,7 +318,7 @@ static void mtdoops_notify_add(struct mtd_info *mtd)
return;
}

- if (mtd->erasesize < OOPS_PAGE_SIZE) {
+ if (mtd->erasesize < record_size) {
printk(KERN_ERR "mtdoops: eraseblock size of MTD partition %d too small\n",
mtd->index);
return;
@@ -329,9 +333,9 @@ static void mtdoops_notify_add(struct mtd_info *mtd)
}
cxt->mtd = mtd;
if (mtd->size > INT_MAX)
- cxt->oops_pages = INT_MAX / OOPS_PAGE_SIZE;
+ cxt->oops_pages = INT_MAX / record_size;
else
- cxt->oops_pages = (int)mtd->size / OOPS_PAGE_SIZE;
+ cxt->oops_pages = (int)mtd->size / record_size;

find_next_position(cxt);

@@ -408,15 +412,15 @@ mtdoops_console_write(struct console *co, const char *s, unsigned int count)
cxt->writecount = 8;
}

- if (count + cxt->writecount > OOPS_PAGE_SIZE)
- count = OOPS_PAGE_SIZE - cxt->writecount;
+ if (count + cxt->writecount > record_size)
+ count = record_size - cxt->writecount;

memcpy(cxt->oops_buf + cxt->writecount, s, count);
cxt->writecount += count;

spin_unlock_irqrestore(&cxt->writecount_lock, flags);

- if (cxt->writecount == OOPS_PAGE_SIZE)
+ if (cxt->writecount == record_size)
mtdoops_console_sync();
}

@@ -455,8 +459,16 @@ static int __init mtdoops_console_init(void)
{
struct mtdoops_context *cxt = &oops_cxt;

+ if ((record_size & 4095) != 0) {
+ printk(KERN_ERR "mtdoops: record_size must be a multiple of 4096\n");
+ return -EINVAL;
+ }
+ if (record_size < 4096) {
+ printk(KERN_ERR "mtdoops: record_size must be over 4096 bytes\n");
+ return -EINVAL;
+ }
cxt->mtd_index = -1;
- cxt->oops_buf = vmalloc(OOPS_PAGE_SIZE);
+ cxt->oops_buf = vmalloc(record_size);
if (!cxt->oops_buf) {
printk(KERN_ERR "mtdoops: failed to allocate buffer workspace\n");
return -ENOMEM;
--
1.6.0.4

2009-10-14 13:43:07

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH v6 4/5]: core: Add kernel message dumper to call on oopses and panics

The core functionality is implemented as per Linus suggestion from

http://lists.infradead.org/pipermail/linux-mtd/2009-October/027620.html

(with the dump_kmsg implementation by Linus). A struct kmsg_dumper has
been added which contains a callback to dump the kernel log buffers on
crashes. The dump_kmsg function gets called from oops_exit() and panic()
and invokes this callbacks with the crash reason.

Signed-off-by: Simon Kagstrom <[email protected]>
Reviewed-by: Anders Grafstrom <[email protected]>
---
ChangeLog:
Review comments from Linus Torvalds and Anders Grafström:
* Rename structures and file names
* Remove setup callback and unify panic/oops callbacks and
instead add a reason parameter
* Use a regular spinlock and try it when dumping (fail
if held)
* Check if the dumper is already registered
* Various style fixes/cleanup

include/linux/kmsg_dump.h | 37 ++++++++++++++++
kernel/panic.c | 3 +
kernel/printk.c | 105 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 145 insertions(+), 0 deletions(-)
create mode 100644 include/linux/kmsg_dump.h

diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
new file mode 100644
index 0000000..a0d6b55
--- /dev/null
+++ b/include/linux/kmsg_dump.h
@@ -0,0 +1,37 @@
+/*
+ * linux/include/kmsg_dump.h
+ *
+ * Copyright (C) 2009 Net Insight AB
+ *
+ * Author: Simon Kagstrom <[email protected]>
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of this archive
+ * for more details.
+ */
+#ifndef _LINUX_KMSG_DUMP_H
+#define _LINUX_KMSG_DUMP_H
+
+#include <linux/list.h>
+
+enum kmsg_dump_reason {
+ kmsg_dump_oops,
+ kmsg_dump_panic,
+};
+
+struct kmsg_dumper {
+ void (*dump)(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
+ const char *s1, unsigned long l1,
+ const char *s2, unsigned long l2);
+ void *priv;
+ struct list_head list;
+ int registered;
+};
+
+void dump_kmsg(enum kmsg_dump_reason reason);
+
+int register_kmsg_dumper(struct kmsg_dumper *dumper, void *priv);
+
+void unregister_kmsg_dumper(struct kmsg_dumper *dumper);
+
+#endif /* _LINUX_DUMP_DEVICE_H */
diff --git a/kernel/panic.c b/kernel/panic.c
index c0b33b8..3a5a93f 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -10,6 +10,7 @@
*/
#include <linux/debug_locks.h>
#include <linux/interrupt.h>
+#include <linux/kmsg_dump.h>
#include <linux/kallsyms.h>
#include <linux/notifier.h>
#include <linux/module.h>
@@ -76,6 +77,7 @@ NORET_TYPE void panic(const char * fmt, ...)
dump_stack();
#endif

+ dump_kmsg(kmsg_dump_panic);
/*
* If we have crashed and we have a crash kernel loaded let it handle
* everything else.
@@ -341,6 +343,7 @@ void oops_exit(void)
{
do_oops_enter_exit();
print_oops_end_marker();
+ dump_kmsg(kmsg_dump_oops);
}

#ifdef WANT_WARN_ON_SLOWPATH
diff --git a/kernel/printk.c b/kernel/printk.c
index f38b07f..a97abb0 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -33,6 +33,8 @@
#include <linux/bootmem.h>
#include <linux/syscalls.h>
#include <linux/kexec.h>
+#include <linux/kmsg_dump.h>
+#include <linux/spinlock.h>

#include <asm/uaccess.h>

@@ -1405,3 +1407,106 @@ bool printk_timed_ratelimit(unsigned long *caller_jiffies,
}
EXPORT_SYMBOL(printk_timed_ratelimit);
#endif
+
+static LIST_HEAD(dump_list);
+static DEFINE_SPINLOCK(dump_list_lock);
+
+/**
+ * register_dump_device - register a kernel log dumper.
+ * @dump: pointer to the dump structure
+ * @priv: private data for the structure
+ *
+ * Adds a kernel log dumper to the system. The dump callback in the
+ * structure will be called when the kernel oopses or panics and must be
+ * set. Returns zero on success and -EINVAL or -EBUSY otherwise.
+ */
+int register_kmsg_dumper(struct kmsg_dumper *dumper, void *priv)
+{
+ unsigned long flags;
+
+ /* The dump callback needs to be set */
+ if (!dumper->dump)
+ return -EINVAL;
+
+ /* Don't allow registering multiple times */
+ if (dumper->registered)
+ return -EBUSY;
+
+ dumper->priv = priv;
+ dumper->registered = 1;
+
+ spin_lock_irqsave(&dump_list_lock, flags);
+ list_add(&dumper->list, &dump_list);
+ spin_unlock_irqrestore(&dump_list_lock, flags);
+ return 0;
+}
+EXPORT_SYMBOL(register_kmsg_dumper);
+
+/**
+ * unregister_dump_device - unregister a dumpdevice.
+ * @dump: pointer to the dump structure
+ *
+ * Removes a dump device from the system.
+ */
+void unregister_kmsg_dumper(struct kmsg_dumper *dumper)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&dump_list_lock, flags);
+ list_del(&dumper->list);
+ spin_unlock_irqrestore(&dump_list_lock, flags);
+}
+EXPORT_SYMBOL(unregister_kmsg_dumper);
+
+static const char *kmsg_reasons[] = {
+ [kmsg_dump_oops] = "oops",
+ [kmsg_dump_panic] = "panic",
+};
+
+static const char *kmsg_to_str(enum kmsg_dump_reason reason)
+{
+ if (reason > ARRAY_SIZE(kmsg_reasons) || reason < 0)
+ return "unknown";
+
+ return kmsg_reasons[reason];
+}
+
+/**
+ * dump_kmsg - dump kernel log to kernel message dumpers.
+ * @reason: the reason (oops, panic etc) for dumping
+ *
+ * Iterate through each of the dump devices and call the oops/panic
+ * callbacks with the log buffer.
+ */
+void dump_kmsg(enum kmsg_dump_reason reason)
+{
+ unsigned long len = ACCESS_ONCE(log_end);
+ struct kmsg_dumper *dumper;
+ const char *s1, *s2;
+ unsigned long l1, l2;
+
+ s1 = "";
+ l1 = 0;
+ s2 = log_buf;
+ l2 = len;
+
+ /* Have we rotated around the circular buffer? */
+ if (len > log_buf_len) {
+ unsigned long pos = len & LOG_BUF_MASK;
+
+ s1 = log_buf + pos;
+ l1 = log_buf_len - pos;
+
+ s2 = log_buf;
+ l2 = pos;
+ }
+
+ if (!spin_trylock(&dump_list_lock)) {
+ printk(KERN_ERR "dump_kmsg: dump list lock is held during %s, skipping dump\n",
+ kmsg_to_str(reason));
+ return;
+ }
+ list_for_each_entry(dumper, &dump_list, list)
+ dumper->dump(dumper, reason, s1, l1, s2, l2);
+ spin_unlock(&dump_list_lock);
+}
--
1.6.0.4

2009-10-14 13:43:33

by Simon Kagstrom

[permalink] [raw]
Subject: [PATCH v6 5/5]: mtdoops: refactor as a kmsg_dumper

The last messages which happens before a crash might contain interesting
information about the crash. This patch reworks mtdoops using the
kmsg_dumper support instead of a console, which simplifies the code and
also includes the messages before the oops started.

On oops callbacks, the MTD device write is scheduled in a work queue (to
be able to use the regular mtd->write call), while panics call
mtd->panic_write directly. Thus, if panic_on_oops is set, the oops will
be written out during the panic.

A parameter to specify which mtd device to use (number or name), as well
as a flag, writable at runtime, to toggle wheter to dump oopses or only
panics (since oopses can often be handled by regular syslog).

Signed-off-by: Simon Kagstrom <[email protected]>
Reviewed-by: Anders Grafstrom <[email protected]>
---
ChangeLog:
Review comments from Anders Grafström and myself:
* Fix mtddev and dump_oops parameter permissions (dump_oops is
now runtime-settable)
* Style cleanup here and there
* Add a MTDOOPS_HEADER_SIZE define instead of hardcoding the size
* Fix a double-free on unregistration
* Rework the patch according to patch 4
* Fix output problems for some conditions

drivers/mtd/mtdoops.c | 207 ++++++++++++++++++++-----------------------------
1 files changed, 84 insertions(+), 123 deletions(-)

diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index 2870a11..817265d 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -29,18 +29,31 @@
#include <linux/sched.h>
#include <linux/wait.h>
#include <linux/delay.h>
-#include <linux/spinlock.h>
#include <linux/interrupt.h>
#include <linux/mtd/mtd.h>
+#include <linux/kmsg_dump.h>

#define MTDOOPS_KERNMSG_MAGIC 0x5d005d00
+#define MTDOOPS_HEADER_SIZE 8

static unsigned long record_size = 4096;
module_param(record_size, ulong, 0400);
MODULE_PARM_DESC(record_size,
"record size for MTD OOPS pages in bytes (default 4096)");

+static char mtddev[80];
+module_param_string(mtddev, mtddev, 80, 0400);
+MODULE_PARM_DESC(mtddev,
+ "name or index number of the MTD device to use");
+
+static int dump_oops = 1;
+module_param(dump_oops, int, 0600);
+MODULE_PARM_DESC(dump_oops,
+ "set to 1 to dump oopses, 0 to only dump panics (default 1)");
+
static struct mtdoops_context {
+ struct kmsg_dumper dump;
+
int mtd_index;
struct work_struct work_erase;
struct work_struct work_write;
@@ -49,14 +62,9 @@ static struct mtdoops_context {
int nextpage;
int nextcount;
unsigned long *oops_page_used;
- char *name;
+ const char *name;

void *oops_buf;
-
- /* writecount and disabling ready are spin lock protected */
- spinlock_t writecount_lock;
- int ready;
- int writecount;
} oops_cxt;

static void mark_page_used(struct mtdoops_context *cxt, int page)
@@ -138,7 +146,6 @@ static void mtdoops_inc_counter(struct mtdoops_context *cxt)

printk(KERN_DEBUG "mtdoops: ready %d, %d (no erase)\n",
cxt->nextpage, cxt->nextcount);
- cxt->ready = 1;
}

/* Scheduled work - when we can't proceed without erasing a block */
@@ -187,7 +194,6 @@ badblock:
if (ret >= 0) {
printk(KERN_DEBUG "mtdoops: ready %d, %d\n",
cxt->nextpage, cxt->nextcount);
- cxt->ready = 1;
return;
}

@@ -205,11 +211,13 @@ static void mtdoops_write(struct mtdoops_context *cxt, int panic)
{
struct mtd_info *mtd = cxt->mtd;
size_t retlen;
+ u32 *hdr;
int ret;

- if (cxt->writecount < record_size)
- memset(cxt->oops_buf + cxt->writecount, 0xff,
- record_size - cxt->writecount);
+ /* Add mtdoops header to the buffer */
+ hdr = cxt->oops_buf;
+ hdr[0] = cxt->nextcount;
+ hdr[1] = MTDOOPS_KERNMSG_MAGIC;

if (panic)
ret = mtd->panic_write(mtd, cxt->nextpage * record_size,
@@ -218,17 +226,15 @@ static void mtdoops_write(struct mtdoops_context *cxt, int panic)
ret = mtd->write(mtd, cxt->nextpage * record_size,
record_size, &retlen, cxt->oops_buf);

- cxt->writecount = 0;
-
if (retlen != record_size || ret < 0)
printk(KERN_ERR "mtdoops: write failure at %ld (%td of %ld written), error %d\n",
cxt->nextpage * record_size, retlen, record_size, ret);
mark_page_used(cxt, cxt->nextpage);
+ memset(cxt->oops_buf, 0xff, record_size);

mtdoops_inc_counter(cxt);
}

-
static void mtdoops_workfunc_write(struct work_struct *work)
{
struct mtdoops_context *cxt =
@@ -247,10 +253,13 @@ static void find_next_position(struct mtdoops_context *cxt)
for (page = 0; page < cxt->oops_pages; page++) {
/* Assume the page is used */
mark_page_used(cxt, page);
- ret = mtd->read(mtd, page * record_size, 8, &retlen, (u_char *) &count[0]);
- if (retlen != 8 || (ret < 0 && ret != -EUCLEAN)) {
- printk(KERN_ERR "mtdoops: read failure at %ld (%td of 8 read), err %d\n",
- page * record_size, retlen, ret);
+ ret = mtd->read(mtd, page * record_size, MTDOOPS_HEADER_SIZE,
+ &retlen, (u_char *) &count[0]);
+ if (retlen != MTDOOPS_HEADER_SIZE ||
+ (ret < 0 && ret != -EUCLEAN)) {
+ printk(KERN_ERR "mtdoops: read failure at %ld (%td of %d read), err %d\n",
+ page * record_size, retlen,
+ MTDOOPS_HEADER_SIZE, ret);
continue;
}

@@ -287,7 +296,6 @@ static void find_next_position(struct mtdoops_context *cxt)
} else {
printk(KERN_DEBUG "mtdoops: ready %d, %d (clean)\n",
cxt->nextpage, cxt->nextcount);
- cxt->ready = 1;
}
return;
}
@@ -298,6 +306,41 @@ static void find_next_position(struct mtdoops_context *cxt)
mtdoops_inc_counter(cxt);
}

+static void mtdoops_do_dump(struct kmsg_dumper *dump,
+ enum kmsg_dump_reason reason, const char *s1, unsigned long l1,
+ const char *s2, unsigned long l2)
+{
+ struct mtdoops_context *cxt = dump->priv;
+ unsigned long s1_start, s2_start;
+ unsigned long l1_cpy, l2_cpy;
+ char *dst;
+
+ /* Only dump oopses if dump_oops is set */
+ if (reason == kmsg_dump_oops && !dump_oops)
+ return;
+
+ dst = cxt->oops_buf + MTDOOPS_HEADER_SIZE; /* Skip the header */
+ l2_cpy = min(l2, record_size - MTDOOPS_HEADER_SIZE);
+ l1_cpy = min(l1, record_size - MTDOOPS_HEADER_SIZE - l2_cpy);
+
+ s2_start = l2 - l2_cpy;
+ s1_start = l1 - l1_cpy;
+
+ memcpy(dst, s1 + s1_start, l1_cpy);
+ memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
+
+ /* Panics must be written immediately */
+ if (reason == kmsg_dump_panic) {
+ if (!cxt->mtd->panic_write)
+ printk(KERN_ERR "mtdoops: Cannot write from panic without panic_write\n");
+ else
+ mtdoops_write(cxt, 1);
+ return;
+ }
+
+ /* For other cases, schedule work to write it "nicely" */
+ schedule_work(&cxt->work_write);
+}

static void mtdoops_notify_add(struct mtd_info *mtd)
{
@@ -339,6 +382,8 @@ static void mtdoops_notify_add(struct mtd_info *mtd)

find_next_position(cxt);

+ cxt->dump.dump = mtdoops_do_dump;
+ register_kmsg_dumper(&cxt->dump, cxt);
printk(KERN_INFO "mtdoops: Attached to MTD device %d\n", mtd->index);
}

@@ -349,116 +394,27 @@ static void mtdoops_notify_remove(struct mtd_info *mtd)
if (mtd->index != cxt->mtd_index || cxt->mtd_index < 0)
return;

+ unregister_kmsg_dumper(&cxt->dump);
cxt->mtd = NULL;
flush_scheduled_work();
}

-static void mtdoops_console_sync(void)
-{
- struct mtdoops_context *cxt = &oops_cxt;
- struct mtd_info *mtd = cxt->mtd;
- unsigned long flags;
-
- if (!cxt->ready || !mtd || cxt->writecount == 0)
- return;
-
- /*
- * Once ready is 0 and we've held the lock no further writes to the
- * buffer will happen
- */
- spin_lock_irqsave(&cxt->writecount_lock, flags);
- if (!cxt->ready) {
- spin_unlock_irqrestore(&cxt->writecount_lock, flags);
- return;
- }
- cxt->ready = 0;
- spin_unlock_irqrestore(&cxt->writecount_lock, flags);
-
- if (mtd->panic_write && (in_interrupt() || panic_on_oops))
- /* Interrupt context, we're going to panic so try and log */
- mtdoops_write(cxt, 1);
- else
- schedule_work(&cxt->work_write);
-}
-
-static void
-mtdoops_console_write(struct console *co, const char *s, unsigned int count)
-{
- struct mtdoops_context *cxt = co->data;
- struct mtd_info *mtd = cxt->mtd;
- unsigned long flags;
-
- if (!oops_in_progress) {
- mtdoops_console_sync();
- return;
- }
-
- if (!cxt->ready || !mtd)
- return;
-
- /* Locking on writecount ensures sequential writes to the buffer */
- spin_lock_irqsave(&cxt->writecount_lock, flags);
-
- /* Check ready status didn't change whilst waiting for the lock */
- if (!cxt->ready) {
- spin_unlock_irqrestore(&cxt->writecount_lock, flags);
- return;
- }
-
- if (cxt->writecount == 0) {
- u32 *stamp = cxt->oops_buf;
- *stamp++ = cxt->nextcount;
- *stamp = MTDOOPS_KERNMSG_MAGIC;
- cxt->writecount = 8;
- }
-
- if (count + cxt->writecount > record_size)
- count = record_size - cxt->writecount;
-
- memcpy(cxt->oops_buf + cxt->writecount, s, count);
- cxt->writecount += count;
-
- spin_unlock_irqrestore(&cxt->writecount_lock, flags);
-
- if (cxt->writecount == record_size)
- mtdoops_console_sync();
-}
-
-static int __init mtdoops_console_setup(struct console *co, char *options)
-{
- struct mtdoops_context *cxt = co->data;
-
- if (cxt->mtd_index != -1 || cxt->name)
- return -EBUSY;
- if (options) {
- cxt->name = kstrdup(options, GFP_KERNEL);
- return 0;
- }
- if (co->index == -1)
- return -EINVAL;
-
- cxt->mtd_index = co->index;
- return 0;
-}

static struct mtd_notifier mtdoops_notifier = {
.add = mtdoops_notify_add,
.remove = mtdoops_notify_remove,
};

-static struct console mtdoops_console = {
- .name = "ttyMTD",
- .write = mtdoops_console_write,
- .setup = mtdoops_console_setup,
- .unblank = mtdoops_console_sync,
- .index = -1,
- .data = &oops_cxt,
-};
-
-static int __init mtdoops_console_init(void)
+static int __init mtdoops_init(void)
{
struct mtdoops_context *cxt = &oops_cxt;
+ int mtd_index;
+ char *endp;

+ if (strlen(mtddev) == 0) {
+ printk(KERN_ERR "mtdoops: mtd device must be supplied\n");
+ return -EINVAL;
+ }
if ((record_size & 4095) != 0) {
printk(KERN_ERR "mtdoops: record_size must be a multiple of 4096\n");
return -EINVAL;
@@ -467,36 +423,41 @@ static int __init mtdoops_console_init(void)
printk(KERN_ERR "mtdoops: record_size must be over 4096 bytes\n");
return -EINVAL;
}
+
+ /* Setup the MTD device to use */
+ cxt->name = mtddev;
cxt->mtd_index = -1;
+ mtd_index = simple_strtoul(mtddev, &endp, 0);
+ if (endp != mtddev)
+ cxt->mtd_index = mtd_index;
+
cxt->oops_buf = vmalloc(record_size);
if (!cxt->oops_buf) {
printk(KERN_ERR "mtdoops: failed to allocate buffer workspace\n");
return -ENOMEM;
}
+ memset(cxt->oops_buf, 0xff, record_size);

- spin_lock_init(&cxt->writecount_lock);
INIT_WORK(&cxt->work_erase, mtdoops_workfunc_erase);
INIT_WORK(&cxt->work_write, mtdoops_workfunc_write);

- register_console(&mtdoops_console);
register_mtd_user(&mtdoops_notifier);
return 0;
}

-static void __exit mtdoops_console_exit(void)
+static void __exit mtdoops_exit(void)
{
struct mtdoops_context *cxt = &oops_cxt;

unregister_mtd_user(&mtdoops_notifier);
- unregister_console(&mtdoops_console);
kfree(cxt->name);
vfree(cxt->oops_buf);
vfree(cxt->oops_page_used);
}


-subsys_initcall(mtdoops_console_init);
-module_exit(mtdoops_console_exit);
+module_init(mtdoops_init);
+module_exit(mtdoops_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Richard Purdie <[email protected]>");
--
1.6.0.4

2009-10-14 15:13:35

by Simon Kagstrom

[permalink] [raw]
Subject: Re: [PATCH v6 5/5]: mtdoops: refactor as a kmsg_dumper

On Wed, 14 Oct 2009 15:41:24 +0200
Simon Kagstrom <[email protected]> wrote:

> -static void __exit mtdoops_console_exit(void)
> +static void __exit mtdoops_exit(void)
> {
> struct mtdoops_context *cxt = &oops_cxt;
>
> unregister_mtd_user(&mtdoops_notifier);
> - unregister_console(&mtdoops_console);
> kfree(cxt->name);

The kfree should go as well since cxt->name is no longer allocated. I'll
await further comments and send a new patch tomorrow.

// Simon

2009-10-14 16:51:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v6 4/5]: core: Add kernel message dumper to call on oopses and panics



On Wed, 14 Oct 2009, Simon Kagstrom wrote:
>
> ChangeLog:
> Review comments from Linus Torvalds and Anders Grafstr?m:
> * Rename structures and file names
> * Remove setup callback and unify panic/oops callbacks and
> instead add a reason parameter
> * Use a regular spinlock and try it when dumping (fail
> if held)
> * Check if the dumper is already registered
> * Various style fixes/cleanup

Ok, looks fine to me now.

I do end up having one minor nit: let's change the calling convention of
the dump function to either be:

void (*dump)(void *priv, enum kmsg_dump_reason reason,
const char *s1, unsigned long l1,
const char *s2, unsigned long l2);

or let's just remove the 'priv' data from the dump entirely.

Right now, you pass in the whole 'kmsg_dumper' data structure, and then
you seem to expect that users look up their private context by looking
into that data structure with 'dumper->priv'. That's just ugly.

So if you want to have a callback value, just pass that in for the
callback.

And if you want to pass in the whole 'kmsg_dumper' data structure, then
use that pointer _itself_ as the context (ie you would embed the
'kmsg_dumper' in some data structure, and then you do

struct my_data *my_data = container_of(dumper, struct my_data, dumper);

or something like that.

But your current implementation mixes _both_ of the above approaches.
Which one are people going to use? Both work, and no, "there are multiple
ways to do the same thing" is not an advantage, it just leads to
confusion. And confusion isn't good, whatever the perl people say.

Linus

2009-10-15 05:12:37

by Vimal Singh

[permalink] [raw]
Subject: Re: [PATCH v6 5/5]: mtdoops: refactor as a kmsg_dumper

On Wed, Oct 14, 2009 at 8:42 PM, Simon Kagstrom
<[email protected]> wrote:
> On Wed, 14 Oct 2009 15:41:24 +0200
> Simon Kagstrom <[email protected]> wrote:
>
>> -static void __exit mtdoops_console_exit(void)
>> +static void __exit mtdoops_exit(void)
>> ?{
>> ? ? ? struct mtdoops_context *cxt = &oops_cxt;
>>
>> ? ? ? unregister_mtd_user(&mtdoops_notifier);
>> - ? ? unregister_console(&mtdoops_console);
>> ? ? ? kfree(cxt->name);
>
> The kfree should go as well since cxt->name is no longer allocated. I'll
> await further comments and send a new patch tomorrow.
>
Help description for 'MTD_OOPS' in Kconfig needs update after this
patch. Are you planning to add a patch for same in this series?

-vimal

> // Simon
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>

2009-11-26 09:37:10

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

Just stumbled across this patch.

On Tue, 13 October 2009 15:22:35 +0200, Simon Kagstrom wrote:
> +void dump_kmsg(int panic)
> +{
> + unsigned long len = ACCESS_ONCE(log_end);
> + struct dump_device *dump;
> + const char *s1, *s2;
> + unsigned long l1, l2;
> +
> + s1 = "";
> + l1 = 0;
> + s2 = log_buf;
> + l2 = len;
> +
> + /* Have we rotated around the circular buffer? */
> + if (len > log_buf_len) {

I believe this bit is wrong. log_end is an unsigned int, so it can
wrap relatively quickly. If log_end just wrapped to 0 and dump_kmsg is
called, the amount of printk buffer stored appears to be 0 as well.

To avoid this case one could either declare log_end and len as u64,
causing extra computational costs. Or one could just use the
conditional code below unconditionally. That could write random or
zeroed printk buffer directly after bootup, but would never miss
information.

> + unsigned long pos = (len & LOG_BUF_MASK);
> +
> + s1 = log_buf + pos;
> + l1 = log_buf_len - pos;
> +
> + s2 = log_buf;
> + l2 = pos;
> + }
> +
> + list_for_each_entry(dump, &dump_list, list) {
> + if (panic && dump->panic)
> + dump->panic(dump, s1, l1, s2, l2);
> + else if (!panic && dump->oops)
> + dump->oops(dump, s1, l1, s2, l2);
> + }
> +}
> --
> 1.6.0.4

Jörn

--
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein

2009-11-30 07:28:59

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Thu, 2009-11-26 at 10:36 +0100, Jörn Engel wrote:
> Just stumbled across this patch.
>
> On Tue, 13 October 2009 15:22:35 +0200, Simon Kagstrom wrote:
> > +void dump_kmsg(int panic)
> > +{
> > + unsigned long len = ACCESS_ONCE(log_end);
> > + struct dump_device *dump;
> > + const char *s1, *s2;
> > + unsigned long l1, l2;
> > +
> > + s1 = "";
> > + l1 = 0;
> > + s2 = log_buf;
> > + l2 = len;
> > +
> > + /* Have we rotated around the circular buffer? */
> > + if (len > log_buf_len) {
>
> I believe this bit is wrong. log_end is an unsigned int, so it can
> wrap relatively quickly. If log_end just wrapped to 0 and dump_kmsg is
> called, the amount of printk buffer stored appears to be 0 as well.

To me it looks like 'log_end' is not supposed to wrap. What makes you
think it can? In which cases it can?

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-11-30 07:46:16

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Mon, 30 November 2009 09:27:51 +0200, Artem Bityutskiy wrote:
>
> To me it looks like 'log_end' is not supposed to wrap. What makes you
> think it can? In which cases it can?

It is a 32bit variable. Would do you expect happens once you reach
0xffffffff and add 1?

Jörn

--
This above all: to thine own self be true.
-- Shakespeare

2009-11-30 08:52:56

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Mon, 2009-11-30 at 08:46 +0100, Jörn Engel wrote:
> On Mon, 30 November 2009 09:27:51 +0200, Artem Bityutskiy wrote:
> >
> > To me it looks like 'log_end' is not supposed to wrap. What makes you
> > think it can? In which cases it can?
>
> It is a 32bit variable. Would do you expect happens once you reach
> 0xffffffff and add 1?

Yes, now I see log_end is an ever increasing variable.

How about this patch on top of the existing one (untested):

diff --git a/kernel/printk.c b/kernel/printk.c
index f711b99..66995ca 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1486,28 +1486,27 @@ static const char *kmsg_to_str(enum kmsg_dump_reason reason)
*/
void kmsg_dump(enum kmsg_dump_reason reason)
{
- unsigned long len = ACCESS_ONCE(log_end);
+ unsigned long end = ACCESS_ONCE(log_end) & LOG_BUF_MASK;
struct kmsg_dumper *dumper;
const char *s1, *s2;
unsigned long l1, l2;
unsigned long flags;

- s1 = "";
- l1 = 0;
- s2 = log_buf;
- l2 = len;
-
- /* Have we rotated around the circular buffer? */
- if (len > log_buf_len) {
- unsigned long pos = len & LOG_BUF_MASK;
-
- s1 = log_buf + pos;
- l1 = log_buf_len - pos;
-
- s2 = log_buf;
- l2 = pos;
+ /*
+ * Have we ever rotated around the circular buffer? If we never did,
+ * we have to have zeroes at the end.
+ */
+ if (log_buf[end]) {
+ s1 = log_buf + end;
+ l1 = log_buf_len - end;
+ } else {
+ s1 = "";
+ l1 = 0;
}

+ s2 = log_buf;
+ l2 = end;
+
if (!spin_trylock_irqsave(&dump_list_lock, flags)) {
printk(KERN_ERR "dump_kmsg: dump list lock is held during %s, skipping dump\n",
kmsg_to_str(reason));


Then the whole function will look like this:

/**
* kmsg_dump - dump kernel log to kernel message dumpers.
* @reason: the reason (oops, panic etc) for dumping
*
* Iterate through each of the dump devices and call the oops/panic
* callbacks with the log buffer.
*/
void kmsg_dump(enum kmsg_dump_reason reason)
{
unsigned long end = ACCESS_ONCE(log_end) & LOG_BUF_MASK;
struct kmsg_dumper *dumper;
const char *s1, *s2;
unsigned long l1, l2;
unsigned long flags;

/*
* Have we ever rotated around the circular buffer? If we never did,
* we have to have zeroes at the end.
*/
if (log_buf[end]) {
s1 = log_buf + end;
l1 = log_buf_len - end;
} else {
s1 = "";
l1 = 0;
}

s2 = log_buf;
l2 = end;

if (!spin_trylock_irqsave(&dump_list_lock, flags)) {
printk(KERN_ERR "dump_kmsg: dump list lock is held during %s, skipping dump\n",
kmsg_to_str(reason));
return;
}
list_for_each_entry(dumper, &dump_list, list)
dumper->dump(dumper, reason, s1, l1, s2, l2);
spin_unlock_irqrestore(&dump_list_lock, flags);
}

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-11-30 09:11:01

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Thu, 2009-11-26 at 10:36 +0100, Jörn Engel wrote:
> Just stumbled across this patch.
>
> On Tue, 13 October 2009 15:22:35 +0200, Simon Kagstrom wrote:
> > +void dump_kmsg(int panic)
> > +{
> > + unsigned long len = ACCESS_ONCE(log_end);
> > + struct dump_device *dump;
> > + const char *s1, *s2;
> > + unsigned long l1, l2;
> > +
> > + s1 = "";
> > + l1 = 0;
> > + s2 = log_buf;
> > + l2 = len;
> > +
> > + /* Have we rotated around the circular buffer? */
> > + if (len > log_buf_len) {
>
> I believe this bit is wrong. log_end is an unsigned int, so it can
> wrap relatively quickly. If log_end just wrapped to 0 and dump_kmsg is
> called, the amount of printk buffer stored appears to be 0 as well.

Simon, if you want your patches to be merged to 2.6.33, we should
address this issue ASAP, otherwise it'll be too late.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-11-30 09:28:33

by Simon Kagstrom

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Mon, 30 Nov 2009 11:09:47 +0200
Artem Bityutskiy <[email protected]> wrote:

> > I believe this bit is wrong. log_end is an unsigned int, so it can
> > wrap relatively quickly. If log_end just wrapped to 0 and dump_kmsg is
> > called, the amount of printk buffer stored appears to be 0 as well.
>
> Simon, if you want your patches to be merged to 2.6.33, we should
> address this issue ASAP, otherwise it'll be too late.

I'm testing your patch, I'll get back when I have verified that it works.

// Simon

2009-11-30 09:35:51

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Mon, 30 November 2009 10:51:58 +0200, Artem Bityutskiy wrote:
>
> How about this patch on top of the existing one (untested):
>
> + /*
> + * Have we ever rotated around the circular buffer? If we never did,
> + * we have to have zeroes at the end.
> + */
> + if (log_buf[end]) {
> + s1 = log_buf + end;
> + l1 = log_buf_len - end;
> + } else {
> + s1 = "";
> + l1 = 0;

So now you are assuming that a) the buffer is initially zeroed and b)
noone ever writes NUL to it. Is that correct?

I'm not sure whether those assumptions are valid. If they are, then
this will obviously work. Otherwise we can just always assume the
wrapped case.

Jörn

--
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy

2009-11-30 09:41:55

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Mon, 2009-11-30 at 10:35 +0100, Jörn Engel wrote:
> On Mon, 30 November 2009 10:51:58 +0200, Artem Bityutskiy wrote:
> >
> > How about this patch on top of the existing one (untested):
> >
> > + /*
> > + * Have we ever rotated around the circular buffer? If we never did,
> > + * we have to have zeroes at the end.
> > + */
> > + if (log_buf[end]) {
> > + s1 = log_buf + end;
> > + l1 = log_buf_len - end;
> > + } else {
> > + s1 = "";
> > + l1 = 0;
>
> So now you are assuming that a) the buffer is initially zeroed and b)
> noone ever writes NUL to it. Is that correct?

a) seems to be true because the buffer is either a static array or a
bootmem alloc, which seems to memzero the buffers it returns, at least
AFAICS. But I did not test this.

vs b). well, the printk ring buffer should contain ASCII, so I assumed
binary zeroes should not be possible there.

> I'm not sure whether those assumptions are valid. If they are, then
> this will obviously work. Otherwise we can just always assume the
> wrapped case.

Of course someone who has more knowlege about the printk buffer should
comment on this.

The other alternative I was thinking about was to introduce a boolean
flag, and set it to one as soon as 'lon_end' becomes larger than
'log_buf_len'.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

2009-11-30 09:53:29

by Simon Kagstrom

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Mon, 30 Nov 2009 11:40:55 +0200
Artem Bityutskiy <[email protected]> wrote:

> > > + /*
> > > + * Have we ever rotated around the circular buffer? If we never did,
> > > + * we have to have zeroes at the end.
> > > + */
> > > + if (log_buf[end]) {
> > > + s1 = log_buf + end;
> > > + l1 = log_buf_len - end;
> > > + } else {
> > > + s1 = "";
> > > + l1 = 0;
> >
> > So now you are assuming that a) the buffer is initially zeroed and b)
> > noone ever writes NUL to it. Is that correct?
>
> a) seems to be true because the buffer is either a static array or a
> bootmem alloc, which seems to memzero the buffers it returns, at least
> AFAICS. But I did not test this.

True as far as I can see.

> vs b). well, the printk ring buffer should contain ASCII, so I assumed
> binary zeroes should not be possible there.

Yes, if this would be printed with printk it sounds like a bug to me.

I've tested your patch, and it seems to work fine both 1) before the
buffer is filled, 2) when the buffer has wrapped and 3) when log_end
wraps (although that part was a quick hack).

So

Tested-by: Simon Kagstrom <[email protected]>

// Simon

2009-11-30 09:55:01

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Mon, 30 November 2009 11:40:55 +0200, Artem Bityutskiy wrote:
>
> The other alternative I was thinking about was to introduce a boolean
> flag, and set it to one as soon as 'lon_end' becomes larger than
> 'log_buf_len'.

I've thought about that as well. The drawback is that we would have to
check in the regular printk path - by the time the panic notifier is
running it is too late. So it adds cost to the regular case.

Jörn

--
There are three principal ways to lose money: wine, women, and engineers.
While the first two are more pleasant, the third is by far the more certain.
-- Baron Rothschild

2009-11-30 10:23:40

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Mon, 2009-11-30 at 10:51 +0200, Artem Bityutskiy wrote:
> On Mon, 2009-11-30 at 08:46 +0100, Jörn Engel wrote:
> > On Mon, 30 November 2009 09:27:51 +0200, Artem Bityutskiy wrote:
> > >
> > > To me it looks like 'log_end' is not supposed to wrap. What makes you
> > > think it can? In which cases it can?
> >
> > It is a 32bit variable. Would do you expect happens once you reach
> > 0xffffffff and add 1?
>
> Yes, now I see log_end is an ever increasing variable.
>
> How about this patch on top of the existing one (untested):

I suspect you should be modelling this more closely on the code in
do_syslog() for case 3. The end is at (log_end & LOG_BUF_MASK), and you
have (logged_chars) to write.

This mean that you won't write messages to the log which were 'cleared'
by 'dmesg -c', but that's acceptable, I think.

Why are you setting s1 to "" in the case where l1 is zero, btw? Can't it
be NULL?

diff --git a/kernel/printk.c b/kernel/printk.c
index f711b99..d150c57 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1486,26 +1486,33 @@ static const char *kmsg_to_str(enum kmsg_dump_reason reason)
*/
void kmsg_dump(enum kmsg_dump_reason reason)
{
- unsigned long len = ACCESS_ONCE(log_end);
+ unsigned long end;
+ unsigned chars;
struct kmsg_dumper *dumper;
const char *s1, *s2;
unsigned long l1, l2;
unsigned long flags;

- s1 = "";
- l1 = 0;
- s2 = log_buf;
- l2 = len;
-
- /* Have we rotated around the circular buffer? */
- if (len > log_buf_len) {
- unsigned long pos = len & LOG_BUF_MASK;
+ /* Theoretically, the log could move on after we do this, but
+ there's not a log we can do about that. The new messages
+ will overwrite the start of what we dump. */
+ spin_lock_irqsave(&logbuf_lock, flags);
+ end = log_end & LOG_BUF_MASK;
+ chars = logged_chars;
+ spin_unlock_irqrestore(&logbuf_lock, flags);

- s1 = log_buf + pos;
- l1 = log_buf_len - pos;
+ if (logged_chars > end) {
+ s1 = log_buf + log_buf_len - logged_chars + end;
+ l1 = logged_chars - end;

s2 = log_buf;
- l2 = pos;
+ l2 = end;
+ } else {
+ s1 = "";
+ l1 = 0;
+
+ s2 = log_buf + end - logged_chars;
+ l2 = logged_chars;
}

if (!spin_trylock_irqsave(&dump_list_lock, flags)) {


--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-11-30 10:27:33

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH/RFC v5 4/5]: core: Add dump device to call on oopses and panics

On Mon, 2009-11-30 at 10:23 +0000, David Woodhouse wrote:
> + /* Theoretically, the log could move on after we do this, but
> + there's not a log we can do about that. The new messages
> + will overwrite the start of what we dump. */
> + spin_lock_irqsave(&logbuf_lock, flags);
> + end = log_end & LOG_BUF_MASK;
> + chars = logged_chars;
> + spin_unlock_irqrestore(&logbuf_lock, flags);

Actually that's not true -- we _could_ hold the logbuf_lock until the
end of the function. Not entirely sure we _want_ to though...

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation