2000-12-17 15:05:42

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Monitoring filesystems / blockdevice for errors

Good morning,

currently, there is no way for an external application to monitor whether a
filesystem or underlaying block device has hit an error condition - internal
inconsistency, read or write error, whatever.

Short of parsing syslog messages, which isn't particularly great.

This is necessary for server monitoring in general.

I don't have a real idea how this could be added, short of adding a field to
/proc/partitions (error count) or something similiar.

Comments?

Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl


2000-12-17 18:55:05

by Mark Hahn

[permalink] [raw]
Subject: Re: Monitoring filesystems / blockdevice for errors

> currently, there is no way for an external application to monitor whether a
> filesystem or underlaying block device has hit an error condition - internal
> inconsistency, read or write error, whatever.
>
> Short of parsing syslog messages, which isn't particularly great.

what's wrong with it? reinventing /proc/kmsg and klogd would be tre gross.

> I don't have a real idea how this could be added, short of adding a field to
> /proc/partitions (error count) or something similiar.

for reporting errors, that might be OK, but it's not a particularly nice
_notification_ mechanism...

regards, mark hahn.

2000-12-17 19:14:31

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: Monitoring filesystems / blockdevice for errors

On 2000-12-17T13:23:52,
Mark Hahn <[email protected]> said:

> > Short of parsing syslog messages, which isn't particularly great.
> what's wrong with it?

Because it means having to know about all potential messages the filesystems
might dump out.

> reinventing /proc/kmsg and klogd would be tre gross.

Well, only one process can read kmsg and get notified about new messages at
any time, so that makes the monitoring depend on klogd/syslogd working, which
given a write error by syslog might not be the case...

> > I don't have a real idea how this could be added, short of adding a field to
> > /proc/partitions (error count) or something similiar.
> for reporting errors, that might be OK, but it's not a particularly nice
> _notification_ mechanism...

Well, yes.

Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl

2000-12-18 05:59:45

by Peter Samuelson

[permalink] [raw]
Subject: Re: Monitoring filesystems / blockdevice for errors


[Mark Hahn]
> > reinventing /proc/kmsg and klogd would be tre gross.

[Lars Marowsky-Bree]
> Well, only one process can read kmsg and get notified about new
> messages at any time, so that makes the monitoring depend on
> klogd/syslogd working, which given a write error by syslog might not
> be the case...

So rewrite klogd to do something much simpler for serious errors (yes
they will be tagged as such) before trying to pass them on to syslogd.
Or does it already do this? It's a userspace problem.

Peter

2000-12-18 09:17:23

by Jan-Benedict Glaw

[permalink] [raw]
Subject: Re: Monitoring filesystems / blockdevice for errors

On Sun, Dec 17, 2000 at 11:28:46PM -0600, Peter Samuelson wrote:
> [Mark Hahn]
> > > reinventing /proc/kmsg and klogd would be tre gross.
>
> [Lars Marowsky-Bree]
> > Well, only one process can read kmsg and get notified about new
> > messages at any time, so that makes the monitoring depend on
> > klogd/syslogd working, which given a write error by syslog might not
> > be the case...
>
> So rewrite klogd to do something much simpler for serious errors (yes
> they will be tagged as such) before trying to pass them on to syslogd.
> Or does it already do this? It's a userspace problem.

Hmmm... Even if LMB and I are often of quite different oppinions, I
think only modifying klogd is not enough. LMB stated that a userspace
tool would need to know any possibly error messages that could
possibly generated. Cleaning up all messages would be the first
step to prepare for failure reports to userspace. Ie, what errors do
re have?

- Sense errors (recoverable) \
- " " (unrecoverable) > for all kinds of devices
- Complete device failure (HDD is gone) /
- Data failure (wrong ext2 bitmaps) for all FS
- RAM's ECC/parity errors
- possibly some more;)

Cleaning up all error messages (maybe using exctly two lines: one for kind
of failure, one for device/RAM/fs specific messages) could help a lot
and doesn't hurt badly (code doesn't get really slower as these paths
are more-or-less never taken; but there is a little bit more bloat...).

With such an infrastructure, klogd could pass those lines to an external
helper (and additionally to syslog).

MfG, JBG

--
Fehler eingestehen, Gr??e zeigen: Nehmt die Rechtschreibreform zur?ck!!!
/* Jan-Benedict Glaw <[email protected]> -- +49-177-5601720 */
keyID=0x8399E1BB fingerprint=250D 3BCF 7127 0D8C A444 A961 1DBD 5E75 8399 E1BB
"insmod vi.o and there we go..." (Alexander Viro on linux-kernel)


Attachments:
(No filename) (1.85 kB)
(No filename) (240.00 B)
Download all attachments