On 5/2/06, Rik van Riel <[email protected]> wrote:
> On Mon, 1 May 2006, Circuitsoft Development wrote:
>
> > I was actually planning on a 5msec timeout to ignore that computer,
> > for now, then if I don't get a response within 100msec, ping them,
> > and permenantly remove them from the list of peers and broadcast a
> > "this peer is dead" message to the network if the ping times out at
> > 500msec.
>
> How are you going to prevent your "dead" peer from writing
> to the disk anyway ?
>
> --
> All Rights Reversed
>
I'm not. They also need to get permission from the network before they
write to the disk, and they're not going to get permission without
hearing back from everybody. Besides, since the same network is used
to connect to the disks as is used to connect the computers to each
other, how would it be able to access the disks without being able to
access other computers which also connect to the disks?
(Sorry for the repeat, Rik)
On 2006-05-08T16:17:07, Circuitsoft Development <[email protected]> wrote:
> I'm not. They also need to get permission from the network before they
> write to the disk, and they're not going to get permission without
> hearing back from everybody. Besides, since the same network is used
> to connect to the disks as is used to connect the computers to each
> other, how would it be able to access the disks without being able to
> access other computers which also connect to the disks?
You really should read up about split-brain scenarios, quorum, IO
fencing, cluster membership algorithms and the amazing variety of
different types of crashes.
Sincerely,
Lars Marowsky-Br?e
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
On 5/8/06, Lars Marowsky-Bree <[email protected]> wrote:
> On 2006-05-08T16:17:07, Circuitsoft Development <[email protected]> wrote:
>
> > I'm not. They also need to get permission from the network before they
> > write to the disk, and they're not going to get permission without
> > hearing back from everybody. Besides, since the same network is used
> > to connect to the disks as is used to connect the computers to each
> > other, how would it be able to access the disks without being able to
> > access other computers which also connect to the disks?
>
> You really should read up about split-brain scenarios
I don't see how they'll happen if the heartbeat/management runs over
the same network that is used to connect to the disk.
> quorum
I'm aware of the idea. I think that a static quorum would be best, and
that it should be configured by the cluster administrator. See
http://lists.osdl.org/pipermail/osdlcluster/2004-January/000071.html
for a description of the problem with dynamic quorum.
> IO fencing
The primary target storage protocol is ATA-over-Ethernet, second is
iSCSI. As far as I know, it should be relatively simple, in both
circumstances, to tell the storage blade to cut off a computer until
it correctly re-registers itself with the cluster. Otherwise, a CISCO
managed switch should also be able to cut off a computer if it stops
responding to the cluster.
> cluster membership algorithms
Having trouble finding too many details on these. I'll keep looking,
but some pointers could be helpful.
> and the amazing variety of different types of crashes.
I figured that IO Fencing combined with STONITH (Shoot The Other Node
In The Head) could solve the problems caused by most crashes.
>
> Sincerely,
> Lars Marowsky-Br?e
>
> --
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"
>
These thoughts are based on my best understanding of how-stuff-works
so far. Any further comments would be greatly appreciated.
- Alex