Subject: [RFC] High availability in KVM

We are trying to improve the integration of KVM with the most common
HA stacks, but we would like to share with the community what we are
trying to achieve and how before we take a wrong turn.

This is a pretty long write-up, but please bear with me.
---


Virtualization has boosted flexibility on the data center, allowing
for efficient usage of computer resources, increased server
consolidation, load balancing on a per-virtual machine basis -- you
name it, However we feel there is an aspect of virtualization that
has not been fully exploited so far: high availability (HA).

Traditional HA solutions can be classified in two groups: fault
tolerant servers, and software clustering.

Broadly speaking, fault tolerant servers protect us against hardware
failures and, generally, rely on redundant hardware (often
proprietary), and hardware failure detection to trigger fail-over.

On the other hand, software clustering, as its name indicates, takes
care of software failures and usually requires a standby server
whose software configuration for the part we are trying to make
fault tolerant must be identical to that of the active server.

Existing open source HA stacks such as pacemaker/corosync and Red
Hat Cluster Suite rely on software clustering techniques to detect
both hardware failures and software failures, and employ fencing to
avoid split-brain situations which, in turn, makes it possible to
perform failover safely. However, when applied to virtualization
environments these solutions show some limitations:

- Hardware detection relies on polling mechanisms (for example
pinging a network interface to check for network connectivity),
imposing a trade off between failover time and the cost of
polling. The alternative is having the failing system send an
alarm to the HA software to trigger failover. The latter
approach is preferable but it is not always applicable when
dealing with bare-metal; depending on the failure type the
hardware may not able to get a message out to notify the HA
software. However, when it comes to virtualization environments
we can certainly do better. If a hardware failure, be it real
hardware or virtual hardware, is fully contained within a
virtual machine the host or hypervisor can detect that and
notify the HA software safely using clean resources.

- In most cases, when a hardware failure is detected the state of
the failing node is not known which means that some kind of
fencing is needed to lock resources away from that
node. Depending on the hardware and the cluster configuration
fencing can be a pretty expensive operation that contributes to
system downtime. Virtualization can help here. Upon failure
detection the host or hypervisor could put the virtual machine
in a quiesced state and release its hardware resources before
notifying the HA software, so that it can start failover
immediately without having to mingle with the failing virtual
machine (we now know that it is in a known quiesced state). Of
course this only makes sense in the event-driven failover case
described above.

- Fencing operations commonly involve killing the virtual machine,
thus depriving us of potentially critical debugging information:
a dump of the virtual machine itself. This issue could be solved
by providing a virtual machine control that puts the virtual
machine in a known quiesced state, releases its hardware
resources, but keeps the guest and device model in memory so
that forensics can be conducted offline after failover. Polling
HA resource agents should use this new command if postmortem
analysis is important.

We are pursuing a scenario where current polling-based HA resource
agents are complemented with an event-driven failure notification
mechanism that allows for faster failover times by eliminating the
delay introduced by polling and by doing without fencing. This would
benefit traditional software clustering stacks and bring a feature
that is essential for fault tolerance solutions such as Kemari.

Additionally, for those who want or need to stick with a polling
model we would like to provide a virtual machine control that
freezes a virtual machine into a failover-safe state without killing
it, so that postmortem analysis is still possible.

In the following sections we discuss the RAS-HA integration
challenges and the changes that need to be made to each component of
the qemu-KVM stack to realize this vision. While at it we will also
delve into some of the limitations of the current hardware error
subsystems of the Linux kernel.


HARDWARE ERRORS AND HIGH AVAILABILITY

The major open source software stacks for Linux rely on polling
mechanisms to detect both software errors and hardware failures. For
example, ping or an equivalent is widely used to check for network
connectivity interruptions. This is enough to get the job done in
most cases but one is forced to make a trade off between service
disruption time and the burden imposed by the polling resource
agent.

On the hardware side of things, the situation can be improved if we
take advantage of CPU and chipset RAS capabilities to trigger
failover in the event of a non-recoverable error or, even better, do
it preventively when hardware informs us things might go awry. The
premise is that RAS features such as hardware failure notification
can be leveraged to minimize or even eliminate service
down-times.

Generally speaking, hardware errors reported to the operating system
can be classified into two broad categories: corrected errors and
uncorrected errors. The later are not necessarily critical errors
that require a system restart; depending on the hardware and the
software running on the affected system resource such errors may be
recoverable. The picture looks like this (definitions taken from
"Advanced Configuration and Power Interface Specification, Revision
4.0a" and slightly modified to get rid of ACPI jargon):

- Corrected error: Hardware error condition that has been
corrected by the hardware or by the firmware by the time the
kernel is notified about the existence of an error condition.

- Uncorrected error: Hardware error condition that cannot be
corrected by the hardware or by the firmware. Uncorrected errors
are either fatal or non-fatal.

o A fatal hardware error is an uncorrected or uncontained
error condition that is determined to be unrecoverable by
the hardware. When a fatal uncorrected error occurs, the
system is usually restarted to prevent propagation of the
error.

o A non-fatal hardware error is an uncorrected error condition
from which the kernel can attempt recovery by trying to
correct the error. These are also referred to as correctable
or recoverable errors.

Corrected errors are inoffensive in principle, but they may be
harbingers of fatal non-recoverable errors. It is thus reasonable in
some cases to do preventive failover or live migration when a
certain threshold is reached. However this is arguably the job
systems management software, not the HA, so this case will not be
discussed in detail here.

Uncorrected errors are the ones HA software cares about.

When a fatal hardware error occurs the firmware may decide to
restart the hardware. If the fatal error is relayed to the kernel
instead the safest thing to do is to panic to avoid further
damage. Even though it is theoretically possible to send a
notification from the kernel's error or panic handler, this is a
extremely hardware-dependent operation and will not be considered
here. To detect this type of failures one's old reliable
polling-based resource agent is the way to go.

Non-fatal or recoverable errors are the most interesting in the
pack. Detection should ideally be performed in a non-intrusive way
and feed the policy engine with enough information about the error
to make the right call. If the policy engine decides that the error
might compromise service continuity it should notify the HA stack so
that failover can be started immediately.


REQUIREMENTS

* Linux kernel

One of the main goals is to notify HA software about hardware errors
as soon as they are detected so that service downtime can be
minimized. For this a hardware error subsystem that follows an
event-driven model is preferable because it allows us to eliminate
the cost associated with polling. A file based API that provides a
sys_poll interface and process signaling both fit the bill (the
latter is pretty limited in its semantics an may not be adequate to
communicate non-memory type errors).

The hardware error subsystem should provide enough information to be
able to map error sources (memory, PCI devices, etc) to processes or
virtual machines, so that errors can be contained. For example, if a
memory failure occurs but only affects user-space addresses being
used by a regular process or a KVM guest there is no need to bring
down the whole machine.

In some cases, when a failure is detected in a hardware resource in
use by one or more virtual machines it might be necessary to put
them in a quiesced state before notifying the associated qemu
process.

Unfortunately there is no generic hardware error layer inside the
kernel, which means that each hardware error subsystem does its own
thing and there is even some overlap between them. See HARDWARE ERRORS IN LINUX below for a brief description of the current mess.

* qemu-kvm

Currently KVM is only notified about memory errors detected by the
MCE subsystem. When running on newer x86 hardware, if MCE detects an
error on user-space it signals the corresponding process with
SIGBUS. Qemu, upon receiving the signal, checks the problematic
address which the kernel stored in siginfo and decides whether to
inject the MCE to the virtual machine.

An obvious limitation is that we would like to be notified about
other types of error too and, as suggested before, a file-based
interface that can be sys_poll'ed might be needed for that.

On a different note, in a HA environment the qemu policy described
above is not adequate; when a notification of a hardware error that
our policy determines to be serious arrives the first thing we want
to do is to put the virtual machine in a quiesced state to avoid
further wreckage. If we injected the error into the guest we would
risk a guest panic that might detectable only by polling or, worse,
being killed by the kernel, which means that postmortem analysis of
the guest is not possible. Once we had the guests in a quiesced
state, where all the buffers have been flushed and the hardware
sources released, we would have two modes of operation that can be
used together and complement each other.

- Proactive: A qmp event describing the error (severity, topology,
etc) is emitted. The HA software would have to register to
receive hardware error events, possibly using the libvirt
bindings. Upon receiving the event the HA software would know
that the guest is in a failover-safe quiesced state so it could
do without fencing and proceed to the failover stage directly.

- Passive: Polling resource agents that need to check the state of
the guest generally use libvirt or a wrapper such as virsh. When
the state is SHUTOFF or CRASHED the resource agent proceeds to
the facing stage, which might be expensive and usually involves
killing the qemu process. We propose adding a new state that
indicates the failover-safe state described before. In this
state the HA software would not need to use fencing techniques
and since the qemu process is not killed postmortem analysis of
the virtual machine is still possible.


HARDWARE ERRORS IN LINUX

In modern x86 machines there is a plethora of error sources:

- Processor machines check exception.
- Chipset error message signals.
- APEI (ACPI4).
- NMI.
- PCIe AER.
- Non-platform devices (SCSI errors, ATA errors, etc).

Detection of processor, memory, PCI express, and platform errors in
the Linux kernel is currently provided by the MCE, the EDAC, and the
PCIe AER subsystems, which covers the first 5 items in the list
above. There is some overlap between them with regard to the errors
they can detect and the hardware they poke into, but they are
essentially independent systems with completely different
architectures. To make things worse, there is no standard mechanism
to notify about non-platform devices beyond the venerable printk().

Regarding the user space notification mechanism, things do not get
any better. Each error notification subsystem does its own thing:

- MCE: Communicates with user space through the /dev/mcelog
special device and
/sys/devices/system/machinecheck/machinecheckN/. mcelog is
usually the tool that hooks into /dev/mcelog (this device can be
polled) to collect and decode the machine check errors.
Alternatively,
/sys/devices/system/machinecheck/machinecheckN/trigger can be
used to set a program to be run when a machine check event is
detected. Additionally, when an machine check error that affects
only user space processes they are signaled SIGBUS.

The MCE subsystem used to deal only with CPU errors, but it was
extended to handle memory errors too and there is also initial
support for ACPI4's APEI. The current MCE APEI implementation
reaps memory errors notified through SCI, but support for other
errors (platform, PCIe) and transports covered in the
specification is in the works.

- EDAC: Exports memory errors, ECC errors from non-memory devices
(L1, L2 and L3 caches, DMA engines, etc), and PCI bus parity and
SERR errors through /sys/devices/system/edac/*.

- NMI: Uses printk() to write to the system log. When EDAC is
enabled the NMI handler can also instruct EDAC to check for
potential ECC errors.

- PCIe AER subsystem: Notifies PCI-core and AER-capable drivers
about errors in the PCI bus and uses printk() to write to the
system log.
---


I would appreciate your comments and advice on any of the issues
presented here.

Thanks,
Fernando


2010-06-21 14:19:54

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [RFC] High availability in KVM

On Thu, 17 Jun 2010 12:15:20 +0900
Fernando Luis Vazquez Cao <[email protected]> wrote:

> * qemu-kvm
>
> Currently KVM is only notified about memory errors detected by the
> MCE subsystem. When running on newer x86 hardware, if MCE detects an
> error on user-space it signals the corresponding process with
> SIGBUS. Qemu, upon receiving the signal, checks the problematic
> address which the kernel stored in siginfo and decides whether to
> inject the MCE to the virtual machine.
>
> An obvious limitation is that we would like to be notified about
> other types of error too and, as suggested before, a file-based
> interface that can be sys_poll'ed might be needed for that.
>
> On a different note, in a HA environment the qemu policy described
> above is not adequate; when a notification of a hardware error that
> our policy determines to be serious arrives the first thing we want
> to do is to put the virtual machine in a quiesced state to avoid
> further wreckage. If we injected the error into the guest we would
> risk a guest panic that might detectable only by polling or, worse,
> being killed by the kernel, which means that postmortem analysis of
> the guest is not possible. Once we had the guests in a quiesced
> state, where all the buffers have been flushed and the hardware
> sources released, we would have two modes of operation that can be
> used together and complement each other.
>
> - Proactive: A qmp event describing the error (severity, topology,
> etc) is emitted. The HA software would have to register to
> receive hardware error events, possibly using the libvirt
> bindings. Upon receiving the event the HA software would know
> that the guest is in a failover-safe quiesced state so it could
> do without fencing and proceed to the failover stage directly.

This seems to match the BLOCK_IO_ERROR event we have today: when a disk error
happens, an event is emitted and the virtual machine can be automatically
stopped (there's a configuration option for this).

On the other hand, there's a number of ways to do this differently. I think
the first thing to do is to agree on what qemu's behavior is going to be, then
we decide how to expose this info to qmp clients.

> - Passive: Polling resource agents that need to check the state of
> the guest generally use libvirt or a wrapper such as virsh. When
> the state is SHUTOFF or CRASHED the resource agent proceeds to
> the facing stage, which might be expensive and usually involves
> killing the qemu process. We propose adding a new state that
> indicates the failover-safe state described before. In this
> state the HA software would not need to use fencing techniques
> and since the qemu process is not killed postmortem analysis of
> the virtual machine is still possible.

It wouldn't be polling, I guess. We already have events for most state changes.
So, when the machine stops, reboots, etc.. the client would be notified and
then it could inspect the virtual machine by using query commands.

This method would be preferable in case we also want this information available
in the user Monitor and/or if the event gets too messy because of the amount of
information we want to put in it.

2010-06-22 01:39:10

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [RFC] High availability in KVM

(2010/06/21 23:19), Luiz Capitulino wrote:
>> On a different note, in a HA environment the qemu policy described
>> above is not adequate; when a notification of a hardware error that
>> our policy determines to be serious arrives the first thing we want
>> to do is to put the virtual machine in a quiesced state to avoid
>> further wreckage. If we injected the error into the guest we would
>> risk a guest panic that might detectable only by polling or, worse,
>> being killed by the kernel, which means that postmortem analysis of
>> the guest is not possible. Once we had the guests in a quiesced
>> state, where all the buffers have been flushed and the hardware
>> sources released, we would have two modes of operation that can be
>> used together and complement each other.
>>
>> - Proactive: A qmp event describing the error (severity, topology,
>> etc) is emitted. The HA software would have to register to
>> receive hardware error events, possibly using the libvirt
>> bindings. Upon receiving the event the HA software would know
>> that the guest is in a failover-safe quiesced state so it could
>> do without fencing and proceed to the failover stage directly.
>
> This seems to match the BLOCK_IO_ERROR event we have today: when a disk error
> happens, an event is emitted and the virtual machine can be automatically
> stopped (there's a configuration option for this).
>
> On the other hand, there's a number of ways to do this differently. I think
> the first thing to do is to agree on what qemu's behavior is going to be, then
> we decide how to expose this info to qmp clients.

I would like to support qemu/KVM bugs too in the same framework.

Even though there are some debugging ways, the easiest and most reliable one would
be using the frozen state of the guest at the moment the bug happened.


We've already experienced some qemu crashes which seemed to be caused by a KVM's
emulation failure in our test environment. Although we could guess what happened
by checking some messages like the exit reason, the guest state might have been
more help.

So what I want to get is:

- new qemu/KVM mode in which guests are automatically stopped in a failover-safe
state if qemu/KVM becomes impossible to continue,

- new interface between qemu and HA to handle the failover-safe state,

Although I personally don't mind whether the interface is event based or polling
based, one important problem from the HA's point of view would be:

* how to treat errors which can be caused in different layers uniformly.

E.g. if the problem is caused by guest side, qemu may normally exit without sending
any events to HA. So an interface for polling may be helpful even when we choose event
driven one.


Takuya


>
>> - Passive: Polling resource agents that need to check the state of
>> the guest generally use libvirt or a wrapper such as virsh. When
>> the state is SHUTOFF or CRASHED the resource agent proceeds to
>> the facing stage, which might be expensive and usually involves
>> killing the qemu process. We propose adding a new state that
>> indicates the failover-safe state described before. In this
>> state the HA software would not need to use fencing techniques
>> and since the qemu process is not killed postmortem analysis of
>> the virtual machine is still possible.
>
> It wouldn't be polling, I guess. We already have events for most state changes.
> So, when the machine stops, reboots, etc.. the client would be notified and
> then it could inspect the virtual machine by using query commands.
>
> This method would be preferable in case we also want this information available
> in the user Monitor and/or if the event gets too messy because of the amount of
> information we want to put in it.

2010-07-10 22:37:24

by David Lang

[permalink] [raw]
Subject: Re: [RFC] High availability in KVM

On Thu, 17 Jun 2010, Fernando Luis Vazquez Cao wrote:

> Existing open source HA stacks such as pacemaker/corosync and Red
> Hat Cluster Suite rely on software clustering techniques to detect
> both hardware failures and software failures, and employ fencing to
> avoid split-brain situations which, in turn, makes it possible to
> perform failover safely. However, when applied to virtualization
> environments these solutions show some limitations:
>
> - Hardware detection relies on polling mechanisms (for example
> pinging a network interface to check for network connectivity),
> imposing a trade off between failover time and the cost of
> polling. The alternative is having the failing system send an
> alarm to the HA software to trigger failover. The latter
> approach is preferable but it is not always applicable when
> dealing with bare-metal; depending on the failure type the
> hardware may not able to get a message out to notify the HA
> software. However, when it comes to virtualization environments
> we can certainly do better. If a hardware failure, be it real
> hardware or virtual hardware, is fully contained within a
> virtual machine the host or hypervisor can detect that and
> notify the HA software safely using clean resources.

you still need to detect failures that you won't be notified of.

what if a network cable goes bad and your data isn't getting through? you
won't get any notification of this without doing polling, even in a
virtualized environment.

also, in a virtualized environment you may have firewall rules between
virtual hosts, if those get misconfigured you may have 'virual physical
connectivity' still, but not the logical connectivity that you need.

> - In most cases, when a hardware failure is detected the state of
> the failing node is not known which means that some kind of
> fencing is needed to lock resources away from that
> node. Depending on the hardware and the cluster configuration
> fencing can be a pretty expensive operation that contributes to
> system downtime. Virtualization can help here. Upon failure
> detection the host or hypervisor could put the virtual machine
> in a quiesced state and release its hardware resources before
> notifying the HA software, so that it can start failover
> immediately without having to mingle with the failing virtual
> machine (we now know that it is in a known quiesced state). Of
> course this only makes sense in the event-driven failover case
> described above.
>
> - Fencing operations commonly involve killing the virtual machine,
> thus depriving us of potentially critical debugging information:
> a dump of the virtual machine itself. This issue could be solved
> by providing a virtual machine control that puts the virtual
> machine in a known quiesced state, releases its hardware
> resources, but keeps the guest and device model in memory so
> that forensics can be conducted offline after failover. Polling
> HA resource agents should use this new command if postmortem
> analysis is important.

I don't see this as the job of the virtualization hypervisor. the software
HA stacks include the ability to run external scripts to perform these
tasks. These scripts can perform whatever calls to the hypervisor that
are appropriate to freeze, shutdown, or disconnect the virtual server (and
what is appropriate will vary from implementation to implementation)

providing sample scripts that do this for the various HA stacks makes
sense as it gives people examples of what can be done and lets them tailor
exactly what does happen to their needs.

> We are pursuing a scenario where current polling-based HA resource
> agents are complemented with an event-driven failure notification
> mechanism that allows for faster failover times by eliminating the
> delay introduced by polling and by doing without fencing. This would
> benefit traditional software clustering stacks and bring a feature
> that is essential for fault tolerance solutions such as Kemari.

heartbeat/pacemaker has been able to do sub-second failovers for several
years, I'm not sure that notification is really needed.

that being said the HA stacks do allow for commands to be fed into the HA
system to tell a machine to go active/passive already, so why don't you
have your notification just call scripts to make the appropriate calls?

> Additionally, for those who want or need to stick with a polling
> model we would like to provide a virtual machine control that
> freezes a virtual machine into a failover-safe state without killing
> it, so that postmortem analysis is still possible.

how is this different from simply pausing the virtual machine?

> In the following sections we discuss the RAS-HA integration
> challenges and the changes that need to be made to each component of
> the qemu-KVM stack to realize this vision. While at it we will also
> delve into some of the limitations of the current hardware error
> subsystems of the Linux kernel.
>
>
> HARDWARE ERRORS AND HIGH AVAILABILITY
>
> The major open source software stacks for Linux rely on polling
> mechanisms to detect both software errors and hardware failures. For
> example, ping or an equivalent is widely used to check for network
> connectivity interruptions. This is enough to get the job done in
> most cases but one is forced to make a trade off between service
> disruption time and the burden imposed by the polling resource
> agent.
>
> On the hardware side of things, the situation can be improved if we
> take advantage of CPU and chipset RAS capabilities to trigger
> failover in the event of a non-recoverable error or, even better, do
> it preventively when hardware informs us things might go awry. The
> premise is that RAS features such as hardware failure notification
> can be leveraged to minimize or even eliminate service
> down-times.

having run dozens of sets of HA systems for about 10 years, I find that
very few of the failures that I have experianced would have been helped by
this. hardware very seldom gives me any indication that it's about to
fail, and even when it does fail it's usually only discovered due to the
side effects of other things I am trying to do not working.

> Generally speaking, hardware errors reported to the operating system
> can be classified into two broad categories: corrected errors and
> uncorrected errors. The later are not necessarily critical errors
> that require a system restart; depending on the hardware and the
> software running on the affected system resource such errors may be
> recoverable. The picture looks like this (definitions taken from
> "Advanced Configuration and Power Interface Specification, Revision
> 4.0a" and slightly modified to get rid of ACPI jargon):
>
> - Corrected error: Hardware error condition that has been
> corrected by the hardware or by the firmware by the time the
> kernel is notified about the existence of an error condition.
>
> - Uncorrected error: Hardware error condition that cannot be
> corrected by the hardware or by the firmware. Uncorrected errors
> are either fatal or non-fatal.
>
> o A fatal hardware error is an uncorrected or uncontained
> error condition that is determined to be unrecoverable by
> the hardware. When a fatal uncorrected error occurs, the
> system is usually restarted to prevent propagation of the
> error.
>
> o A non-fatal hardware error is an uncorrected error condition
> from which the kernel can attempt recovery by trying to
> correct the error. These are also referred to as correctable
> or recoverable errors.
>
> Corrected errors are inoffensive in principle, but they may be
> harbingers of fatal non-recoverable errors. It is thus reasonable in
> some cases to do preventive failover or live migration when a
> certain threshold is reached. However this is arguably the job
> systems management software, not the HA, so this case will not be
> discussed in detail here.

the easiest way to do this is to log the correctable errors and let normal
log analysis tools notice these errors and decide to take action. trying
to make the hypervisor do something here is putting policy in the wrong
place.

> Uncorrected errors are the ones HA software cares about.
>
> When a fatal hardware error occurs the firmware may decide to
> restart the hardware. If the fatal error is relayed to the kernel
> instead the safest thing to do is to panic to avoid further
> damage. Even though it is theoretically possible to send a
> notification from the kernel's error or panic handler, this is a
> extremely hardware-dependent operation and will not be considered
> here. To detect this type of failures one's old reliable
> polling-based resource agent is the way to go.

and in this case you probably cannot trust the system to send notification
without damaging things further, simply halting is probably the only safe
thing to do.

> Non-fatal or recoverable errors are the most interesting in the
> pack. Detection should ideally be performed in a non-intrusive way
> and feed the policy engine with enough information about the error
> to make the right call. If the policy engine decides that the error
> might compromise service continuity it should notify the HA stack so
> that failover can be started immediately.

again, log the errors and let existing log analysis/alerting tools decide
what action to take.

> Currently KVM is only notified about memory errors detected by the
> MCE subsystem. When running on newer x86 hardware, if MCE detects an
> error on user-space it signals the corresponding process with
> SIGBUS. Qemu, upon receiving the signal, checks the problematic
> address which the kernel stored in siginfo and decides whether to
> inject the MCE to the virtual machine.
>
> An obvious limitation is that we would like to be notified about
> other types of error too and, as suggested before, a file-based
> interface that can be sys_poll'ed might be needed for that.
> On a different note, in a HA environment the qemu policy described
> above is not adequate; when a notification of a hardware error that
> our policy determines to be serious arrives the first thing we want
> to do is to put the virtual machine in a quiesced state to avoid
> further wreckage. If we injected the error into the guest we would
> risk a guest panic that might detectable only by polling or, worse,
> being killed by the kernel, which means that postmortem analysis of
> the guest is not possible. Once we had the guests in a quiesced
> state, where all the buffers have been flushed and the hardware
> sources released, we would have two modes of operation that can be
> used together and complement each other.

it sounds like you really need to be running HA at two layers

1. on the host layer to detect problems with the host and decide to
freeze/migrate virtual machines to another system

2. inside the guests to make sure that the guests that are running (on
multiple real machines) continue to provide services.

but what is your alturnative to sending the error into the guest?
depending on what the error is you may or may not be able to freeze the
guest (it makes no sense to try and flush buffers to a drive that won't
accept writes for example)

> - Proactive: A qmp event describing the error (severity, topology,
> etc) is emitted. The HA software would have to register to
> receive hardware error events, possibly using the libvirt
> bindings. Upon receiving the event the HA software would know
> that the guest is in a failover-safe quiesced state so it could
> do without fencing and proceed to the failover stage directly.

if it's not a fatal error then the system can continue to run (for at
least a few more seconds ;-), let such errors get written to syslog and
let a tool like SEC (simple event correlator) see the logs and deicde what
to do. there's no need to modify the kernel/KVM for this.

> - Passive: Polling resource agents that need to check the state of
> the guest generally use libvirt or a wrapper such as virsh. When
> the state is SHUTOFF or CRASHED the resource agent proceeds to
> the facing stage, which might be expensive and usually involves
> killing the qemu process. We propose adding a new state that
> indicates the failover-safe state described before. In this
> state the HA software would not need to use fencing techniques
> and since the qemu process is not killed postmortem analysis of
> the virtual machine is still possible.

how do you define failover-safe states? why would the HA software (with
the assistance of a log watcher) not be able to do the job itself?

I do think that it's significant that all the HA solutions out there
prefer to test if the functionality works rather than watching for log
events to say there may be a problem, but there's nothing preventing this
from easily being done.

David Lang

2010-07-12 06:17:23

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [RFC] High availability in KVM

(2010/07/11 7:36), [email protected] wrote:
> On Thu, 17 Jun 2010, Fernando Luis Vazquez Cao wrote:
>
>> Existing open source HA stacks such as pacemaker/corosync and Red
>> Hat Cluster Suite rely on software clustering techniques to detect
>> both hardware failures and software failures, and employ fencing to
>> avoid split-brain situations which, in turn, makes it possible to
>> perform failover safely. However, when applied to virtualization
>> environments these solutions show some limitations:
>>
>> - Hardware detection relies on polling mechanisms (for example
>> pinging a network interface to check for network connectivity),
>> imposing a trade off between failover time and the cost of
>> polling. The alternative is having the failing system send an
>> alarm to the HA software to trigger failover. The latter
>> approach is preferable but it is not always applicable when
>> dealing with bare-metal; depending on the failure type the
>> hardware may not able to get a message out to notify the HA
>> software. However, when it comes to virtualization environments
>> we can certainly do better. If a hardware failure, be it real
>> hardware or virtual hardware, is fully contained within a
>> virtual machine the host or hypervisor can detect that and
>> notify the HA software safely using clean resources.
>
> you still need to detect failures that you won't be notified of.
>
> what if a network cable goes bad and your data isn't getting through?
> you won't get any notification of this without doing polling, even in a
> virtualized environment.


I agree that we need polling anyway.


>
> also, in a virtualized environment you may have firewall rules between
> virtual hosts, if those get misconfigured you may have 'virual physical
> connectivity' still, but not the logical connectivity that you need.
>
>> - In most cases, when a hardware failure is detected the state of
>> the failing node is not known which means that some kind of
>> fencing is needed to lock resources away from that
>> node. Depending on the hardware and the cluster configuration
>> fencing can be a pretty expensive operation that contributes to
>> system downtime. Virtualization can help here. Upon failure
>> detection the host or hypervisor could put the virtual machine
>> in a quiesced state and release its hardware resources before
>> notifying the HA software, so that it can start failover
>> immediately without having to mingle with the failing virtual
>> machine (we now know that it is in a known quiesced state). Of
>> course this only makes sense in the event-driven failover case
>> described above.
>>
>> - Fencing operations commonly involve killing the virtual machine,
>> thus depriving us of potentially critical debugging information:
>> a dump of the virtual machine itself. This issue could be solved
>> by providing a virtual machine control that puts the virtual
>> machine in a known quiesced state, releases its hardware
>> resources, but keeps the guest and device model in memory so
>> that forensics can be conducted offline after failover. Polling
>> HA resource agents should use this new command if postmortem
>> analysis is important.
>
> I don't see this as the job of the virtualization hypervisor. the
> software HA stacks include the ability to run external scripts to
> perform these tasks. These scripts can perform whatever calls to the
> hypervisor that are appropriate to freeze, shutdown, or disconnect the
> virtual server (and what is appropriate will vary from implementation to
> implementation)


I see that it can be done with HA plus external scripts.

But don't you think we need a way to confirm that vm is in a known quiesced
state?

Although might not be the exact same scenario, here is what we are planning
as one possible next step (polling case):

==============================================================================
A. Current management: "Qemu/KVM + HA using libvirt interface"

- Pacemaker interacts with RA(Resource Agent) through OCF interface.
- RA interacts with Qemu using virsh commands, IOW through libvirt interface.

Pacemaker...(OCF)....RA...(libvirt)...Qemu
| | |
| | |
1: +---- start ----->+---------------->+ state=RUNNING
| | |
+---- monitor --->+---- domstate -->+
2: | | |
+<---- "OK" ------+<--- "RUNNING" --+
| | |
| | |
| | * Error: state=SHUTOFF, or ...
| | |
| | |
+---- monitor --->+---- domstate -->+
3: | | |
+<-- "STOPPED" ---+<--- "SHUTOFF" --+
| | |
+---- stop ------>+---- shutdown -->+ VM killed (if still alive)
4: | | |
+<---- "OK" ------+<--- "SHUTOFF" --+
| | |
| | |

1: Pacemaker starts Qemu.

2: Pacemaker checks the state of Qemu via RA.
RA checks the state of Qemu using virsh(libvirt).
Qemu replies to RA "RUNNING"(normally executing), (*1)
and RA returns the state to Pacemaker as it's running correctly.

(*1): libvirt defines the following domain states:

enum virDomainState {

VIR_DOMAIN_NOSTATE = 0 : no state
VIR_DOMAIN_RUNNING = 1 : the domain is running
VIR_DOMAIN_BLOCKED = 2 : the domain is blocked on resource
VIR_DOMAIN_PAUSED = 3 : the domain is paused by user
VIR_DOMAIN_SHUTDOWN = 4 : the domain is being shut down
VIR_DOMAIN_SHUTOFF = 5 : the domain is shut off
VIR_DOMAIN_CRASHED = 6 : the domain is crashed

}

We took the most common case RUNNING as an example, but this might be
other states except for failover targets: SHUTOFF and CRASHED ?

--- SOME ERROR HAPPENS ---

3: Pacemaker checks the state of Qemu via RA.
RA checks the state of Qemu using virsh(libvirt).
Qemu replies to RA "SHUTOFF", (*2)
and RA returns the state to Pacemaker as it's already stopped.

(*2): Currently we are checking "shut off" answer from domstate command.
Yes, we should care about both SHUTOFF and CRASHED if possible.

4: Pacemaker finally tries to confirm if it can safely start failover by
sending stop command. After killing Qemu, RA replies to Pacemaker
"OK" so that Pacemaker can start failover.

Problems: We lose debuggable information of VM such as the contents of
guest memory.



B. Our proposal: "introduce a new domain state to indicate failover-safe"

Pacemaker...(OCF)....RA...(libvirt)...Qemu
| | |
| | |
1: +---- start ----->+---------------->+ state=RUNNING
| | |
+---- monitor --->+---- domstate -->+
2: | | |
+<---- "OK" ------+<--- "RUNNING" --+
| | |
| | |
| | * Error: state=FROZEN
| | | Qemu releases resources
| | | and VM gets frozen. (*3)
+---- monitor --->+---- domstate -->+
3: | | |
+<-- "STOPPED" ---+<--- "FROZEN" ---+
| | |
+---- stop ------>+---- domstate -->+
4: | | |
+<---- "OK" ------+<--- "FROZEN" ---+
| | |
| | |


1: Pacemaker starts Qemu.

2: Pacemaker checks the state of Qemu via RA.
RA checks the state of Qemu using virsh(libvirt).
Qemu replies to RA "RUNNING"(normally executing), (*1)
and RA returns the state to Pacemaker as it's running correctly.

--- SOME ERROR HAPPENS ---

3: Pacemaker checks the state of Qemu via RA.
RA checks the state of Qemu using virsh(libvirt).
Qemu replies to RA "FROZEN"(VM stopped in a failover-safe state), (*3)
and RA keeps it in mind, then replies to Pacemaker "STOPPED".

(*3): this is what we want to introduce as a new state. Failover-safe means
that Qemu released the external resources, including some namespaces, to be
available from another instance.

4: Pacemaker finally tries to confirm if it can safely start failover by
sending stop command. Knowing that VM has already stopped in a failover-safe
state, RA does not kill Qemu process and replies to Pacemaker "OK" so that
Pacemaker can start failover as usual.

==============================================================================

In any case, we want to confirm that fail over can be safely done.

- I could not find any such API in libvirt.


>
> providing sample scripts that do this for the various HA stacks makes
> sense as it gives people examples of what can be done and lets them
> tailor exactly what does happen to their needs.
>
>> We are pursuing a scenario where current polling-based HA resource
>> agents are complemented with an event-driven failure notification
>> mechanism that allows for faster failover times by eliminating the
>> delay introduced by polling and by doing without fencing. This would
>> benefit traditional software clustering stacks and bring a feature
>> that is essential for fault tolerance solutions such as Kemari.
>
> heartbeat/pacemaker has been able to do sub-second failovers for several
> years, I'm not sure that notification is really needed.
>
> that being said the HA stacks do allow for commands to be fed into the
> HA system to tell a machine to go active/passive already, so why don't
> you have your notification just call scripts to make the appropriate calls?
>
>> Additionally, for those who want or need to stick with a polling
>> model we would like to provide a virtual machine control that
>> freezes a virtual machine into a failover-safe state without killing
>> it, so that postmortem analysis is still possible.
>
> how is this different from simply pausing the virtual machine?


I think it is almost same as pause, but more precisely all vcpus being stopped
and all resources except memory being released. This of course should take care
of namespaces.


Takuya



>
>> In the following sections we discuss the RAS-HA integration
>> challenges and the changes that need to be made to each component of
>> the qemu-KVM stack to realize this vision. While at it we will also
>> delve into some of the limitations of the current hardware error
>> subsystems of the Linux kernel.
>>
>>
>> HARDWARE ERRORS AND HIGH AVAILABILITY
>>
>> The major open source software stacks for Linux rely on polling
>> mechanisms to detect both software errors and hardware failures. For
>> example, ping or an equivalent is widely used to check for network
>> connectivity interruptions. This is enough to get the job done in
>> most cases but one is forced to make a trade off between service
>> disruption time and the burden imposed by the polling resource
>> agent.
>>
>> On the hardware side of things, the situation can be improved if we
>> take advantage of CPU and chipset RAS capabilities to trigger
>> failover in the event of a non-recoverable error or, even better, do
>> it preventively when hardware informs us things might go awry. The
>> premise is that RAS features such as hardware failure notification
>> can be leveraged to minimize or even eliminate service
>> down-times.
>
> having run dozens of sets of HA systems for about 10 years, I find that
> very few of the failures that I have experianced would have been helped
> by this. hardware very seldom gives me any indication that it's about to
> fail, and even when it does fail it's usually only discovered due to the
> side effects of other things I am trying to do not working.
>
>> Generally speaking, hardware errors reported to the operating system
>> can be classified into two broad categories: corrected errors and
>> uncorrected errors. The later are not necessarily critical errors
>> that require a system restart; depending on the hardware and the
>> software running on the affected system resource such errors may be
>> recoverable. The picture looks like this (definitions taken from
>> "Advanced Configuration and Power Interface Specification, Revision
>> 4.0a" and slightly modified to get rid of ACPI jargon):
>>
>> - Corrected error: Hardware error condition that has been
>> corrected by the hardware or by the firmware by the time the
>> kernel is notified about the existence of an error condition.
>>
>> - Uncorrected error: Hardware error condition that cannot be
>> corrected by the hardware or by the firmware. Uncorrected errors
>> are either fatal or non-fatal.
>>
>> o A fatal hardware error is an uncorrected or uncontained
>> error condition that is determined to be unrecoverable by
>> the hardware. When a fatal uncorrected error occurs, the
>> system is usually restarted to prevent propagation of the
>> error.
>>
>> o A non-fatal hardware error is an uncorrected error condition
>> from which the kernel can attempt recovery by trying to
>> correct the error. These are also referred to as correctable
>> or recoverable errors.
>>
>> Corrected errors are inoffensive in principle, but they may be
>> harbingers of fatal non-recoverable errors. It is thus reasonable in
>> some cases to do preventive failover or live migration when a
>> certain threshold is reached. However this is arguably the job
>> systems management software, not the HA, so this case will not be
>> discussed in detail here.
>
> the easiest way to do this is to log the correctable errors and let
> normal log analysis tools notice these errors and decide to take action.
> trying to make the hypervisor do something here is putting policy in the
> wrong place.
>
>> Uncorrected errors are the ones HA software cares about.
>>
>> When a fatal hardware error occurs the firmware may decide to
>> restart the hardware. If the fatal error is relayed to the kernel
>> instead the safest thing to do is to panic to avoid further
>> damage. Even though it is theoretically possible to send a
>> notification from the kernel's error or panic handler, this is a
>> extremely hardware-dependent operation and will not be considered
>> here. To detect this type of failures one's old reliable
>> polling-based resource agent is the way to go.
>
> and in this case you probably cannot trust the system to send
> notification without damaging things further, simply halting is probably
> the only safe thing to do.
>
>> Non-fatal or recoverable errors are the most interesting in the
>> pack. Detection should ideally be performed in a non-intrusive way
>> and feed the policy engine with enough information about the error
>> to make the right call. If the policy engine decides that the error
>> might compromise service continuity it should notify the HA stack so
>> that failover can be started immediately.
>
> again, log the errors and let existing log analysis/alerting tools
> decide what action to take.
>
>> Currently KVM is only notified about memory errors detected by the
>> MCE subsystem. When running on newer x86 hardware, if MCE detects an
>> error on user-space it signals the corresponding process with
>> SIGBUS. Qemu, upon receiving the signal, checks the problematic
>> address which the kernel stored in siginfo and decides whether to
>> inject the MCE to the virtual machine.
>>
>> An obvious limitation is that we would like to be notified about
>> other types of error too and, as suggested before, a file-based
>> interface that can be sys_poll'ed might be needed for that. On a
>> different note, in a HA environment the qemu policy described
>> above is not adequate; when a notification of a hardware error that
>> our policy determines to be serious arrives the first thing we want
>> to do is to put the virtual machine in a quiesced state to avoid
>> further wreckage. If we injected the error into the guest we would
>> risk a guest panic that might detectable only by polling or, worse,
>> being killed by the kernel, which means that postmortem analysis of
>> the guest is not possible. Once we had the guests in a quiesced
>> state, where all the buffers have been flushed and the hardware
>> sources released, we would have two modes of operation that can be
>> used together and complement each other.
>
> it sounds like you really need to be running HA at two layers
>
> 1. on the host layer to detect problems with the host and decide to
> freeze/migrate virtual machines to another system
>
> 2. inside the guests to make sure that the guests that are running (on
> multiple real machines) continue to provide services.
>
> but what is your alturnative to sending the error into the guest?
> depending on what the error is you may or may not be able to freeze the
> guest (it makes no sense to try and flush buffers to a drive that won't
> accept writes for example)
>
>> - Proactive: A qmp event describing the error (severity, topology,
>> etc) is emitted. The HA software would have to register to
>> receive hardware error events, possibly using the libvirt
>> bindings. Upon receiving the event the HA software would know
>> that the guest is in a failover-safe quiesced state so it could
>> do without fencing and proceed to the failover stage directly.
>
> if it's not a fatal error then the system can continue to run (for at
> least a few more seconds ;-), let such errors get written to syslog and
> let a tool like SEC (simple event correlator) see the logs and deicde
> what to do. there's no need to modify the kernel/KVM for this.
>
>> - Passive: Polling resource agents that need to check the state of
>> the guest generally use libvirt or a wrapper such as virsh. When
>> the state is SHUTOFF or CRASHED the resource agent proceeds to
>> the facing stage, which might be expensive and usually involves
>> killing the qemu process. We propose adding a new state that
>> indicates the failover-safe state described before. In this
>> state the HA software would not need to use fencing techniques
>> and since the qemu process is not killed postmortem analysis of
>> the virtual machine is still possible.
>
> how do you define failover-safe states? why would the HA software (with
> the assistance of a log watcher) not be able to do the job itself?
>
> I do think that it's significant that all the HA solutions out there
> prefer to test if the functionality works rather than watching for log
> events to say there may be a problem, but there's nothing preventing
> this from easily being done.
>
> David Lang
>

2010-07-12 09:50:14

by David Lang

[permalink] [raw]
Subject: Re: [RFC] High availability in KVM

On Mon, 12 Jul 2010, Takuya Yoshikawa wrote:

>
> I see that it can be done with HA plus external scripts.
>
> But don't you think we need a way to confirm that vm is in a known quiesced
> state?
>
> Although might not be the exact same scenario, here is what we are planning
> as one possible next step (polling case):
>
> ==============================================================================
> A. Current management: "Qemu/KVM + HA using libvirt interface"
>
> - Pacemaker interacts with RA(Resource Agent) through OCF interface.
> - RA interacts with Qemu using virsh commands, IOW through libvirt interface.
>
> Pacemaker...(OCF)....RA...(libvirt)...Qemu
> | | |
> | | |
> 1: +---- start ----->+---------------->+ state=RUNNING
> | | |
> +---- monitor --->+---- domstate -->+
> 2: | | |
> +<---- "OK" ------+<--- "RUNNING" --+
> | | |
> | | |
> | | * Error: state=SHUTOFF, or ...
> | | |
> | | |
> +---- monitor --->+---- domstate -->+
> 3: | | |
> +<-- "STOPPED" ---+<--- "SHUTOFF" --+
> | | |
> +---- stop ------>+---- shutdown -->+ VM killed (if still alive)
> 4: | | |
> +<---- "OK" ------+<--- "SHUTOFF" --+
> | | |
> | | |
>
> 1: Pacemaker starts Qemu.
>
> 2: Pacemaker checks the state of Qemu via RA.
> RA checks the state of Qemu using virsh(libvirt).
> Qemu replies to RA "RUNNING"(normally executing), (*1)
> and RA returns the state to Pacemaker as it's running correctly.
>
> (*1): libvirt defines the following domain states:
>
> enum virDomainState {
>
> VIR_DOMAIN_NOSTATE = 0 : no state
> VIR_DOMAIN_RUNNING = 1 : the domain is running
> VIR_DOMAIN_BLOCKED = 2 : the domain is blocked on resource
> VIR_DOMAIN_PAUSED = 3 : the domain is paused by user
> VIR_DOMAIN_SHUTDOWN = 4 : the domain is being shut down
> VIR_DOMAIN_SHUTOFF = 5 : the domain is shut off
> VIR_DOMAIN_CRASHED = 6 : the domain is crashed
>
> }
>
> We took the most common case RUNNING as an example, but this might be
> other states except for failover targets: SHUTOFF and CRASHED ?
>
> --- SOME ERROR HAPPENS ---
>
> 3: Pacemaker checks the state of Qemu via RA.
> RA checks the state of Qemu using virsh(libvirt).
> Qemu replies to RA "SHUTOFF", (*2)

why would it return 'shutoff' if an error happened instead of 'crashed'?

> and RA returns the state to Pacemaker as it's already stopped.
>
> (*2): Currently we are checking "shut off" answer from domstate command.
> Yes, we should care about both SHUTOFF and CRASHED if possible.
>
> 4: Pacemaker finally tries to confirm if it can safely start failover by
> sending stop command. After killing Qemu, RA replies to Pacemaker
> "OK" so that Pacemaker can start failover.
>
> Problems: We lose debuggable information of VM such as the contents of
> guest memory.

the OCF interface has start, stop, status (running or not) or an error
(plus API info)

what I would do in this case is have the script notice that it's in
crashed status and return an error if it's told to start it. This will
cause pacemaker to start the service on another system.

if it's told to stop it, do whatever you can to save state, but definantly
pause/freeze the instance and return 'stopped'



no need to define some additional state. As far as pacemaker is concerned
it's safe as long as there is no chance of it changing the state of any
shared resources that the other system would use, so simply pausing the
instance will make it safe. It will be interesting when someone wants to
investigate what's going on inside the instance (you need to have it be
functional, but not able to use the network or any shared
drives/filesystems), but I don't believe that you can get that right in a
generic manner, the details of what will cause grief and what won't will
vary from site to site.


> B. Our proposal: "introduce a new domain state to indicate failover-safe"
>
> Pacemaker...(OCF)....RA...(libvirt)...Qemu
> | | |
> | | |
> 1: +---- start ----->+---------------->+ state=RUNNING
> | | |
> +---- monitor --->+---- domstate -->+
> 2: | | |
> +<---- "OK" ------+<--- "RUNNING" --+
> | | |
> | | |
> | | * Error: state=FROZEN
> | | | Qemu releases resources
> | | | and VM gets frozen. (*3)
> +---- monitor --->+---- domstate -->+
> 3: | | |
> +<-- "STOPPED" ---+<--- "FROZEN" ---+
> | | |
> +---- stop ------>+---- domstate -->+
> 4: | | |
> +<---- "OK" ------+<--- "FROZEN" ---+
> | | |
> | | |
>
>
> 1: Pacemaker starts Qemu.
>
> 2: Pacemaker checks the state of Qemu via RA.
> RA checks the state of Qemu using virsh(libvirt).
> Qemu replies to RA "RUNNING"(normally executing), (*1)
> and RA returns the state to Pacemaker as it's running correctly.
>
> --- SOME ERROR HAPPENS ---
>
> 3: Pacemaker checks the state of Qemu via RA.
> RA checks the state of Qemu using virsh(libvirt).
> Qemu replies to RA "FROZEN"(VM stopped in a failover-safe state), (*3)
> and RA keeps it in mind, then replies to Pacemaker "STOPPED".
>
> (*3): this is what we want to introduce as a new state. Failover-safe means
> that Qemu released the external resources, including some namespaces, to
> be
> available from another instance.

it doesn't need to release the resources. It just needs to not be able to
modify them.

pacemaker on the host won't try to start another instance on the same
host, it will try to start an instance on another host. so you don't need
to worry about releaseing memory, file locks, etc locally. for remote
resources you _can't_ release them gracefully if you crash, so your apps
already need to be able to handle that situation. there's no difference to
the other instances between a machine that gets powered off via STONITH
and a virtual system that gets paused.

David Lang

> 4: Pacemaker finally tries to confirm if it can safely start failover by
> sending stop command. Knowing that VM has already stopped in a
> failover-safe
> state, RA does not kill Qemu process and replies to Pacemaker "OK" so that
> Pacemaker can start failover as usual.
>
> ==============================================================================
>
> In any case, we want to confirm that fail over can be safely done.
>
> - I could not find any such API in libvirt.
>
>
>>
>> providing sample scripts that do this for the various HA stacks makes
>> sense as it gives people examples of what can be done and lets them
>> tailor exactly what does happen to their needs.
>>
>>> We are pursuing a scenario where current polling-based HA resource
>>> agents are complemented with an event-driven failure notification
>>> mechanism that allows for faster failover times by eliminating the
>>> delay introduced by polling and by doing without fencing. This would
>>> benefit traditional software clustering stacks and bring a feature
>>> that is essential for fault tolerance solutions such as Kemari.
>>
>> heartbeat/pacemaker has been able to do sub-second failovers for several
>> years, I'm not sure that notification is really needed.
>>
>> that being said the HA stacks do allow for commands to be fed into the
>> HA system to tell a machine to go active/passive already, so why don't
>> you have your notification just call scripts to make the appropriate calls?
>>
>>> Additionally, for those who want or need to stick with a polling
>>> model we would like to provide a virtual machine control that
>>> freezes a virtual machine into a failover-safe state without killing
>>> it, so that postmortem analysis is still possible.
>>
>> how is this different from simply pausing the virtual machine?
>
>
> I think it is almost same as pause, but more precisely all vcpus being
> stopped
> and all resources except memory being released. This of course should take
> care
> of namespaces.
>
>
> Takuya
>
>
>
>>
>>> In the following sections we discuss the RAS-HA integration
>>> challenges and the changes that need to be made to each component of
>>> the qemu-KVM stack to realize this vision. While at it we will also
>>> delve into some of the limitations of the current hardware error
>>> subsystems of the Linux kernel.
>>>
>>>
>>> HARDWARE ERRORS AND HIGH AVAILABILITY
>>>
>>> The major open source software stacks for Linux rely on polling
>>> mechanisms to detect both software errors and hardware failures. For
>>> example, ping or an equivalent is widely used to check for network
>>> connectivity interruptions. This is enough to get the job done in
>>> most cases but one is forced to make a trade off between service
>>> disruption time and the burden imposed by the polling resource
>>> agent.
>>>
>>> On the hardware side of things, the situation can be improved if we
>>> take advantage of CPU and chipset RAS capabilities to trigger
>>> failover in the event of a non-recoverable error or, even better, do
>>> it preventively when hardware informs us things might go awry. The
>>> premise is that RAS features such as hardware failure notification
>>> can be leveraged to minimize or even eliminate service
>>> down-times.
>>
>> having run dozens of sets of HA systems for about 10 years, I find that
>> very few of the failures that I have experianced would have been helped
>> by this. hardware very seldom gives me any indication that it's about to
>> fail, and even when it does fail it's usually only discovered due to the
>> side effects of other things I am trying to do not working.
>>
>>> Generally speaking, hardware errors reported to the operating system
>>> can be classified into two broad categories: corrected errors and
>>> uncorrected errors. The later are not necessarily critical errors
>>> that require a system restart; depending on the hardware and the
>>> software running on the affected system resource such errors may be
>>> recoverable. The picture looks like this (definitions taken from
>>> "Advanced Configuration and Power Interface Specification, Revision
>>> 4.0a" and slightly modified to get rid of ACPI jargon):
>>>
>>> - Corrected error: Hardware error condition that has been
>>> corrected by the hardware or by the firmware by the time the
>>> kernel is notified about the existence of an error condition.
>>>
>>> - Uncorrected error: Hardware error condition that cannot be
>>> corrected by the hardware or by the firmware. Uncorrected errors
>>> are either fatal or non-fatal.
>>>
>>> o A fatal hardware error is an uncorrected or uncontained
>>> error condition that is determined to be unrecoverable by
>>> the hardware. When a fatal uncorrected error occurs, the
>>> system is usually restarted to prevent propagation of the
>>> error.
>>>
>>> o A non-fatal hardware error is an uncorrected error condition
>>> from which the kernel can attempt recovery by trying to
>>> correct the error. These are also referred to as correctable
>>> or recoverable errors.
>>>
>>> Corrected errors are inoffensive in principle, but they may be
>>> harbingers of fatal non-recoverable errors. It is thus reasonable in
>>> some cases to do preventive failover or live migration when a
>>> certain threshold is reached. However this is arguably the job
>>> systems management software, not the HA, so this case will not be
>>> discussed in detail here.
>>
>> the easiest way to do this is to log the correctable errors and let
>> normal log analysis tools notice these errors and decide to take action.
>> trying to make the hypervisor do something here is putting policy in the
>> wrong place.
>>
>>> Uncorrected errors are the ones HA software cares about.
>>>
>>> When a fatal hardware error occurs the firmware may decide to
>>> restart the hardware. If the fatal error is relayed to the kernel
>>> instead the safest thing to do is to panic to avoid further
>>> damage. Even though it is theoretically possible to send a
>>> notification from the kernel's error or panic handler, this is a
>>> extremely hardware-dependent operation and will not be considered
>>> here. To detect this type of failures one's old reliable
>>> polling-based resource agent is the way to go.
>>
>> and in this case you probably cannot trust the system to send
>> notification without damaging things further, simply halting is probably
>> the only safe thing to do.
>>
>>> Non-fatal or recoverable errors are the most interesting in the
>>> pack. Detection should ideally be performed in a non-intrusive way
>>> and feed the policy engine with enough information about the error
>>> to make the right call. If the policy engine decides that the error
>>> might compromise service continuity it should notify the HA stack so
>>> that failover can be started immediately.
>>
>> again, log the errors and let existing log analysis/alerting tools
>> decide what action to take.
>>
>>> Currently KVM is only notified about memory errors detected by the
>>> MCE subsystem. When running on newer x86 hardware, if MCE detects an
>>> error on user-space it signals the corresponding process with
>>> SIGBUS. Qemu, upon receiving the signal, checks the problematic
>>> address which the kernel stored in siginfo and decides whether to
>>> inject the MCE to the virtual machine.
>>>
>>> An obvious limitation is that we would like to be notified about
>>> other types of error too and, as suggested before, a file-based
>>> interface that can be sys_poll'ed might be needed for that. On a
>>> different note, in a HA environment the qemu policy described
>>> above is not adequate; when a notification of a hardware error that
>>> our policy determines to be serious arrives the first thing we want
>>> to do is to put the virtual machine in a quiesced state to avoid
>>> further wreckage. If we injected the error into the guest we would
>>> risk a guest panic that might detectable only by polling or, worse,
>>> being killed by the kernel, which means that postmortem analysis of
>>> the guest is not possible. Once we had the guests in a quiesced
>>> state, where all the buffers have been flushed and the hardware
>>> sources released, we would have two modes of operation that can be
>>> used together and complement each other.
>>
>> it sounds like you really need to be running HA at two layers
>>
>> 1. on the host layer to detect problems with the host and decide to
>> freeze/migrate virtual machines to another system
>>
>> 2. inside the guests to make sure that the guests that are running (on
>> multiple real machines) continue to provide services.
>>
>> but what is your alturnative to sending the error into the guest?
>> depending on what the error is you may or may not be able to freeze the
>> guest (it makes no sense to try and flush buffers to a drive that won't
>> accept writes for example)
>>
>>> - Proactive: A qmp event describing the error (severity, topology,
>>> etc) is emitted. The HA software would have to register to
>>> receive hardware error events, possibly using the libvirt
>>> bindings. Upon receiving the event the HA software would know
>>> that the guest is in a failover-safe quiesced state so it could
>>> do without fencing and proceed to the failover stage directly.
>>
>> if it's not a fatal error then the system can continue to run (for at
>> least a few more seconds ;-), let such errors get written to syslog and
>> let a tool like SEC (simple event correlator) see the logs and deicde
>> what to do. there's no need to modify the kernel/KVM for this.
>>
>>> - Passive: Polling resource agents that need to check the state of
>>> the guest generally use libvirt or a wrapper such as virsh. When
>>> the state is SHUTOFF or CRASHED the resource agent proceeds to
>>> the facing stage, which might be expensive and usually involves
>>> killing the qemu process. We propose adding a new state that
>>> indicates the failover-safe state described before. In this
>>> state the HA software would not need to use fencing techniques
>>> and since the qemu process is not killed postmortem analysis of
>>> the virtual machine is still possible.
>>
>> how do you define failover-safe states? why would the HA software (with
>> the assistance of a log watcher) not be able to do the job itself?
>>
>> I do think that it's significant that all the HA solutions out there
>> prefer to test if the functionality works rather than watching for log
>> events to say there may be a problem, but there's nothing preventing
>> this from easily being done.
>>
>> David Lang
>>
>
>

2010-07-13 07:50:24

by Takuya Yoshikawa

[permalink] [raw]
Subject: Re: [RFC] High availability in KVM

On Mon, 12 Jul 2010 02:49:55 -0700 (PDT)
[email protected] wrote:

> On Mon, 12 Jul 2010, Takuya Yoshikawa wrote:
>

[...]

> > 1: Pacemaker starts Qemu.
> >
> > 2: Pacemaker checks the state of Qemu via RA.
> > RA checks the state of Qemu using virsh(libvirt).
> > Qemu replies to RA "RUNNING"(normally executing), (*1)
> > and RA returns the state to Pacemaker as it's running correctly.
> >
> > (*1): libvirt defines the following domain states:
> >
> > enum virDomainState {
> >
> > VIR_DOMAIN_NOSTATE = 0 : no state
> > VIR_DOMAIN_RUNNING = 1 : the domain is running
> > VIR_DOMAIN_BLOCKED = 2 : the domain is blocked on resource
> > VIR_DOMAIN_PAUSED = 3 : the domain is paused by user
> > VIR_DOMAIN_SHUTDOWN = 4 : the domain is being shut down
> > VIR_DOMAIN_SHUTOFF = 5 : the domain is shut off
> > VIR_DOMAIN_CRASHED = 6 : the domain is crashed
> >
> > }
> >
> > We took the most common case RUNNING as an example, but this might be
> > other states except for failover targets: SHUTOFF and CRASHED ?
> >
> > --- SOME ERROR HAPPENS ---
> >
> > 3: Pacemaker checks the state of Qemu via RA.
> > RA checks the state of Qemu using virsh(libvirt).
> > Qemu replies to RA "SHUTOFF", (*2)
>
> why would it return 'shutoff' if an error happened instead of 'crashed'?


Yes, it would be 'crashed'.

But 'shutoff' may also be returned I think: it depends on the type of the error
and how KVM/qemu handle it.

I take into my mind not only hardware errors but virtualization specific
errors like emulation errors.


>
> > and RA returns the state to Pacemaker as it's already stopped.
> >
> > (*2): Currently we are checking "shut off" answer from domstate command.
> > Yes, we should care about both SHUTOFF and CRASHED if possible.
> >
> > 4: Pacemaker finally tries to confirm if it can safely start failover by
> > sending stop command. After killing Qemu, RA replies to Pacemaker
> > "OK" so that Pacemaker can start failover.
> >
> > Problems: We lose debuggable information of VM such as the contents of
> > guest memory.
>
> the OCF interface has start, stop, status (running or not) or an error
> (plus API info)
>
> what I would do in this case is have the script notice that it's in
> crashed status and return an error if it's told to start it. This will
> cause pacemaker to start the service on another system.


I see.
So the key point is to how to check target, crashed in this case, status.

In the HA's point of view, we need that qemu guarantees:
- Guest never start again
- VM never modify external resources

But I'm not so sure if qemu currently guarantees such conditions in generic
manner.



Generically I agree that we always start the guest in another node for
failover. But are there any benefits if we can start the guest in the
same node?


>
> if it's told to stop it, do whatever you can to save state, but definantly
> pause/freeze the instance and return 'stopped'
>
>
>
> no need to define some additional state. As far as pacemaker is concerned
> it's safe as long as there is no chance of it changing the state of any
> shared resources that the other system would use, so simply pausing the
> instance will make it safe. It will be interesting when someone wants to
> investigate what's going on inside the instance (you need to have it be
> functional, but not able to use the network or any shared
> drives/filesystems), but I don't believe that you can get that right in a
> generic manner, the details of what will cause grief and what won't will
> vary from site to site.


If we cannot say in a generic manner, we usually choose the most conservative
one: memory and ... perservation only.

What we concern the most is qemu actually guarantees the conditions we are
talking in this thread.



>
>
> > B. Our proposal: "introduce a new domain state to indicate failover-safe"
> >
> > Pacemaker...(OCF)....RA...(libvirt)...Qemu
> > | | |
> > | | |
> > 1: +---- start ----->+---------------->+ state=RUNNING
> > | | |
> > +---- monitor --->+---- domstate -->+
> > 2: | | |
> > +<---- "OK" ------+<--- "RUNNING" --+
> > | | |
> > | | |
> > | | * Error: state=FROZEN
> > | | | Qemu releases resources
> > | | | and VM gets frozen. (*3)
> > +---- monitor --->+---- domstate -->+
> > 3: | | |
> > +<-- "STOPPED" ---+<--- "FROZEN" ---+
> > | | |
> > +---- stop ------>+---- domstate -->+
> > 4: | | |
> > +<---- "OK" ------+<--- "FROZEN" ---+
> > | | |
> > | | |
> >
> >
> > 1: Pacemaker starts Qemu.
> >
> > 2: Pacemaker checks the state of Qemu via RA.
> > RA checks the state of Qemu using virsh(libvirt).
> > Qemu replies to RA "RUNNING"(normally executing), (*1)
> > and RA returns the state to Pacemaker as it's running correctly.
> >
> > --- SOME ERROR HAPPENS ---
> >
> > 3: Pacemaker checks the state of Qemu via RA.
> > RA checks the state of Qemu using virsh(libvirt).
> > Qemu replies to RA "FROZEN"(VM stopped in a failover-safe state), (*3)
> > and RA keeps it in mind, then replies to Pacemaker "STOPPED".
> >
> > (*3): this is what we want to introduce as a new state. Failover-safe means
> > that Qemu released the external resources, including some namespaces, to
> > be
> > available from another instance.
>
> it doesn't need to release the resources. It just needs to not be able to
> modify them.
>
> pacemaker on the host won't try to start another instance on the same
> host, it will try to start an instance on another host. so you don't need
> to worry about releaseing memory, file locks, etc locally. for remote
> resources you _can't_ release them gracefully if you crash, so your apps
> already need to be able to handle that situation. there's no difference to
> the other instances between a machine that gets powered off via STONITH
> and a virtual system that gets paused.


Can't pacemaker start another instance on the same host by configuration?
Of course I agree that it may not be valuable in most situations.


Takuya


2010-07-13 08:53:16

by David Lang

[permalink] [raw]
Subject: Re: [RFC] High availability in KVM

On Tue, 13 Jul 2010, Takuya Yoshikawa wrote:

> On Mon, 12 Jul 2010 02:49:55 -0700 (PDT)
> [email protected] wrote:
>
>> On Mon, 12 Jul 2010, Takuya Yoshikawa wrote:
>>
>>
>>> and RA returns the state to Pacemaker as it's already stopped.
>>>
>>> (*2): Currently we are checking "shut off" answer from domstate command.
>>> Yes, we should care about both SHUTOFF and CRASHED if possible.
>>>
>>> 4: Pacemaker finally tries to confirm if it can safely start failover by
>>> sending stop command. After killing Qemu, RA replies to Pacemaker
>>> "OK" so that Pacemaker can start failover.
>>>
>>> Problems: We lose debuggable information of VM such as the contents of
>>> guest memory.
>>
>> the OCF interface has start, stop, status (running or not) or an error
>> (plus API info)
>>
>> what I would do in this case is have the script notice that it's in
>> crashed status and return an error if it's told to start it. This will
>> cause pacemaker to start the service on another system.
>
>
> I see.
> So the key point is to how to check target, crashed in this case, status.
>
> In the HA's point of view, we need that qemu guarantees:
> - Guest never start again
> - VM never modify external resources
>
> But I'm not so sure if qemu currently guarantees such conditions in generic
> manner.

you don't have to depend on the return from qemu. there are many OCF
scripts that maintain state internally (look at the e-mail script as an
example), if your OCF script thinks that it should be running and it
isn't, mark it as crashed and don't try to start it again until external
actions clear the status (and you can have a boot do so in case you have
an unclean shutdown)

> Generically I agree that we always start the guest in another node for
> failover. But are there any benefits if we can start the guest in the
> same node?

I don't believe that pacemaker supports this concept.

however, if you wanted to you could have the OCF script know that there is
a 'crshed' instance and instead of trying to start it, start a fresh copy.

>
>>
>> if it's told to stop it, do whatever you can to save state, but definantly
>> pause/freeze the instance and return 'stopped'
>>
>>
>>
>> no need to define some additional state. As far as pacemaker is concerned
>> it's safe as long as there is no chance of it changing the state of any
>> shared resources that the other system would use, so simply pausing the
>> instance will make it safe. It will be interesting when someone wants to
>> investigate what's going on inside the instance (you need to have it be
>> functional, but not able to use the network or any shared
>> drives/filesystems), but I don't believe that you can get that right in a
>> generic manner, the details of what will cause grief and what won't will
>> vary from site to site.
>
>
> If we cannot say in a generic manner, we usually choose the most conservative
> one: memory and ... perservation only.
>
> What we concern the most is qemu actually guarantees the conditions we are
> talking in this thread.

I'll admit that I'm not familiar with using qemu/KVM, but vmware/virtual
box/XEN all have an option to freeze all activity and save the ram to a
disk file for a future restart. the OCF file can trigger such action
easily.

>>> B. Our proposal: "introduce a new domain state to indicate failover-safe"
>>>
>>> Pacemaker...(OCF)....RA...(libvirt)...Qemu
>>> | | |
>>> | | |
>>> 1: +---- start ----->+---------------->+ state=RUNNING
>>> | | |
>>> +---- monitor --->+---- domstate -->+
>>> 2: | | |
>>> +<---- "OK" ------+<--- "RUNNING" --+
>>> | | |
>>> | | |
>>> | | * Error: state=FROZEN
>>> | | | Qemu releases resources
>>> | | | and VM gets frozen. (*3)
>>> +---- monitor --->+---- domstate -->+
>>> 3: | | |
>>> +<-- "STOPPED" ---+<--- "FROZEN" ---+
>>> | | |
>>> +---- stop ------>+---- domstate -->+
>>> 4: | | |
>>> +<---- "OK" ------+<--- "FROZEN" ---+
>>> | | |
>>> | | |
>>>
>>>
>>> 1: Pacemaker starts Qemu.
>>>
>>> 2: Pacemaker checks the state of Qemu via RA.
>>> RA checks the state of Qemu using virsh(libvirt).
>>> Qemu replies to RA "RUNNING"(normally executing), (*1)
>>> and RA returns the state to Pacemaker as it's running correctly.
>>>
>>> --- SOME ERROR HAPPENS ---
>>>
>>> 3: Pacemaker checks the state of Qemu via RA.
>>> RA checks the state of Qemu using virsh(libvirt).
>>> Qemu replies to RA "FROZEN"(VM stopped in a failover-safe state), (*3)
>>> and RA keeps it in mind, then replies to Pacemaker "STOPPED".
>>>
>>> (*3): this is what we want to introduce as a new state. Failover-safe means
>>> that Qemu released the external resources, including some namespaces, to
>>> be
>>> available from another instance.
>>
>> it doesn't need to release the resources. It just needs to not be able to
>> modify them.
>>
>> pacemaker on the host won't try to start another instance on the same
>> host, it will try to start an instance on another host. so you don't need
>> to worry about releaseing memory, file locks, etc locally. for remote
>> resources you _can't_ release them gracefully if you crash, so your apps
>> already need to be able to handle that situation. there's no difference to
>> the other instances between a machine that gets powered off via STONITH
>> and a virtual system that gets paused.
>
>
> Can't pacemaker start another instance on the same host by configuration?

I don't think so. If you think about it from the pacemaker/heartbeat point
of view (where they don't know anything about virtual servers, they just
see them as applications) there are two choices to having a failed
service.

1. issue a start command to try and bring it back up (as I note above, the
OCFscript could be written to have this start a new copy instead of
restarting the old copy)

2. decide that if applications are crashing there may be something
wrong with the host and migrate services to another server


> Of course I agree that it may not be valuable in most situations.

a combination of this and the fact that this can be done so easily (and
flexibly) with scripts in the existing tools makes me question the value
of modifying the kernel.

David Lang