Message-ID: <4C3AB489.4020700@oss.ntt.co.jp>
Date: Mon, 12 Jul 2010 15:22:01 +0900
From: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ja; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4
MIME-Version: 1.0
To: david@lang.hm
CC: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>, kvm@vger.kernel.org,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        mori.keisuke@oss.ntt.co.jp, Chris Wright <chrisw@redhat.com>,
        Dor Laor <dlaor@redhat.com>, Lon Hohberger <lhh@redhat.com>,
        "Perry N. Myers" <pmyers@redhat.com>,
        Luiz Capitulino <lcapitulino@redhat.com>, berrange@redhat.com
Subject: Re: [RFC] High availability in KVM
References: <4C199348.6050108@oss.ntt.co.jp> <alpine.DEB.2.00.1007101510180.27626@asgard.lang.hm>
In-Reply-To: <alpine.DEB.2.00.1007101510180.27626@asgard.lang.hm>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 19202
Lines: 413

(2010/07/11 7:36), david@lang.hm wrote:
> On Thu, 17 Jun 2010, Fernando Luis Vazquez Cao wrote:
>
>> Existing open source HA stacks such as pacemaker/corosync and Red
>> Hat Cluster Suite rely on software clustering techniques to detect
>> both hardware failures and software failures, and employ fencing to
>> avoid split-brain situations which, in turn, makes it possible to
>> perform failover safely. However, when applied to virtualization
>> environments these solutions show some limitations:
>>
>> - Hardware detection relies on polling mechanisms (for example
>> pinging a network interface to check for network connectivity),
>> imposing a trade off between failover time and the cost of
>> polling. The alternative is having the failing system send an
>> alarm to the HA software to trigger failover. The latter
>> approach is preferable but it is not always applicable when
>> dealing with bare-metal; depending on the failure type the
>> hardware may not able to get a message out to notify the HA
>> software. However, when it comes to virtualization environments
>> we can certainly do better. If a hardware failure, be it real
>> hardware or virtual hardware, is fully contained within a
>> virtual machine the host or hypervisor can detect that and
>> notify the HA software safely using clean resources.
>
> you still need to detect failures that you won't be notified of.
>
> what if a network cable goes bad and your data isn't getting through?
> you won't get any notification of this without doing polling, even in a
> virtualized environment.


I agree that we need polling anyway.


>
> also, in a virtualized environment you may have firewall rules between
> virtual hosts, if those get misconfigured you may have 'virual physical
> connectivity' still, but not the logical connectivity that you need.
>
>> - In most cases, when a hardware failure is detected the state of
>> the failing node is not known which means that some kind of
>> fencing is needed to lock resources away from that
>> node. Depending on the hardware and the cluster configuration
>> fencing can be a pretty expensive operation that contributes to
>> system downtime. Virtualization can help here. Upon failure
>> detection the host or hypervisor could put the virtual machine
>> in a quiesced state and release its hardware resources before
>> notifying the HA software, so that it can start failover
>> immediately without having to mingle with the failing virtual
>> machine (we now know that it is in a known quiesced state). Of
>> course this only makes sense in the event-driven failover case
>> described above.
>>
>> - Fencing operations commonly involve killing the virtual machine,
>> thus depriving us of potentially critical debugging information:
>> a dump of the virtual machine itself. This issue could be solved
>> by providing a virtual machine control that puts the virtual
>> machine in a known quiesced state, releases its hardware
>> resources, but keeps the guest and device model in memory so
>> that forensics can be conducted offline after failover. Polling
>> HA resource agents should use this new command if postmortem
>> analysis is important.
>
> I don't see this as the job of the virtualization hypervisor. the
> software HA stacks include the ability to run external scripts to
> perform these tasks. These scripts can perform whatever calls to the
> hypervisor that are appropriate to freeze, shutdown, or disconnect the
> virtual server (and what is appropriate will vary from implementation to
> implementation)


I see that it can be done with HA plus external scripts.

But don't you think we need a way to confirm that vm is in a known quiesced
state?

Although might not be the exact same scenario, here is what we are planning
as one possible next step (polling case):

==============================================================================
A. Current management: "Qemu/KVM + HA using libvirt interface"

- Pacemaker interacts with RA(Resource Agent) through OCF interface.
- RA interacts with Qemu using virsh commands, IOW through libvirt interface.

    Pacemaker...(OCF)....RA...(libvirt)...Qemu
        |                 |                 |
        |                 |                 |
1:     +---- start ----->+---------------->+ state=RUNNING
        |                 |                 |
        +---- monitor --->+---- domstate -->+
2:     |                 |                 |
        +<---- "OK" ------+<--- "RUNNING" --+
        |                 |                 |
        |                 |                 |
        |                 |                 * Error: state=SHUTOFF, or ...
        |                 |                 |
        |                 |                 |
        +---- monitor --->+---- domstate -->+
3:     |                 |                 |
        +<-- "STOPPED" ---+<--- "SHUTOFF" --+
        |                 |                 |
        +---- stop ------>+---- shutdown -->+ VM killed (if still alive)
4:     |                 |                 |
        +<---- "OK" ------+<--- "SHUTOFF" --+
        |                 |                 |
        |                 |                 |

1: Pacemaker starts Qemu.

2: Pacemaker checks the state of Qemu via RA.
    RA checks the state of Qemu using virsh(libvirt).
    Qemu replies to RA "RUNNING"(normally executing), (*1)
    and RA returns the state to Pacemaker as it's running correctly.

   (*1): libvirt defines the following domain states:

     enum virDomainState {

     VIR_DOMAIN_NOSTATE  = 0 : no state
     VIR_DOMAIN_RUNNING  = 1 : the domain is running
     VIR_DOMAIN_BLOCKED  = 2 : the domain is blocked on resource
     VIR_DOMAIN_PAUSED   = 3 : the domain is paused by user
     VIR_DOMAIN_SHUTDOWN = 4 : the domain is being shut down
     VIR_DOMAIN_SHUTOFF  = 5 : the domain is shut off
     VIR_DOMAIN_CRASHED  = 6 : the domain is crashed

     }

     We took the most common case RUNNING as an example, but this might be
     other states except for failover targets: SHUTOFF and CRASHED ?

   --- SOME ERROR HAPPENS ---

3: Pacemaker checks the state of Qemu via RA.
    RA checks the state of Qemu using virsh(libvirt).
    Qemu replies to RA "SHUTOFF", (*2)
    and RA returns the state to Pacemaker as it's already stopped.

   (*2): Currently we are checking "shut off" answer from domstate command.
    Yes, we should care about both SHUTOFF and CRASHED if possible.

4: Pacemaker finally tries to confirm if it can safely start failover by
    sending stop command. After killing Qemu, RA replies to Pacemaker
    "OK" so that Pacemaker can start failover.

Problems: We lose debuggable information of VM such as the contents of
    guest memory.


B. Our proposal: "introduce a new domain state to indicate failover-safe"

    Pacemaker...(OCF)....RA...(libvirt)...Qemu
        |                 |                 |
        |                 |                 |
1:     +---- start ----->+---------------->+ state=RUNNING
        |                 |                 |
        +---- monitor --->+---- domstate -->+
2:     |                 |                 |
        +<---- "OK" ------+<--- "RUNNING" --+
        |                 |                 |
        |                 |                 |
        |                 |                 * Error: state=FROZEN
        |                 |                 |   Qemu releases resources
        |                 |                 |   and VM gets frozen. (*3)
        +---- monitor --->+---- domstate -->+
3:     |                 |                 |
        +<-- "STOPPED" ---+<--- "FROZEN" ---+
        |                 |                 |
        +---- stop ------>+---- domstate -->+
4:     |                 |                 |
        +<---- "OK" ------+<--- "FROZEN" ---+
        |                 |                 |
        |                 |                 |


1: Pacemaker starts Qemu.

2: Pacemaker checks the state of Qemu via RA.
    RA checks the state of Qemu using virsh(libvirt).
    Qemu replies to RA "RUNNING"(normally executing), (*1)
    and RA returns the state to Pacemaker as it's running correctly.

    --- SOME ERROR HAPPENS ---

3: Pacemaker checks the state of Qemu via RA.
    RA checks the state of Qemu using virsh(libvirt).
    Qemu replies to RA "FROZEN"(VM stopped in a failover-safe state), (*3)
    and RA keeps it in mind, then replies to Pacemaker "STOPPED".

   (*3): this is what we want to introduce as a new state. Failover-safe means
     that Qemu released the external resources, including some namespaces, to be
     available from another instance.

4: Pacemaker finally tries to confirm if it can safely start failover by
    sending stop command. Knowing that VM has already stopped in a failover-safe
    state, RA does not kill Qemu process and replies to Pacemaker "OK" so that
    Pacemaker can start failover as usual.

==============================================================================

In any case, we want to confirm that fail over can be safely done.

  - I could not find any such API in libvirt.


>
> providing sample scripts that do this for the various HA stacks makes
> sense as it gives people examples of what can be done and lets them
> tailor exactly what does happen to their needs.
>
>> We are pursuing a scenario where current polling-based HA resource
>> agents are complemented with an event-driven failure notification
>> mechanism that allows for faster failover times by eliminating the
>> delay introduced by polling and by doing without fencing. This would
>> benefit traditional software clustering stacks and bring a feature
>> that is essential for fault tolerance solutions such as Kemari.
>
> heartbeat/pacemaker has been able to do sub-second failovers for several
> years, I'm not sure that notification is really needed.
>
> that being said the HA stacks do allow for commands to be fed into the
> HA system to tell a machine to go active/passive already, so why don't
> you have your notification just call scripts to make the appropriate calls?
>
>> Additionally, for those who want or need to stick with a polling
>> model we would like to provide a virtual machine control that
>> freezes a virtual machine into a failover-safe state without killing
>> it, so that postmortem analysis is still possible.
>
> how is this different from simply pausing the virtual machine?


I think it is almost same as pause, but more precisely all vcpus being stopped
and all resources except memory being released. This of course should take care
of namespaces.


     Takuya


>
>> In the following sections we discuss the RAS-HA integration
>> challenges and the changes that need to be made to each component of
>> the qemu-KVM stack to realize this vision. While at it we will also
>> delve into some of the limitations of the current hardware error
>> subsystems of the Linux kernel.
>>
>>
>> HARDWARE ERRORS AND HIGH AVAILABILITY
>>
>> The major open source software stacks for Linux rely on polling
>> mechanisms to detect both software errors and hardware failures. For
>> example, ping or an equivalent is widely used to check for network
>> connectivity interruptions. This is enough to get the job done in
>> most cases but one is forced to make a trade off between service
>> disruption time and the burden imposed by the polling resource
>> agent.
>>
>> On the hardware side of things, the situation can be improved if we
>> take advantage of CPU and chipset RAS capabilities to trigger
>> failover in the event of a non-recoverable error or, even better, do
>> it preventively when hardware informs us things might go awry. The
>> premise is that RAS features such as hardware failure notification
>> can be leveraged to minimize or even eliminate service
>> down-times.
>
> having run dozens of sets of HA systems for about 10 years, I find that
> very few of the failures that I have experianced would have been helped
> by this. hardware very seldom gives me any indication that it's about to
> fail, and even when it does fail it's usually only discovered due to the
> side effects of other things I am trying to do not working.
>
>> Generally speaking, hardware errors reported to the operating system
>> can be classified into two broad categories: corrected errors and
>> uncorrected errors. The later are not necessarily critical errors
>> that require a system restart; depending on the hardware and the
>> software running on the affected system resource such errors may be
>> recoverable. The picture looks like this (definitions taken from
>> "Advanced Configuration and Power Interface Specification, Revision
>> 4.0a" and slightly modified to get rid of ACPI jargon):
>>
>> - Corrected error: Hardware error condition that has been
>> corrected by the hardware or by the firmware by the time the
>> kernel is notified about the existence of an error condition.
>>
>> - Uncorrected error: Hardware error condition that cannot be
>> corrected by the hardware or by the firmware. Uncorrected errors
>> are either fatal or non-fatal.
>>
>> o A fatal hardware error is an uncorrected or uncontained
>> error condition that is determined to be unrecoverable by
>> the hardware. When a fatal uncorrected error occurs, the
>> system is usually restarted to prevent propagation of the
>> error.
>>
>> o A non-fatal hardware error is an uncorrected error condition
>> from which the kernel can attempt recovery by trying to
>> correct the error. These are also referred to as correctable
>> or recoverable errors.
>>
>> Corrected errors are inoffensive in principle, but they may be
>> harbingers of fatal non-recoverable errors. It is thus reasonable in
>> some cases to do preventive failover or live migration when a
>> certain threshold is reached. However this is arguably the job
>> systems management software, not the HA, so this case will not be
>> discussed in detail here.
>
> the easiest way to do this is to log the correctable errors and let
> normal log analysis tools notice these errors and decide to take action.
> trying to make the hypervisor do something here is putting policy in the
> wrong place.
>
>> Uncorrected errors are the ones HA software cares about.
>>
>> When a fatal hardware error occurs the firmware may decide to
>> restart the hardware. If the fatal error is relayed to the kernel
>> instead the safest thing to do is to panic to avoid further
>> damage. Even though it is theoretically possible to send a
>> notification from the kernel's error or panic handler, this is a
>> extremely hardware-dependent operation and will not be considered
>> here. To detect this type of failures one's old reliable
>> polling-based resource agent is the way to go.
>
> and in this case you probably cannot trust the system to send
> notification without damaging things further, simply halting is probably
> the only safe thing to do.
>
>> Non-fatal or recoverable errors are the most interesting in the
>> pack. Detection should ideally be performed in a non-intrusive way
>> and feed the policy engine with enough information about the error
>> to make the right call. If the policy engine decides that the error
>> might compromise service continuity it should notify the HA stack so
>> that failover can be started immediately.
>
> again, log the errors and let existing log analysis/alerting tools
> decide what action to take.
>
>> Currently KVM is only notified about memory errors detected by the
>> MCE subsystem. When running on newer x86 hardware, if MCE detects an
>> error on user-space it signals the corresponding process with
>> SIGBUS. Qemu, upon receiving the signal, checks the problematic
>> address which the kernel stored in siginfo and decides whether to
>> inject the MCE to the virtual machine.
>>
>> An obvious limitation is that we would like to be notified about
>> other types of error too and, as suggested before, a file-based
>> interface that can be sys_poll'ed might be needed for that. On a
>> different note, in a HA environment the qemu policy described
>> above is not adequate; when a notification of a hardware error that
>> our policy determines to be serious arrives the first thing we want
>> to do is to put the virtual machine in a quiesced state to avoid
>> further wreckage. If we injected the error into the guest we would
>> risk a guest panic that might detectable only by polling or, worse,
>> being killed by the kernel, which means that postmortem analysis of
>> the guest is not possible. Once we had the guests in a quiesced
>> state, where all the buffers have been flushed and the hardware
>> sources released, we would have two modes of operation that can be
>> used together and complement each other.
>
> it sounds like you really need to be running HA at two layers
>
> 1. on the host layer to detect problems with the host and decide to
> freeze/migrate virtual machines to another system
>
> 2. inside the guests to make sure that the guests that are running (on
> multiple real machines) continue to provide services.
>
> but what is your alturnative to sending the error into the guest?
> depending on what the error is you may or may not be able to freeze the
> guest (it makes no sense to try and flush buffers to a drive that won't
> accept writes for example)
>
>> - Proactive: A qmp event describing the error (severity, topology,
>> etc) is emitted. The HA software would have to register to
>> receive hardware error events, possibly using the libvirt
>> bindings. Upon receiving the event the HA software would know
>> that the guest is in a failover-safe quiesced state so it could
>> do without fencing and proceed to the failover stage directly.
>
> if it's not a fatal error then the system can continue to run (for at
> least a few more seconds ;-), let such errors get written to syslog and
> let a tool like SEC (simple event correlator) see the logs and deicde
> what to do. there's no need to modify the kernel/KVM for this.
>
>> - Passive: Polling resource agents that need to check the state of
>> the guest generally use libvirt or a wrapper such as virsh. When
>> the state is SHUTOFF or CRASHED the resource agent proceeds to
>> the facing stage, which might be expensive and usually involves
>> killing the qemu process. We propose adding a new state that
>> indicates the failover-safe state described before. In this
>> state the HA software would not need to use fencing techniques
>> and since the qemu process is not killed postmortem analysis of
>> the virtual machine is still possible.
>
> how do you define failover-safe states? why would the HA software (with
> the assistance of a log watcher) not be able to do the job itself?
>
> I do think that it's significant that all the HA solutions out there
> prefer to test if the functionality works rather than watching for log
> events to say there may be a problem, but there's nothing preventing
> this from easily being done.
>
> David Lang
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/