MIME-Version: 1.0
In-Reply-To: <53B6897F.6020802@schoebel-theuer.de>
References: <53B6897F.6020802@schoebel-theuer.de>
Date: Fri, 4 Jul 2014 15:04:07 +0200
Message-ID: <CAFLxGvynFuCh0puda06kTH1P3i=701dQO-+opjLnQrd2KqUTHA@mail.gmail.com>
Subject: Re: Selling Points for MARS Light
From: Richard Weinberger <richard.weinberger@gmail.com>
To: =?UTF-8?Q?Thomas_Sch=C3=B6bel=2DTheuer?= 
	<thomas@schoebel-theuer.de>
Cc: LKML <linux-kernel@vger.kernel.org>, Christoph Hellwig <hch@infradead.org>,
        Greg KH <gregkh@linuxfoundation.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org

On Fri, Jul 4, 2014 at 1:01 PM, Thomas Schöbel-Theuer
<thomas@schoebel-theuer.de> wrote:
> Hi together,
>
> since not all people have attended my presentations on MARS Light, and
> since probably not all are willing to click at links such as to the
> presentation slides
> https://github.com/schoebel/mars/blob/master/docu/MARS_LinuxTag2014.pdf?raw=true
> , I will try to explain something which goes beyond the presentations,
> but nevertheless I don't want to repeat too much from the slides.
>
> So the following is hopefully also interesting for people who attended
> the presentations.
>
> My main selling point:
>
> To my knowledge, MARS Light is the only enterprise-grade opensource
> solution constructed for mass replication of whole datacenters
> (thousands to tenthousands of instances) over _long_ _distances_, aka
> _geo-redundancy_.

I think hch meant not a high level marketing-drone -ready design document
by "explain and sell your design".
We are interested in the real low level design ideas.
i.e. Why do you need all syscalls exported?

> Lets take a guided tour through important keywords given by this sentence.
>
> 1) geo-redundancy
>
> There exist some big companies operating datacenters, typically equipped
> with many thousands of servers each (in some cases it goes up to the
> order of hundred-thousand servers per datacenter). Many big companies
> have more than one datacenter.
>
> CEOs are not sleeping well at night when they dream of artificial or
> natural disasters such as terrorist bombing, earthquakes, fire, or
> simply power blackouts due to human error or technical failures of
> components which should protect against that, but don't work at the
> wrong moment.
>
> In addition, when companies are traded at the stock exchange, they are
> ranked not only by the market, but also by some rating agencies.
>
> If you (as a company) have implemented geo-redundancy as a means for
> failover to a backup datacenter, your rating will improve.
>
> If the _existence_ of your company is _depending_ on IT and it is
> potentially vulnerable by such disasters, geo-redundanancy is not even a
> matter of ranking, but simply a _must_.
>
> 2) long-distance
>
> In theory, this is just a simple consequence from geo-redundancy: the
> longer the distances between your datacenters, the better the protection
> against large-scale disasters such as earthquakes.
>
> There are further reasons for long distances. Probably the most
> important is the cost of electricity, which is much cheaper (by
> _factors_) when building your datacenters near mountains with cheap
> hydro-electric power, but far away from overcrowded cities. The cost for
> electricity is one of the main costs (I know some outlines of business
> cases).
>
> Further reasons are the structure of the internet: the alternative
> routing / multipathing does not always work well, e.g. in case of mass
> DDOS attacks. You don't want to concentrate too big masses of
> datacenters nearby at the globe, and you don't want to go too far away
> from the big masses of your customers / web surfers if you are a
> webhosting company such as 1&1.
>
> When you are world-wide company, distances between your datacenters are
> likely to be high, naturally.
>
> In practice, long distances raise very difficult technical problems,
> which have to do with
>
> a) Einstein's law, the speed of light (aka network latencies), and
>
> b) network bottlenecks due to multiplexing masses of parallel data
> traffic through single wide-distance fibres via routers (aka packet loss),
>
> c) frequent short network outages, and even long-lasting network outages
> _do_ occur from time to time. See the daily press.
>
> Notice that in TCP/IP, packet loss is the _only_ means for congestion
> control.
>
> It is _impossible_ to avoid packet loss over shared long distance lines
> in practice. You would need to _guarantee_ an extremely high total
> bandwidth, capable of delivering every single packet from a packet
> storm, or from a DDOS attack. Even if this were technically possible by
> building up extremely high bandwidth capacities (which isn't in many
> cases), it would be so expensive that it is essentially unaffordable.
>
> 3) replication of whole datacenters
>
> 3a) block layer
>
> Many datacenters are hosting a wide spectrum of heterogeneous
> applications. Many of them carry a _history_ (sometimes dating back over
> decades), i.e. there exist many "running systems" which should "never be
> touched" in sysadmin speak, or even be re-engineered just for the sake
> of adding replication in order to improve your stock market ranking.
> That would be expensive, and it could go wrong.
>
> Therefore, long-distance replication is needed on block device layer.
>
> Doing it there is _no_ _fundamental_ change of the overall architecture,
> because almost every contemporary standard server has hard disks or SSDs
> with a block layer interface.
>
> Only in case of implementing _new_ applications, you could try to do
> long-distance replication _always_ (consistently thoughout your whole
> datacenter, even when it contains some Windows boxes) at higher levels
> such as filesystem level or application level (e.g. by programming it
> yourself), but I wish you good luck. There is a too wide spectrum of
> applications, operating systems, storage systems, IO load behaviour, etc
> for allowing general statements such as "do long-distance replication
> _always_ on filesystem level". Maybe that it is _sometimes_ both
> possible and affordable (and maybe sometimes even advantageous) to do
> long-distance replication at filesystem or even at application level,
> but not generally.
>
> 3b) synchronous vs asynchronous replication
>
> Probably most of you already know that "synchronous" replication means
> to apply a write-through strategy. A write operation is only signalled
> back to the initiator as "completed" when it has been commited to all
> replicas, that means transferred from the active datacenter side to the
> passive one (or to _all_passive ones when having k > 2 replicas), and
> after an answer indicating successful completion has been transferred
> back over the network again (ping-pong game).
>
> Some years ago, I evaluated DRBD (which works synchronously, and which
> was already in productive use for _short_ distances for some years where
> it worked well), but this time for long-distance replication between one
> of our datacenters located in Europe and another one in USA over a 10Gig
> line which is fully administrated by our WAN department, and which was
> only lowly loaded at that time.
>
> Guess the result?
>
> Well, I did a detailed analysis using an early prototype version of my
> blkreplay toolkit (see www.blkreplay.org). I used a real-life load
> recorded by blktrace from shared hosting in our datacenter. Blkreplay
> generates very detailed analysis graphics on latency behaviour, as well
> as on many other interesting properties like workingset behaviour of the
> application etc.
>
> Ok, what do you guess?
>
> At that time, I asked our sysadmins internally: nobody was betting that
> there was a chance that it could work. All were believing that the
> network latencies of ~120ms (ping) would kill the idea.
>
> They were right with respect to the final result. It was easy to see
> from the so-called "sonar diagram" graphics, even for our management,
> that it does not work. I tried lots of network settings such as TCP send
> buffer tuning, all available TCP congestion control algorithms including
> the historic ones, etc, but nothing could help.
>
> I strongly believe that it is not the fault of the DRBD implementation,
> but the fault of synchronous replication at _concept_ level. I got oral
> reports from other people inside 1&1 that similiar long-distance setups
> with commercial storage boxes in synchronous mode also don't work, at
> least for them.
>
> Well, when re-looking at the old graphics with my present experience, I
> should modify my former conclusions a little bit. Not the basic
> latencies as such were likely killing the idea of using synchronous
> replication for long distances, but the packet loss (and what's
> happening after that). Details are out of scope of this discussion (you
> can ask me in personal conversations).
>
> OK, can I assume in this audience that I don't need to provide a
> stronger proof that asynchronous replication is _mandatory_ for
> long-distance replication?
>
> With help from Philipp Reisner from Linbit, I also evaluated the
> combination DRBD + proxy which does asynchronous replication. Since
> proxy is under a commercial license, it is probably out of scope of this
> discussion in this forum. For interested people: a feature comparison
> between MARS Light and DRBD+proxy is on slide 4 of the LinuxTag2014
> presentation.
>
> In short: proxy uses RAM for asynchronous buffering. There is no
> persistence at that point. So you can imagine why we decided not to use
> it for replication of whole datacenters. Details are out of scope here.
>
> 4) enterprise grade
>
> I am arguing on feature level here, not on maturity (where MARS Light
> certainly has to catch up more, at least when compared to
> long-established solutions like DRBD).
>
> At feature level, you need at least all of the following things when
> replicating whole datacenters over long distances:
>
> a) fully automatic handover / switchover
>
> Not only in case of disasters, but also for operational reasons such as
> necessary reboots because of security fixes or kernel upgrades, you need
> to switch the primary side in a controlled manner. Whenever possible,
> you should avoid creating a split-brain.
>
> Notice that the _problem space_ long-distance / asynchronous replication
> bears some hidden pitfalls which are not obvious even when you are
> already familiar with synchronous replication. For example, it is
> impossible to eliminate split brains in general, you can only do your
> best to avoid it.
>
> b) very close-automatic resolution of split brain
>
> That sounds easy for DRBD users, but it isn't when having huge diverging
> backlogs caused by long-lasting network disruptions or other problems,
> and/or when this is combined with more than 2 replicas. You may argue
> that you should avoid (or even try to prohibit wherever you can) any
> split brain situations in advance because you loose data upon
> resolution, but such an argumentation is not practical. Shit happens in
> practice.
>
> You must be able to clean up the shit once it has happened, as
> automatically as possible.
>
> c) shared storage for the backlogs (called "transaction logs" in MARS Light)
>
> When you have to have n resources on the same storage server (which is a
> _must_ for storage consolidation in big datacenters), and n is at least
> in the order of 10 (or even is planned to grow to other powers of 10 in
> future), you will notice a problem which is probably not yet clear to
> everyone here.
>
> Let me try to explain it.
>
> Probably at least the filesystem experts present in this audience will
> know the 90/10 rule, sometimes also called the 80/20 rule, or other
> naming variants.
>
> It is an _observation_ from _practice_.
>
> The rule is not only occurring in filesystems, but also in many other
> places. Well-known instances are "90% of accesses go to 10% of all the
> data", "10% of all files consume 90% of all storage space because they
> are so big", "10% of all users create 90% of the system load", "10% of
> all customers create 90% of our revenue", "90% of all complaints are
> caused by 10% of all customers", and so on. The list can be extended
> almost indefinitely.
>
> Please don't argue whether 90%/10% ist "correct" or not; in some cases
> the "correct" numbers are 95%/5% or even 99%/1%, while in other cases
> its only 70%/30%, or whatever. It is just a practical rule of thumb
> where exact numbers don't matter. In many cases the so-called "Zipf's
> law" appears to be the reason for the rule, but even Zipf's law is an
> observation from practice where no full scientific proof exists up to
> now AFAIK.
>
> My point is: the 90%/10% thumb rule also applies, you guess it, to what?
>
> When you subdivide your available storage storage space into n
> statically allocated ring buffers as a means for storing the backlogs
> (which are _needed_ for asynchronous replication), the cited rule of
> thumb says that you are wasting about one order of magnitude of storage
> space when the first ring buffer is full, but all other (n - 1) ring
> buffers are not yet full. The rule says that for sufficiently large n,
> about 10% of all ring buffers will contain about 90% of all waste.
>
> Imagine: translate this into € or $ for thousands or tenthousands of
> servers, then you probably will get the point.
>
> Like there is no scientific proof for Zipf's law, I cannot prove this in
> strong sense. But this is just what I observe whenever I log into an
> arbitrary MARS node having some long-lasting network problem, and when I
> type "du -s /mars/resource-* | sort -n".
>
> IMNSHO, this observation retrospectively justifies why an
> enterprise-class asynchronous replication system needs not only a
> _shared_ storage for _masses_ of backlogs, but also a _dynamic_ storage.
>
> Some additional justification for the latter: notice that Zipf's law and
> friends are not only observerable in the space dimension, but also in
> the time dimension when looking at the throughput, at least when
> choosing the right scale.
>
> Consequence from this: block layer concepts, which are supposed to deal
> with rather static or near-static storage spaces, are not neccesarily
> the appropriate concepts when implementing huge masses of backlogs for
> huge masses of resources, which in turn is needed for enterprise-class
> reasons.
>
> 5) "only"
>
> I know that there exist some opensource prototype implementations of
> asynchronous long-distance replication, more or less in form of concept
> studies or similar.
>
> I don't know of any one which implements the enterprise-grade
> requirements for thousands of servers / whole datacenters as pointed out
> above.
>
> --
>
> OK, I am at the end of this posting, and I deliberately omitted the
> other main selling point from the slides called "Anytime Consistency".
> When interested, you can ask questions,which I will try to answer as
> best as I can.
>
> Cheers,
>
> Thomas
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


-- 
Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/