Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932398AbaGDNEN (ORCPT ); Fri, 4 Jul 2014 09:04:13 -0400 Received: from mail-vc0-f176.google.com ([209.85.220.176]:65167 "EHLO mail-vc0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754302AbaGDNEI convert rfc822-to-8bit (ORCPT ); Fri, 4 Jul 2014 09:04:08 -0400 MIME-Version: 1.0 In-Reply-To: <53B6897F.6020802@schoebel-theuer.de> References: <53B6897F.6020802@schoebel-theuer.de> Date: Fri, 4 Jul 2014 15:04:07 +0200 Message-ID: Subject: Re: Selling Points for MARS Light From: Richard Weinberger To: =?UTF-8?Q?Thomas_Sch=C3=B6bel=2DTheuer?= Cc: LKML , Christoph Hellwig , Greg KH Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 4, 2014 at 1:01 PM, Thomas Schöbel-Theuer wrote: > Hi together, > > since not all people have attended my presentations on MARS Light, and > since probably not all are willing to click at links such as to the > presentation slides > https://github.com/schoebel/mars/blob/master/docu/MARS_LinuxTag2014.pdf?raw=true > , I will try to explain something which goes beyond the presentations, > but nevertheless I don't want to repeat too much from the slides. > > So the following is hopefully also interesting for people who attended > the presentations. > > My main selling point: > > To my knowledge, MARS Light is the only enterprise-grade opensource > solution constructed for mass replication of whole datacenters > (thousands to tenthousands of instances) over _long_ _distances_, aka > _geo-redundancy_. I think hch meant not a high level marketing-drone -ready design document by "explain and sell your design". We are interested in the real low level design ideas. i.e. Why do you need all syscalls exported? > Lets take a guided tour through important keywords given by this sentence. > > 1) geo-redundancy > > There exist some big companies operating datacenters, typically equipped > with many thousands of servers each (in some cases it goes up to the > order of hundred-thousand servers per datacenter). Many big companies > have more than one datacenter. > > CEOs are not sleeping well at night when they dream of artificial or > natural disasters such as terrorist bombing, earthquakes, fire, or > simply power blackouts due to human error or technical failures of > components which should protect against that, but don't work at the > wrong moment. > > In addition, when companies are traded at the stock exchange, they are > ranked not only by the market, but also by some rating agencies. > > If you (as a company) have implemented geo-redundancy as a means for > failover to a backup datacenter, your rating will improve. > > If the _existence_ of your company is _depending_ on IT and it is > potentially vulnerable by such disasters, geo-redundanancy is not even a > matter of ranking, but simply a _must_. > > 2) long-distance > > In theory, this is just a simple consequence from geo-redundancy: the > longer the distances between your datacenters, the better the protection > against large-scale disasters such as earthquakes. > > There are further reasons for long distances. Probably the most > important is the cost of electricity, which is much cheaper (by > _factors_) when building your datacenters near mountains with cheap > hydro-electric power, but far away from overcrowded cities. The cost for > electricity is one of the main costs (I know some outlines of business > cases). > > Further reasons are the structure of the internet: the alternative > routing / multipathing does not always work well, e.g. in case of mass > DDOS attacks. You don't want to concentrate too big masses of > datacenters nearby at the globe, and you don't want to go too far away > from the big masses of your customers / web surfers if you are a > webhosting company such as 1&1. > > When you are world-wide company, distances between your datacenters are > likely to be high, naturally. > > In practice, long distances raise very difficult technical problems, > which have to do with > > a) Einstein's law, the speed of light (aka network latencies), and > > b) network bottlenecks due to multiplexing masses of parallel data > traffic through single wide-distance fibres via routers (aka packet loss), > > c) frequent short network outages, and even long-lasting network outages > _do_ occur from time to time. See the daily press. > > Notice that in TCP/IP, packet loss is the _only_ means for congestion > control. > > It is _impossible_ to avoid packet loss over shared long distance lines > in practice. You would need to _guarantee_ an extremely high total > bandwidth, capable of delivering every single packet from a packet > storm, or from a DDOS attack. Even if this were technically possible by > building up extremely high bandwidth capacities (which isn't in many > cases), it would be so expensive that it is essentially unaffordable. > > 3) replication of whole datacenters > > 3a) block layer > > Many datacenters are hosting a wide spectrum of heterogeneous > applications. Many of them carry a _history_ (sometimes dating back over > decades), i.e. there exist many "running systems" which should "never be > touched" in sysadmin speak, or even be re-engineered just for the sake > of adding replication in order to improve your stock market ranking. > That would be expensive, and it could go wrong. > > Therefore, long-distance replication is needed on block device layer. > > Doing it there is _no_ _fundamental_ change of the overall architecture, > because almost every contemporary standard server has hard disks or SSDs > with a block layer interface. > > Only in case of implementing _new_ applications, you could try to do > long-distance replication _always_ (consistently thoughout your whole > datacenter, even when it contains some Windows boxes) at higher levels > such as filesystem level or application level (e.g. by programming it > yourself), but I wish you good luck. There is a too wide spectrum of > applications, operating systems, storage systems, IO load behaviour, etc > for allowing general statements such as "do long-distance replication > _always_ on filesystem level". Maybe that it is _sometimes_ both > possible and affordable (and maybe sometimes even advantageous) to do > long-distance replication at filesystem or even at application level, > but not generally. > > 3b) synchronous vs asynchronous replication > > Probably most of you already know that "synchronous" replication means > to apply a write-through strategy. A write operation is only signalled > back to the initiator as "completed" when it has been commited to all > replicas, that means transferred from the active datacenter side to the > passive one (or to _all_passive ones when having k > 2 replicas), and > after an answer indicating successful completion has been transferred > back over the network again (ping-pong game). > > Some years ago, I evaluated DRBD (which works synchronously, and which > was already in productive use for _short_ distances for some years where > it worked well), but this time for long-distance replication between one > of our datacenters located in Europe and another one in USA over a 10Gig > line which is fully administrated by our WAN department, and which was > only lowly loaded at that time. > > Guess the result? > > Well, I did a detailed analysis using an early prototype version of my > blkreplay toolkit (see www.blkreplay.org). I used a real-life load > recorded by blktrace from shared hosting in our datacenter. Blkreplay > generates very detailed analysis graphics on latency behaviour, as well > as on many other interesting properties like workingset behaviour of the > application etc. > > Ok, what do you guess? > > At that time, I asked our sysadmins internally: nobody was betting that > there was a chance that it could work. All were believing that the > network latencies of ~120ms (ping) would kill the idea. > > They were right with respect to the final result. It was easy to see > from the so-called "sonar diagram" graphics, even for our management, > that it does not work. I tried lots of network settings such as TCP send > buffer tuning, all available TCP congestion control algorithms including > the historic ones, etc, but nothing could help. > > I strongly believe that it is not the fault of the DRBD implementation, > but the fault of synchronous replication at _concept_ level. I got oral > reports from other people inside 1&1 that similiar long-distance setups > with commercial storage boxes in synchronous mode also don't work, at > least for them. > > Well, when re-looking at the old graphics with my present experience, I > should modify my former conclusions a little bit. Not the basic > latencies as such were likely killing the idea of using synchronous > replication for long distances, but the packet loss (and what's > happening after that). Details are out of scope of this discussion (you > can ask me in personal conversations). > > OK, can I assume in this audience that I don't need to provide a > stronger proof that asynchronous replication is _mandatory_ for > long-distance replication? > > With help from Philipp Reisner from Linbit, I also evaluated the > combination DRBD + proxy which does asynchronous replication. Since > proxy is under a commercial license, it is probably out of scope of this > discussion in this forum. For interested people: a feature comparison > between MARS Light and DRBD+proxy is on slide 4 of the LinuxTag2014 > presentation. > > In short: proxy uses RAM for asynchronous buffering. There is no > persistence at that point. So you can imagine why we decided not to use > it for replication of whole datacenters. Details are out of scope here. > > 4) enterprise grade > > I am arguing on feature level here, not on maturity (where MARS Light > certainly has to catch up more, at least when compared to > long-established solutions like DRBD). > > At feature level, you need at least all of the following things when > replicating whole datacenters over long distances: > > a) fully automatic handover / switchover > > Not only in case of disasters, but also for operational reasons such as > necessary reboots because of security fixes or kernel upgrades, you need > to switch the primary side in a controlled manner. Whenever possible, > you should avoid creating a split-brain. > > Notice that the _problem space_ long-distance / asynchronous replication > bears some hidden pitfalls which are not obvious even when you are > already familiar with synchronous replication. For example, it is > impossible to eliminate split brains in general, you can only do your > best to avoid it. > > b) very close-automatic resolution of split brain > > That sounds easy for DRBD users, but it isn't when having huge diverging > backlogs caused by long-lasting network disruptions or other problems, > and/or when this is combined with more than 2 replicas. You may argue > that you should avoid (or even try to prohibit wherever you can) any > split brain situations in advance because you loose data upon > resolution, but such an argumentation is not practical. Shit happens in > practice. > > You must be able to clean up the shit once it has happened, as > automatically as possible. > > c) shared storage for the backlogs (called "transaction logs" in MARS Light) > > When you have to have n resources on the same storage server (which is a > _must_ for storage consolidation in big datacenters), and n is at least > in the order of 10 (or even is planned to grow to other powers of 10 in > future), you will notice a problem which is probably not yet clear to > everyone here. > > Let me try to explain it. > > Probably at least the filesystem experts present in this audience will > know the 90/10 rule, sometimes also called the 80/20 rule, or other > naming variants. > > It is an _observation_ from _practice_. > > The rule is not only occurring in filesystems, but also in many other > places. Well-known instances are "90% of accesses go to 10% of all the > data", "10% of all files consume 90% of all storage space because they > are so big", "10% of all users create 90% of the system load", "10% of > all customers create 90% of our revenue", "90% of all complaints are > caused by 10% of all customers", and so on. The list can be extended > almost indefinitely. > > Please don't argue whether 90%/10% ist "correct" or not; in some cases > the "correct" numbers are 95%/5% or even 99%/1%, while in other cases > its only 70%/30%, or whatever. It is just a practical rule of thumb > where exact numbers don't matter. In many cases the so-called "Zipf's > law" appears to be the reason for the rule, but even Zipf's law is an > observation from practice where no full scientific proof exists up to > now AFAIK. > > My point is: the 90%/10% thumb rule also applies, you guess it, to what? > > When you subdivide your available storage storage space into n > statically allocated ring buffers as a means for storing the backlogs > (which are _needed_ for asynchronous replication), the cited rule of > thumb says that you are wasting about one order of magnitude of storage > space when the first ring buffer is full, but all other (n - 1) ring > buffers are not yet full. The rule says that for sufficiently large n, > about 10% of all ring buffers will contain about 90% of all waste. > > Imagine: translate this into € or $ for thousands or tenthousands of > servers, then you probably will get the point. > > Like there is no scientific proof for Zipf's law, I cannot prove this in > strong sense. But this is just what I observe whenever I log into an > arbitrary MARS node having some long-lasting network problem, and when I > type "du -s /mars/resource-* | sort -n". > > IMNSHO, this observation retrospectively justifies why an > enterprise-class asynchronous replication system needs not only a > _shared_ storage for _masses_ of backlogs, but also a _dynamic_ storage. > > Some additional justification for the latter: notice that Zipf's law and > friends are not only observerable in the space dimension, but also in > the time dimension when looking at the throughput, at least when > choosing the right scale. > > Consequence from this: block layer concepts, which are supposed to deal > with rather static or near-static storage spaces, are not neccesarily > the appropriate concepts when implementing huge masses of backlogs for > huge masses of resources, which in turn is needed for enterprise-class > reasons. > > 5) "only" > > I know that there exist some opensource prototype implementations of > asynchronous long-distance replication, more or less in form of concept > studies or similar. > > I don't know of any one which implements the enterprise-grade > requirements for thousands of servers / whole datacenters as pointed out > above. > > -- > > OK, I am at the end of this posting, and I deliberately omitted the > other main selling point from the slides called "Anytime Consistency". > When interested, you can ask questions,which I will try to answer as > best as I can. > > Cheers, > > Thomas > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Thanks, //richard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/