分布式系統開發注意點

by Shubheksha

通過Shubheksha

分布式計算概述：分布式系統如何工作 (Distributed Computing in a nutshell: How distributed systems work)

This post distills the material presented in the paper titled “A Note on Distributed Systems” published in 1994 by Jim Waldo and others.

這篇文章摘錄了Jim Waldo等人于1994年發表的題為“ 有關分布式系統的說明 ”的論文中介紹的材料。

The paper presents the differences between local and distributed computing in the context of Object Oriented Programming. It explains why treating them the same is incorrect and leads to applications that aren’t robust or reliable.

本文介紹了在面向對象編程的情況下本地計算和分布式計算之間的差異。它解釋了為什么對它們進行相同的處理是不正確的，并導致應用程序不可靠或不可靠。

介紹 (Introduction)

The paper kicks off by stating that the current work in distributed systems is modeled around objects — more specifically, a unified view of objects. Objects are defined by their supported interfaces and the operations they support.

本文首先指出，分布式系統中的當前工作是圍繞對象建模的-更具體地說，是對象的統一視圖。 對象由其支持的接口及其支持的操作定義。

Naturally, this can be extended to imply that objects in the same address space, or in a different address space on the same machine, or on a different machine, all behave in a similar manner. Their location is an implementation detail.

自然地，這可以擴展為暗示相同地址空間中，同一機器上或不同機器上不同地址空間中的對象的行為均相似。它們的位置是一個實現細節。

Let’s define the most common terms in this paper:

讓我們定義本文中最常見的術語：

本地計算 (Local Computing)

It deals with programs that are confined to a single address space only.

它處理的程序僅限于單個地址空間。

分布式計算 (Distributed Computing)

It deals with programs that can make calls to objects in different address spaces either on the same machine or on a different machine.

它處理的程序可以在同一臺機器或不同機器上的不同地址空間中調用對象。

統一對象的愿景 (The Vision of Unified Objects)

Implicit in this vision is that the system will be “objects all the way down.” This means that all current invocations, or calls for system services, will eventually be converted into calls that might be made to an object residing on some other machine. There is a single paradigm of object use and communication used no matter what the location of the object might be.

這種愿景的隱含含義是該系統將是“一直向下的對象”。這意味著所有當前的調用或對系統服務的調用最終都將轉換為可能對駐留在其他計算機上的對象進行的調用。無論對象位于何處，都只有一個對象使用和通信范式。

This refers to the assumption that all objects are defined only in terms of their interfaces. Their implementation also includes location of the object, and is independent of their interfaces and hidden from the programmer.

這是指所有對象僅根據其接口定義的假設。它們的實現還包括對象的位置，并且與它們的接口無關并且對程序員而言是隱藏的。

As far the programmer is concerned, they write the same type of call for every object, whether local or remote. The system takes care of sending the message by figuring out the underlying mechanisms not visible to the programmer who is writing the application.

就程序員而言，他們為本地或遠程的每個對象編寫相同類型的調用。系統通過找出對編寫應用程序的程序員不可見的底層機制來處理消息。

The hard problems in distributed computing are not the problems of how to get things on and off the wire.

分布式計算中的難題不是如何使事情在線上或離線的問題。

The paper goes on to define the toughest challenges of building a distributed system:

本文繼續定義構建分布式系統的最艱巨挑戰：

Latency
潛伏
Memory Access
記憶體存取
Partial failure and concurrency
部分失敗和并發

Ensuring a reasonable performance while dealing with all the above doesn’t make the life of the a distributed systems engineer any easier. And the lack of any central resource or state manager adds to the various challenges. Let’s observe each of these one by one.

在處理上述所有問題的同時確保合理的性能不會使分布式系統工程師的工作變得更加輕松。而且缺少任何中央資源或狀態管理器會增加各種挑戰。讓我們一一觀察。

潛伏 (Latency)

This is the fundamental difference between local and distributed object invocation.

這是本地對象調用和分布式對象調用之間的根本區別。

The paper claims that a remote call is four to five times slower than a local call. If the design of a system fails to recognize this fundamental difference, it is bound to suffer from serious performance problems. Especially if it relies on remote communication.

該論文聲稱，遠程呼叫比本地呼叫慢四到五倍。如果系統的設計未能認識到這一根本差異，則勢必會遭受嚴重的性能問題。特別是如果它依賴于遠程通信。

You need to have a thorough understanding of the application being designed so you can decide which objects should be kept together and which can be placed remotely.

您需要對正在設計的應用程序有透徹的了解，以便可以決定哪些對象應該放在一起，哪些可以遠程放置。

If the goal is to unify the difference in latency, then we’ve two options:

如果目標是統一延遲差異，那么我們有兩個選擇：

Rely on the hardware to get faster with time to eliminate the difference in efficiency
依靠硬件來獲得更快的速度以消除效率差異
Develop tools which allow us to visualize communication patterns between different objects and move them around as required. Since location is an implementation detail, this shouldn’t be too hard to achieve
開發工具，使我們可以可視化不同對象之間的通信模式，并根據需要移動它們。由于位置是實現細節，因此實現起來應該不難

記憶 (Memory)

Another difference that’s very relevant to the design of distributed systems is the pattern of memory access between local and remote objects. A pointer in the local address space isn’t valid in a remote address space.

與分布式系統的設計非常相關的另一個區別是本地對象與遠程對象之間的內存訪問模式。本地地址空間中的指針在遠程地址空間中無效。

We’re left with two choices:

我們有兩個選擇：

The developer must be made aware of the difference between the access patterns
必須使開發人員了解訪問模式之間的差異
To unify the differences in access between local and remote access, we need to let the system handle all aspects of access to memory.
為了統一本地訪問和遠程訪問之間的訪問差異，我們需要讓系統處理對內存訪問的所有方面。

There are several way to do that:

有幾種方法可以做到這一點：

Distributed shared memory
分布式共享內存
Using the OOP (Object-oriented programming) paradigm, compose a system entirely of objects — one that deals only with object references.
使用OOP (面向對象編程)范式，可以完全由一個對象組成一個系統-一個僅處理對象引用的系統。

The transfer of data between address spaces can be dealt with by marshalling and unmarshalling the data by the layer underneath. This approach, however, makes the use of address-space-relative pointers obsolete.
地址空間之間的數據傳輸可以通過下面的層對數據進行編組和解組來處理。但是，這種方法使相對于地址空間的指針的使用變得過時了。

The danger lies in promoting the myth that “remote access and local access are exactly the same.” We should not reinforce this myth. An underlying mechanism that does not unify all memory accesses while still promoting this myth is both misleading and prone to error.

危險在于宣傳“遠程訪問和本地訪問完全相同”的神話。我們不應該加強這個神話。不能統一所有內存訪問而又仍在提倡這一神話的基本機制既容易引起誤解，也容易出錯。

It’s important for programmers to be made aware of the various differences between accessing local and remote objects. We don’t want them to get bitten by not knowing what’s happening under the covers.

對于程序員來說，重要的是要意識到訪問本地對象和遠程對象之間的各種差異。我們不希望他們不知道幕后發生的事情而被他們咬住。

部分失敗與并發 (Partial failure & concurrency)

Partial failure is a central reality of distributed computing.

部分故障是分布式計算的中心現實。

The paper argues that both local and distributed systems are subject to failure. But it’s harder to discover what went wrong in the case of distributed systems.

該論文認為，本地系統和分布式系統都容易出現故障。但是，很難發現在分布式系統中出了什么問題。

For a local system, either everything is shut down or there is some central authority which can detect what went wrong (the OS, for example).

對于本地系統，要么一切都已關閉，要么有一些中央機構可以檢測出哪里出了問題(例如OS)。

Yet, in the case of a distributed system, there is no global state or resource manager available to keep track of everything happening in and across the system. So there is no way to inform other components which may be functioning correctly which ones have failed. Components in a distributed system fail independently.

但是，在分布式系統的情況下，沒有可用的全局狀態或資源管理器來跟蹤系統中和整個系統中發生的一切。因此，無法通知可能正在正常運行的其他組件，哪些已發生故障。分布式系統中的組件會獨立發生故障。

A central problem in distributed computing is insuring that the state of the whole system is consistent after such a failure. This is a problem that simply does not occur in local computing.

分布式計算中的一個中心問題是確保發生此類故障后，整個系統的狀態保持一致。這是在本地計算中根本不會發生的問題。

For a system to withstand partial failure, it’s important that it deals with indeterminacy, and that the objects react to it in a consistent manner. The interfaces must be able to state the cause of failure, if possible. And then allow the reconstruction of a “reasonable state” in case the cause can’t be determined.

對于承受部分故障的系統，重要的是要處理不確定性，并且對象以一致的方式對其做出React。如果可能的話，接口必須能夠說明故障原因。然后，在無法確定原因的情況下，允許重構“合理狀態”。

The question is not “can you make remote method invocation look like local method invocation,” but rather “what is the price of making remote method invocation identical to local method invocation?”

問題不是“您能否使遠程方法調用看起來像本地方法調用”，而是“使遠程方法調用與本地方法調用相同的價格是多少？”

Two approaches come to mind:

我想到兩種方法：

Treat all interfaces and objects as local. The problem with this approach is that it doesn’t take into account the failure models associated with distributed systems. Therefore, it’s indeterministic by nature.
將所有接口和對象都視為本地。這種方法的問題在于它沒有考慮與分布式系統相關的故障模型。因此，它本質上是不確定的。
Treat all interfaces and objects as remote. The flaw with this approach is that it over-complicates local computing. It adds on a ton of work for objects that are never accessed remotely.
將所有接口和對象都視為遠程對象。這種方法的缺陷在于它使本地計算過于復雜。它為永不遠程訪問的對象增加了很多工作。

A better approach is to accept that there are irreconcilable differences between local and distributed computing, and to be conscious of those differences at all stages of the design and implementation of distributed applications.

更好的方法是接受本地計算和分布式計算之間的不可調和的差異，并在分布式應用程序的設計和實現的所有階段意識到這些差異。

P.S. — If you made it this far and would like to receive a mail whenever I publish one of these posts, sign up here.

PS —如果您到現在為止，并且希望在我發布這些帖子之一時收到郵件，請在此處注冊。