HybridTime - Accessible Global Consistency with High Clock Uncertainty

Amazon’s Dynamo [9] and Facebook’s Cassandra [13], relax the consistency model，and offer only eventual consistency.

Others such as HBase [1] and BigTable [4] offer strong consistency only for operations touching a single partition, but not across the database as a whole.

Dynamo和Cassandra都是基于Vector Clocks和Gossip算法，強調高可用性，可以達到eventual consistency

而HBase，可以提供單分區的強一致性，但對于跨分區，或跨數據庫，就無法保證；

for example, in a geo-replicated service, a user may first be routed to a local datacenter and perform some action such as sending an email message.?
If they then reload their browser and are routed to another datacenter backend, they would still expect to see their sent message in their email outbox.

這個例子比較典型，要達到異地的一致性

一般要解決分布式問題的一致性問題的最簡單的辦法就是用synchronized clocks across all servers，但這基本是達不到的

the traditional approach to global consistency has been to use logical clocks.?
Logical clocks, such as Lamport Clocks [15] or Vector Clocks [10, 20].

所以傳統的做法就是用邏輯時鐘

https://www.zhihu.com/question/19994133

貼個回答，寫的不錯

當然可以，但是不適合在分布式數據庫里面。為什么呢？首先，需要先確定你這個時間戳取自哪里。1、所有的時間戳都取自一個中心點節點，這樣讓時間戳得到一個全序關系，這是最簡單的實現，如果你的分布式數據庫只是部署在一個數據中心的話，但是這樣就增加了一個中心節點，影響數據庫的擴展性，而且任何一個取時間戳的動作都要增加上網絡延遲的開銷。如果分布式數據庫部署需要跨數據中心，這種方案變的不可行，跨數據中心網絡時延太大。從而影響整體性能2、所有的時間戳取自已所在的節點。這就涉及到時鐘同步的問題，就像Dongdong說的，如果取的物理時間那么這些節點并沒有一個一致的時間，總會有一點誤差，這種解決方案就有兩種思路：?
一、就是現在以Leslie Lamport 提出的邏輯時鐘的方法，重新定義了一種分布式的全序關系。vector clock就是以邏輯時鐘為基礎上發展而來的。?
二、使用物理時鐘，就是google spanner使用gps 加原子鐘進行時間同步，不同節點之間的時間是不能精確同步的，只能把不同節點之間的時間同步到一個范圍之內，spanner一個節點上獲得一個時間戳，不是一個時間值，而是一個時間值加上一個誤差范圍，也就是一個時間窗，如果兩個時間窗有重合，就無法比較大小，從而無法比較出來時間窗代表的事件的先后關系。而時間同步算法就是讓分布式各節點之間的時間的誤差盡量的小，這樣取時間戳這個動作效率是最高的，只在本節點取一下物理時間即可，但是為了同步時間，各個節點肯定在周期與其它節點進行通訊來同步物理時間（google時間同步的算法好像沒有開源）。時間同步算法只能讓時間盡量的精確。分布式數據庫里面是不能直接使用NTP來同步時間的，因為NTP時間同步協議里面允許節點的時間能夠回退，而數據庫里面要求本地的時鐘必須是遞增的。?
三、還有一種混合邏輯時鐘(Logical Physical Clocks)，也是由邏輯時鐘演變而來，但是時鐘的值比較接近于物理時鐘，而且不依賴于下面物理時鐘的同步的算法，它的時間戳都可以在本地取，而且取出來是一個時間值，而不像spanner一樣，取出來是一個時間窗。詳細可以參見Logical Physical Clocks?and Consistent Snapshots in Globally Distributed Databases這篇論文

關于Lamport或向量時鐘，

Lamport時鐘就是邏輯時鐘，即不依賴于物理時鐘，而依賴event之間的因果關系來定義的一種偏序

參考，全序, 分布式一致性的本質

而Vector時鐘是，邏輯時鐘的一種實現，具體的邏輯看下文，

Why Vector Clock are Easy or Hard?

摘自上面的知乎回答

“dynamo paper是七、八年前寫的了，現在的amazon dynamo早已摒棄了version vector，而采用了synchronous replication（類似paxos的protocol）,每一個partition會有一個leader負責寫入。其實version vector并不scale，因為對于一個key來說，隨著writer數量的增加，version vector數量成指數級的增長”?

邏輯時鐘的缺點，

first, they require that all clients must propagate clock data to achieve consistent views,?
and second, the assigned timestamps have no relation to physical time.

Spanner introduced?commit-wait, a way of ensuring physical-time based consistent global state by forcing operations to wait long enough so that all participants agree that the operation’s timestamp has passed based on worst case synchronization error bounds.

While innovative, the system performance becomes highly dependent on the quality of the time synchronization infrastructure, and thus may have unacceptable performance absent specialized hardware such as atomic clocks and GPS receivers.

Spanner利用原子鐘和GPS接收器，實現了一個較為精確的時鐘，這個時鐘叫做TrueTime，每次調用TrueTime API返回的是一個時間區間，而不是一個具體的值，這個TrueTime保證的是真實時間(absolute time/real time)一定在這個區間內，這個區間范圍通常大約14ms，甚至更小。

所以只有每個transaction的時間區間，都不重合，那么就可以比較容易的做到全局有序；

所以這里需要commit-wait，

可以看到，我在開始transaction的時候，取一次true time，[t1.earliest, t1.latest]

那么什么時候，我可以把這個transaction commit掉，即釋放掉鎖，即別人可以開始新的transaction

當我再取一次true time，[t2.earliest, t2.latest]，當t2.earliest > t1.latest的時候，即可

因為這樣兩個transaction的時間區間就不可能重疊了，當然代價就是每個transaction都有等待大概2個error bound的時間

當然，spanner的實現依賴硬件的infra設備，普通用戶無法復制；

HybridTime(HT), a hybrid between physical and logical clocks and we show how HT can be used to achieve the same semantics as Spanner, but with good performance even with commonly available time synchronization.

Like Spanner, our approach relies on physical time measurements with bounded error to assign HybridTime timestamps to events that occur in the system.?
However, unlike Spanner, our approach does not usually require the error to be waited out, thus allowing for usage in common deployment scenarios where clocks are synchronized through common protocols such as the Network Time Protocol, in which clock synchronization error is often higher than with Spanner’s TrueTime.

The trade-off is that, in order to avoid commit-wait, HybridTime requires that timestamps be propagated across machines to achieve the same consistency semantics as Spanner.?
Contrary to vector clocks, which can expand as the number of participants in the cluster grows, HybridTime timestamps have constant and small size.

HybridTime說是physical and logical clocks的結合，和spanner一樣，也可以使用帶error bound的物理時間?
但由于他沒有spanner的硬件設施，所以做不到低延遲的時間同步，所以他依然使用邏輯時鐘來達到一致性；并且他解決了vector clocks隨著client的數量增加而會size膨脹的問題

HybridTime clocks follow similar update rules to Lamport clocks, but the time values are not purely logical:?
each time value has both a logical component, which helps in guaranteeing the same properties as a Lamport Clocks, and a physical component which allows the event to be associated with a physical point-in-time.

One of Spanner’s key benefits is that is?externally consistent, which is defined as fully linearizable, even in the presence of hidden channels.

Additionally we use the term?globally consistent?to describe a system which provides the same linearizability semantics, provided that there are no hidden channels present.

區分為，是否有hidden channels？意思是clients間存在因果或先后關系，比如通過發message，都是這個因果關系對于系統是不可知的，比如A write，然后發送消息給B，然后B write，那么因果關系上，B write一定后于A write，但是對于數據庫而言他并不知道，所以并不能保證A write先寫入；

spanner是可以保證externally consistent，由于他通過 commit wait，來保證每個transaction之間一定是全局有序的

相對的globally consistent，更簡單些，因為并沒有hidden channels

Spanner’s key innovation is that timestamps assigned by the system can be used to achieve?external consistency, but also have?physical meaning.

HybridTime is always?globally consistent, and through selective application of commit-wait is externally consistent.

Spanner的主要創新在于實現external一致性的，同時還保留了物理時間的

而HybridTime，默認情況下是可以實現globally consistent，即偏序，因為他本身就是使用lamport時鐘，并且當你選擇commit-wait時，也可以保證externally consistent；

HybridTime Assumptions

1. HybridTime assumes that machines have a?reasonably accurate physical clock, represented by the?PCi(e)?function, which outputs the numeric timestamp returned by the physical clock as read by?process i for event e, that is able to provide absolute time measurements (usually in milli- or microseconds since 1 January 1970).

2. keeps the physical clocks across different servers synchronized with regard to a reference server,?the “reference” time, represented by the?PCref(e)?function which outputs the numeric timestamp returned by the “reference” process for event e；

Additionally, we assume that such a substrate is able to provide an?error bound?along with each time measurement, denoted by the?Ei(e)?function, which outputs the numeric value ε error of process i at the time e occurred

3. we make no assumptions on the actual accuracy of the clocks, i.e. the physical timestamps returned by server’s clocks may have an arbitrarily large but finite error, as long as this error’s bound is known

說白了，假設

有個相對靠譜的物理時鐘，一個理想的參照時鐘，以及他們之間相差的，error bound

最后，我們并不假設這個error bound會很小，只要有限就可以

假設1，我們有有限的error bound

假設2，進程級別的物理時鐘單調遞增，注意是進程級別

基于以上假設，提出HybridTime Clock and Protocol

HybridTime Clock and Protocol

HybridTime clock(HTC) is a pair?(physical,logical)?where the first component is a representation of the physical time at which the event occurred and the second component is a logical sequence number.

定義其實顯而易見。。。

Algorithm 2 depicts the HTC algorithm.

HTC算法如上，兩部分，NOW和UPDATE

now就是取當前的HybridTime clock

upate就是根據in 來更新當前的clock

Algorithm 2 implements a Lamport Clock, with the additional advantage that generated timestamps have physical meaning and are accurate representations of physical time within a bound error.

算法本身，其實就是Lamport Clock，只是增加了physical clock的部分

給出一個例子，

To order the events timestamped using the HybridTime Clock algorithm we use Definition 1.

Definition 1. HCT(e) < HCT( f ) is defined as the lexicographical ordering of the timestamp two-tuple (physical,logical)

Theorem 1. The HybriTime clock happened-before relation forms a total order of events

Theorem 2. For any event in a causal chain f , the physical component of a HTC timestamp approximates the “real” time the event occurred, with a error defined and bounded by

Implementation

No Consistency - In this mode there are no external consistency guarantees, transactions are assigned timestamps from each server’s physical clock and no guarantee is made that reads are consistent or repeatable.

直接用local的物理時間，不保證一致性

HybridTime Consistency - In this mode our implementation guarantees the global consistency as Spanner, absent hidden channels, but using HybridTime instead of commit-wait.?
Clients choosing this consistency mode on writes must make sure that the timestamp that is received from the server is propagated to other servers and/or clients.?
Within the same client process, timestamps are automatically propagated on behalf of the user.

這個其實和邏輯時鐘，沒有區別，就是保證偏序

Commit-wait Consistency - In this mode our implementation guarantees the same external consistency semantics as Spanner by also using commit-wait in the way described in the original paper.?
However instead of using TrueTime, which is a proprietary and private API, we implemented commit-wait on top of the widely used Network Time Protocol (NTP). Hence, in this consistency mode we support hidden channels.

這個估計沒啥用，在沒有spanner硬件保證的情況下，commit-wait，性能不能忍吧