python緩沖區_如何在Python中使用Google的協議緩沖區

python緩沖區

When people who speak different languages get together and talk, they try to use a language that everyone in the group understands.

當說不同語言的人聚在一起聊天時，他們會嘗試使用小組中每個人都能理解的語言。

To achieve this, everyone has to translate their thoughts, which are usually in their native language, into the language of the group. This “encoding and decoding” of language, however, leads to a loss of efficiency, speed, and precision.The same concept is present in computer systems and their components. Why should we send data in XML, JSON, or any other human-readable format if there is no need for us to understand what they are talking about directly? As long as we can still translate it into a human-readable format if explicitly needed.Protocol Buffers are a way to encode data before transportation, which efficiently shrinks data blocks and therefore increases speed when sending it. It abstracts data into a language- and platform-neutral format.

為了實現這一目標，每個人都必須將他們通常以其本國語言表達的思想翻譯成小組的語言。但是，這種語言的“編碼和解碼”會導致效率，速度和精度的損失。計算機系統及其組件中存在相同的概念。如果不需要我們直接了解他們在說什么，為什么我們應該以XML，JSON或任何其他人類可讀格式發送數據？只要明確需要，我們仍然可以將其轉換為人類可讀的格式。協議緩沖區是一種在傳輸之前對數據進行編碼的方法，它可以有效地縮小數據塊，從而提高發送數據時的速度。它將數據抽象為與語言和平臺無關的格式。

目錄 (Table of Contents)

Why do we need Protocol Buffers?
為什么我們需要協議緩沖區？
What are Protocol Buffers and how do they work?
什么是協議緩沖區，它們如何工作？
Protocol Buffers in Python
Python中的協議緩沖區
Final notes
最后的筆記

為什么要使用協議緩沖區？ (Why Protocol Buffers?)

The initial purpose of Protocol Buffers was to simplify the work with request/response protocols. Before ProtoBuf, Google used a different format which required additional handling of marshaling for the messages sent.

協議緩沖區的最初目的是簡化請求/響應協議的工作。在ProtoBuf之前，Google使用了另一種格式，該格式需要對發送的郵件進行其他封送處理。

In addition to that, new versions of the previous format required the developers to make sure that new versions are understood before replacing old ones, making it a hassle to work with.

除此之外，以前格式的新版本要求開發人員在替換舊版本之前確保已理解新版本，這使使用起來很麻煩。

This overhead motivated Google to design an interface that solves precisely those problems.

這項開銷促使Google設計了一個可以準確解決這些問題的界面。

ProtoBuf allows changes to the protocol to be introduced without breaking compatibility. Also, servers can pass around the data and execute read operations on the data without modifying its content.

ProtoBuf允許在不破壞兼容性的情況下對協議進行更改。此外，服務器可以傳遞數據并在不修改其內容的情況下對數據執行讀取操作。

Since the format is somewhat self-describing, ProtoBuf is used as a base for automatic code generation for Serializers and Deserializers.

由于格式有些自描述，因此ProtoBuf用作自動生成序列化器和反序列化器代碼的基礎。

Another interesting use case is how Google uses it for short-lived Remote Procedure Calls (RPC) and to persistently store data in Bigtable. Due to their specific use case, they integrated RPC interfaces into ProtoBuf. This allows for quick and straightforward code stub generation that can be used as starting points for the actual implementation. (More on ProtoBuf RPC.)

另一個有趣的用例是Google如何將其用于短暫的遠程過程調用 (RPC)并將數據持久存儲在Bigtable中。由于其特定的用例，他們將RPC接口集成到ProtoBuf中。這允許快速直接的代碼存根生成，可用作實際實現的起點。 (有關ProtoBuf RPC的更多信息。)

Other examples of where ProtoBuf can be useful are for IoT devices that are connected through mobile networks in which the amount of sent data has to be kept small or for applications in countries where high bandwidths are still rare. Sending payloads in optimized, binary formats can lead to noticeable differences in operation cost and speed.

ProtoBuf有用的其他示例是通過移動網絡連接的IoT設備，其中必須將發送的數據量保持在很小的水平，或者用于那些仍很少使用高帶寬的國家/地區。以優化的二進制格式發送有效載荷會導致操作成本和速度上的明顯差異。

Using gzip compression in your HTTPS communication can further improve those metrics.

在HTTPS通信中使用gzip壓縮可以進一步改善這些指標。

什么是協議緩沖區，它們如何工作？ (What are Protocol buffers and how do they work?)

Generally speaking, Protocol Buffers are a defined interface for the serialization of structured data. It defines a normalized way to communicate, utterly independent of languages and platforms.

一般來說，協議緩沖區是用于結構化數據序列化的已定義接口。它定義了一種完全獨立于語言和平臺的標準化通信方式。

Google advertises its ProtoBuf like this:

Google 像這樣廣告其ProtoBuf：

Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once …
協議緩沖區是Google的與語言無關，與平臺無關，可擴展的機制，用于對結構化數據進行序列化(例如XML)，但更小，更快，更簡單。 您定義如何一次構造數據……

The ProtoBuf interface describes the structure of the data to be sent. Payload structures are defined as “messages” in what is called Proto-Files. Those files always end with a .proto extension.For example, the basic structure of a todolist.proto file looks like this. We will also look at a complete example in the next section.

ProtoBuf接口描述了要發送的數據的結構。有效載荷結構在所謂的“原始文件”中定義為“消息”。這些文件始終以.proto 擴展名。例如， todolist.proto文件的基本結構如下所示。我們還將在下一部分中查看一個完整的示例。

syntax = "proto3";// Not necessary for Python, should still be declared to avoid name collisions 
// in the Protocol Buffers namespace and non-Python languages
package protoblog;message TodoList {// Elements of the todo list will be defined here...
}

Those files are then used to generate integration classes or stubs for the language of your choice using code generators within the protoc compiler. The current version, Proto3, already supports all the major programming languages. The community supports many more in third-party open-source implementations.

然后，使用這些文件使用協議編譯器中的代碼生成器為您選擇的語言生成集成類或存根。當前版本Proto3已經支持所有主要的編程語言。社區支持更多第三方開放源代碼實施。

Generated classes are the core elements of Protocol Buffers. They allow the creation of elements by instantiating new messages, based on the .proto files, which are then used for serialization. We’ll look at how this is done with Python in detail in the next section.

生成的類是協議緩沖區的核心元素。它們允許通過實例化基于.proto文件的新消息來創建元素，然后將這些消息用于序列化。在下一節中，我們將詳細介紹如何使用Python完成此操作。

Independent of the language for serialization, the messages are serialized into a non-self-describing, binary format that is pretty useless without the initial structure definition.

與用于序列化的語言無關，消息被序列化為非自我描述的二進制格式，如果沒有初始結構定義，該格式幾乎沒有用。

The binary data can then be stored, sent over the network, and used any other way human-readable data like JSON or XML is. After transmission or storage, the byte-stream can be deserialized and restored using any language-specific, compiled protobuf class we generate from the .proto file.Using Python as an example, the process could look something like this:

然后可以存儲二進制數據，通過網絡發送和使用其他任何人類可讀數據(如JSON或XML)的方式。傳輸或存儲后，可以使用從.proto文件生成的任何特定于語言的已編譯protobuf類對字節流進行反序列化和還原。以Python為例，該過程看起來像這樣：

First, we create a new todo list and fill it with some tasks. This todo list is then serialized and sent over the network, saved in a file, or persistently stored in a database.

首先，我們創建一個新的待辦事項列表，并執行一些任務。然后，此待辦事項列表將被序列化并通過網絡發送，保存在文件中或永久存儲在數據庫中。

The sent byte stream is deserialized using the parse method of our language-specific, compiled class.Most current architectures and infrastructures, especially microservices, are based on REST, WebSockets, or GraphQL communication. However, when speed and efficiency are essential, low-level RPCs can make a huge difference.

使用特定于語言的已編譯類的parse方法對發送的字節流進行反序列化。當前大多數體系結構和基礎結構(尤其是微服務)都基于REST，WebSockets或GraphQL通信。但是，當速度和效率至關重要時，低級RPC可能會產生很大的不同。

Instead of high overhead protocols, we can use a fast and compact way to move data between the different entities into our service without wasting many resources.

代替高開銷協議，我們可以使用快速而緊湊的方式在不同實體之間將數據移動到我們的服務中，而不會浪費很多資源。

但是，為什么還沒有在所有地方使用它呢？ (But why isn’t it used everywhere yet?)

Protocol Buffers are a bit more complicated than other, human-readable formats. This makes them comparably harder to debug and integrate into your applications.

協議緩沖區比其他人類可讀格式要復雜一些。這使得它們很難進行調試和集成到您的應用程序中。

Iteration times in engineering also tend to increase since updates in the data require updating the proto files before usage.

工程中的迭代時間也往往會增加，因為數據更新需要在使用前更新原型文件。

Careful considerations have to be made since ProtoBuf might be an over-engineered solution in many cases.

由于ProtoBuf在許多情況下可能是過度設計的解決方案，因此必須謹慎考慮。

我有什么選擇？ (What alternatives do I have?)

Several projects take a similar approach to Google’s Protocol Buffers.

一些項目對Google的協議緩沖區采用了類似的方法。

Google’s Flatbuffers and a third party implementation, called Cap’n Proto, are more focused on removing the parsing and unpacking step, which is necessary to access the actual data when using ProtoBufs. They have been designed explicitly for performance-critical applications, making them even faster and more memory efficient than ProtoBuf.When focusing on the RPC capabilities of ProtoBuf (used with gRPC), there are projects from other large companies like Facebook (Apache Thrift) or Microsoft (Bond protocols) that can offer alternatives.

Google的Flatbuffers和稱為Cap'n Proto的第三方實現更著重于消除解析和拆包步驟，這是使用ProtoBufs時訪問實際數據所必需的。它們專為對性能至關重要的應用程序而設計，使其比ProtoBuf更快，內存效率更高。當專注于ProtoBuf(與gRPC結合使用)的RPC功能時，Facebook等其他大型公司(Apache Thrift)或可以提供替代方案的Microsoft(債券協議)。

Python和協議緩沖區 (Python and Protocol Buffers)

Python already provides some ways of data persistence using pickling. Pickling is useful in Python-only applications. It's not well suited for more complex scenarios where data sharing with other languages or changing schemas is involved.Protocol Buffers, in contrast, are developed for exactly those scenarios.The .proto files, we’ve quickly covered before, allow the user to generate code for many supported languages.

Python已經使用酸洗提供了一些數據持久化的方法。酸洗在僅Python的應用程序中很有用。它不適用于涉及與其他語言共享數據或更改架構的更復雜的場景。相比之下， .proto正是針對這些場景而開發的.proto文件，我們之前已經快速介紹過，允許用戶生成許多受支持語言的代碼。

To compile the .proto file to the language class of our choice, we use protoc, the proto compiler.If you don’t have the protoc compiler installed, there are excellent guides on how to do that:

編譯.proto 文件添加到我們選擇的語言類中，我們使用protoc(即proto編譯器)。如果您未安裝protoc編譯器，則有很好的指南來指導您：

MacOS / Linux
MacOS / Linux
Windows
視窗

Once we’ve installed protoc on our system, we can use an extended example of our todo list structure from before and generate the Python integration class from it.

一旦在系統上安裝了協議，就可以使用之前的待辦事項列表結構的擴展示例，并從中生成Python集成類。

syntax = "proto3";// Not necessary for Python but should still be declared to avoid name collisions 
// in the Protocol Buffers namespace and non-Python languages
package protoblog;// Style guide prefers prefixing enum values instead of surrounding
// with an enclosing message
enum TaskState {TASK_OPEN = 0;TASK_IN_PROGRESS = 1;TASK_POST_PONED = 2;TASK_CLOSED = 3;TASK_DONE = 4;
}message TodoList {int32 owner_id = 1;string owner_name = 2;message ListItems {TaskState state = 1;string task = 2;string due_date = 3;}repeated ListItems todos = 3;
}

Let’s take a more detailed look at the structure of the .proto file to understand it.In the first line of the proto file, we define whether we’re using Proto2 or 3. In this case, we’re using Proto3.

讓我們更詳細地了解.proto文件的結構以了解它。在proto文件的第一行中，我們定義是使用Proto2還是3。在這種情況下，我們使用Proto3 。

The most uncommon elements of proto files are the numbers assigned to each entity of a message. Those dedicated numbers make each attribute unique and are used to identify the assigned fields in the binary encoded output.

原始文件中最不常見的元素是分配給消息的每個實體的編號。這些專用數字使每個屬性都唯一，并用于標識二進制編碼輸出中的分配字段。

One important concept to grasp is that only values 1-15 are encoded with one less byte (Hex), which is useful to understand so we can assign higher numbers to the less frequently used entities. The numbers define neither the order of encoding nor the position of the given attribute in the encoded message.

要掌握的一個重要概念是，只有值1-15會用少一個字節(Hex)進行編碼，這對于理解很有用，因此我們可以為使用頻率較低的實體分配較高的數字。數字既不定義編碼順序也不定義給定屬性在編碼消息中的位置。

The package definition helps prevent name clashes. In Python, packages are defined by their directory. Therefore providing a package attribute doesn’t have any effect on the generated Python code.

程序包定義有助于防止名稱沖突。在Python中，軟件包由其目錄定義。因此，提供包屬性對生成的Python代碼沒有任何影響。

Please note that this should still be declared to avoid protocol buffer related name collisions and for other languages like Java.

請注意，對于其他語言(例如Java)，仍應聲明該名稱以避免協議緩沖區相關的名稱沖突。

Enumerations are simple listings of possible values for a given variable.In this case, we define an Enum for the possible states of each task on the todo list.We’ll see how to use them in a bit when we look at the usage in Python.As we can see in the example, we can also nest messages inside messages.If we, for example, want to have a list of todos associated with a given todo list, we can use the repeated keyword, which is comparable to dynamically sized arrays.

枚舉是給定變量可能值的簡單列表。在這種情況下，我們為待辦事項列表中每個任務的可能狀態定義了一個枚舉，我們將在后面的用法中看到如何使用它們。在示例中可以看到，我們也可以將消息嵌套在消息中，例如，如果我們想要與給定的待辦事項列表關聯的待辦事項列表，則可以使用重復關鍵字，該關鍵字與動態大小的數組。

To generate usable integration code, we use the proto compiler which compiles a given .proto file into language-specific integration classes. In our case we use the --python-out argument to generate Python-specific code.

為了生成可用的集成代碼，我們使用proto編譯器，該編譯器將給定的.proto文件編譯為特定于語言的集成類。在我們的例子中，我們使用--python-out參數生成特定于Python的代碼。

protoc -I=. --python_out=. ./todolist.proto

In the terminal, we invoke the protocol compiler with three parameters:

在終端中，我們使用三個參數調用協議編譯器：

-I: defines the directory where we search for any dependencies (we use . which is the current directory)
-I ：定義在其中搜索任何依賴項的目錄(我們使用。作為當前目錄)
--python_out: defines the location we want to generate a Python integration class in (again we use . which is the current directory)
--python_out ：定義我們要在其中生成Python集成類的位置(再次使用。這是當前目錄)
The last unnamed parameter defines the .proto file that will be compiled (we use the todolist.proto file in the current directory)
最后一個未命名的參數定義將要編譯的.proto文件(我們在當前目錄中使用todolist.proto文件)

This creates a new Python file called <name_of_proto_file>_pb2.py. In our case, it is todolist_pb2.py. When taking a closer look at this file, we won’t be able to understand much about its structure immediately.

這將創建一個名為<name_of_proto_file> _pb2.py的新Python文件。在我們的例子中，它是todolist_pb2.py。當仔細查看此文件時，我們將無法立即了解其結構。

This is because the generator doesn’t produce direct data access elements, but further abstracts away the complexity using metaclasses and descriptors for each attribute. They describe how a class behaves instead of each instance of that class.The more exciting part is how to use this generated code to create, build, and serialize data. A straightforward integration done with our recently generated class is seen in the following:

這是因為生成器不會產生直接的數據訪問元素，而是會使用元類和每個屬性的描述符進一步簡化復雜性。它們描述了一個類的行為方式，而不是該類的每個實例。更令人興奮的部分是如何使用此生成的代碼來創建，構建和序列化數據。以下是與我們最近生成的類進行的直接集成：

import todolist_pb2 as TodoListmy_list = TodoList.TodoList()
my_list.owner_id = 1234
my_list.owner_name = "Tim"first_item = my_list.todos.add()
first_item.state = TodoList.TaskState.Value("TASK_DONE")
first_item.task = "Test ProtoBuf for Python"
first_item.due_date = "31.10.2019"print(my_list)

It merely creates a new todo list and adds one item to it. We then print the todo list element itself and can see the non-binary, non-serialized version of the data we just defined in our script.

它僅創建一個新的待辦事項列表并向其中添加一個項目。然后，我們打印待辦事項列表元素本身，并可以看到我們剛剛在腳本中定義的數據的非二進制，非序列化版本。

owner_id: 1234
owner_name: "Tim"
todos {state: TASK_DONEtask: "Test ProtoBuf for Python"due_date: "31.10.2019"
}

Each Protocol Buffer class has methods for reading and writing messages using a Protocol Buffer-specific encoding, that encodes messages into binary format.Those two methods are SerializeToString() and ParseFromString().

每個協議緩沖區類都有使用協議緩沖區特定的編碼來讀取和寫入消息的方法，該方法將消息編碼為二進制格式。這兩個方法是SerializeToString()和ParseFromString() 。

import todolist_pb2 as TodoListmy_list = TodoList.TodoList()
my_list.owner_id = 1234# ...with open("./serializedFile", "wb") as fd:fd.write(my_list.SerializeToString())my_list = TodoList.TodoList()
with open("./serializedFile", "rb") as fd:my_list.ParseFromString(fd.read())print(my_list)

In the code example above, we write the Serialized string of bytes into a file using the wb flags.

在上面的代碼示例中，我們使用wb標志將字節的序列化字符串寫入文件。

Since we have already written the file, we can read back the content and Parse it using ParseFromString. ParseFromString calls on a new instance of our Serialized class using the rb flags and parses it.

由于已經編寫了文件，因此可以讀回內容并使用ParseFromString對其進行解析。 ParseFromString使用rb標志調用序列化類的新實例并對其進行解析。

If we serialize this message and print it in the console, we get the byte representation which looks like this.

如果我們將此消息序列化并在控制臺中打印，我們將獲得如下所示的字節表示形式。

b'\x08\xd2\t\x12\x03Tim\x1a(\x08\x04\x12\x18Test ProtoBuf for Python\x1a\n31.10.2019'

Note the b in front of the quotes. This indicates that the following string is composed of byte octets in Python.

請注意引號前面的b。這表明以下字符串由Python中的字節八位字節組成。

If we directly compare this to, e.g., XML, we can see the impact ProtoBuf serialization has on the size.

如果直接將其與XML進行比較，我們可以看到ProtoBuf序列化對大小的影響。

<todolist><owner_id>1234</owner_id><owner_name>Tim</owner_name><todos><todo><state>TASK_DONE</state><task>Test ProtoBuf for Python</task><due_date>31.10.2019</due_date></todo></todos>
</todolist>

The JSON representation, non-uglified, would look like this.

未丑化的JSON表示將如下所示。

{"todoList": {"ownerId": "1234","ownerName": "Tim","todos": [{"state": "TASK_DONE","task": "Test ProtoBuf for Python","dueDate": "31.10.2019"}] }
}

Judging the different formats only by the total number of bytes used, ignoring the memory needed for the overhead of formatting it, we can of course see the difference.But in addition to the memory used for the data, we also have 12 extra bytes in ProtoBuf for formatting serialized data. Comparing that to XML, we have 171 extra bytes in XML for formatting serialized data.

僅通過使用的字節總數來判斷不同的格式，而忽略格式化所需的內存，我們當然可以看到區別。但是除了用于數據的內存外，我們還有12個額外的字節ProtoBuf用于格式化序列化數據。與XML相比，我們在XML中有171個額外的字節用于格式化序列化數據。

Without Schema, we need 136 extra bytes in JSON for formatting serialized data.

沒有Schema，我們需要JSON中的136個額外字節 格式化序列化數據。

If we’re talking about several thousands of messages sent over the network or stored on disk, ProtoBuf can make a difference.

如果我們談論的是通過網絡發送或存儲在磁盤上的數千條消息，ProtoBuf可以有所作為。

However, there is a catch. The platform Auth0.com created an extensive comparison between ProtoBuf and JSON. It shows that, when compressed, the size difference between the two can be marginal (only around 9%).

但是，有一個陷阱。 Auth0.com平臺在ProtoBuf和JSON之間進行了廣泛的比較。它表明，壓縮后，兩者之間的大小差異可能很小(僅9％左右)。

If you’re interested in the exact numbers, please refer to the full article, which gives a detailed analysis of several factors like size and speed.

如果您對確切的數字感興趣，請參閱整篇文章，其中詳細分析了一些因素，例如大小和速度。

An interesting side note is that each data type has a default value. If attributes are not assigned or changed, they will maintain the default values. In our case, if we don’t change the TaskState of a ListItem, it has the state of “TASK_OPEN” by default. The significant advantage of this is that non-set values are not serialized, saving additional space.

一個有趣的旁注是，每種數據類型都有一個默認值。如果未分配或更改屬性，則它們將保留默認值。在我們的情況下，如果我們不更改ListItem的TaskState，則默認情況下其狀態為“ TASK_OPEN”。這樣的顯著優點是未設置的值不會被序列化，從而節省了額外的空間。

If we, for example, change the state of our task from TASK_DONE to TASK_OPEN, it will not be serialized.

例如，如果我們將任務的狀態從TASK_DONE更改為TASK_OPEN，它將不會被序列化。

owner_id: 1234
owner_name: "Tim"
todos {task: "Test ProtoBuf for Python"due_date: "31.10.2019"
}

b'\x08\xd2\t\x12\x03Tim\x1a&\x12\x18Test ProtoBuf for Python\x1a\n31.10.2019'

最后說明 (Final Notes)

As we have seen, Protocol Buffers are quite handy when it comes to speed and efficiency when working with data. Due to its powerful nature, it can take some time to get used to the ProtoBuf system, even though the syntax for defining new messages is straightforward.

如我們所見，在處理數據時，在速度和效率方面，協議緩沖區非常方便。由于其強大的特性，即使定義新消息的語法很簡單，也要花一些時間才能習慣ProtoBuf系統。

As a last note, I want to point out that there were/are discussions going on about whether Protocol Buffers are “useful” for regular applications. They were developed explicitly for problems Google had in mind.If you have any questions or feedback, feel free to reach out to me on any social media like twitter or email :)

最后一點，我想指出的是，關于協議緩沖區是否對常規應用程序“有用”的討論正在進行中。它們是專門針對Google遇到的問題而開發的。如果您有任何疑問或反饋，請隨時通過Twitter或電子郵件等任何社交媒體與我聯系：)