dynamodb管理ttl

by Yan Cui

崔燕

如何使用DynamoDB TTL和Lambda安排臨時任務 (How to schedule ad-hoc tasks with DynamoDB TTL and Lambda)

CloudWatch Events let you easily create cron jobs with Lambda. However, it’s not designed for running lots of ad-hoc tasks, each to be executed once, at a specific time. The default limit on CloudWatch Events is a lowly 100 rules per region per account. It’s a soft limit, so it’s possible to request a limit increase. But the low initial limit suggests it’s not designed for use cases where you need to schedule millions of ad-hoc tasks.

通過CloudWatch Events，您可以輕松地使用Lambda創建cron作業。但是，它并非設計用于運行特定任務，每個任務只能在一次執行一次。 CloudWatch Events的默認限制是每個帳戶每個區域最低100條規則。這是一個軟限制，因此可以請求增加限制。但是較低的初始限制表明它不適用于需要安排數百萬個臨時任務的用例。

CloudWatch Events is designed for executing recurring tasks.

CloudWatch Events專為執行重復任務而設計。

問題 (The Problem)

It’s possible to do this in just about every programming language. For example, .Net has the Timer class and JavaScript has the setInterval function. But I often find myself wanting a service abstraction to work with. There are many use cases for such a service, for example:

幾乎每種編程語言都可以做到這一點。例如，.Net具有Timer類，而JavaScript具有setInterval函數。但是我經常發現自己想要使用服務抽象。此類服務有很多用例，例如：

A tournament system for games would need to execute business logic when the tournament starts and finishes.
當比賽開始和結束時，用于游戲的比賽系統將需要執行業務邏輯。
An event system (think eventbrite.com or meetup.com) would need a mechanism to send out timely reminders to attendees.
事件系統(認為eventbrite.com或meetup.com )將需要一個機制來及時提醒發送給與會者。
A to-do tracker (think wunderlist) would need a mechanism to send out reminders when a to-do task is due.
待辦事項跟蹤程序(想想清單 )需要一種機制來在待辦事項到期時發出提醒。

However, AWS does not offer a service for this type of workloads. CloudWatch Events is the closest thing, but as discussed above it’s not intended for the use cases above. You can, however, implement them using cron jobs. But such implementations have other challenges.

但是，AWS不為此類工作負載提供服務。 CloudWatch Events是最接近的事情，但是如上所述，它并不適用于上述用例。但是，您可以使用cron作業來實現它們。但是這樣的實現還有其他挑戰。

I have implemented such service abstraction a few times in my career already. I experimented with a number of different approaches:

在我的職業生涯中，我已經實現了幾次這樣的服務抽象。我嘗試了多種不同的方法：

cron job (with CloudWatch Events)
cron作業(帶有CloudWatch Events)
wrapping the .Net Timer class as an HTTP endpoint
將.Net Timer類包裝為HTTP終結點
using SQS Visibility Timeout to hide tasks until they’re due
使用SQS可見性超時來隱藏任務，直到到期

And lately, I have seen a number of folks use DynamoDB Time-To-Live (TTL) to implement these ad-hoc tasks. In this post, we will take a look at this approach and see where it might be applicable for you.

最近，我看到許多人使用DynamoDB生存時間 (TTL)來實現這些臨時任務。在這篇文章中，我們將研究這種方法，并查看它可能適用于您的地方。

我們如何衡量方法？ (How do we measure the approach?)

For this type of ad-hoc task, we normally care about:

對于此類臨時任務，我們通常關心：

Precision: how close to my scheduled time is the task executed? The closer the better.
精度：任務在我預定的時間有多近？越近越好。
Scale (number of open tasks): can the solution scale to support many open tasks, i.e. tasks that are scheduled but not yet executed?
規模(未完成任務的數量) ：解決方案是否可以擴展以支持許多未完成任務，即已計劃但尚未執行的任務？
Scale (hotspots): can the solution scale to execute many tasks around the same time? E.g. millions of people set a timer to remind themselves to watch the Superbowl, so all the timers fire within close proximity to kickoff time.
擴展(熱點)：解決方案可以擴展以在同一時間執行許多任務嗎？例如，數以百萬計的人設置了一個計時器來提醒自己觀看超級碗，因此所有計時器都在啟動時間附近觸發。

DynamoDB TTL作為調度機制 (DynamoDB TTL as a scheduling mechanism)

From a high level, this approach looks like this:

從較高的角度來看，這種方法看起來像這樣：

A scheduled_items DynamoDB table which holds all the tasks that are scheduled for execution.
scheduled_items DynamoDB表包含計劃執行的所有任務。
A scheduler function that writes the scheduled task into the scheduled_items table, with the TTL set to the scheduled execution time.
一個scheduler函數，它將調度的任務寫入scheduled_items表，并且將TTL設置為調度的執行時間。
An execute-on-schedule function that subscribes to the DynamoDB Stream for scheduled_items and reacts to REMOVE events. These events correspond to when items have been deleted from the table.
在execute-on-schedule功能訂閱到DynamoDB流為scheduled_items并響應REMOVE事件。這些事件對應于從表中刪除項目的時間。

可伸縮性(未完成任務的數量) (Scalability (number of open tasks))

Since the number of open tasks just translates to the number of items in the scheduled_items table, this approach can scale to millions of open tasks.

由于未完成任務的數量僅轉換為scheduled_items表中項目的數量，因此這種方法可以擴展到數百萬個未完成任務。

DynamoDB can handle large throughputs (thousands of TPS) too. So this approach can also be applied to scenarios where thousands of items are scheduled per second.

DynamoDB也可以處理大吞吐量(數千TPS)。因此，這種方法還可以應用于每秒計劃數千個項目的方案。

可擴展性(熱點) (Scalability (hotspots))

When many items are deleted at the same time, they are simply queued in the DynamoDB Stream. AWS also auto scales the number of shards in the stream, so as throughput increases the number of shards would go up accordingly.

同時刪除許多項目時，它們只是在DynamoDB流中排隊。 AWS還自動縮放流中分片的數量，因此隨著吞吐量的增加，分片的數量將相應增加。

But, events are processed in sequence. So it can take some time for your function to process the event depending on:

但是，事件是按順序處理的。因此，您的函數可能需要一些時間來處理事件，具體取決于：

its position in the stream, and
它在信息流中的位置，以及
how long it takes to process each event.
處理每個事件需要多長時間。

So, while this approach can scale to support many tasks all expiring at the same time, it cannot guarantee that tasks are executed on time.

因此，盡管這種方法可以擴展以支持許多同時到期的任務，但它不能保證任務能按時執行。

精確 (Precision)

This is a big question about this approach. According to the official documentation, expired items are deleted within 48 hours. That is a huge margin of error!

這是關于此方法的一個大問題。根據官方文件，過期物品將在48小時內刪除。那是一個很大的誤差范圍！

As an experiment, I set up a Step Functions state machine to:

作為實驗，我設置了一個“步驟功能”狀態機以：

add a configurable number of items to the scheduled_items table, with TTL expiring between 1 and 10 mins
向scheduled_items表中添加可配置的項目數，其中TTL在1至10分鐘之間到期
track the time the task is scheduled for and when it’s actually picked up by the execute-on-schedule function
跟蹤計劃任務的時間以及按計劃execute-on-schedule功能實際提取任務的時間
wait for all the items to be deleted
等待所有項目被刪除

The state machine looks like this:

狀態機如下所示：

I performed several runs of tests. The results are consistent regardless of the number of items in the table. A quick glimpse at the table tells you that, on average, a task is executed over 11 mins AFTER its scheduled time.

我進行了幾次測試。無論表中的項目數量如何，結果都是一致的。快速瀏覽一下表格即可了解到，平均而言，任務在預定時間后的11分鐘內執行。

I repeated the experiments in several other AWS regions:

我在其他幾個AWS區域重復了實驗：

I don’t know why there is such a marked difference between US-EAST-1 and the other regions. One explanation is that the TTL process requires a bit of time to kick in after a table is created. Since I was developing against the US-EAST-1 region initially, its TTL process has been “warmed” compared to the other regions.

我不知道為什么US-EAST-1與其他地區之間有如此明顯的區別。一種解釋是，創建表后，TTL過程需要一點時間才能啟動。自從我最初針對US-EAST-1地區開發以來，與其他地區相比，它的TTL流程已被“溫暖”。

結論 (Conclusions)

Based on the result of my experiment, it will appear that using DynamoDB TTL as a scheduling mechanism cannot guarantee a reasonable precision.

根據我的實驗結果，看來使用DynamoDB TTL作為調度機制不能保證合理的精度。

On the one hand, the approach scales very well. But on the other, the scheduled tasks are executed at least several minutes behind, which renders it unsuitable for many use cases.

一方面，該方法可以很好地擴展。但是，另一方面，計劃的任務至少要延遲幾分鐘才能執行，這使其不適用于許多用例。