重點 (Top highlight)

Kamer Toker-Yildiz, Colin McFarland, Julia Glick

KAMER Toker-耶爾德茲 ， 科林·麥克法蘭 ， Julia·格里克

At Netflix, when we can’t run A/B experiments we run quasi experiments! We run quasi experiments with various objectives such as non-member experiments focusing on acquisition, member experiments focusing on member engagement, or video streaming experiments focusing on content delivery. Consolidating on one methodology could be a challenge, as we may face different design or data constraints or optimization goals. We discuss some key challenges and approaches Netflix has been using to handle small sample size and limited pre-intervention data in quasi experiments.

在Netflix，當我們無法進行A / B實驗時，我們會進行準實驗！我們運行具有各種目標的準實驗，例如專注于獲取的非成員實驗，專注于成員參與的成員實驗或專注于內容交付的視頻流實驗。由于我們可能面臨不同的設計或數據約束或優化目標，因此將一種方法論整合起來可能是一個挑戰。 我們討論了Netflix在準實驗中用于處理小樣本量和有限的干預前數據的一些關鍵挑戰和方法。

設計與隨機化 (Design and Randomization)

We face various business problems where we cannot run individual level A/B tests but can benefit from quasi experiments. For instance, consider the case where we want to measure the impact of TV or billboard advertising on member engagement. It is impossible for us to have identical treatment and control groups at the member level as we cannot hold back individuals from such forms of advertising. Our solution is to randomize our member base at the smallest possible level. For instance, TV advertising can be bought at TV media market level only in most countries. This usually involves groups of cities in closer geographic proximity.

我們面臨各種業務問題，無法運行單獨的A / B級測試，但可以從準實驗中受益。例如，考慮我們要評估電視或廣告牌廣告對會員參與度的影響的情況。對于我們來說，在會員級別擁有相同的待遇和對照組是不可能的，因為我們無法阻止此類廣告形式的個人。我們的解決方案是在盡可能小的水平上隨機分配我們的會員基礎。例如，電視廣告只能在大多數國家/地區在電視媒體市場上購買。這通常涉及地理上更接近的城市群。

One of the major problems we face in quasi experiments is having small sample size where asymptotic properties may not practically hold. We typically have a small number of geographic units due to test limitations and also use broader or distant groups of units to minimize geographic spillovers. We are also more likely to face high variation and uneven distributions in treatment and control groups due to heterogeneity across units. For example, let’s say we are interested in measuring the impact of marketing Lost in Space series on sci-fi viewing in the UK. London with its high population is randomly assigned to the treatment cell, and people in London love sci-fi much more than other cities. If we ignore the latter fact, we will overestimate the true impact of marketing — which is now confounded. In summary, simple randomization and mean comparison we typically utilize in A/B testing with millions of members may not work well for quasi experiments.

我們在準實驗中面臨的主要問題之一是樣本量較小，而漸近性質可能實際上不成立。由于測試的局限性，我們通常會有少量的地理單位，并且還會使用更廣或更遠的單位組，以最大程度地減少地理溢出。由于單位間的異質性，我們也更有可能在治療組和對照組中面臨較高的變異和分布不均。例如，假設我們對衡量“ 迷失太空”系列營銷對英國科幻觀看的影響感興趣。人口眾多的倫敦被隨機分配到治療室，倫敦的人們比其他城市更喜歡科幻小說。如果我們忽略后一個事實，我們將高估行銷的真正影響，而現在卻感到困惑。總而言之，我們通常在A / B測試中使用數以百萬計的成員進行的簡單隨機化和均值比較可能不適用于準實驗。

Completely tackling these problems during the design phase may not be possible. We use some statistical approaches during design and analysis to minimize bias and maximize precision of our estimates. During design, one approach we utilize is running repeated randomizations, i.e. ‘re-randomization’. In particular, we keep randomizing until we find a randomization that gives us the maximum desired level of balance on key variables across test cells. This approach generally enables us to define more similar test groups (i.e. getting closer to apples to apples comparison). However, we may still face two issues: 1) we can only simultaneously balance on a limited number of observed variables, and it is very difficult to find identical geographic units on all dimensions, and 2) we can still face noisy results with large confidence intervals due to small sample size. We next discuss some of our analysis approaches to further tackle these problems.

在設計階段可能無法完全解決這些問題。在設計和分析過程中，我們使用一些統計方法來最小化偏差并最大化我們的估計精度。在設計期間，我們使用的一種方法是運行重復隨機化，即“ 重新隨機化” 。特別是，我們一直進行隨機化，直到找到一個隨機化，該隨機化可為我們提供跨測試單元的關鍵變量的最大期望平衡水平。這種方法通常使我們能夠定義更多相似的測試組(即，越來越接近蘋果與蘋果之間的比較)。但是，我們仍然可能面臨兩個問題：1)我們只能同時在有限數量的觀察變量上保持平衡，并且很難在所有維度上找到相同的地理單位，并且2)我們仍然可以非常有信心地面對嘈雜的結果由于樣本量較小，因此間隔不大。接下來，我們將討論一些分析方法來進一步解決這些問題。

分析 (Analysis)

超越簡單的比較 (Going Beyond Simple Comparisons)

Difference in differences (diff-in-diff or DID) comparison is a very common approach used in quasi experiments. In diff-in-diff, we usually consider two time periods; pre and post intervention. We utilize the pre-intervention period to generate baselines for our metrics, and normalize post intervention values by the baseline. This normalization is a simple but very powerful way of controlling for inherent differences between treatment and control groups. For example, let’s say our success metric is signups and we are running a quasi experiment in France. We have Paris and Lyon in two test cells. We cannot directly compare signups in two cities as populations are very different. Normalizing with respect to pre-intervention signups would reduce variation and help us make comparisons at the same scale. Although the diff-in-diff approach generally works reasonably well, we have observed some cases where it may not be as applicable as we discuss next.

差異比較(diff-in-diff或DID)比較是準實驗中一種非常常用的方法。在差異比較中，我們通常考慮兩個時間段；干預前后。我們利用干預前時期為我們的指標生成基線，并根據該基線對干預后的值進行標準化。這種歸一化是控制治療組和對照組之間固有差異的簡單但非常有效的方法。例如，假設我們的成功指標是注冊，而我們正在法國進行一次準實驗。我們在兩個測試單元中有巴黎和里昂。由于人口差異很大，我們無法直接比較兩個城市的注冊人數。干預前簽約的規范化將減少差異，并幫助我們進行相同規模的比較。盡管diff-in-diff方法通常可以很好地工作，但我們觀察到某些情況下可能不像我們接下來討論的那樣適用。

具有歷史觀察結果但樣本量較小的成功指標 (Success Metrics With Historical Observations But Small Sample Size)

In our non-member focused tests, we can observe historical acquisition metrics, e.g. signup counts, however, we don’t typically observe any other information about non-members. High variation in outcome metrics combined with small sample size can be a problem to design a well powered experiment using traditional diff-in-diff like approaches. To tackle this problem, we try to implement designs involving multiple interventions in each unit over an extended period of time whenever possible (i.e. instead of a typical experiment with single intervention period). This can help us gather enough evidence to run a well-powered experiment even with a very small sample size (i.e. few geographic units).

在我們的非會員重點測試中，我們可以觀察到歷史獲取指標，例如注冊計數，但是，我們通常不會觀察到任何有關非會員的信息。結果量度的高變化與小樣本量相結合可能是使用類似傳統的差異比較法設計功能強大的實驗的問題。為了解決這個問題，我們嘗試在可能的情況下在較長的時間內實施涉及每個單元的多次干預的設計(即，代替具有單個干預期的典型實驗)。這可以幫助我們收集足夠的證據，即使樣本量很小(即地理單位很少)，也可以運行功能強大的實驗。

In particular, we turn the intervention (e.g. advertising) “on” and “off” repeatedly over time in different patterns and geographic units to capture short term effects. Every time we “toggle” the intervention, it gives us another chance to read the effect of the test. So even if we only have few geographic units, we can eventually read a reasonably precise estimate of the effect size (although, of course, results may not be generalizable to others if we have very few units). As our analysis approach, we can use observations from steady-state units to estimate what would otherwise have happened in units that are changing. To estimate the treatment effect, we fit a dynamic linear model (aka DLM), a type of state space model where the observations are conditionally Gaussian. DLMs are a very flexible category of models, but we only use a narrow subset of possible DLM structures to keep things simple. We currently have a robust internal package embedded in our internal tool, Quasimodo, to cover experiments that have similar structure. Our model is comparable to Google’s CausalImpact package, but uses a multivariate structure to let us analyze more than a single point-in-time intervention in a single region.

特別是，我們會隨著時間的推移以不同的模式和地理單位反復“打開”和“關閉”干預措施(例如廣告)，以捕獲短期影響。每次我們“切換”干預時，它都會給我們另一個機會來閱讀測試的效果。因此，即使我們只有很少的地理單位，我們最終仍可以讀取對效果大小的合理精確的估計(盡管，如果我們只有很少的單位，則結果可能無法推廣到其他地區)。作為我們的分析方法，我們可以使用來自穩態單位的觀察值來估計發生變化的單位中發生的情況。為了估算治療效果，我們擬合了動態線性模型(又稱DLM)，這是一種狀態空間模型，其中的觀測條件是有條件的高斯模型。 DLM是模型的一種非常靈活的類別，但是我們僅使用一小部分可能的DLM結構來簡化事情。目前，我們在內部工具Quasimodo中嵌入了一個強大的內部程序包，以涵蓋具有相似結構的實驗。我們的模型可與Google的CausalImpact包相媲美，但使用多元結構可以讓我們分析單個區域中的多個時間點干預。

沒有歷史觀察的成功指標 (Success Metrics Without Historical Observations)

In our member focused tests, we sometimes face cases where we don’t have success metrics with historical observations. For example, Netflix promotes its new shows that are yet to be launched on service to increase member engagement once the show is available. For a new show, we start observing metrics only when the show launches. As a result, our success metrics inherently don’t have any historical observations making it impossible to utilize the benefits of similar time series based approaches.

在針對會員的測試中，有時我們會遇到一些案例，其中沒有根據歷史觀察得出的成功指標。例如，Netflix推廣了尚未投入使用的新節目，以在節目可用后增加會員的參與度。對于新節目，我們僅在節目開始時才開始觀察指標。結果，我們的成功指標天生就沒有任何歷史觀察，因此無法利用類似基于時間序列的方法的優勢。

In these cases, we utilize the benefits of richer member data to measure and control for members’ inherent engagement or interest with the show. We do this by using relevant pre-treatment proxies, e.g. viewing of similar shows, interest in Netflix originals or similar genres. We have observed that controlling for geographic as well as individual level differences work best in minimizing confounding effects and improving precision. For example, if members in Toronto watch more Netflix originals than members in other cities in Canada, we should then control for pre-treatment Netflix originals viewing at both individual and city level to capture within and between unit variation separately.

在這種情況下，我們利用豐富的會員數據的優勢來衡量和控制會員對節目的內在參與或興趣。我們通過使用相關的預處理代理來做到這一點，例如，觀看類似的節目，對Netflix原創作品或相似類型的興趣。我們已經觀察到，控制地理差異和個人水平差異可以最大程度地減少混淆影響并提高精度。例如，如果多倫多的成員觀看的Netflix原件比加拿大其他城市的成員多，則我們應控制在個人和城市級別觀看的Netflix原件的預處理，以分別捕獲單位差異內和單位間的差異。

This is in nature very similar to covariate adjustment. However, we do more than just running a simple regression with a large set of control variables. At Netflix, we have worked on developing approaches at the intersection of regression covariate adjustment and machine learning based propensity score matching by using a wide set of relevant member features. Such combined approaches help us explicitly control for members’ inherent interest in the new show using hundreds of features while minimizing linearity assumptions and degrees of freedom challenges we may face. We thus gain significant wins in both reducing potential confounding effects as well as maximizing precision to more accurately capture the treatment effect we are interested in.

本質上，這與協變量調整非常相似。但是，我們所做的不只是運行帶有大量控制變量的簡單回歸。在Netflix，我們使用大量相關成員功能來開發回歸協變量調整和基于機器學習的傾向得分匹配的交集方法。這種組合方法有助于我們使用數百種功能來明確控制成員對新展會的內在興趣，同時最大程度地減少線性假設和我們可能面臨的自由度挑戰。因此，我們在減少潛在的混雜影響以及最大程度地提高精確度以更準確地捕獲我們感興趣的治療效果方面均獲得了重大勝利。

下一步 (Next Steps)

We have excelled in the quasi experimentation space with many measurement strategies now in play across Netflix for various use cases. However we are not done yet! We can expand methodologies to more use cases and continue to improve the measurement. As an example, another exciting area we have yet to explore is combining these approaches for those metrics where we can use both time series approaches and a rich set of internal features (e.g. general member engagement metrics). If you’re interested in working on these and other causal inference problems, join our dream team!

我們在準實驗領域表現出色，目前針對各種用例，整個Netflix都在使用許多測量策略。但是，我們還沒有完成！我們可以將方法擴展到更多用例，并繼續改進度量。例如，我們尚未探索的另一個令人興奮的領域是將這些方法與那些指標結合使用，在這些指標中，我們既可以使用時間序列方法，又可以使用豐富的內部功能(例如，一般成員參與指標)。如果您有興趣解決這些和其他因果推理問題，請加入我們的理想團隊！

翻譯自: https://netflixtechblog.com/key-challenges-with-quasi-experiments-at-netflix-89b4f234b852

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/390025.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/390025.shtml
英文地址，請注明出處：http://en.pswp.cn/news/390025.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

網站漏洞檢測針對區塊鏈網站安全分析

2019獨角獸企業重金招聘Python工程師標準>>> 目前移動互聯網中，區塊鏈的網站越來越多，在區塊鏈安全上，很多都存在著網站漏洞，區塊鏈的充值，會員賬號的存儲性XSS竊取漏洞，賬號安全，等…

netflix的準實驗面臨的主要挑戰