_source 字段包含索引時傳入的原始 JSON 文檔體。_source 字段本身不被索引(因此不可搜索),但會被存儲,以便在執行獲取請求(如 get 或 search)時返回。
如果磁盤使用很重要,可以考慮以下選項:
- 使用 synthetic _source,在檢索時重建源內容,而不是存儲在磁盤上。這樣可以減少磁盤使用,但會導致 Get 和 Search 查詢中訪問 _source 變慢。
- 完全禁用 _source 字段。這樣可以減少磁盤使用,但會禁用依賴 _source 的功能。
什么是 synthetic _source?
當文檔被索引時,有些字段,比如需要生成 doc_values 或?stored fileds,來自 _source 的字段值會根據數據類型復制到獨立的列表?doc_values 中(磁盤上的不同數據結構,用于模式匹配),這樣可以獨立搜索這些值。當在這些小列表中找到所需值后,返回原始文檔。由于只搜索了小列表,而不是整個文檔的所有字段值,搜索所需的時間會減少。雖然這種處理方式提升了速度,但會在小列表和原始文檔中存儲重復的數據。
更多閱讀:
-
Elasticsearch:inverted index,doc_values 及 source
-
Elasticsearch: 理解 mapping 中的 store 屬性
Synthetic _source 是一種索引配置模式,可以改變文檔在攝取時的處理方式,以節省存儲空間并避免數據重復。它會創建獨立的列表,但不會保留原始的原始文檔。相反,在找到值后,會使用小列表中的數據重建 _source 內容。由于沒有存儲原始文檔,僅在磁盤上存儲 “列表”,可以節省大量存儲空間。
PUT idx
{"settings": {"index": {"mapping": {"source": {"mode": "synthetic"}}}}
}
需要注意的是,由于 _source 值是在文檔被檢索時即時重建的,因此需要額外時間來完成重建。這會為用戶節省存儲空間,但會降低搜索速度。雖然這種即時重建通常比直接保存源文檔并在查詢時加載更慢,但它節省了大量存儲空間。通過在不需要時不加載 _source 字段,可以避免額外的延遲。
Synthetic _source 目前被廣泛使用于 logsdb 及 TSDB。它可以幫我們節省許多的磁盤空間。
Elasticsearch 8.17 Logsdb:企業降本增效利器
支持的字段
Synthetic _source 支持所有字段類型。根據實現細節,不同字段類型在使用 synthetic _source 時具有不同屬性。
大多數字段類型使用現有數據構建 synthetic _source,最常見的是 doc_values 和 stored fields。對于這些字段類型,不需要額外空間來存儲 _source 字段內容。由于 doc_values 的存儲布局,生成的 _source 字段相比原始文檔會有修改。
對于其他所有字段類型,字段的原始值會按原樣存儲,方式與非 synthetic 模式下的 _source 字段相同。這種情況下不會有修改,_source 中的字段數據與原始文檔相同。同樣,使用 ignore_malformed 或 ignore_above 的字段的格式錯誤值也需要按原樣存儲。這種方式存儲效率較低,因為為 _source 重建所需的數據除了索引字段所需的其他數據(如 doc_values)外,還會額外存儲。
Synthetic _source 限制
某些字段類型有額外限制,這些限制記錄在字段類型文檔的 synthetic _source 部分。
Synthetic _source 不支持僅存儲源的快照倉庫。要存儲使用 synthetic _source 的索引,請選擇其他類型的倉庫。
Synthetic _source 修改
啟用 synthetic _source 時,檢索到的文檔相比原始 JSON 會有一些修改。
數組被移動到葉子字段
Synthetic _source 中的數組會被移動到葉子字段。例如:
由于 _source 值是通過 “doc values” 列表中的值重建的,因此原始 JSON 會被做一些修改。例如,數組會被移到葉子節點。
PUT idx/_doc/1
{"foo": [{"bar": 1},{"bar": 2}]
}
將變為:
{"foo": {"bar": [1, 2]}
}
這可能導致某些數組消失:
PUT idx/_doc/1
{"foo": [{"bar": 1},{"baz": 2}]
}
將變為:
{"foo": {"bar": 1,"baz": 2}
}
字段名稱與映射一致
Synthetic _source 使用映射中字段的原始名稱。當與動態映射一起使用時,字段名中帶點(.)的字段默認被解釋為多個對象,而在禁用子對象的對象中,字段名中的點會被保留。例如:
PUT idx/_doc/1
{"foo.bar.baz": 1
}
將變為:
{"foo": {"bar": {"baz": 1}}
}
如何將索引配置為 synthetic _source 模式
測試代碼:在此測試中,將 synthetic _source 模式下的索引與標準索引進行對比。
PUT index
{"settings": {"index": {"mapping": {"source": {"mode": "synthetic"}}}}
}
測試
標準索引使用 multi-field 來說明如何通過全文搜索和聚合檢索文檔,并在 _source 內容中包含已禁用字段的值。
PUT test_standard
{"mappings": {"properties": {"disabled_field": {"enabled": false},"multi_field": {"type": "text","fields": {"keyword": {"type": "keyword"}}}}}
}
讓我們導入一些示例文檔:
PUT test_standard/_doc/1
{"multi_field": "Host_01","disabled_field" : "Required for storage 01"
}PUT test_standard/_doc/2
{"multi_field": "Host_02","disabled_field" : "Required for storage 02"
}PUT test_standard/_doc/3
{"multi_field": "Host_03","disabled_field" : "Required for storage 03"
}
全文搜索會檢索帶有 _source 內容的文檔:
GET test_standard/_search
{"query": {"match": {"multi_field": "host_01"}}
}
結果:文檔通過對已分析的字段進行全文搜索被檢索到。返回的結果包含 _source 中的所有值,包括已被禁用的字段:
{"took": 17,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 1,"relation": "eq"},"max_score": 0.9808291,"hits": [{"_index": "test_standard","_id": "1","_score": 0.9808291,"_source": {"multi_field": "Host_01","disabled_field": "Required for storage 01"}}]}
}
這里,synthetic _source 模式下的索引使用 multi-fields 來說明 “text” 數據類型如何用于 “doc values” 列表,以及禁用字段中的值如何不可用。
PUT test_synthetic
{"settings": {"index": {"mapping": {"source": {"mode": "synthetic"}}}},"mappings": {"properties": {"keyword_field": {"type": "keyword"},"multi_field": {"type": "text","fields": {"keyword": {"type": "keyword"}}},"text_field": {"type": "text"},"disabled_field": {"enabled": false},"skill_array_field": {"properties": {"language": {"type": "text"},"level": {"type": "text"}}}}}
}
讓我們導入一些示例文檔:
PUT test_synthetic/_doc/1
{"keyword_field": "Host_01","disabled_field": "Required for storage 01","multi_field": "Some info about computer 1","text_field": "This is a text field 1","skills_array_field": [{"language": "ruby","level": "expert"},{"language": "javascript","level": "beginner"}],"foo": [{"bar": 1},{"bar": 2}],"foo1.bar.baz": 1
}PUT test_synthetic/_doc/2
{"keyword_field": "Host_02","disabled_field": "Required for storage 02","multi_field": "Some info about computer 2","text_field": "This is a text field 2","skills_array_field": [{"language": "C","level": "guru"},{"language": "javascript","level": "beginner"}],"foo": [{"bar": 1},{"bar": 2}],"foo1.bar.baz": 2
}PUT test_synthetic/_doc/3
{"keyword_field": "Host_03","disabled_field": "Required for storage 03","multi_field": "Some info about computer 3","text_field": "This is a text field 3","skills_array_field": [{"language": "golang","level": "beginner"}],"foo": [{"bar": 1},{"bar": 2}],"foo1.bar.baz": 3
}
搜索 “keyword” 數據類型時需要精確匹配。另外,禁用字段中的值也不再可用。
GET test_synthetic/_search
{"query": {"match": {"keyword_field": "Host_01"}}
}
響應:
{"took": 1,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 1,"relation": "eq"},"max_score": 0.9808291,"hits": [{"_index": "test_synthetic","_id": "1","_score": 0.9808291,"_source": {"keyword_field": "Host_01","disabled_field": "Required for storage 01","multi_field": "Some info about computer 1","text_field": "This is a text field 1","skills_array_field": [{"language": "ruby","level": "expert"},{"language": "javascript","level": "beginner"}],"foo": [{"bar": 1},{"bar": 2}],"foo1.bar.baz": 1}}]}
}
我們再做一次搜索:
GET test_synthetic/_search
{"query": {"match": {"multi_field": "info"}}
}
響應是:
{"took": 1,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 3,"relation": "eq"},"max_score": 0.13353139,"hits": [{"_index": "test_synthetic","_id": "2","_score": 0.13353139,"_source": {"keyword_field": "Host_02","disabled_field": "Required for storage 02","multi_field": "Some info about computer 2","text_field": "This is a text field 2","skills_array_field": [{"language": "C","level": "guru"},{"language": "javascript","level": "beginner"}],"foo": [{"bar": 1},{"bar": 2}],"foo1.bar.baz": 2}},{"_index": "test_synthetic","_id": "3","_score": 0.13353139,"_source": {"keyword_field": "Host_03","disabled_field": "Required for storage 03","multi_field": "Some info about computer 3","text_field": "This is a text field 3","skills_array_field": [{"language": "golang","level": "beginner"}],"foo": [{"bar": 1},{"bar": 2}],"foo1.bar.baz": 3}},{"_index": "test_synthetic","_id": "1","_score": 0.13353139,"_source": {"keyword_field": "Host_01","disabled_field": "Required for storage 01","multi_field": "Some info about computer 1","text_field": "This is a text field 1","skills_array_field": [{"language": "ruby","level": "expert"},{"language": "javascript","level": "beginner"}],"foo": [{"bar": 1},{"bar": 2}],"foo1.bar.baz": 1}}]}
}
更多閱讀,請參考官方文檔:_source field | Elastic Documentation