trie樹查找前綴串
by Julia Geist
Julia·蓋斯特(Julia Geist)
A
Trie
, (also known as a prefix tree) is a special type of tree used to store associative data structures
Trie
(也稱為前綴樹)是一種特殊類型的樹,用于存儲關聯數據結構
A trie (pronounced try) gets its name from retrieval — its structure makes it a stellar matching algorithm.
Trie(發音為try)的名稱取自rev val -其結構使其成為出色的匹配算法。
語境 (Context)
Write your own shuffle method to randomly shuffle characters in a string.
Use the words text file, located at /usr/share/dict/words, and your shuffle method to create an anagram generator that only produces real words.
Given a string as a command line argument, print one of its anagrams.
I was presented with this challenge this week at Make School’s Product Academy.
本周,我在Make School的產品學院里接受了這一挑戰。
The words in the text file are separated by new lines. Its formatting makes it a lot easier to put the words into a data structure. For now, I’m storing them in a list — each element being a single word from the file.
文本文件中的單詞用新行分隔。 它的格式使將單詞放入數據結構變得容易得多。 現在,我將它們存儲在列表中-每個元素都是文件中的單個單詞。
One approach to this challenge is to:
解決這一挑戰的一種方法是:
- randomly shuffle the characters in the string 隨機隨機播放字符串中的字符
then, check it against all words that were in /usr/share/dict/words to verify that it’s a real word.
然后,對照/ usr / share / dict / words中的所有單詞檢查它,以確認它是真實單詞。
However, this approach requires that I check that the randomly shuffled characters in the new string matches one of 235,887 words in that file — that means 235,887 operations for each string that I want to verify as a real word.
但是,這種方法要求我檢查新字符串中隨機混洗的字符是否與該文件中的235887個單詞匹配-這意味著要驗證為真實單詞的每個字符串需要進行235887次操作 。
This was an unacceptable solution for me. I first looked up libraries that had already been implemented to check if words exist in a language, and found pyenchant. I first completed the challenge using the library, in a few lines of code.
對我來說,這是不可接受的解決方案。 我首先查找已經實現的庫,以檢查語言中是否存在單詞,然后找到pyenchant 。 我首先使用庫通過幾行代碼完成了挑戰。
def generateAnagram(string, language="en_US"): languageDict = enchant.Dict(language) numOfPossibleCombinationsForString = math.factorial(len(string)) for i in range(0, numOfPossibleCombinationsForString): wordWithShuffledCharacters = shuffleCharactersOf(string)
if languageDict.check(wordWithShuffledCharacters): return wordWithShuffledCharacters return "There is no anagram in %s for %s." % (language, string)
Using a couple of library functions in my code was a quick and easy solution. However, I didn’t learn much by finding a library to solve the problem for me.
在我的代碼中使用幾個庫函數是一個快速簡便的解決方案。 但是,我沒有找到可以為我解決問題的圖書館,因此學不到很多東西。
I was positive that the library wasn’t using the approach I mentioned earlier. I was curious and dug through the source code — I found a trie.
我很肯定圖書館沒有使用我前面提到的方法。 我很好奇并仔細研究了源代碼,發現了一個特里。
特里 (Trie)
A trie stores data in “steps”. Each step is a node in the trie.
特里樹以“步驟”存儲數據。 每個步驟都是特里樹中的一個節點。
Storing words is a perfect use case for this kind of tree, since there are a finite amount of letters that can be put together to make a string.
對于此類樹而言,存儲單詞是一個完美的用例,因為可以將一定數量的字母放在一起構成一個字符串。
Each step, or node, in a language trie will represent one letter of a word. The steps begin to branch off when the order of the letters diverge from the other words in the trie, or when a word ends.
語言樹中的每個步驟或節點將代表一個單詞的一個字母。 當字母的順序與特里中的其他單詞不同或單詞結束時,步驟開始分支。
I created a trie out of directories on my Desktop to visualize stepping down through nodes. This is a trie that contains two words: apple and app.
我從桌面上的目錄中創建了一個Trie,以可視化逐步通過節點。 這是一個trie,包含兩個詞:apple和app。
Note that the end of a word is denoted with a ‘$’. I’m using ‘$’ because it is a unique character that will not be present in any word in any language.
注意,單詞的結尾用“ $”表示。 我使用的是“ $”,因為它是一個唯一字符,不會以任何語言出現在任何單詞中。
If I were to add the word ‘aperture’ to this trie, I would loop over the letters in the word ‘aperture’ while simultaneously stepping down the nodes in the trie. If the letter exists as a child of the current node, step down into it. If the letter does not exist as a child of the current node, create it and then step down into it. To visualize these steps using my directories:
如果要在此Trie中添加單詞“ aperture”,則將遍歷單詞“ aperture”中的字母,同時降低Trie中的節點。 如果該字母作為當前節點的子代存在,請下移至該節點。 如果該字母不作為當前節點的子代存在,請創建該字母,然后逐步降低該字母。 要使用我的目錄可視化這些步驟:
While stepping down the trie, the first two letters of ‘aperture’ are already present in the trie, so I step down into those nodes.
在下移Trie時,“ aperture”的前兩個字母已經存在于該Trie中,因此我下移到這些節點。
The third letter, ‘e’, however, is not a child of the ‘p’ node. A new node is created to represent the letter ‘e’, branching off from the other words in the trie. New nodes for the letters that follow are created as well.
但是,第三個字母“ e”不是“ p”節點的子代。 創建一個新的節點來表示字母“ e”,并從該樹中的其他單詞分支出來。 還創建了隨后字母的新節點。
To generate a trie from a words file, this process will happen for each word, until all combinations for every word are stored.
為了從單詞文件生成特里樹,將對每個單詞進行此過程,直到存儲每個單詞的所有組合。
You might be thinking: “Wait, won’t it take really long to generate the trie from that text file with 235,887 words in it? What’s the point of looping over every single character in every single word?”
您可能會想:“等等,從包含235,887個單詞的文本文件生成trie是否真的需要很長時間? 是什么在每一個字,遍歷每一個字符的意義呢?”
Yes, iterating over each character of every word to generate a trie does take some time. However, the time taken to create the trie is well worth it — because to check if a word exists in the text file, it takes at most, as many operations as the length of the word itself. Much better than the 235,887 operations it was going to take before.
是的,遍歷每個單詞的每個字符以生成特里確實需要一些時間。 但是,創建特里樹所花的時間是非常值得的 -因為檢查文本文件中是否存在單詞,它最多花費與單詞本身長度一樣多的操作。 比之前要進行的235,887次操作好得多 。
I wrote the simplest version of a trie, using nested dictionaries. This isn’t the most efficient way to implement one, but it is a good exercise to understand the logic behind a trie.
我使用嵌套詞典編寫了最簡單的trie版本。 這不是實現一個的最有效方法,但是了解一個trie背后的邏輯是一個很好的練習。
endOfWord = "$"
def generateTrieFromWordsArray(words): root = {} for word in words: currentDict = root for letter in word: currentDict = currentDict.setdefault(letter, {}) currentDict[endOfWord] = endOfWord return root
def isWordPresentInTrie(trie, word): currentDict = trie for letter in word: if letter in currentDict: currentDict = currentDict[letter] else: return False return endOfWord in currentDict
You can see my solution for the anagram generator on my Github. Since exploring this algorithm, I’ve decided to make this blog post one of many — each post covering one algorithm or data structure. The code is available on my Algorithms and Data Structures repo — star it to stay updated!
您可以在我的Github上看到我的字謎生成器解決方案 。 自探索該算法以來,我決定將該博客文章設為眾多文章之一-每個文章都涉及一種算法或數據結構。 該代碼在我的算法和數據結構存儲庫中可用-對其加注星標以保持更新!
下一步 (Next Steps)
I suggest checking out Ray Wenderlich’s trie repo. Although written in Swift, it’s a valuable source for explanations of various algorithms.
我建議您查看雷·溫德利希(Ray Wenderlich)的trie repo 。 盡管是用Swift編寫的,但它是解釋各種算法的寶貴資源。
Similar to the trie (but more memory efficient) is a suffix tree, or radix. In short, instead of storing single characters at every node, the end of a word, its suffix, is stored and the paths are created relatively.
后綴樹或基數與特里樹類似(但具有更高的存儲效率)。 簡而言之,不是在每個節點上都存儲單個字符,而是存儲單詞的結尾及其后綴,并且相對地創建路徑。
However, a radix is more complicated to implement than a trie. I suggest taking a look at Ray Wenderlich’s radix repo if you’re interested.
但是,基數的實現比trie的實現更為復雜。 如果您有興趣,我建議您看看Ray Wenderlich的基數存儲庫 。
This is the first post of my algorithm and data structures series. In each post, I’ll present a problem that can be better solved with an algorithm or data structure to illustrate the algorithm/data structure itself.
這是我的算法和數據結構系列的第一篇文章。 在每篇文章中,我將介紹一個可以通過算法或數據結構更好地解決的問題,以說明算法/數據結構本身。
Star my algorithms repo on Github and follow me on Twitter if you’d like to follow along!
在Github上為我的算法存儲庫加注星標,如果您想跟隨我,在Twitter上關注我!
Did you gain value by reading this article? Click here to share it on Twitter! If you’d like to see content like this more often, follow me on Medium and subscribe to my once-a-month newsletter below. Feel free to buy me a coffee too.
您通過閱讀本文獲得了價值嗎? 單擊此處在Twitter上分享! 如果您想經常看到這樣的內容,請在Medium上關注我,并訂閱下面的每月一次的新聞通訊。 也可以給我買杯咖啡 。
翻譯自: https://www.freecodecamp.org/news/trie-prefix-tree-algorithm-ee7ab3fe3413/
trie樹查找前綴串