Monday, January 30, 2023

Towards A Token-Free Future In NLP (2022)

For an AI model to understand languages, the most prominent approach so far is to train a so-called language model on a massive amount of text and let the model learn from context the meaning of words, and how those words compose sentences. For an AI model to recall and learn the relation between words, a vocabulary needs to be embedded in the model and then store these mappings as parameters.

Since there are over 170 000 words in the English dictionary and the model needs to learn the weights of each word, it is not feasible for it to store all of those in its vocabulary.

To decrease the number of words to learn, they are instead split into sub-words or tokens. We can use a so-called tokenizer to decide these splits/tokens should be done in the text. A tokenizer is trained to identify a certain number of words from a large corpus. The tokens it learns to identify will be the longest and most common sub-word up to some fixed token vocabulary size; usually 50.000 tokens. 

When we train a language model, we can use the tokenizer to reduce the number of tokens the model needs to store in memory and, therefore, reduce the model size significantly. This is important since reduced model size means faster training, lower training cost and less expensive hardware is required.

Below is an illustration of how a tokenizer splits a sentence into a sequence of tokens.



from Hacker News https://ift.tt/2GPSfjK

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.