Creating a basic Tokenizer

3 min readJun 4, 2023

Introduction

In the previous article, Natural Language Processing basics, we have talked about each step of Natural Language Processing and established the entire pipeline of how NLP works. We would now go in detail about one of the topics in Natural Language Processing: Tokenizer.

The purpose of this article is to go in depth on how tokenization works and how to implement a tokenizer. The main idea is to break the textual input into fragments that contain granular, yet useful data — these are called Tokens.

“Sentencizing”, or splitting input text into sentences, is another step that is done along with tokenizing the text. It is important to have the sentence structures to define parts of speech, dependency relations etc. Sentencizing can be a bit more tricky than Tokenizing, since text structure is more free formed than sentence structure (I mean, not in morphological terms, but more in graphical terms).

A Basic Sentencizer

Basic Sentencizer will create a new sentence at every dot, exclamation mark or question mark. The user can change that by providing a distinct list of characters.

Next, we provide a special “tag” (delimiter_token) to divide our text that would be appended next to the punctuation marks. For that, we create a special “tag” (DEFAULT: <SPLIT>)

class BasicSentencizer:
    def __init__(self, input_text, split_characters=['.','?','!'], delimiter_token='<SPLIT>'):       
        self.sentences = []
        self.input_text = str(input_text)
        self._split_characters=split_characters
        self._delimiter_token=delimiter_token
        self._index=0
        self._sentencize()

    # For every iteration return the self object
    def __iter__(self):
        return self
    
    # For next iteration get the next sentence. 
    def __next__(self):
        if self._index < len(self.sentences):
            result = self.sentences[self._index]
            self._index += 1
            return result
        raise StopIteration
    
    def _sentencize(self):
        string_with_separator = self.input_text
        for character in self._split_characters:
            string_with_separator = string_with_separator.replace(character, character+" "+self._delimiter_token)
        self.sentences = [x.strip() for x in string_with_separator.split(self._delimiter_token) if x !='']

This class would split the sentences from the input text and put it into an array of sentences which you can iterate through using the class iter

Cons

There are several issues that occur with the previous tokenizer:

List numbers, decimal numbers, price: 1.2, 3.4 and so on …
Abbreviations: etc. , i.e, A.I and so on …
Multiple consecutive exclamation marks, question marks, multiple full stops: “Hello !!!”, “how are you ??”, “Life goes on ….”

Future Steps

Create a robust tokenizer that overcome the cons mentioned above
Build a production based tokenizer similar to the ones we see in the NLTK library
Explain all the different tokenizers available
Create a distributed queue that inputs documents and produces tokens. Also write a function to store these tokens on a no-SQL DB.

Conclusion

Thanks to Tiago Duque and his article and writing an amazing explanation on how to create a simple tokenizer. Check out for the original post until the next part.

Tokenization (Building a Tokenizer and a Sentencizer)

The underlying bones that give NLP its structure

medium.com

Also checkout https://pypi.org/project/nlpytools/ for installing this basic project