Creating a basic Tokenizer

Arnav Goel
3 min readJun 4, 2023

Introduction

In the previous article, Natural Language Processing basics, we have talked about each step of Natural Language Processing and established the entire pipeline of how NLP works. We would now go in detail about one of the topics in Natural Language Processing: Tokenizer.

The purpose of this article is to go in depth on how tokenization works and how to implement a tokenizer. The main idea is to break the textual input into fragments that contain granular, yet useful data — these are called Tokens.

Sentencizing”, or splitting input text into sentences, is another step that is done along with tokenizing the text. It is important to have the sentence structures to define parts of speech, dependency relations etc. Sentencizing can be a bit more tricky than Tokenizing, since text structure is more free formed than sentence structure (I mean, not in morphological terms, but more in graphical terms).

A Basic Sentencizer

Basic Sentencizer will create a new sentence at every dot, exclamation mark or question mark. The user can change that by providing a distinct list of characters.

Next, we provide a special “tag” (delimiter_token) to divide our text that would be appended next to the punctuation marks. For that, we create a special “tag” (DEFAULT: <SPLIT>)

class BasicSentencizer:
def __init__(self, input_text, split_characters=['.','?','!'], delimiter_token='<SPLIT>'):
self.sentences = []
self.input_text = str(input_text)
self._split_characters=split_characters
self._delimiter_token=delimiter_token
self._index=0
self._sentencize()

# For every iteration return the self object
def __iter__(self):
return self

# For next iteration get the next sentence.
def __next__(self):
if self._index < len(self.sentences):
result = self.sentences[self._index]
self._index += 1
return result
raise StopIteration

def _sentencize(self):
string_with_separator = self.input_text
for character in self._split_characters:
string_with_separator = string_with_separator.replace(character, character+" "+self._delimiter_token)
self.sentences = [x.strip() for x in string_with_separator.split(self._delimiter_token) if x !='']

This class would split the sentences from the input text and put it into an array of sentences which you can iterate through using the class iter

Cons

There are several issues that occur with the previous tokenizer:

  1. List numbers, decimal numbers, price: 1.2, 3.4 and so on …
  2. Abbreviations: etc. , i.e, A.I and so on …
  3. Multiple consecutive exclamation marks, question marks, multiple full stops: “Hello !!!”, “how are you ??”, “Life goes on ….”

Future Steps

  1. Create a robust tokenizer that overcome the cons mentioned above
  2. Build a production based tokenizer similar to the ones we see in the NLTK library
  3. Explain all the different tokenizers available
  4. Create a distributed queue that inputs documents and produces tokens. Also write a function to store these tokens on a no-SQL DB.

Conclusion

Thanks to Tiago Duque and his article and writing an amazing explanation on how to create a simple tokenizer. Check out for the original post until the next part.

Also checkout https://pypi.org/project/nlpytools/ for installing this basic project

--

--