Part Of Speech tagging and Hidden Markov Models
Introduction
In this chapter, we would look at the second part of the NLP process i.e how to tag sentences with grammatical categories. More importantly we would understand how a relationship is formed between these different categories of textual data. Identifying the different types of tagging is called POS tagging.
If you are new to the NLP world, please review the article Natural Language Processing basics. It would help you piece together how the NLP process works from the beginning to the end. Ofcourse, in the future we would delve into the large language models and more complicated stuff but for now let’s understand how the basics of NLP actually work.
Part of Speech Tagging
Part of speech tagging (POS tagging) is a natural language processing (NLP) technique that involves assigning grammatical categories or “parts of speech” to each word in a given text or sentence.
It falls primarily into two distinctive groups: rule-based and stochastic.
The parts of speech include categories such as
- nouns,
- verbs,
- adjectives,
- adverbs,
- pronouns,
- prepositions,
- conjunctions, and more …
POS tagging helps machine learning models to understand the grammatical structure of a sentence, which in turn aids in extracting meaning and performing various linguistic analyses.
For example, consider the sentence: “The cat is sitting on the mat.” A POS tagging analysis might label the words as follows:
- “The” — Determiner
- “cat” — Noun
- “is” — Verb
- “sitting” — Verb
- “on” — Preposition
- “the” — Determiner
- “mat” — Noun
POS tagging is essential in many NLP applications, such as:
- Text Analysis
- Information Retrieval
- Machine Translation
- Named Entity Recognition
- Sentiment Analysis
- Grammar Checking
- Speech Recognition
POS tagging is typically done using statistical models or machine learning algorithms trained on large labeled datasets. Many natural language processing (NLP) applications utilize stochastic techniques to determine part of speech. The appeal of stochastic techniques over traditional rule based techniques comes from the ease of the necessary statistics automated acquisition.
In addition, rule based applications are often difficult to implement and not as robust. Furthermore, deep learning models has become extremely popular in the recent years, although it black boxes how the machine understands the words in relation to each other.
In this chapter we would cover both the techniques in detail and provide a simple way to implement these techniques. Here are all the different ways it can be performed:
- Rule-based POS tagging: The rule-based POS tagging models apply a set of handwritten rules and use contextual information to assign POS tags to words. These rules are often known as context frame rules. One such rule might be: “If an ambiguous/unknown word ends with the suffix ‘ing’ and is preceded by a Verb, label it as a Verb”.
- Transformation Based Tagging (Rule based tagging): The transformation-based approaches use a pre-defined set of handcrafted rules as well as automatically induced rules that are generated during training.
- Stochastic (Probabilistic) tagging: A stochastic approach includes frequency, probability or statistics. The simplest stochastic approach finds out the most frequently used tag for a specific word in the annotated training data and uses this information to tag that word in the unannotated text. But sometimes this approach comes up with sequences of tags for sentences that are not acceptable according to the grammar rules of a language. One such approach is to calculate the probabilities of various tag sequences that are possible for a sentence and assign the POS tags from the sequence with the highest probability. Hidden Markov Models (HMMs) are probabilistic approaches to assign a POS Tag.
- Deep learning models (Stochastic tagging): Various Deep learning models have been used for POS tagging such as Meta-BiLSTM which have shown an impressive accuracy of around 97 percent.
Rule-based POS tagging
Rule-based Part of Speech (POS) tagging is an approach to assigning grammatical categories or parts of speech to words in a text based on predefined linguistic rules.
Unlike statistical or machine learning-based POS tagging, which relies on large labeled datasets and algorithms to make predictions, rule-based POS tagging uses explicit linguistic knowledge and patterns to determine the POS tags of words.
import nltk
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# Tokenize the sentence into words
words = nltk.word_tokenize(sentence)
# Define a set of custom rules for POS tagging
custom_rules = [
(r'.*ing$', 'VBG'), # Gerunds (e.g., running)
(r'.*ed$', 'VBD'), # Past tense verbs (e.g., jumped)
(r'^[A-Z].*$', 'NNP'), # Proper nouns (e.g., London)
(r'.*', 'NN') # Default: Nouns for all other words
]
# Create a RegexpTagger with the custom rules
regexp_tagger = nltk.RegexpTagger(custom_rules)
# Tag the words using the RegexpTagger
tags = regexp_tagger.tag(words)
# Display the tagged words
print(tags)
Here’s how rule-based POS tagging typically works:
- Lexical Rules: Rule-based POS taggers often start with a lexicon or dictionary that contains words and their associated POS tags. Lexical rules involve looking up each word in the text in the lexicon and assigning the corresponding POS tag. For example, if “run” is found in the lexicon, it might be tagged as a verb (VB).
- Contextual Rules: Beyond simple word lookups, rule-based taggers use contextual rules to handle cases where a word’s POS tag depends on its context. These rules take into account the surrounding words, grammatical structures, and syntactic patterns. For instance, if a word follows “the” and precedes a noun, it is likely an adjective.
- Regular Expressions: Rule-based POS taggers can also use regular expressions to identify specific word forms or patterns that indicate a particular part of speech. For example, a rule might identify words ending in “-ing” as gerunds or present participles.
- Morphological Rules: Morphological rules examine the word’s morphology (word structure) to determine its POS tag. For instance, words ending in “-ed” might be tagged as past tense verbs.
- Syntactic Parsing: Some rule-based systems incorporate simple syntactic parsing rules to analyze sentence structure and assign POS tags accordingly. For example, if a word is the subject of a sentence, it is likely a noun or pronoun.
- Disambiguation Rules: Rule-based taggers may include disambiguation rules to handle cases where a word can have multiple POS tags depending on its usage. These rules consider factors like nearby words or the role of a word in the sentence to make the correct determination.
- Fallback Rules: In cases where none of the above rules apply, rule-based taggers may use default rules to assign a general POS tag (e.g., “unknown” or “noun”).
Pros
- Transparency: The rules are explicit and can be examined and understood by linguists or developers.
- Control: Linguists or domain experts can fine-tune the rules to suit specific applications or languages.
- No Need for Large Corpora: Unlike statistical models, rule-based systems do not require extensive labeled training data.
Cons
- Handling Ambiguity: Rule-based systems may struggle with ambiguous words or constructions that require more context for accurate tagging.
- Limited Generalization: These systems might not generalize well to new or uncommon words or languages with complex grammatical structures.
- Manual Effort: Developing and maintaining rule-based taggers can be labor-intensive, as it requires the creation and upkeep of rules.
Transformation Based Tagging
Transformation-Based Tagging (TBT) is based on transformation rules that iteratively modify an initial set of POS tags assigned to words in a sentence to achieve a more accurate tagging. The primary goal of TBT is to iteratively improve the accuracy of POS tagging by applying a series of rule-based transformations.
- Initial Tagging: Start with an initial set of POS tags assigned to each word in the input sentence. This initial tagging can be obtained using a simple rule-based tagger, a lexicon-based approach, or any other method.
- Transformation Rules: Define a set of transformation rules. These rules specify how to change a tag based on the context of the word and its surrounding words. Transformation rules are typically created by analyzing a labeled dataset and identifying common tagging errors.
- Iteration: Apply the transformation rules iteratively to the entire sentence, adjusting the tags for words that meet the conditions specified in the rules. This process continues until no further improvements can be made or until a stopping criterion is met.
- Stopping Criterion: TBT can use different stopping criteria, such as a maximum number of iterations or a threshold for improvement. If the tagging accuracy reaches a satisfactory level, the process stops.
- Output: The final set of POS tags obtained after applying the transformation rules is considered the output of the TBT process.
# Sample sentence and initial POS tags
sentence = "The quick brown fox jumps over the lazy dog"
initial_tags = ["DT", "JJ", "JJ", "NN", "VBZ", "IN", "DT", "JJ", "JJ", "NN"]
# Transformation rules (simplified)
transformation_rules = [
(lambda words, tags, i: words[i] == "jumps" and tags[i] == "NN", "VB"), # Rule 1
# Add more rules here
]
# Maximum number of iterations
max_iterations = 5
# Apply the transformation rules iteratively
for iteration in range(max_iterations):
new_tags = initial_tags.copy()
for i in range(len(sentence.split())):
for rule_condition, new_tag in transformation_rules:
if rule_condition(sentence.split(), new_tags, i):
new_tags[i] = new_tag
if new_tags == initial_tags:
break
initial_tags = new_tags
# Print the final POS tags
print("Final POS Tags:", initial_tags)
Steps for creating a complex transformation tagging tool
- Step 1: Data Preparation
Acquire a labeled dataset where each word in a sentence is tagged with its correct POS. You can handwrite this dataset or get this dataset using any other method mentioned in this list. - Step 2: Feature Engineering
Define a set of features for each word in the sentence that capture contextual information. These features can include neighboring words, word shapes, prefixes, suffixes, etc. - Step 3: Rule Generation
Create a set of transformation rules based on the labeled dataset. These rules can be generated using techniques like error analysis, which identifies common tagging errors and formulates rules to correct them. These would be the initial set of rules created for all your future data and you should get an accuracy of this on your training data. - Step 4: Iterative Learning
Initialize an initial set of POS tags for each word in the sentence and iterate through the transformation rules and apply them to the initial tags based on the features of the words. Update the tags for each word based on the applied rules and repeat the process for a set number of iterations or until convergence. - Step 5: Evaluation and Validation
Evaluate the TBT system on a separate validation dataset to assess its tagging accuracy. Adjust the rules or feature set based on the validation results to improve performance. - Step 6: Testing
Test the TBT system on a separate test dataset to evaluate its real-world performance.
Pros
- Rule-Based Improvement: It allows for fine-tuning and improving an initial tagging based on linguistically informed rules, which can be very effective for certain languages and domains.
- Iterative Refinement: It iteratively refines the tagging, potentially capturing complex dependencies and linguistic nuances.
Cons
- Rule Development: Creating a set of transformation rules can be labor-intensive and requires linguistic expertise.
- Data Dependency: TBT may require a large annotated dataset to derive effective transformation rules.
- Risk of Overfitting: If the rules are too specific and tailored to a particular dataset, they may not generalize well to other data.
NOTE: Transformation tagging can be used in conjunction with other rule based techniques as well to improve the accuracy of tagging.
POS tagging with Hidden Markov Model
Hidden Markov Model (HMM) captures lexical and contextual information for POS tagging. They are used to represent and model the relationships between words (observations) and their corresponding POS tags (states).
Everyone knows that one of the best ways to understand and implement POS tagging is by using Hidden Markov Models. However, what exactly are they and how do they help with figuring out Part of Speech Tagging ?
Hidden Markov Models
Introduction
A Hidden Markov Model (HMM) is a probabilistic graphical model used for modeling systems that exhibit sequential or temporal behavior, where understanding the underlying states and transitions is essential.
Looking at the image above, we can see we have several nodes:
- <0 degrees centigrade
- 0–20 degrees centigrade
- >20 degrees centigrade
Each node has arrows going out of them and if you sum the weights of all the arrows going out of them, then it equals 1. Eg.
- For the node <0 degrees centigrade, the sum is 0.6 + 0.2 + 0.2 = 1
- For the node 0–20 degrees centigrade, the sum is 0.3 + 0.4 + 0.3 = 1
- For the node >20 degrees centigrade, the sum is 0.7 + 0.1 + 0.2 = 1
The node is the probability distribution of the node and the edge would be the probability for transitioning from one node to another.
Here are the fundamental components and concepts associated with Hidden Markov Models:
- States: An HMM models a system as a sequence of hidden states. These states represent underlying, unobservable conditions or situations. For instance, in speech recognition, the states might represent phonemes or words.
- Observations: At each time step, the HMM emits an observable symbol or observation based on the current hidden state. These observations are what we can measure or see. In speech recognition, observations can be acoustic features like spectral information.
- State Transition Probabilities: HMMs incorporate transition probabilities, which specify the likelihood of moving from one hidden state to another at each time step. These probabilities are typically organized as a transition matrix.
- Emission Probabilities: For each hidden state, there are associated emission probabilities. These probabilities determine the likelihood of emitting a particular observation when the system is in that state. Emission probabilities are typically represented as emission matrices or vectors.
- Initial State Probabilities: HMMs also have initial state probabilities, which specify the likelihood of starting in each hidden state at the beginning of the sequence.
State vs Observation
There is a subtle but important difference between observation
and state
. When I first started, I was really confused on how to program this.
Words are generally considered observations and the POS tags are the states.
import numpy as np
from hmmlearn import hmm
# Define POS states (tags)
states = ["DT", "JJ", "NN", "VB", "IN"]
# Define observations (words)
observations = ["The", "quick", "brown", "fox", "jumps"]
# Create a mapping from states to numeric indices
state_map = {state: i for i, state in enumerate(states)}
# Define transition probabilities (example values)
# In practice, these would be estimated from training data.
transition_matrix = np.array([
[0.3, 0.2, 0.2, 0.1, 0.2],
[0.2, 0.3, 0.2, 0.1, 0.2],
[0.1, 0.2, 0.4, 0.1, 0.2],
[0.1, 0.2, 0.2, 0.3, 0.2],
[0.2, 0.2, 0.2, 0.1, 0.3]
])
# Define emission probabilities (example values)
# In practice, these would also be estimated from training data.
emission_matrix = np.array([
[0.2, 0.2, 0.1, 0.1, 0.1],
[0.1, 0.2, 0.1, 0.1, 0.1],
[0.1, 0.1, 0.3, 0.1, 0.1],
[0.1, 0.1, 0.1, 0.2, 0.1],
[0.1, 0.1, 0.1, 0.1, 0.2]
])
# Initialize the HMM model
model = hmm.MultinomialHMM(n_components=len(states))
model.startprob_ = np.array([0.2, 0.2, 0.2, 0.2, 0.2]) # Initial state probabilities
model.transmat_ = transition_matrix
model.emissionprob_ = emission_matrix
# Encode observations as numeric indices
observations_encoded = [state_map[word] for word in observations]
observations_encoded = np.array(observations_encoded).reshape(-1, 1)
# Use the Viterbi algorithm to find the most likely state sequence
predicted_states = model.predict(observations_encoded)
# Map the predicted state indices back to POS tags
predicted_tags = [states[state_idx] for state_idx in predicted_states]
# Print the results
for word, tag in zip(observations, predicted_tags):
print(f"Word: {word}, Predicted POS Tag: {tag}")
The central idea behind HMMs is to model a sequence of observations as a sequence of hidden states, where the transitions between states and the emissions of observations are governed by probabilistic models. Given a sequence of observations, HMMs can be used for various tasks, including:
- State Estimation: Given a sequence of observations, HMMs can be used to estimate the most likely sequence of hidden states that generated those observations. This is achieved using algorithms like the Viterbi algorithm.
- Learning: HMMs can be trained using techniques like the Expectation-Maximization (EM) algorithm to learn the model parameters (transition probabilities, emission probabilities, initial state probabilities) from labeled data.
- Prediction: HMMs can be used to predict future observations or hidden states in a sequence.
Applications of Hidden Markov Models include:
- Speech Recognition: Modeling phonemes, words, or language states.
- Part-of-Speech Tagging: Assigning parts of speech to words in a sentence.
- Bioinformatics: Modeling DNA sequences, protein structures, and gene prediction.
- Natural Language Processing: Named Entity Recognition, sentiment analysis, and machine translation.
- Gesture Recognition: Recognizing and interpreting gestures in computer vision systems.
- Financial Modeling: Analyzing stock price movements and modeling economic states.
Hidden Markov Models are powerful tools for modeling sequential data with probabilistic dependencies. They have been foundational in many fields for solving problems related to pattern recognition, classification, and prediction. Here is a code for it from scratch:
import numpy as np
# Define states (hidden variables)
states = ["Sunny", "Rainy"]
# Define observations (visible variables)
observations = ["Walk", "Shop"]
# Define initial state probabilities
initial_probabilities = np.array([0.6, 0.4])
# Define state transition probabilities
transition_matrix = np.array([[0.7, 0.3], [0.4, 0.6]])
# Define emission probabilities (likelihood of observations given states)
emission_matrix = np.array([[0.4, 0.6], [0.8, 0.2]])
# Function to predict the most likely sequence of states using the Viterbi algorithm
def predict_states(observations, states, initial_probabilities, transition_matrix, emission_matrix):
num_states = len(states)
num_observations = len(observations)
viterbi = np.zeros((num_states, num_observations))
backpointer = np.zeros((num_states, num_observations), dtype=int)
# Initialization step
for s in range(num_states):
viterbi[s, 0] = initial_probabilities[s] * emission_matrix[s, states.index(observations[0])]
backpointer[s, 0] = 0
# Recursion step
for t in range(1, num_observations):
for s in range(num_states):
prob = [viterbi[sp, t - 1] * transition_matrix[sp, s] * emission_matrix[s, states.index(observations[t])] for sp in range(num_states)]
viterbi[s, t] = max(prob)
backpointer[s, t] = np.argmax(prob)
# Termination step
best_path_prob = max(viterbi[:, -1])
best_last_state = np.argmax(viterbi[:, -1])
# Backtrack to find the most likely sequence of states
state_sequence = [best_last_state]
for t in range(num_observations - 1, 0, -1):
best_last_state = backpointer[best_last_state, t]
state_sequence.append(best_last_state)
state_sequence.reverse()
return state_sequence
# Example observations
observed_sequence = ["Walk", "Shop", "Walk", "Walk", "Shop"]
# Predict the most likely sequence of states
predicted_sequence = predict_states(observed_sequence, states, initial_probabilities, transition_matrix, emission_matrix)
# Map state indices back to state labels
predicted_states = [states[i] for i in predicted_sequence]
# Print the results
print("Observed Sequence:", observed_sequence)
print("Predicted States:", predicted_states)
https://www.mygreatlearning.com/blog/pos-tagging/ provides a great step by step process on how the HMM works for POS tagging ! It explains how the computer first collects all the labelled data and calculates probability of an item being
Deep learning models for POS Tagging
This approach has gained significant attention and popularity in natural language processing (NLP) due to its ability to automatically learn complex patterns and representations from large datasets.
Advantages of Deep Learning Tagging:
- End-to-End Learning: Deep learning models can learn both feature representations and tag prediction jointly from the data, eliminating the need for extensive feature engineering.
- High Accuracy: Deep learning models, especially transformer-based models, have achieved state-of-the-art results on many NLP tasks, including POS tagging.
- Generalization: These models tend to generalize well to various languages and domains, as they can capture complex linguistic patterns.
Challenges and Considerations:
- Data Requirements: Deep learning models typically require large amounts of labeled data to perform well. For languages with limited resources, this can be a challenge. Furthermore, the data needs to be extensive enough with a neural network with large number of parameters to capture the nuances of a language.
- Computational Resources: Training and fine-tuning deep learning models can be computationally expensive, requiring powerful GPUs or TPUs.
- Interpretability: Deep learning models are often considered “black boxes,” making it challenging to interpret their decisions compared to rule-based or traditional statistical models.
Let’s look at different types of deep learning architectures that can be used:
- Recurrent Neural Networks (RNNs): RNNs, including Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) variants, have been used for POS tagging. RNNs process sequences of words, allowing them to capture contextual information effectively.
- Bidirectional RNNs (BiRNNs): To capture information from both past and future words in a sequence, bidirectional RNNs are employed. They process the input sequence in two directions (forward and backward) and concatenate the hidden states, providing a more comprehensive context.
- Convolutional Neural Networks (CNNs): CNNs are often used for POS tagging when considering local context. They apply convolutional operations over word embeddings to capture patterns within a window of neighboring words.
- Transformers: Transformer models, such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), have achieved state-of-the-art results in POS tagging. These models use self-attention mechanisms to capture both local and global context, making them highly effective.
- CRF on Top of Deep Learning Models: Conditional Random Fields (CRFs) are probabilistic models that capture dependencies between neighboring POS tags. CRFs can be combined with deep learning models, where the deep learning model predicts POS tags, and the CRF refines the predictions by considering transitions between tags.
- LSTM-CRF and GRU-CRF Models: These models combine the strength of LSTM or GRU networks for capturing contextual information with a CRF layer for modeling transitions between POS tags.
- BERT-Based POS Tagging: Fine-tuning pre-trained BERT models for POS tagging has become a popular approach due to BERT’s ability to capture rich contextual information. Researchers and practitioners often adapt BERT for POS tagging by adding a classification layer to predict POS tags.
- ELMo (Embeddings from Language Models): ELMo generates contextualized word embeddings by considering the entire sentence context. These embeddings can be used as input features for various downstream tasks, including POS tagging.
- ULMFiT (Universal Language Model Fine-tuning): ULMFiT is a transfer learning approach for NLP tasks. It involves pre-training a language model on a large corpus and then fine-tuning it for specific tasks, such as POS tagging.
- Attention-Based Models: Models that use attention mechanisms, similar to those found in transformers, can be designed for POS tagging tasks to weigh the importance of different words in a sentence.
Each of these common architectures will be discussed in depth in other articles. But for now let’s try to understand how exactly do different architectures work for POS tagging.
Here is an example in Tensorflow using Bi-directional LSTM-CRF model for POS tagging:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, CRF
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np
# Sample dataset (replace with your own dataset)
# Each sentence is represented as a list of (word, POS tag) pairs.
# Example:
# data = [ [("The", "DT"), ("cat", "NN"), ("is", "VBZ"), ("fast", "JJ"), (".", ".") ],
# [("She", "PRP"), ("runs", "VBZ"), ("quickly", "RB"), (".", ".") ] ]
data = [
[("The", "DT"), ("quick", "JJ"), ("brown", "JJ"), ("fox", "NN"), ("jumps", "VBZ"), ("over", "IN"), ("the", "DT"), ("lazy", "JJ"), ("dog", "NN"), (".", ".")],
[("She", "PRP"), ("sells", "VBZ"), ("seashells", "NNS"), ("by", "IN"), ("the", "DT"), ("seashore", "NN"), (".", ".")]
]
# Create word and tag vocabularies
words = set()
tags = set()
for sentence in data:
for word, tag in sentence:
words.add(word.lower()) # Convert to lowercase
tags.add(tag)
# Create word and tag dictionaries for mapping between words/tags and indices
word2idx = {word: idx + 1 for idx, word in enumerate(list(words))}
tag2idx = {tag: idx for idx, tag in enumerate(list(tags))}
idx2tag = {idx: tag for tag, idx in tag2idx.items()}
# Convert sentences to sequences of word and tag indices
X = [[word2idx[word.lower()] for word, _ in sentence] for sentence in data]
Y = [[tag2idx[tag] for _, tag in sentence] for sentence in data]
# Pad sequences to the same length
X = pad_sequences(X, padding='post')
Y = pad_sequences(Y, padding='post', value=tag2idx["."])
# Split data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Define the BiLSTM-CRF model
model = Sequential([
Embedding(input_dim=len(word2idx) + 1, output_dim=50, input_length=X.shape[1], mask_zero=True),
Bidirectional(LSTM(units=50, return_sequences=True)),
CRF(len(tag2idx), sparse_target=True)
])
# Compile the model
model.compile(optimizer='adam', loss=model.layers[-1].loss, metrics=[model.layers[-1].accuracy])
# Train the model
model.fit(X_train, Y_train, batch_size=32, epochs=10, validation_split=0.1)
# Evaluate the model on the test data
y_pred = model.predict(X_test, verbose=1)
y_pred = np.argmax(y_pred, axis=-1)
Y_test = np.argmax(Y_test, axis=-1)
# Convert predicted indices back to tags
y_pred_tags = [[idx2tag[idx] for idx in row] for row in y_pred]
Y_test_tags = [[idx2tag[idx] for idx in row] for row in Y_test]
# Print classification report
print(classification_report(np.concatenate(Y_test_tags), np.concatenate(y_pred_tags)))
Conclusion
Hope the article provides an explanation on how POS tagging works and different ways of POS tagging. In the next article, I would talk about some common ML architectures and we will delve further into Viterbi Algorithm for more understanding how that works from scratch.
There has been several developments in the recent years especially in speech recognition and Natural Language Processing so that POS tagging is black boxed and not necessary for most practical implementations. However, learning the basics will give more insights and help you better train your models in the future.
Appendix and Further Readings
- https://towardsdatascience.com/markov-and-hidden-markov-model-3eec42298d75
- https://www.sciencedirect.com/topics/medicine-and-dentistry/hidden-markov-model
- https://www.mygreatlearning.com/blog/pos-tagging/
- https://medium.com/data-science-in-your-pocket/pos-tagging-using-hidden-markov-models-hmm-viterbi-algorithm-in-nlp-mathematics-explained-d43ca89347c4