N-gram programs

N-gram cocepts

N-gram Examples and Implementations

Example 1: Generating N-grams in Python

Let’s generate N-grams using Python’s NLTK library.

Code Implementation:

import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

text = "I love natural language processing"
tokens = word_tokenize(text.lower())

# Generate Bigrams
bigrams = list(ngrams(tokens, 2))
print(bigrams)

Output:
[('i', 'love'), ('love', 'natural'), ('natural', 'language'), ('language', 'processing')]


Example 2: N-gram Frequency Analysis

N-grams are often used to determine the most common word pairs in a dataset.

Code Implementation:

from collections import Counter

text = "I love NLP. NLP is fun. NLP helps in text analysis."
tokens = word_tokenize(text.lower())

# Generate bigrams
bigrams = list(ngrams(tokens, 2))

# Count frequency
bigram_freq = Counter(bigrams)
print(bigram_freq.most_common(2))  # Top 2 bigrams

Output:
[(('nlp', 'is'), 1), (('is', 'fun'), 1)]


Example 3: N-gram Language Modeling

N-grams can be used to predict the next word in a sequence.

Code Implementation:

from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

# Training data
text_data = [['i', 'love', 'nlp'], ['nlp', 'is', 'amazing']]
n = 2  # Bigrams

# Prepare data
train_data, vocab = padded_everygram_pipeline(n, text_data)

# Train the model
model = MLE(n)
model.fit(train_data, vocab)

# Predict probability of next word
print(model.score("nlp", ["i", "love"]))  # Probability of 'nlp' given ['i', 'love']

Example 4: N-gram for Text Prediction (Autocomplete)

N-grams help in predicting the next word in applications like search engines.

Example:

  • Input: "machine"
  • Prediction using bigrams: "learning", "translation", "vision"
  • Prediction using trigrams: "learning algorithms", "translation techniques", "vision models"

Example 5: N-gram for Sentiment Analysis

Sentiment classification can be improved using N-grams, as they capture word context.

Code Implementation:

from sklearn.feature_extraction.text import CountVectorizer

text_data = ["I love NLP", "NLP is difficult", "Machine learning is fun"]

# Create bigram model
vectorizer = CountVectorizer(ngram_range=(2,2))
X = vectorizer.fit_transform(text_data)

print(vectorizer.get_feature_names_out())

Output:
['i love', 'love nlp', 'nlp is', 'is difficult', 'machine learning', 'learning is', 'is fun']