What are N-Grams?


N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios).

For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then the ngrams would be:

  • the cow
  • cow jumps
  • jumps over
  • over the
  • the moon

So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.

If N=3, the n-grams would be:

  • the cow jumps
  • cow jumps over
  • jumps over the
  • over the moon

So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3 this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so on.

How many N-grams in a sentence?

If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:

What are N-grams used for?

N-grams are used for a variety of different task. For example, when developing a language model, n-grams are used to develop not just unigram models but also bigram and trigram models. Google and Microsoft have developed web scale n-gram models that can be used in a variety of tasks such as spelling correction, word breaking and text summarization. Here is a publicly available web scale n-gram model by Microsoft: http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx. Here is a paper that uses Web N-gram models for text summarization:Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions

Another use of n-grams is for developing features for supervised Machine Learning models such as SVMs, MaxEnt models, Naive Bayes, etc. The idea is to use tokens such as bigrams in the feature space instead of just unigrams. But please be warned that from my personal experience and various research papers that I have reviewed, the use of bigrams and trigrams in your feature space may not necessarily yield any significant improvement. The only way to know this is to try it!

Java for N-gram Generation

This code block generates n-grams at a sentence level. The input consists of N (the size of n-gram), sent the sentence and ngramList a place to store the n-grams generated.

private static void generateNgrams(int N, String sent, List ngramList) {
  String[] tokens = sent.split("\\s+"); //split sentence into tokens
  for(int k=0; k<(tokens.length-N+1); k++){
    String s="";
    int start=k;
    int end=k+N;
    for(int j=start; j<end; j++){
    //Add n-gram to a list
}//End of method

Python code for N-gram Generation

Similar to the example above, the code below generates n-grams in python.

import re
def generate_ngrams(text,n):
    # split sentences into tokens
    # collect the n-grams
    for i in range(len(tokens)-n+1):
    temp=[tokens[j] for j in range(i,i+n)]
    ngrams.append(" ".join(temp))
    return ngrams

Example Output

Here is an example of n-grams generated using the python code above run from a Jupyter notebook.

The start and end tokens are added to maximize the use of the n-grams. Some phrases tend to occur only at the end and some tend to occur at the very beginning. The _start_ and _end_ tokens help capture this pattern.

Keep Learning & Succeed With AI

  • JOIN OUR NEWSLETTER, AI Integrated, which teaches you how to successfully integrate AI into your business to attain growth and profitability for years to come.
  • GET 3 FREE CHAPTERS of our book, The Business Case for AI, to learn practical AI applications, immediately usable strategies, and best practices to be successful with AI. Available as: audiobook, print, and eBook.
  • GET A 1:1 INITIAL CONSULT to learn how to move your AI initiatives forward, develop a strategic roadmap, educate leaders, and more. Use strategies you could apply immediately.

Not Sure Where AI Can Be Used in Your Business? Start With Our Bestseller.

The Business Case for AI: A Leader’s Guide to AI Strategies, Best Practices & Real-World Applications. By: Founder, Kavita Ganesan

In this practical guide for business leaders, Kavita Ganesan, our CEO, takes the mystery out of implementing AI, showing you how to launch AI initiatives that get results. With real-world AI examples to spark your own ideas, you’ll learn how to identify high-impact AI opportunities, prepare for AI transitions, and measure your AI performance.

Scroll to Top