An overview of Text Data Analysis

Published by Ngoc Tran on

Text data analysis

Have you ever wondered how Twitter manages to analyze and update their trend topics so quickly? Or how does the chatbot recognize when your query is out of his expertise and require human assistant? Is there a better way to present your qualitative data? The answer is Text Data Analysis. This article will give you a short yet clear overview of what the term means, and how it helps to spot patterns and insights from text data.

What is Text Data Analysis?

  • Text data includes all data in the form of different natural language texts, texts on all web pages and social media.
  • Text Data Analysis (or Text Mining) is a process of extracting and analyzing digital unstructured text data in order to obtain useful actionable insights and knowledge. A simple example is to identify which key words are positive/ negative in customer reviews, in order to fast-reply and thus improve customer service.

How does Text Data Analysis work?

Generally, text data is collected, usually stored in a Text Information System. Then it is analyzed by machine learning and visualized a meaningful story as a result. As you see, it looks similar to a process of data analysis; however, since text is a different form of data, it indeed requires specific techniques.

#1: Text classification

Before starting, you want to decide the categories of the data needed to analyze. One way to do this is to provide the machine (software) some examples from each desired category and supervise how the machine learns about it by comparing predicted labels against actual labels. This step does need human involvement as most machine learning tools are not fully advanced, or in other words, being able to make interpretation like human. Some common text classification tasks are:

  • Sentiment Analysis – automatically reads and classifies opinion polarity (positive/negative/neutral).
  • Topic analysis – automatically assigns topics to the text (e.g. business tools, business projects, training, etc.)
  • Language identification – automatically defines the languages existing in the text. Common categories can be English, Spanish, Chinese, etc.
  • Intent detection – automatically find out the reason behind customer feedback. A good example is chatbot.

#2: Text similarity

After classification, your machine decides which pieces of language (key words) are the most similar or repeated. You don’t have to involve in supervising the machine at this step, because you don’t want to influence him on what should be included in the categories. You only want to teach him the differences between these categories. A search engine (e.g. Google) is based on this kind of model where finds a set of web pages whose content is most similar to your query. There are two common techniques such as:

  • Keyword extraction – automatically identifies the terms that best describe the content of the text. An example is Word clouds where shows a visualization of the most frequent words used in the word clusters.
  • Named entity recognition (NER) – automatically finds entities (E.g. people, companies, universities, locations, etc.) that exist within the text.

#3: Text visualization

When several pieces of language settle down in their own home, we need to figure out when a pattern is meaningful or not, by using statistical testing (e.g. word frequency TF-IDF, etc.), Business Intelligence (BI) and data visualization tools (E.g. Excel, Google Data Studio, Tableau, etc.). Trending tab on Twitter, and Keyword cloud are typical examples of text visualization.

Fun time: Go to and try to make your own word cloud!

Other detailed sources over this topic:


Thank you for choosing to read this article and I hope we have learnt something new today! Any question feel free to comment.


Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *