Automated Question-Answering System (Preliminary)

April 02, 2019

Automated Question-Answering System (Preliminary)

Overview:

1. Objective

2. Data Overview

3. Data Preprocessing

4. Implementation Logic

5. Results

6. References

1. Objective:

The prime objective of an automated QA system is to provide to-the-point answers from large paragraphs/texts, shortening time investment and increasing resource utilization.

2. Data Overview

A random grade-5 comprehension constituting of several paragraphs is selected for the purpose of testing the system.

A set of Questions relevant to the essay is prepared.

3. Preprocessing

It is a well-accepted rule, that often, around 80% of any data-related process cycle is spent on the preprocessing phase. This is true since real world data is often a good fit for the human mind, but fails to be of any significance to our machines which only understand a numerical feed and provide undesirous results when data is mostly constituted of useless chunks like meaningless text and missing datapoints.

Garbage input cannot miraculously provide accurate results. Thus, the text provided needs to be thoroughly cleaned before it is fed to the model and consecutively be portrayed in numeric format.

3.1 Text Cleaning

Step 1: Divide the paragraphs into sentences.

Step 2: Removal of stop-words

Step 3: Regex based cleaning:

a. Replace dates with ‘date’

b. Replace remaining numbers with ‘number’

c. Replace mail-IDs with ‘mail’

d. Remove extra spaces and punctuations

Step 4: Correction of misspelled words with the use of any good spellchecking python library

Step 5: Stemming or Lemmatization to omit multiple occurrences of varying versions of words with similar meanings. (For instance: ‘beautiful’, ‘beauty’, ‘beautifully’)

This makes the sentences robust and fit for the next step in processing.

3.2 Vectorization

This is the most interesting and intuitive step. The already preprocessed sentences will now be converted to their respective numeric representations and the process is advanced both in terms of performance and intuition.

3.2.1. Why sentence vectors?

The most popular way of text representation for most part of this decade has been word-level embeddings. Thereafter, these numerical representations are put together with respect to certain user-defined-functions (for instance, addition of all the numeric word representations in a sentence).

However, this technique, in spite of good results over the years, is absolutely not intuitive when viewed from a logical point of view.

To demonstrate:

Sentence 1: He did very well in his examination.

Sentence 2: He did not do very well in his examination.

These two sentences, even though completely opposite in meaning, will have very similar numeric representations, irrespective of the meaning they convey. As is evident, word-level vectorization misses out on the context of sentences in the paragraph or text.

Thus, in order to capture the context of sentences, extensive research led to the development of unsupervised (Skip-Thoughts, Quick-Thoughts) and supervised (InferSent) models for sentence vector generation on sentence-level itself, which means that here, a sentence is the smallest unit of reference instead of a word.

Unsupervised techniques capture the context by trying to predict the previous and next sentences from the point of view of one sentence, and based on the error rate, learn which sentences are more probable with respect to another. In other words, it learns the context of sentences in a self-supervised manner with unlabeled data. This technique gives a pretty good outcome, but supervised learning does not fall short, and in some cases perform better.

3.2.2. InferSent

InferSent is Facebook’s latest sentence embedding model (as of early 2019). It is by far the most successful and the only supervised sentence embedding technique I have come across till date. It disproves the previous beliefs that supervised learning provides lower quality embeddings.

InferSent model is pretrained on Stanford Natural Language Inference Corpus which is a set of 570K sentence pairs which are either related neutrally, contradict each other or entail one another. The labels are accordingly set.

The advantage of learning the relationships between the pairs is to attain the knowledge of context.

InferSent uses word embeddings (Glove embeddings used here, but could be any other) and assigns weights on each embedding based on the learned context of the sentences.

3.2.3. Glove Embeddings

Glove is a count-based technique to generate word embeddings. It uses a co-occurrence matrix of words against content to track the frequency of the words used in the corpus and based on the frequency, assigns a numeric representation to the word.

4. Implementation

Step 1: Generate the Sentence Embeddings for the set of Questions

Step 2: Generate the Sentence Embeddings for the Text

Step 3: For every question, find the

a. Cosine Similarity

b. Euclidean Distance

of the question’s vector representation with every sentence vector in the text.

Step 4: Record the sentence with the maximum Cosine Similarity and also the one with least Euclidean Distance from the question (Level 1 recording).

Level 2 recording will mean recording the sentence with optimum metric along with the sentence with second most optimum metric.

5. Results

Here are few sample automated QA results:

5.1 Level 1:

Question: What did Chavez and his wife Helen do to help Mexican immigrants regarding literacy (i.e., the ability to read and write)?

Answer (Cosine Similarity): Chavez and his wife helped teach Mexican immigrants to read and helped them with voting registration.

Answer (Euclidean Distance): Chavez and his wife helped teach Mexican immigrants to read and helped them with voting registration.

Question: What were some of the concerns regarding farm work?

Answer (Cosine Similarity): They worked on many goals like increasing the wages for the workers, improving their working conditions, and improving the safety for the farm workers.

Answer (Euclidean Distance): The name NFWA was changed to the United Farm Workers (UFW) in 1974.

5.2 Level 2:

Question: What did Chavez and his wife Helen do to help Mexican immigrants regarding literacy (i.e., the ability to read and write)?

Answer (Cosine Similarity): Chavez and his wife helped teach Mexican immigrants to read and helped them with voting registration. Chavez’s children and grandchildren continue in his footsteps to help fight for the rights for migrant workers.

Answer (Euclidean Distance): Chavez and his wife helped teach Mexican immigrants to read and helped them with voting registration. Chavez’s children and grandchildren continue in his footsteps to help fight for the rights for migrant workers.

Question: What were some of the concerns regarding farm work?

Answer (Cosine Similarity): They worked on many goals like increasing the wages for the workers, improving their working conditions, and improving the safety for the farm workers. He organized a group of people to help work for the rights of farm workers.

Answer (Euclidean Distance): The name NFWA was changed to the United Farm Workers (UFW) in 1974. He organized a group of people to help work for the rights of farm workers.

6. Code

https://github.com/samadrita1/QA-Automation/

7. References

1. https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a

2. https://research.fb.com/downloads/infersent/

3. https://nlp.stanford.edu/projects/glove/

4. https://www.w3schools.com/python/python_regex.asp

5. Image: https://towardsdatascience.com/