Automated Question-Answering System (Preliminary)
Overview:
1.
Objective
2.
Data Overview
3.
Data Preprocessing
4.
Implementation Logic
5.
Results
6.
References
1.
Objective:
The prime objective of an automated QA system is to provide
to-the-point answers from large paragraphs/texts, shortening time investment
and increasing resource utilization.
2.
Data Overview
A random grade-5 comprehension constituting of several
paragraphs is selected for the purpose of testing the system.
A set of Questions relevant to the essay is prepared.
3.
Preprocessing
It is a well-accepted rule, that often, around 80% of any
data-related process cycle is spent on the preprocessing phase. This is true
since real world data is often a good fit for the human mind, but fails to be of
any significance to our machines which only understand a numerical feed and provide
undesirous results when data is mostly constituted of useless chunks like
meaningless text and missing datapoints.
Garbage input cannot miraculously provide accurate results. Thus,
the text provided needs to be thoroughly cleaned before it is fed to the model
and consecutively be portrayed in numeric format.
3.1
Text Cleaning
Step 1: Divide the paragraphs into
sentences.
Step 2: Removal of stop-words
Step 3: Regex based cleaning:
a.
Replace dates with ‘date’
b.
Replace remaining numbers with ‘number’
c.
Replace mail-IDs with ‘mail’
d.
Remove extra spaces and punctuations
Step 4: Correction of misspelled
words with the use of any good spellchecking python library
Step 5: Stemming or Lemmatization
to omit multiple occurrences of varying versions of words with similar
meanings. (For instance: ‘beautiful’, ‘beauty’, ‘beautifully’)
This makes the sentences robust and fit for the next step in
processing.
3.2
Vectorization
This is the most interesting and intuitive step. The already
preprocessed sentences will now be converted to their respective numeric
representations and the process is advanced both in terms of performance and
intuition.
3.2.1.
Why sentence vectors?
The most popular way of text representation for most part of
this decade has been word-level embeddings. Thereafter, these numerical
representations are put together with respect to certain user-defined-functions
(for instance, addition of all the numeric word representations in a sentence).
However, this technique, in spite of good results over the
years, is absolutely not intuitive when viewed from a logical point of view.
To demonstrate:
Sentence 1: He did very well in his
examination.
Sentence 2: He did not do very well in
his examination.
These two sentences, even though completely opposite in
meaning, will have very similar numeric representations, irrespective of the
meaning they convey. As is evident, word-level vectorization misses out on the
context of sentences in the paragraph or text.
Thus, in order to capture the context of sentences,
extensive research led to the development of unsupervised (Skip-Thoughts, Quick-Thoughts)
and supervised (InferSent) models for sentence vector generation on
sentence-level itself, which means that here, a sentence is the smallest unit
of reference instead of a word.
Unsupervised techniques capture the context by trying to
predict the previous and next sentences from the point of view of one sentence,
and based on the error rate, learn which sentences are more probable with
respect to another. In other words, it learns the context of sentences in a
self-supervised manner with unlabeled data. This technique gives a pretty good
outcome, but supervised learning does not fall short, and in some cases perform
better.
3.2.2.
InferSent
InferSent is Facebook’s latest sentence embedding model (as
of early 2019). It is by far the most successful and the only supervised
sentence embedding technique I have come across till date. It disproves the
previous beliefs that supervised learning provides lower quality embeddings.
InferSent model is pretrained on Stanford Natural Language
Inference Corpus which is a set of 570K sentence pairs which are either related
neutrally, contradict each other or entail one another. The labels are
accordingly set.
The advantage of learning the relationships between the
pairs is to attain the knowledge of context.
InferSent uses word embeddings (Glove embeddings used here,
but could be any other) and assigns weights on each embedding based on the learned
context of the sentences.
3.2.3.
Glove Embeddings
Glove is a count-based technique to generate word embeddings.
It uses a co-occurrence matrix of words against content to track the frequency
of the words used in the corpus and based on the frequency, assigns a numeric
representation to the word.
4.
Implementation
Step 1: Generate the Sentence
Embeddings for the set of Questions
Step 2: Generate the Sentence
Embeddings for the Text
Step 3: For every question, find
the
a.
Cosine Similarity
b.
Euclidean Distance
of the
question’s vector representation with every sentence vector in the text.
Step 4: Record the sentence with
the maximum Cosine Similarity and also the one with least Euclidean Distance
from the question (Level 1 recording).
Level 2 recording will mean recording the sentence with
optimum metric along with the sentence with second most optimum metric.
5.
Results
Here are few sample automated QA results:
5.1
Level 1:
Question: What did Chavez and his wife Helen do
to help Mexican immigrants regarding literacy (i.e., the ability to read and
write)?
Answer (Cosine Similarity): Chavez and his wife
helped teach Mexican immigrants to read and helped them with voting
registration.
Answer (Euclidean Distance): Chavez and his wife
helped teach Mexican immigrants to read and helped them with voting
registration.
Question: What were some of the concerns
regarding farm work?
Answer (Cosine Similarity): They worked on many
goals like increasing the wages for the workers, improving their working
conditions, and improving the safety for the farm workers.
Answer (Euclidean Distance): The name NFWA was
changed to the United Farm Workers (UFW) in 1974.
5.2
Level 2:
Question: What did Chavez and his wife Helen do
to help Mexican immigrants regarding literacy (i.e., the ability to read and
write)?
Answer (Cosine Similarity): Chavez and his wife
helped teach Mexican immigrants to read and helped them with voting
registration. Chavez’s children and grandchildren continue in his footsteps to
help fight for the rights for migrant workers.
Answer (Euclidean Distance): Chavez and his wife
helped teach Mexican immigrants to read and helped them with voting
registration. Chavez’s children and grandchildren continue in his footsteps to
help fight for the rights for migrant workers.
Question: What were some of the concerns
regarding farm work?
Answer (Cosine Similarity): They worked on many
goals like increasing the wages for the workers, improving their working
conditions, and improving the safety for the farm workers. He organized a group
of people to help work for the rights of farm workers.
Answer (Euclidean Distance): The name NFWA was
changed to the United Farm Workers (UFW) in 1974. He organized a group of
people to help work for the rights of farm workers.
6.
Code
7.
References
Comments
Post a Comment