Skip to content

Micro Services

mitaligarg10 edited this page Jul 20, 2017 · 13 revisions

Searcher Service


Core Crawler

The main goal of the core crawler is to fetch the document related to the urls from the searcher service.

In Core Crawler Service,it reads the entire URL and converted into a document using jsoup. jsoup is a Java library for extracting and manipulating data. The DOM object is created and it stores the document for the urls. These urls are serialized and passed through Kafka to Document Parser.


Document Parser Service

The core crawler crawls the whole web and comes along with the document of the URL received. The document parser reads over the document and comes out with the intensity of each word that is obtained from neo4j which are pre declared for a particular intent.

We use Jsoup library to parse through the document. The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not.

The document received is tokenized according to their tags and the frequency of occurrence of each word in the particular tag is calculated.We have an intensity algorithm that calculates the intensity of each word.

The parser creates a document model that is of the form of a HashMap of HashMap consisting of the tag, the word and its intensity that is passed over to the intent parser as well as stored in Mongo so that we do not have to create the document model if the request is received for the same Document.


Intent Parser Service

The purpose of the intent parser is to provide the accurate document corresponding to the concept selected by the user.

In Intent parser we create the algorithm for calculating the confidence score of the intents having the relationship of indicator and counter indicator with the terms, it consumes input through Kafka in the form of document having term and intensity based on the occurrence of terms in particular tags with respect to its intensity to improve the accuracy of the document which we get from the document parser. And it also fetches data from neo4j having an intent graph showing the relationship of the indicatorOf and counterIndicatorOf with the terms along with a property of weight, Then confidence score is calculated.

The Intent parser then finally inserting the URLs based on the concepts with confidence score and intent in the neo4j graph database.


Term Bank Service

The need for this micro-service arouse when the order of the documents returned by the Intent Parser service was not appropriate to the intent of the user. This is because of the lack of terms present in our database. So, to overcome this we have Term Bank Service.

Initially the term bank service was a part of intent parser service, as in everytime at the runtime the user has to communicate with the neo4j and the api to fetch the terms but later it seemed to be a bottleneck approach so it was made a seperate micro-service.

This micro-service is basically to enrich our database with more similar terms with their respective weights and the relationships so as to get more accurate results.

It used Words API to fetch synonyms and antonyms for a particular term.

These synonyms and antonyms then get stored in the database related to their particular intents with the relationship and weight specified.

Clone this wiki locally