Tagging 26 million Publications within 26 minutes – Adding “Spark” to the code !
At the Data Engineering team at Innoplexus, we have been working tirelessly for the past 2 months to reduce the time taken to tag our entire publications database. The database has 26 million publications and our initial approach to getting this done would tag only a million of these publications daily. This meant 26 days for the entire data which consequently meant our live product would be offline for these many days during an update. We had to reduce this time to hours or even better, to minutes!
Why was tagging needed?
Our product offers users an interface to discover publications related to their field of research. For example, an academician working on “algorithm design” would like to see recent publications in this field. For us to filter all the publications on “algorithm design”, we had to have a classification of our publications database. We thought of adding tags to our publications based on their content. An analogy of this is Netflix classifying movies by genre.
Our approach to tagging
Creating a database of tagging terms:
Once we were convinced of our approach, we found out all relevant tags in different areas of research and the terms we needed to look for in order to add these tags. Also, we realized there are open source databases of terms to be looked for in publications and tags corresponding to these terms. Combined together, they came up with an Ontology. This ontology serves as our knowledge base. Among many other things, it has 6.2 million tagging terms to look for and tags corresponding to them.
Next we had to come up with an efficient tagging algorithm. We had to find a way to leverage this ontology and tag raw data. In our publications database, we had “article title” and “article abstract” among other entries such as ‘authors’, ‘publication date’, ‘journal title’. It was decided that we search for the presence of tagging terms in ‘article title’ and ‘article abstract’ and then tag accordingly. The tagging terms could be a word or a phrase and hence it was required to match them on an “as is” basis except for the case of the letters. The appearance of order of different words in tagging terms had to be the same as in ‘article title’ and ‘abstract’. Also, the entire phrase had to be checked and not just a part of the phrase.
Since the two databases – publications and tagging terms, were huge, we could not choose to traverse through both of them as complexity would be O(n*n). So, we chose to create a map of tagging terms where keys would be the terms that we were looking for in publications and values would be the tag, which we would add to the publication. Redis was an obvious choice as it creates an in-memory database for faster look ups with complexity of O(1). Now, ‘article title’ and ‘abstracts’ had to be broken into a structure which matched with these keys so that we could look for their tags. We decided to create ngrams from titles and abstracts and look for these ngrams in the Redis server and get corresponding tags if present, and then tag the publications accordingly. Maximum number of words in a tagging term was 7, so we created ngrams of 7 or below. This way, the complexity would be O(n).
Getting it done
Initial Approach: Python Code
Next, the job was for us was to implement this algorithm. Publications data was stored in mongoDB and the tagging terms were stored in Redis server. We began with writing a python script to read publications data from mongoDB and create ngrams from them and extract values for these ngrams from Redis if present, and then add a tag key along with the tag. When run on a 6 core server with 8GB RAM each, the algorithm was able to tag around a million publications daily. Another attempt was made with 60 cores only to find that it took 6 days using this method.
We had to get it done within an hour at most. So, a brainstorming session was scheduled to change the approach altogether to reduce the time. Also, it was decided to add “Spark” to the code.
Final Approach: Spark/Scala code
The publications database was shifted to HBase for distributed storage. A python code to write data from mongo to HBase was implemented. The fact that spark itself is written in Scala and that we could import all libraries of java plus its own, we chose to write our tagging algorithm in Scala. Spark does all the processing in-memory and is about 100 times faster than the Mapreduce coding paradigm.
Publications data was fetched as RDD (Resilient Distributed Datasets) from HBase into Spark. Similarly tagging terms were fetched from Redis server. For faster look ups and less memory consumption, we calculated hash values for all the tagging terms imported from Redis and created a map where keys would be these hash values and the values would be the term which we would tag. For publications, ngrams with less than or equal to 7 words were created and then their hash values were calculated. These hash values were looked up in the map we created and values were tagged accordingly if present.
With much eagerness and hope, the code was run on a 16 core server with 64 GB RAM each and it took only 26 minutes for the entire publications dataset to be tagged. We were extremely satisfied with the results. We were able to do what we set out to achieve and our results ended up with our team getting recognized for our efforts and approach.
Ravi Ranjan works as a Data Engineer at Innoplexus. He is an IIT Delhi graduate who can be found reading, traveling, writing blogs and exploring life in his free time.