API Multiprocessing

Motivation

In this article, I am going to show you how we can improve the computing power of simple API script from total overall (6 minutes and 17 seconds) to (1 minute 14 seconds)

The Idea

I will share with you one of my simple favourite technique that I prefer to use especially when I work on data science tasks such as data visualization, data analysis, code optimization, and big data processing.

Processing a task in a sequential way may take a long time especially when we are talking about a huge amount of data(eg. big inputs)

This technique takes advantage of parallelization capabilities in order to reduce the processing time.

The idea is to divide the data into chunks so that each engine takes care of classifying the entries in their corresponding chunks. Once performed, each engine reads, writes and processes its chunks, each chunk be processed in the same amount of time.

Example

The example I choose to use for this article is Genderize names that consist of 2 alphabetic characters.

Output Analysis Chart

Explanation

Clone GitHub Repo and follow instructions in Usage section.

Let’s generate all alphabet names that consist of 2 characters(to make the testing process easy)

we can use some Linux Kali penetration testing tool[1] such as crunch
$ crunch 2 2 > names.txt
so we generate all possible alphabet names with length 2 (676 lines)

then let’s create directories which are needed for splitting process
$ mkdir subs/ subs/inputs subs/outputs subs/outputs/parts subs/outputs/all

now we can split out input data, there are many ways to do that but I prefer to use Unix split command [2]
$ split -l 100 -d names.txt ./subs/inputs/
so we split names.txt file into small files, each file consists of 100 lines

now let’s run all processes: ./init.bash
after finish use merger.py script to merge all outputs.
merging process separated to avoid conflicts behaviours and sorting-save.


The Project on GitHub:
https://github.com/khaledalam/api-multiprocessing

An application uses this technique:
Hiring-Related Email(https://github.com/khaledalam/amazon-jobs)

Interesting related ideas:
Parallelizing using GPUs
– MapReduce (https://en.wikipedia.org/wiki/MapReduce)

[1] https://tools.kali.org/password-attacks/crunch
[2] https://en.wikipedia.org/wiki/Split_(Unix)

Facebook Comments
Please follow and like us:
API Multiprocessing

Leave a Reply

Your email address will not be published.

Scroll to top