In this article, I am going to show you how we can improve the computing power of simple API script from total overall (6 minutes and 17 seconds) to (1 minute 14 seconds)
I will share with you one of my simple favourite technique that I prefer to use especially when I work on data science tasks such as data visualization, data analysis, code optimization, and big data processing.
Processing a task in a sequential way may take a long time especially when we are talking about a huge amount of data(eg. big inputs)
This technique takes advantage of parallelization capabilities in order to reduce the processing time.
The idea is to divide the data into chunks so that each engine takes care of classifying the entries in their corresponding chunks. Once performed, each engine reads, writes and processes its chunks, each chunk be processed in the same amount of time.
The example I choose to use for this article is
Genderize names that consist of 2 alphabetic characters.
Output Analysis Chart
Clone GitHub Repo and follow instructions in
Let’s generate all alphabet names that consist of 2 characters(to make the testing process easy)
we can use some Linux Kali penetration testing tool such as
$ crunch 2 2 > names.txt
so we generate all possible alphabet names with length 2 (676 lines)
then let’s create directories which are needed for splitting process
$ mkdir subs/ subs/inputs subs/outputs subs/outputs/parts subs/outputs/all
now we can split out input data, there are many ways to do that but I prefer to use Unix
split command 
$ split -l 100 -d names.txt ./subs/inputs/
so we split
names.txt file into small files, each file consists of 100 lines
now let’s run all processes:
after finish use
merger.py script to merge all outputs.
merging process separated to avoid conflicts behaviours and sorting-save.
An application uses this technique:
Interesting related ideas:
– Parallelizing using GPUs
– MapReduce (https://en.wikipedia.org/wiki/MapReduce)