How-to: Build a Python Web Scraper to capture IMDb Top-100 Movies
--
Some background
I love movies, and all things cinema-related 📽 🎞️ 🎬️🍿🎫.
I truly think it is immensely cool how so much painstaking and beautiful work goes into making movies. Years and years of hard work, involving meticulous planning, growing expenses, constant deliberation and thinking.
All of that hard work compressed into an average-lengthed 1.5-hour movie.
And then, it only takes on average of a few minutes for the audience to decide whether a movie was entertaining, or boring. Realistic, or unrealistic. Interesting, or a waste of time. Inspiring, or lame.
Ummm, ok, aaannd? 🤷♂️
Switching back from my dream movie-land to the real-life software world. Last year, I found myself learning Python and Data engineering. As I was learning my new role in Data engineering, I wanted to play around with some of the widely-used libraries and tools used extensively in Python data processing.
So, I just googled online to see what each of those libraries did. Once I read their docs, I felt I knew enough to be able to play around with them, and I built a little Web 🕸️Scraper.
At a high-level, the Web Scraper I will help us build in this tutorial does the following:
Short Description:
This will be a Python script that calls out to iMDB grabbing all HTML
elements for the Top-100 movies of all-time, and saving the results to a .csv
file.
Step-by-Step Breakdown:
- Import all of the required libraries.
- Call out to iMDB.
- Save the
HTML
elements off the iMDB page to aresults
object. - Create a
movie_soup
BeautifulSoup object that stores all the results as text. - Create lists to extract all
HTML
attributes like:
name
years
runtime
ratings
metascores
number of votes
gross budget
6. Create a movie_div
object to find all div
objects in movie_soup
.
7. Loop through each object in the movie_div
.
8. Add each result from each attribute for each list.
9. Build a movies DataFrame
.
10. Store all attributes into the movies DataFrame
.
11. Use Pandas str.extract
to remove all String characters and save the value as type int
.
12. Export the results to a pretty little top_100_movies.csv
file.
Without further ado, lights, camera, action!💡 🎥 🎬
👨💻️ Code and Step-By-Step Instructions: 📖
Time to see the Movie results! 🙂
Run your python script using the following command:
python imdb-movie-scraper.py
Voilà!
You will now notice a .csv file that has been created: top_100_movies.csv
Conclusion 🎭
Thank you for following my tutorial! 🙂
If you have any questions, or would like to share some of your cool projects using Python, DataFrames, BeautifulSoup, and NumPy, feel free to share in the comments section.
Related Reading on Data Analytics & Engineering 📖
If you want to learn more about Data Analytics & Engineering, check out my series linked below! 👇

