How-to: Build a Python Web Scraper to capture IMDb Top-100 Movies
I love movies, and all things cinema-related 📽 🎞️ 🎬️🍿🎫.
I truly think it is immensely cool how so much painstaking and beautiful work goes into making movies. Years and years of hard work, involving meticulous planning, growing expenses, constant deliberation and thinking.
All of that hard work compressed into an average-lengthed 1.5-hour movie.
And then, it only takes on average of a few minutes for the audience to decide whether a movie was entertaining, or boring. Realistic, or unrealistic. Interesting, or a waste of time. Inspiring, or lame.
Ummm, ok, aaannd? 🤷♂️
Switching back from my dream movie-land to the real-life software world. Last year, I found myself learning Python and Data engineering. As I was learning my new role in Data engineering, I wanted to play around with some of the widely-used libraries and tools used extensively in Python data processing.
So, I just googled online to see what each of those libraries did. Once I read their docs, I felt I knew enough to be able to play around with them, and I built a little Web 🕸️Scraper.
At a high-level, the Web Scraper I will help us build in this tutorial does the following:
This will be a Python script that calls out to iMDB grabbing all
HTML elements for the Top-100 movies of all-time, and saving the results to a
- Import all of the required libraries.
- Call out to iMDB.
- Save the
HTMLelements off the iMDB page to a
- Create a
movie_soupBeautifulSoup object that stores all the results as text.
- Create lists to extract all
number of votes
6. Create a
movie_div object to find all
div objects in
7. Loop through each object in the
8. Add each result from each attribute for each list.
9. Build a movies
10. Store all attributes into the movies
11. Use Pandas
str.extract to remove all String characters and save the value as type
12. Export the results to a pretty little
Without further ado, lights, camera, action!💡 🎥 🎬
👨💻️ Code and Step-By-Step Instructions: 📖
Time to see the Movie results! 🙂
Run your python script using the following command:
You will now notice a .csv file that has been created: top_100_movies.csv
Thank you for following my tutorial! 🙂
If you have any questions, or would like to share some of your cool projects using Python, DataFrames, BeautifulSoup, and NumPy, feel free to share in the comments section.
Related Reading on Data Analytics & Engineering 📖
If you want to learn more about Data Analytics & Engineering, check out my series linked below! 👇