How-to: Build a Python Web Scraper to capture IMDb Top-100 Movies

Some background

I love movies, and all things cinema-related 📽 🎞️ 🎬️🍿🎫.

I truly think it is immensely cool how so much painstaking and beautiful work goes into making movies. Years and years of hard work, involving meticulous planning, growing expenses, constant deliberation and thinking.

All of that hard work compressed into an average-lengthed 1.5-hour movie.

And then, it only takes on average of a few minutes for the audience to decide whether a movie was entertaining, or boring. Realistic, or unrealistic. Interesting, or a waste of time. Inspiring, or lame.

Source: Tenor

Ummm, ok, aaannd? 🤷‍♂️

Switching back from my dream movie-land to the real-life software world. Last year, I found myself learning Python and Data engineering. As I was learning my new role in Data engineering, I wanted to play around with some of the widely-used libraries and tools used extensively in Python data processing.

So, I just googled online to see what each of those libraries did. Once I read their docs, I felt I knew enough to be able to play around with them, and I built a little Web 🕸️Scraper.

At a high-level, the Web Scraper I will help us build in this tutorial does the following:

Short Description:

This will be a Python script that calls out to iMDB grabbing all HTML elements for the Top-100 movies of all-time, and saving the results to a .csv file.

Step-by-Step Breakdown:

  1. Import all of the required libraries.
  2. Call out to iMDB.
  3. Save the HTML elements off the iMDB page to a results object.
  4. Create a movie_soup BeautifulSoup object that stores all the results as text.
  5. Create lists to extract all HTML attributes like:

name

years

runtime

ratings

metascores

number of votes

gross budget

6. Create a movie_div object to find all div objects in movie_soup.

7. Loop through each object in the movie_div.

8. Add each result from each attribute for each list.

9. Build a movies DataFrame.

10. Store all attributes into the movies DataFrame.

11. Use Pandas str.extract to remove all String characters and save the value as type int.

12. Export the results to a pretty little top_100_movies.csv file.

Without further ado, lights, camera, action!💡 🎥 🎬

Source: Pinterest

👨‍💻️ Code and Step-By-Step Instructions: 📖

Time to see the Movie results! 🙂

Run your python script using the following command:

python imdb-movie-scraper.py

Voilà!

Source: Giphy

You will now notice a .csv file that has been created: top_100_movies.csv

Conclusion 🎭

Thank you for following my tutorial! 🙂

If you have any questions, or would like to share some of your cool projects using Python, DataFrames, BeautifulSoup, and NumPy, feel free to share in the comments section.

Source: Tenor

Related Reading on Data Analytics & Engineering 📖

If you want to learn more about Data Analytics & Engineering, check out my series linked below! 👇

Data Engineering & Cloud

5 stories

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abdul Wahab

Abdul Wahab

Multi-disciplinary Software Engineer specialized in building products users love. Today, I manage & secure big data in the cloud. All views shared are my own.