TMDb Movie Dataset

Is it possible to predict the commercial success of a movie before release? Have certain film studios found a magic formula? Given that quality content can cost over $100 million to produce and can still flop (anyone remember The Lone Ranger?), this question is more important than ever to an industry upended by streaming services and free-content online. This is a great place to start digging into those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.

The Project

The project uses a dataset of about 10,000 movies collected from The Movie Database (TMDb). The dataset was extracted from Kaggle and the analysis conducted using Jupyter Notebook running on a Python kernel.

What We Learned

  • Using the plot method to build histograms

  • Using plotting.scatter_matrix to build scatter plot visualisations

  • Changing the figsize of a chart to a more readable format and adding a ‘;’ to the end of the line to remove unwanted text

  • Renaming data frame columns in Pandas

  • Using group by and query methods to aggregate and group selections of data

  • Creating line charts, bar charts, heatmaps in Matplotlib and utilising Seaborn to augment visuals

  • Using lambda functions to wrangle data formats

The Code and the Report

References

Next
Next

Exploration of Weather Trends