TMDb Movie Dataset
Is it possible to predict the commercial success of a movie before release? Have certain film studios found a magic formula? Given that quality content can cost over $100 million to produce and can still flop (anyone remember The Lone Ranger?), this question is more important than ever to an industry upended by streaming services and free-content online. This is a great place to start digging into those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.
The Project
The project uses a dataset of about 10,000 movies collected from The Movie Database (TMDb). The dataset was extracted from Kaggle and the analysis conducted using Jupyter Notebook running on a Python kernel.
What We Learned
Using the plot method to build histograms
Using plotting.scatter_matrix to build scatter plot visualisations
Changing the figsize of a chart to a more readable format and adding a ‘;’ to the end of the line to remove unwanted text
Renaming data frame columns in Pandas
Using group by and query methods to aggregate and group selections of data
Creating line charts, bar charts, heatmaps in Matplotlib and utilising Seaborn to augment visuals
Using lambda functions to wrangle data formats
The Code and the Report
GitHub repository for the data and the Jupyter Notebook
the PDF report can also be found here