TV Show and Movie Popularity Analysis

Authors

Christian Garduno

Roshan Mehta

Colton Rowe

Published

June 2, 2023

Abstract

The present article analyzes a dataset containing information about titles on different streaming services and build a model to predict the popularity of an arbitrary title. The datasets used were sourced from Kaggle and the specific links can be found on the Data tab above. These provide valuable insights into the popularity and characteristics of different titles available on the streaming platforms and will help us in predicting the outcomes of future titles.

Problem Statement

The streaming services industry has seen rapid growth in recent years and has undoubtedly revolutionized the way we consume content. With numerous platforms offering a vast array of media types, it has become increasingly important for creators and streaming platforms to understand what makes a certain movie or TV show popular. This project aims to explore the Kaggle dataset and uncover the key factors that contribute to their popularity through fitting a series of different models.

Outcomes and Looking Forward

We developed a web app that predicts a score for a show or movie given particular inputs. The hope is to give producers an idea of how popular a show or movie will be before they spend the time and money on producing it. This analysis is important because it shows how data can be used to create media people actaully want.

We have a column in our dataset description that describes each show and movie. I think it would have been interesting to use word vectoriztion and natural language processing (NLP) to quantify this. If we were to expand on this project, adding an NLP component would be one avenue to do so. A neural network may have also been an interesting model to apply. Neural networks are considered to be highly flexible but not very interpretable, so adding this model to the app may or may not have been beneficial. In a practical sense, the model should be somewhat interpretable if order to use with the web app, however, predictive power is beneficial in its own right.

Source

We combined the below four datasets found on Kaggle to get a comprehensive dataset of movies and TV shows available on the major streaming platforms.

Variable Description

The dataset contains both categorical and numeric variables which may provide insights for our analysis. Here is a brief description of the key variables:

type: Indicates whether the show is a TV series or a movie.
title: The title of the TV show.
director: The name of the director(s) of the show.
cast: The names of the main cast members.
country: The countries the show was released in.
release_year: The year when the show was released.
rating: The content rating assigned to the show (TV-14, PG-13, etc.).
duration: The number of seasons (for TV series) or the duration (minutes) of the movie (for movies).
listed_in: The genre(s) or category(s) the show belongs to.
description: A brief summary or description of the show.
score: Rating of the show or movie - scraped from IMDb.
director_score: Calculated score based on the directors of the title.

Together, these variables provide a set of features that allow us to analyze and understand the characteristics that determine the popularity of TV shows. By exploring these variables and their relationships, we can gain insights into the factors that contribute to a show’s popularity, like the impact of different genres or countries of origin, and begin to create better shows that more people would watch.

Cleaning

Original Data

Due to the four original datasets being from Kaggle and in the same format we were able to merge them into one dataset that encompassed all of the previously mentioned streaming sites. The main area of concern with this dataset were the missing IMDb scores as that is the key metric we were using for our modeling. Another concern is that directors and casts members were in lists and this will prove to be difficult to feed to our model.

API

To fill in the missing IMDb scores we used an API that can be found here. In order to use the API we made a python script using the requests library that would query the API with the title of the show or movie we wanted the score for. From there the API searches its internal database and would pull the IMDb score from there. From there our python script would append our dataset with the score pulled from the API.

Director and Cast data

Since we wanted to use each director and cast member as their own predictor we had to get the overall average IMDb score for each director or cast member. A challenge we overcame was that each row stored their cast members or directors as a list. This would make it hard to compute the average since we would not be able to iterate over each cast member or director this way. In order to solve this, we split the lists and make each director or cast member their own respective row. From here we were able to iterate over each cast member or director and compute their average score. Computing the average score bring into question data leakage because some values from the response would be used to make predictors. To prevent this from happening, we split the entire dataset in half, and computed the averages on only one half of the dataset. We used the other half of the dataset, now imputed with the average scores, for training and testing. After the averages were computed for the directors and cast members the train dataset and test were merged back together to be fed into the models.

Cleaning Factors

For Genre we had some values that had multiple alliases such as Not Rated and NR being the same value but when we use our one-hot encoding to vectorize these for use as predictors it would read them as seperate values. In order to prevent this we had to come up with a normalized naming scheme. In order to normalize the naming scheme we searched for unique genre values and merged all common values into a uniform name so that they can now be one-hot encoded.

Overview

The response varaible that we used for prediction was the IMDb score of the Movie or TV Show. After combing the four datasets we found on Kaggle we discovered some of the shows had null values for their IMDb score. In order to replace these nulls we connected to an API that can be found here. With this API we were able to query by Movie or Show Title and get it’s IMDb score. We then cleaned and formatted our data so that it could be vectorized for our models. Since cast, director, and country of origin were all in lists, we had to split each list value into their own rows. Due to this spliting we can turn our categorical variables into numerical representation in order for our model to read it.

In this section we will explore and visualize our dataset to gain a better understanding of what we are working with, identify any obvious patterns, correlations, or trends.

Figures

Code

print("We have " + str(len(data)) + " total different movies and TV shows that we are working with.")

We have 4985 total different movies and TV shows that we are working with.

Code

print(data.shape)
print("There are a total of 89 columns that were are working with.")

(4985, 89)
There are a total of 89 columns that were are working with.

Code

import seaborn as sns

sns.countplot(x=data["type"])

# Add labels and title
plt.xlabel("Type")
plt.ylabel("Count")
plt.title("Distribution of Titles by Type")

# Display the plot
plt.show()

Code

# 1. Line Chart of the Number of Titles Released per Year

# Group data by year and count number of titles
counts = data.groupby("release_year")["title"].count()

# Create line chart
#plt.figure(figsize=(10,10))
plt.plot(counts.index, counts.values)

# Add labels and title
plt.xlabel("Year")
plt.ylabel("Number of Titles Released")
plt.title("Number of Titles Released per Year")

# Display chart
plt.show()

Code

# 2. Histogram of the Distribution of Content Ratings

# Create histogram of content ratings
#plt.figure(figsize=(20,10))
plt.hist(data["rating"].dropna(), bins=10)

# Add labels and title
plt.xlabel("Content Rating")
plt.xticks(rotation=45)
plt.ylabel("Frequency")
plt.title("Distribution of Content Ratings")

# Display chart
plt.show()

Code

# 3. Bar Chart of the Number of Titles per Country (count > 10)

# Extract first country from country column
chart = data.copy()
chart["country"] = chart["country"].str.split(", ").str[0]

# Group data by country and count number of titles
counts = chart.groupby("country")["title"].count()
counts = counts[counts > 10].sort_values(ascending=False)

# Create bar chart
#plt.figure(figsize=(20,10))
plt.bar(counts.index, counts.values)

# Add labels and title
plt.xlabel("Country")
plt.xticks(rotation=90)
plt.ylabel("Number of Titles")
plt.title("Number of Titles per Country")

# Display chart
plt.show()

Code

# 4. Movie and Rating Scatter Plot

# Filter out TV shows and missing ratings
movies = data[(data["type"] == "Movie") & (data["score"].notnull())]

# Create scatter plot of IMDb rating vs. runtime
#plt.figure(figsize=(20,10))
plt.scatter(movies["score"], movies["duration"], alpha=0.5)

# Add labels and title
plt.xlabel("IMDb Rating")
plt.ylabel("Runtime (minutes)")
plt.title("IMDb Rating vs. Runtime for Movies")

# Display chart
plt.show()

Code

# 5. TV Show and Rating Scatter Plot

# Filter out Movies and missing ratings
tv = data[(data["type"] == "TV Show") & (data["score"].notnull())]

# Create scatter plot of IMDb rating vs. runtime
#plt.figure(figsize=(20,10))
plt.scatter(tv["score"], tv["duration"], alpha=0.5)

# Add labels and title
plt.xlabel("IMDb Rating")
plt.ylabel("Runtime (seasons)")
plt.title("IMDb Rating vs. Runtime for TV Shows")

# Display chart
plt.show()

Code

# 6. Top 15 Genres By Number of Titles

# Get list of genre columns
genre_cols = [col for col in data.columns if col.startswith("genre")]

# Sum the number of true values in each genre column to get the total number of titles for each genre
genre_counts = data[genre_cols].sum()#.sort_values(ascending=False)

# Get the top 10 genres by number of titles
top_genres = genre_counts[:15]

# Remove "genre." from the genre names in the x-axis labels
labels = [col.replace("genre.", "") for col in top_genres.index]

# Create bar chart
# plt.figure(figsize=(20, 10))
plt.bar(labels, top_genres.values, color='purple')

# Add labels and title
plt.xlabel("Genre")
plt.xticks(rotation=45, ha='right')
plt.ylabel("Number of Titles")
plt.title("Top 10 Genres by Number of Titles")

# Display chart
plt.show()

These visualizations revealed several intriguing patterns. We almost have an 80/20 split of movies and shows within our data. The number of released titles per year is exponential, and is only hindered in 2020, due to the COVID-19 pandemic limiting production. The few bar plots also demonstrate the ratings, major countries, and the types of media within the dataset. Now we have a much better picture of the data that we are working with and can keep this in mind for our future models.

Overview

In this section, we describe the machine learning models employed for predicting the popularity score of TV shows and movies in our project. We utilized several models, including beta regression, decision tree, K-Nearest Neighbors (KNN), and random forest. Each of these models offers unique characteristics and can capture different aspects of the data to make accurate predictions.

Train-Test Split

To evaluate the performance of our machine learning models, we employed a train-test split approach. This process involves dividing our dataset, consisting of TV shows and movies, into two separate subsets: a training set and a testing set. The training set, which constitutes a majority of the data, was used to train our models to learn the underlying patterns and relationships between the features and the target variable, which in our case is the popularity score. The testing set, on the other hand, served as an unseen dataset to assess the models’ generalization ability and determine their predictive performance on new, unseen instances. By randomly assigning the data points to the training and testing sets, we ensured that the evaluation process is unbiased and representative of real-world scenarios. After fitting each model with the training set, we are left with the root mean squared error (rMSE) value.

This metric gives us the weighted distance our model is from the correct metric. We found that the response variable score has a sample standard deviation of 21.7. This means that if we guessed the mean every time, we would be, on average, off by 21.7 points. So, our model should try to minimize the rMSE and get at most a rMSE of 21.7.

Additional Information

For all of these models, we defined the set of predictor variables to be the same. These predictors included genre information, duration, release year, type, rating, director’s average score, cast’s average score, and country.

Also, to evaluate all four model’s performance and validate its predictive ability, we employed a cross-validation strategy. Code was written so that the dataset was divided into five subsets or folds, with each fold serving as a testing set while the remaining four folds were used for training. The grid search was performed within this cross-validation framework, optimizing the model’s hyperparameters based on the negative root mean squared error (RMSE) metric. By utilizing cross-validation, we obtained robust estimates of the model’s performance and ensured the generalization of the results to unseen data.

Modeling

Here, you can input parameters for a Movie or TV Show and see each our models’ predictions for your show.

TV Show and Movie Popularity Analysis

Abstract

Problem Statement

Outcomes and Looking Forward

Source

Variable Description

Cleaning

Original Data

API

Director and Cast data

Cleaning Factors

Overview

Figures

Overview

Train-Test Split

Additional Information

Modeling

Beta Regression

Pipeline

Cross Evaluation

Results

Test rMSE: 19.71

K-Nearest Neighbors

Pipeline

Hyperparameter Tuning and Cross Evaluation

Results

Test rMSE: 20.52

Decision Tree Regressor

Pipeline

Hyperparameter Tuning and Cross Evaluation

Results

Test rMSE: 19.84

Random Forest Regressor

Pipeline

Hyperparameter Tuning and Cross Evaluation

Results

Test rMSE: 19.03