Game Success Prediction

Introduction

In this project, I attempt to create an application to predict the success of a video game based on its features. I will use data processing, feature engineering, machine learning, and visualization techniques to analyze the data and build a predictive model.

The raw data is a dataset of Steam games sourced from Kaggle.com. It includes various attributes such as game titles, platforms supported (Windows, Mac, Linux), features (Single-player, Multiplayer, etc.), and genres (Action, Adventure, etc.). The data also includes review scores and review counts.
Dataset

Overview

The project consists of the following steps:

Data Cleaning: Remove duplicates, missing values, and irrelevant columns.
Data Exploration: Analyze the data to understand its structure and relationships.
Feature Engineering: Create new features from existing ones to improve the model's performance.
Data Preprocessing: Prepare the data for modeling.
Model Selection and Training: Choose an appropriate ML algorithm and train the model on the training data.
Model Evaluation: Evaluate the model's performance on the test data.
Model Deployment: Deploy the model to make predictions on new data.

Data Cleaning

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42497 entries, 0 to 42496
Data columns (total 24 columns):
#   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
0   app_id                42497 non-null  int64  
1   title                 42497 non-null  object 
2   release_date          42440 non-null  object 
3   genres                42410 non-null  object 
4   categories            42452 non-null  object 
5   developer             42307 non-null  object 
6   publisher             42286 non-null  object 
7   original_price        4859 non-null   object 
8   discount_percentage   4859 non-null   object 
9   discounted_price      42257 non-null  object 
10  dlc_available         42497 non-null  int64  
11  age_rating            42497 non-null  int64  
12  content_descriptor    2375 non-null   object 
13  about_description     42359 non-null  object 
14  win_support           42497 non-null  bool   
15  mac_support           42497 non-null  bool   
16  linux_support         42497 non-null  bool   
17  awards                42497 non-null  int64  
18  overall_review        40020 non-null  object 
19  overall_review_%      40020 non-null  float64
20  overall_review_count  40020 non-null  float64
21  recent_review         5503 non-null   object 
22  recent_review_%       5503 non-null   float64
23  recent_review_count   5503 non-null   float64
dtypes: bool(3), float64(4), int64(4), object(13)
memory usage: 6.9+ MB

The initial dataset consists of 42,497 entries with 24 columns, including some with missing values and irrelevant information. After cleaning the data by removing duplicates and irrelevant columns and handling missing values, the dataset is reduced to a more manageable and relevant set of features.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42497 entries, 0 to 42496
Data columns (total 9 columns):
#   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
0   title                 42497 non-null  object 
1   genres                42410 non-null  object 
2   categories            42452 non-null  object 
3   win_support           42497 non-null  bool   
4   mac_support           42497 non-null  bool   
5   linux_support         42497 non-null  bool   
6   overall_review        40020 non-null  object 
7   overall_review_%      40020 non-null  float64
8   overall_review_count  40020 non-null  float64
dtypes: bool(3), float64(2), object(4)
memory usage: 2.1+ MB

Data Exploration

To better understand the data, I explore the relationships between various features and their impact on the target variable. Using visualizations such as histograms, scatter plots, and correlation matrices, I identify patterns and trends that would be important for building the model.

Feature Engineering

Feature engineering involves creating new features to improve model performance. I focus on transforming the 'categories' and 'genres' columns into binary columns for each category and genre. This transformation allows the model to better understand the presence or absence of specific game features. I also combine smaller or similar categories into broader ones to reduce dimensionality and improve model interpretability.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42497 entries, 0 to 42496
Data columns (total 23 columns):
#   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
0   title                       42497 non-null  object 
1   win_support                 42497 non-null  bool   
2   mac_support                 42497 non-null  bool   
3   linux_support               42497 non-null  bool   
4   overall_review              42497 non-null  float64
5   overall_review_%            40020 non-null  float64
6   overall_review_count        40020 non-null  float64
7   Single-player               42497 non-null  bool   
8   Cross-Platform Multiplayer  42497 non-null  bool   
9   VR Supported                42497 non-null  bool   
10  LAN Multiplayer             42497 non-null  bool   
11  Online Multiplayer          42497 non-null  bool   
12  Indie                       42497 non-null  bool   
13  Action                      42497 non-null  bool   
14  Adventure                   42497 non-null  bool   
15  Casual                      42497 non-null  bool   
16  Simulation                  42497 non-null  bool   
17  Strategy                    42497 non-null  bool   
18  RPG                         42497 non-null  bool   
19  Free to Play                42497 non-null  bool   
20  Sports                      42497 non-null  bool   
21  Racing                      42497 non-null  bool   
22  Massively Multiplayer       42497 non-null  bool   
dtypes: bool(19), float64(3), object(1)
memory usage: 2.1+ MB

Data Preprocessing

Before training the model, I convert verbose review scores to numerical values and consolidate games with low review scores into a single category, considering any score below 5 negatively received. I also removed games with extremely high review counts to prevent skewed predictions. Finally, I split the dataset into training and testing sets for model validation.

Model Selection and Training

I choose a classification model for this project to predict whether a game will succeed. I use the Multi-Layer Perceptron Classifier (MLPClassifier) from Scikit-learn, an artificial neural network suitable for classification tasks. The model architecture included two hidden layers, with 64 nodes in the first layer and 32 nodes in the second.

Model Evaluation

After training the model, I evaluate its performance using standard metrics such as accuracy, precision, recall, and F1 score. The results were as follows:

Accuracy: 64.59%
Precision: 58.23%
Recall: 64.60%
F1 Score: 58.53%

Model Deployment

The model is saved as a pickle file and can be loaded and used to make predictions. I deploy the model using Django, a high-level Python web framework, as a back-end API. The application's front-end is built using HTML, CSS, and JavaScript. The user can choose the game's features by toggling the checkboxes and clicking the 'Predict' button to see the prediction. The application is available here.

Conclusion

This project sdemonstrates how machine learning can be applied to predict the success of video games based on their features. By leveraging a diverse dataset of Steam games, we explored the relationships between game characteristics and their potential success, as measured by user reviews and other metrics.

The process involved extensive data cleaning, exploration, feature engineering, and preprocessing to ensure the quality and relevance of the data. Through careful model selection and training, we developed a Multi-Layer Perceptron Classifier that accurately predicted game success. Although there is room for improvement in model performance, the results are promising and provide a solid foundation for further development.

The model could be enhanced in the future by:

Completing the existing features in the dataset: Filling in missing genres for certain titles, which is crucial for improving the model's performance.
Incorporating more features: Include additional relevant features, such as user engagement metrics (e.g., average playtime, peak concurrent users), marketing efforts (e.g., advertising spend, trailers released), or social media sentiment analysis (e.g., Twitter mentions, Reddit discussions). These could provide a more comprehensive view of the factors influencing a game's success.
Expanding the dataset: Gather more data to increase the model’s robustness. A larger dataset with more diverse examples can help the model learn better generalizations.
Addressing Class Imbalance: If the dataset is imbalanced (e.g., there are more successful games than unsuccessful ones), use techniques such as oversampling the minority class, undersampling the majority class, or using synthetic data generation methods like SMOTE (Synthetic Minority Over-sampling Technique).
Hyperparameter Tuning: Optimize the model's hyperparameters using grid search or random search with cross-validation to find the best settings for model performance.

Overall, this project provides valuable insights into the factors contributing to a video game's success and demonstrates the potential of machine learning in the gaming industry. The deployed model offers a practical tool for developers and publishers to assess the potential success of their games based on key features, ultimately aiding in better decision-making and strategic planning.