This project aims to predict song popularity and group songs and find correlations among lyrics. MusiXmatch’s Million Song Dataset was used for lyrics and artist data. Google’s YouTube API and a YouTube web scraper were used to find song popularity. The results showed that lyrics and song popularity are correlated with a R2 of 0.0920. It is feasible to predict song popularity by lyrics.
Before a song releases, it can be hard to tell whether it will take off. The ability to predict a song's popularity based solely on lyrics is important to illuminate the success of a song before it is released. Existing works have come across two major problems. They used curated datasets based on a complex knowledge of music, and they discredited the hold lyrics have on the popularity of music. Curated datasets are impractical for real-world use, as it takes special skills to be able to make one of these datasets. Here, the lyrics are simply sorted so the resulting models can be used by anybody.. Using these sorted lyrics, we are able to roughly predict the popularity of a song.
5858 songs with their lyrics, artists, and popularity were analyzed. Lyric and artist data was found using the MusiXmatch’s Million Song Dataset available online. It contains lyrics in a bag-of-words format. Each song has a list containing all of its words and how many of those words there are. This list omits obscure words that make it outside of the 5000 most common words that MusiXmatch has chosen.
Popularity data were found using Google’s youtube API. The YouTube search list function was used to search for the song's name and the artist’s name, followed by the word song. The YouTube ID of the first 3 videos found was put into a YouTube web scraper to find the number of views that YouTube video had. Of the view counts of the three videos, the higher count was used as popularity data. Taking the highest of the three avoids getting numbers too low from videos that have been put at the top of the search not by higher view count but by their more recent posting. The popularity numbers then had a natural log applied to them.
All analysis was run in spyder using python. Data storage was handled using Pandas python data analysis library, and machine learning was achieved using scikit-learn. The method can be split into three parts, predicting song popularity based on lyrics using interpretable machine learning, grouping songs using linear and non-linear dimensionality reduction, and predicting missing lyrics using interpretable machine learning
Here are two runs of Random Forest Regressor. The X axis is the predicted values, and the Y axis is the real values.
Random forest is a meta-estimator. It splits the dataset into multiple parts, or sub-samples, and then fits a decision tree to each subset. The estimator then averages the predictions of the subsets.
Predicting popularity based solely on lyrics is extremely difficult, compounded by the fact that lyric order or timing is not accounted for. Popularity is also at the whim of a host of other factors, such as artist popularity, genre, and tune. Nonetheless, the Random Forest Regressor was able to find a correlation between lyrics and song popularity.
Here are two runs of Bayesian Ridge. The X axis are the predicted values, and the Y axis are the real values.
Bayesian ridge regression is a variant of ridge regression. Ridge Regression estimates the coefficients of multiple-regression in sets of linearly independent variables are highly correlated.
The Bayesian ridge regression was given the same task as the Random Forest Regressor, and it was able to product slightly stronger results. However, it is still held back by the same limitations as the Random Forest Regressor, such as not taking into account factors such as artist popularity and the date of publish
Shap is a reliable way to interpret how significant factors are to a model. Shap changes the inputs of a model and observes how the output differs.
Here you can see the top 20 most impactful words in the Shap. how far left or right it is on the graph signifies the words impact on the model output, and the color signifies its feature impact.
A word like love has high feature value, and positive Shap value, so it is a word that will highly positively influence the predicted popularity number
A word like Death has a high feature value, but a negative Shap Value, so it would highly negatively impact the popularity of a song.
TSNE is a way of grouping similar songs together while making dissimilar songs move farther apart. The input dataset is very high dimension, and it is crushed down to 2D to visualize. The color of the dots is their popularity. Some of these groupings of songs seem to all share similar popularity. This could be because their lyrical similarity represents a particular genre of music that is either popular or not.