Spotify

Predicting the success of your playlist and providing personalized recommendations

View project on GitHub

Contents

Steps of recommendation:

  • Get the user specified genre: E.g. pop
  • Get the user specified number of tracks: E.g. N
  • Get all tracks of this genre from the tracks database
  • Sort the filtered tracks based on popularity
  • Combinatorially generate (N+2 choose N) different playlists as the recommendation candidates
  • Use our fitted regression model to predict the num_followers of each recommendation candidate
    • We used the Meta Linear Regression model with main predictors and interaction terms for recommendation here. This model has top test $R^2$ score among all our models.
  • Return the playlist that has the highest predicted value of num_followers as the final recommendation

Strategy of validating the recommendation:

  • Find the most similar playlist with the same genre as the user specified one from the playlists database
    • Definition of Similairy:   set(recommended_playlist_tracks) $ \quad \cap \quad $ set(existing_playlist_tracks)  
  • See the actual ranking of the most similar playlist within the user specified genre
    • High rank within genre indicates a good recommendation
    • Low rank within genre indicates a poor recommendation

Summary of findings: We found the performance of our recommender system varies significantly (i.e. the within genre ranking of the most similar playlist is fairly unstable) as shown in the 3 examples above. This variance could be due to 1) the model predicting the number of followers does not have sufficient predictive power or 2) our metric of similarity is not sufficient to actually find a very similar playlist in the existing pool of playlists. We suspect both reasons contributed to the variance in performance we observed.

Define helper functions to access the database

"""
Function to get track features and return a playlist dictionary with track features
""" 
def extract_track_features(tracks, playlists):
    processed_playlists = copy.deepcopy(playlists)
    
    missing_counts = 0
    # Loop over each playlist
    for index, playlist in enumerate(processed_playlists):
        track_feature_keys = ['acousticness', 'album_id', 'album_name', 'album_popularity','artists_genres', 
                              'artists_ids', 'artists_names', 'artists_num_followers', 'artists_popularities',
                              'avg_artist_num_followers', 'avg_artist_popularity', 'danceability', 'duration_ms',
                              'energy', 'explicit', 'instrumentalness', 'isrc', 'key', 'liveness', 
                              'loudness', 'mode', 'genre', 'name', 'num_available_markets',
                              'popularity', 'speechiness', 'std_artist_num_followers', 'std_artist_popularity',
                              'tempo', 'time_signature', 'valence']
        
        # new entries of audio features for each playlist as a list to append each track's audio feature
        for track_feature_key in track_feature_keys:
            playlist['track_' + track_feature_key] = []
        
        # append each tracks' audio features into the entries of the playlist
        selected_tracks = tracks[tracks['trackID'].isin(playlist['track_ids'])]
        for j, track in selected_tracks.iterrows():
            # append each track's audio feature into the playlist dictionary
            for track_feature_key in track_feature_keys:
                if track_feature_key in list(selected_tracks.columns):
                    playlist['track_' + track_feature_key].append(track[track_feature_key])
        processed_playlists[index] = playlist
    print('tracks that are missing : {}'.format(missing_counts))
    return processed_playlists

"""
Function to build playlist dataframe from playlists dictionary with track features
"""
def build_playlist_dataframe(playlists_dictionary_list):
    
    # features to take the avg and std
    features_avg = ['track_acousticness', 'track_avg_artist_num_followers', 'track_album_popularity',
                    'track_avg_artist_popularity', 'track_danceability', 'track_duration_ms', 
                    'track_energy', 'track_explicit', 'track_instrumentalness','track_liveness', 
                    'track_loudness', 'track_mode', 'track_num_available_markets',
                    'track_std_artist_num_followers', 'track_std_artist_popularity',
                    'track_popularity', 'track_speechiness', 'track_tempo', 'track_valence'
                   ]          
                      
    # features to take the mode, # of uniques
    features_mode = ['track_key','track_time_signature', 'track_genre']

    # features as is
    features = ['collaborative', 'num_followers', 'num_tracks']

    processed_playlists = {}

    for index, playlist in enumerate(playlists_dictionary_list):
        playlist_info = {} 
    #     playlist_info['id'] = playlist['id']

        for key in playlist.keys():
            if key in features_avg: # take avg and std
                playlist_info[key + '_avg'] = np.mean(playlist[key])
                playlist_info[key + '_std'] = np.std(playlist[key])
                if key in set(['track_popularity', 'track_album_popularity', 'track_avg_artist_popularity']):
                    playlist_info[key + '_max'] = max(playlist[key])
            elif key in features_mode: # take mode
                if playlist[key]:
                    if key == 'track_artists_genres':
                        flatten = lambda l: [item for sublist in l for item in sublist]
                        flattened_value = flatten(playlist[key])
                        if flattened_value:
                            counter = collections.Counter(flattened_value)
                            playlist_info[key + '_mode'] = counter.most_common()[0][0]
                            playlist_info[key + '_unique'] = len(set(flattened_value))
                    else:
                        counter = collections.Counter(playlist[key])
                        playlist_info[key + '_mode'] = counter.most_common()[0][0]
                        playlist_info[key + '_unique'] = len(set(playlist[key]))
            elif key in features:
                playlist_info[key] = playlist[key]

        processed_playlists[index] = playlist_info
    df = pd.DataFrame(processed_playlists).T
    df.rename(columns = {'track_genre_mode': 'genre'}, inplace = True)
    
    return df

def load_data(file):
    with open(file, 'r') as fd:
        data_from_json = json.load(fd)
        return data_from_json

Load Data

tracks_db = load_data('../../data_archive/tracks.json')
tracks_df = pd.read_csv('../../data/tracks.csv')
playlists_db = load_data('../../data_archive/playlists_from_200_search_words.json')
playlists_df = pd.read_csv('../../data/playlists_with_id.csv')

X_train_main = pd.read_csv('../../data/X_train_main.csv', index_col = 0)
X_train_int = pd.read_csv('../../data/X_train_int.csv', index_col = 0)

Functions to make a recommendation

def generate_playlists(genre, num_tracks):
    num_tracks_pool = num_tracks + 2
    
    # Filter tracks by popularity in ascending order
    tracks_filtered = tracks_df[tracks_df['genre'] == genre].sort_values('popularity', ascending = False).iloc[:num_tracks_pool]
    
    # generate combinations of track ids from the filtered tracks
    combination_tracks_ids = list(combinations(list(tracks_filtered['trackID']), num_tracks))
    
    # build the list of dictionaries consisting of tracks ids of size user_input_tracks
    playlist_track_ids = [{'track_ids':track_ids, 'num_tracks':num_tracks_pool} for track_ids in combination_tracks_ids]
    
    # extract the audio features for each playlists in the list of dictionaries
    processed_playlists = extract_track_features(tracks_df, playlist_track_ids)
    
    # build playlists from processed_playlists
    gen_playlists_df = build_playlist_dataframe(processed_playlists)
    
    return gen_playlists_df, playlist_track_ids

def preprocess_playlist_candidates(candidates_playlists_df):
    '''
    1. Take One-hot encoding for categorical variables;
    2. Add columns to match X_train_main
    3. Add interaction terms
    4. Standardize the numerical predictors
    '''
    # ======= One-hot encoding =======
    categorical_predictors = ['genre', 'track_time_signature_mode', 'track_key_mode']
    df_encoded = pd.get_dummies(candidates_playlists_df, prefix = categorical_predictors, columns = categorical_predictors)
    
    
    # ======= Add columns to match X_train_main =======
    df_encoded_full = pd.DataFrame()
    cur_columns = set(list(df_encoded.columns))
    for col in X_train_main.columns:
        if col in cur_columns:
            df_encoded_full = pd.concat([df_encoded_full, df_encoded[col]], axis = 1)
        else:
            df_encoded_full[col] = 0
    
    
    # ======= Add interaction terms =======
    audio_features_avg = ['track_acousticness_avg', 'track_album_popularity_avg', 'track_danceability_avg',
                    'track_duration_ms_avg', 'track_energy_avg', 'track_explicit_avg', 
                    'track_instrumentalness_avg', 'track_liveness_avg', 'track_loudness_avg', 'track_mode_avg', 
                    'track_speechiness_avg', 'track_tempo_avg', 'track_valence_avg']
    genres = ['genre_blues', 'genre_classical', 'genre_country', 'genre_dance',
           'genre_edm', 'genre_elect', 'genre_folk', 'genre_funk', 'genre_hip hop',
           'genre_house', 'genre_indie', 'genre_indie pop', 'genre_jazz',
           'genre_korean pop', 'genre_mellow', 'genre_metal', 'genre_other',
           'genre_pop', 'genre_pop christmas', 'genre_pop rap', 'genre_punk',
           'genre_r&b', 'genre_rap', 'genre_reggae', 'genre_rock', 'genre_soul',]

    cross_terms = audio_features_avg + genres
    
    df_encoded_int = deepcopy(df_encoded_full)
    for feature in audio_features_avg:
        for genre in genres:
            df_encoded_int[feature+'_X_'+genre] = df_encoded_int[feature] * df_encoded_int[genre]

    
    # ======= Standardize the numerical predictors =======
    df_recommendation_int = deepcopy(df_encoded_int)
    for col in X_train_int.columns:
        if not np.logical_or((df_recommendation_int[col]==0), ((df_recommendation_int[col]==1))).all():
            mean_train = X_train_int[col].mean()
            std_train = X_train_int[col].std()
            df_recommendation_int[col] = (df_recommendation_int[col] - mean_train) / std_train

    return df_recommendation_int
    
    
def recommended_playlist(genre, num_tracks):
    gen_playlists_df, playlist_track_ids = generate_playlists(genre, num_tracks)
    df_recommendation_int = preprocess_playlist_candidates(gen_playlists_df)
    
    # Load meta model
    meta_model = joblib.load('../../fitted_models/meta_reg_int.pkl')
    prefix = '../../fitted_models/'
    suffix = '.pkl'
    models_int = ['sim_lin_int', 'ridge_cv_int', 'lasso_cv_int', 'RF_best_int', 'ab_best_int']

    # Record each single model's predicted results
    meta_X_recommendation_int = np.zeros((df_recommendation_int.shape[0], len(models_int)))
    for i, name in enumerate(models_int):
        model_name = prefix + name + suffix
        model = joblib.load(model_name) 
        meta_X_recommendation_int[:, i] = model.predict(df_recommendation_int)

    predicted_log_num_followers = meta_model.predict(meta_X_recommendation_int)
    predicted_num_followers = np.exp(predicted_log_num_followers) - 1

    recommendation_playlist = playlist_track_ids[np.argmax(predicted_num_followers)]
    recommendation_playlist_pred_num_followers = max(predicted_num_followers)
    display_recommendation_df = get_recommendation_tracks_display_info(recommendation_playlist)
    
    print('The recommended playlist is:')
    display(display_recommendation_df)
    print('Predicted num_followers: {}'.format(recommendation_playlist_pred_num_followers))
    return recommendation_playlist, recommendation_playlist_pred_num_followers

def get_recommendation_tracks_display_info(recommendation):
    display_info_list = []
    recommend_track_ids = recommendation['track_ids']
    for track_id in recommend_track_ids:
        display_info = {}
        track_info = tracks_db[track_id]
        display_info['track ID'] = track_id
        display_info['track name'] = track_info['name']
        display_info['artists names'] = track_info['artists_names']
        display_info_list.append(display_info)
    return pd.DataFrame(display_info_list)

Functions to validate the recommendation

def get_most_similar_playlist(recommendation, genre):
    '''
    Retrun the playlist id in the playlists database that is the most similar to the recommended playlist
    '''
    dissimilarity_list = []
    for playlist in playlists_db:
        cur_pl = playlists_df[playlists_df['id'] == playlist['id']]
        if not cur_pl.empty:
            cur_genre = cur_pl['genre'].values[0]
            if cur_genre == genre:
                dissimilarity_list.append(len(set(recommendation['track_ids']) - set(playlist['track_ids'])))
            else: # dissimilarity = inf if the playlist's genre is different from the user specified genre
                dissimilarity_list.append(float('inf'))
        else:
            dissimilarity_list.append(float('inf'))

    most_similar_playlist_id = playlists_db[dissimilarity_list.index(min(dissimilarity_list))]['id']
    most_similar_playlist = playlists_df[playlists_df['id'] == most_similar_playlist_id]
    most_similar_playlist_num_followers = most_similar_playlist['num_followers'].values[0]
    return most_similar_playlist_id

def predict_most_similar_playlist(most_similar_playlist_id):
    most_similar_playlist = playlists_df[playlists_df['id'] == most_similar_playlist_id]
    
    # Process the most similar playlist to be ready for model prediction
    processed_most_similar_playlist = preprocess_playlist_candidates(most_similar_playlist)
    
    # Load meta model
    meta_model = joblib.load('../../fitted_models/meta_reg_int.pkl')
    prefix = '../../fitted_models/'
    suffix = '.pkl'
    models_int = ['sim_lin_int', 'ridge_cv_int', 'lasso_cv_int', 'RF_best_int', 'ab_best_int']

    # Record model's predicted results on validation set as the train set for the meta regressor
    meta_X_processed_most_similar_playlist = np.zeros((processed_most_similar_playlist.shape[0], len(models_int)))
    for i, name in enumerate(models_int):
        model_name = prefix + name + suffix
        model = joblib.load(model_name) 
        meta_X_processed_most_similar_playlist[:, i] = model.predict(processed_most_similar_playlist)

    most_similar_predicted_log_num_followers = meta_model.predict(meta_X_processed_most_similar_playlist)
    most_similar_predicted_num_followers = (np.exp(most_similar_predicted_log_num_followers) - 1)[0]
    print('The most similar playlist\'s predicted num_followers: {}'.format(most_similar_predicted_num_followers))

def get_rank_within_genre(playlist_id, genre):
    n_genre = len(playlists_df[playlists_df['genre']==genre])
    print('There are {} playlists in genre = {}'.format(n_genre, genre))
    
    df = deepcopy(playlists_df[playlists_df['genre']==genre])
    df.sort_values(by=['num_followers'], ascending=False, inplace=True)
    
    # Get the within-genre-rank of the playlist in the database
    df.reset_index(inplace=True)
    rank = df[df['id'] == playlist_id].index.values[0] + 1
    print('The most similar playlist\'s rank within genre is: {}'.format(rank))
    return rank
    

Examples of Recommendation

recommendation, recommendation_pred_num_followers = recommended_playlist('pop', 6)
most_similar_pl_id = get_most_similar_playlist(recommendation, 'pop')
predict_most_similar_playlist(most_similar_pl_id)
rank = get_rank_within_genre(most_similar_pl_id, 'pop')
tracks that are missing : 0
The recommended playlist is:
artists names track ID track name
0 [Camila Cabello, Young Thug] 0ofbQMrRDsUaVKq2mGLEAb Havana
1 [ZAYN, Sia] 1j4kHkkpqZRBwE0A4CN4Yv Dusk Till Dawn - Radio Edit
2 [Charlie Puth] 32DGGj6KlNuBr6WaqRxpxi How Long
3 [Selena Gomez, Marshmello] 7EmGUiUaOSGDnUUQUDrOXC Wolves
4 [Becky G, Bad Bunny] 7JNh1cfm0eXjqFVOzKLyau Mayores
5 [Sam Smith] 1mXVgsBdtIVeCLJnSnmtdV Too Good At Goodbyes
Predicted num_followers: 11497.350942856436
The most similar playlist's predicted num_followers: 1299334.2394193185
There are 2738 playlists in genre = pop
The most similar playlist's rank within genre is: 596
recommendation, recommendation_pred_num_followers = recommended_playlist('pop', 10)
most_similar_pl_id = get_most_similar_playlist(recommendation, 'pop')
predict_most_similar_playlist(most_similar_pl_id)
rank = get_rank_within_genre(most_similar_pl_id, 'pop')
tracks that are missing : 0
The recommended playlist is:
artists names track ID track name
0 [Camila Cabello, Young Thug] 0ofbQMrRDsUaVKq2mGLEAb Havana
1 [ZAYN, Sia] 1j4kHkkpqZRBwE0A4CN4Yv Dusk Till Dawn - Radio Edit
2 [Ed Sheeran] 0tgVpDi06FyKpA1z0VMD4v Perfect
3 [Selena Gomez, Marshmello] 7EmGUiUaOSGDnUUQUDrOXC Wolves
4 [Becky G, Bad Bunny] 7JNh1cfm0eXjqFVOzKLyau Mayores
5 [Sam Smith] 1mXVgsBdtIVeCLJnSnmtdV Too Good At Goodbyes
6 [Charlie Puth] 4iLqG9SeJSnt0cSPICSjxv Attention
7 [Ed Sheeran] 7qiZfU4dY1lWllzX7mPBI3 Shape of You
8 [Maroon 5, SZA] 3hBBKuWJfxlIlnd9QFoC8k What Lovers Do (feat. SZA)
9 [Lauv] 1wjzFQodRWrPcQ0AnYnvQ9 I Like Me Better
Predicted num_followers: 14741.997686715062
The most similar playlist's predicted num_followers: 12219.764844058263
There are 2738 playlists in genre = pop
The most similar playlist's rank within genre is: 2240
recommendation, recommendation_pred_num_followers = recommended_playlist('pop', 12)
most_similar_pl_id = get_most_similar_playlist(recommendation, 'pop')
predict_most_similar_playlist(most_similar_pl_id)
rank = get_rank_within_genre(most_similar_pl_id, 'pop')
tracks that are missing : 0
The recommended playlist is:
artists names track ID track name
0 [Camila Cabello, Young Thug] 0ofbQMrRDsUaVKq2mGLEAb Havana
1 [ZAYN, Sia] 1j4kHkkpqZRBwE0A4CN4Yv Dusk Till Dawn - Radio Edit
2 [Ed Sheeran] 0tgVpDi06FyKpA1z0VMD4v Perfect
3 [Dua Lipa] 2ekn2ttSfGqwhhate0LSR0 New Rules
4 [Charlie Puth] 32DGGj6KlNuBr6WaqRxpxi How Long
5 [Selena Gomez, Marshmello] 7EmGUiUaOSGDnUUQUDrOXC Wolves
6 [Becky G, Bad Bunny] 7JNh1cfm0eXjqFVOzKLyau Mayores
7 [Sam Smith] 1mXVgsBdtIVeCLJnSnmtdV Too Good At Goodbyes
8 [Charlie Puth] 4iLqG9SeJSnt0cSPICSjxv Attention
9 [Maroon 5, SZA] 3hBBKuWJfxlIlnd9QFoC8k What Lovers Do (feat. SZA)
10 [Lauv] 1wjzFQodRWrPcQ0AnYnvQ9 I Like Me Better
11 [Justin Bieber, BloodPop®] 7nZmah2llfvLDiUjm0kiyz Friends (with BloodPop®)
Predicted num_followers: 17795.27856902531
The most similar playlist's predicted num_followers: 202172073.55318552
There are 2738 playlists in genre = pop
The most similar playlist's rank within genre is: 23
recommendation, recommendation_pred_num_followers = recommended_playlist('rock', 6)
most_similar_pl_id = get_most_similar_playlist(recommendation, 'rock')
predict_most_similar_playlist(most_similar_pl_id)
rank = get_rank_within_genre(most_similar_pl_id, 'rock')
tracks that are missing : 0
The recommended playlist is:
artists names track ID track name
0 [Imagine Dragons] 0tKcYR2II1VCQWT79i5NrW Thunder
1 [Imagine Dragons] 5VnDkUNyX6u5Sk0yZiP8XB Thunder
2 [Imagine Dragons] 0CcQNd8CINkwQfe1RDtGV6 Believer
3 [Imagine Dragons] 1NtIMM4N0cFa1dNzN15chl Believer
4 [Imagine Dragons] 4IWAyPf1KMq7JCyGeCjTeH Whatever It Takes
5 [Twenty One Pilots] 3CRDbSIZ4r5MsZ0YwxuEkn Stressed Out
Predicted num_followers: 9612.629344915793
The most similar playlist's predicted num_followers: 18949928.261263046
There are 981 playlists in genre = rock
The most similar playlist's rank within genre is: 60
recommendation, recommendation_pred_num_followers = recommended_playlist('rock', 10)
most_similar_pl_id = get_most_similar_playlist(recommendation, 'rock')
predict_most_similar_playlist(most_similar_pl_id)
rank = get_rank_within_genre(most_similar_pl_id, 'rock')
tracks that are missing : 0
The recommended playlist is:
artists names track ID track name
0 [Imagine Dragons] 5VnDkUNyX6u5Sk0yZiP8XB Thunder
1 [Imagine Dragons] 0CcQNd8CINkwQfe1RDtGV6 Believer
2 [Imagine Dragons] 4IWAyPf1KMq7JCyGeCjTeH Whatever It Takes
3 [Twenty One Pilots] 3CRDbSIZ4r5MsZ0YwxuEkn Stressed Out
4 [Eurythmics] 1TfqLAPs4K3s2rJMoCokcS Sweet Dreams (Are Made of This) - Remastered
5 [a-ha] 2WfaOiMkCvy7F5fcp2zZ8L Take On Me
6 [Twenty One Pilots] 6i0V12jOa3mr6uu4WYhUBr Heathens
7 [Twenty One Pilots] 2Z8WuEywRWYTKe1NybPQEW Ride
8 [AC/DC] 2zYzyRzz6pRmhPzyfMEC8s Highway to Hell
9 [Eagles] 40riOy7x9W7GXjyGp4pjAv Hotel California - Remastered
Predicted num_followers: 10451.662075669927
The most similar playlist's predicted num_followers: 1113.209291725653
There are 981 playlists in genre = rock
The most similar playlist's rank within genre is: 804
recommendation, recommendation_pred_num_followers = recommended_playlist('rock', 12)
most_similar_pl_id = get_most_similar_playlist(recommendation, 'rock')
predict_most_similar_playlist(most_similar_pl_id)
rank = get_rank_within_genre(most_similar_pl_id, 'rock')
tracks that are missing : 0
The recommended playlist is:
artists names track ID track name
0 [Imagine Dragons] 0tKcYR2II1VCQWT79i5NrW Thunder
1 [Imagine Dragons] 5VnDkUNyX6u5Sk0yZiP8XB Thunder
2 [Imagine Dragons] 0CcQNd8CINkwQfe1RDtGV6 Believer
3 [Imagine Dragons] 1NtIMM4N0cFa1dNzN15chl Believer
4 [Imagine Dragons] 4IWAyPf1KMq7JCyGeCjTeH Whatever It Takes
5 [Twenty One Pilots] 3CRDbSIZ4r5MsZ0YwxuEkn Stressed Out
6 [Twenty One Pilots] 6i0V12jOa3mr6uu4WYhUBr Heathens
7 [Twenty One Pilots] 2Z8WuEywRWYTKe1NybPQEW Ride
8 [AC/DC] 2zYzyRzz6pRmhPzyfMEC8s Highway to Hell
9 [Eagles] 40riOy7x9W7GXjyGp4pjAv Hotel California - Remastered
10 [Red Hot Chili Peppers] 3d9DChrdc6BOeFsbrZ3Is0 Under The Bridge
11 [AC/DC] 08mG3Y1vljYA6bvDt4Wqkj Back In Black
Predicted num_followers: 14808.540723599252
The most similar playlist's predicted num_followers: 6894626.190676659
There are 981 playlists in genre = rock
The most similar playlist's rank within genre is: 115
recommendation, recommendation_pred_num_followers = recommended_playlist('funk', 6)
most_similar_pl_id = get_most_similar_playlist(recommendation, 'funk')
predict_most_similar_playlist(most_similar_pl_id)
rank = get_rank_within_genre(most_similar_pl_id, 'funk')
tracks that are missing : 0
The recommended playlist is:
artists names track ID track name
0 [MC Jhowzinho e MC Kadinho] 0pDaqgIForVNO4jrtTxcWT Agora Vai Sentar
1 [1Kilo, Baviera, Pablo Martins, Knust] 2srL4DYBekshpbprS6H0mO Deixe Me Ir - Acústico
2 [Mc Livinho] 6pSYjx66rlqRmGGTHhnjCo Fazer Falta
3 [MC Kevinho, Leo Santana] 7yYOMPwpV5CsK0cxoAZT6B Encaixa
4 [MC G15] 4BjPsq3MXBNo4Qxg40igEr Cara Bacana
5 [Mc Don Juan] 1kNVJQEkobOlyfbctPZ4fs Amar Amei
Predicted num_followers: 75942.19733322992
The most similar playlist's predicted num_followers: 899363319.2529154
There are 66 playlists in genre = funk
The most similar playlist's rank within genre is: 1