Spotify

Predicting the success of your playlist and providing personalized recommendations

View project on GitHub

Contents

Here, we preprocess Spotify data. By the end of this preprocessing stage, we have 2 csv files containing data ready to be used for EDA and for building our predictive model.

Overview of the steps:

  • Load tracks.json and playlists_from_200_search_words.json files
  • Extract track features (This is for each playlist. The data is stored in a list of python dictionaries where a dictionary contains information for one playlist. In each dictionary, each key is a feature and each value is a list of feature values where each entry corresponds to one track in the playlist.)
  • Build playlists dataframe (feature engineering).
    • For each playlist, there are playlist-level and track-level variables. We chose to take playlist-level variables as they are. For track-level numeric variables, we chose to calculate the average and standard deviation, and for track-level categorical variables, we chose to take the mode and count the number of unique occurrences. Since Spotify playlists/tracks are not directly labeled with genre, we defined the genre of a playlist to be the most freqently occurring artist genre among all its tracks.
      • Playlist-level predictors: number of tracks
      • Track-level predictors: average, standard deviation of all numerical track audio features (e.g. danceability, tempo), popularities (e.g. track, album), and number of available market; and mode and unique counts of track artist genre, key and time signature
  • Build tracks dataframe
    • For each track, we added a new column genre, which is based on the mode of its artists genres.
  • Save playlists and tracks dataframes to csv. They are ready for EDA and model building.

Define helper functions

These libraries and functions are used to preprocess the data scpared from the Spotify API.

def load_data(file):
    """
    Function to load json file
    """
    with open(file, 'r') as fd:
        data_from_json = json.load(fd)
        return data_from_json
    
def extract_track_features(tracks_db, playlists):
    """
    Function to get track features and return a playlist dictionary with track features
    """ 
    processed_playlists = deepcopy(playlists)
    
    missing_counts = 0
    # Loop over each playlist
    for index, playlist in enumerate(processed_playlists):
        # get the list of track ids for each playlist
        track_ids = playlist['track_ids']
        track_feature_keys = ['acousticness', 'album_id', 'album_name', 'album_popularity','artists_genres', 
                              'artists_ids', 'artists_names', 'artists_num_followers', 'artists_popularities',
                              'avg_artist_num_followers', 'avg_artist_popularity', 'danceability', 'duration_ms',
                              'energy', 'explicit', 'instrumentalness', 'isrc', 'key', 'liveness', 
                              'loudness', 'mode', 'mode_artist_genre', 'name', 'num_available_markets',
                              'popularity', 'speechiness', 'std_artist_num_followers', 'std_artist_popularity',
                              'tempo', 'time_signature', 'valence']
        
        # new entries of audio features for each playlist as a list to append each track's audio feature
        for track_feature_key in track_feature_keys:
            playlist['track_' + track_feature_key] = []
        
        # append each tracks' audio features into the entries of the playlist
        for track_id in track_ids:
            # check if the track_id is in the scrapped_tracks
            if track_id in tracks_db.keys():
                # append each track's audio feature into the playlist dictionary
                for track_feature_key in track_feature_keys:
                    if track_feature_key in tracks_db[track_id].keys():
                        playlist['track_' + track_feature_key].append(tracks_db[track_id][track_feature_key])
            else:
                missing_counts += 1
        processed_playlists[index] = playlist
    print('tracks that are missing : {}'.format(missing_counts))
    return processed_playlists


def build_playlist_dataframe(playlists_dictionary_list):
    """
    Function to build playlist dataframe from playlists dictionary with track features
    """
    
    if playlists_dictionary_list[7914]['id'] == '4krpfadGaaW42C7cEm2O0A':
        del playlists_dictionary_list[7914]
        
    # features to take the avg and std
    features_avg = ['track_acousticness', 'track_avg_artist_num_followers', 'track_album_popularity',
                    'track_avg_artist_popularity', 'track_danceability', 'track_duration_ms', 
                    'track_energy', 'track_explicit', 'track_instrumentalness','track_liveness', 
                    'track_loudness', 'track_mode', 'track_num_available_markets',
                    'track_std_artist_num_followers', 'track_std_artist_popularity',
                    'track_popularity', 'track_speechiness', 'track_tempo', 'track_valence'
                   ]                
                      
    # features to take the mode, # of uniques
    features_mode = ['track_artists_genres','track_key','track_time_signature']

    # features as is
    features = ['collaborative', 'num_followers', 'num_tracks']

    processed_playlists = {}

    for index, playlist in enumerate(playlists_dictionary_list):
        playlist_info = {} 
        playlist_info['id'] = playlist['id']

        for key in playlist.keys():
            if key in features_avg: # take avg and std
                playlist_info[key + '_avg'] = np.mean(playlist[key])
                playlist_info[key + '_std'] = np.std(playlist[key])
                if key in set(['track_popularity', 'track_album_popularity', 'track_avg_artist_popularity']):
                    playlist_info[key + '_max'] = max(playlist[key])
            elif key in features_mode: # take mode
                if playlist[key]:
                    if key == 'track_artists_genres':
                        flatten = lambda l: [item for sublist in l for item in sublist]
                        flattened_value = flatten(playlist[key])
                        if flattened_value:
                            counter = collections.Counter(flattened_value)
                            playlist_info[key + '_mode'] = counter.most_common()[0][0]
                            playlist_info[key + '_unique'] = len(set(flattened_value))
                    else:
                        counter = collections.Counter(playlist[key])
                        playlist_info[key + '_mode'] = counter.most_common()[0][0]
                        playlist_info[key + '_unique'] = len(set(playlist[key]))
            elif key in features:
                playlist_info[key] = playlist[key]

        processed_playlists[index] = playlist_info
    df = pd.DataFrame(processed_playlists).T
    
    # Drop all observations (playlists) with missingness
    df_full = df.dropna(axis=0, how='any')
    df_full.reset_index(inplace=True, drop=True)
    
    # Define our genre labels
    predefined_genres =['pop rap', 'punk', 'korean pop', 'pop christmas', 'folk', 'indie pop', 'pop', 
                    'rock', 'rap' , 'house', 'indie', 'dance', 'edm', 'mellow', 'hip hop',  
                    'alternative', 'jazz', 'r&b', 'soul', 'reggae', 'classical', 'funk', 'country',
                    'metal', 'blues', 'elect']
    # Create a new column genre_category
    df_full['genre'] = None
    
    # Label genres
    genres = df_full['track_artists_genres_mode']
    for g in reversed(predefined_genres):
        df_full['genre'][genres.str.contains(g)] = g

    # Label all observations that did not match our predefined genres as 'other'  
    df_full['genre'].fillna('other', inplace=True)
    df_full.drop('track_artists_genres_mode', axis=1, inplace=True)
    
    return df_full
    

def build_track_dataframe(tracks_db):
    """
    Function to build track dataframe
    """
    df = pd.DataFrame(tracks_db).T
    df.reset_index(inplace=True)
    df.rename(columns={'index': 'trackID'}, inplace=True)
    df.drop('album_genres', axis=1, inplace=True) # drop album genre because it's null for all tracks
    
    # Define our genre labels
    predefined_genres =['pop rap', 'punk', 'korean pop', 'pop christmas', 'folk', 'indie pop', 'pop', 
                    'rock', 'rap' , 'house', 'indie', 'dance', 'edm', 'mellow', 'hip hop',  
                    'alternative', 'jazz', 'r&b', 'soul', 'reggae', 'classical', 'funk', 'country',
                    'metal', 'blues', 'elect']
    
    # Drop all observations (tracks) with missingness
    df_full = df.dropna(axis=0, how='any')
    df_full.reset_index(inplace=True, drop=True)
    
    # Create a new column genre_category
    df_full['genre'] = None
    
    # Label genres
    genres = df_full['mode_artist_genre']
    for g in reversed(predefined_genres):
        df_full['genre'][genres.str.contains(g)] = g

    # Label all observations that did not match our predefined genres as 'other'  
    df_full['genre'].fillna('other', inplace=True)
    df_full.drop('mode_artist_genre', axis=1, inplace=True)
    
    return df_full

Preprocess the playlists

Upon extracting relevant track features and performing necessary calculations on the extracted track features, we stored the playlist dataframe as a csv file for easy access in the later parts of this project.

playlists = load_data('../../data_archive/playlists_from_200_search_words.json')
tracks_db = load_data('../../data_archive/tracks.json')

playlists_with_track_features = extract_track_features(tracks_db, playlists)

playlists_df = build_playlist_dataframe(playlists_with_track_features)

playlists_df.head()
tracks that are missing : 505
collaborative id num_followers num_tracks track_acousticness_avg track_acousticness_std track_album_popularity_avg track_album_popularity_max track_album_popularity_std track_artists_genres_unique ... track_std_artist_num_followers_std track_std_artist_popularity_avg track_std_artist_popularity_std track_tempo_avg track_tempo_std track_time_signature_mode track_time_signature_unique track_valence_avg track_valence_std genre
0 False 37i9dQZF1DX1N5uK98ms5p 3000606 52 0.180999 0.17112 71.6731 96 13.1364 60 ... 921166 1.78425 3.08155 116.689 25.1949 4 1 0.456071 0.184214 pop
1 False 37i9dQZF1DX5drguwUcl5X 69037 75 0.144201 0.160799 68.44 100 15.5111 70 ... 1.53959e+06 2.11486 3.17182 114.454 24.115 4 2 0.555027 0.19144 pop
2 False 37i9dQZF1DX9bAf4c66TGs 385875 38 0.1166 0.117615 72.4211 94 16.1923 44 ... 2.05042e+06 2.12676 2.15179 115.813 22.7593 4 1 0.526526 0.201783 pop
3 False 37i9dQZF1DX9nq0BqAtM4H 69344 40 0.134162 0.247197 57.025 82 18.0838 97 ... 308030 0.0375 0.172753 126.491 29.5215 4 2 0.501825 0.188804 pop
4 False 1dCUPq7sB98i1jgQmo9d7e 15612 26 0.171635 0.229736 53.4615 54 0.498519 5 ... 12787 3.34629 3.18413 126.678 33.242 4 1 0.658846 0.184523 pop

5 rows × 51 columns

print('Number of observations with missing values: ', sum(playlists_df.isnull().any()))
Number of observations with missing values:  0
playlists_df.to_csv('../../data/playlists.csv', index=False)

Preprocess the tracks

tracks_df = build_track_dataframe(tracks_db)
tracks_df.head()
trackID acousticness album_id album_name album_popularity artists_genres artists_ids artists_names artists_num_followers artists_popularities ... name num_available_markets popularity speechiness std_artist_num_followers std_artist_popularity tempo time_signature valence genre
0 000C3ZY8325A4yktxnnwCl 0.952 3ypgq6ExA3JN8s2biuRK5e Soft Ice 36 [drift] [4Uqu4U6hhDMODyzSCtNDzG] [Poemme] [531] [47] ... When the Sun Is a Stranger 62 26 0.0469 0 0 134.542 3 0.0835 other
1 000EWWBkYaREzsBplYjUag 0.815 5WGfEM0WaAyoJa6AOSfx7T Red Flower 59 [chillhop] [0oer0EPMRrosfCF2tUt2jU] [Don Philippe] [1300] [56] ... Fewerdolr 62 40 0.0747 0 0 76.43 4 0.56 other
2 000hI2Lxs4BxqJyqbw7Y10 0.108 5RIqRVn99mfdZSVmgjBrfj Las 35 Baladas de Medina Azahara 0 [latin metal, rock en espanol, spanish new wav... [72XPmW6k6HZT6K2BaUUOhl] [Medina Azahara] [43172] [47] ... Tu Mirada 0 0 0.029 0 0 72.474 4 0.41 metal
3 000uWezkHfg6DbUPf2eDFO 0.00188 3MBXzJXHFBslpPUcxNB3jn Dancehall Days 36 [reggae rock] [0hDJSg859MdK4c9vqu1dS8] [The Beautiful Girls] [48518] [56] ... Me I Disconnect From You 39 21 0.0298 0 0 134.008 4 0.362 rock
4 000x2qE0ZI3hodeVrnJK8A 0.339 2N0AgtWbCmVoNUl2GN1opH Dreamboat Annie 62 [album rock, art rock, classic rock, dance roc... [34jw2BbxjoYalTp8cJFCPv] [Heart] [413139] [70] ... (Love Me Like Music) I'll Be Your Song 62 32 0.0306 0 0 134.248 4 0.472 rock

5 rows × 33 columns

tracks_df.to_csv('../../data/tracks.csv', index=False)