Data Preprocessing

Define helper functions
Preprocess the playlists
Preprocess the tracks

Here, we preprocess Spotify data. By the end of this preprocessing stage, we have 2 csv files containing data ready to be used for EDA and for building our predictive model.

Overview of the steps:

Load tracks.json and playlists_from_200_search_words.json files
Extract track features (This is for each playlist. The data is stored in a list of python dictionaries where a dictionary contains information for one playlist. In each dictionary, each key is a feature and each value is a list of feature values where each entry corresponds to one track in the playlist.)
Build playlists dataframe (feature engineering).
- For each playlist, there are playlist-level and track-level variables. We chose to take playlist-level variables as they are. For track-level numeric variables, we chose to calculate the average and standard deviation, and for track-level categorical variables, we chose to take the mode and count the number of unique occurrences. Since Spotify playlists/tracks are not directly labeled with genre, we defined the genre of a playlist to be the most freqently occurring artist genre among all its tracks.
  - Playlist-level predictors: number of tracks
  - Track-level predictors: average, standard deviation of all numerical track audio features (e.g. danceability, tempo), popularities (e.g. track, album), and number of available market; and mode and unique counts of track artist genre, key and time signature
Build tracks dataframe
- For each track, we added a new column genre, which is based on the mode of its artists genres.
Save playlists and tracks dataframes to csv. They are ready for EDA and model building.

Define helper functions

These libraries and functions are used to preprocess the data scpared from the Spotify API.

def load_data(file):
    """
    Function to load json file
    """
    with open(file, 'r') as fd:
        data_from_json = json.load(fd)
        return data_from_json
    
def extract_track_features(tracks_db, playlists):
    """
    Function to get track features and return a playlist dictionary with track features
    """ 
    processed_playlists = deepcopy(playlists)
    
    missing_counts = 0
    # Loop over each playlist
    for index, playlist in enumerate(processed_playlists):
        # get the list of track ids for each playlist
        track_ids = playlist['track_ids']
        track_feature_keys = ['acousticness', 'album_id', 'album_name', 'album_popularity','artists_genres', 
                              'artists_ids', 'artists_names', 'artists_num_followers', 'artists_popularities',
                              'avg_artist_num_followers', 'avg_artist_popularity', 'danceability', 'duration_ms',
                              'energy', 'explicit', 'instrumentalness', 'isrc', 'key', 'liveness', 
                              'loudness', 'mode', 'mode_artist_genre', 'name', 'num_available_markets',
                              'popularity', 'speechiness', 'std_artist_num_followers', 'std_artist_popularity',
                              'tempo', 'time_signature', 'valence']
        
        # new entries of audio features for each playlist as a list to append each track's audio feature
        for track_feature_key in track_feature_keys:
            playlist['track_' + track_feature_key] = []
        
        # append each tracks' audio features into the entries of the playlist
        for track_id in track_ids:
            # check if the track_id is in the scrapped_tracks
            if track_id in tracks_db.keys():
                # append each track's audio feature into the playlist dictionary
                for track_feature_key in track_feature_keys:
                    if track_feature_key in tracks_db[track_id].keys():
                        playlist['track_' + track_feature_key].append(tracks_db[track_id][track_feature_key])
            else:
                missing_counts += 1
        processed_playlists[index] = playlist
    print('tracks that are missing : {}'.format(missing_counts))
    return processed_playlists


def build_playlist_dataframe(playlists_dictionary_list):
    """
    Function to build playlist dataframe from playlists dictionary with track features
    """
    
    if playlists_dictionary_list[7914]['id'] == '4krpfadGaaW42C7cEm2O0A':
        del playlists_dictionary_list[7914]
        
    # features to take the avg and std
    features_avg = ['track_acousticness', 'track_avg_artist_num_followers', 'track_album_popularity',
                    'track_avg_artist_popularity', 'track_danceability', 'track_duration_ms', 
                    'track_energy', 'track_explicit', 'track_instrumentalness','track_liveness', 
                    'track_loudness', 'track_mode', 'track_num_available_markets',
                    'track_std_artist_num_followers', 'track_std_artist_popularity',
                    'track_popularity', 'track_speechiness', 'track_tempo', 'track_valence'
                   ]                
                      
    # features to take the mode, # of uniques
    features_mode = ['track_artists_genres','track_key','track_time_signature']

    # features as is
    features = ['collaborative', 'num_followers', 'num_tracks']

    processed_playlists = {}

    for index, playlist in enumerate(playlists_dictionary_list):
        playlist_info = {} 
        playlist_info['id'] = playlist['id']

        for key in playlist.keys():
            if key in features_avg: # take avg and std
                playlist_info[key + '_avg'] = np.mean(playlist[key])
                playlist_info[key + '_std'] = np.std(playlist[key])
                if key in set(['track_popularity', 'track_album_popularity', 'track_avg_artist_popularity']):
                    playlist_info[key + '_max'] = max(playlist[key])
            elif key in features_mode: # take mode
                if playlist[key]:
                    if key == 'track_artists_genres':
                        flatten = lambda l: [item for sublist in l for item in sublist]
                        flattened_value = flatten(playlist[key])
                        if flattened_value:
                            counter = collections.Counter(flattened_value)
                            playlist_info[key + '_mode'] = counter.most_common()[0][0]
                            playlist_info[key + '_unique'] = len(set(flattened_value))
                    else:
                        counter = collections.Counter(playlist[key])
                        playlist_info[key + '_mode'] = counter.most_common()[0][0]
                        playlist_info[key + '_unique'] = len(set(playlist[key]))
            elif key in features:
                playlist_info[key] = playlist[key]

        processed_playlists[index] = playlist_info
    df = pd.DataFrame(processed_playlists).T
    
    # Drop all observations (playlists) with missingness
    df_full = df.dropna(axis=0, how='any')
    df_full.reset_index(inplace=True, drop=True)
    
    # Define our genre labels
    predefined_genres =['pop rap', 'punk', 'korean pop', 'pop christmas', 'folk', 'indie pop', 'pop', 
                    'rock', 'rap' , 'house', 'indie', 'dance', 'edm', 'mellow', 'hip hop',  
                    'alternative', 'jazz', 'r&b', 'soul', 'reggae', 'classical', 'funk', 'country',
                    'metal', 'blues', 'elect']
    # Create a new column genre_category
    df_full['genre'] = None
    
    # Label genres
    genres = df_full['track_artists_genres_mode']
    for g in reversed(predefined_genres):
        df_full['genre'][genres.str.contains(g)] = g

    # Label all observations that did not match our predefined genres as 'other'  
    df_full['genre'].fillna('other', inplace=True)
    df_full.drop('track_artists_genres_mode', axis=1, inplace=True)
    
    return df_full
    

def build_track_dataframe(tracks_db):
    """
    Function to build track dataframe
    """
    df = pd.DataFrame(tracks_db).T
    df.reset_index(inplace=True)
    df.rename(columns={'index': 'trackID'}, inplace=True)
    df.drop('album_genres', axis=1, inplace=True) # drop album genre because it's null for all tracks
    
    # Define our genre labels
    predefined_genres =['pop rap', 'punk', 'korean pop', 'pop christmas', 'folk', 'indie pop', 'pop', 
                    'rock', 'rap' , 'house', 'indie', 'dance', 'edm', 'mellow', 'hip hop',  
                    'alternative', 'jazz', 'r&b', 'soul', 'reggae', 'classical', 'funk', 'country',
                    'metal', 'blues', 'elect']
    
    # Drop all observations (tracks) with missingness
    df_full = df.dropna(axis=0, how='any')
    df_full.reset_index(inplace=True, drop=True)
    
    # Create a new column genre_category
    df_full['genre'] = None
    
    # Label genres
    genres = df_full['mode_artist_genre']
    for g in reversed(predefined_genres):
        df_full['genre'][genres.str.contains(g)] = g

    # Label all observations that did not match our predefined genres as 'other'  
    df_full['genre'].fillna('other', inplace=True)
    df_full.drop('mode_artist_genre', axis=1, inplace=True)
    
    return df_full

Preprocess the playlists

Upon extracting relevant track features and performing necessary calculations on the extracted track features, we stored the playlist dataframe as a csv file for easy access in the later parts of this project.

playlists = load_data('../../data_archive/playlists_from_200_search_words.json')
tracks_db = load_data('../../data_archive/tracks.json')

playlists_with_track_features = extract_track_features(tracks_db, playlists)

playlists_df = build_playlist_dataframe(playlists_with_track_features)

playlists_df.head()

tracks that are missing : 505

	collaborative	id	num_followers	num_tracks	track_acousticness_avg	track_acousticness_std	track_album_popularity_avg	track_album_popularity_max	track_album_popularity_std	track_artists_genres_unique	...	track_std_artist_num_followers_std	track_std_artist_popularity_avg	track_std_artist_popularity_std	track_tempo_avg	track_tempo_std	track_time_signature_mode	track_time_signature_unique	track_valence_avg	track_valence_std	genre
0	False	37i9dQZF1DX1N5uK98ms5p	3000606	52	0.180999	0.17112	71.6731	96	13.1364	60	...	921166	1.78425	3.08155	116.689	25.1949	4	1	0.456071	0.184214	pop
1	False	37i9dQZF1DX5drguwUcl5X	69037	75	0.144201	0.160799	68.44	100	15.5111	70	...	1.53959e+06	2.11486	3.17182	114.454	24.115	4	2	0.555027	0.19144	pop
2	False	37i9dQZF1DX9bAf4c66TGs	385875	38	0.1166	0.117615	72.4211	94	16.1923	44	...	2.05042e+06	2.12676	2.15179	115.813	22.7593	4	1	0.526526	0.201783	pop
3	False	37i9dQZF1DX9nq0BqAtM4H	69344	40	0.134162	0.247197	57.025	82	18.0838	97	...	308030	0.0375	0.172753	126.491	29.5215	4	2	0.501825	0.188804	pop
4	False	1dCUPq7sB98i1jgQmo9d7e	15612	26	0.171635	0.229736	53.4615	54	0.498519	5	...	12787	3.34629	3.18413	126.678	33.242	4	1	0.658846	0.184523	pop

5 rows × 51 columns

print('Number of observations with missing values: ', sum(playlists_df.isnull().any()))

Number of observations with missing values:  0

playlists_df.to_csv('../../data/playlists.csv', index=False)

Preprocess the tracks

tracks_df = build_track_dataframe(tracks_db)

tracks_df.head()

	trackID	acousticness	album_id	album_name	album_popularity	artists_genres	artists_ids	artists_names	artists_num_followers	artists_popularities	...	name	num_available_markets	popularity	speechiness	tempo	time_signature	valence	genre
0	000C3ZY8325A4yktxnnwCl	0.952	3ypgq6ExA3JN8s2biuRK5e	Soft Ice	36	[drift]	[4Uqu4U6hhDMODyzSCtNDzG]	[Poemme]	[531]	[47]	...	When the Sun Is a Stranger	62	26	0.0469	134.542	3	0.0835	other
1	000EWWBkYaREzsBplYjUag	0.815	5WGfEM0WaAyoJa6AOSfx7T	Red Flower	59	[chillhop]	[0oer0EPMRrosfCF2tUt2jU]	[Don Philippe]	[1300]	[56]	...	Fewerdolr	62	40	0.0747	76.43	4	0.56	other
2	000hI2Lxs4BxqJyqbw7Y10	0.108	5RIqRVn99mfdZSVmgjBrfj	Las 35 Baladas de Medina Azahara	0	[latin metal, rock en espanol, spanish new wav...	[72XPmW6k6HZT6K2BaUUOhl]	[Medina Azahara]	[43172]	[47]	...	Tu Mirada	0	0	0.029	72.474	4	0.41	metal
3	000uWezkHfg6DbUPf2eDFO	0.00188	3MBXzJXHFBslpPUcxNB3jn	Dancehall Days	36	[reggae rock]	[0hDJSg859MdK4c9vqu1dS8]	[The Beautiful Girls]	[48518]	[56]	...	Me I Disconnect From You	39	21	0.0298	134.008	4	0.362	rock
4	000x2qE0ZI3hodeVrnJK8A	0.339	2N0AgtWbCmVoNUl2GN1opH	Dreamboat Annie	62	[album rock, art rock, classic rock, dance roc...	[34jw2BbxjoYalTp8cJFCPv]	[Heart]	[413139]	[70]	...	(Love Me Like Music) I'll Be Your Song	62	32	0.0306	134.248	4	0.472	rock

5 rows × 33 columns

tracks_df.to_csv('../../data/tracks.csv', index=False)

Spotify

Predicting the success of your playlist and providing personalized recommendations

Contents

Define helper functions

Preprocess the playlists

Preprocess the tracks