Contents
Here, we preprocess Spotify data. By the end of this preprocessing stage, we have 2 csv files containing data ready to be used for EDA and for building our predictive model.
Overview of the steps:
- Load tracks.json and playlists_from_200_search_words.json files
- Extract track features (This is for each playlist. The data is stored in a list of python dictionaries where a dictionary contains information for one playlist. In each dictionary, each key is a feature and each value is a list of feature values where each entry corresponds to one track in the playlist.)
- Build playlists dataframe (feature engineering).
- For each playlist, there are playlist-level and track-level variables. We chose to take playlist-level variables as they are. For track-level numeric variables, we chose to calculate the average and standard deviation, and for track-level categorical variables, we chose to take the mode and count the number of unique occurrences. Since Spotify playlists/tracks are not directly labeled with genre, we defined the genre of a playlist to be the most freqently occurring artist genre among all its tracks.
- Playlist-level predictors: number of tracks
- Track-level predictors: average, standard deviation of all numerical track audio features (e.g. danceability, tempo), popularities (e.g. track, album), and number of available market; and mode and unique counts of track artist genre, key and time signature
- For each playlist, there are playlist-level and track-level variables. We chose to take playlist-level variables as they are. For track-level numeric variables, we chose to calculate the average and standard deviation, and for track-level categorical variables, we chose to take the mode and count the number of unique occurrences. Since Spotify playlists/tracks are not directly labeled with genre, we defined the genre of a playlist to be the most freqently occurring artist genre among all its tracks.
- Build tracks dataframe
- For each track, we added a new column
genre
, which is based on the mode of its artists genres.
- For each track, we added a new column
- Save playlists and tracks dataframes to csv. They are ready for EDA and model building.
Define helper functions
These libraries and functions are used to preprocess the data scpared from the Spotify API.
def load_data(file):
"""
Function to load json file
"""
with open(file, 'r') as fd:
data_from_json = json.load(fd)
return data_from_json
def extract_track_features(tracks_db, playlists):
"""
Function to get track features and return a playlist dictionary with track features
"""
processed_playlists = deepcopy(playlists)
missing_counts = 0
# Loop over each playlist
for index, playlist in enumerate(processed_playlists):
# get the list of track ids for each playlist
track_ids = playlist['track_ids']
track_feature_keys = ['acousticness', 'album_id', 'album_name', 'album_popularity','artists_genres',
'artists_ids', 'artists_names', 'artists_num_followers', 'artists_popularities',
'avg_artist_num_followers', 'avg_artist_popularity', 'danceability', 'duration_ms',
'energy', 'explicit', 'instrumentalness', 'isrc', 'key', 'liveness',
'loudness', 'mode', 'mode_artist_genre', 'name', 'num_available_markets',
'popularity', 'speechiness', 'std_artist_num_followers', 'std_artist_popularity',
'tempo', 'time_signature', 'valence']
# new entries of audio features for each playlist as a list to append each track's audio feature
for track_feature_key in track_feature_keys:
playlist['track_' + track_feature_key] = []
# append each tracks' audio features into the entries of the playlist
for track_id in track_ids:
# check if the track_id is in the scrapped_tracks
if track_id in tracks_db.keys():
# append each track's audio feature into the playlist dictionary
for track_feature_key in track_feature_keys:
if track_feature_key in tracks_db[track_id].keys():
playlist['track_' + track_feature_key].append(tracks_db[track_id][track_feature_key])
else:
missing_counts += 1
processed_playlists[index] = playlist
print('tracks that are missing : {}'.format(missing_counts))
return processed_playlists
def build_playlist_dataframe(playlists_dictionary_list):
"""
Function to build playlist dataframe from playlists dictionary with track features
"""
if playlists_dictionary_list[7914]['id'] == '4krpfadGaaW42C7cEm2O0A':
del playlists_dictionary_list[7914]
# features to take the avg and std
features_avg = ['track_acousticness', 'track_avg_artist_num_followers', 'track_album_popularity',
'track_avg_artist_popularity', 'track_danceability', 'track_duration_ms',
'track_energy', 'track_explicit', 'track_instrumentalness','track_liveness',
'track_loudness', 'track_mode', 'track_num_available_markets',
'track_std_artist_num_followers', 'track_std_artist_popularity',
'track_popularity', 'track_speechiness', 'track_tempo', 'track_valence'
]
# features to take the mode, # of uniques
features_mode = ['track_artists_genres','track_key','track_time_signature']
# features as is
features = ['collaborative', 'num_followers', 'num_tracks']
processed_playlists = {}
for index, playlist in enumerate(playlists_dictionary_list):
playlist_info = {}
playlist_info['id'] = playlist['id']
for key in playlist.keys():
if key in features_avg: # take avg and std
playlist_info[key + '_avg'] = np.mean(playlist[key])
playlist_info[key + '_std'] = np.std(playlist[key])
if key in set(['track_popularity', 'track_album_popularity', 'track_avg_artist_popularity']):
playlist_info[key + '_max'] = max(playlist[key])
elif key in features_mode: # take mode
if playlist[key]:
if key == 'track_artists_genres':
flatten = lambda l: [item for sublist in l for item in sublist]
flattened_value = flatten(playlist[key])
if flattened_value:
counter = collections.Counter(flattened_value)
playlist_info[key + '_mode'] = counter.most_common()[0][0]
playlist_info[key + '_unique'] = len(set(flattened_value))
else:
counter = collections.Counter(playlist[key])
playlist_info[key + '_mode'] = counter.most_common()[0][0]
playlist_info[key + '_unique'] = len(set(playlist[key]))
elif key in features:
playlist_info[key] = playlist[key]
processed_playlists[index] = playlist_info
df = pd.DataFrame(processed_playlists).T
# Drop all observations (playlists) with missingness
df_full = df.dropna(axis=0, how='any')
df_full.reset_index(inplace=True, drop=True)
# Define our genre labels
predefined_genres =['pop rap', 'punk', 'korean pop', 'pop christmas', 'folk', 'indie pop', 'pop',
'rock', 'rap' , 'house', 'indie', 'dance', 'edm', 'mellow', 'hip hop',
'alternative', 'jazz', 'r&b', 'soul', 'reggae', 'classical', 'funk', 'country',
'metal', 'blues', 'elect']
# Create a new column genre_category
df_full['genre'] = None
# Label genres
genres = df_full['track_artists_genres_mode']
for g in reversed(predefined_genres):
df_full['genre'][genres.str.contains(g)] = g
# Label all observations that did not match our predefined genres as 'other'
df_full['genre'].fillna('other', inplace=True)
df_full.drop('track_artists_genres_mode', axis=1, inplace=True)
return df_full
def build_track_dataframe(tracks_db):
"""
Function to build track dataframe
"""
df = pd.DataFrame(tracks_db).T
df.reset_index(inplace=True)
df.rename(columns={'index': 'trackID'}, inplace=True)
df.drop('album_genres', axis=1, inplace=True) # drop album genre because it's null for all tracks
# Define our genre labels
predefined_genres =['pop rap', 'punk', 'korean pop', 'pop christmas', 'folk', 'indie pop', 'pop',
'rock', 'rap' , 'house', 'indie', 'dance', 'edm', 'mellow', 'hip hop',
'alternative', 'jazz', 'r&b', 'soul', 'reggae', 'classical', 'funk', 'country',
'metal', 'blues', 'elect']
# Drop all observations (tracks) with missingness
df_full = df.dropna(axis=0, how='any')
df_full.reset_index(inplace=True, drop=True)
# Create a new column genre_category
df_full['genre'] = None
# Label genres
genres = df_full['mode_artist_genre']
for g in reversed(predefined_genres):
df_full['genre'][genres.str.contains(g)] = g
# Label all observations that did not match our predefined genres as 'other'
df_full['genre'].fillna('other', inplace=True)
df_full.drop('mode_artist_genre', axis=1, inplace=True)
return df_full
Preprocess the playlists
Upon extracting relevant track features and performing necessary calculations on the extracted track features, we stored the playlist dataframe as a csv file for easy access in the later parts of this project.
playlists = load_data('../../data_archive/playlists_from_200_search_words.json')
tracks_db = load_data('../../data_archive/tracks.json')
playlists_with_track_features = extract_track_features(tracks_db, playlists)
playlists_df = build_playlist_dataframe(playlists_with_track_features)
playlists_df.head()
tracks that are missing : 505
collaborative | id | num_followers | num_tracks | track_acousticness_avg | track_acousticness_std | track_album_popularity_avg | track_album_popularity_max | track_album_popularity_std | track_artists_genres_unique | ... | track_std_artist_num_followers_std | track_std_artist_popularity_avg | track_std_artist_popularity_std | track_tempo_avg | track_tempo_std | track_time_signature_mode | track_time_signature_unique | track_valence_avg | track_valence_std | genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | 37i9dQZF1DX1N5uK98ms5p | 3000606 | 52 | 0.180999 | 0.17112 | 71.6731 | 96 | 13.1364 | 60 | ... | 921166 | 1.78425 | 3.08155 | 116.689 | 25.1949 | 4 | 1 | 0.456071 | 0.184214 | pop |
1 | False | 37i9dQZF1DX5drguwUcl5X | 69037 | 75 | 0.144201 | 0.160799 | 68.44 | 100 | 15.5111 | 70 | ... | 1.53959e+06 | 2.11486 | 3.17182 | 114.454 | 24.115 | 4 | 2 | 0.555027 | 0.19144 | pop |
2 | False | 37i9dQZF1DX9bAf4c66TGs | 385875 | 38 | 0.1166 | 0.117615 | 72.4211 | 94 | 16.1923 | 44 | ... | 2.05042e+06 | 2.12676 | 2.15179 | 115.813 | 22.7593 | 4 | 1 | 0.526526 | 0.201783 | pop |
3 | False | 37i9dQZF1DX9nq0BqAtM4H | 69344 | 40 | 0.134162 | 0.247197 | 57.025 | 82 | 18.0838 | 97 | ... | 308030 | 0.0375 | 0.172753 | 126.491 | 29.5215 | 4 | 2 | 0.501825 | 0.188804 | pop |
4 | False | 1dCUPq7sB98i1jgQmo9d7e | 15612 | 26 | 0.171635 | 0.229736 | 53.4615 | 54 | 0.498519 | 5 | ... | 12787 | 3.34629 | 3.18413 | 126.678 | 33.242 | 4 | 1 | 0.658846 | 0.184523 | pop |
5 rows × 51 columns
print('Number of observations with missing values: ', sum(playlists_df.isnull().any()))
Number of observations with missing values: 0
playlists_df.to_csv('../../data/playlists.csv', index=False)
Preprocess the tracks
tracks_df = build_track_dataframe(tracks_db)
tracks_df.head()
trackID | acousticness | album_id | album_name | album_popularity | artists_genres | artists_ids | artists_names | artists_num_followers | artists_popularities | ... | name | num_available_markets | popularity | speechiness | std_artist_num_followers | std_artist_popularity | tempo | time_signature | valence | genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000C3ZY8325A4yktxnnwCl | 0.952 | 3ypgq6ExA3JN8s2biuRK5e | Soft Ice | 36 | [drift] | [4Uqu4U6hhDMODyzSCtNDzG] | [Poemme] | [531] | [47] | ... | When the Sun Is a Stranger | 62 | 26 | 0.0469 | 0 | 0 | 134.542 | 3 | 0.0835 | other |
1 | 000EWWBkYaREzsBplYjUag | 0.815 | 5WGfEM0WaAyoJa6AOSfx7T | Red Flower | 59 | [chillhop] | [0oer0EPMRrosfCF2tUt2jU] | [Don Philippe] | [1300] | [56] | ... | Fewerdolr | 62 | 40 | 0.0747 | 0 | 0 | 76.43 | 4 | 0.56 | other |
2 | 000hI2Lxs4BxqJyqbw7Y10 | 0.108 | 5RIqRVn99mfdZSVmgjBrfj | Las 35 Baladas de Medina Azahara | 0 | [latin metal, rock en espanol, spanish new wav... | [72XPmW6k6HZT6K2BaUUOhl] | [Medina Azahara] | [43172] | [47] | ... | Tu Mirada | 0 | 0 | 0.029 | 0 | 0 | 72.474 | 4 | 0.41 | metal |
3 | 000uWezkHfg6DbUPf2eDFO | 0.00188 | 3MBXzJXHFBslpPUcxNB3jn | Dancehall Days | 36 | [reggae rock] | [0hDJSg859MdK4c9vqu1dS8] | [The Beautiful Girls] | [48518] | [56] | ... | Me I Disconnect From You | 39 | 21 | 0.0298 | 0 | 0 | 134.008 | 4 | 0.362 | rock |
4 | 000x2qE0ZI3hodeVrnJK8A | 0.339 | 2N0AgtWbCmVoNUl2GN1opH | Dreamboat Annie | 62 | [album rock, art rock, classic rock, dance roc... | [34jw2BbxjoYalTp8cJFCPv] | [Heart] | [413139] | [70] | ... | (Love Me Like Music) I'll Be Your Song | 62 | 32 | 0.0306 | 0 | 0 | 134.248 | 4 | 0.472 | rock |
5 rows × 33 columns
tracks_df.to_csv('../../data/tracks.csv', index=False)