Spotify

Predicting the success of your playlist and providing personalized recommendations

View project on GitHub

Contents

Here, we explore different regression models to predict the number of followers (log-scale) of each playlist. We use 2 sets of features and fitted 2 sets of regression models respectively.

  • Feature set 1: main predictors only
  • Feature set 2: main predictors and interaction terms of (genre X numerical_audio_feature_avg)

For each set of features, do the following 2 steps:

  • Use main predictors only and fit single regression models:
    • Linear Regression
    • (Perform PCA on the main predictors and interaction terms set of features)
    • RidgeCV
    • LassoCV
    • Random Forest Regressor
    • Adaboost Regressor
  • Stack all fitted models on the training set together to fit a
    • Meta regressor 1: Weighted Average
      • Gather the predicted values of all the fitted single models on the validation set
      • Average each single model’s predicted value weighted by its accuracy on the same validation set
    • Meta regressor 2: Meta Linear Regressor
      • Gather the predicted values of all the fitted single models on the validation set
      • Fit a linear regression model on these single models’ predicted values

As shown in the summary part at the end of this notebook, Random Forest Regressor is the best single model and the stacked Meta Linear Regression model is overall the best. Comparing models with the 2 sets of features, both have overall comparable test accuracies. The interaction terms between genre and avg of numerical audio features slightly improved test accuracies for the RidgeCV, LassoCV and weighted average meta model but not for the other models. This suggests that some of the interaction terms are useful while others are not. Thus, the models fitted on main predictors and interaction terms are (to some extent) overfitted.

Note of the additional 209 method: In addition to the regression models discussed in class, we implemented 2 unqiue meta regressors which combined the prediction results of all single models by either weighted averaging or fitting an additional linear regressor on the single models’ prediction outputs. The meta linear regressor is proved to give the best performance among all models.

def dump_data(data_to_json, file):
    # example: file = '../data/playlists_5088.json'
    with open(file,'w') as fd:
        json.dump(data_to_json, fd)

def load_data(file):
    with open(file, 'r') as fd:
        data_from_json = json.load(fd)
        return data_from_json

Regression models with main predictors only

Data Processing

Take log transform of the skewed response variable:

num_followers -> log_num_followers

playlists_df_full = pd.read_csv('../../data/playlists.csv')
playlists_df_full.head()
collaborative num_followers num_tracks track_acousticness_avg track_acousticness_std track_album_popularity_avg track_album_popularity_max track_album_popularity_std track_artists_genres_unique track_avg_artist_num_followers_avg ... track_std_artist_num_followers_std track_std_artist_popularity_avg track_std_artist_popularity_std track_tempo_avg track_tempo_std track_time_signature_mode track_time_signature_unique track_valence_avg track_valence_std genre
0 False 3000606 52 0.180999 0.171120 71.673077 96 13.136445 60 1.276693e+06 ... 9.211661e+05 1.784248 3.081549 116.689288 25.194937 4 1 0.456071 0.184214 pop
1 False 69037 75 0.144201 0.160799 68.440000 100 15.511063 70 3.791621e+06 ... 1.539587e+06 2.114856 3.171820 114.453907 24.115022 4 2 0.555027 0.191440 pop
2 False 385875 38 0.116600 0.117615 72.421053 94 16.192317 44 2.319518e+06 ... 2.050419e+06 2.126763 2.151793 115.812500 22.759341 4 1 0.526526 0.201783 pop
3 False 69344 40 0.134162 0.247197 57.025000 82 18.083815 97 2.387520e+06 ... 3.080303e+05 0.037500 0.172753 126.490950 29.521523 4 2 0.501825 0.188804 pop
4 False 15612 26 0.171635 0.229736 53.461538 54 0.498519 5 8.566853e+04 ... 1.278704e+04 3.346290 3.184129 126.677692 33.241999 4 1 0.658846 0.184523 pop

5 rows × 50 columns

playlists_df_full['log_num_followers'] = np.log(list(playlists_df_full['num_followers']+1))

playlists_df_full.drop(['num_followers'], axis=1, inplace=True)

Take one-hot encoding of the categorical variables

playlists_df_full.drop(['collaborative'], axis=1, inplace=True)
categorical_predictors = ['genre', 'track_time_signature_mode', 'track_key_mode']
numerical_predictors = list(set(playlists_df_full.columns.values) - set(categorical_predictors))

playlists_df_full = pd.get_dummies(playlists_df_full, prefix = categorical_predictors, 
                                         columns = categorical_predictors, drop_first = True)
def change_column_order(df, col_name, index): 
    """
    Function to change column order in a dataframe
    """
    cols = df.columns.tolist() 
    cols.remove(col_name)
    cols.insert(index, col_name)
    return df[cols]

playlists_df_full = change_column_order(playlists_df_full, 'log_num_followers', len(playlists_df_full.columns))

playlists_df_full.head()
num_tracks track_acousticness_avg track_acousticness_std track_album_popularity_avg track_album_popularity_max track_album_popularity_std track_artists_genres_unique track_avg_artist_num_followers_avg track_avg_artist_num_followers_std track_avg_artist_popularity_avg ... track_key_mode_3 track_key_mode_4 track_key_mode_5 track_key_mode_6 track_key_mode_7 track_key_mode_8 track_key_mode_9 track_key_mode_10 track_key_mode_11 log_num_followers
0 52 0.180999 0.171120 71.673077 96 13.136445 60 1.276693e+06 1.320843e+06 82.256410 ... 0 0 0 0 0 0 0 0 1 14.914325
1 75 0.144201 0.160799 68.440000 100 15.511063 70 3.791621e+06 3.658858e+06 84.684000 ... 0 0 0 0 0 0 0 0 0 11.142412
2 38 0.116600 0.117615 72.421053 94 16.192317 44 2.319518e+06 2.237652e+06 86.705263 ... 0 0 0 0 0 0 0 0 1 12.863271
3 40 0.134162 0.247197 57.025000 82 18.083815 97 2.387520e+06 3.589807e+06 72.987500 ... 0 0 0 1 0 0 0 0 0 11.146849
4 26 0.171635 0.229736 53.461538 54 0.498519 5 8.566853e+04 2.347853e+05 55.059341 ... 0 0 0 0 0 0 1 0 0 9.655859

5 rows × 86 columns

print('Is there any missing values in our data: {}'.format(playlists_df_full.isnull().values.any()))
Is there any missing values in our data: False

Split data 6:1:1

  1. train set for single models: 6/8 of all data
  2. validation set for single models & train set for meta models: 1/8 of all data
  3. test set: 1/8 of all data
np.random.seed(9001)
msk1 = np.random.rand(len(playlists_df_full)) < 0.75
playlists_df_train_main = playlists_df_full[msk1]
playlists_df_nontrain_main = playlists_df_full[~msk1]

msk12 = np.random.rand(len(playlists_df_nontrain_main)) < 0.5
playlists_df_valid_main = playlists_df_nontrain_main[msk12]
playlists_df_test_main = playlists_df_nontrain_main[~msk12]
X_train_main = playlists_df_train_main.iloc[:,:-1]
X_valid_main = playlists_df_valid_main.iloc[:,:-1]
X_test_main = playlists_df_test_main.iloc[:,:-1]

y_train = playlists_df_train_main['log_num_followers']
y_valid = playlists_df_valid_main['log_num_followers']
y_test = playlists_df_test_main['log_num_followers']

X_train_main.shape, y_train.shape, X_valid_main.shape, y_valid.shape, X_test_main.shape, y_test.shape
((7083, 85), (7083,), (1214, 85), (1214,), (1133, 85), (1133,))

Drop columns that have only 0, and standardize numerical variables using train set’s mean and std

for col in X_train_main.columns:
    if (X_train_main[col] == 0).all():
        X_train_main.drop(col, axis=1, inplace=True)
    else:
        # Standardize a numerical variable
        if not np.logical_or((X_train_main[col]==0), ((X_train_main[col]==1))).all():
            mean_train = X_train_main[col].mean()
            std_train = X_train_main[col].std()
            X_train_main[col] = (X_train_main[col] - mean_train) / std_train
            X_valid_main[col] = (X_valid_main[col] - mean_train) / std_train
            X_test_main[col] = (X_test_main[col] - mean_train) / std_train

Regression Models

Linear regression

sim_lin_main = LinearRegression().fit(X_train_main, y_train)
joblib.dump(sim_lin_main, '../../fitted_models/sim_lin_main.pkl') 
sim_lin_main = joblib.load('../../fitted_models/sim_lin_main.pkl') 
print('-- Linear Regression with all main predictors only --')
print('Training R^2: ', sim_lin_main.score(X_train_main, y_train))
print('Test R^2: ', sim_lin_main.score(X_test_main, y_test))
-- Linear Regression with all main predictors only --
Training R^2:  0.263847739771
Test R^2:  0.252153441084

RidgeCV

lambdas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4, 1e5]
ridge_cv_main = RidgeCV(alphas=lambdas, fit_intercept=True).fit(X_train_main, y_train)
joblib.dump(ridge_cv_main, '../../fitted_models/ridge_cv_main.pkl') 
ridge_cv_main = joblib.load('../../fitted_models/ridge_cv_main.pkl') 
ridge_r2_train_main = ridge_cv_main.score(X_train_main, y_train)
ridge_r2_test_main = ridge_cv_main.score(X_test_main, y_test)

print('-- RidgeCV (best alpha = {}) with all main predictors only --'.format(ridge_cv_main.alpha_))
print("Training R^2: {}".format(ridge_r2_train_main))
print("Test R^2: {}".format(ridge_r2_test_main))
-- RidgeCV (best alpha = 10.0) with all main predictors only --
Training R^2: 0.263451109837063
Test R^2: 0.25052396982949465

LassoCV

lambdas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4, 1e5]
lasso_cv_main = LassoCV(alphas=lambdas, fit_intercept=True).fit(X_train_main, y_train)
joblib.dump(lasso_cv_main, '../../fitted_models/lasso_cv_main.pkl') 
lasso_cv_main = joblib.load('../../fitted_models/lasso_cv_main.pkl') 
lasso_r2_train_main = lasso_cv_main.score(X_train_main, y_train)
lasso_r2_test_main = lasso_cv_main.score(X_test_main, y_test)

print('-- LassoCV (best alpha = {}) with all main predictors only --'.format(lasso_cv_main.alpha_))
print("Training R^2: {}".format(lasso_r2_train_main))
print("Test R^2: {}".format(lasso_r2_test_main))
-- LassoCV (best alpha = 0.01) with all main predictors only --
Training R^2: 0.2556596987903933
Test R^2: 0.24914685865189712

Random Forest

RF_params = OrderedDict(
    n_estimators = [2**(i+2) for i in np.arange(7)],
    max_features = [0.1, 0.3, 0.5, 0.7, 0.9]
)
RF_params.values()
odict_values([[4, 8, 16, 32, 64, 128, 256], [0.1, 0.3, 0.5, 0.7, 0.9]])
cv_scores_RF_main = []
params_dict_RF_main = {}

for i, (n, f) in enumerate(product(*RF_params.values())):
    params_dict_RF_main[i] = (n, f)
    
    # Cross Validation on Random Forest Classifier
    RF_main = RandomForestRegressor(oob_score=False, n_estimators=n, max_features=f, n_jobs=-1, random_state=22)
    scores = cross_val_score(RF_main, X_train_main, y_train, cv=5, scoring='r2')
    cv_scores_RF_main.append(scores.mean())

idx_optimal = np.argmax(cv_scores_RF_main)
optimal_params_RF_main = params_dict_RF_main[idx_optimal]
print('-- Random Forest Regression --')
print('Maximum R^2 validation score = {}'.format(max(cv_scores_RF_main)))
print('Optimal n_estimators = {}'.format(optimal_params_RF_main[0]))
print('Optimal max_features = {}'.format(optimal_params_RF_main[1]))

RF_best_main = RandomForestRegressor(oob_score=False, n_estimators=optimal_params_RF_main[0], 
                                     max_features=optimal_params_RF_main[1], n_jobs=-1, 
                                     random_state=22).fit(X_train_main, y_train)
RF_best_r2_train_main = RF_best_main.score(X_train_main, y_train)
RF_best_r2_test_main = RF_best_main.score(X_test_main, y_test)
print('-- Best Random Forest Regression --')
print("Training R^2: {}".format(RF_best_r2_train_main))
print("Test R^2: {}".format(RF_best_r2_test_main))

joblib.dump(RF_best_main, '../../fitted_models/RF_best_main.pkl') 

Load fitted model to reproduce results directly

RF_best_main = joblib.load('../../fitted_models/RF_best_main.pkl') 
RF_best_r2_train_main = RF_best_main.score(X_train_main, y_train)
RF_best_r2_test_main = RF_best_main.score(X_test_main, y_test)
print('-- Best Random Forest Regression with all main predictors only --')
print("Training R^2: {}".format(RF_best_r2_train_main))
print("Test R^2: {}".format(RF_best_r2_test_main))
-- Best Random Forest Regression with all main predictors only --
Training R^2: 0.9263072080770891
Test R^2: 0.48156931235472966

Adaboost

ab_params = OrderedDict(
    base_depths = [2, 4, 6, 8],
    n_estimators = [2**(i+2) for i in np.arange(7)]
)
ab_params.values()
odict_values([[2, 4, 6, 8], [4, 8, 16, 32, 64, 128, 256]])
l_rate = 0.05
params_dict_ab_main = {}
cv_scores_ab_main = []
for i, (d, n) in enumerate(product(*ab_params.values())):
    params_dict_ab_main[i] = (d, n)
    ab_main = AdaBoostRegressor(DecisionTreeRegressor(max_depth=d, random_state=22), n_estimators=n, 
                                learning_rate=l_rate, random_state=22)
    scores = cross_val_score(ab_main, X_train_main, y_train, cv=5, scoring='r2')
    cv_scores_ab_main.append(scores.mean())


idx_optimal = np.argmax(cv_scores_ab_main)
optimal_params_ab_main = params_dict_ab_main[idx_optimal]
print('-- AdaBoost Regression --')
print('Maximum R^2 cv score = {}'.format(max(cv_scores_ab_main)))
print('Optimal base tree max_depth = {}'.format(optimal_params_ab_main[0]))
print('Optimal n_estimators = {}'.format(optimal_params_ab_main[1]))


ab_best_main = AdaBoostRegressor(DecisionTreeRegressor(max_depth=optimal_params_ab_main[0], random_state=22), 
                                 n_estimators=optimal_params_ab_main[1], learning_rate=l_rate, 
                                 random_state=22).fit(X_train_main, y_train)
ab_best_r2_train_main = ab_best_main.score(X_train_main, y_train)
ab_best_r2_test_main = ab_best_main.score(X_test_main, y_test)
print('-- Best AdaBoost Regression --')
print("Training R^2: {}".format(ab_best_r2_train_main))
print("Test R^2: {}".format(ab_best_r2_test_main))

joblib.dump(ab_best_main, '../../fitted_models/ab_best_main.pkl') 

Load fitted model to reproduce results directly

ab_best_main = joblib.load('../../fitted_models/ab_best_main.pkl') 
ab_best_r2_train_main = ab_best_main.score(X_train_main, y_train)
ab_best_r2_test_main = ab_best_main.score(X_test_main, y_test)
print('-- Best AdaBoost Regression --')
print("Training R^2: {}".format(ab_best_r2_train_main))
print("Test R^2: {}".format(ab_best_r2_test_main))
-- Best AdaBoost Regression --
Training R^2: 0.6574644836867659
Test R^2: 0.4194601631510584

Meta Model: Weighted Avg

prefix = '../../fitted_models/'
suffix = '.pkl'
models_main = ['sim_lin_main', 'ridge_cv_main', 'lasso_cv_main', 'RF_best_main', 'ab_best_main']

weights_main = np.zeros((len(models_main),))
preds_main_valid = np.zeros((y_valid.shape[0], len(models_main)))
preds_main_test = np.zeros((y_test.shape[0], len(models_main)))
for i, name in enumerate(models_main):
    model_name = prefix + name + suffix
    model = joblib.load(model_name) 
    weights_main[i] = model.score(X_valid_main, y_valid)
    preds_main_valid[:, i] = model.predict(X_valid_main)
    preds_main_test[:, i] = model.predict(X_test_main)

weights_main = weights_main/np.sum(weights_main)
meta_pred_main_valid = np.average(preds_main_valid, axis=1, weights=weights_main)
meta_pred_main_test = np.average(preds_main_test, axis=1, weights=weights_main)

meta_r2_valid_main = r2_score(y_valid, meta_pred_main_valid)
meta_r2_test_main = r2_score(y_test, meta_pred_main_test)
print('-- Meta model 1: Weighted Avg with main predictors only --')
print("Valid R^2: {}".format(meta_r2_valid_main))
print("Test R^2: {}".format(meta_r2_test_main))
-- Meta model 1: Weighted Avg with main predictors only --
Valid R^2: 0.4051594155555923
Test R^2: 0.41065615616702467

Meta Model: Linear Regression

meta_X_train_main = np.zeros((y_valid.shape[0], len(models_main)))
meta_X_test_main = np.zeros((y_test.shape[0], len(models_main)))
for i, name in enumerate(models_main):
    model_name = prefix + name + suffix
    model = joblib.load(model_name) 
    meta_X_train_main[:, i] = model.predict(X_valid_main)
    meta_X_test_main[:, i] = model.predict(X_test_main)

meta_reg_main = LinearRegression().fit(meta_X_train_main, y_valid)
joblib.dump(meta_reg_main, '../../fitted_models/meta_reg_main.pkl') # Dump fitted model to disk

meta_reg_main = joblib.load('../../fitted_models/meta_reg_main.pkl') 
print('-- Meta model 2: Linear Regression with main predictors only --')
print("Train R^2: {}".format(meta_reg_main.score(meta_X_train_main, y_valid)))
print("Test R^2: {}".format(meta_reg_main.score(meta_X_test_main, y_test)))
-- Meta model 2: Linear Regression with main predictors only --
Train R^2: 0.48592344342771787
Test R^2: 0.4965343020024535

Regression models on main predictors & interaction terms

Data Processing

Add interaction terms between genres and the numerical audio features’ average

audio_features_avg = ['track_acousticness_avg', 'track_album_popularity_avg', 'track_danceability_avg',
                    'track_duration_ms_avg', 'track_energy_avg', 'track_explicit_avg', 
                    'track_instrumentalness_avg', 'track_liveness_avg', 'track_loudness_avg', 'track_mode_avg', 
                    'track_speechiness_avg', 'track_tempo_avg', 'track_valence_avg']
genres = ['genre_blues', 'genre_classical', 'genre_country', 'genre_dance',
       'genre_edm', 'genre_elect', 'genre_folk', 'genre_funk', 'genre_hip hop',
       'genre_house', 'genre_indie', 'genre_indie pop', 'genre_jazz',
       'genre_korean pop', 'genre_mellow', 'genre_metal', 'genre_other',
       'genre_pop', 'genre_pop christmas', 'genre_pop rap', 'genre_punk',
       'genre_r&b', 'genre_rap', 'genre_reggae', 'genre_rock', 'genre_soul',]

cross_terms = audio_features_avg + genres
playlists_df_interaction = deepcopy(playlists_df_full)

for feature in audio_features_avg:
    for genre in genres:
        playlists_df_interaction[feature+'_X_'+genre] = playlists_df_full[feature] * playlists_df_full[genre]

playlists_df_interaction = change_column_order(playlists_df_interaction, 'log_num_followers', len(playlists_df_interaction.columns))
playlists_df_interaction.head()
num_tracks track_acousticness_avg track_acousticness_std track_album_popularity_avg track_album_popularity_max track_album_popularity_std track_artists_genres_unique track_avg_artist_num_followers_avg track_avg_artist_num_followers_std track_avg_artist_popularity_avg ... track_valence_avg_X_genre_pop track_valence_avg_X_genre_pop christmas track_valence_avg_X_genre_pop rap track_valence_avg_X_genre_punk track_valence_avg_X_genre_r&b track_valence_avg_X_genre_rap track_valence_avg_X_genre_reggae track_valence_avg_X_genre_rock track_valence_avg_X_genre_soul log_num_followers
0 52 0.180999 0.171120 71.673077 96 13.136445 60 1.276693e+06 1.320843e+06 82.256410 ... 0.456071 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 14.914325
1 75 0.144201 0.160799 68.440000 100 15.511063 70 3.791621e+06 3.658858e+06 84.684000 ... 0.555027 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 11.142412
2 38 0.116600 0.117615 72.421053 94 16.192317 44 2.319518e+06 2.237652e+06 86.705263 ... 0.526526 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.863271
3 40 0.134162 0.247197 57.025000 82 18.083815 97 2.387520e+06 3.589807e+06 72.987500 ... 0.501825 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 11.146849
4 26 0.171635 0.229736 53.461538 54 0.498519 5 8.566853e+04 2.347853e+05 55.059341 ... 0.658846 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 9.655859

5 rows × 424 columns

Split data 6:1:1

  1. train set for single models: 6/8 of all data
  2. validation set for single models & train set for meta models: 1/8 of all data
  3. test set: 1/8 of all data
np.random.seed(9001)
msk2 = np.random.rand(len(playlists_df_interaction)) < 0.75
playlists_df_train_int = playlists_df_interaction[msk2]
playlists_df_nontrain_int = playlists_df_interaction[~msk2]

msk22 = np.random.rand(len(playlists_df_nontrain_int)) < 0.5
playlists_df_valid_int = playlists_df_nontrain_int[msk22]
playlists_df_test_int = playlists_df_nontrain_int[~msk22]
X_train_int = playlists_df_train_int.iloc[:,:-1]
X_valid_int = playlists_df_valid_int.iloc[:,:-1]
X_test_int = playlists_df_test_int.iloc[:,:-1]

X_train_int.shape, y_train.shape, X_valid_int.shape, y_valid.shape, X_test_int.shape, y_test.shape
((7083, 423), (7083,), (1214, 423), (1214,), (1133, 423), (1133,))

Drop columns that have only 0 and standardize numerical variables using train set’s mean and std

for col in X_train_int.columns:
    if (X_train_int[col] == 0).all():
        X_train_int.drop(col, axis=1, inplace=True)
    else:
        # Standardize a numerical variable
        if not np.logical_or((X_train_int[col]==0), ((X_train_int[col]==1))).all():
            mean_train = X_train_int[col].mean()
            std_train = X_train_int[col].std()
            X_train_int[col] = (X_train_int[col] - mean_train) / std_train
            X_valid_int[col] = (X_valid_int[col] - mean_train) / std_train
            X_test_int[col] = (X_test_int[col] - mean_train) / std_train

Regression Models

Linear regression

sim_lin_int = LinearRegression().fit(X_train_int, y_train)
joblib.dump(sim_lin_int, '../../fitted_models/sim_lin_int.pkl') 
sim_lin_int = joblib.load('../../fitted_models/sim_lin_int.pkl') 
print('-- Regression with all predictors and interaction terms --')
print('Training R^2: ', sim_lin_int.score(X_train_int, y_train))
print('Test R^2: ', sim_lin_int.score(X_test_int, y_test))
-- Regression with all predictors and interaction terms --
Training R^2:  0.324099629417
Test R^2:  0.246580040621

PCA on main predictors & their interaction terms

Use 5-Fold cross-validation to find the best n_components of PCA

n_pca = 200
r2_valid_cv = np.zeros((n_pca, 5))

fold_ctr = 0
for itrain, ivalid in KFold(n_splits=5, shuffle=True, random_state=9001).split(X_train_int.index):
    # in general though its good for creating consistent psets, don't put seeds into kfold split
    X_train_cv = X_train_int.iloc[itrain,:]
    y_train_cv = y_train.iloc[itrain]
    X_valid_cv = X_train_int.iloc[ivalid,:]
    y_valid_cv = y_train.iloc[ivalid]
    
    # pca
    pca = PCA()
    pca.fit(X_train_cv)
    X_train_pca_cv = pca.transform(X_train_cv)
    X_valid_pca_cv = pca.transform(X_valid_cv)
    
    for comp in range(1, n_pca+1):
        linear_cv = LinearRegression().fit(X_train_pca_cv[:,:comp], y_train_cv) # fit model
        r2_valid_cv[comp-1,fold_ctr] = linear_cv.score(X_valid_pca_cv[:,:comp], y_valid_cv) # get valid r2 score
        
    fold_ctr += 1


scores_valid = np.mean(r2_valid_cv, axis=1)[1:]
best_n = np.argmax(scores_valid)
print('The best n_components for PCA is {}'.format(best_n))
The best n_components for PCA is 188
pca = PCA()
pca.fit(X_train_int)

X_train_int_pca_best = pca.transform(X_train_int)[:,:best_n]
X_test_int_pca_best = pca.transform(X_test_int)[:,:best_n]

linear_pca_best = LinearRegression().fit(X_train_int_pca_best, y_train)
score_train_pca_best = linear_pca_best.score(X_train_int_pca_best, y_train)
score_test_pca_best = linear_pca_best.score(X_test_int_pca_best, y_test)

print('-- Linear Regression on main predictors & interaction terms \' best PCA(n_components={}) --'.format(best_n))
print("Training R^2: {}".format(score_train_pca_best))
print("Test R^2: {}".format(score_test_pca_best))
-- Linear Regression on main predictors & interaction terms ' best PCA(n_components=188) --
Training R^2: 0.2637031949094367
Test R^2: 0.20381245149309746

RidgeCV

lambdas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4, 1e5]
ridge_cv_int = RidgeCV(alphas=lambdas, fit_intercept=True).fit(X_train_int, y_train)
joblib.dump(ridge_cv_int, '../../fitted_models/ridge_cv_int.pkl') 
ridge_cv_int = joblib.load('../../fitted_models/ridge_cv_int.pkl') 
ridge_r2_train_int = ridge_cv_int.score(X_train_int, y_train)
ridge_r2_test_int = ridge_cv_int.score(X_test_int, y_test)

print('-- RidgeCV (best alpha = {}) with main predictors & interaction terms --'.format(ridge_cv_int.alpha_))
print("Training R^2: {}".format(ridge_r2_train_int))
print("Test R^2: {}".format(ridge_r2_test_int))
-- RidgeCV (best alpha = 100.0) with main predictors & interaction terms --
Training R^2: 0.2996521645411735
Test R^2: 0.25243054110669527

LassoCV

lambdas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4, 1e5]
lasso_cv_int = LassoCV(alphas=lambdas, fit_intercept=True).fit(X_train_int, y_train)
joblib.dump(lasso_cv_int, '../../fitted_models/lasso_cv_int.pkl') 
lasso_cv_int = joblib.load('../../fitted_models/lasso_cv_int.pkl') 
lasso_r2_train_int = lasso_cv_int.score(X_train_int, y_train)
lasso_r2_test_int = lasso_cv_int.score(X_test_int, y_test)

print('-- LassoCV w/ (best alpha = {}) with main predictors & interaction terms --'.format(lasso_cv_int.alpha_))
print("Training R^2: {}".format(lasso_r2_train_int))
print("Test R^2: {}".format(lasso_r2_test_int))
-- LassoCV w/ (best alpha = 0.01) with main predictors & interaction terms --
Training R^2: 0.2875042397561971
Test R^2: 0.2689411261878084

Random Forest

cv_scores_RF_int = []
params_dict_RF_int = {}

for i, (n, f) in enumerate(product(*RF_params.values())):
    params_dict_RF_int[i] = (n, f)
    
    # Cross Validation on Random Forest Classifier
    RF_int = RandomForestRegressor(oob_score=False, n_estimators=n, max_features=f, n_jobs=-1, random_state=22)
    scores = cross_val_score(RF_int, X_train_main, y_train, cv=5, scoring='r2')
    cv_scores_RF_int.append(scores.mean())

idx_optimal = np.argmax(cv_scores_RF_int)
optimal_params_RF_int = params_dict_RF_int[idx_optimal]
print('-- Random Forest Regression with main predictors & interaction terms --')
print('Maximum R^2 validation score = {}'.format(max(cv_scores_RF_int)))
print('Optimal n_estimators = {}'.format(optimal_params_RF_int[0]))
print('Optimal max_features = {}'.format(optimal_params_RF_int[1]))

RF_best_int = RandomForestRegressor(oob_score=False, n_estimators=optimal_params_RF_int[0], 
                                     max_features=optimal_params_RF_int[1], n_jobs=-1, 
                                     random_state=22).fit(X_train_int, y_train)
RF_best_r2_train_int = RF_best_int.score(X_train_int, y_train)
RF_best_r2_test_int = RF_best_int.score(X_test_int, y_test)
print('-- Best Random Forest Regression w/ interaction terms --')
print("Training R^2: {}".format(RF_best_r2_train_int))
print("Test R^2: {}".format(RF_best_r2_test_int))

joblib.dump(RF_best_int, '../../fitted_models/RF_best_int.pkl') 

Load fitted model to reproduce results directly

RF_best_int = joblib.load('../../fitted_models/RF_best_int.pkl') 
RF_best_r2_train_int = RF_best_int.score(X_train_int, y_train)
RF_best_r2_test_int = RF_best_int.score(X_test_int, y_test)
print('-- Best Random Forest Regression with main predictors & interaction terms --')
print("Training R^2: {}".format(RF_best_r2_train_int))
print("Test R^2: {}".format(RF_best_r2_test_int))
-- Best Random Forest Regression with main predictors & interaction terms --
Training R^2: 0.9257887738410054
Test R^2: 0.47591913127397234

Adaboost

l_rate = 0.05
params_dict_ab_int = {}
cv_scores_ab_int = []
for i, (d, n) in enumerate(product(*ab_params.values())):
    params_dict_ab_int[i] = (d, n)
    ab_main = AdaBoostRegressor(DecisionTreeRegressor(max_depth=d, random_state=22), n_estimators=n, 
                                learning_rate=l_rate, random_state=22)
    scores = cross_val_score(ab_main, X_train_int, y_train, cv=5, scoring='r2')
    cv_scores_ab_int.append(scores.mean())

idx_optimal = np.argmax(cv_scores_ab_int)
optimal_params_ab_int = params_dict_ab_int[idx_optimal]
print('-- AdaBoost Regression with main predictors & interaction terms --')
print('Maximum R^2 validation score = {}'.format(max(cv_scores_ab_int)))
print('Optimal base tree max_depth = {}'.format(optimal_params_ab_int[0]))
print('Optimal n_estimators = {}'.format(optimal_params_ab_int[1]))

ab_best_int = AdaBoostRegressor(DecisionTreeRegressor(max_depth=optimal_params_ab_int[0], random_state=22), 
                                 n_estimators=optimal_params_ab_int[1], learning_rate=l_rate, 
                                 random_state=22).fit(X_train_int, y_train)

ab_best_r2_train_int = ab_best_int.score(X_train_int, y_train)
ab_best_r2_test_int = ab_best_int.score(X_test_int, y_test)
print('-- Best AdaBoost Regression with main predictors & interaction terms --')
print("Training R^2: {}".format(ab_best_r2_train_int))
print("Test R^2: {}".format(ab_best_r2_test_int))

joblib.dump(ab_best_int, '../../fitted_models/ab_best_int.pkl') 

Load fitted model to reproduce results directly

ab_best_int = joblib.load('../../fitted_models/ab_best_int.pkl') 
ab_best_r2_train_int = ab_best_int.score(X_train_int, y_train)
ab_best_r2_test_int = ab_best_int.score(X_test_int, y_test)
print('-- Best AdaBoost Regression with main predictors & interaction terms --')
print("Training R^2: {}".format(ab_best_r2_train_int))
print("Test R^2: {}".format(ab_best_r2_test_int))
-- Best AdaBoost Regression with main predictors & interaction terms --
Training R^2: 0.6450367936426891
Test R^2: 0.4128596865435953

Meta Model: Weighted Avg

prefix = '../../fitted_models/'
suffix = '.pkl'
models_int = ['sim_lin_int', 'ridge_cv_int', 'lasso_cv_int', 'RF_best_int', 'ab_best_int']

weights_int = np.zeros((len(models_int),))
preds_int_valid = np.zeros((y_valid.shape[0], len(models_int)))
preds_int_test = np.zeros((y_test.shape[0], len(models_int)))
for i, name in enumerate(models_int):
    model_name = prefix + name + suffix
    model = joblib.load(model_name) 
    weights_int[i] = model.score(X_valid_int, y_valid)
    preds_int_valid[:, i] = model.predict(X_valid_int)
    preds_int_test[:, i] = model.predict(X_test_int)

weights_int = weights_int/np.sum(weights_int)
meta_pred_int_valid = np.average(preds_int_valid, axis=1, weights=weights_int)
meta_pred_int_test = np.average(preds_int_test, axis=1, weights=weights_int)

meta_r2_valid_int = r2_score(y_valid, meta_pred_int_valid)
meta_r2_test_int = r2_score(y_test, meta_pred_int_test)
print('-- Meta model 1: Weighted Avg with main predictors & interaction terms --')
print("Valid (Meta Train) R^2: {}".format(meta_r2_valid_int))
print("Test R^2: {}".format(meta_r2_test_int))
-- Meta model 1: Weighted Avg with main predictors & interaction terms --
Valid (Meta Train) R^2: 0.40621785748120054
Test R^2: 0.4157773596959048

Meta Model: Linear Regression

meta_X_train_int = np.zeros((y_valid.shape[0], len(models_int)))
meta_X_test_int = np.zeros((y_test.shape[0], len(models_int)))
for i, name in enumerate(models_int):
    model_name = prefix + name + suffix
    model = joblib.load(model_name) 
    meta_X_train_int[:, i] = model.predict(X_valid_int)
    meta_X_test_int[:, i] = model.predict(X_test_int)

meta_reg_int = LinearRegression().fit(meta_X_train_int, y_valid)
joblib.dump(meta_reg_int, '../../fitted_models/meta_reg_int.pkl') # Dump fitted model to disk

meta_reg_int = joblib.load('../../fitted_models/meta_reg_int.pkl') 
print('-- Meta model 2: Linear Regression with main predictors & interaction terms --')
print("Train R^2: {}".format(meta_reg_int.score(meta_X_train_int, y_valid)))
print("Test R^2: {}".format(meta_reg_int.score(meta_X_test_int, y_test)))
-- Meta model 2: Linear Regression with main predictors & interaction terms --
Train R^2: 0.4849953986455844
Test R^2: 0.49444420329210315

Summary

$R^2$ score on the test set of all playlists:

Model Main Only Main & Interaction
Linear Regression 0.25215 0.24658
PCA N/A (n=188) 0.20381
RidgeCV 0.25052 0.25243
LassoCV 0.24915 0.26894
Random Forest 0.48157 0.47592
Adaboost 0.41946 0.41286
Meta - Weighted Avg 0.41066 0.41578
Meta - Linear Regression 0.49653 0.49444
model_names = ['Lin.Reg.', 'RidgeCV', 'LassoCV', 'RF', 'Adaboost', 'Meta 1\nWeighted.Avg.', 'Meta 2\nLin.Reg.']
acc_test_main = [0.25215, 0.25052, 0.24915, 0.48157, 0.41946, 0.41066, 0.49653]
acc_test_int = [0.24658, 0.25243, 0.26894, 0.47592, 0.41286, 0.41578, 0.49444]

width = 0.3
plt.figure(figsize=(12, 8))
plt.bar(np.arange(len(model_names)), acc_test_main, width, alpha=0.8, label='Main')
plt.bar(np.arange(len(model_names))+width, acc_test_int, width, alpha=0.8, label='Main & Interaction')
plt.xticks(np.arange(len(model_names))+0.5*width, model_names, fontsize=14)
plt.title('Test $R^2$ VS. Models ', fontsize=20)
plt.ylabel('Test $R^2$', fontsize=16)
plt.legend(fontsize=14)
plt.show()

png