3Chapter 3: Basics of stochastic frontier analysis for panel data
3.1 Exercise 3.1: Within and Between Summary Statistics
This exercise reports summary statistics for the between and within variation of log GPD from the WBpanel.csv data as well as the correlations of log GDP with itself lagged once, twice and three times.
# Remove all previous function and variable definitions before the next exercise%reset
3.4 Exercise 3.4: CSS Model Estimation
This exercise estimates the stochastic frontier model with time-varying inefficiency of Cornwell, Schmidt and Sickles (1990) via the Least Squared Dummy Variable (LSDV) approach.
import numpy as npimport pandas as pdimport statsmodels.api as smregression_data = pd.read_csv("WBpanel.csv")regression_data["intercept"] =1regression_data["yr2"] = regression_data["yr"] **2country_dummies = pd.get_dummies( regression_data["country"], prefix="cntrydum", drop_first=True)regression_data = pd.concat([regression_data, country_dummies], axis=1)yr_df = pd.DataFrame( columns=["yr_{}".format(i) for i in regression_data["country"].unique()[1:]])yr2_df = pd.DataFrame( columns=["yr2_{}".format(i) for i in regression_data["country"].unique()[1:]])for i in regression_data["country"].unique()[1:]: yr_df["yr_{}".format(i)] = ( regression_data["yr"] * regression_data["cntrydum_{}".format(i)] ) yr2_df["yr2_{}".format(i)] = ( regression_data["yr"] * regression_data["cntrydum_{}".format(i)] )regression_data = pd.concat([regression_data, yr_df, yr2_df], axis=1)y = regression_data["lny"]X = regression_data[ ["intercept", "lnk", "lnl"]+ [x for x in regression_data.columns if"cntrydum"in x]+ [x for x in regression_data.columns if"yr_"in x]+ [x for x in regression_data.columns if"yr2_"in x]]model = sm.OLS(y, X)result = model.fit()print(result.summary())
# Remove all previous function and variable definitions before the next exercise%reset
3.5 Exercise 3.5: LS Model Estimation
This exercise estimates the stochastic frontier model with time-varying inefficiency of Lee and Schmidt (1993) via the Least Squared Dummy Variable (LSDV) approach.
import numpy as npimport pandas as pdimport statsmodels.api as smregression_data = pd.read_csv("WBpanel.csv")regression_data["intercept"] =1country_dummies = pd.get_dummies( regression_data["country"], prefix="cntrydum", drop_first=True)year_dummies = pd.get_dummies(regression_data["yr"], prefix="yrdum", drop_first=True)regression_data = pd.concat([regression_data, country_dummies, year_dummies], axis=1)y = regression_data["lny"]X = regression_data[ ["intercept", "lnk", "lnl"]+ [x for x in regression_data.columns if"cntrydum"in x]+ [x for x in regression_data.columns if"yrdum"in x]]model = sm.OLS(y, X)result = model.fit()print(result.summary())
/var/folders/q0/9kb46g7n1_57dfk7vtdsv8t4w169ch/T/ipykernel_41531/139070248.py:103: RuntimeWarning: divide by zero encountered in log
np.log(
/Users/jleu0010/Documents/GitHub/EPA-Copula-SFA-website/venv/lib/python3.8/site-packages/scipy/optimize/_numdiff.py:576: RuntimeWarning: invalid value encountered in subtract
df = fun(x) - f0
LL -337.505
Est
StEr
t-stat
p-val
[95%Conf.
Interv]
$$\beta_{0}$$
0.068
0.033
2.082
0.037
0.004
0.132
x1
0.502
0.027
18.666
0.0
0.45
0.555
x2
0.681
0.024
27.983
0.0
0.633
0.728
$$\sigma^{2}_{u}$$
$$\gamma_{0}$$
-3.199
0.251
-12.763
0.0
-3.69
-2.708
z1
1.386
0.235
5.908
0.0
0.927
1.846
$$\sigma^{2}_{v}$$
0.001
0.026
0.039
0.969
-0.05
0.052
$$\sigma^{2}_{w}$$
0.354
0.06
5.902
0.0
0.236
0.472
# Remove all previous function and variable definitions before the next exercise%reset
3.11 Exercise 3.11: “True” Fixed Effects model of Wang and Ho (2010)
This exercise estimates the fixed effects stochastic frontier model of Wang and Ho (2010) using MLE for the TaiwaneseManufacturing.csv data.
3.12 Exercise 3.12: KLH model of Kumbhakar et al. (2014)
This exercise estimates the model of Kumbhakar et al. (2014) that allows for both persistent and time varying inefficiency for the TaiwaneseManufacturing.csv data.
import numdifftools as ndimport numpy as npimport pandas as pdfrom linearmodels import ( RandomEffects, # To install the linear models package run: pip install linearmodels at the Anaconda prompt)from scipy import statsfrom scipy.optimize import minimizedef loglikelihood(coefs, y, x1):# Obtain the log likelihood logDen = log_density(coefs, y, x1) log_likelihood =-np.sum(logDen)return log_likelihooddef log_density(coefs, y, x1):# Get parameters beta1 = coefs[0] sigma2u = coefs[1] sigma2v = coefs[2] Lambda = np.sqrt(sigma2u / sigma2v) sigma2 = sigma2u + sigma2v sigma = np.sqrt(sigma2)# Composed errors from the production function equation eps = y - x1 * beta1# Compute the log density Den = ( (2/ sigma)* stats.norm.pdf(eps / sigma)* stats.norm.cdf(-Lambda * eps / sigma) ) logDen = np.log(Den)return logDendef BC98_TE(theta, y, x1): beta1 = theta[0] sigma2u = theta[1] sigma2v = theta[2] sigma2 = sigma2u + sigma2v epsilon = y - x1 * beta1 u_star = (-sigma2u * epsilon) / (sigma2) sigma2_star = (sigma2v * sigma2u) / (sigma2v + sigma2u) TE = np.exp(-u_star +0.5* sigma2_star) * ( stats.norm.cdf((u_star / np.sqrt(sigma2_star)) - np.sqrt(sigma2_star))/ stats.norm.cdf(u_star / np.sqrt(sigma2_star)) )return TE
3.13 Exercise 3.13: Matrix form of FE and RE estimators
This exercise implements the FE and RE regression by hand using the matrix form outlined in Appendix 3.B. It uses the World Bank growth data for 82 countries between 1960 and 1987, contained in the file WBpanel.csv.
import pandas as pdimport numpy as np from scipy import stats# Fixed EffectsWBpanel_data = pd.read_csv('WBpanel.csv')C = np.unique(WBpanel_data['country'])N =len(C) # Number of cross sectional unitsn, p = WBpanel_data.shapeT =28#Observations per identity group. Balanced panelX = WBpanel_data[['lnk', 'lnl', 'yr']].valuesy = WBpanel_data[['lny']].valuese_T = np.ones((T,1))P_star =1/T*(e_T@e_T.T)Q_star = np.eye(T) - P_starP = np.kron(np.eye(N), P_star)Q = np.kron(np.eye(N), Q_star)beta_fe = np.linalg.inv(X.T@Q@X)@X.T@Q@yk =len(beta_fe)y_bar = np.mean(y).reshape(-1,1)X_bar = np.mean(X, axis =0).reshape(-1,1)alpha_hat = (y_bar - X_bar.T@beta_fe)fe_coef = np.concatenate([alpha_hat, beta_fe])# We follow the definiton of r-squared used in STATA using squared correlationsy_hat_overall = alpha_hat + X@beta_fe;R_squared_overall_corr = np.corrcoef(y.flatten(), y_hat_overall.flatten())[0,1]**2;#R_squared_overall_var = 1 - np.sum((y - y_hat_overall.flatten())**2)/(np.sum((y - np.mean(y)).^2));y_hat_between = alpha_hat + (P@X)@beta_fe;R_squared_between_corr = np.corrcoef((P@y).flatten(), y_hat_between.flatten())[0,1]**2;#R_squared_between_var = 1 - np.sumsum(((P@y).flatten() - y_hat_between.flatten())**2)/(np.sum((P*y - np.mean(y))**2));y_hat_within = ((Q@X)@beta_fe);R_squared_within_corr = np.corrcoef((Q@y).flatten(), y_hat_within.flatten())[0,1]**2;#R_squared_within_var = 1 - np.sum(((Q@y).flatten() - y_hat_within.flatten())**2)/(np.sum((Q*y - np.mean(Q*y))**2));#standard error estimationy_dot = Q@yX_dot = Q@Xepsilon = y_dot - X_dot@beta_feSigma = epsilon@epsilon.Tsigma2_epsilon = (1/(n - N - k))*epsilon.T@epsilonbread = np.linalg.inv(X_dot.T@X_dot)filling =sum([X_dot[(i*T):((i+1)*T),:].T@epsilon[(i*T):((i+1)*T)]@epsilon[(i*T):((i+1)*T)].T@X_dot[(i*T):((i+1)*T),:] for i inrange(N)])cluster_robust_Sigma = (N/(N-1))*(bread)@(filling)@(bread)robust_std_err = np.sqrt(np.diag(cluster_robust_Sigma)).reshape(-1,1)#Standard error for intercept(mean value of the fixed effect)std_err_alpha_hat = np.sqrt((sigma2_epsilon/T)+(X_bar.T@cluster_robust_Sigma@X_bar))std_errs = np.concatenate([std_err_alpha_hat, robust_std_err])#T-statistics and P-valuesTstat = fe_coef/std_errsPvalues =2*(1- stats.t.cdf(np.abs(Tstat), df = n - N - k))lower_ConfInt95 = stats.t.ppf(0.025, df = n - N - k, loc = fe_coef, scale = std_errs)upper_ConfInt95 = stats.t.ppf(0.975, df = n - N - k, loc = fe_coef, scale = std_errs)display(pd.DataFrame(data = {'Coef.':np.concatenate([beta_fe, alpha_hat], axis =0).ravel(), 'Robust Std. Err.':np.concatenate([robust_std_err, std_err_alpha_hat], axis =0).ravel(), 'T-Stat': Tstat.ravel(), 'P-value': Pvalues.ravel(),'Lower CI':lower_ConfInt95.ravel(), 'Upper CI':upper_ConfInt95.ravel()}, index = ['lnk', 'lnl', 'yr', '_cons']))print('R-squared within ', R_squared_within_corr)print('R-squared between ', R_squared_between_corr)print('R-squared overall ', R_squared_overall_corr)
Coef.
Robust Std. Err.
T-Stat
P-value
Lower CI
Upper CI
lnk
0.531544
0.051674
2.806108
0.005058
0.153553
0.866212
lnl
0.133646
0.109927
10.286444
0.000000
0.430209
0.632879
yr
0.007467
0.002957
1.215769
0.224203
-0.081925
0.349217
_cons
0.509883
0.181705
2.525304
0.011629
0.001668
0.013266
R-squared within 0.8933515713213597
R-squared between 0.9674700339386045
R-squared overall 0.9545972995058635
# Random Effectsbeta_hat_within = np.linalg.pinv(X_dot.T@X_dot)@X_dot.T@y_dot #OLS applied to the de-meaned dataY_bar = np.mean(y).reshape(-1,1)X_bar = np.mean(X, axis =0).reshape(-1,1)alpha_hat_within = (Y_bar - X_bar.T@beta_hat_within)params = np.concatenate([beta_hat_within, alpha_hat_within])k =len(beta_hat_within)epsilon = y_dot - X_dot@beta_hat_withinSigma = epsilon@epsilon.Tsigma2_epsilon = (1/(n - N - k))*epsilon.T@epsiloncross_sectional_means_regression_data = WBpanel_data.groupby(['country'])[['lny', 'lnk', 'lnl', 'yr']].mean()Y_bar = cross_sectional_means_regression_data[['lny']].valuesall_X_bar = cross_sectional_means_regression_data[['lnk', 'lnl', 'yr']].valuesX_bar = all_X_bar[:,:-1] #drop the yr variableX_bar = np.column_stack([np.ones((N,1)), X_bar]) #add interceptbeta_between = np.linalg.pinv(X_bar.T@X_bar)@X_bar.T@Y_bar #OLS applied to the cross sectional mean dataepsilon_between = Y_bar - X_bar@beta_betweensigma2_b =1/(N - k)*epsilon_between.T@epsilon_betweensigma2_u =max(0, sigma2_b - (sigma2_epsilon/T))rho_hat = np.sqrt(sigma2_epsilon)/(np.sqrt(sigma2_epsilon + T*sigma2_u))#Compute feasible transformationsy_all_cross_sectional_means = P@yX_all_cross_sectional_means = P@XX_all_cross_sectional_means = np.column_stack([np.ones((n,1)),X_all_cross_sectional_means]) #add interceptY_tilde = y - (1- rho_hat)*y_all_cross_sectional_meansX_tilde = np.column_stack([np.ones((n,1)),X]) - (1- rho_hat)*X_all_cross_sectional_meansbeta_random_effects = np.linalg.pinv(X_tilde.T@X_tilde)@X_tilde.T@Y_tilde #OLS applied to the de-meaned data# We follow the definiton of r-squared used in STATA using squared correlationsy_hat_overall = beta_random_effects[0] + X@beta_random_effects[1:]R_squared_overall_corr = np.corrcoef(y.flatten(), y_hat_overall.flatten())[0,1]**2#R_squared_overall_var = 1 - np.sum((y - y_hat_overall)**2)/(sum((y - np.mean(y))**2))y_hat_between = beta_random_effects[0] + (P@X)@beta_random_effects[1:]R_squared_between_corr = np.corrcoef((P@y).flatten(), y_hat_between.flatten())[0,1]**2#R_squared_between_var = 1 - np.sum((P@y - y_hat_between)**2)/(np.sum((P@y - np.mean(P@y))**2))y_hat_within = (Q@X)@beta_random_effects[1:]R_squared_within_corr = np.corrcoef((Q@y).flatten(), y_hat_within.flatten())[0,1]**2#R_squared_within_var = 1 - np.sum(((Q@y) - y_hat_within)**2)/(sum(((Q@y) - np.mean(Q@y))**2))epsilon_random_effects = Y_tilde - X_tilde@beta_random_effectsOmega_inv = (1/sigma2_epsilon)*(Q + (rho_hat**2)*P)Omega_star_inv = np.linalg.inv(sigma2_epsilon*Q_star + (sigma2_epsilon + T*sigma2_b)*P_star)X = np.column_stack([np.ones((n,1)), X])bread = np.linalg.inv(X.T@Omega_inv@X)filling =sum([X[(i*T):((i+1)*T),:].T@Omega_star_inv@epsilon_random_effects[(i*T):((i+1)*T)]@epsilon_random_effects[(i*T):((i+1)*T)].T@Omega_star_inv@X[(i*T):((i+1)*T),:] for i inrange(N)])cluster_robust_Sigma = bread@filling@breadrobust_std_err = np.sqrt(np.diag(cluster_robust_Sigma))#Z-statistics and P-valuesZstat = beta_random_effects.flatten()/robust_std_errPvalues =2*(1- stats.norm.cdf(np.abs(Zstat)))lower_ConfInt95 = stats.norm.ppf(0.025, loc = beta_random_effects.flatten(), scale = robust_std_err)upper_ConfInt95 = stats.norm.ppf(0.975, loc = beta_random_effects.flatten(), scale = robust_std_err)display(pd.DataFrame(data = {'Coef.':beta_random_effects.ravel(), 'Robust Std. Err.':robust_std_err.ravel(), 'Z-Stat': Zstat.ravel(), 'P-value': Pvalues.ravel(), 'Lower CI':lower_ConfInt95.ravel(), 'Upper CI':upper_ConfInt95.ravel()}, index = ['_cons', 'lnk', 'lnl', 'yr']))print('R-squared within ', R_squared_within_corr)print('R-squared between ', R_squared_between_corr)print('R-squared overall ', R_squared_overall_corr)
Coef.
Robust Std. Err.
Z-Stat
P-value
Lower CI
Upper CI
_cons
0.114532
0.101448
1.128974
0.258909
-0.084302
0.313366
lnk
0.607688
0.043080
14.106006
0.000000
0.523253
0.692124
lnl
0.251772
0.053337
4.720446
0.000002
0.147235
0.356310
yr
0.000518
0.002191
0.236569
0.812992
-0.003777
0.004814
R-squared within 0.890544528855278
R-squared between 0.9565239715120661
R-squared overall 0.9512782824088507