STATA实用学习笔记

更新时间:2024-01-18 07:35:01 阅读量: 教育文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

北京科技大学

STATA应用

学习摘录

第一章 STATA的基本操作

一、设置内存容

set mem 500m, perm 一、 显示输入内容

Display 1

Display “clive”

二、 显示数据集结构describe

Describe /d 三、 编辑 edit

Edit 四、 重命名变量

Rename var1 var2

五、 显示数据集内容list/browse

List in 1 List in 2/10

六、 数据导入:数据文件是文本类型(.csv)

1、 insheet: . insheet using “C:\\Documents and Settings\\Administrator\\桌面

\\ST9007\\dataset\\Fees1.csv”, clear

2、 内存为空时才可以导入数据集,否则会出现(you must start with an empty dataset)

(1) 清空内存中的所有变量:.drop _all (2) 导入语句后加入“clear”命令。

七、 保存文件

1、 save “C:\\Documents and Settings\\Administrator\\桌面\\ST9007\\dataset\\Fees1.dta”

2、 save “C:\\Documents and Settings\\Administrator\\桌面\\ST9007\\dataset\\Fees1.dta”, replace 八、 打开及退出已存文件use

1、.Use 文件路径及文件名, clear 2、. Drop _all/.exit

九、 记录命令和输出结果(log)

1、 开始建立记录文件:log using \2、 暂停记录文件:log off 3、 重新打开记录文件:log on 4、 关闭记录文件:log close 十一、创建和保存程序文件:(doedit, do)

1、 打开程序编辑窗口:doedit 2、 写入命令

3、 保存文件,.do.

4、 运行命令:.do 程序文件路径及文件名

十二、多个数据集合并为一个数据集(变量和结构相同)纵向合并append

insheet using \save \

insheet using \

2

append using \save \

十三、横向合并,在原数据集基础上加上另外的变量merge

1、insheet using \

sort companyid yearend

save \describe

insheet using \sort companyid yearend

merge companyid yearend using \save \describe

2、_merge==1 obs. From master data _merge==2 obs. From using data

_merge==3 obs. From both master and using data

十四、帮助文件:help 1、. Help describe 十五、描述性统计量

1、summarize incorporationyear 单个

summarize incorporationyear-big6 连续多个 summarize _all or simply summarize 所有

2、更详细的统计量

summarize incorporationyear, detail 3、centile

centile auditfees, centile(0(10)100) centile auditfees, centile(0(5)100)

4、tabulate不同类型变量的频数和比例

tabulate companytype

tabulate companytype big6, column 按列计算百分比 tabulate companytype big6, row 按行计算百分比

tab companytype big6 if companytype<=3, row col 同时按行列和条件计算百分比

5、 计算满足条件观测的个数

count if big6==1

count if big6==0 | big6==1 6、按离散变量排序,对连续变量计算描述性统计量:

(1)by companytype, sort: summarize auditfees, detail (2)sort companytype

By companytype:summarize auditees

十六、转换变量

1、按公司类型将公开发行股票公司赋值为1,其他为0 gen listed=0

replace listed=1 if companytype==2

3

replace listed=1 if companytype==3 replace listed=1 if companytype==5 replace listed=. if companytype==.

十七、产生新变量gen

Generate newvar=表达式 十八、数据类型 1、数值型 Storage type Bytes byte int long float double 2、字符型 Storage type str1 str2 … str80 Bytes 1 2 80 Max length (characters) 1 2 80 1 2 4 4 8 Min -127 -32,767 -2,147,483,647 -1.70141173319*1038 -8.9884656743*10307 Max +100 +32,740 2,147,483,620 1.70141173319*1036 8.9884656743*10308 3、新建变量的过程中定义数据类型

? gen str3 gender= \? list gender in 1/10

4、变量所占字节过长

? drop gender

? gen str30 gender= \? browse

? describe gender ? compress gender

5、日期数据类型:%d dates, which is a count of the number of days elapsed since January 1, 1960。 (1)date( 日期变量 )

? gen fye=date(yearend, \应根据前面日期的排列顺序而定,结果显示的

是距离1960年1月1日的天数 ? list yearend fye in 1/10

(2)日期格式化%d(显示fye变量为日期形式,但数值并未真正变动):

? format fye %d

? list yearend fye in 1/10

4

? sum fye

(3)利用日期天数求对应的年、月、日

? gen year=year(fye) ? gen month=month(fye) ? gen day=day(fye)

? list yearend fye year month day in 1/10 (4)将三个分别表示年、月、日的变量合并为一个日期变量

? drop fye

? gen fye=mdy(month, day, year) ? format fye %d

? list yearend fye in 1/10

(5) 将一个数值型的时间数据(20080131)转变为ST可识别的时间数据

? gen year=int(date/10000)

? gen month=int((date-year*10000)/100) ? gen day=date-year*10000-month*100 ? list date year month day in 1/10 ? gen edate=mdy(month, day, year) ? format edate %d ? list edate date in 1/10

十九、存贮统计量的内部变量R( )

? sum auditfees

? gen meanadjaf= auditfees-r(mean) ? list meanadjaf in 1/10

SUM命令后常见的几种R()值 r(N) r(sum_w) r(mean) r(var) Number of cases Sum of weights Arithmetic mean Variance r(sd) r(min) r(max) r(sum) Standard deviation Minimum Maximum Sum of variable 显示这些变量值的命令

? sum auditfees, detail ? return list

二十、recode命令(PPT61)

1、产生有多个值的变量的哑变量recode

recode year (min/1999 = 0) (2000/max = 1), gen (yeardum) min/1999表示小于等于1999的值全部赋值为0 2000/max表示大于等于2000的值全部赋为1。

2、对一个连续变量按一定值分为不同间隔的组recode

gen assets_categ=recode(totalassets, 100, 500, 1000, 5000, 20000, 100000, 1000000)。分组的值为每组的上限,包含该值。 sort assets_categ

5

by assets_categ: sum totalassets assets_categ

3、 对一个连续变量按一定值分为相同间隔的组autocode autocode(variable name, # of intervals, min value, max value)

for example: gen assets_categ=autocode(totalassets, 10, 0, 10000) 4、对一个连续变量按每组样本数相同进行分组:xtile xtile assets_categ=totalassets, nquantiles(10)

每组样本不一定完全相同

二十一、一次性计算同一变量不同组别的均值:egen命令

按公司类型先排序,再计算每一类型公司审计费用的均值并赋值给新变量: by companytype, sort: egen meanaf2=mean(auditfees)

? count() ? mean() ? median() ? sum() 二十二、_n和_N命令

1、 显示每个观测的序号并显示总观测数

sort companyid fye capture drop x gen x=_n

capture drop y gen y=_N

list companyid fye x y in 1/30

2、分组显示每个组中变量的序号和每组总的样本数

? capture drop x y ? sort companyid fye ? by companyid: gen x=_n ? by companyid: gen y=_N ? list companyid fye x y in 1/30

3、创建新变量等于每个分组中变量的第一个值或最后一个值

? sort companyid fye

? by companyid: gen auditfees_first=auditfees[1] ? by companyid: gen auditfees_last=auditfees[_N]

? list companyid fye auditfees auditfees_first auditfees_last in 1/30

4、创建新变量等于滞后一期或滞后两期的值

? sort companyid fye

? by companyid: gen auditfees_lag1= auditfees[_n-1] ? by companyid: gen auditfees_lag2= auditfees[_n-2]

? list companyid fye auditfees auditfees_lag1 auditfees_lag2 in 1/30

二十三、转变数据集结构:reshape

不同数据库的数据集结构不同:长型是指同一公司不同年度数据在不同的行。宽型数据是指同一数据不同年度数据在现一行。二者间的转换可通过reshape命令来实现。需要注意的

6

是,在转换过程中对数据集是有要求的,一个公司只能有一个年度数据,否则会出错。

1、长型转换为宽型:

reshape wide yearend incorporationyear companytype sales auditfees nonauditfees currentassets currentliabilities totalassets big6 fye, i(companyid) j(year) 2、宽型转换为长型:

reshape long yearend incorporationyear companytype sales auditfees nonauditfees currentassets currentliabilities totalassets big6 fye, i(companyid) j(year)

3、第二次转换时命令可简化:

? reshape wide ? reshape long

二十四、计算CAR的例子:

已知股票日回报率,市场回报率,事件日,计算窗口期为三天的CAR。 1、定义三天的窗口期:

? sort ticker edate

? gen window=0 if eventdate<.(事件日为0)

? replace window=-1 if window[_n+1]==0 & ticker==ticker[_n+1] ? replace window=1 if window[_n-1]==0 & ticker==ticker[_n-1]

2、计算AR和CAR

? gen ar=ret-vwretd

? gen car=ar+ar[_n-1]+ar[_n+1] if window==0 & ticker==ticker[_n+1]

ticker==ticker[_n-1]

3、检验

? list ticker edate ret vwretd ar car window if window<.

二十五、means 的T检验:

1、检验总体上big6的审计收费有无显著不同

? use \? gen lnaf=ln(auditfees) ? by big6, sort: sum lnaf ? test lnaf, by (big6)

2、分年度比较big6的审计收费有无显著不同,加入by year命令。

? gen fye=date(yearend, \? format fye %d ? gen year=year(fye) ? sort year ? by year: ttest lnaf, by(big6)

3、均值等于特定值得的T检验:

? sum lnaf ? ttest lnaf=2.1

二十六、meadian的显著性检验:

1、获取中位数的命令:

by big6, sort: sum lnaf, detail by big6, sort: centile lnaf

7

&

2、中位数检验:

? median lnaf, by(big6) ? ranksum lnaf, by(big6)

二十七、列联表检验:

1、创建列联表的命令:

? tabulate companytype big6, row

第一个变量是表的最左侧一列的项目,第二个变量是表的第一行的项目。 2、两变量之间的相关性检验:chi2

tabulate companytype big6, chi2 row

3、相关矩阵:

pwcorr lnaf big6 year listed 4、列出相关矩阵并进行符号检验 pwcorr lnaf big6 year listed, sig

5、在矩阵中列出观测数

? pwcorr lnaf big6 listed if year==2000, sig obs 二十八、创建一个不包含缺失值的数据集

1、无缺失值的变量值为1,至少有一个的为0

gen samp=1 if lnaf<. & big6<. & year<. & listed<.

2、缺失值的变量值表示同一行中缺失值的个数

egen miss=rmiss(lnaf big6 year listed) sum miss, detail 二十九、图形 1、直方图

? histogram incorporationyear, width(1) ? histogram incorporationyear, bin(147)

width表示分一小份的宽度。bin表示分成的份数。改变宽度值可以使图像看起来更合适。

? 选择起始点和间隔宽度:hist lnaf if lnaf>=0 & lnaf<=5, width (0.25)

? 选择描述横轴和纵轴的单位和数据标识:hist lnaf if lnaf>=0 & lnaf<=5, width (0.25)

xlabel(0(0.5)5)

? 是否与正态分布一致:hist lnaf if lnaf>=0 & lnaf<=5, width(0.25) normal

2、散点图(scatter)

? scatter lnaf lnta

第一个变量是纵轴,第二个变量是横轴。

? twoway (scatter lnaf lnta, msize(tiny)) (lfit lnaf lnta) 在散点图上加入最适合的一条直线。 三十、缩尾处理winsor

. winsor rev, gen(wrev) p(0.01)0.01代表去掉的百分数。 Winsor rev, gen(wrev) h(5),5代表去掉的个数

8

第二章 线性回归 内容简介:

? ? ? ? ? ? ? ? ? ? ?

2.1 The basic idea underlying linear regression 2.2 Single variable OLS

2.3 Correctly interpreting the coefficients 2.4 Examining the residuals 2.5 Multiple regression 2.6 Heteroskedasticity 2.7 Correlated errors 2.8 Multicollinearity

2.9 Outlying observations 2.10 Median regression 2.11 “Looping”

2.1 The basic idea underlying linear regression

1.残差

F为真实值,为预测值,ε为残差。 OLS回归就是使残差最小。 2. 基本一元回归 regress y x

3.回归结果的保存

回归结果的系数保存在_b[varname]内存变量中,常数项的系数保存在 (_cons)内存变量中。

4、预测值及残差

? predict yhat

? predict yres, resid yres即为真实值得与预测值之差。 5、残差与X的散点图

twoway (scatter y_res x) (lfit y_res x)

9

6、衡量估计系数准确程度:标准误差。

用样本的标准偏差与系数之间的关系来衡量即T值(用系数除以标准差),同时P值是

根据T值的分布计算出来的,表示系数落入标准对应上下限的可能性。前提是残差符合以下假设:

同方差:Homoscedasticity (i.e., the residuals have a constant variance)

独立不相关:Non-correlation (i.e., the residuals are not correlated with each other) 正态分布:Normality (i.e., the residuals are normally distributed)

7、回归结果包含的一些内容的意思 ? 各变差的自由度:

? For the ESS, df = k-1 where k = number of regression coefficients (df = 2 – 1) ? For the RSS, df = n – k where n = number of observations (= 11 - 2) ? For the TSS, df = n-1 ( = 11 – 1)

? MS:变差除以自由度:The last column (MS) reports the ESS, RSS and TSS divided by their

respective degrees of freedom

? R平方:The R-squared = ESS / TSS

? 调整的R平方:Adj R-squared = 1-(1-R2)(n-1)/(n-k) ,消除了加入相关度不高解释变量后R平

方增加的不足。

? Root MSE = square root of RSS/n-k:模型的平均解释能力 ? The F-statistic = (ESS/k-1)/(RSS/n-k):模型的总解释能力

2.3 Correctly interpreting the coefficients

1、假如想检验big6的审计费用在公开发行和非公开发行公司之间的区别时,可用交互变量。Big6*listed.

10

? Notice that the ologit and oprobit results are quite close to each other

? usually it doesn’t make much difference whether you use ordered logit or ordered

probit.

3.6 Count data models

1、适用情况:

计数模型适用于因变量是非负的离散数,且数据有实际的意义。

? 比如:consider the number of financial analysts that follow a given company

? if the company is not followed by any analysts, Y = 0 ? if the company is followed by one analyst, Y = 1 ? if the company is followed by two analysts, Y = 2 ? if the company is followed by two analysts, Y = 3

此种数据无法使用OLS回归,因为因变量无法满足数据是在负无穷到正无穷之间,因为只能取非负数,同时要求因变量是连续变量,而计数模型的因变量是离散的。 2、适用的回归模型

? Two distributions that fulfill the criteria of having non-negative discrete integer values

are the “Poisson” and the “negative binomial”. ? the negative binomial (nbreg) ? the Poisson (poisson) 3、实际中计数模型的例子:

? The number of R&D patents awarded ? The number of airline accidents ? The number of murders

? The number of times that mainland Chinese people have visited Singapore ? The number of weaknesses found by peer reviewers at audit firms

4、模型的选择:

(1)POISSON模型:

? The Poisson distribution is most often used to determine the probability of x

occurrences per unit of time。E.g., the number of murders per year ? The basic assumptions of the Poisson distribution are as follows:

? The time interval can be divided into small subintervals such that the probability of an

occurrence in each subinterval is very small

? The probability of an occurrence in each subinterval remains constant over time

? The probability of two or more occurrences in each subinterval must be small enough

to be ignored

? An occurrence or nonoccurrence in one subinterval must not affect the occurrence or

nonoccurrence in any other subinterval (this is the independence assumption). 满足条件下的例子:

? The probability of a murder occurring during any given minute is small

? The probability of a murder occurring during any given minute remains constant

during the year

21

? The probability of more than one person being murdered during any given minute is

very small

? The number of murders in any given time period is independent of the number of

murders in any other time period. 参数的估计:

? The only parameter needed to characterize the Poisson distribution is the mean rate at which events occur 。“incidence rate” ,?

? For example, ? can be the average number of murders per month or the average

number of analysts per company

POISSON分布的概率函数:

? 如果已知每月的犯罪数为2,求每月有3起犯罪的概率。

模型特点:

? 模型只有一个参数

λ,发生率可用右式估计。

命令:

? control for heteroscedasticity using the robust option

poisson weaknesses reviewed_firm_also_reviewer litigation_dummy , robust

? panel dataset (it isn’t) you would also need to control for time-series dependence

using the cluster() option 缺点:

? Unobserved heterogeneity in the data (e.g., omitted variables) will often cause the

variance to exceed the mean (a phenomenon known as “overdispersion”). 回归后检验:

? 回归后马上用poisgof 命令,检验是否显著,如显著则无法使用,而须使用The

negative binomial ,该模型无须assume that the mean and variance of the distribution are the same (2)the negative binomial模型:

? nbreg weaknesses reviewed_firm_also_reviewer litigation_dummy , robust

(cluster())

? 回归结果的α显著,说明POISSON模型不适用。

22

3.7 Tobit and interval regression models

1、适用的数据类型:

? censoring (or truncation) of the dependent variable. 当观众数大于座位数时,观测不到。

2、选择模型:

? The censoring problem can be solved by estimating a “tobit” model ? The tobit model is somewhat similar:

Y* = a0+ a1 X + e

Y = 0 if -? < Y* ? 0 Y = Y* if 0 < Y* < +? The Y* and Y variables are both observed when they are greater than zero (Y* is

unobserved when Y = 0)

? Both the probit and tobit models assume that the errors (e) are normally distributed. 3、例子:

? Recall that in our fee dataset, the nonauditfees variable is left-censored at zero

because many companies choose not to purchase any non-audit services。This phenomenon is like some individuals choosing not to purchase any cigarettes when the price exceeds P0

? gen lnta=ln(totalassets)

? egen miss=rmiss(lnnaf lnta) (当lnnaf lnta为miss时,miss为1) ? tobit lnnaf lnta if miss==0, ll(0) (ll(数字)表示左边截取的数据,ul(数字)表示

右边截取的数字。)

? tobit lnnaf lnta if miss==0, ll(此命令与上命令功能相同) ? 回归完成后可以用命令显示有多少数据 censoried.

count if miss==0 & lnnaf==0 count if miss==0 & lnnaf>0

4、当左右两边均截取以后,也可使用tobit模型

? gen lnnaf1=lnnaf

? replace lnnaf1=5 if lnnaf>5 & lnnaf!=. ? tobit lnnaf1 lnta if miss==0, ll(0) ul(5)

? tobit lnnaf1 lnta if miss==0, ll ul (如果截取数字是样本中的最大和最小值不用列出,

系统会自动选取)。

? tobit lnnaf lnta if miss==0, ll ul(5) robust cluster (companyid)(控制异方差和时间序

列不独立)

23

3.8 Duration models(生存模型)

1、适用数据:

因变量测试某一事件持续的时间。例如:

? Duration of life (medical, engineering)

? how long do people live for? ? how long do machines last?

? Duration of unemployment (economics)

? how long do people remain unemployed?

? for example, we may be interested in how retraining schemes affect the

duration of unemployment

? Duration of CEO tenure (management)

? how long does the CEO stay at the same company?

? Duration of auditor-company tenure (accounting)

? how long do the company and audit firm stay together?

2、度量变量:

? The “hazard rate”, h(t), is the probability that the event will occur in period t, given

that it has not occurred up to time t.

?

3、使用命令stset timevar

? use \? list

? stset failtime

该语句产生四个内部变量:

显示变量:list failtime _st _d _t _t0

? The _st variable is a dummy equal to one for observations whose data has

been stset (e.g., there would have been some zero values if we had excluded some observations using the if qualifier)

? The _d variable 是否改变状态 ? The _t variable 生存时间

? The _t0 variable 生存起始点,默认为0

4、用Cox proportional hazards model 估计

? 命令:stcox

stcox load bearings(load bearings两个变量是影响生命的两个因素)

? The reported hazard ratios are the exponentials of the coefficients.

The hazard ratio for load = 1.52647 = exp(a1) where a1 is the coefficient on load

24

a1 = ln(1.52647) = 0.4229578

The coefficient on bearings = ln(0.0636433) = -2.754461

The load coefficient is significantly positive implying that the machines fail more quickly (higher hazard rate) when they are under greater stress

The bearings coefficient is significantly negative implying that the machines fail less quickly (lower hazard rate) when they use the new-type of bearing.

? 如果想让系统报告系数而不是H(T)系数,可使用以下命令

stcox load bearings, nohr

5、 解决ties问题的模型之一:breslow

? The Breslow method is very fast and is the default method that STATA uses for resolving

ties.

? 如果生存时间相同时,就形成一个ties. ? 命令集:stcox load bearings, breslow ? stcox load bearings, efron 6、 解决ties问题的模型之二:efron

7、 该方法比上一个方法更准确,但用时较长。将两个同样的死亡时间各分0.5的可能

性。

8、 当存在censoring时,即并不是所且有的样本都死亡时,需要在命令中加选项。

? stset failtime, failure(failed)

The failtime variable gives the time of failure or censoring The failed variable indicates whether failure or censoring occurred STATA assumes censoring if failed equals zero or is set to missing

9、 以上均是处理一个事件只占一行的情况,当事件某一特性改变时,就需要多行来描

述。这时需要在告诉系统以下数据为生存数据的命令中加入选项,事件代码 ? stset t, id(patid) failure(died)

10、 当Left-censoring occurs,这时需在说明生存命令中加入开始时间变量

? stset end, id(id) failure(died) enter(begin)

11、 当中间部分时间的数据缺失时的处理:需要说明死亡时间、变量标识,死亡标识,

开始时间。

? stset end, id(id) failure(died) enter(begin) 12、 为消除heteroscedasticity and time-series dependence ,可以在回归命令的最后加上

robust和cluster().

? stcox x1, robust cluster(id)

25

小结:根据因变量的类型选择不同的回归模型 Dependent variable (Y) Continuous (-? < Y < +?) Examples Estimation method(s) OLS Quantile regression Probit Logit Multinomial logit Multinomial probit STATA command regress qreg Log of audit fees Stock returns Cost of capital Listed / Not listed Big 6 / Non-Big 6 auditor Method of transport (train, bus, car, bicycle) Type of company (private, public unquoted, quoted) Type of peer review report (adverse, modified, unmodified) Examples Binary (Y = 0, 1) Discrete and unordered (Y = 0, 1, 2,..) probit logit mlogit mprobit Discrete and ordered (Y = 0, 1, 2,..) Dependent variable (Y) Discrete count data (Y = 0, 1, 2, …) Ordered probit Ordered logit oprobit ologit Estimation method(s) Poisson Negative binomial STATA command poisson nbreg Number of weaknesses disclosed in peer review report Non-audit fees Football attendance Duration of unemployment CEO tenure Company survival 26

Continuous but censored (kL ? Y < kH) Duration data (often censored) kL ? Y < kH Tobit tobit Cox proportional hazards stcox

第四章 面板数据 主要内容: ? 4.1 The basic idea ? 4.2 Linear regression ? 4.3 Logit and probit models ? 4.4 Other models

? 4.1 The basic idea

1、面板数据:横截面时间序列,自变量中同一个特征变量连续多年其他变量有变动。在不同年度数据间可能存在一个稳定影响的公司或个人特征因素。在以前的学习中,为消除时间序列不独立,在回归命令中加入robust cluster () 选项来消除异方差和变量相关错误。 2、面板数据的优点:一是样本量大,估计精确,二是可以在回归中加入动态影响因素(上一年或下一年数据),三是可以控制不可观测变量对不同年度的影响。 3、面板数据可能存在的计量问题:

一般的回归模型可用上式来表示,如果是面板数据,ε中应该包含两部分,一部分为随机误差,一部分为每年的固定影响。即:

,这样回归模型即变为:

u表示各年的固定影响,e为误差项,X为变量值。

如果u和X不相关,则X的系数是无偏的,但实际中常常的相关的,因此需要进行特别处理。

4、处理计量问题的一些方法:

(1)最简单的方法是在回归式中加入robust cluster ()选择项。

(2)使用固定效用模型:对特征变量取不同的哑变量,在回归中加入哑变量来控制个人特征的影响。例如:

? tab persnr, gen(dum_)(根据不同的persnr值产生哑变量dum_1、dum_2。。。) ? reg lsat age dum_1 dum_2 dum_4

? reg lsat age dum_1 dum_2 dum_3 dum_4, nocons

为消除多重共线性,回归时如保留常数项时,哑变量就需要减少一个,如想保留所有哑

27

变量,则需去掉常数项。

? 另一种命令可以将特征变量的值分开,separate lsat, by(persnr),此命令根据

不同的persnr产生不同的变量序列lsat1、lsat2、lsat3。。。,以便分别进行对比分析,比如以下命令:

twoway (lfit lsat1 age) (scatter lsat1 age) twoway (lfit lsat2 age) (scatter lsat2 age) twoway (lfit lsat3 age) (scatter lsat3 age) twoway (lfit lsat4 age) (scatter lsat4 age)

? 另一种命令可以同时画出不同分变量的直线图,两变量间的拟合图以及不同

分变量间的散点图。命令如下:

twoway (line lsat_hat1-lsat_hat4 age) (lfit lsat age) (scatter lsat1-lsat4 age)

图形结果如下:

(3)使用命令:

fixed effect model命令:xtreg lsat age , fe i(persnr) random effect model命令:xtreg lsat age , re i(persnr)

在上式中,如果θ=1,则变形为fixed effect model,如果θ=0,则变形为OLS模型,如果0<θ<1,则为random effect model。

通过上式的变换,基本消除了在每一年中保持不变的影响部分。有时Y也可以用上一期的

?Yi?1来代替。

28

? 4.2 Linear regression

通过上面的讲解,知道连续变量的线性回归可以用OLS模型、fixed effect model和random effect model。什么情况下使用哪一种模型更有效?

1、当ui和 Xit相关时,使用fixed effect model,当不相关时, it is better to use the random-effects model (because it is more efficient).只有当没有时间序列效应时才用OLS模型。

2、判断使用模型的方法:

(1)固定效应模型与随机模型:

使用Hausman方法判断,原理是如果相关固定效应模型无偏,而随机模型有偏,二者相差会较大;如果不相关,两个都无偏,随机模型会更有效,二者不会有太大差别。因此方法就是比较系数是否有显著不同。

? Null hypothesis (H0): ui and Xit are uncorrelated ? The Hausman statistic is distributed as chi2

? If the chi2 statistic is positive and statistically significant, we can reject the null hypothesis. This

would mean that the fixed-effects model is preferable because the coefficients are consistent. ? If the chi2 statistic is not positive and statistically significant, we cannot reject the null

hypothesis. This would mean that the random-effects model is preferable because the coefficients are consistent and efficient.

实现命令:

? xtreg lsat age, fe i( persnr) ? estimates store fixed_effects ? xtreg lsat age, re i( persnr) ? estimates store random_effects

? hausman fixed_effects random_effects

注意事项:小样本时结果并不一定可靠。On the other hand, this result is not very reliable because the asymptotic assumption fails to hold in this small sample.

(2)随机模型和OLS模型:

? If we cannot reject the null hypothesis that ui and Xit are uncorrelated, we need to determine

whether the ui are distributed randomly across individuals. 用the Breusch-Pagan test来检验,whether ?u2 is significantly positive.

实现命令:

xtreg, re(随机模型面板数据回归) ttest0 检验结果如下:未拒绝零假设。

29

3、使用固定效用模型时要注意,

接近于0时,结果不可靠。

4.3 Logit and probit models

1、当因变量为0和1时,数据为面板数据时,可以使用命令消除时间序列计量问题:

? xtlogit , fe i()

但结果的解释与连续变量不一样,当因变量没有变动时,回归时自动剔除这些样本。实际中也有不少这种例子,比如研究公司在某一年是否有欺诈时,大多数公司不会有,因此Y=0在样本年度中一直不变,使用固定效用模型将剔除这些样本。

? xtlogit , re i()

使用随机模型可保留样本中的所有样本。

2、使用哪一种模型?

通过以下命令集来判断,与连续变量相同。

? xtlogit y x, fe i(id)

? estimates store fixed_effects ? xtlogit y x, re i(id)

? estimates store random_effects

? hausman fixed_effects random_effects 观察chi2的正负与统计的显著性。 3、当因变量为0和1时,可以使用logit和probit命令来回归,但当使用时间序列时,probit命令没有固定效应模型,If you type xtprobit big6 lnta, fe i(companyid) you will get an error message,但可使用随机效应模型:xtprobit big6 lnta, re i(companyid)。 4、随机效应与普通二项模型:

? Just as with the random-effects logit model, there is a likelihood ratio test that helps

us to choose between the random-effects probit and the ordinary probit models.

? In our data, we can reject the hypothesis that rho = 0, so we may decide not to use an

30

ordinary probit model.

4.4 Other models

1、计数因变量:Fixed-effects and random-effects models are available for count data (xtpoisson and xtnbreg)。We can test which model is preferable using a Hausman 2、(截尾数据)censored data :Random-effects models are available ,但是,fixed-effects models 目前不可用。

3、生存模型(Duration data )本身就是时间序列。

4、多数据和排序因变量(multinomial and ordered models )目前还不可用Fixed-effects or random-effects models。

? 小结:The xtreg command is used to estimate fixed-effects and random-effects models (where

the dependent variable is continuous).

? We can test whether the fixed-effects or random-effects model is preferable using the

hausman test.

? If there is a significant correlation between ui and Xit, the fixed effects model is

preferable to the OLS and random effects models.

? If there is no significant correlation between ui and Xit, we can test whether the OLS

or random-effects model is preferable using a LM test.

31

第五章 变量的内生性

内容简介:

? 5.1 The problem of endogeneity bias

? 5.2 The basic idea underlying the use of instrumental variables ? 5.3 When the endogenous right hand side variable is continuous ? 5.4 When the endogenous right hand side variable is binary

? 5.1 The problem of endogeneity bias

1、 OLS回归的一些基本问题 (1)一般回归问题的处理:

最简单的最小二乘回归Yit= a0 + a1 X1it + uit ,对a1的无偏估计是当X1it 与残差项(uit)不相关。为实现这样的假设,一种手段是控制一些可观测的变量,这些变量影响Y,同时也与X1it相关,另一种手段是通过使用固定效应模型,对面板数据中一些影响Y,但与X相关的不可观测的变量进行控制。一般可以得到a1的无偏估计量。如果这样,自变量就可以说是外生的(exogenous)。最终的回归模型可表示如下:

Yit = a0 + a1 X1it + a2 X2it + ui + eit

X2it为控制变量,ui为面板数据中的固定影响因素。

(2)一般回归问题无法避免的问题:

一方面我们无法找到所有影响Y且与X相关的控制变量,这样多变量回归不起作用,另一方面,如果我们没有面板数据,固定效应模型也无法估计。即使有面板数据,如果随时间Y和X的变动不大,固定效应模型也不起作用,再进一步,有了面板数据,随时间变量的变动也较大,但与X相关的不可观测的变量不是恒定的,固定效应模型也不起作用。

变量的内生性问题(endogenous):如果变量是内生的,它更可能与误差项相关,内生是指变量是由正在估计的经济模型决定的。例如Y2it是一个内生解释变量:

Y1it = a0 + a1 Y2it + a2 Xit + uit (1) Y2it = b0 + b1 Xit + b2 Zit + vit (2)

只有当vit 和 uit 不相关时,方程(1)中的 a1 才会是无偏的。否则a1是有偏的。为了避免有偏,我们必须估计方程(1)的工具变量回归,而不是OLS回归。

方程(1)和方程(2)叫做结构方程,它描述了Y1和Y2之间的经济关系。将方程(2)代入方程(1)之后得到方程(3):Y1it = a0 + a1 (b0 + b1 Xit + b2 Zit + vit) + a2 Xit + uit 方程右边的所有变量均为外生变量。

工具变量回归最根本的思路是从方程(3)中移去vit ,这样对a1的估计就是无偏的。

5.2 The basic idea underlying the use of instrumental variables

1、工具变量的基本思想:要想消除vit,一种思路是使用Y2的预测值而不是真实

值。

32

用方程(4)估计的a1将会是无偏的,因为vit已经消除。需要注意的是要估计系数

a1,只有当结构方程(2)中包含至少一个不属于结构方程(1)的变量时才可以。

2、方程的不可识别和过度识别: (1)不可识别:under-identified

因为方程(2)中的两个变量均包含在方程(1)中,这样就无法确定a1的估计值,

因此方程是不可识别的。

(2)过度识别:over-identified

通过方程可以估计

,因此

,这时方程有两个,而内生变量只有一个,因此是过

度识别。这样的方程中三角结构。

5.3 When the endogenous right hand side variable is

continuous

1、当模型是一个三角结构时,(Y2影响Y1,但Y1不影响Y2.)就可通过系统的工

具变量回归命令来回归: ivregress 。最常用的工具变量回归方法是2SLS or LIML or GMM,实际研究中2SLS应用最普遍。

2、应用工具变量的实例: (1)需要估计的模型:

? rent = a0 + a1 pct_population_urban + a2

33

housing_value + u

? housing_value = b0 + b1 family_income + b2 region2 + b3 region3

+ b4 region4 + v (2)估计命令集:

? use \打开数据集*/

? ivregress 2sls rent pct_population_urban (housing_value = family_income region2 region3 region4)

? ivregress liml rent pct_population_urban (housing_value = family_income region2 region3 region4)

? ivregress gmm rent pct_population_urban (housing_value = family_income region2 region3 region4) /*在以上三种回归方法中选择一种即可*/

/*检验工具变量的有效性We should test whether:our chosen instruments are exogenous (i.e., they should be uncorrelated with the error term) and it is valid to exclude some of them from the model that has the endogenous regressor。If they are not exogenous or they should not be excluded, they are not valid instruments. */

? estat overid /*在回归命令后马上执行此命令,检验工具变量的有效性,如

果结果显著,则不有效。以上检验只是在方程是过度识别时才可用,正好识别时不可用,因为检验的前提是假设有一个变量是有效的*/

? /*检验内生变量是否有偏:We can also test whether the coefficient of the

“endogenous” regressor is biased under OLS.However, the Hausman tests for endogeneity bias are only reliable if the chosen instruments are valid. In our example they are not, and so we cannot draw conclusions about the potential for endogeneity bias. 结果与样本的数据关系密切。 */

? ivregress 2sls rent pct_population_urban (housing_value =

family_income region2 region3 region4) ? estat endogenous

3、使用工具变量的注意事项:

? 估计工具变量模型的关键是能找到一个或多个外生变量来解释内生变量,而且这些外生

变量没有包含在主方程中。 ? 不幸的是,大多数会计研究在使用IV回归模型时并没有企图说明为什么所选择的工具变

量是外生的,而且不会包含在主结构方程中。

? As a result, Larcker and Rusticus (2010) criticize the way in which accounting

studies have applied IV regression

? A key problem is that the IV results can be very sensitive to the

researcher’s choice of which variables to exclude from the structural model and, in many studies, these variables have been chosen in a very arbitrary way

5、 联立方程组Estimating simultaneous equations using 3SLS (reg3):

三角结构方程中,因变量Y2影响另一个因变量Y1,但因变量Y2不受因变量Y1的影响。但

34

在联立方程组中,两个因变量相互影响。如下面两个方程所示

Y1it = a0 + a1 Y2it + a2 Xit + a3 Z2it + uit (1) Y2it = b0 + b1 Y1it + b2 Xit + b3 Z1it + vit (2)

如果使用OLS模型估计,a1和b1都有偏。应注意的是为了识别,每个方程都必须至少包含一个对方没有包含的外生变量。比如Z2it在(1)中和Z1it在(2)式中。这种方程可以用reg3 命令来实现。

? reg3 (rent= housing_value pct_population_urban) (housing_value =

rent family_income region2 region3 region4)

在此模型中,无法使用 the robust cluster() option and the overid and endog commands。

5.4 When the endogenous right hand side variable is binary

? 内生变量为二元时:This brings us to a special class of models which are known

as “self-selection” or “Heckman” models. “Selectivity” = “Endogeneity” where the endogenous regressor is binary 1、 会计中常用的二元内生变量:

a) Companies decide whether to use hedge contracts (Barton, 2001; Pincus and

Rajgopal, 2002).

b) Companies decide whether to grant stock options (Core and Guay, 1999). c) Companies decide whether to hire Big 5 or non-Big 5 auditors (e.g., Chaney

et al., 2004).

d) Governments decide whether to fully or partially privatize (Guedhami and

Pittman, 2006).

e) Companies decide whether to follow international financial reporting

strategy (Leuz and Verrecchia, 2000).

f) Companies decide whether to recognize financial instruments at fair value

or disclose (Ahmed et al., 2006).

g) Companies decide whether or not to go private (Engel et al., 2002).

? 2、Concerns about selectivity arise when the RHS dummy variable (D) is endogenous:

Endogeneity results in bias if E(u | D) ≠ 0. The intuition underlying Heckman is to estimate and then control for E(u | D). First model the choice of D:

Z is a vector of exogenous variables that affect D but have no direct effect on Y.

35

本文来源:https://www.bwwdw.com/article/sato.html

Top