Frances Woolley has posted some great tips on how to clean data in Stata. This post follows up with some tips on how to quickly and robustly estimate models as you vary specifications, and on how to get your results in a publication-ready form. The .do file described in this post can be downloaded by clicking here, you must change the extension from .doc to .do.

**THOU SHALT USE .DO FILES**. No, don’t argue, just do it, even for your exploratory data analysis. You will find that doing everything in .do files is much faster than working from the command line (or, even worse, the drop-down menus), and you automatically document your work. Also, make sure you adequately comment your .do files. That convoluted code you just wrote makes perfect sense to you right now, but will you remember what it does when you come back to this project a week from now, or a year from now?

Your goal while writing your .do file is to keep your code clean, easily readable, and efficient. Writing decent code will speed up your work and minimize the chances that a coding error will affect your results. You want to set up your code in such a way that you can easily make changes to your specification and see how your results are affected. And you want to produce output that looks good right out of Stata so you don’t have to do a lot of work writing up your tables.

Suppose you’ve got your data all cleaned following steps such as those Frances lays out. For this example we’ll use the Mroz wage data as presented by Jeff Wooldridge, available online, and we’ll estimate some log wage models. We’ll start the .do file by setting some options:

*** Sample .do file, Chris Auld

*** last modified: October 4, 2011*** Preliminaries.

# delimit ;

drop _all ;

clear matrix ;

capture log close ;

log using cchs-example, replace ;*** Read cleaned data, show summary statistics ;

use http://fmwww.bc.edu/ec-p/data/wooldridge/MROZ ;

It’s a good idea to use a character to mark the end of command lines. Otherwise you can’t wrap long commands over multiple lines, which makes your code hard to read. *“#delimit ;”* tells Stata to interpret the semicolon as indicating the end of a command. (The downside is 50% of your syntax errors will henceforth be the result of forgetting a semicolon.) Then we nuke any data or matrixes Stata may have in memory when you start the program with the *drop _all* and *clear matrix* commands. If you like, also set memory, scheme, and other options here.

We want to keep a record of the output. *capture log close* tells Stata to stop logging if it is logging, and not to halt on an error if it’s not logging. The next line tells Stata to write all output to a file. And then we read the already-cleaned data.

Now let’s define the sets of variables we’re going to use in the analysis. Assume you want to vary the set of covariates and see how your estimates change, and you want to see whether the estimator you choose has a substantive effect on your results. You want to produce a nice looking table of estimates from various specifications. For this example, suppose you want to compare results when you do and do not control for husband’s characteristics, and you want to compare results from OLS and median regressions (implemented with the command *qreg*) of the same specifications.

*** Define sets of variables ;

local demographic “city age educ” ;

local husbandchars “hushrs husage” ;

local allcovars “`demographic’ `husbandchars'” ;

local estimators “regress qreg” ;

We’ve stuffed strings into four new local variables. *demographic* and *husbandchars* contain sets of right-hand side covariates we wish to include or exclude in various models, and the sets are collected in another set *allcovars*. If you want to add a new variable to one of these sets, it will automatically be included in all subsequent estimation or other commands which reference that set, so you never have to go through your code clumsily adding or removing variables. Further, your code will be much readable—imagine a real project with many of dozens of variables instead of a handful. Finally, the local *estimators* lists the estimation commands we want to try. Want to try all your specifications with some other estimator too? Just add it to this list.

Now is a good time to generate your descriptive statistics (and remember: * all good papers display descriptive statistics*) and graphs. We’ll just ask for plain summary statistics for this example. If you are exporting your output to LaTeX, you can use the command *sutex* instead of *summarize* and you’ll get the output in the form of .tex code.

*** Summary statistics ;

summ wage `allcovars’ ;

And finally we’ll estimate some models. To make the log file more readable we’ll ask Stata to suppress output while running models by wrapping everything using the *quietly* command. We’ll loop over each of the estimators we specified above, and save all of the results.

*** Estimate regression models. ;

quietly { ;

foreach estimator of local estimators { ;`estimator’ lwage `demographic’ ;

estimates store `estimator’m1 ;

`estimator’ lwage `demographic’ `husbandchars’ ;

estimates store `estimator’m2 ;

};

Given what we placed in the macros, after this loop executes we will have four sets of estimates in memory: regressm1, regressm2, qregm1, and qregm2. If we want one table to display all these results, we can use:

esttab * , b(%8.3f) t(%7.2f) stats(N r2)

ti(“Log wage OLS and median regression estimates”)

booktabs ;

This command tells Stata to make a table of all (*) the estimates it has saved. Since we’re economists, we want coefficients (to three decimal places) and t-ratios (to two decimal places) rather than standard errors. We tell Stata to report the number of observations used and the R2 from the model if it’s available. Finally, the option *booktabs* tells Stata to write the results as .tex code, although there are other options which make it easy to export the output to Word or as .html or several other formats.

Running this code produces a table which looks like this (with ten seconds of editing in .tex to add the “OLS” and “median” labels):

( Click here to see a high resolution version. )