This tutorial is an introduction to Stata emphasizing data management
and graphics. A complementary discussion of statistical models may
be found in the Stata Logs section of my GLM course at
http://data.princeton.edu/wws509/stata.
The tutorial has been updated for version 13, but most of the discussion applies to versions 8 and later.
The window labeled Command is where you type your commands.
Stata then shows the results in the larger window immediately above,
called appropriately enough Results.
Your command is added to a list in the window labeled Review on the left,
so you can keep track of the commands you have used.
The window labeled Variables, on the top right, lists the variables in your dataset.
The Properties window immediately below that, introduced in version 12,
displays properties of your variables and dataset.
You can resize or even close some of these windows. Stata remembers its settings the next time it runs. You can also save (and then load) named preference sets using the menu Edit|Preferences. I happen to like the Compact Window Layout. You can also choose the font used in each window; just right click and select font from the context menu; my own favorite, Lucida Console, is now the default in Windows. Finally, it is possible to change the color scheme, selecting from seven preset or three customizable styles. One of the preset schemes is classic, the traditional black background used in earlier versions of Stata.
There are other windows that we will discuss as needed, namely the Graph, Viewer, Variables Manager, Data Editor, and Do file Editor.
Starting with version 8 Stata's graphical user interface (GUI) allows selecting commands and options from a menu and dialog system. However, I strongly recommend using the command language as a way to ensure reproducibility of your results. In fact, I recommend that you type your commands on a separate file, called a do file, as explained in Section 1.2 below, but for now we will just type in the command window. The GUI can be helpful when you are starting to learn Stata, particularly because after you point and click on the menus and dialogs, Stata types the corresponding command for you.
The second command shows the use of a built-in function to compute a p-value, in this case twice the probability that a Student's t with 20 d.f. exceeds 2.1. This result would just make the 5% cutoff. To find the two-tailed 5% critical value try
If you issue a command and discover that it doesn't work press the Page Up key to recall it (you can cycle through your command history using the Page Up and Page Down keys) and then edit it using the arrow, insert and delete keys, which work exactly as you would expect. For example Arrows advance a character at a time and Ctrl-Arrows advance a word at a time. Shift-Arrows select a character at a time and Shift-Ctrl-Arrows select a word at a time, which you can then delete or replace. A command can be as long as needed (up to some 64k characters); in an interactive session you just keep on typing and the command window will wrap and scroll as needed.
If you don't know the name of the command you need you can search for it. Stata has a
One of the nicest features of Stata is that, starting with version 11, all the documentation is available in PDF files. (In fact it looks as if in version 13 you can no longer get printed manuals.) Moreover, these files are linked from the online help, so you can jump directly to the relevant section of the manual. To learn more about the help system type
The plot shows a curvilinear relationship between GNP per capita and life
expectancy.
We will see if the relationship can be linearized by taking the log of GNP per capita.
To compute natural logs we use the built-in function
Following a regression (or in fact any estimation command) you can retype the command with no arguments to see the results again. Try typing
In this command each expression in parenthesis is a separate two-way plot to be overlayed in the same graph.
The fit looks reasonably good, except for a possible outlier.
If you don't care about saving anything you can type
Stata understands nested directory structures and doesn't care if you use \ or / to separate directories. Versions 9 and later also understand the double slash used in Windows to refer to a computer, so you can
Stata has other commands for interacting with the operating system, including
By default the log is written using SMCL, Stata Markup and Control Language (pronounced "smicle"), which provides some formatting facilities but can only be viewed using Stata's Viewer. Fortunately, there is a
The
Alternatively, you can use an editor such as Notepad. Save the file using extension
You could even use a word processor such as Word, but you would have to remember to save the file in plain text format, not in Word document format. Also, you may find Word's insistence on capitalizing the first word on each line annoying when you are trying to type Stata commands that must be in lowercase. You can, of course, turn auto-correct off. But it's a lot easier to just use a plain-text editor.
In the Stata command window you can start a line with a * to indicate that it is a comment, not a command. This can be useful to annotate your output.
In a do file you can also use two other types of comments: // and /* */
There is a third type of comment used to break very long lines, as explained in the next subsection. Type
It is always a good idea to start every do file with comments that include at least a title, the name of the programmer who wrote the file, and the date. Assumptions about required files should also be noted.
To indicate to Stata that a command continues on the next line you use
An alternative is to tell Stata to use a semi-colon instead of the carriage return at the end of the line to mark the end of a command, using
The
Note also that we use a
The note on the last line is to remind you that by default Stata uses the (invisible) carriage return at the end of the line as the command delimiter. If you haven't pressed return after the last line, the entire line will usually be ignored by Stata.
There is an independent listserv maintained by Marcello Pagano at the Harvard School of Public Health, where you can post questions and receive prompt and knowledgeable answers from other users. (Quite often from the indefatigable and extremely knowledgeable Nicholas Cox, who deserves special recognition for his service to the user community.) For detailed instructions on how to join the list see http://www.stata.com/statalist/ and follow the link to subscribe. The postings are archived by Stata, Harvard University and Yahoo. Stata also maintains a list of frequently asked questions (FAQ) classified by topic, see http://www.stata.com/support/faqs/.
UCLA maintains an excellent Stata portal at http://www.ats.ucla.edu/stat/stata/, with many useful links, including a list of resources to help you learn and stay up-to-date with Stata. Don't miss their starter kit, which includes "class notes with movies", a set of instructional materials that combine class notes with movies you can view on the web, and their links by topic, which provides how-to guidance for common tasks. There are also more advanced learning modules, some with movies as well, and comparisons of Stata with other packages such as SAS and SPSS.
Some of the books on Stata that I particularly like are Sophia Rabe-Hesketh and Brian Everitt's A Handbook of Statistical Analyses using Stata (4th Edition), Lawrence Hamilton's Statistics with Stata (Updated for version 10), and Scott Long and Jeremy Freese's Regression Models for Categorical Dependent Variables Using Stata (2nd Edition), updated following the release of version 8. All three books include useful tutorials introducing Stata. Section 2.10 of the book by Long and Freese is a set of recommended practices that should be read and followed faithfully by every aspiring Stata programmer. Another book I like is Michael Mitchell's excellent A Visual Guide to Stata Graphics, which was written specially to introduce the new graphs in version 8 and is now in its 3rd edition. Two useful (but more specialized) references written by the developers of Stata are An Introduction to Survival Analysis Using Stata (3rd Edition), by Mario Cleves, William Gould, Roberto Gutierrez and Julia Marchenko, and Maximum Likelihood Estimation with Stata (4th Edition) by William Gould, Jeffrey Pitblado, and Brian Poi.
1. Introduction
Stata is a powerful statistical package with smart data-management facilities, a wide array of up-to-date statistical techniques, and an excellent system for producing publication-quality graphs. Stata is fast and easy to use. In this tutorial we start with a quick introduction and overview and then discuss data management, statistics, graphs, and programming.The tutorial has been updated for version 13, but most of the discussion applies to versions 8 and later.
1.1 A Quick Tour of Stata
Stata is available for Windows, Unix, and Mac computers. This tutorial focuses on the Windows version, but most of the contents applies to the other platforms as well. The standard version is called Stata/IC (or Intercooled Stata) and can handle up to 2,047 variables. There is a special edition called Stata/SE that can handle up to 32,766 variables (and also allows longer string variables and larger matrices), and a version for multicore/multiprocessor computers called Stata/MP, which has the same limits but is substantially faster. The number of observations is limited by your computer's memory, as long as it doesn't exceed about two billion. These three versions are available both for 32-bit and 64-bit computers; the latter can handle more memory (and hence more observations) and tend to be faster. There's also a small version of Stata that is limited to about 1,000 observations on 99 variables. All of these versions can read each other's files within their size limits.
Local Note: At OPR you can access Stata/SE on Windows by running the network version on your own workstation,
just create a shortcut to
\\opr\shares\applications\stata13-se\stataSE.exe
.
(If you have a 64-bit workstation change the program name to stataSE-64.exe.)
For computationally intensive jobs you may want to login to our Windows server Coale
via remote desktop and run Stata/SE there.
If you prefer Unix systems logon to our Unix server Lotka via X-Windows and leave your job running there. 1.1.1 The Stata Interface
When Stata starts up you see five docked windows, initially arranged as shown below:You can resize or even close some of these windows. Stata remembers its settings the next time it runs. You can also save (and then load) named preference sets using the menu Edit|Preferences. I happen to like the Compact Window Layout. You can also choose the font used in each window; just right click and select font from the context menu; my own favorite, Lucida Console, is now the default in Windows. Finally, it is possible to change the color scheme, selecting from seven preset or three customizable styles. One of the preset schemes is classic, the traditional black background used in earlier versions of Stata.
There are other windows that we will discuss as needed, namely the Graph, Viewer, Variables Manager, Data Editor, and Do file Editor.
Starting with version 8 Stata's graphical user interface (GUI) allows selecting commands and options from a menu and dialog system. However, I strongly recommend using the command language as a way to ensure reproducibility of your results. In fact, I recommend that you type your commands on a separate file, called a do file, as explained in Section 1.2 below, but for now we will just type in the command window. The GUI can be helpful when you are starting to learn Stata, particularly because after you point and click on the menus and dialogs, Stata types the corresponding command for you.
1.1.2 Typing Commands
Stata can work as a calculator using thedisplay
command.
Try typing the following
(excluding the dot at the start of a line, which is how Stata marks the lines you type):. display 2+2 4 . display 2 * ttail(20,2.1) .04861759Stata commands are case-sensitive,
display
is not the same as Display
and the latter will not work.
Commands can also be abbreviated; the documentation and online help underlines the shortest legal
abbreviation of each command and we will do the same here.The second command shows the use of a built-in function to compute a p-value, in this case twice the probability that a Student's t with 20 d.f. exceeds 2.1. This result would just make the 5% cutoff. To find the two-tailed 5% critical value try
display invttail(20, 0.025)
.
We list a few other functions you can use in Section 2. If you issue a command and discover that it doesn't work press the Page Up key to recall it (you can cycle through your command history using the Page Up and Page Down keys) and then edit it using the arrow, insert and delete keys, which work exactly as you would expect. For example Arrows advance a character at a time and Ctrl-Arrows advance a word at a time. Shift-Arrows select a character at a time and Shift-Ctrl-Arrows select a word at a time, which you can then delete or replace. A command can be as long as needed (up to some 64k characters); in an interactive session you just keep on typing and the command window will wrap and scroll as needed.
1.1.3 Getting Help
Stata has excellent online help. To obtain help on a command (or function) type help command_name
,
which displays the help on a separate window called the Viewer.
(You can also type chelp command_name
,
which shows the help on the Results window; but this is not recommended.)
Or just select Help|Command on the menu system.
Try help ttail
.
(Unfortunately, versions 9 and later open a new viewer each time you type help,
and before you know it you have dozens of windows cluttering your desktop.
To avoid this problem type the help command on the viewer itself,
or type , nonew
at the end of the help command
to instruct it not to open a new window.)If you don't know the name of the command you need you can search for it. Stata has a
search
command with a few options, type help search
to learn more. Version 13 searches the web by default. If you are using an earlier version
of Stata learn about the findit
command. Try search Student's t
.
This will list all Stata commands and functions related to the t distribution.
One of the entries is "density functions",
which takes you to a table with a list of probability distribution and density functions,
including "Student's t and noncentral Student's t distributions", which takes you to a
list of functions, which includes ttail()
.
Along the way you see that Stata can also compute tail probabilities
for the normal, chi-squared and F distributions, among others.
One of the nicest features of Stata is that, starting with version 11, all the documentation is available in PDF files. (In fact it looks as if in version 13 you can no longer get printed manuals.) Moreover, these files are linked from the online help, so you can jump directly to the relevant section of the manual. To learn more about the help system type
help help
.
1.1.4 Loading a Sample Data File
Stata comes with a few sample data files. You will learn how to read your own data into Stata in Section 2, but for now we will load one of the sample files, namelylifeexp.dta
,
which has data on life expectancy and gross national product (GNP) per capita in 1998
for 68 countries.
To see a list of the files shipped with Stata type sysuse dir
.
To load the file we want type sysuse lifeexp
(the file extension is optional).
To see what's in the file type describe.
(This command can be abbreviated to a single letter but I prefer desc
.). sysuse lifeexp (Life expectancy, 1998) . desc Contains data from C:\Program Files (x86)\Stata13\ado\base/l/lifeexp.dta obs: 68 Life expectancy, 1998 vars: 6 26 Mar 2011 09:40 size: 2,652 (_dta has notes) ------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------------------------- region byte %12.0g region Region country str28 %28s Country popgrowth float %9.0g * Avg. annual % growth lexp byte %9.0g * Life expectancy at birth gnppc float %9.0g * GNP per capita safewater byte %9.0g * * indicated variables have notes ------------------------------------------------------------------------------------------------- Sorted by:We see that we have six variables. The dataset has notes that you can see by typing
notes
. Four of the variables have annotations that you can see by typing
notes varname
. You'll learn how to add notes in
Section 2.1.1.5 Descriptive Statistics
Let us run simple descriptive statistics for the two variables we are interested in, using thesummarize
command followed by the names of the variables
(which can be omitted to summarize everything):. summarize lexp gnppc Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- lexp | 68 72.27941 4.715315 54 79 gnppc | 63 8674.857 10634.68 370 39980We see that live expectancy averages 72.3 years and GNP per capita ranges from $370 to $39,980 with an average of $8,675. We also see that Stata reports only 63 observations on GNP per capita, so we must have some missing values. Let us
list
the countries for which we are missing GNP per capita:. list country gnppc if missing(gnppc) +--------------------------------------+ | country gnppc | |--------------------------------------| 7. | Bosnia and Herzegovina . | 40. | Turkmenistan . | 44. | Yugoslavia, FR (Serb./Mont.) . | 46. | Cuba . | 56. | Puerto Rico . | +--------------------------------------+We see that we have indeed five missing values. This example illustrates a powerful feature of Stata: the action of any command can be restricted to a subset of the data. If we had typed
list country gnppc
we would have listed these variables for all 68 countries.
Adding the condition if missing(gnppc)
restricts the list to cases where gnppc
is missing.
Note that Stata lists missing values using a dot. We'll learn more about missing values in
Section 2.1.1.6 Drawing a Scatterplot
To see how life expectancy varies with GNP per capita we will draw a scatter plot using thegraph
command, which has a myriad of subcommands and options,
some of which we describe in Section 3.. graph twoway scatter lexp gnppc
1.1.7 Computing New Variables
We compute a new variable using thegenerate
command
with a new variable name and an arithmetic expression. Choosing good variable names is important.
When computing logs I usually just prefix the old variable name with 'log' or
'l',
but compound names can easily become cryptic and hard-to-read.
Some programmers separate words using an underscore, as in log_gnp_pc,
and others prefer the camel-casing convention which capitalizes each word after the first: logGnpPc.
I suggest you develop a consistent style and stick to it.
Variable labels can also help, as described in Section 2. To compute natural logs we use the built-in function
log
: . gen loggnppc = log(gnppc) (5 missing values generated)Stata says it has generated five missing values. These correspond to the five countries for which we were missing GNP per capita. Try to confirm this statement using the list command. We will learn more about generating new variables in Section 2.
1.1.8 Simple Linear Regression
We are now ready to run a linear regression of life expectancy on log GNP per capita. We will use theregress
command, which lists the outcome followed by the predictors
(here just one, loggnppc). regress lexp loggnppc Source | SS df MS Number of obs = 63 -------------+------------------------------ F( 1, 61) = 97.09 Model | 873.264865 1 873.264865 Prob > F = 0.0000 Residual | 548.671643 61 8.99461709 R-squared = 0.6141 -------------+------------------------------ Adj R-squared = 0.6078 Total | 1421.93651 62 22.9344598 Root MSE = 2.9991 ------------------------------------------------------------------------------ lexp | Coef. Std. Err. t p>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- loggnppc | 2.768349 .2809566 9.85 0.000 2.206542 3.330157 _cons | 49.41502 2.348494 21.04 0.000 44.71892 54.11113 ------------------------------------------------------------------------------Note that the regression is based on only 63 observations. Stata omits observations that are missing the outcome or one of the predictors. The log of GNP per capita "explains" 61% of the variation in life expectancy in these countries. We also see that a one percent increase in GNP per capita is associated with an increase of 0.0277 years in life expectancy. (To see this point note that if GNP increases by one percent its log increases by 0.01.)
Following a regression (or in fact any estimation command) you can retype the command with no arguments to see the results again. Try typing
reg
.1.1.9 Post-Estimation Commands
Stata has a number of post-estimation commands that build on the results of a model fit. A useful command ispredict
, which can be used to generate fitted values or
residuals following a regression. The command. predict plexp (option xb assumed; fitted values) (5 missing values generated)generates a new variable,
plexp
, that has the life expectancy predicted from
our regression equation.
No predictions are made for the five countries without GNP per capita.
(If life expectancy was missing for a country it would be excluded from the regression,
but a prediction would be made for it. This technique can be used to fill-in missing values.)1.1.10 Plotting the Data and a Linear Fit
A common task is to superimpose a regression line on a scatter plot to inspect the quality of the fit. We could do this using the predictions we stored inplexp
, but Stata's graph
command
knows how to do linear fits on the fly using the lfit
plot type,
and can superimpose different types of twoway
plots,
as explained in more detail in Section 3.
Try the command. graph twoway (scatter lexp loggnppc) (lfit lexp loggnppc)
1.1.11 Listing Selected Observations
It's hard not to notice the country on the bottom left of the graph, which has much lower life expectancy than one would expect, even given its low GNP per capita. To find which country it is we list the (names of the) countries where life expectancy is less than 55:. list country lexp plexp if lexp < 55, clean country lexp plexp 50. Haiti 54 66.06985We find that the outlier is Haiti, with a life expectancy 12 years less than one would expect given its GNP per capita. (The keyword
clean
after the comma is an option
which omits the borders on the listing. Many Stata commands have options, and these are always
specified after a comma.)
If you are curious where the United States is try. list gnppc loggnppc lexp plexp if country == "United States", clean gnppc loggnppc lexp plexp 58. 29240 10.28329 77 77.88277Here we restricted the listing to cases where the value of the variable
country
was "United States".
Note the use of a double equal sign in a logical expression.
In Stata x = 2
assigns the value 2 to the variable x,
whereas x == 2
checks to see if the value of x is 2.1.1.12 Saving your Work and Exiting Stata
To exit Stata you use theexit
command (or select File|Exit in the menu, or
press Alt-F4, as in most Windows programs). If you have been following along
this tutorial by typing the commands and try to exit Stata will refuse, saying "no;
data in memory would be lost". This happens because we have added a new variable
that is not part of the original dataset, and it hasn't been saved. As you
can see, Stata is very careful to ensure we don't loose our work. If you don't care about saving anything you can type
exit, clear
, which tells
Stata to quit no matter what. Alternatively, you can save the data to disk
using the save filename
command, and then exit. A cautious programmer
will always save a modified file using a new name. 1.2 Using Stata Effectively
While it is fun to type commands interactively and see the results straightaway, serious work requires that you save your results and keep track of the commands that you have used, so that you can document your work and reproduce it later if needed. Here are some practical recommendations.1.2.1 Create a Project Directory
Stata reads and saves data from the working directory, usuallyC:\DATA
, unless you specify otherwise.
You can change directory using the command cd [drive:]directory_name
,
and print the (name of the) working directory using pwd
,
type help cd
for details.
I recommend that you create a separate directory for each course or research project
you are involved in,
and start your Stata session by changing to that directory.
Stata understands nested directory structures and doesn't care if you use \ or / to separate directories. Versions 9 and later also understand the double slash used in Windows to refer to a computer, so you can
cd \\opr\shares\research\myProject
to
access a shared project folder. An alternative approach, which also works
in earlier versions, is to use Windows explorer to assign a drive letter to
the project folder,
for example assign P: to \\opr\shares\research\myProject
and then in Stata use cd p:
.
Alternatively, you may assign R: to \\opr\shares\research
and then
use cd R:\myProject
,
a more convenient solution if you work in several projects.
Stata has other commands for interacting with the operating system, including
mkdir
to create a directory,
dir
to list the names of the files in a directory,
type
to list their contents,
copy
to copy files,
and erase
to delete a file. You can (and probably should) do these tasks using the operating system directly,
but the Stata commands may come handy if you want to write a program to perform repetitive tasks.
1.2.2 Open a Log File
So far all our output has gone to the Results window, where it can be viewed but eventually disappears. (You can control how far you can scroll back, type help scrollbufsize
to learn more.)
To keep a permanent record of your results, however,
you should log
your session.
When you open a log, Stata writes all results to both the Results window and
to the file you specify. To open a log file use the command
log using filename, text replacewhere filename is the name of your log file. Note the use of two recommended options:
text
and replace
.
By default the log is written using SMCL, Stata Markup and Control Language (pronounced "smicle"), which provides some formatting facilities but can only be viewed using Stata's Viewer. Fortunately, there is a
text
option to create logs in plain text (ASCII) format,
which can be viewed in an editor such as Notepad or a word processor such as Word.
(An alternative is to create your log in SMCL and then use the translate
command
to convert it to plain text, postscript, or even PDF if you are a Mac user,
type help translate
to learn more about this option.)
The
replace
option specifies that the file is to be overwritten if
it already exists. This will often be the case if (like me) you need to run your
commands several times to get them right. In fact, if an earlier run has failed
it is likely that you have a log file open,
in which case the log
command will fail.
The solution is to close any open logs using the log close
command.
The problem with this solution is that it will not work if there is no log open!
The way out of the catch 22 is to use
capture log closeThe
capture
keyword tells Stata to run the command that follows and
ignore any errors.
Use judiciously!1.2.3 Always Use a Do File
A do file is just a set of Stata commands typed in a plain text file. You can use Stata's own built-in do-file Editor, which has the great advantage that you can run your program directly from the editor by clicking on the run icon or selecting Tools|Run from the menu. You can also select just a few commands and run them by selecting Tools|Run Selection in the menu. To access Stata's do editor use Ctrl-9 in versions 12 and 13 (Ctrl-8 in earlier versions) or select Window|Do-file Editor|New Do-file Editor in the menu system.Alternatively, you can use an editor such as Notepad. Save the file using extension
.do
and then execute it using the
command do filename
.
For a thorough discussion of alternative text editors see
http://fmwww.bc.edu/repec/bocode/t/textEditors.html,
a page maintained by Nicholas J. Cox, of the University of Durham.You could even use a word processor such as Word, but you would have to remember to save the file in plain text format, not in Word document format. Also, you may find Word's insistence on capitalizing the first word on each line annoying when you are trying to type Stata commands that must be in lowercase. You can, of course, turn auto-correct off. But it's a lot easier to just use a plain-text editor.
1.2.4 Use Comments and Annotations
Code that looks obvious to you may not be so obvious to a co-worker, or even to you a few months later. It is always a good idea to annotate your do files with explanatory comments that provide the gist of what you are trying to do.In the Stata command window you can start a line with a * to indicate that it is a comment, not a command. This can be useful to annotate your output.
In a do file you can also use two other types of comments: // and /* */
//
is used to indicate that everything that follows to the end of the line
is a comment and should be ignored by Stata. For example you could writegen one = 1 // this will serve as a constant in the model
/* */
is used to indicate that all the text between the opening
/* and the closing */, which may be a few characters or may span several lines,
is a comment to be ignored by Stata. This type of comment can be used anywhere,
even in the middle of a line, and is sometimes used to "comment out" code.
There is a third type of comment used to break very long lines, as explained in the next subsection. Type
help comments
to learn more about comments.It is always a good idea to start every do file with comments that include at least a title, the name of the programmer who wrote the file, and the date. Assumptions about required files should also be noted.
1.2.5 Continuation Lines
When you are typing on the command window a command can be as long as needed. In a do-file you will probably want to break long commands into lines to improve readability.To indicate to Stata that a command continues on the next line you use
///
,
which says everything else to the end of the line is a comment and
the command itself continues on the next line. For example you could writegraph twoway (scatter lexp loggnppc) /// (lfit lexp loggnppc)Old hands might write
graph twoway (scatter lexp loggnppc) /* */ (lfit lexp loggnppc)which "comments out" the end of the line.
An alternative is to tell Stata to use a semi-colon instead of the carriage return at the end of the line to mark the end of a command, using
#delimit ;
,
as in this example:
#delimit ; graph twoway (scatter lexp loggnppc) (lfit lexp loggnppc) ;Now all commands need to terminate with a semi-colon. To return to using carriage return as the delimiter use
#delimit crThe delimiter can only be changed in do files. But then you always use do files, right?
1.2.6 A Sample Do File
Here's a simple do file that can reproduce all the results in our Quick Tour, and illustrates the syntax highlighting introduced in Stata's do file editor in version 11. The file doesn't have many comments because this page has all the details. Following the listing we comment on a couple of lines that require explanation.// A Quick Tour of Stata // German Rodriguez - Fall 2013 version 13 clear capture log close log using QuickTour, text replace display 2+2 display 2 * ttail(20,2.1) // load sample data and inspect sysuse lifeexp desc summarize lexp gnppc list country gnppc if missing(gnppc) graph twoway scatter lexp gnppc, /// title(Life Expectancy and GNP ) xtitle(GNP per capita) graph export scatter.png, width(400) replace // save the graph in PNG format gen loggnppc = log(gnppc) regress lexp loggnppc predict plexp graph twoway (scatter lexp loggnppc) (lfit lexp loggnppc) /// , title(Life Expectancy and GNP) xtitle(log GNP per capita) graph export fit.png, width(400) replace list country lexp plexp if lexp < 55, clean list gnppc loggnppc lexp plexp if country == "United States", clean log close // make sure you hit enter for the last lineWe start the do file by specifying the version of Stata we are using, in this case 12. This helps ensure that future versions of Stata will continue to interpret the commands correctly, even if Stata has changed, see
help version
for details.
(The previous version of this file read version 12, and I could have left that in place
to run under version control; the results would be the same because none of the commands
used in this quick tour has changed.)
The
clear
statement deletes the data currently held in memory
and any value labels you might have. The reason we need that is that if we
had to rerun the program the sysuse
command would
fail because we already have a dataset in memory and it has not been saved.
An alternative with the same effect is to type
sysuse lifeexp, clear
.
(Stata keeps other objects in memory as well, including saved results, scalars
and matrices, although we haven't had occasion to use these yet.
Typing clear all
removes these objects from memory,
ensuring that you start with a completely clean slate.
See help clear
for more information.
Usually, however, all you need to do is clear the data.)
Note also that we use a
graph export
command to convert the
graph in memory to Portable Network Graphics (PNG) format, ready for inclusion
in a web page. To include a graph in a Word document you are better off cutting and
pasting a graph in Windows Metafile format, as explained in
Section 3.The note on the last line is to remind you that by default Stata uses the (invisible) carriage return at the end of the line as the command delimiter. If you haven't pressed return after the last line, the entire line will usually be ignored by Stata.
1.2.7 Stata Command Syntax
Having used a few Stata commands it may be time to comment briefly on their structure, which usually follows the following syntax, where bold indicates keywords and square brackets indicate optional elements:
[by varlist:] command [varlist]
[=exp] [if exp] [in range]
[weight] [using filename] [,options]
We now describe each syntax element:
- command:
- The only required element is the command itself, which is usually (but not always)
an action verb, and is often followed by the names of one or more variables.
Stata commands are case-sensitive.
The commands
describe
andDescribe
are different, and only the former will work. Commands can usually be abbreviated as noted earlier. When we introduce a command we underline the letters that are required. For exampleregress
indicates that theregress
command can be abbreviated toreg
. - varlist:
- The command is often followed by the names of one or more variables,
for example
describe lexp
orregress lexp loggnppc
. Variable names are case sensitive.lexp
andLEXP
are different variables. A variable name can be abbreviated to the minimum number of letters that makes it unique in a dataset. For example in our quick tour we could refer tologgnppc
aslog
because it is the only variable that begins with those three letters, but this is a really bad idea. Abbreviations that are unique may become ambiguous as you create new variables, so you have to be very careful. You can also use wildcards such asv*
or name ranges, such asv101-v105
to refer to several variables. Typehelp varlist
to lear more about variable lists. - =exp:
- Commands used to generate new variables,
such as
generate log_gnp = log(gnp)
, include an arithmetic expression, basically a formula using the standard operators (+ - * and / for the four basic operations and ^ for exponentiation, so 3^2 is three squared), functions, and parentheses. We discuss expressions in Section 2. - if exp and in range:
- As we have seen, a command's action can be restricted to a subset of the data by
specifying a logical condition that evaluates to true of false,
such as
lexp < 55
. Relational operators are <, <=, ==, >= and >, and logical negation is expressed using!
or~
, as we will see in Section 2. Alternatively, you can specify a range of the data, for examplein 1/10
will restrict the command's action to the first 10 observations. Typehelp numlist
to learn more about lists of numbers. - weight:
- Some commands allow the use of weights, type
help weights
to learn more. - using filename:
- The keyword
using
introduces a file name; this can be a file in your computer, on the network, or on the internet, as you will see when we discuss data input in Section 2. - options:
- Most commands have options that are specified following a comma.
To obtain a list of the options available with a command type
help command
where command is the actual command name. - by varlist:
- A very powerful feature, it instructs
Stata to repeat the command for each group of observations defined by distinct values of the variables
in the list. For this to work the command must be "byable" (as noted on the online help)
and the data must be sorted by the grouping variable(s) (or use
bysort
instead).
1.3 Stata Resources
There are many resources available to learn more about Stata, both online and in print.1.3.1 Online Resources
Stata has an excellent website at http://www.stata.com. Among other things you will find that they make available online all datasets used in the official documentation, that they publish a journal called Stata Journal, and that they have an excellent bookstore with texts on Stata and related statistical subjects. Stata also offers email and web-based training courses called NetCourses, see http://www.stata.com/netcourse/.There is an independent listserv maintained by Marcello Pagano at the Harvard School of Public Health, where you can post questions and receive prompt and knowledgeable answers from other users. (Quite often from the indefatigable and extremely knowledgeable Nicholas Cox, who deserves special recognition for his service to the user community.) For detailed instructions on how to join the list see http://www.stata.com/statalist/ and follow the link to subscribe. The postings are archived by Stata, Harvard University and Yahoo. Stata also maintains a list of frequently asked questions (FAQ) classified by topic, see http://www.stata.com/support/faqs/.
UCLA maintains an excellent Stata portal at http://www.ats.ucla.edu/stat/stata/, with many useful links, including a list of resources to help you learn and stay up-to-date with Stata. Don't miss their starter kit, which includes "class notes with movies", a set of instructional materials that combine class notes with movies you can view on the web, and their links by topic, which provides how-to guidance for common tasks. There are also more advanced learning modules, some with movies as well, and comparisons of Stata with other packages such as SAS and SPSS.
1.3.2 Manuals and Books
The Stata documentation has been growing with each version and now consists of 20 volumes with more than 11,000 pages, all available in PDF format with your copy of Stata. The basic documentation consists of a Base Reference Manual, separate volumes on Data Management and Graphics, a User's Guide, a Glossary and Index, and Getting Started with Stata, which has platform-specific versions for Windows, Macintosh and Unix. Some statistical subjects that may be important to you are described in ten separate manuals, dealing with Longitudinal/Panel Data, Multilevel Mixed-Effects, Multiple Imputation, Multivariate Statistics, Power and Sample Size, Structural Equation Modeling, Survey Data, Survival Analysis and Epidemiological Tables, Times Series, and Treatment Effects. Additional volumes of interest to programmers, particularly those seeking to extend Stata's capabilities, are manuals on Programming and on Mata, Stata's matrix programming language.Some of the books on Stata that I particularly like are Sophia Rabe-Hesketh and Brian Everitt's A Handbook of Statistical Analyses using Stata (4th Edition), Lawrence Hamilton's Statistics with Stata (Updated for version 10), and Scott Long and Jeremy Freese's Regression Models for Categorical Dependent Variables Using Stata (2nd Edition), updated following the release of version 8. All three books include useful tutorials introducing Stata. Section 2.10 of the book by Long and Freese is a set of recommended practices that should be read and followed faithfully by every aspiring Stata programmer. Another book I like is Michael Mitchell's excellent A Visual Guide to Stata Graphics, which was written specially to introduce the new graphs in version 8 and is now in its 3rd edition. Two useful (but more specialized) references written by the developers of Stata are An Introduction to Survival Analysis Using Stata (3rd Edition), by Mario Cleves, William Gould, Roberto Gutierrez and Julia Marchenko, and Maximum Likelihood Estimation with Stata (4th Edition) by William Gould, Jeffrey Pitblado, and Brian Poi.
Tidak ada komentar:
Posting Komentar