Stata Data Management
Topics
- Generating and replacing variables
- Missing values
- Variable types and conversion
- Merging, appending, and joining
- Creating summarized data sets
Setup
Software and Materials
Follow the Stata Installation instructions and ensure that you can successfully start Stata.
Class structure and organization
- Please feel free to ask questions at any point if they are relevant to the current topic (or if you are lost!)
- Collaboration is encouraged - please introduce yourself to your neighbors!
- If you are using a laptop, you will need to adjust file paths accordingly
- Make comments in your do-file - save on flash drive or email to yourself
Prerequisites
- This is an introduction to data management in Stata
- Assumes basic knowledge of Stata
- Not appropriate for people already familiar with Stata
- If you are catching on before the rest of the class, experiment with command features described in help files
Goals
We will learn about the Stata language by analyzing data from the general social survey (gss). In particular, our goals are to learn:
- Basic data manipulation commands
- Dealing with missing values
- Variable types and conversion
- Merging and appending datasets
Opening Files
GOAL: To understand the working directory of Stata, how to change working directory, and open files from the working directory. In particular:
Look at bottom left hand corner of Stata screen, this is the directory Stata is currently reading from. Files are located in the StataDatMan folder in your user directory. Let’s start by telling Stata where to look for these:
// change directory
cd "~/Desktop/dss-workshops/Stata/StataDatMan"
// Use dir to see what is in the directory:
dir
dir dataSetsNow we can read in the gss.dta dataset:
Generating & replacing variables
GOAL: You’ll learn how to generate new variables or recode existing variables. In particular, we will learn to use:
generate(gen) to create new variablesegenerate(egen) to create new variables using more complicated calculations thangenallowsreplaceto replace existing variablesrecodeto change existing categorical variables
Generate & Replace
The generate command creates a new variables. Often, this is used in combination with the replace command and logic statements to create a new variable. Available logical operators include the following:
| Operator | Meaning |
|---|---|
| == | equal to |
| != | not equal to |
| > | greater than |
| >= | greater than or equal to |
| < | less than |
| <= | less than or equal to |
| & | and |
For example:
Recode
The recode command is basically generate and replace combined. You can recode an existing variable OR use recode to create a new variable (via the gen option).
// recode the wrkstat variable
recode wrkstat (1=8) (2=7) (3=6) (4=5) (5=4) (6=3) (7=2) (8=1)
// recode wrkstat into a new variable named wrkstat2
recode wrkstat (1=8), gen(wrkstat2)
// tabulate workstat
tab wrkstatThe table below illustrates common forms of recoding:
| Rule | Example | Meaning |
|---|---|---|
| #=# | 3=1 | 3 recoded to 1 |
| ##=# | 2.=9 | 2 and . recoded to 9 |
| #/#=# | 1/5=4 | 1 through 5 recoded to 4 |
| nonmissing=# | nonmiss=8 | nonmissing recoded to 8 |
| missing=# | miss=9 | missing recoded to 9 |
egen
The egen command (“extensions” to the gen command) reaches beyond simple computations (var1 + var2, log(var1), etc.) to add descriptive stats, standardizations and more. For example, we can use egen to create a new variable that counts the number of “yes” responses on computer, email and internet use:
// count number of yes on use comp email and net
egen compuser = anycount(usecomp usemail usenet), values(1)
tab compuserHere are some additional examples of egen in action:
// assess how much missing data each participant has:
egen countmiss = rowmiss(age-wifeft)
codebook countmiss
// compare values on multiple variables
egen ftdiff = diff(wkftwife wkfthusb)
codebook ftdiffYou will need to refer to the documentation to discover what else egen can do: type help egen in Stata to get a complete list of available functions.
Exercise 0
- Open the
gss.dtadata,generatea new variable that represents the squared value ofage.
generatea new variable equal to “1” ifincomeis greater than “19”.
- Create a new variable that counts the number of missing responses for each respondent. What is the maximum number of missing variables?
Click for Exercise 0 Solution
- Open the
gss.dtadata,generatea new variable that represents the squared value ofage.
generatea new variable equal to “1” ifincomeis greater than “19”.
describe income
label list income
recode income (99=.) (98=.)
gen highincome =0 if income != .
replace highincome=1 if income>19
sum highincome- Create a new variable that counts the number of missing responses for each respondent. What is the maximum number of missing variables?
Missing values
GOAL: Learn how missing values are coded and how to recode them.
Stata’s symbol for a missing value is a period . and this value is coded and treated as positive infinity (i.e., the largest possible value), so it’s easy to make mistakes when making logical and relational comparisons!
Making sure missingness is preserved
To identify highly educated women, we might use the command:
// generate and replace without considering missing values
gen hi_ed=0
replace hi_ed=1 if wifeduc>15
// What happens to our missing values?
tab hi_ed, mi nolaIt looks like around 66% have higher education, but look closer:
// generate hi_ed2, but only set a value if wifeduc is not missing
gen hi_ed2 = 0 if wifeduc != .
// only replace non-missing values
replace hi_ed2=1 if wifeduc >15 & wifeduc !=.
//check to see that missingness is preserved
tab hi_ed2, miThe correct value is 10%. Moral of the story? Be careful with missing values and remember that Stata considers missing values to be positive infinity!
Bulk Conversion to missing values
Often the data collection / generating procedure will have used some other value besides . to represent missing values. The mvdecode command will convert all these values to missing. For example:
The _all command tells Stata to perform this conversion for all variables. Use this command carefully! If you have any variables where “999” is a legitimate value, Stata is going to recode it to missing. As an alternative, you could list var names separately rather than using _all.
Variable types
GOAL: Learn about the two main types of variables that Stata uses: string and numeric.
To be able to perform any mathematical operations, your variables need to be in a numeric format. Stata can store numbers with differing levels of precision, as described in the table below:
| type | Minimum | Maximum | being 0 | bytes |
|---|---|---|---|---|
| byte | -127 | 100 | +/-1 | 1 |
| int | -32,767 | 32,740 | +/-1 | 2 |
| long | -2,147,483,647 | 2,147,483,620 | +/-1 | 4 |
| float | -1.70141173319*1038 | 1.70141173319*1038 | +/-10-38 | 4 |
| double | -8.9884656743*10307 | 8.9884656743*10307 | +/-10-323 | 8 |
Precision for float is 3.795x10-8. Precision for double is 1.414x10-16.
Converting to & from Strings
Stata provides several ways to convert to and from strings. You can use tostring and destring to convert from one type to the other:
// convert degree to a string
tostring degree, gen(degree_s)
// and back to a number
destring degree_s, gen(degree_n)Use decode and encode to convert to / from variable labels:
Converting strings to date / time
Often date / time variables start out as strings — you’ll need to convert them to numbers using one of the conversion functions listed below:
| Format | Meaning | String-to-numeric conversion function |
|---|---|---|
| %tc | milliseconds | clock(string, mask) |
| %td | days | date(string, mask) |
| %tw | weeks | weekly(string, mask) |
| %tm | months | monthly(string, mask) |
| %tq | quarters | quarterly(string, mask) |
| %ty | years | yearly(string, mask) |
Date / time variables are stored as the number of units elapsed since 01 January 1960 00:00:00.000. For example, the date function returns the number of days since that time, and the clock function returns the number of milliseconds since that time.
Formatting numbers as dates
Once you have converted a string to a number you can format it for display. You can either accept the defaults used by your formatting string or provide details to customize it.
Exercise 1
Missing values, string conversion, & by processing
- Recode values “99” and “98” on the variable
hrs1as missing.
- Recode the
maritalvariable into a string variable and then back into a numeric variable.
- Create a new variable that associates each individual with the average number of hours worked among individuals with matching educational degrees (see the last
byexample for inspiration).
Click for Exercise 1 Solution
- Recode values “99” and “98” on the variable
hrs1as missing.
- Recode the
maritalvariable into a string variable and then back into a numeric variable.
tostring marital, gen(marstring)
destring marstring, gen(mardstring)
//compare with
decode marital, gen(marital_s)
encode marital_s, gen(marital_n)
describe marital marstring mardstring marital_s marital_n
sum marital marstring mardstring marital_s marital_n- Create a new variable that associates each individual with the average number of hours worked among individuals with matching educational degrees (see the last
byexample for inspiration).
Merging, appending, & collapsing
GOAL: To learn the basic commands to merge, append, or join two dataset in Stata. In particular:
- How to append datasets
- How to merge datasets and types of merge
- Collapse from master data and create a new dataset of summary statistics
Appending datasets
Sometimes you have observations in two different datasets, or you’d like to add observations to an existing dataset. In this case you can use the append command to add observations to the end of the observations in the master dataset. For example:
clear
// from the append help file
webuse even
list
webuse odd
list
// Append even data to the end of the odd data
append using "http://www.stata-press.com/data/r14/even"
list
clearTo keep track of where observations came from, use the generate option as shown below:
webuse odd
append using "http://www.stata-press.com/data/r14/even", generate(observesource)
list
clearThere is a force option will allow for data type mismatches, but this is not recommended. Remember, append is for adding observations (i.e., rows) from a second data set to your current dataset.
Merging datasets
You can merge variables from a second dataset to the dataset you’re currently working with. There are different ways that you might be interested in merging data:
- Two datasets with same participant pool, one row per participant (1:1)
- A dataset with one participant per row with a dataset with multiple rows per participant (1:many or many:1)
Before you begin:
- Identify the
IDthat you will use to merge your two datasets - Determine which variables you’d like to merge
- Variable types must match across datasets (there is a
forceoption to get around this, but it is not recommended)
Note — data do NOT have to be sorted prior to merging.
// Adapted from the merge help page
webuse autosize
list
webuse autoexpense
list
webuse autosize
merge 1:1 make using "http://www.stata-press.com/data/r14/autoexpense"
list
clear
// keep only the matches (AKA "inner join")
webuse autosize, clear
merge 1:1 make using "http://www.stata-press.com/data/r14/autoexpense", keep(match) nogen
list
clearRemember, merge is for adding variables (i.e., columns) from a second data set.
Merge Options
There are several options that provide more fine-grain control over what happens to non-id columns contained in both data sets. If you’ve carefully cleaned and prepared the data prior to merging this shouldn’t be an issue, but here are some details about how Stata handles this situation.
- In standard merge, the current active dataset is the authority and WON’T CHANGE
- If your current dataset has missing data and some of those values are not missing in your new dataset, specify
update– this will fill in missing data in the current dataset - If you want data from your new dataset to overwrite that in your current dataset, specify
replace update— this will replace current data with new data UNLESS the value is missing in the new dataset
Many-to-many merges - joinby command
Stata allows you to specify merges like merge m:m id using newdata.dta, but it is difficult to imagine a situation where this would be useful. If you are thinking about using merge m:m chances are good that you actually need joinby. Please refer to the joinby help page for details.
Collapse
collapse will take your current active dataset and create a new dataset of summary statistics
- Useful in hierarchical linear modeling if you’d like to create aggregate, summary statistics
- Can generate group summary data for many descriptive stats
- Can also attach weights
Before you collapse:
- Save your current dataset and then save it again under a new name (this will prevent
collapsefrom writing over your original data - Consider issues of missing data. Do you want Stata to use all possible observations? If not, the
cw(casewise) option will make casewise deletions
// Adapted from the collapse help page
clear
webuse college
list
// mean and sd by hospital
collapse (mean) mean_gpa = gpa mean_hour = hour (sd) sd_gpa = gpa sd_hour = hour, by(year)
list
clearYou could also generate different statistics for multiple variables.
Exercise 2
Append, merge, & collapse
Open the gss2.dta dataset. This dataset contains only half of the variables that are in the complete gss dataset.
- Merge dataset
gss1.dtawith datasetgss2.dta. The identification variable isid.
- Open the
gss.dtadataset and merge in data from themarital.dtadataset, which includes income information grouped by individuals’ marital status. Themarital.dtadataset contains collapsed data regarding average statistics of individuals based on their marital status.
- Open the
gssAppend.dtadataset and create a new dataset that combines the observations ingssAppend.dtawith those ingssAddObserve.dta.
- Open the
gss.dtadataset and create a new dataset that summarizes the mean and standard deviation of income based on individuals’ degree status (degree). In the process of creating this new dataset, rename your three new variables.
Click for Exercise 2 Solution
Open the gss2.dta dataset. This dataset contains only half of the variables that are in the complete gss dataset.
- Merge dataset
gss1.dtawith datasetgss2.dta. The identification variable isid.
- Open the
gss.dtadataset and merge in data from themarital.dtadataset, which includes income information grouped by individuals’ marital status. Themarital.dtadataset contains collapsed data regarding average statistics of individuals based on their marital status.
use dataSets/gss.dta, clear
merge m:1 marital using dataSets/marital.dta, nogenerate replace update
save gss4.dta, replace- Open the
gssAppend.dtadataset and create a new dataset that combines the observations ingssAppend.dtawith those ingssAddObserve.dta.
- Open the
gss.dtadataset and create a new dataset that summarizes the mean and standard deviation of income based on individuals’ degree status (degree). In the process of creating this new dataset, rename your three new variables.
Wrap-up
Feedback
These workshops are a work-in-progress, please provide any feedback to: help@iq.harvard.edu
Resources
- IQSS
- Workshops: https://www.iq.harvard.edu/data-science-services/workshop-materials
- Data Science Services: https://www.iq.harvard.edu/data-science-services
- Research Computing Environment: https://iqss.github.io/dss-rce/
- HBS
- Research Computing Services workshops: https://training.rcs.hbs.org/workshops
- Other HBS RCS resources: https://training.rcs.hbs.org/workshop-materials
- RCS consulting email: mailto:research@hbs.edu
- Stata
- UCLA website: http://www.ats.ucla.edu/stat/Stata/
- Stata website: http://www.stata.com/help.cgi?contents
- Email list: http://www.stata.com/statalist/