IPUMS CPS Unharmonized Variables
What are unharmonized variables?
IPUMS CPS unharmonized variables are original CPS data packaged for accessibility and utility. Regular IPUMS CPS variables are harmonized for comparability across time. Similar concepts are given consistent codes across months and years, unknown and NIU categories are coded consistently, and any unexpected values are recoded. IPUMS CPS unharmonized variables correspond directly to the original public use datasets made available by the Census Bureau and the Bureau of Labor Statistics. IPUMS USA and IPUMS International users may be familiar with the source variables available from these IPUMS products. Like source variables, IPUMS CPS unharmonized variables deliver original data but, rather than being unique to each sample, one unharmonized variable is available for every sample in which a given original CPS variable has identical codes and value labels.
The Current Population Survey is fielded monthly, and variables rarely change from month to month. However, the large number of samples and inconsistent variable names across time make use of the source data cumbersome. IPUMS CPS has created unharmonized variables to ease browsing and use of CPS source data. Unharmonized variables are (mostly) un-recoded variables that apply a consistent name to variables that have the same meaning and codes across multiple samples.
For example, the race variable exists in all CPS Basic Monthly files from 1976-2020. However, it has eight different original Census Bureau variable names across these years, and five unique coding schemes across the same time period.
IPUMS CPS synthesizes this information into five different unharmonized variables: UH_RACE_B1, which is available from 1976-1988, and UH_RACE_B2 available from 1989-1995, UH_RACE_B3 available from 1996-2002, UH_RACE_B4 available from 2003-2012, and UH_RACE_B5 available from 2013-2020. These variables bundle the source variables from all samples which have the same possible codes into five easily-selected variables.
Race | ||||||||
---|---|---|---|---|---|---|---|---|
1976-1981 | 1982-1983 | 1984-1988 | 1989-1993 | 1994-1995 | 1996-2002 | 2003-2012 | 2013-2020 | |
CPS Mnemonic | I29 | I18H | I18J | A-RACE | PERACE | PERACE | PRDTRACE | PTDTRACE |
Unharmonized variable | UH_RACE_B1 | UH_RACE_B2 | UH_RACE_B3 | UH_RACE_B4 | UH_RACE_B5 | |||
White | 1 | 1 | 1 | 1 | 1 | |||
Black | 2 | 2 | 2 | 2 | 2 | |||
American Indian/Alaska Native | 3 | 3 | 3 | 3 | ||||
Asian | 4 | 4 | ||||||
Hawaiian/Pacific Islander | 5 | 5 | ||||||
Asian/Pacific Islander | 4 | 4 | ||||||
Other | 3 | 5 | ||||||
White-Black | 6 | 6 | ||||||
White-AI | 7 | 7 | ||||||
White-Asian | 8 | 8 | ||||||
White-Hawaiian | 9 | 9 | ||||||
Black-AI | 10 | 10 | ||||||
Black-Asian | 11 | 11 | ||||||
Black-Hawaiian | 12 | 12 | ||||||
AI-Asian | 13 | 13 | ||||||
AI-Hawaiian | 14 | |||||||
Asian-Hawaiian | 14 | 15 | ||||||
White-Black-AI | 15 | 16 | ||||||
White-Black-Asian | 16 | 17 | ||||||
White-Black-Hawaiian | 18 | |||||||
White-AI-Asian | 17 | 19 | ||||||
White-AI-Hawaiian | 20 | |||||||
White-Asian-Hawaiian | 18 | 21 | ||||||
Black-AI-Asian | 22 | |||||||
White-Black-AI-Asian | 19 | 23 | ||||||
White-AI-Asian-Hawaiian | 24 | |||||||
2 or 3 races | 20 | |||||||
3 race combinations | 25 | |||||||
4 or 5 races | 21 | 26 |
Unharmonized variable naming conventions
IPUMS CPS unharmonized variables are denoted with a "UH_" prefix and a "_[letter][number]" suffix. When codes are added or removed or have different labels between years, a new unharmonized variable is created, and the suffix increments by one.
Unharmonized variables are available for both Basic Monthly and ASEC files. Variables belonging to the Basic Monthly files include a B in the variable name suffix and variables from the ASEC files include an A in the variable name suffix. Variables that appear in both Basic Monthly and ASEC files have been given the same name stem. For example, the race variables begin with UH_RACE_ in both the Basic Monthly and the ASEC files. Users should note that the same variable in the same years may have different numbered suffixes due to differing availability across the two types of files. For example UH_RACE_B1 is available from 1976-1988 while UH_RACE_A1 is available for 1962 and UH_RACE_A2 includes ASEC samples from 1976-1988.
Recoding in Unharmonized Variables
Unharmonized variables are intended deliver raw CPS data, however, IPUMS CPS has done some recoding to increase the usability of the data. These recodes deal with blank spaces in the original data, unexpected strings in the original data, and meaningful strings that would force the variable to be string type.
Blank spaces and unexpected values
We recode string representations of missing values in variables that are otherwise of numeric type. Consider the variable that indicates whether the respondent worked for 35 hours or more during the week at a given job in 1976-1993, UH_USLFT_B1. In the original data, "Not in universe" is designated by a blank space. These records have been recoded to have a value of -9, "Missing" in UH_USLFT_B1 so that all cases in the data have a numeric value.
Recoding in UH_USLFT_B1 | |||
---|---|---|---|
Input Value | Input Label | Output Value | Output Label |
' ' | NIU | -9 | Missing |
1 | Yes | 1 | Yes |
2 | No | 2 | No |
Unexpected Strings
We use a similar approach to recode meaningless strings. In the CPS source data from years prior to 1989, low-frequency junk strings occasionally show up in the data. When this occurs, we assign these values to the IPUMS CPS-created "Missing" category. We note these recodes in the unharmonized variable descriptions.
There are also occasionally numeric values in the source data that are not given a label in any of the original CPS documentation. We leave these unexpected values un-recoded and un-labeled and, as a result, they may not appear in the 'Codes' tab of a variable page.
Meaningful Strings
There are several variables that have both numeric and meaningful string values. We recode the meaningful strings to numeric values so that the variable can be treated as numeric. For example, UH_FAMINCX_B1 contains meaningful string values.
Recoding in UH_FAMICNX_B1 | |||
---|---|---|---|
Input Value | Input Label | Output Value | Output Label |
0 | Under $5,000 | 0 | Under $5,000 |
1 | $ 5,000-7,499 | 1 | $ 5,000-7,499 |
2 | $ 7,500- 9,999 | 2 | $ 7,500- 9,999 |
3 | $ 10,000-12,499 | 3 | $ 10,000-12,499 |
4 | $ 12,500-14,999 | 4 | $ 12,500-14,999 |
5 | $ 15,000-17,499 | 5 | $ 15,000-17,499 |
6 | $ 17,500-19,999 | 6 | $ 17,500-19,999 |
7 | $ 20,000-24,999 | 7 | $ 20,000-24,999 |
8 | $ 25,000-29,999 | 8 | $ 25,000-29,999 |
9 | $ 30,000-34,999 | 9 | $ 30,000-34,999 |
A | $ 35,000-39,999 | 10 | $ 35,000-39,999 |
B | $ 40,000-49,999 | 11 | $ 40,000-49,999 |
C | $ 50,000-74,999 | 12 | $ 50,000-74,999 |
D | $ 75,000 and over | 13 | $ 75,000 and over |
' ' | NIU | -99 | Missing |
- | -99 | Missing |
(Un)Availability
Some variables in some samples are missing in the original data files, contradicting the original metadata. In these instances, the unharmonized variable is available for all samples for which the original metadata suggests it should exist, but is entirely missing in samples where the original data is not meaningful. For example, according to original documentation, the variable for "Relationship to owner of business" (UH_BUSOWN_B1) should be available from 1994 to July of 2005. However, the columns specified for this variable in the data files from 1999 to July of 2005 contain only -1. These samples are still listed as available in UH_BUSOWN_B1 even though they are entirely missing.