01/23/2025 | Press release | Distributed by Public on 01/23/2025 10:42
Data collection. The 2023 data collection period was slightly longer than 6 months (i.e., 27 weeks) beginning in mid-September 2023. The SDR used a trimodal data collection approach: self-administered online survey (Web), self-administered paper questionnaire (via mail), and computer-assisted telephone interview (CATI). All individuals in the sample were started in the Web mode if a mail or e-mail address was available. After an initial survey invitation via postal mail and e-mail, the data collection protocol included sequential contacts by postal mail, telephone, and e-mail that ran throughout the data collection period. At any time during data collection, sample members could choose to complete the survey using any of the three modes. Nonrespondents to the initial survey invitation received follow-up sent by alternating contacting methods.
Quality assurance procedures were in place at each data collection step (address updating, printing, package assembly and mailing, e-mail sending, questionnaire receipt, data entry, coding, CATI, and post-data collection processing). Active data collection ended the last week of March 2024. The online survey closed 1 April 2024, and receipt of hard-copy questionnaires ended on 25 April 2024.
Mode. Almost 96.5% of the participants completed the survey through the Web, 2.3% through mail, and 1.3% through CATI.
Response rates. Response rates were calculated on complete responses, that is, from instruments with responses to all critical items. Critical items are those containing information needed to report labor force participation, including employment status, job title, and job description, as well as location of residency on the reference date. The overall unweighted response rate was 65%; the weighted response rate was also 65%. These response rates are consistent with those achieved in the 2021 SDR.
Of the 125,262 persons in the 2023 SDR sample, 80,143 completed the survey. Among those who completed the survey, 71,161 respondents were residing in the United States on the survey reference date and contributed to the U.S. SEH doctoral population estimates. An additional 8,982 persons completed the survey, but they were residing outside of the United States on the survey reference date. This group contributed to the estimates of the internationally residing U.S.-trained SEH doctoral population.
Data editing. All survey data collected in the 2023 SDR were captured in a single survey instrument with mode specific interfaces. Using a unified instrument supported efficient post-data collection processing and facilitated data harmonization. Prior to entry, mail questionnaire data were reviewed and edited to resolve unclear or inconsistent responses (e.g., multiple responses in a select-one type question) following pre-entry editing procedures. Telephone callbacks were used to obtain additional information for incomplete mail responses. Captured data were exported to a single database for subsequent coding, editing, and imputation needed to create an analytical database.
Following established NCSES guidelines for coding SDR survey data, including verbatim responses, staff were trained in conducting a standardized review and coding of occupation and education information, "other/specify" verbatim responses including verbatim items pertaining to the new retirement items, state and country geographical information, and postsecondary institution information. For standardized coding of occupation, the respondent's reported job title, duties and responsibilities, the extent the work was related to the first U.S. doctoral degree earned, and other work-related information from the questionnaire were first autocoded using a programmed algorithm. Any remaining uncoded occupations were reviewed by trained coders who corrected known respondent self-reporting errors to obtain the best occupation codes. The education code for the field of study of a newly earned degree or for the first bachelor's degree earned if not reported previously was assigned solely based on the verbatim response for that degree field.
Imputation. Item nonresponse for key employment items-such as employment status, sector of employment, and primary work activity-ranged from 0.0% to 2.6%. Nonresponse to questions about income was higher: nonresponse to salary was 6.7%, and nonresponse to earned income was 15.3%. Personal demographic data, such as sex, marital status, citizenship, ethnicity, and race, had variable item nonresponse rates, with sex at 0.0%, birth year at 0.5%, marital status at 10.1%, citizenship at 6.6%, ethnicity at 0.2%, and race at 0.7%. Item nonresponse was addressed using random or hot-deck imputation methods.
Logical imputation often was accomplished as a part of editing. In the editing phase, the answer to a question with missing data was sometimes determined by the answer to another question. In some circumstances, editing procedures found inconsistent data that were blanked out and therefore subject to statistical imputation.
During sample frame construction for the SDR, some missing demographic variables, such as race and ethnicity, were imputed before sample selection by the SED or by using other existing information from the sampling frame. All sample members with imputed values for sex, race, or ethnicity were given the opportunity to report these data during data collection if they responded in the Web or CATI modes.
Respondents with missing race or ethnicity data who did not take the opportunity to report these data and did not have imputed race or ethnicity values from prior SDR rounds or from the SED were assigned values for race or ethnicity through hot-deck procedures during post-data collection processing.
Most SDR variables were subjected to hot-deck imputation, with each variable having its own class and sort variables. Hot-deck imputation was implemented using sort variables as specified by statistical modeling to identify important variables with respect to the imputed information.
However, imputation was not performed on verbatim-based variables, personal contact information, or a few other system variables such as mother's and father's education. For some variables, no set of class and sort variables was reliably related to or suitable for predicting the missing value, such as day of birth. In these instances, random imputation was used, so that the distribution of imputed values was similar to the distribution of reported values without using class or sort variables.
Weighting. Because the SDR is based on a complex sampling design and subject to nonresponse bias, sampling weights were created for each respondent to support unbiased population estimates. The final analysis weights account for the following:
The final sample weights enable data users to derive survey-based estimates of the SDR target population. The variable name on the SDR public use data files for the SDR final sample weight is WTSURVY.
Detailed information on weighting is contained in the 2023 SDR Methodology Report, available upon request from the SDR Survey Manager.
Variance estimation. The successive difference replication method (SDRM) was used to develop replicate weights for variance estimation. The theoretical basis for the SDRM is described in Wolter (1984), Fay and Train (1995), and Ash (2014). As with any replication method, successive difference replication involves constructing a number of subsamples (replicates) from the full sample and computing the statistic of interest for each replicate. The mean squared error of the replicate estimates around their corresponding full sample estimate provides an estimate of the sampling variance of the statistic of interest. The 2023 SDR produced 104 sets of replicate weights. Please contact the SDR Survey Manager to obtain the SDR replicate weights and the replicate weight user guide.
Disclosure protection. To protect against the disclosure of confidential information provided by SDR respondents, the estimates presented in SDR data tables are rounded to the nearest 50, although calculations of percentages are based on unrounded estimates.
Data table cell values based on counts of respondents that fall below a predetermined threshold are deemed to be sensitive to potential disclosure, and the letter "D" indicates this type of suppression in a table cell.