01/21/2026 | Press release | Distributed by Public on 01/21/2026 10:14
Working papers are intended to report exploratory results of research and analysis undertaken by the National Center for Science and Engineering Statistics (NCSES) at the U.S. National Science Foundation (NSF). Any opinions, findings, conclusions, or recommendations expressed in this working paper do not necessarily reflect the views of NSF. This working paper has been released to inform interested parties of ongoing research or activities and to encourage further discussion of the topic.
NCSES has reviewed this product for unauthorized disclosure of confidential information and approved its release (NCSES-DRN25-016).
This working paper describes imputation research with applications to the Survey of Doctorate Recipients (SDR), which provides information about the educational and occupational achievements and career movement of U.S.-trained doctoral scientists and engineers in the United States and abroad. The paper explores a new imputation approach for the SDR that systematically utilizes and maintains reported longitudinal patterns. The paper also summarizes the evaluation of the new cross-sectional imputation for the 2019 SDR cross-sectional data and the 2015-19 longitudinal SDR file.
Missing data in surveys occurs when a respondent does not answer a particular question either intentionally (the respondent does not know or refuses to answer), unintentionally (an oversight or misinterpreted skip pattern instructions), or because the respondent completes a short version of the questionnaire that includes only critical items. Imputation of missing data from surveys aims to mitigate the risk of nonresponse bias associated with item nonresponse and provide data users with a complete data set with no missing values to facilitate data analyses. In addition, the imputation procedures seek to support multivariate inferences by reflecting the relationships among survey variables.
Particularly, in imputation for longitudinal surveys, it is crucial to maintain longitudinal relationships in the data by preserving temporal patterns and accurately predicting missing data at a given time. The Survey of Doctorate Recipients (SDR) provides information about the educational and occupational achievements and career movement of U.S.-trained doctoral scientists and engineers in the United States and abroad and produces both cross-sectional and longitudinal data. The SDR currently imputes the biennial cross-sectional item missing data through a hot deck method. The first release of longitudinal SDR data in 2022 (LSDR 2015-19) added hot deck imputation of missing years of data for longitudinal cohort cases using longitudinal patterns while taking the cross-sectional imputations as fixed, meaning the cross-sectional data were not changed. Although each cross-sectional imputation utilizes reported data from prior cycles, the current method does not incorporate longitudinal patterns consistently.
The research described in this working paper investigates the effects of the systematic usage of information from past cycles in the cross-sectional imputation and in the longitudinal imputation, which already incorporates longitudinal information. The key goal is to improve donor matching in the cross-sectional imputation and, in turn, enhance the data quality of (1) the cross-sectional data sets by strengthening the imputation accuracy and (2) the longitudinal data sets by reducing the chance of creating unobserved and unverifiable longitudinal patterns. The following steps are involved in this investigation.
The results of this research can be used to evaluate both the cross-sectional and longitudinal data sets and to provide specific recommendations on extending and modifying the methodology for use in future SDR cycles.
The two methods sections (sections 2 and 3) describe the cross-sectional imputation and longitudinal imputation, respectively. Section 4 concludes with a summary and recommendations.
The SDR is conducted approximately biennially and provides demographic, education, and career history information from individuals with a U.S. research doctoral degree in a science, engineering, or health field. The majority of data items have low nonresponse. For example, item nonresponse in the SDR 2019 for key employment and demographic items ranged from 0% to 3%. Nonresponse to questions about income was higher and ranged from 13% to 15%. For the SDR cross-sectional files, the cross-sectional imputation has been utilizing information available within a cycle while incorporating data from previous cycles where possible, primarily aiming to enhance the accuracy of the imputation within the cross-sectional file. The research described in this working paper investigated the effect of a new approach that uses data from previous cycles systematically. The new approach resembles the processes used for the original cross-sectional imputation but with some modifications, including (1) conducting a separate imputation step for cases grouped by the historic response patterns of a given item and (2) identifying additional class and sort variables to be used in the imputation of each of the items. These modifications were intended to preserve longitudinal patterns observed in the 2015, 2017, and 2019 SDR data reported by respondents.
The exploration focuses on items from questionnaire section A (employment situation) of the 2019 survey, which are of particular interest longitudinally. The original 2019 cross-sectional imputation for questionnaire sections B (past employment), C (other work-related experiences), D (recent educational experiences), and E (demographic) were retained.
The methods section begins by outlining the existing hot deck imputation in section 2.1. The new cross-sectional imputation approach, including describing the modifications and their implementation in more detail using the 2019 SDR data, is given in section 2.2. Section 2.3 shows comparisons between the re-imputed data using the new approach and the published 2019 SDR data using the original approach; the comparisons were made by evaluating item response status patterns and variable distributions before and after imputation (marginal and conditional on the longitudinal information).
For the 2015-19 LSDR data file, the longitudinal imputation was limited to imputing missing years of data and incorporated the data from previous cycles in a systematic manner (see section 3.1 Overview of Imputation Approach for details). However, the cross-sectional SDR data, both the respondent-provided values and the imputed values, were used regardless of the source of the data, and the cross-sectional imputation did not systematically utilize the longitudinal information for the 2017 and 2019 cycles.
For the longitudinal imputation research, the 2019 SDR cross-sectional file created through the new imputation approach was used while the general longitudinal imputation approach remained the same. Thus, the imputation approaches in the cross-sectional and longitudinal imputation processes were consistent for the 2019 cycle.
Section 3.1 Overview of Imputation Approach provides an overview of the longitudinal imputation approach. The newly re-imputed variables from section A of the questionnaire from SDR 2019 were compared with the existing 2015-19 LSDR data created based on the published 2019 SDR cross-sectional data.
Longitudinal surveys offer insights that cross-sectional surveys cannot, such as the ability to track outcomes over time and assess the potential impacts of policy changes. However, longitudinal data are more complex by design. Particularly, in imputation for longitudinal surveys, it is critical to preserve temporal patterns in addition to accurately predicting missing data at a given time. Through NCSES's SDR, which produces both cross-sectional and longitudinal data and imputes missing data through two processes-one for cross-sectional files and the other for longitudinal files-the research summarized in this working paper studied the effects of the systematic usage of information from past cycles in both processes.
Systematically incorporating longitudinal information in imputation for the cross-sectional files did not affect the overall distributions of the variables marginally or conditionally in the 2019 cross-sectional file when comparing them to the variable distributions using the existing imputation approach. However, the research findings indicate that the proposed approach slightly reduced the chance of creating new longitudinal patterns when compared with the existing approach. In the longitudinal file, the distribution of changes using the new approach was slightly closer to the distribution using the reported values than the distribution using the existing approach.
This research implemented an alternative imputation approach for primary and secondary work activities to enhance efficiency and effectiveness of the process. Instead of a random imputation method, the alternative is a two-step approach based on the hot deck imputation. The first step incorporates each of the individual work activity variables and primary and secondary work activities reported in the past cycles if available as sort variables, which imputes most of the missing cases. The second step iterates individually through each of a small number of cases that the first step could not successfully impute and imputes the variables by forming a donor pool with valid cases given the combination of individual work activities the respondent performs. This modification shifted the distributions of these variables among imputed cases closer to the distributions among reported cases.
Although implementing the new approaches for both the cross-sectional and longitudinal imputation required some additional effort, the processes are similar to the original approaches and were not overly burdensome to execute. Given this, although differences in overall distributions of the imputed variables between the existing and new imputation approaches were small, future SDR data processing cycles should consider:
This research demonstrated potential gains in data quality and efficiency of survey data processing when applying a unified approach that fully utilizes reported longitudinal data patterns for both cross-sectional and longitudinal imputation. Future work that includes a comprehensive evaluation of results and estimating variance that account for imputation is recommended before a full implementation.
National Center for Science and Engineering Statistics (NCSES). 2022. Survey of Doctorate Recipients, Longitudinal Data: 2015-19. NSF 22-326. Alexandria, VA: U.S. National Science Foundation. Available at https://ncses.nsf.gov/pubs/nsf22326/.
1 For sample cases who were included in the LSDR 2015-19 and did not complete the 2017 SDR form or the 2019 SDR form, hot deck imputation was conducted to impute all variables in a selected set of 32 critical items.
2 Item response statistics are summarized in the SDR 2019 technical notes at https://ncses.nsf.gov/pubs/nsf21320#technical-notes.
3 Imputation donors that exceeded the use limit will be excluded from the list of available donors so that the response data from a donor is not used more than three times.
4 Serpentine sorting for hot deck imputation is described in details in https://support.sas.com/resources/papers/proceedings-archive/SUGI95/Sugi-95-182%20Carlson%20Cox%20Bandeh.pdf.
5 For these missing items, the most and second-most important reasons were imputed by a simple random selection from the subset of reported reasons.
6 See NCSES 2022: data table 1-B at https://ncses.nsf.gov/pubs/nsf22326/table/1-B.
National Center for Science and Engineering Statistics (NCSES). 2025. A Unified Approach to Cross-Sectional and Longitudinal Imputation with Applications to the Survey of Doctorate Recipients. Working Paper NCSES 25-222. Alexandria, VA: U.S. National Science Foundation. Available at https://ncses.nsf.gov/pubs/ncses25222.
National Center for Science and Engineering Statistics
Directorate for Social, Behavioral and Economic Sciences
U.S. National Science Foundation
Alexandria, VA 22314
Tel: (703) 292-8780
FIRS: (800) 877-8339
TDD: (800) 281-8749
E-mail: [email protected]
Table Appendix-A. List of re-imputed questionnaire section A items
Source(s):
2019 Survey of Doctorate Recipients questionnaire.
Table Appendix-B. List of variables on the Survey of Doctorate Recipients, Longitudinal Data: 2015-19
Note(s):
Survey item ID is not applicable for recode variables.
Source(s):
National Center for Science and Engineering Statistics, Survey of Doctorate Recipients: Longitudinal Data: 2015-19.