Linear regression with linked data files

Friday, March 29, 2019
Dr. Emanuel Ben David, Center for Statistics Research & Methodology, Census Bureau


Large organizations that own or have access to multiple data sources regularly rely on data integration for conducting large-scale scientific projects. Record linkage, or entity resolution, is an essential task in data integration. The task is to identify which records in different datasets belong to the same entity. In practice, due to the lack of unique identifiers, record linkage is prone to matching errors: mismatches and missed-matches. Statistical analysis of linked data files, even with low matching error, can then suffer from selection bias and adverse outliers. To adjust the analysis, it is of interest to develop statistical methods that can alleviate the adverse effects of matching errors. In this talk, I consider the regression analysis of ``permuted data'' in which the record linkage results in an unknown permutation of the observations for the response variable. Assuming that the matching error is small, I propose an approach for estimating the parameters that is statistically sound and computationally feasible.