Visible to the public Biblio

Filters: Keyword is Record Linkage  [Clear All Filters]
2017-03-07
Schild, Christopher-J., Schultz, Simone.  2016.  Linking Deutsche Bundesbank Company Data Using Machine-Learning-Based Classification: Extended Abstract. Proceedings of the Second International Workshop on Data Science for Macro-Modeling. :10:1–10:3.

We present a process of linking various Deutsche Bundesbank datasources on companies based on a semi-automatic classification. The linkage process involves data cleaning and harmonization, blocking, construction of comparison features, as well as training and testing a statistical classification model on a "ground-truth" subset of known matches and non-matches. The evaluation of our method shows that the process limits the need for manual classifications to a small percentage of ambiguously classified match candidates.

Kim, Kunho, Giles, C. Lee.  2016.  Financial Entity Record Linkage with Random Forests. Proceedings of the Second International Workshop on Data Science for Macro-Modeling. :13:1–13:2.

Record linkage refers to the task of finding same entity across different databases. We propose a machine learning based record linkage algorithm for financial entity databases. Record linkage on financial databases are essential for information integration on certain financial entity, since those databases do not have common unified identifier. Our algorithm works in two steps to determine if a pair of record is same entity or not. First we check with proposed rules if the record pair can be exactly matched after cleaning the entity name and address. Second, inspired by earlier work on author name disambiguation, we train a binary Random Forest classifier to decide the linkage. To reduce and scale the computation, this process is done only for candidate pairs within a proposed heuristic. Initial evaluation for precision, recall and F1 measures on two different linking tasks in the Financial Entity Identification and Information Integration (FEIII) Challenge show promising results.