Alias Detection in Link Data Sets
Abstract
The problem of detecting aliases - multiple text string identifiers
corresponding to the same entity - is increasingly important in the
domains of biology, intelligence, marketing, and geoinformatics. This
report investigates the extent to which probabilistic methods can
help. Aliases arise from entities who are trying to hide their
identities, from a person with multiple names, or from words which are
unintentionally or even intentionally misspelled. While purely
orthographic methods (e.g. string similarity) can help solve
unintentional spelling cases, many types of alias (including those
adopted with malicious intent) can fool these methods. However, if an
entity has a changed name in some context, several or all of the set
of other entities with which it has relationships can remain stable.
Thus, the local social network can be exploited by using the
relationships as semantic information. The proposed combined algorithm
takes advantage of both orthographic and semantic information to
detect aliases. By applying the best combination of both types of
information, the combined algorithm outperforms the ones built solely
on one type of information or the other. Empirical results on three
real world data sets support this claim.