Extracting and structuring name data from strings using Python
2 min readOct 20, 2020
Introduction:
Structuring name data can solve data integrity issues and reduce duplication, especially when working with messy data. A solution for structuring name data is provided and demonstrated at scale.
I. Introduction: Data Structure of a Name
1. Types of Data Structure for Names:
A person’s name can be represented in one of two ways:
- A long string
'Mr. Lin-Manuel Miranda'
- Attributes
{first:'Lin', middle: None, last:'Miranda', title:'Mr.'}
2. Importance of Data Structure for Names:
Unstructured name data can cause the following data integrity issues:
- Duplication resulting from inconsistent titles. Ex:
'Ms. Katrina Lenk'
and'Mz. Katrina Lenk'
- Duplication resulting from inconsistent middle name abbreviation. Ex:
'Renée Elise Goldsberry'
and'Renée E. Goldsberry'
- Improper references to nicknames (or stage names). Ex:
'Scott Leo “Taye”
…