Extracting and structuring name data from strings using Python

Yaakov Bressler
2 min readOct 20, 2020

Introduction:
Structuring name data can solve data integrity issues and reduce duplication, especially when working with messy data. A solution for structuring name data is provided and demonstrated at scale.

I. Introduction: Data Structure of a Name

1. Types of Data Structure for Names:

A person’s name can be represented in one of two ways:

  1. A long string 'Mr. Lin-Manuel Miranda'
  2. Attributes {first:'Lin', middle: None, last:'Miranda', title:'Mr.'}
Photo by Jon Tyson on Unsplash

2. Importance of Data Structure for Names:

Unstructured name data can cause the following data integrity issues:

  • Duplication resulting from inconsistent titles. Ex: 'Ms. Katrina Lenk' and 'Mz. Katrina Lenk'
  • Duplication resulting from inconsistent middle name abbreviation. Ex:'Renée Elise Goldsberry' and 'Renée E. Goldsberry'
  • Improper references to nicknames (or stage names). Ex: 'Scott Leo “Taye”

--

--