Extracting and structuring name data from strings using Python

Yaakov Bressler
2 min readOct 20, 2020

--

Introduction:
Structuring name data can solve data integrity issues and reduce duplication, especially when working with messy data. A solution for structuring name data is provided and demonstrated at scale.

I. Introduction: Data Structure of a Name

1. Types of Data Structure for Names:

A person’s name can be represented in one of two ways:

  1. A long string 'Mr. Lin-Manuel Miranda'
  2. Attributes {first:'Lin', middle: None, last:'Miranda', title:'Mr.'}
Photo by Jon Tyson on Unsplash

2. Importance of Data Structure for Names:

Unstructured name data can cause the following data integrity issues:

  • Duplication resulting from inconsistent titles. Ex: 'Ms. Katrina Lenk' and 'Mz. Katrina Lenk'
  • Duplication resulting from inconsistent middle name abbreviation. Ex:'Renée Elise Goldsberry' and 'Renée E. Goldsberry'
  • Improper references to nicknames (or stage names). Ex: 'Scott Leo “Taye” Diggs' and '“Taye" Diggs' and 'Scott Leo Diggs'

II. Structuring Names from String:

Now that the importance of structuring names is understood, here’s how you would go about structuring a name from string format:

Note: Code is written in python3.
Also, I’ll be using the nameparser library.

from nameparser import HumanName

# Here's a full name, with a nickname
full_name = 'Ms. Rebecca "The Boss" Teichman'

# Extract values
parsed_name = HumanName(full_name)

# Get just the first and last name
f_name = parsed_name.first
l_name = parsed_name.last

print(f_name, l_name)
# Rebecca Teichman

# ------------------------------

# If you want to see everything:
parsed_name.as_dict()
{'title': 'Ms.',
'first': 'Rebecca',
'middle': '',
'last': 'Teichman',
'suffix': '',
'nickname': 'The Boss'}

III. Big Data: Structuring Names at Scale

Here’s how you would process names in large data objects, such as a dataframe:

Note: I’ve cached the extraction function to speed up operations, in the case of duplicated names.

Conclusion:

You now know how to extract structured name data from a string representation of a full name. Congrats!

Good luck on your data journey.

--

--