Extracting and structuring name data from strings using Python
--
Introduction:
Structuring name data can solve data integrity issues and reduce duplication, especially when working with messy data. A solution for structuring name data is provided and demonstrated at scale.
I. Introduction: Data Structure of a Name
1. Types of Data Structure for Names:
A person’s name can be represented in one of two ways:
- A long string
'Mr. Lin-Manuel Miranda'
- Attributes
{first:'Lin', middle: None, last:'Miranda', title:'Mr.'}
2. Importance of Data Structure for Names:
Unstructured name data can cause the following data integrity issues:
- Duplication resulting from inconsistent titles. Ex:
'Ms. Katrina Lenk'
and'Mz. Katrina Lenk'
- Duplication resulting from inconsistent middle name abbreviation. Ex:
'Renée Elise Goldsberry'
and'Renée E. Goldsberry'
- Improper references to nicknames (or stage names). Ex:
'Scott Leo “Taye” Diggs'
and'“Taye" Diggs'
and'Scott Leo Diggs'
II. Structuring Names from String:
Now that the importance of structuring names is understood, here’s how you would go about structuring a name from string
format:
Note: Code is written in
python3
.
Also, I’ll be using the nameparser library.
from nameparser import HumanName
# Here's a full name, with a nickname
full_name = 'Ms. Rebecca "The Boss" Teichman'
# Extract values
parsed_name = HumanName(full_name)
# Get just the first and last name
f_name = parsed_name.first
l_name = parsed_name.last
print(f_name, l_name)
# Rebecca Teichman
# ------------------------------
# If you want to see everything:
parsed_name.as_dict()
{'title': 'Ms.',
'first': 'Rebecca',
'middle': '',
'last': 'Teichman',
'suffix': '',
'nickname': 'The Boss'}
III. Big Data: Structuring Names at Scale
Here’s how you would process names in large data objects, such as a dataframe
:
Note: I’ve cached the extraction function to speed up operations, in the case of duplicated names.
Conclusion:
You now know how to extract structured name data from a string representation of a full name. Congrats!
Good luck on your data journey.