Topologically Crazy for my Future Baby

On or about July 3rd 2017 my wife and I are expecting our first child.

Naming your child is one of the biggest decisions faced by parents and one that can change the course of your child's life. Preparation become doubly more difficult when you are one of the select few who want to be surprised by your new born's gender, a decision my wife and I made. Whether you know the gender ahead of time or not, naming your child is a task that both deciding parties can have widely diverging views on. It is also a decision, whenever and however it is made, both parties need to feel good about when it is all said and done. Needless to say it is typically a challenging process. As my wife and I ventured down the path of exploring baby names I hypothesized that data and machine learning may be a fun tool and powerful to assist me and possibly even my wife in this difficult process.

Being an evangelist for topological data analysis [TDA] I thought may work to generate baby names. Adding fuel to the fire, in early March of 2017 I discovered a relatively new topological data analysis algorithm called Growing Neural Gas [GNG] that I was excited to explore. As I mentally prepared for impending commencement of baby name negotiations I decided learn how to use Growing Gas algorithim and then mix its results with the SOMs I often use to cluster and explore data and see if this crazy pair may help me find names I like for my future daugher or son.

Here is how I did it.

First thing I needed was data. For that I built a function that downloads and munges the a long history of U.S. baby names from the Social Security Administration [1881 to 2015]. Once you have the data you need to make it friendly for topological data analysis which means each row has to represent an item, in this case a name, and all the columns must be numeric features.

To do that I first filtered out any baby names that appeared in less than 15 unique years. Next you must make sure the data is normailzed, which in this case means making sure that numeric counts are represented as portion. The next and most crucial part was to take this relatively sparse set of data by year and make it more robust. To do that I settled on a bunch of calculations that encompass a name's popularity within a decade. I subjectively settled on the minimum, maximum, standard deviation and overall proportion of appearances for each name by decade. Next, I created the final input matrix by joining together the decade data with information about how often the name appeared since 1881 and the actual proportions of each name for each year. Finally I scaled the huge 235 by 13,851 [boys] and 235 by 21,886 [girls] feature matrices to a mean zero.

I was locked and loaded with the everything I needed to explore my hypothesis. I ran the matrix through by both the SOM [using the Kohohen package in R] and the GNG [using the gmum.r package in R] algorithms both of which produced codebooks that included a group for every name. This task takes up to an hour for each gender because the SOM calculations are VERY slow {if you just do GNG it takes only a few seconds!!}. Finally I joined the respective codebooks to the original data and ran a quick calculation to find the optimal number of clusters for the data using a scree plot. I had what I set out to explore

Before arming myself for negotiations with the wife and rant to my friends about the mind blowing amazingness of topological data analysis I wanted to build a tool that would make it easy to visualize explore those names and relationships. To do that I built a function (the output of which you see at the end of the page) which lets the user input names for a given gender, define the distance of the groups they want to see and explore the results interactively through a hierarchy.

I won't get into the results until except to say that they were absolutely incredible and far exceeded my expectations. You can literally use a name you like, don't like, are curious about and produce a universe of related names all driven by math. I know for a fact that this has already used names my wife liked that I hate {Asher} to produce names that are on the top of my list!

Stay tuned for a formal release of a decidcated baybNameR R package that allow you to recreate this in your own way and do much much more {visualize historic name ranks, generate names by decade or phonetics, play with state by state from 1910}. Also I am working on a seperate interactive tool to let anyone visually explore related names for any of the 35,737 names my wife and I continue to negotiate over, it will produce a plot like the one below for any group of names. In the meantime if you want to explore the codebooks and/or the source code I have included that as well.

Hope you enjoyed and stay tuned for baby Bresler sometime in early July!!