I attended the Boston Girl Geeks Dinner last week, and left with a mission – to extract the names from all the biographies on wikipedia, and analyze them for gender.
There’s a gem that will compare a first name to a 40,000 name dictionary, and determine whether that name is female, male, mostly female, mostly male, or androgynous/unknown.
Wikipedia has an api, but it’s a long, convoluted process to get at just the names of biographical subjects. However, they also have a set of pages that organize biographies according to the quality of the bio. Each of those pages lists … the *names* of the biography subjects. Bingo!
So, I got to learn how to extract xpath data from a page, and do a bit of work to extract exactly what I need from the data (the names are not the only text in the resulting string, so I had to strip the extra content out), then run it through the gender tool.
I’ve made the tool as gentle on wikipedia as possible – there’s no automated running of the whole extraction, each page has to be run separately, by hand, so it won’t create a big load and thus accidentally get my IP address banned. I used a really small page during coding, so the incessant testing would only be grabbing a single page and a few lines of data.
And it works!
Next up: enabling people to edit records, to handle cases where the tool couldn’t identify the gender, and adding statistical analysis.