Comments are closed.
Geographic Analysis + Text Mining + Big, Messy Data
I’m interested in the intersection between geographic analysis and text mining large, messy data sets. I know that a fair amount of work has been done on this in various private and public sectors (maybe the CIA could hold a Bootcamp session for us!), but I’m not sure how much has been done specifically in humanities research. I also want to move beyond metadata-level analysis and into the actual mass of text. How can we map not just the places mentioned in, say, A Portrait of the Artist as a Young Man, but all the places in every Irish novel published during the 1910s, along with their relative frequencies and contexts of nearby words and other places?
Think of Google Books, and their automatically-generated map in the About This Book (see an example here) section that gives you a geographic sense of what places are being named. I’ve always found this only superficially interesting, since I have no idea how it was generated and it makes no qualitative distinction between the various places (whether they occur 2 times or 2,000 times for instance, or in what context). Especially in the case of historical research, the quality of the data can often be a limiting factor in applying Named Entity Recognition or place name extraction (to say nothing of disambiguation between identically-referenced places/names/words). What specific techniques are being used most effectively right now? Do we need to use more advanced Natural Language Processing or can we use more inelegant blunt force? How can we apply these techniques in the context of raw, messy, humanistic data?
test Filed under Sessions | Comments (2)2 Responses to “Geographic Analysis + Text Mining + Big, Messy Data”
Hi Cameron,
I, too, am interested in mining big ol’ uncontrolled place-related fields to expose information in a more better way.
On openlibrary.org, we have 2 place-related fields that I’m particularly interested in:
– the place a book was published
– place as subject heading
It’s remarkably messy in there, so I’d love to brainstorm ideas about how to keep the mess, but increase the signal. Perhaps some sort of “merge” of place names, and the nomination of a master.
Try a search for any place here to see what I mean:
openlibrary.org/search/subjects
I’d love to see if we could mush together this uncontrolled data with, say, Freebase’s Gridworks, and then perhaps the Yahoo! Geo toolset to see what we could do…
This sounds like a fun session Cameron. I look forward to it.