Monday, 21 January 2013

Challenges and opportunities of Twitter as a corpus

In the run up to our next Knowledge Exchange Event we'll be posting a series of blogs on new social media and qualitative research methods. The first is by Amy Aisha Brown, a research student in the Faculty of Education and Language Studies at the Open University. 

I won’t deny it, I am another one of those researchers who has been wowed by the idea of using social media in research, but I’d argue that it hasn’t been without good reason. I am interested in the ideologies of the English language in Japan, and I am looking to find out how these ideologies pan out in everyday discussions. The hope is that a wide scale investigation will complement research in the area that takes a more ethnographic approach (see Philip Seargeant’s work). So, what really pulled me into the idea of using social media, and Twitter specifically, were the possibilities for accessing a large body of relevant, naturally occurring discourse on everyday topics.

A quick search for “英語” (Japanese for ‘English’, as in the language rather than the people or the muffins) brings up new tweets every few seconds. While this shows just how much potential data is out there, ways of getting hold of tweets and getting them into a format that I can work with for the corpus analysis element of my study, are not as easy to find.

NVivo 10 and the associated browser plugin NCapture are two of-the-shelf tools I have used so far. NCapture lets you use Twitter’s simple search feature to find relevant tweets, and once you import the search results into NVivo, they appear alongside their metadata as a searchable data set that is ready for qualitative coding. This has been a useful way of getting an initial idea about what I can expect to get from tweets, but NVivo is unlikely to be a long-term solution for collect and corpus analysis for two reasons:

      1. Collecting tweets 
    • Using Twitter’s basic search function only gives access to a selection of the public tweets produced, a selection that is “optimized to serve relevant tweets to end-users” rather than a random sample or a sample based on any published definition. 
    • This way of collecting tweets also only allows you to collect around 1500 at a time, making it difficult (or at least very time consuming) to collect most of the relevant tweets accessible through the search function. 
      2. Corpus tools 
    • NVivo has lots of nice tools for visualizing text, such as word frequency lists and tag clouds but neither is it a tool built for corpus analysis nor one that is optimised for Japanese text. 
    • NCapture does not capture tweets in a way that makes them easily processed by software other than NVivo. 
In many ways, these are not just the limitations of the NVivio/NCapture combo, they are the technical challenges of my research in general. It might be that I have to compromise on what I hope to achieve, but for the time being I am enjoying looking into other options. If you have any suggestions, I'd be happy to hear. Otherwise, I’ll be getting back to it …