Dr Luke Sloan is a Senior Lecturer in Quantitative Methods, Deputy Director of Cardiff Q-Step and a member of the Collaborative Online Social Media Observatory (COSMOS: www.cosmosproject.net). He is based in the School of Social Sciences at Cardiff University and his research focuses on the development of demographic proxies for Twitter data and understanding how social media data can augment traditional modes of
social scientific analysis. @drlukesloan
A perennial criticism
of Twitter data is that it’s missing many of the variables
that we find interesting as social scientists
and, because of this, it will never be a viable source of data for social scientific
analysis. We are anchored to the practices
of survey methodology in which a question is asked and answered, thus we ensure that the researcher collects
the relevant demographic information allowing us to compare
gender/ethnic/socio-economic groups. This is
the bread and butter of social science.
In contrast, social media data is naturally
occurring – it is not elicited! Because
of this it is unfocused, messy and does not neatly address a pre-conceived research
question. But it is a rich source of information on attitudes and provides insights
into immediate reactions
following key events. It’s been used to predict elections, box office revenue and even to calculate
the epicentre of an earthquake. So clearly we shouldn’t be so quick to dismiss
this data as useless, particularly if we are creative and innovative in how we conceptualise the manner in which demographic data may manifest and thus open this data up to social scientific analysis.
Imagine that you are walking down the street and have decided that today you are going to guess the demographics characteristics of the people
that you see – the only rule is that you cannot ask them outright, you must observe
their behaviour without
being obtrusive. How might you work out someone’s gender?
Well, perhaps
you overhear someone
shouting his or her name. What
about their occupation? Maybe they have an ID badge or are carrying tools. What about their age? Well we all make
guesses about age based on appearance, often at the risk of offending someone.
The point is that through the passive
uptake of incidental
information which is there to be analysed (and which you have not elicited!) you can tell quite a bit about a person.
Now let’s consider
this in the context of Twitter. People
put their name on Twitter,
thus allowing us to derive a proxy for their gender. For those who have geo-tagging switched on we can tell where they were when they tweeted,
or we can use profile information to workout
their home town. If we have enough
time we can even look at the place which they make reference to in their tweets. We know about their hobbies as they report on their
leisure activities and we know a bit about their work if they report on it via social
media. Are they employed? Well we can have a look at whether they’re complaining about work, about colleagues or about the printer breaking
down (‘again!’). When we look close enough we are flooded with ‘signatures’ that offer us an indication of characteristics that that would typically be found in the demographics section
of a survey.
The sticking point
is that we can’t derive
this information for all tweeters
and not all the proxies
are as reliable as others.
First names are actually quite an accurate
proxy for gender as identity
play is a minority pursuit.
As long as you have stringent classification rules and understand that around 52% of UK users can’t be classified (this still results in successful
identification of around 600,000 users),
then you still have information for 48%*. You could think of this 48% as a sample of Twitter users which is synonymous to a survey sample, although
not randomly sampled…
but even then do we have any reason to think that the users we have been able to identify
are substantively different
to those we can’t?
The bottom line is that it is possible to derive important
demographic information from Twitter data if we’re prepared to think creatively. The methods will get better and programmes of work will emerge which allow the confirmation of proxy demographic reliability. We’re only a few metres off the ground
on our climb up this new methodological edifice, but seeking
out a viable trail enables others to follow and establish safer, more secure routes.