#NSMNSS: Using Big Data to Solve Social Science Problems

Thursday, 7 July 2016

Using Big Data to Solve Social Science Problems

Curtis Jessop is a Senior Researcher at NatCen Social Research and is the Network Lead for the NSMNSS network

On Wednesday 29^th June I attended a roundtable hosted by our network partners SAGE on using big data to solve social science problems. It was a great day, with contributions from leading researchers and lots of discussion of some of the key issues of working with big data in social science.

Jane Elliott began with an overview of the ESRC’s Big Data Network. She identified the difficulties with data access that earlier phases had faced, but also highlighted key challenges that big data social science currently faces:

1. Methodological

Can we apply the same qualitative techniques/statistical inferences we have in the past?
Are social scientists (falling) behind in using machine learning & algorithms? What are the implications of these methods?

2. Relevance of research

Making sure we use big data to answer pertinent social science questions, and not just focus on methods

3. Ethics at a macro & micro level

Working ethically with big data - data security, anonymity, informed consent & data ownership
What are the implications of a ‘big data society’/algorithm-led decision making?

New methods, tools and techniques for big data research

Giuseppe Veltri outlined how data-driven science differs from ‘traditional’ social science research as it generates hypotheses and insights from the data, rather than theory, combining abductive, inductive & deductive approaches. Further, Phillip Brooker identified a tension in big data analysis between wanting to use qualitative research approaches with data of a scale that requires numerical treatment. As a result, social scientists need to work with ‘unfamiliar’ techniques and software.

Tools for Big Data analysis

It was generally agreed that existing software are not fit for addressing academic/social science research questions. Also, tools offered by commercial companies are often ‘black boxes’, when social scientists need to be transparent on the algorithms they use as they are part of the methodology.

Many at the roundtable have therefore developed their own tools (e.g. COSMOS, Textonics, Chorus, & Method52 from CASM) to enable them to conduct analysis in a manner they wanted to. However, it was felt there was still some way to go - many of these tools are ‘in-house’ and ongoing funding/support is needed to develop something more stable, well-supported, and ‘outward facing’.

Interdisciplinary working

One approach to addressing the challenges of big data analysis is working in interdisciplinary teams (in particular linking between social & computer science departments). Luke Sloan and Mark Carrigan identified the key challenge of this at a ‘human level’ is ensuring a common understanding of language, after which it was easy to have an open discussion and there were rarely disagreements. Mark argued that what was key was not necessarily making sure that everyone had the same definitions, but that there was an understanding that different fields may have different perspectives.

Mark Kennedy, based on his experiences at the Data Science Institute, emphasised the importance of ‘getting excited’ about the right research question, not just focusing on the technology, and then building a team based on what skills you need to fill that gap.

However, attendees felt that there were structural barriers to interdisciplinary working in academia – departmental silos, geography, navigating different funding bodies, finding journals to publish in, and demonstrating value for the REF were all recognised as problems, although it was also mentioned that funding increasingly supported this approach.

Training in the social sciences

Quite early in the discussion, the question was raised that if there is such a clear skills gap in the social sciences, why had universities not responded to it?

Although it was accepted that training needed to address big data methods, there were differing opinions on how feasible this might be. Adding new techniques into methods courses was welcomed, but to what extent was this achievable when these are already packed covering ‘traditional’ methods? Further, given the relative rarity of established social scientists with this skill-set, who would provide this teaching?

Although it was felt that new students are open to using Python or R/new statistical techniques, this scarcity of trainers with the skills to teach both programming and its application within social sciences was again identified as a problem. Giving students (and academics) access to data science training materials that are framed by social science problems, and relevant dummy data to work with, was suggested as a way to start addressing this.

Answering social science questions with Big Data

While discussing his own research, Slava Mikhaylov highlighted that a good way to make impact is, rather than starting with a research question, to aim to solve a problem. This was echoed by Carl Miller, who outlined some principles that Demos follow for making impact:

Look beyond academic funders – if research is funded by a government department, they’re going to have to listen to it!
Ask the right question – what is interesting to a researcher vs. a policy maker
Answer quickly – policy interests change, and research won’t make an impact if everyone’s moved on
Diversify outputs – can they be real-time, interactive, engaging?
Networking – who are the champions of big data research?

Carl emphasised that was just the approach that Demos used, and may not be appropriate for all research or audiences. He also mentioned you need to work hard in a new discipline to be responsible and transparent about what your research doesn’t do or say.

Ethics of research using Big Data

Anne Alexander differentiated between the ethics of research using big data and the ethics of doing research in a networked world.

On the latter, Anne felt that there has not been enough reflection on the implications of the ‘datafication’ of human interaction, and that we need to de-mystify these processes and consider what the use of machine learning/algorithms means for society (e.g. their potential for discrimination).

Anne emphasised the need to take into consideration the public’s views on this when considering Big Data research, a point re-enforced by Steve Ginnis, whose work at Ipsos Mori on developing ethical guidelines for social media research drew on public ethics, existing industry guidelines and legal frameworks.

Steve’s research identified that the public both have low awareness of, and are not keen on, their social media data being used for research. This was not just due to concerns about privacy/anonymization – people were uncomfortable with being profiled and its possible implications.

That said, participants were willing to weigh up the risks and benefits, and context (who is doing the research and why) was important. Nonetheless, the ‘fundamentals’ (consent, what information, anonymization, etc.) played a much larger role in whether they felt research using social data was appropriate.

Both Anne & Steve emphasised that ethics is an ongoing process, not a one-off event at the start of a project – they need to be considered at the collection, analysis and publication stages of the research cycle.

Some concluding thoughts

Carl Miller identified that in the context of pressure for evidence-based policy, digital by default, and the open data initiative, there has never been a better time for social scientists to make impact with big data research.

Wednesday’s session demonstrated how far big data analysis in the social sciences has come over recent years and it is impressive to hear how much work has been put into developing the tools and methods to mould this rich, but novel, form of data into social insights.

However, the session also showed that there are number of areas that still need to be addressed if we are to make the most of big data:

Access to large data sets continues to be an issue, be they proprietary, public, or administrative. We need to bargain collectively to talk to large, often global, actors and argue for academic access.
There is a skills gap among social scientists for analysing big data, and support is needed to help develop the required methodological and programming skills.
The interdisciplinary working required for big data analysis can be challenging, and we need to work to enable effective collaboration.
Developing an ethical approach to big data analysis is challenging given its novelty, variety, and changing nature. Any framework needs to provide practical guidance to researchers while remaining flexible and responsive to changing contexts.
Available tools for big data analysis can be expensive, lack transparency, or inappropriate for social science research. A maintained central library of available tools, with appropriate documentation and guidance could be extremely useful.