Friday, 28 February 2014

Using “Small Data” to Improve the Use of “Big Data”

Digital Globe
This post was first published on Survey Post on Feb. 3rd, 2014.
Recently, I attended two statistical events in the Washington, DC, area: one was the 23rd Morris Hansen Lecture  on “Envisioning the 2030 U.S. Census”; the other was the SAMSI workshop on “Computational Methods for Censuses and Surveys.” “Big data” was a popular keyword at both events and stirred up discussions on how to utilize it (such as from administrative records and online data sources) for current government statistics, especially when combining big data with  traditional survey data.
Statisticians are exploring new ways in which big data can be used. The US Census has initiated investigations on using administrative records in the 2020 Census. The National Center for Health Statistics (NCHS) has identified some research opportunities combining multiple data sources. University-based researchers  have launched studies on the use of Google trends and other online data in small area estimation.
When big data dominated the mainstream discussion at these events, I started thinking more about “small data.” Can small data help us make better use of big data? Here are some of my thoughts.
  1. Applying a conventional sampling-based approach to big data: more and more administrative records are collected electronically. Statisticians are excited about using these records that may contain information from the entire population for analytic purposes. Literature in the past two decades has extensively discussed the advantages of administrative records. Processing administrative records data, however, can be quite time consuming. In addition, it can be cumbersome to run analyses on these large datasets because of the large data volume. Especially, when analysts use conventional statistical software, such as SAS, Stata and R, it becomes increasingly complex to handle, store and analyze these data. The question is: is there a way to reduce the data volume and increase computational speed? Applying conventional sampling-based approach (e.g. optimal sampling, calibration weighting) may make a big data smaller and more manageable while allowing researchers to maintain decent data quality.
  2. Combining non-probability sample data with probability sample data: many big data, such as data collected by Google/Twitter/Facebook, are not census (population) data. We may treat them as non-probability sample data.  Elements are chosen arbitrarily in these datasets and there is no way to estimate the probability that each element in the population will be included. Also, it is not guaranteed that each element has a chance of being included, making it impossible either to assess the validity (always measured in terms of “bias”) and reality (always measured in terms of “variance”) of the data. One solution to make the data more representative of the entire population is to combine them with probability sample data (e.g. survey data), which can be relatively smaller. This method can also assist us estimating sample variability and identifying potential bias in big data.
  3. Using high-quality small data for measuring and adjusting errors in big data: big data is not only non-representative of the target population, but also carry loads of measurement errors because the construct behind a particular measure in these data can differ from the construct that analysts require. To evaluate errors in the big data and improve precision, small survey data can be collected for validation. Take the National Health Interview Survey (NHIS) as an example. This is a household interview survey with only self-reported data. To improve on analyses of the NHIS self-reported data, an imputation-based strategy for using clinical information from an examination-based health survey (i.e. National Health Nutrition Examination Survey, NHANES) was implemented that predicts clinical values from self-reported values and covariates. Estimates of health measures based on the multiply imputed clinical values are different from those based on the NHIS self-reported data alone and have smaller estimated standard errors than those based solely on the NHANES clinical data. Similarly, we may assess potential errors in big data through a more sophisticated and accurate small survey.
While big data provides us massive and timely information from various sources (e.g. social media, administrative records, small data is simple, easy to collect and process, and can be more accurate and representative.  Can small data help you when dealing with your big data problems?

Dan Liao is a research statistician at RTI International. She currently works on multiple aspects of data processing and  analysis for large, multistage surveys of health care in the United States, including sampling design, calibration weighting, data editing and imputation, statistical disclosure control, and the analysis of survey data. Her survey research interests include multiphase survey designs, combining survey and administrative data, domain estimation, calibration weighting, and regression diagnostics for complex survey data. Dan has a PhD in Survey Methodology from the Joint Program in Survey Methodology at University of Maryland and has published research focusing on regression diagnostics, calibration weighting and predictive modeling.

Annual SRA conference on social media in social research

call for papers advert
The 4th Social Media in Social Research event is being held on the 16th May in London, it would be great if lots of network members submitted papers so please do get your thinking caps on. 

The call for papers closes on 10th March.

We'd like to start sharing news about network members who are presenting social media research papers at upcoming conferences, so please let us know if you're presenting at this or other conference by adding a comment here or tweeting @NSMNSS.

SAGE Research Methods Cases

Over the last 40 years SAGE has become world famous for publishing the highest-quality and most cutting-edge books on research methods.  Yet research methods – the “how” of doing research – is still commonly considered a dry and abstract subject.  SAGE needs your help to change that!  

Launching in May 2014, SAGE Research Methods Cases is an innovative and exciting new collection of hundreds of real-life research projects distilled into accessible, peer-reviewedcase studies.  Our goal is to bring methodological concepts and problems to life in a fully-comprehensive resource. The collection with be published online on the award winning SAGE Research Methods platform.

Get Involved!
We plan to commission cases for SAGE Research Methods Cases on an ongoing basis, so we are still looking for case authors.   Have you undertaken a research project? Can you discuss your methodological choices and challenges in an accessible and engaging way? We are looking for short original cases between 2000 – 5000 words in length that put difficult and abstract methodological concepts into a real research context. 

Do you experiment with new social media research? Would you like to discuss the implications of new up and coming methodologies? Our ambition is to represent the breadth and depth of social science research and encompass the full range of possible methodological approaches!  We welcome authors from across the methodological and disciplinary divides.

Want to know more?
Contact Bronia Flett: for more information on how to submit a case of your own to the collection.
Visit to sign up for a free trial of SAGE Research Methods Cases.

Thursday, 27 February 2014

New Social Media, New Social Science… and New Ethical Issues!

We held a small event on Friday 21st February 2014, our goal was to have a series of focused discussions around the ethical dimensions of social media research and to come up with a series of action points for the network to take forward. The day also included two presentations of related research by network members. You can read a a storify of the day 

We will post more about the outcomes and discussions shortly but we want to kick off by sharing links to the two reports and presentations. A team from NatCen has been researching the views of social media users on how their posted data should be handled by researchers. The findings are illuminating and will help to inform how we consider we should work with social media data in the future. You can read a post by the research team here, and then the findings here.

Now Janet Salmons introduces the research she has undertaken on what researchers need to help support their ethical practice when conducting social research online:

Our NSMNSS network has convened researchers from the UK and around the world in thought-provoking dialogue on topics related to scholarly use the Internet and social media. Recurrent matters related to research ethics demonstrate that there are many questions and concerns about how to adapt conventional guidelines to kinds of online research. In the atmosphere of collaboration and exchange NSMNSS encourages, resources are often suggested.  My curiosity led to two questions: to what extent do the resources suggested by the NSMNSS network address the concerns and questions raised by the NSMNSS network? What are the gaps and how can or should they be addressed? The report, “New Social Media, New Social Science… and New Ethical Issues!” is the result of this exploration. 

Numerous concerns and queries emerged from network discussions, classified here as seven interrelated themes:
  • Participants: Issues related to online sampling and recruiting to find, screen, and select appropriate and verifiable research participants.
  • Identity: Issues related to the identity, anonymity and/or privacy of the participant andthe researcher.
  • Research site: Issues related to the setting for the study or source of data.
  • Informed consent: Issues related to the determination of when consent is needed and what type of consent is adequate.
  • Data: Issues related to user-generated content and ownership and protection of data.
  • Research guidance: Issues related to the academic institutions and committees that prepare the next generation of researchers and must approve researchers’ proposals or decide whether their research is adequate for tenure or promotion.
  • Methods and methodologies: Issues related to implications of social media research for the ways we think about research methods and methodologies.
The first five themes relate specifically to the designing, conducting and reporting on research. The final two themes raise larger issues for the field—a field NSMNSS members characterize as multidisciplinary.

Many of the recommended guidelines and materials offered little or no advice about online research ethics. However, a few professional societies and organizations have made the effort to either create a set of guidelines specific to research on the Internet, or have created supplementary materials that focus on how to apply that profession’s ethical standards when conducting studies online—and reporting on the findings. Examples of the latter were chosen for this review including materials from the Association of Internet Researchers(AOIR), British Educational Research Association (BERA), British Psychological Society (BPS), CASROEuropean Society for Opinion and Marketing Research (ESOMAR) and Market Research Society (MRS), and Association (MRA).

As you can see, in some cases the profiled guidelines make recommendations aligned with NMSNSS network needs and in other cases they instead identify other risks and concerns.
You can read the full report here or watch the presentation below.

Please use the comment box to add your thoughts, relevant experiences, or to suggest other resources. We are also planning a Tweetchat discussion of the report on Tuesday 11th March at 7pm GMT (8am NZDT/6am AEDT/8pm CET/9pm SAST/3pm EDT/1pm MDT), so please join us ! You can read more about taking part in an NSMNSS twitter chat here.

Wednesday, 19 February 2014

Ethical considerations in my research: My current stance

Amy Aisha Brown is a research student in the Faculty of Education and Language Studies at the Open University and is PhD Blogger for the NSMNSS Network. She reflects here on the current state of ethics for her research project, outlined in this post.

At a conference I attended recently, a plenary speaker give an inspiring talk about using YouTube videos to research various sociolinguistic phenomena. However, I was a little surprised by the fact that he failed to mention any ethical considerations that researchers might need to take in using them. I say I was surprised, but I wasn’t really surprised: It seems to be a common argument in many fields that says, “if someone was happy enough to post it, and if anyone with an internet connection can access it, then why shouldn’t it be collected and researched?”

This argument is easy to apply to tweets (the data I use) because Twitter only lets you see and collect public tweets. Furthermore, users agree to their tweets being collected because when they sign up, they accept Twitter’s Terms of Service (TOS) that include warnings that broad re-use of content is both permitted and encouraged. The problem I have with this standpoint is that it treats tweets simply as texts and largely ignores the human beings who produce them.

In relation to my research, I find myself having to question whether this gives sufficient consideration to ethical issues, particularly the potential for my research to cause harm through the publication of examples from my dataset. For example, it is common practice to cite supporting examples in discourse analyses, but would there not be a real cause for concern if, for instance, a child user could be identified?

Some might argue that using Twitter data at all for my research is unethical because without directly asking for the permission from every user who authored one of the tweets, it goes against the principle of informed consent (e.g., Davidson, 2012). Twitter users might have technically agreed to the TOS but how can I know that they ever read and understood them? The problem with this stance, however, is that it would prevent medium-large scale analyses of social media data, neglecting the social benefits of research.

My personal approach lies somewhere in the middle: I collect tweets and treat them as texts for the purposes of my research, but I also need to recognise the human element and assess the potential for harm at every stage of the research. The recent Ethics Guide of the Association of Internet Researchers (AoIR) has been influential in helping me come to this position, but they also point out that advice and support from people within and outside of the field is necessary to make decisions in an informed manner in this emerging field. For this reason, it is going to be great to talk about ethics this week at an event with experts in the field and co-coordinators of the #NSMNSS network. Watch this space for feedback after the event.

Monday, 17 February 2014

Collecting stories, moving along

Keurkoon Phoomwittaya is a student in the Social Media MA at the University of Westminster.

They travelled with me to many places in my hometown in Thailand and flew across the sky to England. I take them on every journey that I go.

Since I was young, I used to write stories about travelling experiences in my secret notebook. I kept it in my backpack. Now I have a smart phone as another object which I use for capturing moments and telling my stories via social media platforms.

Turkle says that "We think with the objects we love; we love the objects we think with" (2007:5). The things that we take around us provoke emotions toward moments that we relate to.

Not only do we use mobile phones to share life moments online, we also observe stories that our friends tell. We often imagine along with their stories, but we do not sense exactly the same as how they feel.

Indeed, we know best our own embodied experience of being in a place, and why we choose to tell about it. The important reminder then is to see values attached in stories that people share about the space they are in (Farman, 2012). This is what I would like to bring into focus as I think of my research approach.

I am currently working on a postgraduate dissertation entitled “Girlguiding’s Use of Twitter in Storytelling”. My interest is on how people reflect on their experiences by sharing their stories via Twitter. Interestingly, Twitter’s hashtag search helps me as a researcher to find out what people say about events they are involved in. 

What draws my attention are hashtags which create a collaboration of common values that people in social groups reflect through stories, even though they are in different locations. 

As a Girlguiding volunteer, I would like to investigate how using hashtags to tell stories via twitter represents the organization’s values of giving girls and young women a space where they have fun and can be themselves. 

A search for #girlguides on twitter enables me to collect data of the values projected by messages and attached pictures. The expressions seen are of girls and young women having fun at concerts and campsites in different countries. Textual analysis will be a major approach as I will see the bigger image of how hashtags create a collective feminist identity among Girlguides participants. 

Qualitative interviews will be a minor method to let the participants reflect on how they use Twitter as a platform for sharing Girlguiding stories. 

I think meaningful research projects are reflected by being part of social groups. When I went to a Girlguides’ campsite in London for the first time, I was impressed by its green and comforting environment. However, what interests me more is the stories. 

Similarly, while people give meaning to communities they live in as part of their histories, mobile media lets us explore and tell our own experiences from every corner in which we belong.
  • Farman, J. 2012. Mobile Interface Theory: Embodied Space and Locative Media. United Kingdom: Routledge.
  • Turkle, S. 2007. Evocative Objects. United States: Massachusetts Institute of Technology.