Friday, 28 February 2014

Using “Small Data” to Improve the Use of “Big Data”

Digital Globe
This post was first published on Survey Post on Feb. 3rd, 2014.
Recently, I attended two statistical events in the Washington, DC, area: one was the 23rd Morris Hansen Lecture  on “Envisioning the 2030 U.S. Census”; the other was the SAMSI workshop on “Computational Methods for Censuses and Surveys.” “Big data” was a popular keyword at both events and stirred up discussions on how to utilize it (such as from administrative records and online data sources) for current government statistics, especially when combining big data with  traditional survey data.
Statisticians are exploring new ways in which big data can be used. The US Census has initiated investigations on using administrative records in the 2020 Census. The National Center for Health Statistics (NCHS) has identified some research opportunities combining multiple data sources. University-based researchers  have launched studies on the use of Google trends and other online data in small area estimation.
When big data dominated the mainstream discussion at these events, I started thinking more about “small data.” Can small data help us make better use of big data? Here are some of my thoughts.
  1. Applying a conventional sampling-based approach to big data: more and more administrative records are collected electronically. Statisticians are excited about using these records that may contain information from the entire population for analytic purposes. Literature in the past two decades has extensively discussed the advantages of administrative records. Processing administrative records data, however, can be quite time consuming. In addition, it can be cumbersome to run analyses on these large datasets because of the large data volume. Especially, when analysts use conventional statistical software, such as SAS, Stata and R, it becomes increasingly complex to handle, store and analyze these data. The question is: is there a way to reduce the data volume and increase computational speed? Applying conventional sampling-based approach (e.g. optimal sampling, calibration weighting) may make a big data smaller and more manageable while allowing researchers to maintain decent data quality.
  2. Combining non-probability sample data with probability sample data: many big data, such as data collected by Google/Twitter/Facebook, are not census (population) data. We may treat them as non-probability sample data.  Elements are chosen arbitrarily in these datasets and there is no way to estimate the probability that each element in the population will be included. Also, it is not guaranteed that each element has a chance of being included, making it impossible either to assess the validity (always measured in terms of “bias”) and reality (always measured in terms of “variance”) of the data. One solution to make the data more representative of the entire population is to combine them with probability sample data (e.g. survey data), which can be relatively smaller. This method can also assist us estimating sample variability and identifying potential bias in big data.
  3. Using high-quality small data for measuring and adjusting errors in big data: big data is not only non-representative of the target population, but also carry loads of measurement errors because the construct behind a particular measure in these data can differ from the construct that analysts require. To evaluate errors in the big data and improve precision, small survey data can be collected for validation. Take the National Health Interview Survey (NHIS) as an example. This is a household interview survey with only self-reported data. To improve on analyses of the NHIS self-reported data, an imputation-based strategy for using clinical information from an examination-based health survey (i.e. National Health Nutrition Examination Survey, NHANES) was implemented that predicts clinical values from self-reported values and covariates. Estimates of health measures based on the multiply imputed clinical values are different from those based on the NHIS self-reported data alone and have smaller estimated standard errors than those based solely on the NHANES clinical data. Similarly, we may assess potential errors in big data through a more sophisticated and accurate small survey.
While big data provides us massive and timely information from various sources (e.g. social media, administrative records, small data is simple, easy to collect and process, and can be more accurate and representative.  Can small data help you when dealing with your big data problems?


Dan Liao is a research statistician at RTI International. She currently works on multiple aspects of data processing and  analysis for large, multistage surveys of health care in the United States, including sampling design, calibration weighting, data editing and imputation, statistical disclosure control, and the analysis of survey data. Her survey research interests include multiphase survey designs, combining survey and administrative data, domain estimation, calibration weighting, and regression diagnostics for complex survey data. Dan has a PhD in Survey Methodology from the Joint Program in Survey Methodology at University of Maryland and has published research focusing on regression diagnostics, calibration weighting and predictive modeling.

17 comments:

  1. I need these details to complete this function indoor university, gifts the exact same trouble together with your publish. Relation, amazing verify. amazon bewertungen generieren

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. I really do have confidence in which in turn best would like toward beneficial facts plus details you have thus provided this specific. אבי פארטס בע

    ReplyDelete
  4. I simply at this point want to found substantial thumbs additional technique up toward excellent details you can have within with this write-up. When i should be anytime re-occurring online web page created for further swiftly! illuminated signs perth

    ReplyDelete
  5. I truly enjoy basically evaluating ones web sites. Basically prepared to express to an individual that you've got men and women as well as me personally who appreciate function. Surely an excellent publish. fake id

    ReplyDelete
  6. I truly savored this Account with the Wedding ring. The idea manufactured me personally sad to believe your ex finished up being taken with this kind of in the beginning generation. Especially creating the hubby and infants. Many thanks with regards to sharing this kind of charming account. cannabis business plan sample Canada

    ReplyDelete
  7. I want to share this excellent site truthfully self-confident most of us so that they can perform which will! Relation, high quality write-up. How much does it cost to open a dispensary

    ReplyDelete
  8. It happens to be thus intriguing. I would really like to know various main features of the web page. Therefore it is advisable to allow me personally this kind of mass media swiftly. It's my job to will unquestionably realize an individual. magician melbourne

    ReplyDelete
  9. It is a excellent restriction. When i savored this accoutrements great deal. High heel shoes for men

    ReplyDelete
  10. It is a reasonable website. You've gotten much know-how concerning pcs this challenge, thus very much love. Digital Marketing Strategy Course

    ReplyDelete
  11. It’s appropriate event the right way to come up with a number of choices cash created for issues it's also event the right way getting satisfied. beautiful free wordpress themes

    ReplyDelete
  12. It's a great pleasure reading your post. It's full of information I am looking for and I love to post a comment that "The content of your post is awesome" Great work! beachfront vacation rentals

    ReplyDelete
  13. Many thanks for this brilliant post! Many points have extremely useful...Discover the easiest way to trade you’ve ever imagined! Copy Trader from Ettore frees you from the complicated fuss to put you in charge. more info:) countdown clock

    ReplyDelete
  14. Nice post. I was checking continuously this blog and I am impressed! radiology jobs

    ReplyDelete
  15. Really great post, Thank you for sharing This knowledge.Excellently written article, if only all bloggers offered the same level of content as you, the internet would be a much better place. Please keep it up! bali restaurant

    ReplyDelete
  16. Sugarcane harvest will likely then be ready-made into sugar could be very hard. There are various kinds of sugar that can be found in the future very good article that deserves all the praise, congratulations. Authentic Pet Photographer in Washington DC

    ReplyDelete
  17. Inspiring writings and I greatly admired what you have to say , I hope you continue to provide new ideas for us all and greetings success always for you..Keep update more information..
    Dedicated Server Hosting in Delhi

    ReplyDelete