Social scientists are seeing new red flags in their field’s predicted big-data future, finding computerised analyses not just vulnerable to bias but perhaps fundamentally limited in their predictive value.
Concern is rising after a large-scale study where 160 academic research teams, organised by Princeton University sociologists, tried machine-learning methods to predict the life pathways of disadvantaged children.
“The best predictions were not very accurate and were only slightly better” than those developed in traditional models using far fewer data inputs, the Princeton team reported in PNAS.
That result is a major warning sign for the quickly expanding ranks of computer-heavy approaches to the social sciences, said Filiz Garip, a professor of sociology at Cornell University who was not part of the Princeton study.
At Cornell, for instance, between a third and half of graduate students in the social sciences are already taking classes in machine learning, said Professor Garip, who assessed the Princeton experiment for a subsequent PNAS article.
“Everybody feels like they need to learn this, they need to gain these skills, to find any kind of job,” she said in an interview. Yet so far, as the Princeton study showed, “we’re not gaining a whole lot by using these methods”, she said.
The findings come as social scientists are already on the defensive over indications that using large databases and sophisticated computer programmes to guide political and legal decisions may be reinforcing and institutionalising human biases.
Long-recognised examples include predictive algorithms that identify black defendants as posing a greater risk of future crime because their community histories often show relatively high levels of police attention.
Advocates of such data-driven assessments have argued that problems within algorithms can eventually be identified and eliminated, thereby making them less biased than decisions that rely on humans alone.
The Princeton study, meanwhile, raises the question of whether the teaching of basic skills and perspectives in the social sciences may be getting pushed aside by an overriding desire to amass and analyse the vast troves of data that can be found on almost any human these days.
Such volumes of data may be adding more confusion than clarity by outstripping the capacity of social scientists to meaningfully understand what value each individual piece of data is really contributing to a necessary answer, Professor Garip said.
For the Princeton study, the participating research teams were given nearly 13,000 pieces of data on each of 4,200 families with a child who was born in a large US city around the year 2000, derived largely from visits, assessments and questionnaires over the following years with the child, parents, caregivers and teachers.
Given that information for those children up to age 9, the teams were asked to predict various outcomes for the child and family at age 15, including child school grades and parent job success.
The teams broadly failed to create computer-aided models that worked any better than traditional social sciences analyses that use far less subject data, in painting a picture of how societal conditions affect people’s lives, the Princeton team wrote.
The Princeton authors, led by sociology professors Matthew Salganik and Sara McLanahan, said they expect their social science colleagues will, in coming years, keep improving their methods of big data computer analysis.
Further experimentation, they said, should also help their field better understand what types of societal problems may justify scientists pursuing individual-level predictions, rather than being content with broader understandings of how policies affect people.
Professor Garip said she agreed with such perspectives. But in the meantime, she cautioned, large numbers of younger social scientists and their universities may be betting too heavily on data-intensive training.
“We have to be careful,” she said, “of jumping on this trend or hype.”