Crowdsourcing Predictors of Behavioral Outcomes
Generating models from large data sets—and determining which subsets of data to mine is becoming increasingly automated. However choosing what data to collect in the first place requires human intuition or experience, usually supplied by a domain expert. This paper describes a new approach to machine science which demonstrates for the first time that non-domain experts can collectively formulate features, and provide values for those features such that they are predictive of some behavioral outcome of interest. This was accomplished by building a web platform in which human groups interact to both respond to questions likely to help predict a behavioral outcome and pose new questions to their peers. This results in a dynamically-growing online survey, but the result of this cooperative behavior also leads to models that can predict user’s outcomes based on their responses to the user-generated survey questions. Here we describe two web-based experiments that instantiate this approach: the first site led to models that can predict users’ monthly electric energy consumption; the other led to models that can predict users’ body mass index. As exponential increases in content are often observed in successful online collaborative communities, the proposed methodology may, in the future, lead to similar exponential rises in discovery and insight into the causal factors of behavioral outcomes.
There are many problems in which one seeks to develop predictive models to map between a set of predictor variables and an outcome. Statistical tools such as multiple regression or neural networks provide mature methods for computing model parameters when the set of predictive covariates and the model structure are pre-specified. Furthermore, recent research is providing new tools for inferring the structural form of non-linear predictive models, given good input and output data . However, the task of choosing which potentially predictive variables to study is largely a qualitative task that requires substantial domain expertise. For example, a survey designer must have domain expertise to choose questions that will identify predictive covariates. An engineer must develop substantial familiarity with a design in order to determine which variables can be systematically adjusted in order to optimize performance.
The need for the involvement of domain experts can become a bottleneck to new insights. However, if the wisdom of crowds could be harnessed to produce insight into difficult problems, one might see exponential rises in the discovery of the causal factors of behavioral outcomes, mirroring the exponential growth on other online collaborative communities. Thus, the goal of this research was to test an alternative approach to modeling in which the wisdom of crowds is harnessed to both propose potentially predictive variables to study by asking questions, and respond to those questions, in order to develop a predictive model
This paper introduces, for the first time, a method by which non domain experts can be motivated to formulate independent variables as well as populate enough of these variables for successful modeling. In short, this is accomplished as follows. Users arrive at a website in which a behavioral outcome is to be modeled. Users provide their own outcome and then answer questions that may be predictive of that outcome. Periodically, models are constructed against the growing data sets that predict each user’s behavioral outcome. Users may also pose their own questions that, when answered by other users, become new independent variables in the modeling process. In essence, the task of discovering and populating predictive independent variables is outsourced to the user community.
The rapid growth in user-generated content on the Internet is an example of how bottom-up interactions can, under some circumstances, effectively solve problems that previously required explicit management by teams of experts. Harnessing the experience and effort of large numbers of individuals is frequently known as “crowdsourcing” and has been used effectively in a number of research and commercial applications. For an example of how crowdsourcing can be useful, consider Amazon’s Mechanical Turk. In this crowdsourcing tool a human describes a “Human Intelligence Task” such as characterizing data , transcribing spoken language , or creating data visualizations . By involving large groups of humans in many locations it is possible to complete tasks that are difficult to accomplish with computers alone, and would be prohibitively expensive to accomplish through traditional expert-driven processes
- Investigator Behavior
- User Behavior
- Model Behavior
1. Investigator Behavior
The investigator is responsible for initially creating the web platform, and seeding it with a starting question. Then, as the experiment runs they filter new survey questions generated by the users.
However, once posed, the question was filtered by the investigator as to its suitability . A question was deemed unsuitable if any of the following conditions were met:
(1) the question revealed the identity of its author (e.g. “Hi, I am John Doe. I would like to know if…”) thereby contravening the Institutional Review Board approval for these experiments;
(2) the question contained profanity or hateful text;
(3) the question was inappropriately correlated with the outcome (e.g. “What is your BMI?”).
If the question was deemed suitable it was added to the pool of questions available on the site ; otherwise the question was discarded.
2. User Behavior
Users who visit the site first provide their individual value for the outcome of interest. Users may then respond to questions found on the site . Their answers are stored in a common data set and made available to the modeling engine.
At any time a user may elect to pose a question of their own devising. Users could pose questions that required a yes/no response, a five-level Likert rating, or a number. Users were not constrained in what kinds of questions to pose.
3. Model Behavior
The modeling engine continually generates predictive models using the survey questions as candidate predictors of the outcome and users’ responses as the training data.
H/W System Configuration:-
Processor - 1.2 GHz
RAM - 512 MB(min)
Hard Disk – 120 GB
Floppy Drive – 1.44 MB
Key Board - Standard Windows Keyboard
Mouse - Two or Three Button Mouse
Monitor - SVGA
S/W System Configuration:-
Operating System :Windows95/98/2000/XP
Application Server : Tomcat5.0/6.X
Front End : HTML, Java, Jsp
Server side Script : Java Server Pages.
Database : Mysql
Database Connectivity : JDBC.