Researchers develop AI model that detects mental disorders using Reddit posts
Researchers at Dartmouth College have developed an artificial intelligence (AI) model that can be used to predict mental disorders using data from conversations on Reddit, according to an article the university.Researchers Xiaobo Guo, Yaojia Sun and Soroush Vosoughi presented a paper titled, “Emotion-based Modeling of Mental Disorders on Social Media” at the 20th International Conference on Web Intelligence and Intelligent Agent Technology.
According to the paper, most such AI models that ex currently function on the basis of the psycho-linguic analysis of the content of the user-generated text. Despite displaying high levels of performance, content-based representation models are affected domain and topic bias.
Vosoughi explained to a Dartmouth science writer speaking about the possibility of how if a model learns to correlate the word “COVID” with “sadness” or “anxiety”, it will automatically assume that a scient doing COVID research and posting about it is suffering from depression and anxiety.
The new model suppresses these topic-specific biases being based entirely on emotional states while learning nothing about the topic described in posts.
To train the model, researchers collected two sets of data from between 2011 and 2019: the first one was a dataset of users with one of three emotion disorders of interest (major depressive, anxiety and bipolar disorders) and the second was a dataset of users without known mental disorders, which acted as a control group.
The first dataset was collected based on self-reported mental disorders i.e, the researchers searched for users who had made posts or comments which said something similar to “I was diagnosed with bipolar/depression/anxiety”. Only posts made before the self-report were considered for the research because prior work had shown that users’ realisation that they have a disorder will change how they behave online and create a bias.
Researchers then ensured that the data belonging to the four classes (one each for users with each disorder of interest and one control group) had similar temporal dributions: this means that the data in the four classes had a similar time-based dribution of posts. The datasets were also balanced with 1,997 users for each of the classes.
After this, the researchers split the data into training (70%), validation (15%) and test (15%). After training the model on the data and then testing it, researchers found out that the emotion-based representation model that they used was more accurate in predicting disorders than the content TF-IDF based (Term Frequency — Inverse Document Frequency) method. TF-IDF is used to compute the importance of a keyword, based on its frequency and the importance of the post.