Social media data has been regularly employed by researchers seeking to understand the COVID-19 pandemic, from work to track the virus’ early spread in Wuhan to simulations that predict interaction levels in a given population. Trying to keep up with the breakneck pace of virus-related data on social media, however, is a challenging task for many researchers. Now, the Texas Advanced Computing Center (TACC) is providing a virtual treasure trove of COVID-19-related social media data for researchers – made possible by its supercomputers.
TACC gobbled up around 40 million tweets a day starting March, then combined that data with similar data from four universities in order to incorporate trends from January and February. That data is now available in a GitHub repository, which will also contain a series of TACC-led analyses.
First in that series is a collection of the top thousand one-, two-, and three-word sequences (called “n-grams”) for each day of the pandemic – a massive analysis enabled by TACC’s supercomputing power. Next will be an analysis of terms that frequently appear in conjunction with each other (expected on the GitHub in the coming weeks). After that, the world is TACC’s oyster; the researchers are looking at creating a searchable database, identifying people and organizations in tweets and even automatically detecting and categorizing events.
Click to play. The top 20 daily words in the Twitter data sample changing over time, March 25 to April 26. In that time, “COVID” overtook “coronavirus” as the most popular term in the dataset. Image courtesy of TACC.
“There’s a large amount of interest in these types of collections. It’s very useful in data science,” said Weijia Xu, manager of TACC’s Scalable Computational Intelligence group, in an interview with TACC’s Aaron Dubrow. “We’re mostly interested in letting people access curated datasets and helping them do research. We’re collecting, cleaning up, and processing data so it’s ready for others to use.”
The research, which is enabled by TACC’s machine learning text analysis tools, is already being picked up by researchers at the University of Texas at Austin.
“The TACC COVID-19 Twitter collection will be invaluable in enabling us to model communication patterns and topics that emerge across stages of the disease,” said Sharon Stover, a professor of communication at UT Austin. “We may be able to compare the timeline to similar data from other countries such as China that experienced the epidemic earlier. This may lead us toward understanding when typical responses occur and help us to characterize how populations make sense of health pandemics at certain stages in an epidemic’s process.”
Other researchers plan to use the dataset to study COVID-19 fake news trends, the spread of racist messaging and more.
“The large volume of tweets collected at TACC provides a valuable data source to explore various perspectives on COVID-19,” said Ruizhu Huang, a TACC research associate. “And the storage and supercomputing power at TACC will tremendously speed up the data analysis process.”
To read more, visit the article from TACC’s Aaron Dubrow here.