Update (2/24/2014): It appears that the historical backfiles are once again available at the new gdelt project site. I haven’t had a chance to compare the files to the old ones to see if they are identical. Unfortunately, Mr. Leetaru is not very good at providing helpful details and documentation about these new files. Did he re-run the TABARI parsing program on the news stories for these newly-posted backfiles, or are the ones posted on the site the exact same as the previous files? Are all news sources the same as before? Which news sources are used for which date-ranges (and when are new sources added)?
Well, this is unfortunate. I’ve been working with GDELT for a little while now, using it for an article manuscript and a part of my dissertation. But as of 10 days ago, GDELT is suspended. See here: GDELT Suspension.
Kalev Leetaru quickly set up a new site at HERE, but I’m not yet convinced that all is well with using the data for publication or the dissertation. The data is no longer available prior to Jan. 1, 2014 (so pretty much all of it), raising questions about the legal issues discussed elsewhere (like here). From what I gather, the data itself is legitimately coded, but there is question over how the underlying sources were obtained. I have the entirety of the dataset saved, but at this point I’m weary of putting time into any analysis based on GDELT.
There are critical and cautionary comments about using data whose underlying sources are generally unknown. That’s an important lesson here about knowing and considering “the processes through which the data were created.” The primary reason I justified not having this information (the news stories and sources themselves) when taking on a GDELT-based project was that these source citations are supposed to be provided in the next version of GDELT in summer 2014, and there’s a variable in the data for “SOURCE URL” for events from April 2013. Per the FAQ:
“Due to copyright restrictions and publisher agreements we cannot redistribute any of the news content that was used in the creation of GDELT, only the codified numeric event records extracted from that content. However, for web-based content after April 1, 2013 we do include the URL or citation to the source article for each event so that you can locate the material on your own to read more about the event and its surrounding context. In the next release of GDELT, tentatively slated for late Summer 2014, we will be including source citations for all events back to 1979.
I’ve previously done some work with TABARI, the machine coding program that parses the news stories underlying GDELT, and that remains a valid means to obtain events data (limited to Reuters and AFP stories). The parser has been shown in various places, including HERE, to perform comparatively well next to human coded news stories. Time to go that direction for the projects at hand until things get sorted out with GDELT.