[For the previous post in this series, click here].

The BBC Debate

The BBC’s Great Debate” was broadcasted live in the UK by the BBC on Tuesday 21 June 2016 between 20:00 and 22:00 BST. It saw activity on Twitter with the #BBCDebate hashtag.

I collected some of the Tweets tagged with #BBCDebate using a Google Spreadsheet. (See the methodology section below). I have shared an anonymised dataset on figshare:

Priego, E. (2016) “The BBC’s Great Debate”: Anonymised Data from a #BBCDebate Archive. figshare. https://dx.doi.org/10.6084/m9.figshare.3457688.v1

[Note: figshare DOIs are not resolving or there are delays in resolving; it should be fixed soon…]

Archive Summary (#BBCDebate)

Number of links 16826
Number of RTs 32206 <-estimate based on occurrence of RT
Number of Tweets 38116
Unique tweets 38066 <-used to monitor quality of archive
First Tweet in Archive 14/06/2016 22:03:18 BST
Last Tweet in Archive 22/06/2016 09:12:32 BST
In Reply Ids 349
In Reply @s 456
Tweet rate (tw/min) 62 Tweets/min (from last archive 10mins)
Unique Users in archive:

                      20, 243

Tweets from StrongerIn in archive:


Tweets from vote_leave in archive:


The raw data was downloaded as an Excel spreadsheet file containing 38,166 Tweets (38,066 Unique Tweets) publicly published with the queried hashtag (#BBCDebate) between 14/06/2016 22:03:18 and 22/06/2016 09:12:32 BST.

Due to the expected high volume of Tweets only users with at least 10 followers were included in the archive.

As indicated above the BBC Debate was broadcasted live on UK national television on Tuesday 21 June 2016 between 20:00 and 22:00 BST. This means the data collection covered the real-time broadcasting of the live debate (see the chart below).

#BBCDebate Activity in the last 3 days
#BBCDebate Activity in the last 3 days. Key: blue: Tweet; red: Reply

The data collected indicated only 12 Tweets in the whole archive contained geolocation data. A variety of user languages (user_lang) were identified.

Number of Different User Languages (user_lang)

Note this is not the language of the Tweets’ text, but the language setting in the application used to post the Tweet. In other words user_lang indicates the language the Twitter user selected from the drop-down list on their Twitter Settings page. This metadata is an indication of a user’s primary language but it might be misleading. For example, a user might select ‘es’ (Spanish) as their preferred language but compose their Tweets in English.

The following list ranks  user_lang  by number of Tweets in dataset in descending order. Specific counts can be obtained by looking at the dataset shared.


If you are interested in user_lang, GET help/languages returns the list of languages supported by Twitter along with the language code supported by Twitter. At the time of writing the language code may be formatted as ISO 639-1 alpha-2 (en), ISO 639-3 alpha-3 (msa), or ISO 639-1 alpha-2 combined with an ISO 3166-1 alpha-2 localization (zh-tw).

It is interesting to note the variety of European user_lang selected by those tweeting about #BBCDebate.

Notes on Methodology

The Tweets contained in the Archive sheet were collected using Martin Hawksey’s TAGS 6.0.

Given the relatively large volume of activity expected around #BBCDebate and the public and political nature of the hashtag, I have only shared indicative data. No full tweets nor any other associated metadata have been shared.

The dataset contains a metrics summary as well as a table with column headings labeled  created_at,  time,    geo_coordinates (anonymised; if there was data YES has been indicated; if no data was present the corresponding cell has been left blank), user_lang and user_followers_count data corresponding to each Tweet.

Timestamps should suffice to prove the existence of the Tweets and could be useful to run analyses of activity on Twitter around a real-time media event.

Text analysis of the raw dataset was performed using Stéfan Sinclair’s & Geoffrey Rockwell’s Voyant Tools. I may share results eventually if I find the time.

The collection and analysis of the dataset complies with Twitter’s Developer Rules of the Road.

Some basic deduplication and refining of the collected data was performed.

As in all the previous datasets I have created and shared it must be taken into account this is just a sample dataset containing the tweets published during the indicated period and not a large-scale collection of the whole output. The data is presented as is as a research sample and as the result of an archival task. The sample’s significance is subject to interpretation.

Again as in all the previous cases please note that both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailón, Sandra, et al. 2012). Google spreadsheet limits must also be taken into account. Therefore it cannot be guaranteed the dataset contains each and every Tweet actually published with the queried Twitter hashtag during the indicated period. [González-Bailón et al have done very interesting work regarding political discussions online and their work remains an inspiration].

Only data from public accounts was included and analysed. The data was obtained from the public Twitter Search API. The analysed data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.

Each Tweet and its contents were published openly on the Web, they were explicitly meant for public consumption and distribution and are responsibility of the original authors. Any copyright belongs to its original authors.

No Personally identifiable information (PII), nor Sensitive Personal Information (SPI) was collected nor was contained in the dataset.

I have shared the dataset including the extra tables as a sample and as an act of citizen scholarship in order to archive, document and encourage open educational and historical research and analysis. It is hoped that by sharing the data someone else might be able to run different analyses and ideally discover different or more significant insights.

For the previous post in this series, click here. If you got all the way here, thank you for reading.

[vote_leave]. (2016) [Twitter account]. Retrieved from https://twitter.com/vote_leave. [Accessed 21 June 2016].

González-Bailón, S., Banchs, R.E. and Kaltenbrunner, A. (2012) Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions. Human Communication Research 38 (2) 121-143.

González-Bailón, S. et al (2012) Assessing the Bias in Communication Networks Sampled from Twitter (December 4, 2012). DOI: http://dx.doi.org/10.2139/ssrn.2185134

Hawksey, M. (2013) What the little birdy tells me: Twitter in education. Published on November 12, 2013. Presentation given from the LSE NetworkED Seminar Series 2013 on the use of Twitter in Education. Available from http://www.slideshare.net/mhawksey/what-the-little-birdy-tells-me-twitter-in-education [Accessed 21 June 2016].

Priego, E. (2016) “Vote Leave”. A Dataset of 1,100 Tweets by vote_leave with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare. URL: DOI: https://dx.doi.org/10.6084/m9.figshare.3452834.v1

Priego, E. (2016) “Stronger In”. A Dataset of 1,005 Tweets by StrongerIn with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare.

Priego, E. (2016) “Stronger In”: Looking Into a Sample Archive of 1,005 StrongerIn Tweets. 21 June 2016. Available from https://epriego.wordpress.com/2016/06/21/stronger-in-looking-into-a-sample-archive-of-1005-strongerin-tweets/. [Accessed 21 June 2016].

Priego, E. (2016) “The BBC’s Great Debate”: Anonymised Data from a #BBCDebate Archive. figshare. https://dx.doi.org/10.6084/m9.figshare.3457688.v1