- Feeds on twitter sample stream (GET statuses/sample)
- collects the tweets every hour in a new file (yyyy-mm-dd-hh.json)
- Uses oAuth
- Uses twisted an event driven networking engine in Python
Twisted==11.0.0
httplib2==0.7.4
oauth2==1.5.170
pyOpenSSL==0.13
wsgiref==0.1.2
zope.interface==3.6.3
- Register an application with Twitter here: https://dev.twitter.com/apps/new
- Fill in your details. All details are mandatory.
- Your website can be fictional, but it does require a http:// prefix. Make sure you select 'Client' as the application type. You only need Read-only permissions.
- At the application settings page take note of your consumer key, consumer secret, Access token and Access token secret
- Add these four values to the top of the oauth_stream_collect.py script. Search for 'consumer key', 'consumer secret', 'access token' and 'access token secret' and replace with appropriate values inside quotes.
-
Start the script from the folder where the tweets should be saved
python oauth_stream_collect.py -
The tweets will be saved in the file yyyy-mm-dd-hh.json. A new file will be created every one hour
A simple way to parse the json files (replace yyyy-mm-dd-hh.json with the filename.) Run this script from the folder where the json files are stored.
import json
tweets = []
for line in open('yyyy-mm-dd-hh.json'):
try:
tweets.append(json.loads(line))
except:
pass
This creates a list of json objects called tweets[] which can be manipulated depending on the use case. Some examples:
print len(tweets) #print the length of the tweets list
tweet = tweets[0] #look at a single tweet
print tweet
This Blog Post by Michal Migurski demonstrates how to convert this tweets list into a .csv file which can easily be read by Excel, MySql etc,.
- Sample Stream is 1% of twitter stream
- Sample stream per hour uncompressed - Approx filesize: 550MB/hr (13.2 GB/day) - Approx 210,000 Tweets/hr