Learning Python – Project 1: How to find a job using Twitter’s API and Python

This post is my very small spin on a post by Adil MoujahidAn Introduction to Text Mining using Twitter Streaming API and Python. Check out his post and blog in general, there is some really cool stuff on there. I’ve updated the code to work with Python 3 and gave more of a walk through so even if you have never coded anything before, this hopefully all makes sense.

Introduction to Twitter API and Python

Finding jobs can be hard. You have to check different job websites which are always poorly designed. Sometimes jobs are not posted on these aggregate sites and you need to go to University or company career pages to find jobs. Or, jobs are advertised only through social media and can’t easily be found otherwise. So in the post we will use Python to make a Twitter streamer which brings in tweets about career fields we are interested in and then make a second program that finds tweets referring to hiring and opens the web links they point to. 

Twitter is a never ending source of information; from cat gifs, to breaking news, to how people feel about a topic or jobs. Getting data from Twitter is also extremely easy through its API, or Application Program Interface. As the name implies, an API allows outside applications, like the one we will build below, to interface with Twitter and either get data from it or put data on it (like posting a Tweet). Furthermore, there are some great free libraries out there that make using API’s very easy.

We’ll be using Tweepy for Python to collect data from Twitter. We’ll be “streaming” Twitter data meaning we’ll be collecting it in real time as tweets are posted. This allows us to collect an unlimited number of tweets. Otherwise, say if you wanted to search Twitter for previous tweets, you are both limited to the number of tweets you get per request and each request will deliver random tweets which often results in getting the same tweets over and over. 

Python is a widely used programming language designed to be easy to read. Python is powering many programs and websites such as YouTube, Google, various video games and in scientific computing. If you don’t know any Python, or any computer programming at all, that’s totally fine! I will walk through everything in this post.

As an important note here, Python utilizes white space in its programming language. So if you see anything indention, this is very important! Also, Python uses zero to mean the first thing in a list. So if you have a list of 10 items, they will be labeled as 0, 1, 2 … 9. 

To begin, you need Python. There are two versions of Python, 2.7 and 3.5 (as of this post). All of this code will be for Python 3. If you are learning Python and find code online that does not work for you, check to make sure the versions are the same. There are also programs to convert code, such as 2to3. However, we are not going to just download Python. Instead, you’ll likely want an IDE, or Integrated Development Environment, which is software which runs Python but has helpful additions like error checking. Anaconda with Spyder is a great IDE for Python which is focused on scientific analysis and comes with some great libraries of extra code. To replicate what you see below, download and install Anaconda 3.5 (which will come with Python) or use whichever IDE you prefer. 

Setting up Python

Once you have Anaconda installed, either open Anaconda and then launch Spyder or just find the Spyder program and open that. We’re going to need to install some Python libraries. Libraries are libraries of functions which will do many things behind the scenes without us having to write out a bunch of code. For example with Tweepy, all we need to do is write “Stream” and Twitter data will stream to our computer. 

On Spyder’s toolbar, click “Tools” and “Open command prompt”

python_cmd

 

In the Command Window, type in:

‘pip’ is an installer program which can download Packages and Libraries for Python. This line simply tells Python to install the tweepy library. If you are using Anaconda and Spyder, that is all you need to do! If you are not, you will also need to install: json, pandas, matplotlib.pyplot, and re. These are included with Anaconda. 

Setting up Twitter API credentials

In order to get data from Twitter’s API you need credentials to access it. These are very easy to get. When you are logged into Twitter (you must have a Twitter account) go here: https://apps.twitter.com/ Click the “Create New App” button. Fill out the details. If you don’t have a website just put any website, don’t forget to include ‘http://’. On the next screen, go to the “Keys and Access Tokens” tab. Scroll down and request your access token. Do not show these credentials to anyone! They could use them to impersonate you, for example, posting Tweets to your account. 

 

python-create-tweet-progream

 

python-create-tweet-program-2

 

 

Creating the streaming program

Copy and paste the consumer key, consumer secret, access token, and access token secret into a new Python file in Spyder (call the file something informative, like “Twitter_streamer_v1.py”. Now create four variables, seen below, to store your credentials. To create a variable all you need to do is type the name of your variable, followed by an =, and what you want that variable to represent. Your credentials should be string format; meaning surrounded by ‘ or ” .

Above these, we will need to import the Tweepy functions we want to use. OAuthHandler will authenticate out API credentials. Stream establishes a connection with Twitter. StreamListener is a class used by tweepy’s Stream to bring in data from Twitter. We will also need to import time so we can tell the program to quit after a certain period of time has passed. 

Your program should now look like this. Next we will put in everything we need to get started. 

Some lines have a # in front of them, these are comments. Comments are ignored when a code is ran. After the authentication details, we will create a class, or object, called MyStreamListener which will use the Tweepy StreamListener function. We will make our own version so it stops after a certain period of time. First we will indent once (with the tab key) and define, with def, a function that opens the a file to save our data to. Using __init__ and self, we are able to keep the objects we make in a class and their values even outside of the class (otherwise they would be discarded). Everything indented under the def line is included in the function. To begin, we get the self.start_time of the program. time.time() returns the current time in seconds. Next, we make self.limit the time_limit, or how long the program should collect tweets for. We will set this value later on Next, we set up a variable, self.saveFile, using open(), to hold all of our data. The file name is called twitter_jobs.json, as a string, and this is what you’ll see on your computer in the current directory (below). Python will make this file in whatever your current directory is. The ‘a’ parameter with open means “append” or do not write over preexisting data with new data, just add new data to the end, AKA append it. super just allows us to easily reference self later.

This is your path
This is your directory in spyder

 

Next, the on_data function is what the tweepy streamer uses to pass through data. We will set it up so that as long as the time is less than the limit, we will collect tweets. However, if the time elapsed goes above the time limit, end the program. With each tweet, we will write the data of the tweet to the self.saveFile. When you want to perform a function on something, such as write() to saveFile, you put the thing you want the function to act on, saveFile, followed by a period, followed by the function, so self.saveFile.write(data) is writing the contents of data to the saveFile.

We will also write a new line between tweets using ‘\n’. We make sure to return True so the program continues. If the time does run over the limit, else:, we close() the file and return False to end the program.

We’re nearly there! If we ran this now, this would just collect ALL THE TWEETS! But we don’t want that. We’re looking for a job! So lets set up a tweepy filter. Since I am finishing up my PhD, I’ll be looking for postdoc positions. So my filter looks like this: 

This way we will only get tweets which contain one of those words. You can set up whatever filter you want for your occupation. Just make each keyword a string and separate them by commas within the square brackets. For example, [‘science blogger’, ‘science writer’,’science journalist’]. While these are vague and may not be job postings, we will sort through them later for jobs. But first, we want to make sure we don’t miss any tweets! 

Your final program should look like this to fully run: 

myStream is the function which listens to tweets, using the auth API authentication we set up earlier. The time_limit is passed in to set our self.time_limit. This is in seconds. One hour = 3600 seconds, one day = 86400. Finally, myStream is ran with the tweepy filter. Once complete, hit save and run it. In spyder you only need to hit the green run button.

After running mine for 1 hour and got 141KB of tweets and 24 hours and I got ~500MB worth of tweets. Make sure you have enough space on your hard drive for this wherever your directory is.

This program can of course be used to collect tweets for anything! Just set your filters accordingly.


Creating the analysis program

Now that we have all our data, we can start to search through it for job postings and other information. These tweets are formatted using JSON (JavaScript Object Notation) which makes them very easy to work with. Create a new Python program, I called mine twitter_stream_analytics.py

Let’s start with the imports. We will need json so we can handle the JSON data format. We will use pandas to structure our data. We will import pandas as pd to make it quicker to type. matplotlib, as plt, will be used to make nice looking graphs. re, or regex aka “regular expression operations” to parse through our data. Finally, we need webbrowser which will open all our job postings in our computer’s default web browser. 

Next we need to import our data from the json file. The code below sets my path to the file name, tweets_data_path, as a string. Since my file is already in my directory, I do not need to specify the whole path, e.g. ‘C:\Users\blake\Documents\Python\twitter_jobs.json’ If your file is not in your directory, you’ll need to put the whole path.

Next we create an empty list, using square brackets [], called tweets_data to store data in later. Next we open our file with ‘r’, or read only. The next few bits of code mean: for every line in the file, load the json data to a variable called tweet then append, or add to the next line, tweets_data with this new information in tweet. In other words, write each tweet to a row in tweets_data. If there is nothing there, except, just continue. The second to last line will print out just how many tweets we have using the len, or length, function to count how many tweets there are. The final line creates an empty panda DataFrame to store the data we want into it, which we will do next.

Next we will get only the important information about these tweets like what they contain in their text, what language they were written in, and what country they came from. The below code makes a list for each of these data points then stores them into the DataFrame tweets. If you want to look at other things, just change the string. However, beware of nesting. For example, ‘country‘ is nested inside of ‘place‘. So first we need to reach into the place data, then get the country data. I’ve also added an exception that if the tweet has no place data, denoted by the !=, replace it with a None value. 

tweets is now a DataFrame containing the text, language, and country (as columns) of each tweet (as rows). In spyder, you can click on the ‘variable explorer’ tab and double click on tweets to check your data.

 

python-dataframe1

 

Next we can do some basic visualization. Let’s plot what language the tweets are in and plot the top 5. 

First we make a series variable, tweets_by_lang, of the tweets counting how many times each language comes up using tweets[‘lang’].value_counts(). For example, if 27 tweets were in English, en will have a value of 27. The series is indexed to the name of the language with the [‘lang’] column being how many (value_counts) tweets were in that language, 

The rest of the code sets up the matplot graph and is pretty self explanatory. The final line does the actual plotting. Using tweets_by_lang[:5] means plot the first 5, aka [:5], highest value languages. You will see something like this in your IPython console. I only have 2 languages, English and Spanish (es). 

 

pythton-lang

 

We can do the same thing for countries using the code below.

Now to searching for jobs. First we will check through the text first to make sure these is a text field (maybe someone posted a picture with not text). In out tweets dataframe, we will add two new columns to check which one of our keywords was present in each text, returning True or False. To check this, we will define a function called word_in_text. 

word_in_text will take in a word and some text and check if the word is in the text. If it is, it will return True if it is not, it will return False. First we check if there is text at all. If the text has no text field, text == None, return False. Otherwise, make the word and text all lowercase. This is simple with the .lower() built in python function. This makes sure we don’t miss a word due to capitalization, like Postdoc vs postdoc vs PostDoc. Next we use re.search to search for the word is in the text and store this as match. If match is true, return True otherwise, else: , AKA if the word is not in the text, return False. That’s all we need! No we can check each tweet for the key word and write it to our tweets DataFrame with the below code.

As you can see, for every tweet, the tweets column called ‘postdoc’ will be populated with True or False, depending on the ‘text’ of each tweets having the word ‘postdoc’ in the tweet. lambda allows us to pass our word_in_text easily into panda’s .apply function. The last two lines just print out how many tweets contain each word by counting, value_counts, how many times True appears in each tweets column [‘postdoc’] or [‘postdoctoral’]. The below code can easily visualize this in a bar graph. 

Now onto the jobs! Let’s use our word_in_text to check for tweets which also have job related words in them. Just as before, we’ll add new columns to the tweets DataFrame telling us if a word, ‘job‘ or ‘position‘ is contained within the associated tweet‘s text. Next, we will add a final column that is True or False if the tweet is relevant, AKA contains either ‘job‘ or ‘position‘. Lastly, print out how many tweets have job in them, have position in them, and how many are relevant. You can also add more criteria. For example, look for tweets with the word ‘neuroscience’ or ‘chemistry’. Just keep adding  keep adding or‘s to the relevant line. 

And to visualize the data. 

Now, you could go through all the relevant tweets and get the link and copy and paste it into your browser. Or we can just tell our program to do that for us. Let’s make a function which extracts web links within tweets then opens them up in our browser. 

extract_link will take in some text, in this case the text of a tweet, and first check if there is text, if None, return a blank denoted as ‘ ‘  (single quotation, space, single quotation). Next will use regex to analyze the text. We will setup regex as expressions we are looking for, in this case anything that as http:// or www. in it. Then we use re.search to check if our text has any of this and if it is a match, return that group of characters. Otherwise, return a blank. Then all we need to do is use this function on our tweets DataFrame. This way, the final column of our tweets will be made of all the links to each tweet (again, each  row being a tweet). 

The next few lines will make a new DataFrame that only take the relevant tweets (aka relevant column == True for a tweet) and makes sure they have a link. If they do not have a link, !=, then delete, ‘ ‘. Finally, any tweets with duplicate links will be removed, using panda’s .drop_duplicates function. 

Now, be very careful with this final bit of code. This will open every link there is. If you have a ton of links your computer may not be able to handle it. I’ve tested this with Google Chrome as my default browser. If you are using something else I can’t guarantee it will work. 

A for loop will start from the beginning and sequentially run till it reaches the end of a specified range. For example, if we have a range from 0 to 5, the for loop will run with the value, in our case link_number, equaling 0, then run again with link_number equal to 1, then again with it equal to 2, etc till the end.

The code above creates a for loop which says: for each value, link_number,  within the range from the start to the entire length, len, of the tweets_relevant_with_link, meaning number of rows, open each link in your web browser. In other words, check each row, row numbers being link_number, in tweets_relevant_with_link, and get the link out, using panda’s .iat.  [link_number,8] denotes the [row,column] to look in. So whatever the link_number currently is, use that for the row number, and stick to column 8, which is where our links were stored. webbrowser.get().open opens the link using only your default web browser and if there is an issue with opening it, do not try another web browser. Some links may be cut off or broken for various reason and the web page will not exist. 

If you have a lot of links, you can use the below code instead which will open 10 links and wait for you to press enter, input(), before opening more. 

All together, the analysis code will look like this: 

I hope this gives you all the tools you need to make your own Twitter data mining program with Python!


Github for the code


I used the Crayon Syntax Highlighter to embed by code. It’s very easy to use and supports all major languages.


If you enjoyed this article and found it informative, consider donating! Funds first go to keeping the website running (domain name/hosting) and secondly go to me. Donations are secure and handled by PayPal.




Leave a Reply