Learning Python – Project 2: How to find the best price on TradeMe

This tutorial will take you through using Python to pull, organize, and store data off TradeMe (the New Zealand website to buy and sell everything) then use that data to find the listings with the best price. This tutorial assumes only a very basic knowledge of Python, see Python Project 1: Looking for jobs on Twitter, for an even more basic introduction to Python. I’ll be using home listings in Dunedin for my data in this tutorial but you can use this for whatever you need! I’ve included some code to work on car listings as well. I learned a lot from Shreyas Raghavan‘s Medium post titled “Create a model to predict house prices using Python“.

 

Collecting the data

TradeMe has an excellent API that allows users to pull all sorts of data from their site, such as car or real estate listings. The authentication process to pull data can be a bit daunting to beginners. However, it is quite an easy processes using the “OAuth1Session” library. 

Step 1: Register your program with TradeMe

All you need to do to get started is sign into TradeMe and fill out this form on their website: https://www.trademe.co.nz/MyTradeMe/Api/RegisterNewApplication.aspx
TradeMe got back to me within a couple days and approved my application. Once approved, you can see your consumer Key and Secret at: https://www.trademe.co.nz/MyTradeMe/Api/DeveloperOptions.aspx

Step 2: Authenticate yourself 

In order to request information from TradeMe, you need to authenticate yourself so TradeMe knows who is asking for data. Although this method is a bit tedious, I found it easiest to authenticate myself using their web-page: https://developer.trademe.co.nz/api-overview/authentication/

Simply put in your consumer key and token, select “Production” for the environment (so you get data from the real TradeMe, rather than their fake sandbox site for testing purposes), and select the permissions you need. Generate your secret Key and Token and copy them into a Python script. 

Keep your consumer and secret Key, Token, and Secrets private! Otherwise, anyone could impersonate you on TradeMe. 

Step 3: Using Python to make requests

In order to request data from TradeMe, we need to make authenticated searches to their website. I am using the Python library “requests_oauthlib” to handle all the heavy lifting. TradeMe uses OAuth 1.0 to handle authentication. Since TradeMe uses OAuth1.0, we can use “OAuth1Session” from “requests_oauthlib”. We pass in our consumer key, token, and secrets into “OAuth1Session” and start making some requests. Let’s get into some code:

At the top of our program we import some libraries that we will be using. See Project 1 on how to install these libraries. Alternatively, if you are using Anaconda for Python (which I recommend) most of these libraries will be pre-installed. Most importantly we are using OAuth1Session from requests_oauthlib to handle out authentication with TradeMe’s API. We use JSON and Pandas to organize all the data we collect. JSON is a widely used format for storing large data sets. Pandas is a python library that makes it very easy to work with large data sets (similar to Microsoft Excel).  

The next few lines we are just storing our API credentials. You can just copy+paste your credentials within the corresponding strings. Now our program is all setup and we are ready to starting asking TradeMe for data. First we’ll create an authentication session with OAuth1Session on line 16. We will use  this session to make our data request. Next we need to tell TradeMe what information we want. TradeMe’s API uses a URL method of searching, meaning we use a long website URL to request the specific data we want to see. TradeMe has extensive documentation of how to create these URLs and also offers a really nice tool to generate these URLs for the data you want, here’s the one for real estate. MAKE SURE YOU CHANGE THE URL TO BE THE REAL TRADE ME WEBSITE AND NOT THEIR TESTING SANDBOX SITE. This tool returns “https://api.tmsandbox.co.nz”, you need to change it to “https://api.trademe.co.nz”. We create our URL on line 19. My URL here is to find all real estate listings in Dunedin. Finally, on line 22, we make the request for the data in the URL using our tradeMe credentials and the .get function from OAuth1Session. The first page of our data is stored in returnedPageAll. However, this is only the first step.

Step 4: Figure out how much data there is

TradeMe limits us to only 500 listings per page of request. This is similar to when you use a website and they only display 50 listings and you have to click “Next Page”. By default TradeMe’s API returns page 1, or the first 500 listings. This may be fine if you’re making a very specific search with less than 500 listings. However, if you wanted to work with bigger data sets with thousands of listings, we need a way to get all the data for all the listings we are interested in. We can do this by requesting all the available pages of data. Graciously, TradeMe tells use how many total listings there are for our search, making it relatively easy to request all the available pages. The next section of code works out how many pages of listings there are. 

First things first, we pull the content from our requested search page on line 2 with the .content function from OAuth1Session. Next, we use our json library to load the contents of our page into a json object with the .loads function called on the page content. The total number of listings for our search in question is stored in the json key called ‘TotalCount’. We use line 8 to get this number. Next we do some simple math to figure out the total number of pages. Since we get 500 listings per page, if there is 1,500 listings, there must be 3 pages of data. We divide totalCount by 500. However, this may result in a fraction. Perhaps there is only 100 listings for something, we will get 0.2 as our answer. This would be a problem, we can’t request page 0.2, we need to request page 1! In order to round our answer, we can convert it from a float to an integer with the int() function. Thus line 10 will work out how many total number of page requests we need to make in order to get all the listings we are interested in. Now lets get to work and grab some data!

Step 5: Requesting ALL THE DATA

We are going to set up a simple for loop which goes through and makes a request for all the pages we need, starting at page 1 and ending at the value of totalRequests. For loops execute code a certain number of times, each time running the same code but with different inputs. In our case, it will run the same code but each time requesting a different page number. Note, Python uses white space in it’s code. Meaning all the code within our for loop needs to be indented by 4 spaces. When we stop indenting, we tell Python we are no longer in the for loop. The whole for loop is below to make following it easier. 

The first line sets up our for loop. We make the variable called which will go up by 1 each time the for loop executes. The for loop will start at 0 and end at totalRequests+1. We use the range function to set this up. In other words, run this code so long as i is within the range of 0 and totalRequests+1. Once i becomes larger than totalRequests+1, stop running  the code. We add the +1 for instances where we round the total page numbers down. For example, if there is 5.1 pages of results, totalRequests would have rounded down to 5 when we used int(totalCount/500) earlier. However, there were listings on page 6 and we don’t want to miss those! Now that we have our for loop set up, let’s step inside and see how it is requesting, organizing, and storing our data. 

Line 3 simply converts our i variable from a number to a string and store it as pageNum with the str() function. This allows use to insert our page number into our URL search string. We create our search URL on line 6 and store it as searchAll. The key part in our URL is this bit: …&page=’ + pageNum + ‘…    We break up our search URL at the point where it specifies which page to request. We add together with the + sign, or concatenate, the first part of our URL string together with the current pageNum of our for loop then the rest of our URL. As we did before with our first request, we use our tradeMe OAuth to get our data specified by our URL from TradeMe (line 9), get the content (line 12), and convert to json (line 15). However, rather than just getting out the ‘TotalCount‘ key, we want to get the actual listings. TradeMe stores all the listings in the ‘List‘ key. To get out all the listings, all we need to do is ask for just [‘List’] from our parsedDataAll. Now we have each listing stored in eachListingAll in json format. Json may be great for computers to use, but it’s pretty hard for humans to read json data. Instead, we can convert our data from json format into a Pandas dataframe. A dataframe is an intuitive way to store data, with named columns for each type of data, such as Price, Bedrooms, LandArea, and each row is a listing from TradeMe. To do this, we simply use Panda’s function pd.DataFrame.from_dict() on our json data. Now pandaAll has all our TradeMe listings from a page stored as a Panda’s data frame. If you are using a Python IDE with a workspace, you can now easily browse all the listings of this page by opening pandaAll. Next, we need to save all of our data. 

If we are on the first page of results, we need to create a new file on our computer to store all of our data. We can determine which page we are currently on in our for loop with a simple if statement. An if statement will ask “Are we currently on the first page?” if so, create a new file. However, or else, if we are not on the first page, don’t create new file but append the file we already have. If we did not do this step, our data would just be written over by each new page and we’d only store the last page. Python uses 0 indexing, meaning the first of something is assigned a value of 0, not 1. So our if statement asks, if i is equal to == 0, in other words, is i the first page, then simple save our dataFrame of page 1 as a pickle. Yes you read that correctly, as a pickle. Python uses a file type called a pickle (.pkl) to store data, like Excel is .xlsx. Line 25 saves pandaAll as a pickle file called ‘dataDunedin.pkl’ using the to_pickle() function. This file will now be on your computer wherever your directory is set to. If you’re using Spyder, you can see your directory in the top right. However, or else (line 26), we are not on the first results page, we want to just add our new data from each page to the existing pickle file. To do this, we first open the pickle file with the read_pickle() function with out pickle file passed in (line 29). Now we have our pickle file stored in Python as pandaAllStorage. We can then append our stored data with the data from the current page with Panda’s .append() (line 32). Next, we save our new Panda’s dataframe with our previous page’s data and current page data as we did before (line 35). The next two lines within the for loop simply put our code to sleep for half a second (0.5) as to not make too many requests too quickly to TradeMe and print out which page was just ran to update us that it is indeed running. Finally, line 41 just tells us the program is finished running. At this point we have every listing on TradeMe which meets our search criteria set out by our search URL. Now we need to make sense of all this data! The full program for collecting data is at the bottom of this post.

Step 6: Setting up our data

I’ve opted to create a new Python program to analyze our data. We’ll start by importing some libraries that make graphing and data analysis a breeze. We have Pandas again to house our data in dataframes. We use numpy a bit for graphing. We use matplotlib and seaborn for the bulk of our graphs. Next we use SciKit Learn library to create models of our data. Last, we’ll import webbrowser to open up the deals we find in our browser. Next we’ll load up our pickle file which stores all our data. 

TradeMe provides use with a lot of data for listings which is meaningless to us. If you look into datAll, you’ll see we have columns for things like the real estate agent, best contact time, or if a listing has pictures. Some of these may be useful to us later, for but for analyzing t he price of homes we don’t need it. We’ll create a list of strings that contains only the data we want to analyze and we’ll call it labels (line 14). These are the attributes I felt were the most important. I’ve also included ‘ListingId’ to use later on. To keep only the columns in our label list, we simple use dataAll[labels] and store the data we want as data (line 17).

We do have a problem though with TradeMe’s real estate prices. They do not provide us with just the price of the listing. Rather, they provide a string called PriceDisplay which has the listing’s price advertisement. For example, rather than saying 200,000 it’ll say “Offers over $200.000“. In order to do any data analysis we need to extract out the price and convert it to a number from these PriceDisplay strings (code below). First we’ll just take out the PriceDisplay column which is what Panda refers to as a “Series” (Series is one column of data, multiple columns is a Dataframe; line 2). We can use regular expression, or regex, to only keep numbers in our PriceDisplay strings. Regex is a way to parse data by asking for specific characters or combinations of characters. For our purposes, we will ask it to delete everything in the string which is not a number (line 5) and store it in priceInt (Int refering to integer). However, priceInt is not still a string, meaning Python can’t do math on it. We need to tell python these are only numbers and we can do math on them. We can use Panda’s to_numeric function for that providing it with  out priceInt variable and re-storing it as priceInt. priceInt is now actually a Series made up of numbers which correspond to the numbers in the listing’s displayed price. We can now replace our annoying “PriceDisplay” in our data with these much more useful numerical price (line 11). 

We’re almost ready to dive into this data! But first a few more steps to clean it up and make it easy to work with. First we will remove listings which don’t provide us with all the information we need. We can easily do this with Panda’s dropna function which drops listings which have any column that is empty, or a NaN (Not a Number; line 2). TradeMe posters can be tricky and will sometimes input zeros if they don’t know something. For example, if they don’t know the land area they just put 0. We don’t want this messing with out data analysis we we’ll remove any listing that has 0’s (line 5). Let’s give our data our quick once over now. Pandas has an excellent function called describe() which describes all the data in a dataframe by providing things like the average of all values, the max and min values, and standard deviations. You may want to use these descriptive stats to constrain your data to be more applicable to you. For example, if you see the max bedroom value is 10, and you do not want any houses that large, we can remove all listings which are over X number of bedrooms. Lines 10-13 show examples of how to take out listings which do not meet my preferences. Feel free to edit these or add in your own constraints. This data is already quite useful. For example, you can see the average price of homes within your criteria. Homes below that average may be a good buy. But let’s go on and visualize all of our data and get a better feel for the market.

Data Description

Step 7: Visualizing our data

The whole purpose of this program is to try and see what a good price for something is, be it a house or car or anything on TradeMe. Rather than scrolling through pages of listings manually and trying to get a “feel” for a good price, we can plot all our data and see what a good price may be. The next section of code will use the seaborn and matplotlib libraries to plot all our data. Feel free to mess around with these graphs and make your own to plot the data you want to see. The first few plots are more basic plots, such as a histogram of all the prices and how aspects of a house, like its area, relate to price. We can see from the histogram you’d expect to pay around $300,000 for a house in Dunedin. This may indicate if houses are listed for less than this they’d be a good deal. For the other graphs we are using seaborn’s linear regression plots, regplot. These are really informative as we can see houses who are priced below our fit line could be a good deal. As an example, homes with an area of 200 m3 should be priced around $500,000. Homes that are this large and priced under $500,000 may be a bargain. The last plot is a seaborn pairplot that just plot ALL the data in a dataframe against itself. It can be a built overwhelming at first but it’s quite informative. 

sns.distplot(data[‘PriceDisplay’])
sns.regplot(x=’Area’,y=’PriceDisplay’,data=data)
sns.regplot(x=’Bedrooms’,y=’PriceDisplay’,data=data,x_jitter=.1)
sns.regplot(x=’Bathrooms’,y=’PriceDisplay’,data=data,x_jitter=.1)
sns.pairplot(data,hue=’Bedrooms’,palette=”PuBuGn_d”)

 

A quick analysis we an do is a correlation. We can see how the number of bedrooms or land area is correlated to house price. We can see Area and Price have a very high correlation (red square) while the number of bedrooms and the land area are not well correlated (teal). 

cax=ax.matshow(correlations,vmin=-1,vmax=1)

Step 8: Modeling our data with machine learning

At this point we’ve already obtained some useful data. We can see what, on average, we may expect to pay for a home given one other variable, like area of the house, using our linear regression plots. However, these linear regression plots are simple linear regressions; they only use one value, like house area. Instead, we could use a multiple linear regression to use all our variables to predict the price of a house. We can use SciKit Learn’s LinearRegression to do this. We feed it in all the data about the homes we’re interested in, number of bedrooms, bathrooms, land area, etc. and it learns the relationships between these variables and tries to predict the price. First we need to get our data set up to feed into the linear regression. 

A linear regression takes in the explanatory variables, usually denoted as X, and models their relationship to a dependent variable, in our case price. The dependent variable is usually denoted as y. First we’ll set up our dependent variable, price, by pulling our just the [‘PriceDisplay’] from out data and set that to (line1). Next we need our explanatory variables as X, in our case, everything in data except for price. Thus, we’ll just drop [‘PriceDisplay’] and set the resulting data to (line 2). We also need to take out [‘ListingId’] (line 3). Linear regressions can handle numbers, like the number of bedrooms, but it cannot take in string data, such as the dwelling type which could be “House” or “Townhome”. However, the dwelling type could be useful to our model and we don’t want to drop it. To get around this, we can use a special pandas function called “get_dummies“. Get_dummies will create a new column for each type of dwelling and set the value for a listing to 1 if it is true or 0 if it is false. For example, a house will go from “House” to a 1 in the house column while a “Townhome” will be 0 for house and 1 for townhome. This was our multiple linear regression model can use the dwelling information. This is done on line 10. 

Now we’re almost ready to setup our model! We just need to split out data into training data and testing data. We need our model to train using some portion of our data then it will test the model against a different portion. SciKit learn makes this a breeze with their function train_test_split(). We just hand it in our X and y, tell it how much we want to train on as a percent, in this case 90% or 0.90, and whether or not to shuffle the listings. If Shuffle is True, it will randomly take 90% of our data into train, otherwise it will just take the first 90% of listings. It’s usually best to have shuffle be True. train_test_split() will give us back our portioned data into training and test sets. All this is carried out on line 13. Now lets create and test our model! 

Yup, that’s it! SciKit learn makes it extremely easy to set up a linear regression model. First we set initialize our model on line 2 then use the .fit() function to fit our training data to it. Next, we can use the function .score() to see how well our model did. A score of 1.0 would be perfect, that our X data predicted our y data with 100% certainty. Less than that our model isn’t perfect, but that’s okay! You can play around with which variable to use to try to make your model more accurate. But how can we use this model to check for deals? You can also keep fitting the data with different train/test splits. If shuffle is true, your train and test data will be different every time you run it. You could keep running it until you get a very good score, something > 0.8. I was able to get a model with a score of 0.79, which I’ll use for the rest of this post. SciKit learn also has many other ways to model data. I encourage you to try to use other methods to model your data and explore machine learning.

We’ll first use our model to predict the price of our of all of our listings using the predict() function with all our dependent variables, X, and store it as a panda Series called predictions (line 2). Next, we’ll rename these prices that our model predicted as “Predictions” and all of our real prices, y, as “Real”. For ease of use, we’ll create a new dataframe with both our real and predicted prices by concatenating them together and calling our dataframe comparisons (line 8). Lastly, we’ll create a new Series in our comparison dataframe called [‘Difference’] which takes the difference between the real price of the house on TradeMe and the predicted price from our model (line 10). We can then plot all this data to visualize how well our model did. If our model did a good job, the real price and predicted price should be very similar and we’ll get a tight clustering of our data. However, we can utilize the outliers. Perhaps a house is listed for $200,000 and our model predicts it is worth $300,000. This may indicate this house is priced below the market value and could be a deal (or our model is just really bad). We can easily see these listings with our Real price versus Predicted difference plot. Large negative differences may be deals while large positive difference may signal asking prices well above average. 

sns.regplot(‘Real’,’Predictions’,data=comparisons)
sns.regplot(‘Real’,’Difference’,data=comparisons)

All this would be pointless if we could not then look at these deals on trade me and assess them ourselves. This is why we were keeping the ‘ListingId’ the whole time. Our listing Id’s will correspond with our predicted differences. This way we know if a house was predicted to be $300,000 but listed at $200,000, we can find the TradeMe listing associated with that discrepancy through its Id. First we find our goodDeals by only taking the listings which have a real price vs predicted price difference of -100000 or less (line 2). Next, we’ll write a little for loop that opens up all the listings in our default web browser by going through all the goodDeals, getting out the listingId, and using the webbrowser library to open the page. I’ve written the loop to only open 10 links at a time. Once you’ve looked over the first 10, you can just press the Enter key in Python for the next 10. Happy hunting! 

Note: Depending on how long you take between collecting data and analyzing it, some of the listings may be expired. 

 

 

 

 

 

Creative Commons License
This work by Blake Porter is licensed under a Creative Commons Attribution-Non Commercial-ShareAlike 4.0 International License

2 Responses

  1. mikeybeck

    Will probably be a while before I get around to trying this out, but man this is excellent!

    What happened to your Reddit post?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.