![]() ![]() These include the regression parameters ( m and b), and also a very useful set of statistics that give you a sense of how accurate these parameters are. LINEST is an Excel array function, and this just means that it returns not only a single value, but a set of values. This LINEST function can also handle multiple input "features", so it will be used later for the more complex models. In this one-dimensional linear regression case, there are many ways to do this calculation in Excel, but I decided to use the LINEST function. The 'harder' part of determinine the 'best' m and b is handled by Excel. Once we know m and b, its' a trivial matter to plug them into the formula above and out comes a predicted rent for any square footage we enter. Where the constants m and b are the model parameters that the linear regression procedure calculates for us. ![]() In this simplest of cases, we'll make an assumption that the house rent is simply linearly proportional to the square footage plus some constant. Let's start with the simplest model one that relies solely on the Square footage. This procedure may seem abstract but it becomes clearer with some examples. For technical reasons that we won't get into in this article, we often use "square error" (square of the distance between a rent prediction and the actually observed rent in the dataset summed over all the data points). It does this by minimizing the error between this function and the points in the data set that are used as input (the 'training set'). Linear regression gives us a practical statistical method to combine those features into a function (formula) that predicts the rent price. We have four input 'features' at our disposal. We will articulate this relation more precisely by building linear regression models.įigure 1 - Rent as a function of Square Footage and Number of Rooms Linear Regression Models Not surprisingly, the rent tends to increase with the square footage and the number of rooms. The depth is the number of rooms whereas the horizontal axis represents square footage. This simple step allows us to construct richer (sometimes 'too rich') models as we will see shortly.įigure 1 is a 3D plot representing the data. This is the "non-linear transformation" I mentioned in the introduction. In Table 1, I added two extra columns that are the square of the square-footage and the square of the number of rooms respectively. The lowest rent in the data set was $675 while the highest was $2750. As a reference, the mean rent turned-out to be about $1500 but this of course has a huge spread (Table 1 is just the lower portion of the data). The data I collected is for a relatively upscale zip code in the Sacramento region, and I was able to get a total of 70 data points after deleting repeats. Also be sure to delete repeat listings (this is a step I also automated in Excel). I recommend that you fine-tune your search in Craigslist as much as possible so it is not 'contaminated' with properties in different zip codes. Table 1 shows a portion of it, ordered by square footage. I wrote a small VBA procedure in Excel that can extract the prices and number of rooms out of the Craigslist search results. The first step in any data mining exercise is to collect the actual data. Some Excel VBA programming is needed for the cross-validation step, but could be excluded if cross-validation is not used. In all this, I used Microsoft Excel exclusively. This article also shows how one can, in a principled manner, select a model that is just complex enough to perform well in-sample, while avoiding "over fitting" the data (bad out-of-sample performance). In spite of its age and simplicity, regression is still a powerful method, especially when you add non-linear transformations and multiple input features to the mix. To keep this first article relatively simple, I used the 'grandfather' of all prediction techniques - linear regression (it's roots go back to the early 1800s with French mathematician Adrien-Marie Legendre). The problem is determining the appropriate rent value for a property given a set of data points 'mined' from Craigslist. Well, a couple of books, and several online machine learning and statistics courses later, I feel prepared to write an article illustrating a practical situation where these data mining techniques proved useful. I've reviewed a few books in this site that relate to this topic and these have made me very curious about the mathematical "machinery" involved in data mining. Excel-based Rent Prediction Using Linear Regression and Cross-Validation.Įxcel-Based Rent Prediction Using Linear Regression and Cross-ValidationĪs I write this (mid 2014), "big data" is a popular topic in computer science and even the popular media outlets are talking about it. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |