
How to Build a Big Data Analysis Project from Scratch?
——
That’s what we’ll present in this article. We’ll guide you step by step through the main stages of building a data science project from the ground up. This project tackles a real problem: What are the main drivers of rental prices in Berlin? We’ll collect data on this problem and analyze it. We’ll also highlight common mistakes beginners make when it comes to machine learning.Here are the steps that will be discussed in detail:
- Finding the problem or situation we want to study
- Extracting and cleaning data from the web
- Extracting insights and deep analyses
- Feature engineering using external APIs
- Common mistakes when applying machine learning
- The key feature: Finding the factors and drivers that control rent prices
- Building machine learning models
Finding a Study or Analysis Topic
There are many topics and problems that can be solved by data analysis, but it’s always best to find one you care about and that motivates you. Still, don’t restrict your thinking to your personal interests. Listen to what people around you talk about. What bothers them? What do they complain about? This can be a good source of ideas for a data analysis project. When people doubt the available data or analytics about a topic, it means the problem wasn’t solved correctly the first time. So you can offer a better solution and impact how the topic is understood.
This might sound theoretical and vague, but we’ll clarify it with the example in this article: why we chose to analyze rental prices in Berlin.
High rent prices are a common complaint I heard from people who recently moved to Berlin for work. Most newcomers didn’t realize how expensive Berlin is. There aren’t accurate statistics about the possible range of apartment prices.
Many said that if they had known in advance, they would have asked for a higher salary or considered other options.
We searched Google, checked many apartment rental websites, and asked many people. But we didn’t find any reasonable statistics or visualizations about the rental market. That’s how we landed on this analysis topic.
We wanted to collect data and build an interactive dashboard that lets users select different options—like their desired price range for a 40-square-meter apartment in Berlin with a balcony and equipped kitchen. This would help people understand Berlin apartment prices. We’ll also identify the reasons and factors influencing rent prices using machine learning algorithms.
Extracting and Cleaning Data from the Web
Data Collection
Now that you have an idea for your data analysis project, you can start searching for data. There are many large, amazing data repositories like Kaggle, UCI ML Repository, dataset search engines, and websites with academic papers that include datasets. You can also use web scraping.
But beware, old data fills the web. When we searched for info about Berlin rents, we found plenty of data and charts, but most were outdated or had no date.
Some stats were also too specific, like prices for unfurnished two-room, 50-square-meter apartments. But what if you’re looking for a smaller apartment with a furnished kitchen?
Because we only found old data, we decided to web scrape sites that list rental apartments. Web scraping is a technique for extracting data from websites using automated processing.
Data Cleaning
Once you start collecting data, it’s crucial to review it as soon as possible to spot any potential issues.
While scraping rental data from websites, we included some small checks in the script, such as monitoring missing data for all features.
After meeting all technical requirements for web scraping, we thought the data would be almost perfect. However, in the end, we needed a whole week to clean it due to duplicates.
Other cleaning problems include missing fields or data. Also, if you use a comma as a separator when saving the data and a field itself contains commas, you can end up with poorly structured files.
There are several reasons for duplication:
- Repeated listings because the apartment was online for a long time.
- Errors by rental agencies entering data, then later correcting it or posting a new ad with updated info.
- Price changes for the same apartment (increases or decreases) after a month.
We had to extract several logical rules to filter out duplicates. Once we confirmed that listings were duplicates with slight modifications, we sorted them by date, keeping only the most recent records.
Also, some agencies raise or lower an apartment’s price after a month.
If demand is high, the price goes up; if not, the agency lowers it. This is logical.
Extracting Deep Analyses and Insights
Now that everything is ready, we can start analyzing the data. I know most data scientists prefer to use seaborn and ggplot2 for some analysis and conclusions. But we believe interactive dashboards are the best solution. They help extract useful and deep insights. There are many powerful tools to create such dashboards, like Tableau and Microstrategy.
It took less than 30 minutes to create an interactive dashboard so users can select all relevant factors and see their possible impacts on price.
From the charts we obtained, we can see that apartments with 2.5 rooms are cheaper than two-room apartments. The reason is that most 2.5-room apartments are not in the city center, which lowers the price.
Feature Engineering Using External APIs
Charts help identify the features to use in machine learning algorithms. If the features you use are not numerical, any algorithm will produce poor predictions. However, if the features are strong and correct, predictions will be satisfactory and effective—even with simple algorithms.
Price is the continuous variable in the Berlin apartment rent analysis project. So it’s a typical problem. Next, we’ll extract all the influential factors and use them for prediction.
Factors influencing price
Number of rooms
Total area
Address
Furnished or not?
Has a kitchen?
Balcony?
Apartment type
Floor in building
Heating type
Guest toilet
Number of bathrooms
Apartment condition (renovated, new, etc.)
Elevator
Year of construction
We then encountered a major issue with the address feature. We had about 6,600 apartments and about 4,400 addresses. There were about 200 unique postal codes, which could be converted to variables, but doing so would lose very valuable information.
What do you do when you get a new address?
Of course, you’d search for it on Google or look up directions.
We used an external API that provided us with four additional features to calculate the apartment’s location:
- Train travel time from the apartment to S-Bahn Friedrichstrasse (the main station)
- Driving distance from the apartment to U-Bahn Stadtmitte (city center)
- Walking time from the apartment to the nearest metro station
- Number of metro stations within one kilometer
These four features greatly improved performance.
Common Mistakes in Using Machine Learning and Data Science
After collecting the data, there are many steps to complete before applying a machine learning model.
You need to visualize all variables to see their distributions. You also need to check for outliers and understand the causes of such values.
What can you do with missing values in some features?
What is the best way to convert categorical features to numerical ones?
We’ll try to provide details and solutions to the most common beginner mistakes.
- Visualization
First, you should visualize the distribution of continuous features to see if there are many outliers and whether the distributions are logical.
There are many ways to create constructive data visualizations, like box plots, histograms, cumulative distribution functions, and violin plots. Choose the one that gives you the most information about the data.
For assessing distributions (normal, bimodal, etc.), histograms are the most useful. While histograms are a good starting point, box plots can be superior for identifying outliers.
Based on the previous charts, the most interesting question will be: Do you see what you expect to see? Answering this will help you either find insights or spot data errors.
For inspiration and choosing the best chart, we recommend using Python’s seaborn gallery. You can also use Kaggle kernels for inspiration, building visualizations, and extracting insights.
In the rent analysis, we created charts for all continuous variables and expected to see a long tail in the rent-without-bills and total area distributions.
Should we impute values based on the entire dataset?
Often, you get data with lots of missing values. If you drop every record with at least one missing value, you might lose a large part of your dataset.
There are many methods to impute missing data. It’s up to you how to do it, but make sure to calculate imputation stats only on the relevant data to avoid data leakage from your test set.
In the Berlin rent data project, we relied heavily on the apartment description field. If quality, type, or condition of the apartment was missing, we assigned values to those fields based on the description info.
How do we convert categorical variables?
Many machine learning programs don’t work directly with categorical values. So, we need to convert them to numerical values.
There are many ways to convert categorical to numerical, such as Label Encoder, One Hot Encoding, bin encoding, and hashing encoding. Most people use Label Encoder when they should be using One Hot Encoding.
For example, in our dataset, we had a column for apartment type with values like: [ground floor, loft, small house, loft, loft, ground floor]. LabelEncoder would convert this to [3,2,1,2,2,1], implying ground floor > loft > small house. For some algorithms like decision trees, this encoding is fine, but for others like regressions and SVM, it’s not.
In our dataset, apartment condition was encoded as:
- New: 1
- Renovated: 2
- Needs renovation: 3
Apartment quality was encoded as:
- Luxury: 1
- Better than average: 2
- Average: 3
- Simple: 4
- Unknown: 5
Should I standardize variables?
Standardization brings all continuous variables to the same scale. If one variable ranges from 1,000 to 1,000,000 and another from 0.1 to 1, they’ll be on the same scale after standardization.
L1 or L2 methods are common for reducing overfitting and can be used in many regression algorithms. But it’s crucial to standardize features before applying L1 or L2.
Rent is calculated in euros, so the fitted coefficient will be about 100 times larger than if the price was in cents. L1 and L2 penalize larger coefficients, so they penalize features with smaller ranges more. To prevent this, standardize features before applying L1 or L2.
Machine Learning
Once you understand the data well and have cleaned outliers, it’s time to use machine learning. There are many machine learning algorithms available.
We wanted to explore three different algorithms and compare their distinguishing features, like performance and speed. The comparison included gradient boosted trees and their variants (XGBoost and LightGBM), Random Forest and its variants (FR and scikit-learn), and 3-layer neural networks (NN and TensorFlow). We chose RMSLE as the evaluation metric.
The comparison showed that XGBoost and LightGBM perform almost identically. Random Forest’s performance was lower, while NN was the worst.
Decision tree-based algorithms are great for interpreting features, as they can show the importance of each factor.
The Key Feature: Identifying Factors Controlling Rent Prices
After applying a decision tree model, you can identify the factors that control or predict prices.
Feature importance provides a score indicating how much each feature contributes to building decision trees inside the model. One way to compute this is by counting the number of times a feature is used to split data across all trees. There are different ways to compute it.
For Berlin apartment rents, it’s no surprise that total area is the most influential factor. Interestingly, some features engineered using external APIs were also important.
The rent price analysis results looked like this:
We were also interested in the impact of the following factors:
- Travel time to the nearest metro station
- Number of stations within 1 km
Travel time to the nearest metro station
For some apartments, a high value for this factor indicates a higher price. That’s because these apartments are in very wealthy residential areas outside Berlin.
One can also see that the proximity to a metro station has two effects: it can both lower and increase the price of some apartments. This may be because apartments too close to a metro station may suffer from noise or vibrations, but on the other hand, they are very well connected to public transport.
Number of stations within 1 km
Similarly, more metro stations within 1 km generally means higher rent, but it can also add noise.
Combining the Results
After using different models and comparing their performance, you can now combine the results of each model and create an ensemble.
Bagging is a machine learning ensemble model that uses predictions from several algorithms to calculate the overall final prediction. It is designed to prevent overfitting and reduce algorithm variance
Since we already had predictions from the algorithms above, we combined all four models in every possible way and selected the best seven single and ensemble models according to RMSLE for the validation set.
You can also create a weighted ensemble where you give more weight to the model you prefer.
In fact, you will not know which ensemble works best without trying.
Stacked Models
Averaging or weighted averaging are not the only ways to combine different prediction models. Data scientists also use stacked models.
Conclusion and Final Notes
- Listen to what people around you talk about; their complaints can be a good starting point for a big data analysis project.
- Let people extract their own insights by providing interactive dashboards.
- Don’t limit yourself to common methods and algorithms. Try to find additional data sources and explanations.
- Try ensemble and stacked models—they can greatly improve performance.
Source: https://medium.freecodecamp.org/how-to-build-a-data-science-project-from-scratch-dc4f096a62a1