::p_load('sf', 'tidyverse', 'tmap', 'spdep', 'onemapsgapi', 'units', 'matrixStats', 'readxl', 'jsonlite', 'olsrr', 'corrplot', 'ggpubr', 'GWmodel',
pacman'devtools', 'kableExtra', 'plotly', 'ggthemes', 'ranger', 'Metrics')
Take-home Exercise 3: Predicting HDB Public Housing Resale Pricies using Geographically Weighted Methods
1 Setting the scene
Several things influence housing costs. Some of them have a worldwide scope, such the overall health of a nation’s economy or the level of inflation. Some may focus more on the properties themselves. You can further divide these characteristics into structural and geographic ones.
Geographical Weighted Models were introduced for enhancing predictive model for housing resale prices as traditional price predictive models failed to take account the spatial correlation and heterogeneity in geographical aspects which caused inaccuracy and bias.
1.1 Loading the data
1.2 Data set used
<- data.frame(
datasets Type=c("Aspatial",
"Geospatial",
"Geospatial - Extracted",
"Geospatial - Extracted",
"Geospatial - Extracted",
"Geospatial - Extracted",
"Geospatial - Extracted",
"Geospatial - Extracted",
"Geospatial - Extracted",
"Geospatial - Selfsourced",
"Geospatial - Selfsourced",
"Geospatial - Selfsourced",
"Geospatial - Selfsourced",
"Geospatial - Selfsourced"),
Name=c("Resale Flat Prices",
"Master Plan 2019 Subzone Boundary (Web)",
"Childcare Services",
"Community Clubs",
"Eldercare Services",
"Hawker Centres",
"Kindergartens",
"Parks",
"Libraries",
"Bus Stop Locations Aug 2021",
"MRT & LRT Locations Aug 2021",
"Supermarkets",
"Shopping Mall SVY21 Coordinates",
"Primary School"),
Format=c(".csv",
".shp",
".shp",
".shp",
".shp",
".shp",
".shp",
".shp",
".shp",
".shp",
".kml",
".shp",
".shp",
".csv"),
Source=c("[data.gov.sg](https://data.gov.sg/dataset/resale-flat-prices)",
"[data.gov.sg](https://data.gov.sg/dataset/master-plan-2014-subzone-boundary-web)",
"[OneMap API](https://www.onemap.gov.sg/docs/)",
"[OneMap API](https://www.onemap.gov.sg/docs/)",
"[OneMap API](https://www.onemap.gov.sg/docs/)",
"[OneMap API](https://www.onemap.gov.sg/docs/)",
"[OneMap API](https://www.onemap.gov.sg/docs/)",
"[OneMap API](https://www.onemap.gov.sg/docs/)",
"[OneMap API](https://www.onemap.gov.sg/docs/)",
"[datamall.lta](https://datamall.lta.gov.sg/content/datamall/en/search_datasets.html?searchText=bus%20stop)",
"[data.gov](https://data.gov.sg/dataset/lta-mrt-station-exit)",
"[Onemap.gov](https://www.onemap.gov.sg/main/v2/essentialamenities)",
"[Valery Lim's Github](https://github.com/ValaryLim/Mall-Coordinates-Web-Scraper/blob/master/mall_coordinates_updated.csv)",
"[data.gov](https://data.gov.sg/dataset/school-directory-and-information)")
)
library(knitr)
library(kableExtra)
kable(datasets, caption="Datasets Used") %>%
kable_material("hover", latex_options = "scale_down")
Type | Name | Format | Source |
---|---|---|---|
Aspatial | Resale Flat Prices | .csv | [data.gov.sg](https://data.gov.sg/dataset/resale-flat-prices) |
Geospatial | Master Plan 2019 Subzone Boundary (Web) | .shp | [data.gov.sg](https://data.gov.sg/dataset/master-plan-2014-subzone-boundary-web) |
Geospatial - Extracted | Childcare Services | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
Geospatial - Extracted | Community Clubs | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
Geospatial - Extracted | Eldercare Services | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
Geospatial - Extracted | Hawker Centres | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
Geospatial - Extracted | Kindergartens | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
Geospatial - Extracted | Parks | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
Geospatial - Extracted | Libraries | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
Geospatial - Selfsourced | Bus Stop Locations Aug 2021 | .shp | [datamall.lta](https://datamall.lta.gov.sg/content/datamall/en/search_datasets.html?searchText=bus%20stop) |
Geospatial - Selfsourced | MRT & LRT Locations Aug 2021 | .kml | [data.gov](https://data.gov.sg/dataset/lta-mrt-station-exit) |
Geospatial - Selfsourced | Supermarkets | .shp | [Onemap.gov](https://www.onemap.gov.sg/main/v2/essentialamenities) |
Geospatial - Selfsourced | Shopping Mall SVY21 Coordinates | .shp | [Valery Lim's Github](https://github.com/ValaryLim/Mall-Coordinates-Web-Scraper/blob/master/mall_coordinates_updated.csv) |
Geospatial - Selfsourced | Primary School | .csv | [data.gov](https://data.gov.sg/dataset/school-directory-and-information) |
2 Data Preparation
2.1 Aspatial data
Loading Resale Data
<- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv") resale
glimpse(resale)
Rows: 148,680
Columns: 11
$ month <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block <chr> "406", "108", "602", "465", "601", "150", "447", "…
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
$ storey_range <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
$ floor_area_sqm <dbl> 44, 67, 67, 68, 67, 68, 68, 67, 68, 67, 68, 67, 67…
$ flat_model <chr> "Improved", "New Generation", "New Generation", "N…
$ lease_commence_date <dbl> 1979, 1978, 1980, 1980, 1980, 1981, 1979, 1976, 19…
$ remaining_lease <chr> "61 years 04 months", "60 years 07 months", "62 ye…
$ resale_price <dbl> 232000, 250000, 262000, 265000, 265000, 275000, 28…
On a personal note, I am currently staying in a 3 room flat at Bedok, and my family is planning to move soon. Hence, I have decided to scope it down to 3 room flats so that I can roughly gauge how much my family can resell the house! It will be interesting to look at how various geographical factors affect resale prices :-)
As such , we will be looking at 3 room flats, within January of 2021 to February of 2023, which will be separated again later on. This is done by using filter() on flat_type and month.
<- resale %>%
resale_sub filter(flat_type == "3 ROOM",
>= "2021-01" & month <= "2023-02") month
To double confirm that we extracted what we really want, we will be using unique() to view what we extracted. As seen below, we have extracted the correct range for the dates, and the flat_type of 3 Room.
unique(resale_sub$month)
[1] "2021-01" "2021-02" "2021-03" "2021-04" "2021-05" "2021-06" "2021-07"
[8] "2021-08" "2021-09" "2021-10" "2021-11" "2021-12" "2022-01" "2022-02"
[15] "2022-03" "2022-04" "2022-05" "2022-06" "2022-07" "2022-08" "2022-09"
[22] "2022-10" "2022-11" "2022-12" "2023-01" "2023-02"
unique(resale_sub$flat_type)
[1] "3 ROOM"
glimpse(resale_sub)
Rows: 13,780
Columns: 11
$ month <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block <chr> "331", "534", "561", "170", "463", "542", "170", "…
$ street_name <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 10", "ANG MO K…
$ storey_range <chr> "04 TO 06", "04 TO 06", "01 TO 03", "07 TO 09", "0…
$ floor_area_sqm <dbl> 68, 68, 68, 60, 68, 68, 60, 73, 67, 67, 68, 68, 73…
$ flat_model <chr> "New Generation", "New Generation", "New Generatio…
$ lease_commence_date <dbl> 1981, 1980, 1980, 1986, 1980, 1981, 1986, 1976, 19…
$ remaining_lease <chr> "59 years", "58 years 02 months", "58 years 01 mon…
$ resale_price <dbl> 260000, 265000, 265000, 268000, 268000, 270000, 27…
2.1.2 Aspatial Wrangling
As we can see from the resale_sub, the variable remaining lease and lease commence date essentially provides the same information. We will only be including remaining lease, as it directly tells us the remaining year. Most importantly, this will help us to ensure that the variables are not perfectly collinear during our regression later!
<- resale_sub %>%
resale_sub select(-lease_commence_date)
Additionally, after looking at the resale_sub variables, we notice that remaining lease is <chr>. We would want it to be numeric, as it will be beneficial for us later on. We will have to first split the months and years in the variable remaining_lease.
<- resale_sub %>%
resale_sub mutate(remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
mutate(remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))
glimpse(resale_sub)
Rows: 13,780
Columns: 12
$ month <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block <chr> "331", "534", "561", "170", "463", "542", "170", "…
$ street_name <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 10", "ANG MO K…
$ storey_range <chr> "04 TO 06", "04 TO 06", "01 TO 03", "07 TO 09", "0…
$ floor_area_sqm <dbl> 68, 68, 68, 60, 68, 68, 60, 73, 67, 67, 68, 68, 73…
$ flat_model <chr> "New Generation", "New Generation", "New Generatio…
$ remaining_lease <chr> "59 years", "58 years 02 months", "58 years 01 mon…
$ resale_price <dbl> 260000, 265000, 265000, 268000, 268000, 270000, 27…
$ remaining_lease_yr <int> 59, 58, 58, 64, 58, 59, 64, 54, 56, 55, 59, 58, 55…
$ remaining_lease_mth <int> NA, 2, 1, 2, 2, 1, NA, 4, 10, 4, 1, 8, 7, 9, 4, 5,…
From above, we notice that there are NA values for remaining lease month. We will replace it with 0 by identifying it with is.na()
$remaining_lease_mth[is.na(resale_sub$remaining_lease_mth)] <- 0 resale_sub
Our main goal here is to convert remaining lease months into years, and add them together under remaining_lease_year. we will first divide the remaining lease month variable by 12.
$remaining_lease_mth <- resale_sub$remaining_lease_mth/12 resale_sub
Then, we will combine them together under the new variable remaining lease year.
<- resale_sub %>%
resale_sub mutate(resale_sub, remaining_lease_year = rowSums(resale_sub[, c("remaining_lease_yr", "remaining_lease_mth")]))
By referencing to our favorite senior, Megan, she advised that we replace “SAINT” with “ST”, as onemap spells it that way in the variable street_name, which will be a bit confusing. We can change that by using the code chunk below
$street_name <- gsub("ST\\.", "SAINT", resale_sub$street_name) resale_sub
Similarly, by referencing Megan’s work, we notice that there are no coordinates provided in the resale data. we will have to use a geocoding function as shown below.
The steps are as follows:
Combine the block and street name into an address
Pass the address as the searchVal in our query
Send the query to OneMapSG search Note: Since we don’t need all the address details, we can set
getAddrDetails
as ‘N’Convert response (JSON object) to text
Saving the response in text form as a data frame
retaining the latitude and longitude for our output
library(httr)
library(rjson)
<- function(block, streetname) {
geocode <- "https://developers.onemap.sg/commonapi/search"
base_url <- paste(block, streetname, sep = " ")
address <- list("searchVal" = address,
query "returnGeom" = "Y",
"getAddrDetails" = "N",
"pageNum" = "1")
<- GET(base_url, query = query)
res <-content(res, as="text")
restext
<- fromJSON(restext) %>%
output %>%
as.data.frame select(results.LATITUDE, results.LONGITUDE)
return(output)
}
After creating the function, we will have to make sure that the iterations are able to run through the whole resale data to ensure that every entry gets a coordinate. As such, we will be creating a loop
$LATITUDE <- 0
resale_sub$LONGITUDE <- 0
resale_sub
for (i in 1:nrow(resale_sub)){
<- geocode(resale_sub[i, 4], resale_sub[i, 5])
temp_output
$LATITUDE[i] <- temp_output$results.LATITUDE
resale_sub$LONGITUDE[i] <- temp_output$results.LONGITUDE
resale_sub }
Let us save it into a RDS object
saveRDS(resale_sub, file="resale_sub", compress=FALSE)
We will then assign the RDS object to resale_sub
<- readRDS("resale_sub") resale_sub
2.1.3 Creating ordinal arrangement for floor level
We will create an ordinal column for the floor level for the purpose of our analysis later on! We will first use unique() to identify all of the values from storey_range
unique(resale_sub$storey_range)
[1] "04 TO 06" "01 TO 03" "07 TO 09" "10 TO 12" "13 TO 15" "28 TO 30"
[7] "19 TO 21" "16 TO 18" "25 TO 27" "22 TO 24" "34 TO 36" "40 TO 42"
[13] "31 TO 33" "37 TO 39" "46 TO 48" "43 TO 45"
We will then write it down into a vector, and assign it to levels. After that, we will assign story_ordinal by using seq_along() and changing story_ordinal into numeric using as.numeric().
<- c("01 TO 03", "04 TO 06", "07 TO 09", "10 TO 12", "13 TO 15", "16 TO 18", "19 TO 21", "22 TO 24", "25 TO 27", "28 TO 30", "31 TO 33", "34 TO 36", "37 TO 39", "40 TO 42", "43 TO 45", "46 TO 48", "49 TO 51")
levels
<- seq_along(levels)
story_ordinal
$Story_Ordinal <- story_ordinal[match(resale_sub$storey_range, levels)]
resale_sub
levels(resale_sub$Story_Ordinal) <- levels
<- resale_sub |>
resale_sub mutate(Story_Ordinal=as.numeric(Story_Ordinal))
2.1.4 Mutating remaining columns
We noticed that month is still <chr>, which is able to be changed by as.Date. By referencing to this website, %Y denotes the year with century, %m for month, and %d for day.
<- resale_sub %>%
resale_sub mutate(month = as.Date(paste(month, "-01"), format ="%Y-%m-%d"))
2.1.5 Checking for duplicates
Finally, let us check for duplicates. We are good to go!
sum(is.na(resale_sub$LATITUDE))
[1] 0
sum(is.na(resale_sub$LONGITUDE))
[1] 0
Geospatial data
As mentioned above, we will be extracting geospatial data from Onemap. the codes used below will require the onemapsgapi package, which is already loaded in 1.1. I will break down the steps to extract data from Onemap as simple as possible!
First, create an account for Onemap
Fill up your details in this link (with the details sent to your email)
Ready to go after following the code chunks below!
Do note that this is purely for reference, you will have to type in your own email and password to load your token!
Type this command into the console, with your own email and password
<- get_token("email", "password") run_token
Next, since I love food, I will be giving an example to extract hawker data from Onemap, using this code chunk.
<- search_themes(run_token, "hawkercentre") themes
We will use the following function to extract data from Onemap. Lets assign it into hawker.
<- get_theme(run_token, "hawkercentre") hawkercentre
And of course, our favorite part, ensuring that our spatial data have the correct ESPG CRS, which is 3414. We will also need to transform it into sf using st_as_sf(). Since we will be dealing with various extracted data, we do not want to keep extracting it each time we run or render the page. Hence, we will be using st_write() to write the sf object to file or database!
<- st_as_sf(hawkercentre, coords = c("Lng", "Lat"), crs = 4326) %>%
hawkercentre_sf st_transform(crs = 3414)
st_write(obj = hawkercentre_sf,
dsn = "data/geospatial/extracted",
layer = "hawkercentre",
driver = "ESRI Shapefile")
We can just repeat the steps above for the following variables. For the scope of this assignment, we will be extracting the following:
childcare
community clubs
elder care
kindergartens
libraries
parks
I have already extracted the data beforehand, so let us proceed!
(Note that we have already extracted hawker centers!)
For reference, i will be putting the code below to extract the data from onemap
<- get_theme(token,"childcare")
childcare <- st_as_sf(childcare, coords = c("Lng", "Lat"), crs = 4326) %>%
childcare_sf st_transform(crs = 3414)
st_write(obj = childcare_sf,
dsn = "data/geospatial/extracted",
layer = "childcare",
driver = "ESRI Shapefile")
<- get_theme(token,"communityclubs")
communityclubs<- st_as_sf(childcare, coords = c("Lng", "Lat"), crs = 4326) %>%
communityclubs_sf st_transform(crs = 3414)
st_write(obj = communityclubs_sf,
dsn = "data/geospatial/extracted",
layer = "communityclubs",
driver = "ESRI Shapefile")
<- get_theme(token,"eldercare")
eldercare <- st_as_sf(eldercare, coords=c("Lng", "Lat"), crs = 4326) %>%
eldercare_sf st_transform(crs = 3414)
st_write(obj = eldercare_sf,
dsn = "data/geospatial/extracted",
layer = "eldercare",
driver = "ESRI Shapefile")
<- get_theme(token,"kindergartens")
kindergartens <- st_as_sf(kindergartens, coords=c("Lng", "Lat"), crs= 4326) %>%
kindergartens_sf st_transform(crs = 3414)
st_write(obj = kindergartens_sf,
dsn = "data/geospatial/extracted",
layer = "kindergartens",
driver = "ESRI Shapefile")
<- get_theme(token,"libraries")
library <- st_as_sf(library, coords=c("Lng", "Lat"), crs = 4326) %>%
library_sf st_transform(crs = 3414)
st_write(obj = library_sf,
dsn = "data/geospatial/extracted",
layer = "libraries",
driver = "ESRI Shapefile")
<- get_theme(token,"nationalparks")
parks <- st_as_sf(parks, coords=c("Lng", "Lat"), crs = 4326) %>%
parks_sf st_transform(crs = 3414)
st_write(obj = parks_sf,
dsn = "data/geospatial/extracted",
layer = "parks",
driver = "ESRI Shapefile")
2.2.1 Reading Geospatial Data
<- st_read(dsn = "data/geospatial/extracted", layer = "hawkercentre") hawkercentre_sf
Reading layer `hawkercentre' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/extracted'
using driver `ESRI Shapefile'
Simple feature collection with 125 features and 18 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 12874.19 ymin: 28355.97 xmax: 45241.4 ymax: 47850.43
Projected CRS: SVY21 / Singapore TM
<- st_read(dsn = "data/geospatial/extracted", layer = "eldercare") eldercare_sf
Reading layer `eldercare' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/extracted'
using driver `ESRI Shapefile'
Simple feature collection with 133 features and 4 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21 / Singapore TM
<- st_read(dsn = "data/geospatial/extracted", layer = "childcare") childcare_sf
Reading layer `childcare' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/extracted'
using driver `ESRI Shapefile'
Simple feature collection with 1925 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11810.03 ymin: 25596.33 xmax: 45404.24 ymax: 49300.88
Projected CRS: SVY21 / Singapore TM
<- st_read(dsn = "data/geospatial/extracted", layer = "communityclubs") communityclubs_sf
Reading layer `communityclubs' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/extracted'
using driver `ESRI Shapefile'
Simple feature collection with 125 features and 11 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 12308.4 ymin: 28593.37 xmax: 42008.87 ymax: 48958.52
Projected CRS: SVY21 / Singapore TM
<- st_read(dsn = "data/geospatial/extracted", layer = "kindergartens") kindergartens_sf
Reading layer `kindergartens' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/extracted'
using driver `ESRI Shapefile'
Simple feature collection with 448 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11909.7 ymin: 25596.33 xmax: 43395.47 ymax: 48562.06
Projected CRS: SVY21 / Singapore TM
<- st_read(dsn = "data/geospatial/extracted", layer = "libraries") library_sf
Reading layer `libraries' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/extracted'
using driver `ESRI Shapefile'
Simple feature collection with 31 features and 13 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 13665.24 ymin: 27383.57 xmax: 40922.89 ymax: 47759.75
Projected CRS: SVY21 / Singapore TM
<- st_read(dsn = "data/geospatial/extracted", layer = "nationalparks") parks_sf
Reading layer `nationalparks' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/extracted'
using driver `ESRI Shapefile'
Simple feature collection with 421 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 12374.75 ymin: 21917.81 xmax: 52533.09 ymax: 49296.46
Projected CRS: SVY21 / Singapore TM
<- st_read(dsn = "data/geospatial/map", layer = "MP14_SUBZONE_WEB_PL") mpsz_sf
Reading layer `MP14_SUBZONE_WEB_PL' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/map'
using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
<- st_read(dsn = "data/geospatial/sourced", layer = "BusStop") busstop_sf
Reading layer `BusStop' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/sourced'
using driver `ESRI Shapefile'
Simple feature collection with 5159 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
<- st_read(dsn = "data/geospatial/sourced", layer = "goodprimarysch") goodprisch_sf
Reading layer `goodprimarysch' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/sourced'
using driver `ESRI Shapefile'
Simple feature collection with 9 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 103.7611 ymin: 1.305285 xmax: 103.937 ymax: 1.34968
Geodetic CRS: WGS 84
<- st_read(dsn = "data/geospatial/sourced", layer = "mrt") mrt_sf
Reading layer `mrt' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/sourced'
using driver `ESRI Shapefile'
Simple feature collection with 474 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 103.6368 ymin: 1.264972 xmax: 103.9893 ymax: 1.449157
Geodetic CRS: WGS 84
<- st_read(dsn = "data/geospatial/sourced", layer = "primarysch") prisch_sf
Reading layer `primarysch' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/sourced'
using driver `ESRI Shapefile'
Simple feature collection with 183 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 103.6878 ymin: 1.274958 xmax: 103.9628 ymax: 1.456608
Geodetic CRS: WGS 84
<- st_read(dsn = "data/geospatial/sourced", layer = "shoppingmall") shoppingmall_sf
Reading layer `shoppingmall' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/sourced'
using driver `ESRI Shapefile'
Simple feature collection with 184 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 103.6784 ymin: 1.263797 xmax: 103.9897 ymax: 1.448227
Geodetic CRS: WGS 84
<- st_read(dsn = "data/geospatial/sourced", layer = "SUPERMARKETS") supermarket_sf
Reading layer `SUPERMARKETS' from data source
`/Users/keredpoh/Desktop/keredpoh/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial/sourced'
using driver `ESRI Shapefile'
Simple feature collection with 526 features and 8 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 4901.188 ymin: 25529.08 xmax: 46948.22 ymax: 49233.6
Projected CRS: SVY21
2.2.2 Converting Self-sourced data
As we all know, we have yet to change our out-sourced data to the correct crs, which is 3414. Let us change it below using st_transform()! Do note that we do not need to do this all the time, especially when the data is already in the correct CRS. But in this case, all of the data are not.
<- busstop_sf %>%
busstop_sf st_transform(3414)
<- goodprisch_sf %>%
goodprisch_sf st_transform(3414)
<- mrt_sf %>%
mrt_sf st_transform(3414)
<- prisch_sf %>%
prisch_sf st_transform(3414)
<- shoppingmall_sf %>%
shoppingmall_sf st_transform(3414)
<- supermarket_sf %>%
supermarket_sf st_transform(3414)
2.2.3 Checking for invalid geometries
We will be checking invalid geometries using the length(), which() and st_is_valid() functions.
length(which(st_is_valid(hawkercentre_sf) == FALSE))
[1] 0
length(which(st_is_valid(communityclubs_sf) == FALSE))
[1] 0
length(which(st_is_valid(eldercare_sf) == FALSE))
[1] 0
length(which(st_is_valid(childcare_sf) == FALSE))
[1] 0
length(which(st_is_valid(kindergartens_sf) == FALSE))
[1] 0
length(which(st_is_valid(library_sf) == FALSE))
[1] 0
length(which(st_is_valid(parks_sf) == FALSE))
[1] 0
length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 9
length(which(st_is_valid(busstop_sf) == FALSE))
[1] 0
length(which(st_is_valid(goodprisch_sf) == FALSE))
[1] 0
length(which(st_is_valid(mrt_sf) == FALSE))
[1] 0
length(which(st_is_valid(prisch_sf) == FALSE))
[1] 0
length(which(st_is_valid(shoppingmall_sf) == FALSE))
[1] 0
length(which(st_is_valid(supermarket_sf) == FALSE))
[1] 0
Alright, that aside, we notice that the basemap have a few invalid geometries. This can be corrected using st_make_valid()
<- st_make_valid(mpsz_sf) mpsz_sf
2.3.4 Selecting name column
As we only need the name and geometry column, we will use the select() function.
<- hawkercentre_sf %>%
hawkercentre_sf select(1)
<- communityclubs_sf %>%
communityclubs_sf select(1)
<- eldercare_sf %>%
eldercare_sf select(1)
<- childcare_sf %>%
childcare_sf select(1)
<- kindergartens_sf %>%
kindergartens_sf select(1)
<- library_sf %>%
library_sf select(1)
<- parks_sf %>%
parks_sf select(1)
<- busstop_sf %>%
busstop_sf select(1)
<- goodprisch_sf %>%
goodprisch_sf select(1)
<- mrt_sf %>%
mrt_sf select(1)
<- prisch_sf %>%
prisch_sf select(1)
<- shoppingmall_sf %>%
shoppingmall_sf select(1)
<- supermarket_sf %>%
supermarket_sf select(1)
2.3.5 Checking for missing values
What is more scary than missing geometries? Missing values, of course! That is a huge no no. Let us check missing values with is.na().
rowSums(is.na(hawkercentre_sf))!=0,] hawkercentre_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] NAME geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(communityclubs_sf))!=0,] communityclubs_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] NAME geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(eldercare_sf))!=0,] eldercare_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] NAME geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(childcare_sf))!=0,] childcare_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] NAME geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(kindergartens_sf))!=0,] kindergartens_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] NAME geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(library_sf))!=0,] library_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] NAME geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(parks_sf))!=0,] parks_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] NAME geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(mpsz_sf))!=0,] mpsz_sf[
Simple feature collection with 0 features and 15 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] OBJECTID SUBZONE_NO SUBZONE_N SUBZONE_C CA_IND PLN_AREA_N
[7] PLN_AREA_C REGION_N REGION_C INC_CRC FMEL_UPD_D X_ADDR
[13] Y_ADDR SHAPE_Leng SHAPE_Area geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(busstop_sf))!=0,] busstop_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] BUS_STOP_N geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(goodprisch_sf))!=0,] goodprisch_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] schl_nm geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(mrt_sf))!=0,] mrt_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] Name geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(prisch_sf))!=0,] prisch_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] schl_nm geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(shoppingmall_sf))!=0,] shoppingmall_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] name geometry
<0 rows> (or 0-length row.names)
rowSums(is.na(supermarket_sf))!=0,] supermarket_sf[
Simple feature collection with 0 features and 1 field
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] LIC_NAME geometry
<0 rows> (or 0-length row.names)
3 Exploratory Data Analysis
For this section, we will be visualising data with tmap! For colour coding purposes,
Blue denotes public space and transport
Red denotes education
Green denotes park
Purple denotes elder care :-)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(hawkercentre_sf) +
tm_dots(col = "blue", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(communityclubs_sf) +
tm_dots(col = "blue", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(eldercare_sf) +
tm_dots(col = "purple", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(childcare_sf) +
tm_dots(col ="red", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(kindergartens_sf) +
tm_dots(col = "red", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(library_sf) +
tm_dots(col = "blue", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(parks_sf) +
tm_dots(col = "green", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(busstop_sf) +
tm_dots(col = "blue", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(goodprisch_sf) +
tm_dots(col = "red", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(mrt_sf) +
tm_dots(col = "blue", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(prisch_sf) +
tm_dots(col = "red", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(shoppingmall_sf) +
tm_dots(col = "blue", size = 0.04)
tmap_mode("plot")
tm_shape(mpsz_sf) +
tm_polygons(alpha = 0.3) +
tm_shape(supermarket_sf) +
tm_dots(col = "blue", size = 0.04)
4 Locational factors
With aspatial and geospatial data in set, we will now be setting locational factors which will enable us to view the proximity from one location to the area of interest.
4.1 CBD location
We will be taking reference from this website, to get the latitude and longitude of downtown core. SImilarly, we will convert it into the correct CRS, which is 3414.
<- 1.287953
lat <- 103.851784
lng
<- data.frame(lat, lng) %>%
cbd_sf st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_transform(crs = 3414)
4.2 Converting resale data into sf
Notice that our resale data, resale_sub do not have any CRS.
st_crs(resale_sub)
Coordinate Reference System: NA
In that case, we will be assigning CRS using st_as_sf to convert it into a sf object, while inserting CRS 3414 using st_transform.
<- st_as_sf(resale_sub, coords = c("LONGITUDE", "LATITUDE"), crs = 4326) %>%
resale_sub st_transform(crs = 3414)
Let us do a check. Looking good!
st_crs(resale_sub)
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
4.3 Proximity distance calculation
Earlier, we were just discussing about how locational factors will help us view the proximity from one location to the other. By referencing to our favourite senior, Megan again, we will be calculating as such using st_distance() to get the shortest distance, together with rowMins().
<- function(df1, df2, varname) {
proximity <- st_distance(df1, df2) |>
dist_matrix drop_units()
<- rowMins(dist_matrix)
df1[,varname] return(df1)
}
We will then input all the proximity into all of our data, using the newly created proximity() function.
<-
resale_sub proximity(resale_sub, cbd_sf, "PROX_CBD") %>%
proximity(., communityclubs_sf, "PROX_COMMUNITYCLUBS") %>%
proximity(., childcare_sf, "PROX_CHILDCARE") %>%
proximity(., kindergartens_sf, "PROX_KINDERGARTEN") %>%
proximity(., eldercare_sf, "PROX_ELDERCARE") %>%
proximity(., hawkercentre_sf, "PROX_HAWKER") %>%
proximity(., busstop_sf, "PROX_BUSSTOP") %>%
proximity(., mrt_sf, "PROX_MRT") %>%
proximity(., library_sf, "PROX_LIBRARY") %>%
proximity(., parks_sf, "PROX_PARK") %>%
proximity(., goodprisch_sf, "PROX_GOODPRISCH") %>%
proximity(., shoppingmall_sf, "PROX_MALL") %>%
proximity(., supermarket_sf, "PROX_SPRMKT")
4.4 Facility count within radius calculation
As the saying goes, size is not everything. Knowing the distance between one point to the other is simply not enough. Hence, we will use st_distance() and rowSums() to find the particular facility within the radius!
<- function(df1, df2, varname, radius) {
num_radius <- st_distance(df1, df2) %>%
dist_matrix drop_units() %>%
as.data.frame()
<- rowSums(dist_matrix <= radius)
df1[,varname] return(df1)
}
For the scope of our assignment, we are required to set:
Numbers of kindergartens within 350m
Numbers of childcare centres within 350m
Numbers of bus stop within 350m
Numbers of primary school within 1km
Let us do that then!
<-
resale_sub num_radius(resale_sub, kindergartens_sf, "NUM_KINDERGARTEN", 350) %>%
num_radius(., childcare_sf, "NUM_CHILDCARE", 350) %>%
num_radius(., busstop_sf, "NUM_BUSSTOP", 350) %>%
num_radius(., prisch_sf, "NUM_PRIMARYSCH", 1000)
4.5 Saving as RDS file
With that, let us save the updated resale_sub as RDS so that we can use it later on!
saveRDS(resale_sub, "data/final/resale_sub.rds")
And assign it back into resale_sub!
<- read_rds("data/final/resale_sub.rds") resale_sub
Let us glimpse resale_sub now. Looking good!
glimpse(resale_sub)
Rows: 13,780
Columns: 32
$ month <date> 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01, …
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO…
$ flat_type <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM",…
$ block <chr> "331", "534", "561", "170", "463", "542", "170", …
$ street_name <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 10", "ANG MO …
$ storey_range <chr> "04 TO 06", "04 TO 06", "01 TO 03", "07 TO 09", "…
$ floor_area_sqm <dbl> 68, 68, 68, 60, 68, 68, 60, 73, 67, 67, 68, 68, 7…
$ flat_model <chr> "New Generation", "New Generation", "New Generati…
$ remaining_lease <chr> "59 years", "58 years 02 months", "58 years 01 mo…
$ resale_price <dbl> 260000, 265000, 265000, 268000, 268000, 270000, 2…
$ remaining_lease_yr <int> 59, 58, 58, 64, 58, 59, 64, 54, 56, 55, 59, 58, 5…
$ remaining_lease_mth <dbl> 0.00000000, 0.16666667, 0.08333333, 0.16666667, 0…
$ remaining_lease_year <dbl> 59.00000, 58.16667, 58.08333, 64.16667, 58.16667,…
$ Story_Ordinal <dbl> 2, 2, 1, 3, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 4, 3, 4…
$ geometry <POINT [m]> POINT (29941.75 38240.88), POINT (30320.3 3…
$ PROX_CBD <dbl> 8200.838, 9524.798, 9161.157, 9666.904, 8757.889,…
$ PROX_COMMUNITYCLUBS <dbl> 326.61835, 570.35903, 931.51879, 171.26726, 612.7…
$ PROX_CHILDCARE <dbl> 1.096366e-03, 1.303650e+02, 1.326133e+02, 7.67633…
$ PROX_KINDERGARTEN <dbl> 296.49085, 130.36502, 162.86335, 123.23507, 133.4…
$ PROX_ELDERCARE <dbl> 412.40614, 1008.60569, 683.47390, 429.68375, 284.…
$ PROX_HAWKER <dbl> 356.0483, 146.7193, 307.8475, 314.3604, 190.8302,…
$ PROX_BUSSTOP <dbl> 52.08994, 81.73559, 49.50312, 105.66510, 123.5681…
$ PROX_MRT <dbl> 818.2893, 668.5038, 889.4626, 1260.4751, 884.9691…
$ PROX_LIBRARY <dbl> 1372.5693, 959.1003, 1443.1690, 1021.9212, 1583.6…
$ PROX_PARK <dbl> 358.0379, 554.3379, 817.5539, 303.1924, 461.9375,…
$ PROX_GOODPRISCH <dbl> 5095.852, 6102.352, 5563.911, 5954.445, 5238.499,…
$ PROX_MALL <dbl> 826.9469, 829.6281, 1012.7966, 1439.9090, 877.514…
$ PROX_SPRMKT <dbl> 436.96170265, 159.23907382, 166.00546756, 341.303…
$ NUM_KINDERGARTEN <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1…
$ NUM_CHILDCARE <dbl> 2, 3, 5, 4, 6, 3, 4, 3, 3, 3, 2, 3, 3, 4, 3, 4, 4…
$ NUM_BUSSTOP <dbl> 7, 11, 8, 3, 6, 9, 3, 4, 6, 4, 6, 9, 9, 5, 4, 8, …
$ NUM_PRIMARYSCH <dbl> 2, 2, 2, 2, 3, 1, 2, 1, 3, 1, 1, 1, 2, 3, 1, 2, 1…
5 Regressions
As we will be doing prediction models, we will be exploring various regressions!
5.1 Visualising correlation matrix
Recall that we have already removed lease_commence_date earlier on to prevent perfect multi-colinearity. However, we should visualise the matrix to ensure that no variables are co-linear. Note that we have dropped resale_price, as that is the dependent variable.
<- resale_sub %>%
resale_corm st_drop_geometry() %>%
select_if(is.numeric) %>%
select(-resale_price)
::corrplot(cor(resale_corm),
corrplotdiag = FALSE,
order = "AOE",
tl.pos = "td",
tl.cex = 0.4,
method = "number",
type = "upper")
Oh my! looks like we forgot that we combined remaining_lease_year from remaining_lease_yr. This shows that we cannot be complacent. Lucky for us, we found it out before anything! Let us remove it using select().
<- resale_sub %>%
resale_sub select(-remaining_lease_yr, -remaining_lease_mth)
5.2 Splitting into training and testing data
For the purpose on this assignment, we will be splitting the data into:
Testing data, ranging from January 2023 to February 2023
Training data, from January 2021 to December 2022
<- resale_sub %>%
test_data filter(month >= "2023-01-01" & month <= "2023-02-28")
<- resale_sub %>%
train_data filter(month >= "2021-01-01" & month <= "2022-12-31")
5.3 Multi-linear regression (non-spatial)
Let us build a multi-linear regression now. The beauty of building a regression is that we can choose what to add and what to not add, as long as it is justified. We will not be adding flat model, as we narrowed the scope to 3 room flats. Geometry will not be useful as well. remaining_lease will be omitted as well, as we have remaining_lease_year.
glimpse(test_data)
Rows: 1,172
Columns: 30
$ month <date> 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, …
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO…
$ flat_type <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM",…
$ block <chr> "225", "225", "310C", "319", "319", "220", "457",…
$ street_name <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 1", "ANG MO K…
$ storey_range <chr> "04 TO 06", "07 TO 09", "25 TO 27", "04 TO 06", "…
$ floor_area_sqm <dbl> 67, 67, 70, 73, 73, 67, 89, 68, 75, 74, 75, 67, 7…
$ flat_model <chr> "New Generation", "New Generation", "Model A", "N…
$ remaining_lease <chr> "54 years 01 month", "54 years 01 month", "88 yea…
$ resale_price <dbl> 380000, 380000, 635000, 365000, 418000, 380000, 4…
$ remaining_lease_year <dbl> 54.08333, 54.08333, 88.75000, 53.33333, 53.33333,…
$ Story_Ordinal <dbl> 2, 3, 9, 2, 3, 2, 3, 2, 2, 1, 1, 2, 3, 3, 2, 4, 2…
$ geometry <POINT [m]> POINT (28537.68 38825.23), POINT (28537.68 …
$ PROX_CBD <dbl> 8914.494, 8914.494, 8511.594, 8545.724, 8545.724,…
$ PROX_COMMUNITYCLUBS <dbl> 287.0599, 287.0599, 481.9450, 720.7452, 720.7452,…
$ PROX_CHILDCARE <dbl> 1.857691e+02, 1.857691e+02, 1.151054e+02, 1.17446…
$ PROX_KINDERGARTEN <dbl> 1.951542e+02, 1.951542e+02, 3.319666e+02, 2.43231…
$ PROX_ELDERCARE <dbl> 404.35866, 404.35866, 254.95445, 58.09058, 58.090…
$ PROX_HAWKER <dbl> 137.8719, 137.8719, 382.8329, 147.7741, 147.7741,…
$ PROX_BUSSTOP <dbl> 166.43714, 166.43714, 28.02879, 180.21394, 180.21…
$ PROX_MRT <dbl> 1266.0217, 1266.0217, 731.4223, 526.3600, 526.360…
$ PROX_LIBRARY <dbl> 1162.8752, 1162.8752, 1128.3688, 1088.8972, 1088.…
$ PROX_PARK <dbl> 320.9723, 320.9723, 500.4366, 443.2952, 443.2952,…
$ PROX_GOODPRISCH <dbl> 5391.555, 5391.555, 5183.752, 5301.687, 5301.687,…
$ PROX_MALL <dbl> 1165.8961, 1165.8961, 649.6052, 470.6596, 470.659…
$ PROX_SPRMKT <dbl> 157.75712, 157.75712, 314.48012, 63.03934, 63.039…
$ NUM_KINDERGARTEN <dbl> 2, 2, 1, 2, 2, 0, 1, 2, 1, 2, 2, 0, 2, 2, 1, 1, 1…
$ NUM_CHILDCARE <dbl> 4, 4, 5, 6, 6, 1, 3, 5, 4, 6, 7, 1, 5, 5, 4, 4, 4…
$ NUM_BUSSTOP <dbl> 6, 6, 4, 4, 4, 5, 4, 10, 4, 5, 5, 4, 6, 10, 7, 5,…
$ NUM_PRIMARYSCH <dbl> 2, 2, 2, 4, 4, 1, 2, 1, 3, 3, 3, 3, 3, 1, 1, 3, 1…
<- lm(resale_price~ Story_Ordinal + remaining_lease_year + floor_area_sqm + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER + PROX_PARK + PROX_LIBRARY + PROX_COMMUNITYCLUBS + PROX_BUSSTOP + PROX_MALL + PROX_SPRMKT + PROX_GOODPRISCH + PROX_MRT + NUM_CHILDCARE + NUM_KINDERGARTEN + NUM_PRIMARYSCH + month, data = train_data) multi_lr
summary(multi_lr)
Call:
lm(formula = resale_price ~ Story_Ordinal + remaining_lease_year +
floor_area_sqm + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
PROX_PARK + PROX_LIBRARY + PROX_COMMUNITYCLUBS + PROX_BUSSTOP +
PROX_MALL + PROX_SPRMKT + PROX_GOODPRISCH + PROX_MRT + NUM_CHILDCARE +
NUM_KINDERGARTEN + NUM_PRIMARYSCH + month, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-189219 -26995 -3391 21886 472311
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.209e+06 3.663e+04 -60.304 < 2e-16 ***
Story_Ordinal 9.851e+03 2.421e+02 40.697 < 2e-16 ***
remaining_lease_year 3.790e+03 3.433e+01 110.394 < 2e-16 ***
floor_area_sqm 5.094e+03 6.348e+01 80.235 < 2e-16 ***
PROX_CBD -7.016e+00 1.445e-01 -48.558 < 2e-16 ***
PROX_ELDERCARE -4.222e+00 8.173e-01 -5.166 2.42e-07 ***
PROX_HAWKER -9.331e+00 1.139e+00 -8.190 2.87e-16 ***
PROX_PARK -9.948e+00 1.248e+00 -7.973 1.68e-15 ***
PROX_LIBRARY -4.553e+00 8.818e-01 -5.163 2.46e-07 ***
PROX_COMMUNITYCLUBS -1.211e+01 1.638e+00 -7.389 1.57e-13 ***
PROX_BUSSTOP -1.187e+01 7.245e+00 -1.639 0.101
PROX_MALL -1.900e+01 1.204e+00 -15.786 < 2e-16 ***
PROX_SPRMKT 2.526e+01 2.351e+00 10.745 < 2e-16 ***
PROX_GOODPRISCH -2.147e+00 1.682e-01 -12.762 < 2e-16 ***
PROX_MRT -1.177e+01 1.169e+00 -10.063 < 2e-16 ***
NUM_CHILDCARE -1.821e+03 2.520e+02 -7.223 5.37e-13 ***
NUM_KINDERGARTEN 6.897e+03 5.481e+02 12.583 < 2e-16 ***
NUM_PRIMARYSCH -4.088e+03 3.392e+02 -12.052 < 2e-16 ***
month 1.101e+02 1.912e+00 57.586 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44900 on 12589 degrees of freedom
Multiple R-squared: 0.7351, Adjusted R-squared: 0.7348
F-statistic: 1941 on 18 and 12589 DF, p-value: < 2.2e-16
write_rds(multi_lr, "data/regmodels/multi_lr.rds")
<- read_rds("data/regmodels/multi_lr.rds")
multi_reg multi_reg
Call:
lm(formula = resale_price ~ Story_Ordinal + remaining_lease_year +
floor_area_sqm + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
PROX_PARK + PROX_LIBRARY + PROX_COMMUNITYCLUBS + PROX_BUSSTOP +
PROX_MALL + PROX_SPRMKT + PROX_GOODPRISCH + PROX_MRT + NUM_CHILDCARE +
NUM_KINDERGARTEN + NUM_PRIMARYSCH + month, data = train_data)
Coefficients:
(Intercept) Story_Ordinal remaining_lease_year
-2.209e+06 9.851e+03 3.790e+03
floor_area_sqm PROX_CBD PROX_ELDERCARE
5.094e+03 -7.016e+00 -4.222e+00
PROX_HAWKER PROX_PARK PROX_LIBRARY
-9.331e+00 -9.948e+00 -4.553e+00
PROX_COMMUNITYCLUBS PROX_BUSSTOP PROX_MALL
-1.211e+01 -1.187e+01 -1.900e+01
PROX_SPRMKT PROX_GOODPRISCH PROX_MRT
2.526e+01 -2.147e+00 -1.177e+01
NUM_CHILDCARE NUM_KINDERGARTEN NUM_PRIMARYSCH
-1.821e+03 6.897e+03 -4.088e+03
month
1.101e+02
5.4 GWR predictive method
We will convert sf into sp, so that it can be useful in predicting resale prices of 3 room HDB flats. We will convert using as_Spatial()
<- as_Spatial(train_data)
train_data_sp train_data_sp
class : SpatialPointsDataFrame
features : 12608
extent : 11597.97, 45192.3, 28097.64, 48622.47 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables : 29
names : month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, remaining_lease, resale_price, remaining_lease_year, Story_Ordinal, PROX_CBD, PROX_COMMUNITYCLUBS, PROX_CHILDCARE, ...
min values : 18628, ANG MO KIO, 3 ROOM, 1, ALJUNIED AVE 2, 01 TO 03, 51, DBSS, 43 years 01 month, 2e+05, 43.0833333333333, 1, 721.958623503081, 0.000559347187634686, 8.52942410735876e-05, ...
max values : 19327, YISHUN, 3 ROOM, 99, YUAN CHING RD, 46 TO 48, 241, Terrace, 97 years 05 months, 1268000, 97.4166666666667, 16, 19650.0691667807, 3877.80844185295, 2952.47979062617, ...
5.5 Computing adaptive bandwidth
Similarly, after computing the bandwidth, we will use write_rds() and read_rds for efficiency!
<- bw.gwr(resale_price~ Story_Ordinal + remaining_lease_year + floor_area_sqm + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER + PROX_PARK + PROX_LIBRARY + PROX_COMMUNITYCLUBS + PROX_BUSSTOP + PROX_MALL + PROX_SPRMKT + PROX_GOODPRISCH + PROX_MRT + NUM_CHILDCARE + NUM_KINDERGARTEN + NUM_PRIMARYSCH + month, data = train_data_sp,
bw_adaptive approach = "CV",
kernel = "gaussian",
adaptive = TRUE,
longlat = FALSE)
write_rds(bw_adaptive, "data/regmodels/bw_adaptive.rds")
<- read_rds("data/regmodels/bw_adaptive.rds") bw_adaptive
Constructing adaptive bandwidth GWR
Similarly to above, we will use write_rds() to save, and read_rds() to read the adaptive gwr
<- gwr.basic(resale_price~ Story_Ordinal + remaining_lease_year + floor_area_sqm + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER + PROX_PARK + PROX_LIBRARY + PROX_COMMUNITYCLUBS + PROX_BUSSTOP + PROX_MALL + PROX_SPRMKT + PROX_GOODPRISCH + PROX_MRT + NUM_CHILDCARE + NUM_KINDERGARTEN + NUM_PRIMARYSCH + month,
gwr_adaptive data = train_data_sp,
bw = bw_adaptive,
kernel = 'gaussian',
adaptive = TRUE,
longlat = FALSE)
write_rds(gwr_adaptive, "data/regmodels/gwr_adaptive.rds")
<- read_rds("data/regmodels/gwr_adaptive.rds") gwr_adaptive
5.6 Extracting Coordinates
With reference to the previous in class exercise, we were taught that we need to use st_coordinates to extract the x & y coordinates of the dataset. We will be using write_rds(), and read_rds() again.
<- st_coordinates(resale_sub)
coords <- st_coordinates(train_data)
coords_training <- st_coordinates(test_data) coords_testing
write_rds(coords_training, "data/regmodels/coords_training.rds" )
write_rds(coords_testing, "data/regmodels/coords_testing.rds")
<- read_rds("data/regmodels/coords_training.rds")
coords_training <- read_rds("data/regmodels/coords_testing.rds") coords_testing
5.7 Dropping geometry field
We will do so by using st_drop_geometry()
<- st_drop_geometry(train_data) train_data
write_rds(train_data, "data/regmodels/train_data.rds")
<- read_rds("data/regmodels/train_data.rds") train_data
5.8 Random forest model
We use set.seed() to ensure that the same value gets generated in our random forest model. Not so random now, is it?
set.seed(1234)
<- ranger(resale_price~ Story_Ordinal + remaining_lease_year + floor_area_sqm + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER + PROX_PARK + PROX_LIBRARY + PROX_COMMUNITYCLUBS + PROX_BUSSTOP + PROX_MALL + PROX_SPRMKT + PROX_GOODPRISCH + PROX_MRT + NUM_CHILDCARE + NUM_KINDERGARTEN + NUM_PRIMARYSCH + month, data = train_data) forest
write_rds(forest, "data/regmodels/forest.rds")
<- read_rds("data/regmodels/forest.rds") forest
print(forest)
Ranger result
Call:
ranger(resale_price ~ Story_Ordinal + remaining_lease_year + floor_area_sqm + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER + PROX_PARK + PROX_LIBRARY + PROX_COMMUNITYCLUBS + PROX_BUSSTOP + PROX_MALL + PROX_SPRMKT + PROX_GOODPRISCH + PROX_MRT + NUM_CHILDCARE + NUM_KINDERGARTEN + NUM_PRIMARYSCH + month, data = train_data)
Type: Regression
Number of trees: 500
Sample size: 12608
Number of independent variables: 18
Mtry: 4
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 511184749
R squared (OOB): 0.9327335
5.9 Calibrating geographic random forest using GRF
For this assignment, will be using 30 trees!
Adaptive bandwidth
<- grf.bw(formula = resale_price~ Story_Ordinal + remaining_lease_year + floor_area_sqm + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER + PROX_PARK + PROX_LIBRARY + PROX_COMMUNITYCLUBS + PROX_BUSSTOP + PROX_MALL + PROX_SPRMKT + PROX_GOODPRISCH + PROX_MRT + NUM_CHILDCARE + NUM_KINDERGARTEN + NUM_PRIMARYSCH + month,
bw_grf_adaptive dataset = train_data,
kernel = "adaptive",
coords = coords_training,
trees = 30)
write_rds(bw_grf_adaptive, "data/regmodels/bw_grf_adaptive.rds")
We will be assigning 638, which is the highest R2 score to bw_grf_adaptive before running our grf
<- 638 bw_grf_adaptive
5.10 Calibrating Geographical Random Forest Model
We will be using grf() to calibrate a model to predict HDB resale price.
write_rds(gwRF_adaptive, "data/regmodels/gwRF_adaptive.rds")
<- read_rds("data/regmodels/gwRF_adaptive.rds") gwRF_adaptive
Here are the snippets of the output gwRF for better visualisation!
6 Predictions
Since we are done with the training data, this section will deal with testing data for our prediction
6.1 Dropping geometry for test data
<- cbind(test_data, coords_testing) %>%
test_data st_drop_geometry()
write_rds(test_data, "data/regmodels/test_data.rds")
<- read_rds("data/regmodels/test_data.rds") test_data
6.2 Predicting with test data
We will use predict.grf() to predict the resale value by using the test data and gwRF_adaptive model calibrated above!
<- predict.grf(gwRF_adaptive,
gwRF_pred
test_data, x.var.name = "X",
y.var.name = "Y",
local.w = 1,
global.w = 0)
<- write_rds(gwRF_pred, "data/regmodels/gwRF_pred.rds") GRF_pred
6.3 Conversion of predicting output into data frame
We will convert the output using as.data.frame()
<- read_rds("data/regmodels/gwRF_pred.rds")
GRF_pred <- as.data.frame(GRF_pred) GRF_pred_df
Next, we will be binding the predicted output into the test data using cbind()
<- cbind(test_data, GRF_pred_df) test_data_p
write_rds(test_data_p, "data/regmodels/test_data_p.rds")
<- read_rds("data/regmodels/test_data_p.rds") test_data_p
6.4 Calculating root mean square error
rsme() will help us measure how far the predicted values are from the observed values in a regression analysis
rmse(test_data_p$resale_price,
$GRF_pred) test_data_p
[1] 26458.07
6.5 Visualising the predicted values
Let us visualise it on a scatterplot, where we plot actual resale price against predicted resale price
ggplot(data = test_data_p,
aes(x = GRF_pred,
y = resale_price)) +
geom_point()
7 Conclusion
In summary, while both OLS and GWR are regression techniques used to analyze the relationship between variables, GWR is better suited for analyzing spatially varying relationships, while OLS is better suited for analyzing relationships that are constant across space.
7.1 OLS vs Geographical weighted methods
The main difference between OLS and GWR is that OLS produces a single set of coefficients that are applicable across the entire data set, while GWR produces a set of coefficients that vary across space, which also means that the strength and direction of the relationship between the dependent variable and independent variables can change depending on the location of the observation, which can be more useful in understanding resale prices of HDB, especially if we are using the model for prediction! Let us prove this below.
<- predict(multi_reg, test_data)
multi_reg_df <- cbind(test_data, multi_reg_df)
multi_reg_test
rmse(multi_reg_test$resale_price,
$multi_reg_df) multi_reg_test
[1] 45573.09
ggplot(data = multi_reg_test,
aes(x = multi_reg_df,
y = resale_price)) +
geom_point() +
geom_abline(col = "Red")
rmse(test_data_p$resale_price,
$GRF_pred) test_data_p
[1] 26458.07
ggplot(data = test_data_p,
aes(x = GRF_pred,
y = resale_price)) +
geom_point() +
geom_abline(col = "Red")
From the scatter plots and the mean square error between both models, we can tell that the OSL aims to minimise residual sum of squares, as the regression line is fitted in that manner. Geographical Random Forest Model on the other hand, aims to reduce mean squared error, which is further supported by the value as computed above.
Hence, we therefore conclude that the Geographical Random Forest Model is better as a basis of predictive model, as compared to Non-spatial OLS.
7.2 Possible Improvements
If we recall, we ignored the variable flat_type as we assumed it to be 3 room throughout. In real life scenarios, that is not the case as different flat types do in fact have different default layout, which may affect the resale prices of the house. This could be done by potentially creating a binary variable with each of the different variables in flat type.
It was also obvious that there were outliers on the scatterplot. These outliers are the real life scenarios of HDB resale prices appearing out in the news where people are selling and buying houses for 1 million dollars. But hey, what can I say? if there is a demand, there will definitely be a supply. Removing those outlier can certainly help the overall model to improve.
Lastly, with a stronger device, we could potentially create more trees generated in the gwRF model!
8 References
I am super grateful for all of the resources that were made available to me, especially from past seniors and prof kam’s in-class resources.
A huge thanks to our favourite senior, Megan, for her work which helped many of us during the initial data wrangling and outsourcing stage!
I also referred to another senior’s work, Aisyah, where i found inspiration for creating the remaining_lease_year column!
And of course, most importantly to Prof Kam’s in-class exercise 9 for the second half of this take home assignment!