nycRealEstateR Hello World

Introduction

The programming universe spawned one of my favorite concepts called Hello World. For those unfamiliar, in many programming language the first concept one is encouraged to learn is how to write a function that prints the phrase Hello World.

Over time it has grown into as much a metaphor for an encompassing yet easy to follow introduction to something. So with this post I intend to create a Hello World that showcases some of the basic functionality of my nycRealEstateR package. It is my goal that everyone taking the time to read this will understand some of what this package does and why I built it.

One extremely important caveat I want to emphasize upfront is that unlike most of my packages, this package is not public and I will not make it public. I do however selectively work with industry friends to make use of this package and continue to be open to working with new relationships under the right circumstances. I am also open to sharing the codebase in extremely rare cases or if there is a compelling feature integration someone is interested in exploring. In either case feel free to email me.

Why nycRealEstateR??

I spent the last year building this package because there is immense value in Tidy Data. Real estate, regardless of the market, is a notriously untransparent universe. Many of the actors in the industry benefit from this inefficency, but to me it just pisses me off and makes my life more difficult. As someone who loves data driven analysis and loathes snake-oil salesman esque practices I want code that gives me what I want to make decisions or have fun explorations.

As a general rule of thumb, putting aside legality and whole host of other things, if data lives on the web, regardless of the form, it can be turned into an API of sorts by a person with enough techinical proficency and an idea of how to connect the data together. That is essentially what I have done with this package.

I wanted a package where in one place I could access, link, explore and visualize the data across the spectrum. This could include everything from the entire New York City zoning data-store to the abiity to read any document recorded against a specific lot and block, to being able to monitor condo offerings, and even keeping tabs on my favorite creature in the real estate universe, the real estate broker.

I also wanted to make the data friendly to outside APIs, everything from Gooogle Street View to some of my other R packages like fundManageR. Finally and most importantly, I wanted to make sure the package had easy to use verbs that a person not programming-language proficent could understand and use when I am providing access to this package.

After alot of work and digging for every possible New York real estate data focused silo I am happy to write this to demonstrate what this package can do!

Package Functionality

Before getting into the Hello World portion I want to spend a moment showcasing what the functions contained in the package.

## Load packages
library(dplyr)
library(nycRealEstateR)
library(purrr)
library(highcharter)

Now I can list the 35 functions warehoused inside of nycRealEstateR

lsf.str("package:nycRealEstateR") %>% as.character()
##  [1] "drop_na_columns"                             
##  [2] "generate_date_df"                            
##  [3] "get_acris_bbl_url"                           
##  [4] "get_addresses_bbls"                          
##  [5] "get_bbl_address"                             
##  [6] "get_class_df"                                
##  [7] "get_data_acris_bbl"                          
##  [8] "get_data_acris_boroughs_docs"                
##  [9] "get_data_acris_df"                           
## [10] "get_data_acris_parties"                      
## [11] "get_data_acris_people"                       
## [12] "get_data_acris_urls"                         
## [13] "get_data_addresses_bbls"                     
## [14] "get_data_bbl_condo_declaration"              
## [15] "get_data_bbls_address"                       
## [16] "get_data_new_york_brokers"                   
## [17] "get_data_ny_ag_ids_addresses"                
## [18] "get_data_ny_ag_plan_ids"                     
## [19] "get_data_ny_ag_sponsors"                     
## [20] "get_data_nyc_boroughs_sales"                 
## [21] "get_data_nyc_property_file"                  
## [22] "get_data_nyc_tax_urls"                       
## [23] "get_data_nyc_urls"                           
## [24] "get_data_nyc_years_tax_lots"                 
## [25] "get_data_oasis_addresses"                    
## [26] "get_data_ucc_entities"                       
## [27] "get_data_ucc_people"                         
## [28] "get_dictionary_boroughs"                     
## [29] "get_dictionary_document_class"               
## [30] "get_dictionary_letter_class"                 
## [31] "get_dictionary_nyc_building_class_name"      
## [32] "import_pluto"                                
## [33] "import_sales"                                
## [34] "import_tax_lots"                             
## [35] "ocr_acris_documents"                         
## [36] "ocr_pdf_file"                                
## [37] "ocr_ucc_data"                                
## [38] "parse_addresses"                             
## [39] "parse_document_detail_data"                  
## [40] "pdf_acris_urls"                              
## [41] "read_rda_file"                               
## [42] "resolve_acris_parties"                       
## [43] "resolve_data_parties"                        
## [44] "save_ocr_documents"                          
## [45] "tidy_column_formats"                         
## [46] "tidy_column_relations"                       
## [47] "visualize_dictionary_nyc_building_class_name"
## [48] "visualize_network"

What I hope becomes apparent is the repeated use of certain words.

get_data: Retrieval of live data from the specified source
get_dictionary: Codebook for specified source
import: Cached version of specified data-store, saves computing time and bandwith
ocr: Peforms Optical character recognition on specified file

The remaining functions perform a variety of data cleaning, visualization, and import-export tasks.

I have found when creating packages it is useful to come up with a logical naming system for all the tasks the package/module perform. It helps if they are easy to remember and consistant in form.

Now let me get into what the package can do.

I will start with my favorite people, real estate brokers.

Easy Exploration of Real Estate Brokers

Whether you like them or not, real estate brokers play a significant role in the real estate universe. Every week I am inundated with emails from brokers selling properties, shopping for debt and searching for equity.

Early in my real estate private equity career I discovered a select few brokers at each shop control most of the dealflow and relationships. I wanted to quantify this and be able to monitor this power structure.

I was able to execute on this plan by effectively turning the New York State Occupational Licensing Management System into an API. In doing so, I accidently also created a mechanism that allows me to monitor every single brokerage firm operating in the state of New York and a littany of information about the brokers that work under that license.

Here is how it works.

Lets explore the one of the United States’s most prominent brokerage firms Eastdil and Lazard, a well-known investment bank that has a licensed commercial real estate brokrage arm.

df_broker_search <- 
  get_data_new_york_brokers(
    search_names = c("Lazard", "Eastdil"),
    parse_brokers = T,
    visual_network = NULL
  )

Here is the structure of the data:

df_broker_search %>% 
  glimpse()
## Observations: 43
## Variables: 22
## $ searchBroker              <chr> "Eastdil", "Eastdil", "Lazard", "Laz...
## $ typeLicenseBroker         <chr> "LIMITED LIABILITY COMPANY BROKER", ...
## $ namePerson                <chr> "MICHAEL COCHRAN", "BENJAMIN LAMBERT...
## $ namePersonFull            <chr> "MICHAEL D COCHRAN", "BENJAMIN V LAM...
## $ nameFirst                 <chr> "MICHAEL", "BENJAMIN", "MATTHEW", "M...
## $ nameMiddle                <chr> "D", "V", "J", "W", "P", "M", "S", "...
## $ nameLast                  <chr> "COCHRAN", "LAMBERT", "LUSTIG", "OGR...
## $ nameBrokerEntity          <chr> "EASTDIL SECURED LLC", "EASTDIL SECU...
## $ typeLicenseBrokerEntity   <chr> "PRINCIPAL OFFICE", "PRINCIPAL OFFIC...
## $ dateLicenseExpiry         <date> 2019-04-02, 2019-04-02, 2018-12-02,...
## $ codeLicenseBroker         <chr> "LL COMP BROKER", "LL COMP BROKER", ...
## $ addressStreetEntity       <chr> "40 W 57TH ST 22ND FL", "40 W 57TH S...
## $ cityStateEntity           <chr> "NEW YORK NY", "NEW YORK NY", "NEW Y...
## $ countryCodeEntity         <chr> "US", "US", "US", "US", "US", "US", ...
## $ stateEntity               <chr> "NEW YORK", "NEW YORK", "NEW YORK", ...
## $ zipEntity                 <chr> "10019", "10019", "10020", "10020", ...
## $ urlBrokers                <chr> "https://appext20.dos.ny.gov/nydos/d...
## $ countEntity               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ idLicenseBrokerEntity     <dbl> 109924135, 109924135, 10991211855, 1...
## $ statusLicenseBrokerEntity <chr> "CURRENT", "CURRENT", "CURRENT", "CU...
## $ codeLicenseBrokerEntity   <chr> "PRINC OFF", "PRINC OFF", "PRINC OFF...
## $ phoneEntity               <chr> "2123157200", "2123157200", NA, NA, ...

What we have is a spreadsheet of sorts containing all the information about Eastdil, Lazard, and the brokers that work for them. The information includes valuable information like broker license, broker license type, date of license expiry and much more.

Most importantly the data is in a tidy format which makes it easy to perform further analysis, create data visualizations, and tie the information into other pipelines.

Let me demonstrate this by creating a a heirarchical graph showing all the brokers at Eastdil and Lazard by license type. The levels of the license roughly correspond with the power structure at each firm.

df_broker_search %>%
  mutate(typeLicenseBroker = factor(
    typeLicenseBroker,
    levels = c(
      "LIMITED LIABILITY COMPANY BROKER",
      "ASSOCIATE BROKER",
      "SALESPERSON"
    )
  )) %>%
  visualize_network(
    nodes = c('nameBrokerEntity', 'typeLicenseBroker', 'namePerson'),
    radial_title = "Eastdil & Lazard Brokers",
    visual_network = "dendrogram"
  )

This sort of analysis can be reproduced for any of the tens of thousands of licensed brokerage firms in New York and will self update as firms and brokers get licensed in the state.

In a bit I will come back to a portion of this data to demonstrate how having tidy data makes it easy to link data across tasks.

Now I am going to move on to the New York City specific data

Primary Land Use Tax Lot Output [PLUTO]

For those that aren’t familiar with PLUTO I best equate it to the list of franchises in a professional sports league. Instead of it being a sports team it is a unique plot of land and the structure that may or may not be built on top of it.

PLUTO returns stat sheet containing all the numeric and categoric data for every plot of land in New York City. The data is extremely robust, you can spend a few moments looking through the information it contains here.

The data is updated annually and changes as land use changes. The data can is also meant to be linked with other data stores including ACRIS, the Department of Buildings, and The Department of Tax and Finance. When you do this you are truly able to get a clear view of the characteristics and history of every piece of land in New York.

One important thing to remember is that there is a one-to-many relationship between a unique plot of land and the number of tax IDs it contains. An easy to understand example is a condominium building. One condominum building can contain many units, each with their own tax-lots. PLUTO will not give you information about the tax lots in the building but nycRealEstateR has another method to allow you to get that information.

Now that I have explained PLUTO lets actually explore some if it.

Acquiring PLUTO Data

The package contains 2 methods for importing PLUTO.

get_data_nyc_property_file(
  id_data_set = "PLUTO",
  filter_na_address = TRUE,
  return_message = FALSE
  )

This method goes onto the web, downloads, cleans, and parses PLUTO. This is is timeconsuming and bandwith consuming but is worth doing if you don’t yet have the data stored locally or if the data was recently updated.

You can also import a cached version of this data which I have stored in the cloud. I am going to us this method to demonstrate some of the interesting things one can do with this data.

df_pluto <- 
  import_pluto(only_key_data = F)

Lets take a look at the structure of the data.

df_pluto %>% 
  glimpse()

Now lets take a look at the top 5 buildings by built size by borough and parent class letter

df_pluto %>%
  group_by(codeBorough, letterClass) %>%
  arrange(desc(areaBuildingSF)) %>%
  slice(1:5) %>%
  ungroup() %>%
  DT::datatable(
  options = list(pageLength = 5,
                 lengthMenu = c(5, 10, 15, 20),
                 scrollX = TRUE),
  class = "display",
  caption = NULL,
  filter = c("none", "bottom",
             "top"),
  escape = TRUE,
  style = "default",
  width = NULL,
  height = NULL,
  elementId = NULL,
  fillContainer = getOption("DT.fillContainer", NULL),
  autoHideNavigation = getOption("DT.autoHideNavigation",
                                 NULL),
  selection = c("multiple", "single", "none"),
  plugins = NULL)

As you can see there are 99 different variables covering things like land use, current building-area, allowable buildable area. We now also know that there are 858,396 unique plots of land throughout the 5 boroughs of New York.

Lets take this a bit further and perform some Exploratory Data Analysis

Distribution by Borough

Lets take a look at how some of this information distributes by borough.

df_buildings_borough <-
  df_pluto %>%
  group_by(codeBorough) %>%
  summarise(countBuildings = n(),
            amountSFBuilt = sum(areaBuildingSF),
            areaSFBuildable = sum(areaAllowableSF),
            areaLand = sum(areaLotSF)) %>%
  mutate(ratioFAR = amountSFBuilt / areaLand,
         ratioFARBuildable = areaSFBuildable / areaLand) %>%
  mutate_if(is.numeric, funs(. %>% formattable::comma(digits = 0))) %>%
  arrange(desc(countBuildings))

df_buildings_borough %>% DT::datatable()

Number of buildings

df_buildings_borough %>%
  asbmisc::plot_hc_xy(
    type = "column",
    category = "codeBorough",
    y_var = "countBuildings",
    title = "New York City Buildings by Borough"
  )

Amount of Built SF

df_buildings_borough %>% 
  mutate(amountSFBuilt = amountSFBuilt/1000000) %>% 
  plot_hc_xy(type = "waterfall", category = "codeBorough", y_var = "amountSFBuilt",
             title = "New York City Built Square Footage by Borough") %>% 
  hc_yAxis(
    tickAmount = 8,
    lineColor = "transparent",
    minorGridLineWidth = 0,
    gridLineColor = "transparent",
    labels = list(format = "{value}M SF"))

FAR Distribuion

df_buildings_borough %>%
  arrange((ratioFAR)) %>%
  plot_hc_xy(
    type = "columnrange" ,
    category = "codeBorough",
    low_var = "ratioFAR",
    high_var = "ratioFARBuildable",
    title = "FAR Built vs Buildable by Borough"
  ) %>%
  hc_yAxis(
    tickAmount = 8,
    lineColor = "transparent",
    minorGridLineWidth = 0,
    gridLineColor = "transparent",
    labels = list(format = "{value} FAR"),
    title = "FAR"
  )
## Joining, by = "idRow"

Trelliscop

Department of Finance

Accessing All BBLs

ACRIS

OCRing Documents

New York AG

Projects by Entity

Projects by Address

Deep Diving a Project

New York UCC

By Entity

OCR