< Find Me a Venture Capitalist



Introduction

This tutorial demonstrates how I am utilizing fundManageR, reproducible research, data visualization and data science to advise two of my portfolio companies who are preparing for an institutional fund-raising round. This exercise will focus on exploring and analyzing venture capitalists.

But Before We Get Started

Accurate, clean, and tidy data is an incredible resource, but it is only a tool.

Success is about a confluence of factors including, but not limited to, luck, execution, perserverence, communication, and a clear, but flexible, understanding of the end goal. Data, analytical skills, programming abilities, and machine learning proficiency are high powered weapons but in isolation they won’t get you very far; they are only tools.

Now, let’s get started

The Mission: Identify and Target Venture Capital Firms Headquartered in a Specified Area

Today’s mission is to find venture capitalists head-quartered in Maryland, Washington D.C., Virginia, New York, Massachusetts, and Florida. We are limiting it to these locations to focus on areas that me, and the portfolio companies have strong ties in.

The workflow is straight forward:

  • Access the most recent monthly ADV Manager summary data
  • Exclude managers outside of our specified locations
  • Exclude all non venture exempt managers
  • Access detailed manager data by piping the CRDs through the get_data_adv_managers_filings function
  • Explore the data
  • Draw some conclusions.

Important Caveat

This is a crude hack for isolating venture capitalists. A pure play venture firm, in many circumstances, utilizes a special exemption allowing for venture managers to avoid many of the standard reporting requirements mandated on SEC regulated investment managers. The SEC captures information and discloses it in its standard monthly reporting data via a logical field that discloses whether a manager has utilized the venture exemption.

It is extremely important to note that this method limits us to only venture managers utilizing this exemption. This method won’t help us find managers who have venture as one of their product lines, like say Texas Pacific Group, or a venture manager that is registered but hasn’t qualified or applied for this exemption like 137 Ventures.

Context and common sense is as important as a powerful data driven work-flow. This tutorial is meant to empower but not mislead you into thinking the results produced are all-encompassing, because they are not. What tutorial will do is provide you with a crude method to acquire data that we know meet the venture criteria we defined.

Step 1 - Import Most Recent Monthly Summary ADV Data

packages <-
  c('fundManageR', 'dplyr', 'magrittr', 'tidyr', 'formattable', 'tibble') # packages we need
lapply(packages, library, character.only = T)

all_managers_recent <- 
  get_data_adv_managers_periods_summaries(all_periods = F, only_most_recent = T, is_exempt = c(F, T), file_directory = NULL, folder_name = NULL)

We now have the data for the 15,461 SEC regulated managers registered as of October 3rd, 2016 which we can pare down according to our search criteria.

Step 2 - Find Exempt Venture Managers in Our Target Locations

Using hasExemptionAsSolelyVentureAdviser and stateOfficePrimary fields we can quickly filter down our data.

locations <- 
  c("DC", "MA", "VA", "FL", "NY", "MD", "NY")
vcs_in_location_df <- 
  all_managers_recent %>%
  dplyr::filter(stateOfficePrimary %in% locations, # comma acts as AND in dplyr
  hasExemptionAsSolelyVentureAdviser == T)

The filtered data includes the 218 registered managers that meet our search criteria. Soon we will take this and use the get_data_adv_managers_filings function to acquire all the disclosed information about each of these managers.

Before doing that I want to highlight something you notice while exploring the vcs_in_location_df data. Some of the most pertinent fields containing information about a manager’s assets under management, employee counts, client counts, fee structures are NA.

The venture exemption allows these managers to avoid the periodic reporting of this information, something non-exempt managers are forced to do. The good news is this data does exist and fundManageR will get it for you, the downside is the data is reported annually and is buried deep in the messy html of the manager’s accepted ADV form.

Step 3 - Acquire Detailed Manager Data

The get_data_adv_managers_filings function requires either free text search names or numeric CRDs to acquire the ADV filing data. Search names may produce less accurate results, and often return managers whose information we did not want.

CRDs are a unique identifier, the primary key of the SEC’s IAPD database for an entity, and if we have a firm’s CR it will always produce accurate results. Our prior work produced each manager’s CRD which we will use to power our next function call.

Please Be Patient

This search will take a while given we are parsing 218 managers so be patient. This function will process a few gigabytes of data to produce our final results. You can track the progress with the message output but I’d recommend doing something else for about an hour or so while this runs.

search_crds <- 
  vcs_in_location_df$idCRD # create vector of CRDs
all_manager_data <-
  get_data_adv_managers_filings(
    crd_ids = search_crds,
    all_sections = T,
    search_names = NULL,
    assign_to_enviornment = T
  )

Step 4 - Explore the Data

Now that we’ve acquired the data lets explore it and answer a few questions.

  • What are each manager’s reported total assets under management?
  • What is the distribution of assets under management by state?
  • How many fund’s does each manager control?
  • What conclusions can we draw from the numeric data about the managers?

Calculate Assets Under Management by Manager

The first thing we want to do is calculate the summary information we’d normally find in the monthly ADV data for these venture exempt managers.

We can do this using the data in the section7BPrivateFundReporting table. We will produce the desired information by grouping the information by CRD and manager name, and summarizing the total assets under management, count of investors and count of fund vehicles.

Finally we will join that data with descriptive information about the manager’s state headquarters and website.

fund_manager_totals <- 
  section7BPrivateFundReporting %>% 
  group_by(nameEntityManager, idCRD) %>%  # set group
  summarise(amountAUMTotal = sum(amountFundGrossAUM, na.rm = T),
         countVehicles = n(),
         countInvestors = sum(countFundOwners, na.rm = T)) %>% 
  left_join(
    vcs_in_location_df %>%  # join the office state
      select(idCRD, stateOfficePrimary)
  ) %>% 
  left_join(
    managerWebsite %>% # join the manager website
      dplyr::select(idCRD, urlManager)
  ) %>% 
  select(nameEntityManager, idCRD, urlManager, stateOfficePrimary, everything()) %>% 
  ungroup() %>% # ungroup
  arrange(desc(amountAUMTotal))
fund_manager_totals

Venture Manager Breakdown by Location

Next we will use the data to paint a picture of venture landscape by location. To make this information visual and explorable I will show you how to use an experimental feature of the formattable package which embeds htmlwidgets into a table.

In order for this code to work you need to have the packages listed below loaded, you can follow the instructions in the code comments to install them if you have yet to do so. The experimental formattable functions have yet to be integrated into the package so the functions must be sourced from my github gist.

library(sparkline) #devtools::install_github("htmlwidgets/sparkline")
library(scatterD3) # devtools::install_github("juba/scatterD3")
library(htmlwidgets) # devtools::install_github("ramnathv/htmlwidgets")
source("https://gist.githubusercontent.com/abresler/8080a046794f97b8e3174b871b8429c0/raw/06f9c66552775facda7aeb7e6780ed74e9fda1c3/htmlwidget_tests.r")

Now we can create the table.

The first step is to use dplyr to create the summary data. We do this by grouping by state, removing the CRD ID, and using the summarise_if function to quickly calculate the sum of all the remaining numeric columns.

Next, we will shorten the column names to make them easier to print. We are left with the core summary data by state but still need to augment it by joining additional information and our visualizations. We don’t have the fund manager count by state so we can derive that and join it in with our existing data. Finally we will join our visualization columns, scatter plots of assets under management by fund vehicles and assets under management by investor count.

ft_lm <-
  fund_manager_totals %>%
  dplyr::select(-idCRD) %>%
  group_by(state = stateOfficePrimary) %>%
  summarise_if(.predicate = is.numeric,
               funs(. %>% sum)) %>% 
  rename(vehicles = countVehicles, investors = countInvestors, AUM = amountAUMTotal) %>%
  left_join(
    fund_manager_totals %>%
      dplyr::select(-idCRD) %>%
      group_by(state = stateOfficePrimary) %>%
      summarise(managers = n())
  ) %>%
  left_join(
    fund_manager_totals %>%
      dplyr::select(-idCRD) %>%
      group_by(state = stateOfficePrimary) %>%
      summarize(
        aumByFundCount = scatter_cell(
          countVehicles,
          amountAUMTotal / 1000000, ## divide by 1M
          xlab = "Fund Vehicles",
          ylab = "AUM Millions",
          tooltip_text = paste0(
            nameEntityManager, # Add the tool tip
            "<br><strong>Amount AUM: </strong>",
            amountAUMTotal,
            "<br><strong>Fund Vehicles: </strong>",
            countVehicles
          ),
          point_size = 20, # modify point size
          hover_size = 6, # change hover size
          ellipses = T, # add grouping ellipse
          hover_opacity = 1, # change hover opacity
          height = 200, # modify the height
          width = 220 # modify the width
        ),
        aumByInvestorCount = scatter_cell(
          countInvestors,
          amountAUMTotal / 1000000,
          tooltip_text = paste0(
            nameEntityManager,
            "<br><strong>Amount AUM:</strong>",
            amountAUMTotal,
            "<br><strong>Investor Total: </strong>",
            countInvestors
          ),
          xlab = "Fund Investors",
          ylab = "AUM Millions",
          point_opacity = 0.5,
          hover_size = 4,
          hover_opacity = 1,
          point_size = 20,
          ellipses = T,
          colors = "#A94175",
          height = 200,
          width = 270
        )
      )
  ) %>%
  formattable() %>%
  as.htmlwidget() %>% 
  suppressMessages()
ft_lm$dependencies <- c(ft_lm$dependencies, # setup the htmlwidget
                        htmlwidgets:::widget_dependencies("scatterD3", "scatterD3"))

Fund Manager Exploration

Next we will try to better understand our list of managers. We also want to make it easy to navigate to the manager’s webpage which we can do by taking our data from step 1 and layering in some html. Finally, we want to add a visual sparkline showcasing a manager’s assets under management by fund, which will give us a good visual cue on the AUM variance between funds, number of funds and overall fund sizes.

manager_df <-
  fund_manager_totals %>%
  ungroup %>%
  dplyr::select(-idCRD) %>%
  mutate(urlManager = ifelse(
    urlManager %>% is.na(),
    NA,
    list(
      "<a href ='",
      urlManager, # create a clickable url that opens a new window
      "' target ='_blank'>",
      urlManager,
      "</a>"
    ) %>% invoke(paste0, .)
  )) %>%
  dplyr::rename(
    url = urlManager, # change naming to make the table fit
    state = stateOfficePrimary,
    investors = countInvestors,
    vehicles = countVehicles,
    AUM = amountAUMTotal
  )

manager_ft <-
  formattableWithSL(
    TableDF = manager_df,
    rawDF =
      section7BPrivateFundReporting %>% 
      select(nameEntityManager, fundAUM = amountFundGrossAUM) %>% 
      data.frame(), # requires data.frame unforutantely 
    key = "nameEntityManager", # key for spark line
    SLfield = "fundAUM" # variable visualized in spark line
  ) %>% 
  suppressMessages()
manager_ft

A Data Driven View of the Manager Landscape

In this final step we will utilize some important numeric descriptors to draw some conclusions about the landscape of our managers. To achieve this we will use a statistical analysis tool known as Principal Components Analysis[PCA].

PCA is a powerful way to take numeric data and break it down into a set of components that explain the variance of the data. The algorithm returns a set scores for each the variables. These scores include a list of components the length of the number of variables that fully explain the variance. We will isolate, explore and visualize the first two principal components.

We have assembled some use numeric data but there are a couple of we can join in to enhance this analysis. In addition to the assets under management, vehicle count, and investor count we want to include information that proxies a firm’s operational size, something we can get by including a count of their offices. We also want to include information a manager’s average fund profile, something we can do by joining in information about a manager’s weighted average investment minimum and percentage breakdown of various fund characteristics {percent of international investors, percent from fund of funds, percent from feeder fund, etc..}. Finally we want information that profiles a manager’s depth of relationships with services providers. We can achieve this by joining in a count of the service providers by service also disclosed in section7BPrivateFundReporting.

Once completed, we will take the data, isolate the the numeric fields, scale to mean zero and use prcomp function to apply the principal components algorithm. We will then explore the loading scores, visualize the result first two components and draw some conclusions from the data.

Assemble Data for PCA

fund_totals <- 
  section7BPrivateFundReporting %>%
  group_by(idCRD, nameEntityManager) %>%
  summarise(
    countFunds = max(numberFund),
    amountTotalAUM = sum(amountFundGrossAUM, na.rm = T),
    countTotalFundOwners = sum(countFundOwners, na.rm = T)
  ) %>%
  ungroup
service_provider_df <- 
  section7BPrivateFundReporting %>% 
  select(idCRD, matches("^name[A-Z]")) %>% 
  select(-c(nameFund, nameFundGPManagerTrusteeDirector, nameFundFeederGPManagerTrusteeDirector)) %>% 
  select(-matches("nameFundFeederGPManagerTrusteeDirector|nameCustodianBusiness|EntityManager")) %>% 
  gather(item, value, -c(idCRD), na.rm = T) %>% 
  mutate(item = item %>% str_replace_all("[0-9]", '')) %>% 
  distinct() %>% 
  group_by(idCRD, item) %>% 
  summarise(count = n()) %>% 
  mutate(item = item %>% str_replace_all('name', 'countFund')) %>%
  spread(item, count, fill = 0)
weighted_fund_avg_df <- 
  section7BPrivateFundReporting %>%
  group_by(idCRD) %>%
  summarise_at(
    c(
      "amountFundMinimumInvestment",
      "pctFundManagerOwned",
      "pctFundFundOfFunds",
      "pctFundNonUSCitizens",
      "pctClientsFundInvestor"
    ),
    funs(weighted.mean(. , amountFundGrossAUM, na.rm = T))
  ) %>% 
  ungroup
office_count_df <- 
  managerOtherOfficeLocations %>% 
  group_by(idCRD) %>% 
  summarise(countOffices = max(countItem)) %>% 
  ungroup()
all_data <- 
  fund_totals %>% 
  left_join(
    vcs_in_location_df %>% select(idCRD, stateOfficePrimary)
  ) %>% 
  left_join(weighted_fund_avg_df) %>% 
  left_join(office_count_df) %>%
  left_join(service_provider_df) %>% 
  suppressMessages()

Scale the Data

set.seed(123) # make it reproducible
scaled_numeric_data <- 
  all_data %>%
  purrr::keep(is.numeric) %>% 
  mutate_all(as.numeric) %>% 
  gather(item, value, -idCRD) %>% 
  spread(item, value, fill = 0) %>% 
  dplyr::select(-idCRD) %>% 
  scale()
rownames(scaled_numeric_data) <-
  all_data$nameEntityManager
scaled_numeric_data %>% data.frame %>% rownames_to_column(var = 'nameEntityManager') %>% as_data_frame()

Perform the PCA

pc_funds <- 
  scaled_numeric_data %>% 
  prcomp()

Explore the Loadings

Now we can explore the results. If you aren’t familiar with the methodology behind PCA loadings I recommend reading this. Try to spend a few minutes exploring each of the 15 components and their variable loadings and think about what the scores may indicate. Try to focus on the top 5 components as they are the influential to explaining the variance of our data.

pc_funds$rotation %>% 
  data.frame() %>% 
  tibble::rownames_to_column(var = 'nameVariable') %>% 
  as_data_frame() %>% 
  arrange(PC1, PC2)

Visualize the Results

Now we will take the results, tidy them with broom package, and build an interactive plot of the top two principal components.

To create the data visualization we will first build a static ggplot2 scatter plot, customize the format, and then feed the static plot into plotly to produce an interactive visualization.

Note, in order to exactly replicate this visualization you must have the Myraid Pro font installed on your system which you can download from here.

If the code does not work I recommend removing code contained in hrbrmisc::theme_hrbrmstr_my() +. Finally please make sure you have the other required packages installed and loaded.

library(broom) # install.packages("broom")
library(ggplot2) # devtools::install_github("hadley/ggplot2") # requires dev version
library(plotly) # devtools::install_github("ropensci/plotly") requires dev version
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")
pca_plot <- 
  augment(pc_funds, data = all_data) %>% # Tidy the results
  ggplot(aes(.fittedPC1, .fittedPC2, colour = stateOfficePrimary, text = nameEntityManager)) +
  geom_jitter(size = .55) + # jitter the results
  hrbrmisc::theme_hrbrmstr_my() + # require myriad pro font installed on your computer as well, remove this code if it doesn't work
  scale_x_continuous(name = "PC1 - AUM / Investor, Vehicle Count", breaks = seq(-13.5, 2.5, length.out = 8)) +
  scale_y_continuous(name = "PC2 - Feeder / Fund of Fund Reliance, Service Provider Complexity", 
                   breaks = seq(-7, 3, length.out = 8)) + 
  geom_hline(yintercept = 0, color = "gray70", size = .15) + # add horizontal line
  geom_vline(xintercept = 0, color = "gray70", size = .15) + ## add vertical line
  theme(panel.grid.major.x = element_blank(),panel.grid.major.y = element_blank()) + # remove the lines
  ggtitle("Venture Manager PCA Analysis")
pca_gg <-
  pca_plot %>% ggplotly()
pca_gg

Interpretting the Results

This visualization may like a standard scatter plot but it is not. The X and Y axis are unrelated and we aren’t looking to the x-axis to explain the y-axis. Each axis represents a distribution of component scores.

The x-axis represents the first principal component which, according to my interpretation of the PCA loadings represents the size of manager’s capital base, count of investors and the size of the manager’s operation. The farther to the left you move, the more assets under management, fund investors and offices the manager has. The farther to the right the less AUM, fund investors and offices a manager has.

To demonstrate this, hover over outliers farthest to the left, NEA and Deer Management, whose business name is actually Bessemer Ventures. These outliers encapsulate of first principal component. These managers have the largest assets under management and fund investor count, generally by an order of magnitude over the rest of the managers.

The y-axis represents the second principal component, this is more difficult to interpret. As you move south along the axis it indicates a heavy influence of having a prime broker, relying on fund of funds for capital and/or using feeder funds for a large portion of a vehicle’s capital. As you move south along the y-axis this indicates having a high number of unrelated fund auditors, administrators and marketers.

You can get a better understanding of this component by hovering over the outlier at the bottom right, FF Asset Management. They only have $144.812M in AUM which is why the firm is on the right on the x-axis and if you look at the all_data you will see the manager has engaged 6 different fund marketers and 3 separate fund auditors is why they are located on the south end of the y-axis. Compared to the other 218 managers, this count of fund marketers and auditors is a huge outlier. While this information doesn’t indicate anything in isolation, this heavy distortion does raise a red flag.

PCA Driven Conclussions

Finally we want to use this visualization and the results from our PCA to communicate feedback to the portfolio companies.

My first piece feedback is to be aware of meaning of NEA and Bessemer as relative whales. They have massive AUM and most likely will look to write significant large checks to optimize efficiency. Any pursuit of their capital should weight the importance of the overall round size and need for them to write a good-sized check.

My second piece of feedback is to treat any firm in the lower right quadrant with extreme skepticism. These managers have small asset bases and a fairly complex network of service providers. This indicates possible high reliance on fund of funds, a source of capital that adds another layer of fees to the end investor and may operate and push decisions onto the the recipient manager that pushes their own agenda.

This component also isolates managers with a higher count of auditors and administrators. This alone doesn’t indicate wrong doing but it does raises serious questions. I recommend before any dialogue with managers in that quadrant there is a serious vetting of these firms, their auditors, fund administrators and most importantly the principals and entities that own the manager.

That leaves the rest of the data, which if I were to guesstimate is about 200 of the 218 managers as candidates meet the criteria as a potential investor. I’d advocate a thorough exploration of the data in Section Schedule A and Section Schedule B which identifies the people and entities who own the management companies. I’d then write a function that takes the legal fund names and Googles for PDF files containing the name. Any matched results would provide us with a good idea of the fund’s financial returns and we may even find the fund’s portfolio investments and LPs. All very important information to process as you try to identify potential venture targets.

There are also 2 fantastic R data acquisition packages I’d integrate into the workflow. The first, rcrunchbase, wraps Crunchbase and is a fantastic way to discover a fund’s portfolio investments and any affiliations/investments of people affiliated with the firm.

The second is gdeltr2 which wraps The GDELT Project’s various APIs and data-stores. You can use this package to search for media or television references of the fund manager and/or any of its employees and portfolio companies. Depending on your R skills you could even right a cron job that queries hourly the API with a list of companies, terms, and people, storing any validated results in a separate database to build an on-going media tracking tool.

Parting Notes

I hope this tutorial was insightful and gave you an idea of how you can use fundManageR to try to answer real-life questions. I also hope it demonstrated how R can help you create powerful and engaging information in just a few lines of code.

I also wanted to encourage you to think about the importance of data driven, re-producible analysis. We have the ability to interactively show and share how we make decisions. Whether you are using R Notebooks or Jupyter Notebooks, it’s easy to create a data driven narrative that you can use to share, discuss and collaborate around a problem.

Thank you for investing your precious time going in going through this tutorial.

Don’t hesitate to reach out to me on twitter or via e-mail. I hope to be back soon with a tutorial integrating fundManageR, gdeltr2, and advanced machine learning techniques,

Stay curious, analytical, visual, open-minded, and away from spreadsheets.

---
title: "Find Me a Venture Capitalist"
author: "Alex Bresler"
date: "`r Sys.Date()`"
output: 
  html_notebook: 
    css: ~/Desktop/Semantic-UI/dist/semantic.css
    fig_width: 10
    highlight: tango
    theme: cosmo
---
<br>
<img src ="http://wolfofyorkstreet.com/wp-content/uploads/2016/09/venture-capital.jpg">

## Introduction

This tutorial demonstrates how I am utilizing [fundManageR](https://github.com/abresler/fundManageR), reproducible research, data visualization and data science to advise two of my portfolio companies who are preparing for an institutional fund-raising round.  This exercise will focus on exploring and analyzing venture capitalists.

### But Before We Get Started

Accurate, clean, and [tidy](http://r4ds.had.co.nz/tidy-data.html) data is an incredible resource, but it is only a tool.  

Success is about a confluence of factors including, but not limited to, *luck*, *execution*, *perserverence*, *communication*, and a *clear*, but *flexible*, understanding of the end goal.  Data, analytical skills, programming abilities, and machine learning proficiency are high powered weapons but in isolation they won't get you very far; they are only tools.

Now, let's get started

### The Mission: Identify and Target Venture Capital Firms Headquartered in a Specified Area

Today's mission is to find venture capitalists head-quartered in Maryland, Washington D.C., Virginia, New York, Massachusetts, and Florida.  We are limiting it to these locations to focus on areas that me, and the portfolio companies have strong ties in.

The workflow is straight forward: 

* Access the most recent monthly ADV Manager summary data 
* Exclude managers outside of our specified locations
* Exclude all non venture exempt managers
* Access detailed manager data by piping the [CRDs]( [CRD IDs](http://www.finra.org/industry/crd)) through the `get_data_adv_managers_filings` function
* Explore the data 
* Draw some conclusions.

### Important Caveat

This is a crude hack for isolating venture capitalists.  A pure play venture firm, in many circumstances, utilizes a special exemption allowing for venture managers to avoid many of the standard reporting requirements mandated on SEC regulated investment managers. The SEC captures information and discloses it in its standard monthly reporting data via a logical field that discloses whether a manager has utilized the venture exemption.

It is extremely important to note that this method limits us to *only* venture managers utilizing this exemption.  This method won't help us find managers who have venture as one of their product lines, like say [Texas Pacific Group](http://www.tpg.com/), or a venture manager that is registered but hasn't qualified or applied for this exemption like [137 Ventures](http://137ventures.com/).

Context and common sense is as important as a powerful data driven work-flow. This tutorial is meant to empower but not mislead you into thinking the results produced are all-encompassing, because they are not.  What tutorial will do is provide you with a crude method to acquire data that we know meet the venture criteria we defined.

### Step 1 - Import Most Recent Monthly Summary ADV Data

```{r load_packages_import}
packages <-
  c('fundManageR', 'dplyr', 'magrittr', 'tidyr', 'formattable', 'tibble') # packages we need
lapply(packages, library, character.only = T)

all_managers_recent <- 
  get_data_adv_managers_periods_summaries(all_periods = F, only_most_recent = T, is_exempt = c(F, T), file_directory = NULL, folder_name = NULL)
```

We now have the data for the 15,461 SEC regulated managers registered as of October 3rd, 2016 which we can pare down according to our search criteria.

### Step 2 - Find Exempt Venture Managers in Our Target Locations

Using `hasExemptionAsSolelyVentureAdviser` and `stateOfficePrimary` fields we can quickly filter down our data.

```{r pare_down}
locations <- 
  c("DC", "MA", "VA", "FL", "NY", "MD", "NY")

vcs_in_location_df <- 
  all_managers_recent %>%
  dplyr::filter(stateOfficePrimary %in% locations, # comma acts as AND in dplyr
  hasExemptionAsSolelyVentureAdviser == T)
```

The filtered data includes the 218 registered managers that meet our search criteria.  Soon we will take this and use the `get_data_adv_managers_filings` function to acquire all the disclosed information about each of these managers.

Before doing that I want to highlight something you notice while exploring the `vcs_in_location_df` data.  Some of the most pertinent fields containing information about a manager's assets under management, employee counts, client counts, fee structures are NA.  

The venture exemption allows these managers to avoid the periodic reporting of this information, something non-exempt managers are forced to do.  The good news is this data *does exist* and fundManageR will get it for you, the downside is the data is reported annually and is buried deep in the messy html of the manager's accepted ADV form.

### Step 3 - Acquire Detailed Manager Data

The `get_data_adv_managers_filings` function requires either free text search names or numeric CRDs to acquire the ADV filing data.  Search names may produce less accurate results, and often return managers whose information we did not want.

CRDs are a unique identifier, the [primary key](https://en.wikipedia.org/wiki/Unique_key) of the SEC's IAPD database for an entity, and if we have a firm's CR it will always produce accurate results.  Our prior work produced each manager's CRD which we will use to power our next function call.

### Please Be Patient

This search will take a while given we are parsing 218 managers so be patient.  This function will process a few *gigabytes* of data to produce our final results. You can track the progress with the message output but I'd recommend doing something else for about an hour or so while this runs.

```{r search_crds}
search_crds <- 
  vcs_in_location_df$idCRD # create vector of CRDs
all_manager_data <-
  get_data_adv_managers_filings(
    crd_ids = search_crds,
    all_sections = T,
    search_names = NULL,
    assign_to_enviornment = T
  )
```

### Step 4 - Explore the Data

Now that we've acquired the data lets explore it and answer a few questions.

* What are each manager's reported total assets under management?
* What is the distribution of assets under management by state?
* How many fund's does each manager control?
* What conclusions can we draw from the numeric data about the managers?


### Calculate Assets Under Management by Manager

The first thing we want to do is calculate the summary information we'd normally find in the monthly ADV data for these venture exempt managers.  

We can do this using the data in the `section7BPrivateFundReporting` table. We will produce the desired information by grouping the information by CRD and manager name, and summarizing the total assets under management, count of investors and count of fund vehicles.  

Finally we will join that data with descriptive information about the manager's state headquarters and website.

```{r get_fund_size}
fund_manager_totals <- 
  section7BPrivateFundReporting %>% 
  group_by(nameEntityManager, idCRD) %>%  # set group
  summarise(amountAUMTotal = sum(amountFundGrossAUM, na.rm = T),
         countVehicles = n(),
         countInvestors = sum(countFundOwners, na.rm = T)) %>% 
  left_join(
    vcs_in_location_df %>%  # join the office state
      select(idCRD, stateOfficePrimary)
  ) %>% 
  left_join(
    managerWebsite %>% # join the manager website
      dplyr::select(idCRD, urlManager)
  ) %>% 
  select(nameEntityManager, idCRD, urlManager, stateOfficePrimary, everything()) %>% 
  ungroup() %>% # ungroup
  arrange(desc(amountAUMTotal))
```

```{r private_funds}
fund_manager_totals
```

### Venture Manager Breakdown by Location

Next we will use the data to paint a picture of venture landscape by location.  To make this information visual and explorable I will show you how to use an experimental feature of the [formattable](https://github.com/renkun-ken/formattable) package which embeds [htmlwidgets](htmlwidgets.org) into a table.  

In order for this code to work you need to have the packages listed below loaded, you can follow the instructions in the code comments to install them if you have yet to do so.  The experimental formattable functions have yet to be integrated into the package so the functions must be sourced from my github gist.

```{r read_widget_code}
library(sparkline) #devtools::install_github("htmlwidgets/sparkline")
library(scatterD3) # devtools::install_github("juba/scatterD3")
library(htmlwidgets) # devtools::install_github("ramnathv/htmlwidgets")
source("https://gist.githubusercontent.com/abresler/8080a046794f97b8e3174b871b8429c0/raw/06f9c66552775facda7aeb7e6780ed74e9fda1c3/htmlwidget_tests.r")
```

Now we can create the table.

The first step is to use dplyr to create the summary data.  We do this by grouping by state, removing the CRD ID, and using the `summarise_if` function to quickly calculate the sum of all the remaining numeric columns.  

Next, we will shorten the column names to make them easier to print. We are left with the core summary data by state but still need to augment it by joining additional information and our visualizations.  We don't have the fund manager count by state so we can derive that and join it in with our existing data.  Finally we will join our visualization columns, scatter plots of assets under management by fund vehicles and assets under management by investor count.

```{r top_fund_table_widget}
ft_lm <-
  fund_manager_totals %>%
  dplyr::select(-idCRD) %>%
  group_by(state = stateOfficePrimary) %>%
  summarise_if(.predicate = is.numeric,
               funs(. %>% sum)) %>% 
  rename(vehicles = countVehicles, investors = countInvestors, AUM = amountAUMTotal) %>%
  left_join(
    fund_manager_totals %>%
      dplyr::select(-idCRD) %>%
      group_by(state = stateOfficePrimary) %>%
      summarise(managers = n())
  ) %>%
  left_join(
    fund_manager_totals %>%
      dplyr::select(-idCRD) %>%
      group_by(state = stateOfficePrimary) %>%
      summarize(
        aumByFundCount = scatter_cell(
          countVehicles,
          amountAUMTotal / 1000000, ## divide by 1M
          xlab = "Fund Vehicles",
          ylab = "AUM Millions",
          tooltip_text = paste0(
            nameEntityManager, # Add the tool tip
            "<br><strong>Amount AUM: </strong>",
            amountAUMTotal,
            "<br><strong>Fund Vehicles: </strong>",
            countVehicles
          ),
          point_size = 20, # modify point size
          hover_size = 6, # change hover size
          ellipses = T, # add grouping ellipse
          hover_opacity = 1, # change hover opacity
          height = 200, # modify the height
          width = 220 # modify the width
        ),
        aumByInvestorCount = scatter_cell(
          countInvestors,
          amountAUMTotal / 1000000,
          tooltip_text = paste0(
            nameEntityManager,
            "<br><strong>Amount AUM:</strong>",
            amountAUMTotal,
            "<br><strong>Investor Total: </strong>",
            countInvestors
          ),
          xlab = "Fund Investors",
          ylab = "AUM Millions",
          point_opacity = 0.5,
          hover_size = 4,
          hover_opacity = 1,
          point_size = 20,
          ellipses = T,
          colors = "#A94175",
          height = 200,
          width = 270
        )
      )
  ) %>%
  formattable() %>%
  as.htmlwidget() %>% 
  suppressMessages()


ft_lm$dependencies <- c(ft_lm$dependencies, # setup the htmlwidget
                        htmlwidgets:::widget_dependencies("scatterD3", "scatterD3"))
```


#### Print the Table

I encourage you to hover over the points to which will give you additional information about the manager.  You can also scroll in and out of the plots which will show you help you explore high level distributions as you scroll out and exact x-y coordinates as you scroll in.

```{r, fig.width=10}
ft_lm
```

### Fund Manager Exploration

Next we will try to better understand our list of managers.  We also want to make it easy to navigate to the manager's webpage which we can do by taking our data from step 1 and layering in some html.  Finally, we want to add a visual sparkline showcasing a manager's assets under management by fund, which  will give us a good visual cue on the AUM variance between funds, number of funds and overall fund sizes.

```{r by_manager_table}
manager_df <-
  fund_manager_totals %>%
  ungroup %>%
  dplyr::select(-idCRD) %>%
  mutate(urlManager = ifelse(
    urlManager %>% is.na(),
    NA,
    list(
      "<a href ='",
      urlManager, # create a clickable url that opens a new window
      "' target ='_blank'>",
      urlManager,
      "</a>"
    ) %>% invoke(paste0, .)
  )) %>%
  dplyr::rename(
    url = urlManager, # change naming to make the table fit
    state = stateOfficePrimary,
    investors = countInvestors,
    vehicles = countVehicles,
    AUM = amountAUMTotal
  )

manager_ft <-
  formattableWithSL(
    TableDF = manager_df,
    rawDF =
      section7BPrivateFundReporting %>% 
      select(nameEntityManager, fundAUM = amountFundGrossAUM) %>% 
      data.frame(), # requires data.frame unforutantely 
    key = "nameEntityManager", # key for spark line
    SLfield = "fundAUM" # variable visualized in spark line
  ) %>% 
  suppressMessages()
```

```{r manager_ft}
manager_ft
```

### A Data Driven View of the Manager Landscape

In this final step we will utilize some important numeric descriptors to draw some conclusions about the landscape of our managers.  To achieve this we will use a statistical analysis tool known as [Principal Components Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis)[PCA].  

PCA is a powerful way to take numeric data and break it down into a set of components that explain the variance of the data.  The algorithm returns a set scores for each the variables.  These scores include a list of components the length of the number of variables that fully explain the variance.  We will isolate, explore and visualize the first two principal components.


We have assembled some use numeric data but there are a couple of we can join in to enhance this analysis.  In addition to the assets under management, vehicle count, and investor count we want to include information that proxies a firm's operational size, something we can get by including a count of their offices.  We also want to include information a manager's average fund profile, something we can do by joining in information about a manager's weighted average investment minimum and percentage breakdown of various fund characteristics {percent of international investors, percent from fund of funds, percent from feeder fund, etc..}.  Finally we want information that profiles a manager's depth of relationships with  services providers.  We can achieve this by joining in a count of the service providers by service also disclosed in section7BPrivateFundReporting.

Once completed, we will take the data, isolate the the numeric fields, scale to mean zero and use `prcomp` function to apply the principal components algorithm.  We will then explore the loading scores, visualize the result first two components and draw some conclusions from the data.

### Assemble Data for PCA

```{r pca_prep}
fund_totals <- 
  section7BPrivateFundReporting %>%
  group_by(idCRD, nameEntityManager) %>%
  summarise(
    countFunds = max(numberFund),
    amountTotalAUM = sum(amountFundGrossAUM, na.rm = T),
    countTotalFundOwners = sum(countFundOwners, na.rm = T)
  ) %>%
  ungroup

service_provider_df <- 
  section7BPrivateFundReporting %>% 
  select(idCRD, matches("^name[A-Z]")) %>% 
  select(-c(nameFund, nameFundGPManagerTrusteeDirector, nameFundFeederGPManagerTrusteeDirector)) %>% 
  select(-matches("nameFundFeederGPManagerTrusteeDirector|nameCustodianBusiness|EntityManager")) %>% 
  gather(item, value, -c(idCRD), na.rm = T) %>% 
  mutate(item = item %>% str_replace_all("[0-9]", '')) %>% 
  distinct() %>% 
  group_by(idCRD, item) %>% 
  summarise(count = n()) %>% 
  mutate(item = item %>% str_replace_all('name', 'countFund')) %>%
  spread(item, count, fill = 0)

weighted_fund_avg_df <- 
  section7BPrivateFundReporting %>%
  group_by(idCRD) %>%
  summarise_at(
    c(
      "amountFundMinimumInvestment",
      "pctFundManagerOwned",
      "pctFundFundOfFunds",
      "pctFundNonUSCitizens",
      "pctClientsFundInvestor"
    ),
    funs(weighted.mean(. , amountFundGrossAUM, na.rm = T))
  ) %>% 
  ungroup

office_count_df <- 
  managerOtherOfficeLocations %>% 
  group_by(idCRD) %>% 
  summarise(countOffices = max(countItem)) %>% 
  ungroup()

all_data <- 
  fund_totals %>% 
  left_join(
    vcs_in_location_df %>% select(idCRD, stateOfficePrimary)
  ) %>% 
  left_join(weighted_fund_avg_df) %>% 
  left_join(office_count_df) %>%
  left_join(service_provider_df) %>% 
  suppressMessages()
```

### Scale the Data

```{r scale_the data}
set.seed(123) # make it reproducible

scaled_numeric_data <- 
  all_data %>%
  purrr::keep(is.numeric) %>% 
  mutate_all(as.numeric) %>% 
  gather(item, value, -idCRD) %>% 
  spread(item, value, fill = 0) %>% 
  dplyr::select(-idCRD) %>% 
  scale()
rownames(scaled_numeric_data) <-
  all_data$nameEntityManager

scaled_numeric_data %>% data.frame %>% rownames_to_column(var = 'nameEntityManager') %>% as_data_frame()
```

### Perform the PCA

```{r pca_calc}
pc_funds <- 
  scaled_numeric_data %>% 
  prcomp()
```

### Explore the Loadings

Now we can explore the results.  If you aren't familiar with the methodology behind PCA loadings I recommend [reading this](http://stats.stackexchange.com/questions/222/what-are-principal-component-scores).  Try to spend a few minutes exploring each of the 15 components and their variable loadings and think about what the scores may indicate.  Try to focus on the top 5 components as they are the influential to explaining the variance of our data.

```{r pca_loadings}
pc_funds$rotation %>% 
  data.frame() %>% 
  tibble::rownames_to_column(var = 'nameVariable') %>% 
  as_data_frame() %>% 
  arrange(PC1, PC2)
```


### Visualize the Results

Now we will take the results, tidy them with [broom](https://github.com/dgrtwo/broom) package, and build an interactive plot of the top two principal components.  

To create the data visualization we will first build a static [ggplot2](https://github.com/hadley/ggplot2) scatter plot, customize the format, and then feed the static plot into [plotly](https://plot.ly/) to produce an interactive visualization.

Note, in order to exactly replicate this visualization you must have the Myraid Pro font installed on your system which you can download from [here](http://fontsup.com/family/myriad+pro.html).  

If the code does not work I recommend removing code contained in  `hrbrmisc::theme_hrbrmstr_my() +`.  Finally please make sure you have the other required packages installed and loaded.

```{r pca_plot_prep}
library(broom) # install.packages("broom")
library(ggplot2) # devtools::install_github("hadley/ggplot2") # requires dev version
library(plotly) # devtools::install_github("ropensci/plotly") requires dev version
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")

pca_plot <- 
  augment(pc_funds, data = all_data) %>% # Tidy the results
  ggplot(aes(.fittedPC1, .fittedPC2, colour = stateOfficePrimary, text = nameEntityManager)) +
  geom_jitter(size = .55) + # jitter the results
  hrbrmisc::theme_hrbrmstr_my() + # require myriad pro font installed on your computer as well, remove this code if it doesn't work
  scale_x_continuous(name = "PC1 - AUM / Investor, Vehicle Count", breaks = seq(-13.5, 2.5, length.out = 8)) +
  scale_y_continuous(name = "PC2 - Feeder / Fund of Fund Reliance, Service Provider Complexity", 
                   breaks = seq(-7, 3, length.out = 8)) + 
  geom_hline(yintercept = 0, color = "gray70", size = .15) + # add horizontal line
  geom_vline(xintercept = 0, color = "gray70", size = .15) + ## add vertical line
  theme(panel.grid.major.x = element_blank(),panel.grid.major.y = element_blank()) + # remove the lines
  ggtitle("Venture Manager PCA Analysis")
pca_gg <-
  pca_plot %>% ggplotly()
```

```{r pca_plot, fig.width=9}
pca_gg
```

#### Interpretting the Results

This visualization may like a standard scatter plot but it is not.  The X and Y axis are unrelated and we aren't looking to the x-axis to explain the y-axis.  Each axis represents a distribution of component scores.

The x-axis represents the first principal component which, according to my interpretation of the PCA loadings represents the size of manager's capital base, count of investors and the size of the manager's operation.  The farther to the left you move, the more assets under management, fund investors and offices the manager has.  The farther to the right the less AUM, fund investors and offices a manager has.

To demonstrate this, hover over outliers farthest to the left, [NEA](https://en.wikipedia.org/wiki/New_Enterprise_Associates) and Deer Management, whose business name is actually [Bessemer Ventures](https://en.wikipedia.org/wiki/Bessemer_Venture_Partners).  These outliers encapsulate of first principal component.  These managers have the largest assets under management and fund investor count, generally by an order of magnitude over the rest of the managers.

The y-axis represents the second principal component, this is more difficult to interpret.  As you move south along the axis it indicates a heavy influence of having a [prime broker](https://en.wikipedia.org/wiki/Prime_brokerage), relying on [fund of funds](https://en.wikipedia.org/wiki/Fund_of_funds) for capital and/or using [feeder funds](https://en.wikipedia.org/wiki/Feeder_fund) for a large portion of a vehicle's capital.  As you move south along the y-axis this indicates having a high number of unrelated fund auditors, administrators and marketers.  

You can get a better understanding of this component by hovering over the outlier at the bottom right, FF Asset Management.  They only have $144.812M in AUM which is why the firm is on the right on the x-axis and if you look at the `all_data` you will see the manager has engaged 6 different fund marketers and 3 separate fund auditors is why they are located on the south end of the y-axis.   Compared to the other 218 managers, this count of fund marketers and auditors is a *huge* outlier.  While this information doesn't indicate anything in isolation, this heavy distortion does raise a red flag.

### PCA Driven Conclussions

Finally we want to use this visualization and the results from our PCA to communicate feedback to the portfolio companies.

My first piece feedback is to be aware of meaning of NEA and Bessemer as relative whales.  They have massive AUM and most likely will look to write significant large checks to optimize efficiency. Any pursuit of their capital should weight the importance of the overall round size and need for them to write a good-sized check.

My second piece of feedback is to treat any firm in the lower right quadrant with extreme skepticism.  These managers have small asset bases and a fairly complex network of service providers.  This indicates possible high reliance on fund of funds, a source of capital that adds another layer of fees to the end investor and may operate and push decisions onto the the recipient manager that pushes their own agenda.  

This component also isolates managers with a higher count of auditors and administrators.  This alone doesn't indicate wrong doing but it does raises serious questions.  I recommend before any dialogue with managers in that quadrant there is a serious vetting of these firms, their auditors, fund administrators and most importantly the principals and entities that own the manager.

That leaves the rest of the data, which if I were to guesstimate is about 200 of the 218 managers as candidates meet the criteria as a potential investor.  I'd advocate a thorough exploration of the data in Section Schedule A and Section Schedule B which identifies the people and entities who own the management companies.  I'd then write a function that takes the legal fund names and Googles for PDF files containing the name.  Any matched results would provide us with a good idea of the fund's financial returns and we may even find the fund's portfolio investments and LPs.  All very important information to process as you try to identify potential venture targets. 

There are also 2 fantastic R data acquisition packages I'd integrate into the workflow.  The first, [rcrunchbase](https://github.com/tarakc02/rcrunchbase), wraps [Crunchbase](https://www.crunchbase.com/) and is a fantastic way to discover a fund's portfolio investments and any affiliations/investments of people affiliated with the firm.  

The second is [gdeltr2](https://github.com/abresler/gdeltr2) which wraps  [The GDELT Project](gdeltproject.org)'s various APIs and data-stores.  You can use this package to search for media or television references of the fund manager and/or any of its employees and portfolio companies.  Depending on your R skills you could even right a [cron job](https://en.wikipedia.org/wiki/Cron) that queries hourly the API with a list of companies, terms, and people, storing any validated results in a separate database to build an on-going media tracking tool.

### Parting Notes

I hope this tutorial was insightful and gave you an idea of how you can use fundManageR to try to answer real-life questions.  I also hope it demonstrated how R can help you create powerful and engaging information in just a few lines of code.  

I also wanted to encourage you to think about the importance of data driven, re-producible analysis.  We have the ability to interactively show and share how we make decisions.  Whether you are using [R Notebooks](http://rmarkdown.rstudio.com/r_notebooks.html) or [Jupyter Notebooks](jupyter.org), it's easy to create a data driven narrative that you can use to share, discuss and collaborate around a problem.

Thank you for investing your precious time going in going through this tutorial. 

Don't hesitate to reach out to me on [twitter](https://twitter.com/abresler) or via e-mail.  I hope to be back  soon with a tutorial integrating fundManageR, gdeltr2, and advanced machine learning techniques,

Stay curious, analytical, visual, open-minded, and away from spreadsheets.