6 min read

Parsing Mass Municipal PDF CAFRs with Tabulizer, pdftools and AWS Textract - Part 1

# Libraries
packages <- 
  c("data.table",
    "rlist",
    "stringr",
    "DT",
    "janitor",
    "readxl",
    "xlsx"
    )

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

invisible(lapply(packages, library, character.only = TRUE))

knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')

Introduction

Redwall Analytics had the pleasure of collaborating with Marc Joffe, of Reason Foundation, in its October 2018 post Replicating Yankee Institute Risk Score Over 15 Years for 150 Connecticut towns. This involved taking a well organized public dataset from the State’s website, and analyzing and building an application to view the risk score over time in R. When Marc called to ask if we could report on our blog site about similar analysis for 149 Massachusetts towns, as in Connecticut’s Warning Signs, we jumped at the idea. Naturally, Redwall wanted to replicate the analysis over a longer time period, but this turned out to be more challenging in Massachusetts.

As Redwall discussed in A Walk Though of Accessing Financial Statements with XBRL in R - Part 1, the SEC has required registered companies to report in XBRL since 2008. Unfortunately, this is not the case for the tens of thousands of municipal borrowers around the US. Connecticut is one of approximately 20 states with a forman municipal monitoring program, so the Office of Policy and Management systematically gathers key financial statement data each year in an annual report and open public databases. Unfortunately, Massachusetts does not have formal monitoring and does not make similar databases available.

While Massachusetts does maintain the slick-looking Municipal Trends Dashboard, with some key information going back in some cases 20 years, it doesn’t offer the opportunity to download the complete data, and for some reason, excludes a lot of key elements. For example, it includes “Unassigned General Fund Balance”, but not “Assigned General Fund Balance”, so it is impossible to know the complete picture of the Fund Balance of a municipality. There are similar difficulties finding the “Unrestricted Net Position”, which is essential to the Yankee Institute’s analysis. In Connecticut, we have felt a little left behind by Massachusetts’ successful biotech and high-technology industries, and resulting booming real estate markets, so it was a surprise to see how far ahead we are in this regard.

Project Outlines

As we worked to replicate Marc’s analysis, the outlines of an interesting project, which might be helpful to others needing to extract data from PDFs, became apparent. Redwall will gather some of the information from the Municipal Trends Dashboard, but then try to extract the missing elements from PDFs containing the cities’ audited financial statements (known as CAFRs), using OCR tools available in R. We will then attempt to use the commercial AWS Textract service for the remaining tough cases using the R paws library. For the purposes of this exercise, Marc’s spreadsheets will represents a kind of “labelled data” for comparison purposes with fields from the extracted PDFs.

The relevant 149 PDFs for 2018 were downloaded from a library maintained by the Center for Municipal Finance, and Redwall set out on heavy-duty PDF parsing to be described in multiple upcoming posts. In this post, we will show the data required in Table 1 and results in Table 2 of Marc’s analysis, and explore which data fields will be needed to automate the process. (Please note: the data and methodology shown below are Marc’s, displayed here for information purposes. Redwall Analytics does not vouch for the accuracy of the data or methodology, nor make any investment recommendations based on it.)

After that, we will move on to attempting to extract that data from the PDFs. The first steps of the process to parse these PDFs using two incredible rstat wrappers pdftools and tabulizer. We will also use Amazon’s Textract using the R paws SDK library. While these libraries have been around for a couple of years, we did not find anyone who had put all of these pieces together through the paces like we are about to.

Because these tools are relatively new, we struggled to find much written on using them to their full individual potentials. We also believe that we have found some interesting methods, making them more powerful by using them in combination. One of the main goals of this series is to begin to repay the debt to all those who shared their wisdom and enabled us to reach the level where we can give back to the community.

Blogpost Index

Massachusetts Municipal Vulnerability

To begin with, we load and show his spreadsheet analysis with his key inputs. Needless to say, it is a big job to gather and extract that much data manually PDF-by-PDF and record in a spreadsheet. In addition, Marc’s approach can only be used at a point in time, and would require a similar amount of effort to repeat for past or future years. If it is possible to reliably automate, even partially, it should help the cause of municipal fiscal transparency. We show the data used in Marc’s analysis in Table 1 below. As with most of our posts, it is possible to scroll over the table columns, adjust the number of rows displayed or search for a particular municipality with the search bar.

Figure 1: Massachusetts Municipal Financial Data - Key Inputs

As shown above in Table 1, Marc uses three financial statement metrics for his analysis: (1) Unrestricted Net Position divided by total expenses, (2) All forms of indebtedness including unfunded pension and OBEB liabilities divided by total revenues, (3) Unassigned plus Assigned Net Position divided by General Fund Expenses, and two cyclical/macro metrics inculding: (1) year-over-year change in employment and change in home prices. At the moment of this analysis (ie: based on 2018 conditions), the two cyclical components don’t differentiate much among municipalities at this point in time, because most have high scores. Together, these drive his five Risk score components to give the aggregate “Risk Score” below for each town, shown below in Table 2. For a more detailed explanation, see Towns in Trouble - Assessing Municipal Health in Vermont

According to Marc’s analysis, a handful of towns including: Fall River, New Bedford, Pittsfield, Eastham, Quincy, Glouchester and Springfield are most vulnerable. Some of these towns have weak scores for Unrestricted Net Position, Debt and General Fund, but are helped by the smaller, more cyclical Home Valuation and Employment components. Even the strongest towns have somewhat weak Unrestricted Net Positions, and after the strongest 30-40 towns, General Funds and Debt Scores start to fall off.

Figure 2: Massachusetts Municipal Vulnerability Scores

Conclusion

That completes the first in this series of posts. In the next post Tabulizer and pdftools Together as Super-powers - Part 2, we will show how to the combination of pdftools and tabulizer can enable more accurate table extraction for a large number of slightly varying tables.