Helping Humans find Homes

Tyler Etheridge
8 min readJan 7, 2021

For the past 60+ years, Maslow’s hierarchy of needs has set out to describe the links between a human’s necessities and internal motivations. The primary level of Maslow’s hierarchy represents the physiological needs of a human, the basic prerequisites to survival. Contained within that list of needs is food, water, and shelter. Family Promise of Spokane is a non-profit organization spearheading the growing incidence of familial homelessness in the Spokane, Washington area in an effort to provide resources to those in danger of not having their basic physiological needs met. As Family Promise of Spokane has grown to meet the needs of its guests, the anachronistic intake and resource management workflows in use have struggled to keep pace. To curb the growing labor requirements and insensitivity to the stresses that a written intake interview places on the shelter’s guests, my team has worked to develop a digital intake solution to address the out of date system currently in place.

2019 Statistics for homelessness in WA from endhomelessness.org

The goal of this software solution has been not only to provide case managers and supervisors the ability to process guests much more efficiently, but also to employ the results of machine learning workflows running in the background to better distribute resources to families at critical times. For example, a specific machine learning application in use is predicting the type of exit destination each family is likely to move to as they exit the shelter. Coming into the project as the data analyst and modeling engineer, the wide variety of circumstances that families come from and the strict parameters set forth by the master information database systems that Family Promise of Spokane must adhere to create a paradoxical landscape that quickly became a concern as early as the planning stages. By nature of each family having their own set of circumstances there are an indescribable number of unique ways to explore for each family and model on the data, but by the stage of deployment, the chosen modeling features have to fit within a universal context of features available from the master database so that the predictions are generalized and can be applied to whatever circumstances are given.

Family Promise of Spokane aims to help people by building resource networks

Sometimes a Shovel isn’t Good Enough to Dig With

As my concerns alluded to earlier, Family Promise of Spokane’s position within a larger network of homeless shelters places specific requisites for data handling that is not convenient to sift through. The dataset given to the data science team was an anonymized set of entries from the master database system that governs the entire network of shelters. The largest consequence of this data format is the presence of approximately 100 feature entries per guest that are largely irrelevant or missing due to each shelter having variations in intake interview questions based on the specific shelter’s type, location, and resources offered. Thus the dilemma arises. How do you start the workflow when the majority of the data features need individual scrutiny?

As the person responsible both for the data analysis and modeling, no work could be done on the data until these null features had been resolved. The dataset given only contained data from the specific shelter that is our stakeholder for this project, so there were some creatively convenient feature filtering techniques I was able use to massively pare down the dataset prior to manual inspection of the remaining entries. For certain features the master database system has limited valid string entries that also apply when the shelter guests wish to not provide specific information. For example, the “Race” feature entry has “Client refused”, “Client doesn’t know”, “NaN”, and “Data not Collected” as valid responses for when an observation has a null entry.

A look at the value counts of the race feature

Even though many classifiers have the ability to recognize the pattern of those entries being contextually the same and do not require binning the data, if it is done then a more accurate representation of feature cardinality can be generated across the entire dataset. This allows for a programmatic approach to feature paring once confidence in the cardinality is established. The image below shows a snapshot of one of the data cleaning pipelines used to accomplish this feature paring. In the replace_values function a list of features is passed and has the contextually null values replaced and imputed with “Unknown”. Then once cardinality has been regulated, a null value threshold filter is applied to the dataframe in the remove_null_columns that checks the incidence of null or unknown values for each feature and drops it if the incidence count is greater than the threshold value set on the first line of the function. Finally the last automated part of the paring employs the results of the feature binning. For some features such as an income source that are not relevant to the desired homeless shelter, rather than use null entries a default value is given in the database so null value counts are not sufficient to address what are basically completely null columns. By ensuring that all contextually “Unknown” entries have been coerced to a singular value, checking for columns where feature cardinality is 1 exposes features where in modeling the feature would act as a universal constant and have no importance to the modeling. Thus these columns are dropped.

pandas pipelines were used to maintain consistency and atomicity

As much as I wished that the feature paring could be entirely accomplished via creative coding, there were still dozens of features that had to be manually inspected for relevance due to changes in the information that future guests would provide as a result of the digital intake system we were creating having different fields and retiring some older requirements. This unfortunately was where the shovel alone was not enough and I manually generated a list of columns to remove by cross-referencing the new intake fields from the web developers on my team and the master database dictionary.

Becoming Increasingly Picky

The final goal with the data science workflow is to classify and generate a report of the likely exit destinations each family will likely experience after their time at the shelter. With the data in a much better shape to move to modeling, only a few feature engineering tasks remained. In the original dataset, the location options for the “Exit Destination” feature that functions as our classifier target has 22 different possibilities; however, the stakeholder wished for the classifier output to only return 1 of 5 super categories for these 22 possibilities. So just like with the unknown values earlier, we worked with the stakeholder directly to remap the values to his desired categories.

The original 22 targets
Targets after remapping

For classifier I chose to start the process with RandomForest from Sci-kit Learn. It is very straightforward and easy to implement into modeling and hyperparameter tuning pipelines to generate a baseline sense of being for the project. Based on findings from data analysis I slowly generated a set of features that I felt contained important variance to include. The original scope of features for modeling was very narrow, only using basic universal values and resulting in an accuracy of ~50% for 5 class classification. With all the feature cleaning done and the expanded dataset, that accuracy quickly jumped to 70% with significant room to tune the precision and recall targets for a more accurate look at the data output with future engineering.

A snapshot of the classification report.

My Participation’s End is Another Family’s New Beginning

Even though the 70% accuracy is where the classifier stands today, it is far from an end to the work that can be done to improve the lives of families trying to start over and build better for themselves. With extensive feature documentation that I generated along with a massively more approachable dataset for deeper engineering and exploration, the project is in a stable and approachable state for new members of the team to put forth their contributions for the cause. There are now multiple documents detailing the contextual findings and requisites from the shelter, and the classifier is implemented into an API that returns a prediction to the digital intake dashboard as guests are added to the new database.

Moving forward, there is a lot of room for a context sensitive communication effort for this classifier result. To the supervisors working with these families, a single classifier output is of very little use and thus creating a framework to communicate the context of the classifier results with layman interpretable information and sensitivity to the privacy and context of the family’s situation is a huge next step in delivering a full featured data science workflow for this product. This will most definitely prove to be difficult as the reports must work for any family that walks into the doors of the shelter and provide clear and precise insight into how the shelter staff can more specifically help each family.

This project has forever underlined the importance of the basic data science principle of “Garbage in, Garbage out.” Multiple times throughout my initial modeling process I would tweak parameters to include varying feature sets to see how the modeling would perform and seemingly out of nowhere a poorly cleaned or irrelevant feature would sneak into my dataframe and lead to abysmal results. Spending more time with the data and revising my requirements for what I considered to be relevant to the modeling context would routinely net a 10–15% increase in accuracy. Throughout the project my coding structure was a recurring topic of conversation and feedback, and it helped me appreciate the nuances simply having more DRY and ACID-ic code provides that are not immediately apparent when simply typing away until you get the solution you want. Furthermore this has been the most comprehensive project that I have been involved in from a software development point of view. Seeing how web developers work and communicate has influenced my way of thinking and approach to problem solving and integrations. Ideally I will continue to become more well rounded in my skill set in my future career, not being bound to just a data scientist. Rather I’d be a competent software engineer all around, and participating in the processes that are currently technically out of my skillset provided immeasurably important lessons to reach that goal of being more well rounded and valuable to a team working in a real world software development environment.

--

--