The Internet Gap Has a Map, and It Is Not Pretty - Packet Flow: Journey to Network & Cybersecurity Expertise

In one of the labs in our Advanced GIS Applications course, we used ArcGIS Pro and Random Forest machine learning to predict the percentage of households without internet access across U.S. counties.

That sounds more complicated than it really is. The idea was simple: take county-level data, prepare it properly, train a model, test it against data it had not already seen, and see which factors helped explain where internet access is still weak.

This was not about pretending machine learning is magic. It is not. It is a tool. A useful one, but only if the data is prepared correctly and the results are interpreted with common sense.

That last part matters because software will gladly give you an answer even when the question is bad. Very helpful. Very dangerous.

Final map layout showing predicted percentage of households without internet access across U.S. counties

The final map showed predicted internet access gaps by county. Lighter areas had lower predicted percentages of households without internet access. Darker areas had higher predicted percentages.

The pattern made sense. Rural and less populated counties generally showed higher predicted values. More urbanized counties generally showed lower predicted values.

That is not exactly shocking. But predictable does not mean unimportant. A pothole is predictable too. Still ruins your tire.

Getting the Data Ready

The lab started with the basic setup. A new ArcGIS Pro project was created, and the county dataset was added to the map.

ArcGIS Pro project with U.S. county boundaries loaded

The shapefile loaded correctly and displayed county boundaries across the United States.

This is not the exciting part, but it is necessary. Before running a model, the data has to load, display, and behave. Otherwise, we are not doing machine learning. We are just asking the computer to guess with a straight face.

And computers are very good at being wrong with confidence.

Checking the Projection

The next step was checking the coordinate system. The original dataset used WGS 1984, which stores locations using latitude and longitude.

Layer properties showing the original spatial reference

For this lab, the data was reprojected to USA Contiguous Albers Equal Area Conic.

selecting USA Contiguous Albers Equal Area Conic projection

That projection is better for national-scale analysis across the contiguous United States because it helps preserve area. Since this lab compared county-level patterns, that mattered.

Projection is one of those GIS topics people want to skip until skipping it creates bad results. Then suddenly everyone cares about coordinate systems. Funny how that works.

Splitting the Data

The dataset was split into training and test groups. Each group contained 50 percent of the records.

Subset Features tool used to create training and test datasets]

This step is important because the model needs to be tested on data it has not already seen.

If we train and test a model on the same data, the model may look better than it really is. That is not learning. That is memorizing the answer key.

The training data was used to build the model. The test data was used to check how well the model worked. This helped reduce overfitting, which happens when a model performs well on training data but does not generalize well to new data.

In plain English, it means the model learned the worksheet but not the lesson.

Building the Random Forest Model

We used ArcGIS Pro’s Forest-based Classification and Regression tool to train a Random Forest model. The model predicted the percentage of households without internet access.

The explanatory variables included income, population density, land use, infrastructure-related variables, and regional characteristics.

Forest-based Classification and Regression tool setup

The model used 100 trees.

Random Forest works by building many decision trees and combining the results. One tree can be weak. Many trees together can give a stronger prediction.

So yes, in this case, more trees helped. GIS finally found a forest it could use without requiring a chainsaw permit.

Reading the Prediction Map

After the model ran, ArcGIS Pro created a prediction layer for the test counties.

Random Forest prediction layer showing estimated households without internet access.

The prediction map showed higher values in many rural and less densely populated counties. Lower values appeared more often in urban areas.

That makes sense. Urban areas usually have more people in less space. That makes internet service easier and cheaper to provide. Rural areas often have fewer people spread across larger distances. That usually means higher costs and less provider interest.

Then someone calls it a “deployment challenge,” because apparently “not profitable enough” sounds too honest.

This is where GIS helps. It turns a general issue into a location-based issue. Instead of saying “some places lack internet access,” the map shows where the problem is more likely to be.

That is useful.

Uncomfortable, maybe. But useful.

Looking at Variable Importance

The variable importance table showed which predictors had the most influence on the model.

Median household income was the strongest predictor. It contributed about 55.12 percent of total importance.

Population density was second. It contributed about 17.04 percent.

The lowest variable was the SUB_REG Mid Atlantic regional variable, which contributed about 0.06 percent.

That says something important.

Income mattered the most. Population density also mattered. Broad regional labels did not matter much.

So the model was not saying that internet access is explained mainly by region. It was showing that local conditions matter more. Income, density, and infrastructure-related factors did more to explain the pattern.

In other words, the real conditions on the ground mattered more than the label on the map.

Shocking concept, I know.

Checking the Error

To evaluate the model, the predicted values were joined back to the test dataset. Then the residuals were calculated. A residual is the difference between the actual value and the predicted value.

Spatial Join tool used to combine predicted values with test data.

Calculate Field tool used to compute squared residuals.

The squared residuals were used to calculate Root Mean Square Error, or RMSE.

The full Random Forest model produced an RMSE of about 6.08.

Since the dependent variable was a percentage, this means the model was off by about 6 percentage points on average.

That is not perfect. But it is not useless either.

For a model dealing with income, geography, density, and infrastructure, that is a reasonable result. Real-world data is messy. Anyone expecting perfect predictions has probably never dealt with public data, customer records, or a printer.

RMSE gives the average error, but it does not show where the model was wrong. A good next step would be to map the residuals and see if the model overpredicted or underpredicted in certain areas.

That would make the analysis stronger.

Testing Median Income by Itself

Since median household income was the strongest predictor, the model was also run again using only median income.

Random Forest model rerun using only median household income

The median-income-only model produced an RMSE of about 6.30.

The full model produced an RMSE of about 6.08.

So the full model performed better.

The difference was not huge, but it still mattered. Median income explained a lot, but it did not explain everything. Adding population density, land use, and other variables improved the prediction.

That is the lesson.

One variable can be important, but the real world usually does not fit into one clean explanation. Income matters. Density matters. Infrastructure matters. Geography matters. Local conditions matter.

Simple answers are nice. They are also often wrong.

What This Lab Shows

This lab was about predicting internet access, but the bigger lesson applies to public infrastructure.

Broadband, water, sewer, roads, emergency response, and public services all have a location. They do not exist in theory. They exist somewhere, serve someone, and fail somewhere first.

GIS helps show that.

Machine learning can help find patterns in the data. ArcGIS Pro provides the tools. RMSE gives the model a reality check.

But none of this is magic.

The model did not fix the internet gap. The map did not install fiber. ArcGIS Pro did not become a policy expert.

But the workflow helped show where internet access may be weaker and which factors help explain the pattern.

That matters.

Because once the gap is mapped, it becomes harder to hide behind vague language.

Harder to say, “We are looking into it.”

Harder to pretend the problem is floating somewhere in the abstract.

The map puts the problem on the ground.

And that is why GIS matters.

It gives location to the issue.

Then it leaves us with the harder question:

Now that we can see the gap, what are we going to do about it?