Version 1.0 | First Created Apr 6, 2026 | Updated Apr 12, 2026

1. Introduction

This study revisits one of the most influential demonstrations of the Modifiable Areal Unit Problem (MAUP) by Openshaw (1970). By reproducing and extending Openshaw’s first experiment using open-source computational tools, this study aims to provide both a conceptual and empirical understanding of how zoning and scale influence analytical outcomes.

Two key problems are examined. First, the scale problem, which explores how results change as spatial units are aggregated into larger zones. Second, the aggregation problem, which investigates how different configurations of zones at the same scale can yield different results. Both of these are examined under two forms of spatial arrangement: zoning (contiguous regions) and grouping (non-contiguous sets).

Ultimately, this project serves as a pedagogical vignette that highlights the importance of spatial units in geographic analysis and encourages a more critical interpretation of spatial statistical results.

Keywords: Modifiable Areal Unit Problem, zoning, grouping, scale problem, aggregation problem

2. Background

The choice of areal units for spatial analysis has been a fundamental issue in geography and spatial statistics since the mid-20th century. Since the area over which data are collected is continuous, it follows that there are numerous alternative ways in which it can be partitioned to form areal units for reporting data. In theory, there is an infinite number of ways in which a study area can be divided. These units themselves may also be combined to form larger units at new scales in a large number of alternative ways. This is the Modifiable Areal Unit Problem (MAUP), identified by Yule and Kendall (1950), who emphasize the necessity of not losing sight of the fact that analytical results depend on the choice of spatial units.

The MAUP is, in reality, composed of two separate but interrelated problems. The first is the scale problem, defined as the variation in results that arises when the same areal data are aggregated into increasingly larger spatial units. For instance, analyzing the same census data at scales ranging from enumeration districts, wards, and local authorities up to standard regions will almost certainly yield different results and possibly different interpretations. Although scale differences are the most obvious manifestation of the MAUP, there is also the problem of alternative combinations of base units at equal or similar scales. Any variation in results due to different configurations of units when the number of units (n) is held constant is referred to as the aggregation problem.

In addition, we can distinguish between two forms of areal arrangement. We use the term zoning system for contiguous arrangements and grouping system for noncontiguous cases. Alternatively, a zoning system may be considered a special case of a grouping system that incorporates a contiguity constraint in its formation.

2.1 Openshaw’s First Experiment

Building on this foundation, in the 1970s Openshaw conducted a series of three experiments to demonstrate the impact of the MAUP. In the first experiment, Openshaw examines how statistical relationships change purely as a result of how spatial units are defined. Using data from the 99 counties of Iowa, he selects two variables: the percentage of votes for the Republican presidential candidate in 1968 and the percentage of the population over age 65 from the 1970 census. At the county level, these variables have a baseline Pearson correlation of 0.3466 . This baseline serves as a reference point for all subsequent comparisons. The key idea of the experiment is that the underlying data remain fixed, while only the spatial aggregation scheme changes, allowing him to isolate the effects of zoning and scale on statistical results.

The experimental design proceeds by aggregating the 99 counties into larger spatial units using multiple alternative zoning systems. These include politically defined districts (e.g., Republican- and Democrat-proposed congressional districts), administratively defined districts, and functionally or geographically defined regions such as urban–rural classifications. Each aggregation groups the counties into a smaller number of zones (commonly six in the initial comparisons), after which the same two variables are recomputed at the aggregated level and a new correlation coefficient is calculated. By comparing these correlations across different zoning systems, Openshaw demonstrates that the same underlying data can yield substantially different statistical relationships depending solely on how boundaries are drawn. In general, as the size of zones increases, the correlation values can shift significantly.

Beyond comparing a small set of predefined aggregations, Openshaw extends the analysis using simulation. He generates a large number (on the order of thousands) of alternative zoning and grouping systems, including both contiguous zoning systems and non-contiguous grouping systems. For each simulated aggregation, he recalculates the correlation coefficient, producing a distribution of possible correlation values rather than a single estimate. This allows him to examine the range (maximum and minimum), spread (standard deviation), and shape of the distribution of correlations at different levels of aggregation. Openshaw finds that as the number of zones decreases (i.e., as spatial scale becomes coarser), the range of possible correlations narrows, although substantial variability remains.

In addition to analyzing correlation coefficients, Openshaw introduces a variance decomposition framework to understand how aggregation alters statistical relationships. He expresses total variation as the sum of between-zone and within-zone components and defines a measure of the loss of variation caused by aggregation.

\[ S^T = S^B + S^W \]

where

\[ S^T = \sum_j \sum_i (x_{ij} - \bar{x})^2 \]

is the total sum of squares,

\[ S^B = \sum_j n_j (\bar{x}_j - \bar{x})^2 \]

is the between-zone sum of squares, and

\[ S^W = \sum_j \sum_i (x_{ij} - \bar{x}_j)^2 \]

is the within-zone sum of squares.

Based on this, the loss of variation due to aggregation can be expressed as:

\[ \Delta S^B = \frac{S^T-S^B}{S^T} \]

which measures the amount of variation lost when individual units are aggregated into zones.

Through relating correlation values to this loss of variation, he shows that aggregation not only changes correlation numerically but does so in ways tied to how variance is redistributed across spatial scales.

Openshaw ultimately concludes that correlation is not an intrinsic property of the data but is highly sensitive to how spatial units are defined.

2.2 Reanalysis of Openshaw’s First Experiment

In this report, we attempt to reproduce Openshaw (1970)’s first experiment on the Modifiable Areal Unit Problem (MAUP).

For the data, we obtained population data from the 1970 US decennial census and presidential election results from 1968 at the county level. We were unable to obtain congressional election data at the county level, as available sources report these data only at the district level.

Following Openshaw (1977), we re-implemented the zoning algorithm in R. We first conduct exploratory analysis of the variables and the zoning algorithm, including computing the correlation coefficient at the county level. We then follow the original experiment by applying the zoning algorithm to generate alternative zoning systems with different numbers of zones and compute the correlation coefficient for each generated zoning system. This allows us to examine the scale problem.

To investigate the aggregation problem, we generate multiple zoning systems with the same number of zones and compute the correlation coefficient for each, using 1,000 simulations (Openshaw used 10,000, but this is computationally expensive).We also examine the distribution of correlation coefficients across different zoning configurations by computing their range, standard deviation, and distributional shape, as well as calculating the loss of variation following the procedures described in the original experiment.

3. Methodology

3.1 Data Source and Variables

First, we load the data for the 1968 presidential election results and the 1970 census population data, as well as the spatial data for Iowa counties.

vote1968 <- read.csv(here("data", "raw", "1968vote.csv"))
census1970 <- read.csv(here("data", "raw", "1970census.csv"))
iowa <- st_read(here("data", "raw", "iowa_county.geojson"))

3.2 Data Preprocessing

Second, we preprocess the data by calculating the percentage of Republican votes in 1968 and the percentage of the population over age 60 in 1970. Iowa county data is also cleaned and transformed to a common coordinate reference system for spatial analysis.

vote1968 <- vote1968 %>% 
  mutate(pct_rep = 100 * (repvote / totalvote))

census1970 <- census1970 %>% 
  mutate(Geo_NAME = str_remove(Geo_NAME, " County")) %>% 
  rename(pop65 = SE_T007_005, totalpop = SE_T007_001, pop6064 = X60_64, County = Geo_NAME) %>%
  dplyr:: select(County, pop65, pop6064, totalpop) %>%
  mutate(pop60 = pop6064 + pop65, 
         pop60_pct = pop60 / totalpop * 100)

iowa <- iowa %>% 
  rename(County = NAME) %>%
  st_transform(crs = 6464)

Both the census and vote data contain a “County” column, but they may have different formatting (e.g., extra spaces, “County” suffix, capitalization). To ensure a successful join, we define a function to clean the county names in both datasets.

clean_county <- function(x) {
  x %>%
    str_trim() %>%                 
    str_remove(" County$") %>%     
    str_replace_all("['’]", "") %>%
    str_squish() %>%              
    str_to_title()                
}

census1970 <- census1970 %>%
  mutate(County = clean_county(County))

vote1968 <- vote1968 %>%
  mutate(County = clean_county(County))

Now we can join the datasets together on the “County” column and select the relevant variables for analysis.

iowa_data <- iowa %>%
  left_join(census1970, by = "County") %>%
  left_join(vote1968, by = "County") %>% 
  dplyr::select(County, GEOID, COUNTYNS, pop60, totalpop, pop60_pct, totalvote, repvote, pct_rep, geometry)

3.3 Zoning Aggregation Algorithm

Next, we reconstruct the zoning algorithm used by Openshaw to aggregate the 99 counties into different zoning systems.

The details of the algorithm are described in Openshaw (1977), specifically in Algorithm 3: A Procedure to Generate Pseudo-Random Aggregations of N Zones into M Zones, where \(M<N\). To summarize briefly, the purpose of this algorithm is to generate pseudo-random aggregations of \(N\) zones into \(M\) zones, where \(M\) is less than \(N\), such that each of the original \(N\) zones is allocated to one of the \(M\) zones, and such that all the member zones of each \(M\) zone are spatially contiguous. Successive calls of the subroutine can be used to build up a random sample of alternative zoning systems from the population of all possible zoning systems with \(M\) zones.

The algorithm assumes that the study region is initially divided into \(N\) zones, (in our case, counties), and that a contiguity matrix is available to describe the adjacency relationships among them. The first step is to randomly select \(M\) zones to serve as “core” zones, which act as the seeds for aggregation.

Subsequently, one of these core zones is selected at random. A list is then constructed of all unallocated zones that are adjacent to this core zone. From this list, one zone is randomly selected and assigned to the chosen core zone. This process is repeated by randomly selecting a core zone and assigning adjacent unallocated zones until either all original zones are allocated or no remaining unallocated zones are contiguous to any of the existing aggregated zones.

In the latter case, some zones may remain unassigned due to issues such as isolated spatial configurations or inconsistencies in the contiguity matrix. This can be addressed by rerunning the algorithm with a different set of initial core zones or by modifying the contiguity structure.

To translate this algorithm into R, a function zone1_r() is defined below that takes the neighbor list, the number of original zones, the desired number of aggregated zones and returns a list containing the zone assignments for each unit (county).

Table 1: Algorithm Parameters

Parameter Description Notes
nb Neighbor list representing spatial adjacency between zones Generated using spdep::poly2nb()
N Number of original zones Length of the dataset
M Desired number of aggregated zones Must satisfy M < N
fixed_zones Optional vector of pre-assigned zones Defaults to empty
ipart Vector storing zone assignments for each unit Initialized as all zeros (unassigned)
active_cores Set of core zones that can still grow Updated during the algorithm
max_iter Maximum number of iterations allowed Prevents infinite loops
ifault Number of unallocated zones after completion Diagnostic output

Table 2: Algorithm Steps

Step Description
Initialization Create an assignment vector (ipart) of length N, initialized to 0
Core Selection Randomly select M zones as core zones and assign unique labels
Active Core Setup Define the set of active cores that can grow
Core Selection (Loop) Randomly select one active core
Neighbor Identification Identify unallocated neighboring zones of the selected core
Zone Assignment Randomly assign one neighboring zone to the selected core
Core Deactivation Remove core if no unallocated neighbors remain
Iteration Repeat until all zones are assigned or no cores can expand
Convergence Check Stop if all zones are allocated or max iterations reached
Fault Handling Assign leftover unallocated zones new labels
Output Return zone assignments, number of faults (ifault), and total zones
zone1_r <- function(nb, N, M, fixed_zones = integer(0)) {
  # nb: a neighbor list from spdep::poly2nb()
  # N: number of original zones
  # M: desired number of output zones

  stopifnot(M < N)
  
  # --- Initialize ---
  # Creates a vector of 99 zeros, one slot per county. 
  # A 0 means not yet assigned to any zone.
  ipart <- integer(N)  
  
  # handle fixed zones (optional)
  n_fixed <- length(fixed_zones)
  if (n_fixed > 0) {
    for (i in seq_along(fixed_zones)) {
      ipart[fixed_zones[i]] <- i
    }
  }
  n_cores_needed <- M - n_fixed
  
  # --- Step 1: Randomly select M core zones ---
  available <- which(ipart == 0)
  core_indices <- sample(available, n_cores_needed)
  
  for (i in seq_along(core_indices)) {
    ipart[core_indices[i]] <- n_fixed + i
  }
  
  # track active zones
  active_cores <- (n_fixed + 1):M 
  
  # --- Step 2: Random accretion loop ---
  max_iter <- N * M * 10  # safety limit
  iter <- 0
  
  # stop until safety limit or no active cores with unallocated neighbors
  while (length(active_cores) > 0 && iter < max_iter) {
    iter <- iter + 1
    
    # pick a random active core
    core_label <- sample(active_cores, 1)
    
    # find all zones belonging to this core
    core_members <- which(ipart == core_label)
    
    # find unallocated neighbors of any core member
    unalloc_neighbors <- c()
    for (z in core_members) {
      neighbors <- nb[[z]]
      if (!(length(neighbors) == 1 && neighbors == 0)) {  # 0 means no neighbors in spdep
        unalloc <- neighbors[ipart[neighbors] == 0]
        unalloc_neighbors <- union(unalloc_neighbors, unalloc)
      }
    }
    
    if (length(unalloc_neighbors) == 0) {
      # this core can't grow, deactivate it
      active_cores <- setdiff(active_cores, core_label)
    } else {
      # randomly absorb one neighbor
      chosen <- ifelse(
        length(unalloc_neighbors) == 1,
        unalloc_neighbors,
        sample(unalloc_neighbors, 1)
      )
      ipart[chosen] <- core_label
    }
  }

  # --- Check for unallocated zones (islands) ---
  unallocated <- which(ipart == 0)
  ifault <- length(unallocated)
  
  if (ifault > 0) {
    warning(paste(ifault, "zones could not be allocated (contiguity islands)"))
    # Assign them extra zone labels
    for (i in seq_along(unallocated)) {
      ipart[unallocated[i]] <- M + i
    }
  }
  list(
    ipart    = ipart,
    ifault   = ifault,
    n_zones  = length(unique(ipart))
  )
}

Below is a demonstration of the zoning algorithm, showing how it generates different contiguous zoning systems for Iowa counties with varying numbers of zones.

Figure 1: Contiguous Random Zones

nb <- poly2nb(iowa, queen = FALSE)
N  <- nrow(iowa)

zone_sizes <- c(4, 6, 8, 10)
zone_plots <- lapply(zone_sizes, function(M) {
  result <- zone1_r(nb, N, M)
  
  vis_zones <- iowa %>%
    mutate(zone = factor(result$ipart))
  
  ggplot(vis_zones) +
    geom_sf(aes(fill = zone), color = "white", linewidth = 0.2) +
    theme1 +
    scale_fill_manual(values = palette1) +
    labs(title = paste("Randomly Generated", M, "Zones")) 
})

grid.arrange(grobs = zone_plots, ncol = 2)

Openshaw also generates non-contiguous grouping systems by randomly assigning each original zone to one of the M groups without any spatial constraints. Although he did not provide a specific algorithm for this, it is straightforward to implement through random sampling. The key is to ensure that each group is represented at least once, which can be achieved by first assigning one zone to each group and then randomly assigning the remaining zones.

random_grouping <- function(N, M) {
  # guarantee each group appears at least once
  base <- sample(M)
  rest <- sample(M, N - M, replace = TRUE)
  sample(c(base, rest))
}

Below is a demonstration of the random grouping algorithm, showing how it generates different non-contiguous grouping systems for Iowa counties with varying numbers of groups.

Figure 2: Non-Contiguous Random Zones

# demo of random grouping
group_sizes <- c(4, 6, 8, 10)
group_plots <- lapply(group_sizes, function(M) {
  group_labels <- random_grouping(N, M)
  vis_groups <- iowa %>%
    mutate(group = factor(group_labels))
  ggplot(vis_groups) +
    geom_sf(aes(fill = group), color = "white", linewidth = 0.2) +
    theme1 +
    scale_fill_manual(values = palette1) +
    labs(title = paste("Randomly Generated", M, "Groups"))
})
grid.arrange(grobs = group_plots, ncol = 2)

3.4 Analysis

With the zoning and grouping systems in place, we can now analyze how the correlation between the percentage of Republican votes and the percentage of the population over age 60 changes under different spatial configurations. Two main analyses are conducted to examine the scale problem and the aggregation problem.

For the scale problem, we apply the zoning algorithm to generate alternative zoning systems with different numbers of zones (e.g., 6, 12, 18, etc.) and compute the correlation coefficient for each generated zoning system. This allows us to examine how the correlation changes as we aggregate counties into larger zones. Both zoning (contiguous) and grouping (non-contiguous) systems are analyzed to compare the effects of spatial arrangement on the scale problem.

For the aggregation problem, we generate multiple zoning systems with the same number of zones (e.g., 6 zones) and compute the correlation coefficient for each, using 1,000 simulation. We also examine the distribution of correlation coefficients across different zoning configurations by computing their range, standard deviation, and distributional shape, as well as calculating the loss of variation following the procedures described in the original experiment. Both zoning and grouping systems are analyzed here.

4. Results

4.1 Exploratory Data Analysis

Some preliminary exploratory analysis is conducted to understand the distribution of the key variables and their spatial patterns across Iowa counties. This includes mapping the percentage of the population over age 60 and the percentage of Republican votes, as well as examining their standardized distributions and correlation at the county level. While in Openshaw’s original experiment, the correlation between these two variables at the county level was 0.3466, we find a different value of 0.22 in our analysis, which may be due to differences in data sources.

Figure 3: Exploratory Data Analysis

# map elderly population
age_map <- iowa_data %>% 
  ggplot() +
  geom_sf(aes(fill = pop60_pct), color = "white", linewidth = 0.2) +
  scale_fill_gradientn(colors = green_palette) + 
  theme1 + 
  labs(title = "Pct of Population Over Age 60", fill = "Percentage") +
  theme(
    legend.position = "bottom",
    axis.title = element_text(size = 9))

# map voting patterns
vote_map <- iowa_data %>%
  ggplot() +
  geom_sf(aes(fill = pct_rep), color = "white", linewidth = 0.2) +
  scale_fill_gradientn(colors = green_palette) + 
  theme1 + 
  labs(title = "Pct of Republican Votes", fill = "Percentage") +
  theme(
    legend.position = "bottom",
    axis.title = element_text(size = 9))

density_plot <- iowa_data %>%
  st_drop_geometry() %>%
  mutate(pop60_std = scale(pop60_pct), rep_std = scale(pct_rep)) %>%
  tidyr::pivot_longer(
    cols = c(pop60_std, rep_std),
    names_to = "variable",
    values_to = "value"
  ) %>%
  ggplot(aes(x = value, fill = variable)) +
  geom_density(alpha = 0.6, color = NA) +
  scale_fill_manual(values = c("pop60_std" = "#6EA6C3","rep_std"= "#99CFAB")) +
  theme2 + 
  theme(legend.position = "none") + 
  labs(title = "Standardized Distributions of Variables",
       x = "Standardized Value",
       y = "Density") +
  annotate("text", x = 2.8, y = 0.12,
           label = "Population Over 60",
           fontface = "bold", 
           color = "#6EA6C3", size = 4) +
  annotate("text", x = 1.5, y = 0.42,
           label = "Republican Vote %",
           fontface = "bold", 
           color = "#99CFAB", size = 4)

# correlation plot
r_val <- cor(iowa_data$pop60_pct, iowa_data$pct_rep, use = "complete.obs")
cor_plot <- iowa_data %>%
  st_drop_geometry() %>%
  ggplot(aes(x = pop60_pct, y = pct_rep)) +
  geom_point(color = "#6EA6C3", size = 2) +
  geom_smooth(method = "lm", color = "#99CFAB", se = FALSE) +
  annotate("text", x = Inf, y = Inf, label = paste0("r = ", round(r_val, 3)), hjust = 1.1, vjust = 1.5, size = 6, color = "#6f7f6c") +
  theme2 + 
  labs(title = "Correlation between Pop Over 60 and Rep Vote %",
       x = "Percentage of Population Over Age 60",
       y = "Percentage of Republican Votes")

grid.arrange(age_map, vote_map, density_plot, cor_plot, ncol = 2)

Another important aspect of exploratory analysis is to examine the spatial autocorrelation of the key variables, which can influence the results of zoning and grouping. We conduct Moran’s I test for both the percentage of Republican votes and the percentage of the population over age 60 to assess the degree of spatial clustering in these variables across Iowa counties. Both variables show significant positive spatial autocorrelation, meaning similar values tend to cluster together in space.

Figure 4: Moran’s I Test

nb <- poly2nb(iowa_data, queen = FALSE)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)

# simulation (permutation test)
moran_rep <- moran.mc(iowa_data$pct_rep, lw, nsim = 999, zero.policy = TRUE)
moran_age <- moran.mc(iowa_data$pop60_pct, lw, nsim = 999, zero.policy = TRUE)

plot_moran_mc <- function(moran_obj, title_text, fill_color) {
  sim_vals <- moran_obj$res[1:999]
  obs_val  <- moran_obj$statistic
  ggplot(data.frame(sim = sim_vals), aes(x = sim)) +
    geom_histogram(binwidth = 0.01, fill = fill_color, alpha = 0.8) +
    geom_vline(xintercept = obs_val, color = "#DC6D85", linewidth = 1.2) +
    theme2 +
    labs(
      title = title_text,
      subtitle = "Observed Moran's I in red",
      x = "Moran's I",
      y = "Count") +
    theme(plot.title = element_text(face = "bold"), plot.subtitle = element_text(face = "italic"))}

plot_rep <- plot_moran_mc(moran_rep, "Moran's I Distribution: Republican Vote %", "#6EA6C3")
plot_age <- plot_moran_mc(moran_age, "Moran's I Distribution: Population Over 60", "#99CFAB")

grid.arrange(plot_rep, plot_age, ncol = 2)

4.2 Scale Problem

For the scale problem, we apply the zoning algorithm to generate alternative zoning systems with different numbers of zones and compute the correlation coefficient for each generated zoning system. Unlike Openshaw’s experiment, here we tested all values of M from 6 to 98 (the original experiment only tested a few values such as 6, 12, 18, etc.) to provide a more comprehensive analysis of how the correlation changes as we aggregate counties into larger zones.

# neighbor list
nb <- poly2nb(iowa_data, queen = FALSE)

# number of original zones
N <- nrow(iowa_data)

# different target numbers of zones
M_vals <- c(6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 98)

A helper function is defined to compute the correlation coefficient for a given zoning configuration. This function takes the original data and the zone assignments, aggregates the data to the zone level, and then computes the correlation between the percentage of Republican votes and the percentage of the population over age 60 at the zone level. Another deviation from Openshaw’s original approach here is that in the original experiment, variables were each aggregated by summing the country percentages and dividing by the number of counties in each zone, which is not a proper way to aggregate percentages. Openshaw acknowledges this issue and states that this approach was adopted to faciliate computation in the subsequent simulation. In our analysis, we aggregate the raw counts (e.g., total population, total votes, etc.) to the zone level and then compute the percentages at the zone level, which is a more accurate way to aggregate these variables.

compute_zone_correlation <- function(data, zone_ids) {
  zone_summary <- data %>%
    mutate(zone = factor(zone_ids)) %>%
    st_drop_geometry() %>%
    group_by(zone) %>%
    summarise(
      total_pop   = sum(totalpop, na.rm = TRUE),
      total_vote  = sum(totalvote, na.rm = TRUE),
      rep_count   = sum(repvote, na.rm = TRUE),
      pop60_count = sum(pop60, na.rm = TRUE),
      .groups = "drop") %>%
    mutate(rep_pct   = 100 * rep_count / total_vote,
           pop60_pct = 100 * pop60_count / total_pop)
  cor(zone_summary$rep_pct, zone_summary$pop60_pct, use = "complete.obs")
}

The results show how the correlation coefficient decrease as the number of zones decreases for both zoning and grouping arrangements, demonstrating the scale problem of the MAUP. The results also allow us to compare how zoning (contiguous) and grouping (non-contiguous) arrangements affect the correlation at different scales. This matches the general finding in Openshaw’s original experiment.

Table 3: Coefficient of Different Areal Arrangements

zoning_results <- map_dfr(M_vals, function(m) {
  result <- zone1_r(nb, N, m)
  tibble(
    method = "Zoning",
    zones = result$n_zones,
    correlation = compute_zone_correlation(iowa_data, result$ipart))
})
grouping_results <- map_dfr(M_vals, function(m) {
  group_ids <- random_grouping(N, m)
  tibble(
    method = "Grouping",
    zones = length(unique(group_ids)),
    correlation = compute_zone_correlation(iowa_data, group_ids))
})
corr_results <- zoning_results %>%
  select(zones, zoning_correlation = correlation) %>%
  left_join(grouping_results %>% select(zones, grouping_correlation = correlation), by = "zones")

# save cor
write_csv(corr_results, here("data", "derived", "scale_results.csv"))
Number of Zones Zoning (r) Grouping (r)
6 0.536 0.889
12 0.677 0.767
18 0.276 0.721
24 0.606 0.670
30 0.266 0.453
36 0.372 0.363
42 0.291 0.403
48 0.210 0.357
54 0.400 0.300
60 0.216 0.401
66 0.235 0.218
72 0.258 0.329
78 0.236 0.202
98 0.221 0.217

Note that because of the randomness in the zoning and grouping algorithms, the correlation values vary across different runs. But it is evident that the correlation coefficient varied substantially across different zoning and grouping arrangements, and generally decreased as the number of zones decreased, which means that generalizations based on aggregated data can be misleading.

Figure 5: Coefficient of Different Areal Arrangements

# plot changes
corr_results_long <- corr_results %>%
  pivot_longer(cols = c(zoning_correlation, grouping_correlation), names_to = "method", values_to = "correlation") %>%
  mutate(method = recode(method, zoning_correlation = "Zoning", grouping_correlation = "Grouping"))
ggplot(corr_results_long, aes(x = zones, y = correlation, color = method)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  scale_color_manual(values = c("Zoning" = "#6EA6C3", "Grouping" = "#99CFAB")) +
  theme2 +
  labs(title = "Correlation Coefficient by Number of Zones",
       x = "Number of Zones",
       y = "Correlation Coefficient (r)",
       color = "Method") +
  theme(plot.title = element_text(face = "bold"))

4.3 Aggregation Problem

For the aggregation problem, we generate multiple zoning systems with the same number of zones (e.g., 6 zones) and compute the correlation coefficient for each, using 1,000 simulations. We also examine the distribution of correlation coefficients across different zoning configurations by computing their range, standard deviation, and distributional shape, as well as calculating the loss of variation following the procedures described in the original experiment. Both zoning and grouping systems are analyzed here. To facilitate the simulation process, a helper function run_simulations() is defined to run multiple simulations.

run_simulations <- function(M_vals, n_runs, runner) {
  map_dfr(M_vals, function(m) {
    map_dfr(seq_len(n_runs), function(run_id) {
      runner(m = m, run_id = run_id)
    })
  })
}

Then, two helper functions run_one_zoning() and run_one_grouping() are defined to run a single simulation for zoning and grouping arrangements, respectively. These functions take the number of zones, the run ID, the data, and other necessary parameters, and return a tibble containing the method, number of zones, run ID, correlation coefficient, and any faults (for zoning).

run_one_zoning <- function(m, run_id, nb, data, N) {
  result <- zone1_r(nb, N, m)
  
  tibble(
    method = "Zoning",
    M = m,
    run = run_id,
    zones = result$n_zones,
    ifault = result$ifault,
    correlation = compute_zone_correlation(data, result$ipart))
}

run_one_grouping <- function(m, run_id, data, N) {
  group_ids <- random_grouping(N, m)
  
  tibble(
    method = "Grouping",
    M = m,
    run = run_id,
    zones = length(unique(group_ids)),
    ifault = 0,
    correlation = compute_zone_correlation(data, group_ids))
}

Next, we run the simulations for both zoning and grouping arrangements across the specified values of M and the number of runs. The results are stored in two separate data frames, which are then combined.

n_runs <- 1000

all_zoning_corrs <- run_simulations(
  M_vals = M_vals,
  n_runs = n_runs,
  runner = function(m, run_id) {
    run_one_zoning(
      m = m,
      run_id = run_id,
      nb = nb,
      data = iowa_data,
      N = N)}
)

all_grouping_corrs <- run_simulations(
  M_vals = M_vals,
  n_runs = n_runs,
  runner = function(m, run_id) {
    run_one_grouping(
      m = m,
      run_id = run_id,
      data = iowa_data,
      N = N)}
)

saveRDS(all_zoning_corrs, here("data", "derived", "all_zoning_corrs.rds"))
saveRDS(all_grouping_corrs, here("data", "derived", "all_grouping_corrs.rds"))

At the aggregation level, the results show that there is a wide range of correlation coefficients across different zoning and grouping configurations, even when the number of zones / groups is held constant. For both for six zones and six groups the limits show that it is possible to produce the widest range of correlations, spanning from negative to positive values. As the number of zones / groups increases, the range of possible correlations narrows, especially for zoning arrangements where contiguity constraints are involved. This demonstrates the aggregation problem of the MAUP, where different spatial configurations can lead to different analytical results.

We can conclude that for all scales there is a relatively wide range of correlations which would certainly involve alternative substantive interpretations.

Table 4: Maximum and Minimum Values of the Correlation Coefficient

Zoning
Grouping
Number of zones or groups minimum r maximum r minimum r maximum r
6 -0.363 0.990 -0.594 0.985
12 -0.184 0.842 -0.110 0.925
18 -0.113 0.733 0.083 0.902
24 -0.015 0.732 0.032 0.842
30 0.005 0.644 0.083 0.774
36 0.021 0.624 0.019 0.723
42 0.036 0.600 0.068 0.650
48 0.044 0.501 0.068 0.601
54 0.047 0.491 0.066 0.561
60 0.063 0.476 0.093 0.534
66 0.072 0.413 0.095 0.504
72 0.103 0.409 0.077 0.466
78 0.106 0.363 0.132 0.430
98 0.183 0.253 0.184 0.263

Table 5: Mean and Standard Deviation Values of the Correlation Coefficient

Zoning
Grouping
Number of zones or groups mean r sd r mean r sd r
6 0.590 0.218 0.670 0.245
12 0.451 0.169 0.619 0.155
18 0.389 0.141 0.560 0.138
24 0.349 0.119 0.516 0.124
30 0.322 0.108 0.464 0.120
36 0.308 0.097 0.423 0.109
42 0.296 0.089 0.386 0.099
48 0.285 0.077 0.355 0.095
54 0.272 0.071 0.332 0.082
60 0.260 0.065 0.311 0.074
66 0.252 0.055 0.294 0.067
72 0.245 0.048 0.276 0.059
78 0.240 0.045 0.259 0.049
98 0.221 0.008 0.222 0.009

This is the same figure as in Openshaw’s original experiment, but with more values of M. The correlation between Republican vote and population over age 60 is consistently positive, but its magnitude varies substantially depending on the aggregation scheme. At coarse scales, different aggregations can produce correlations ranging from moderate to very strong, while at finer scales the correlation stabilizes and becomes nearly invariant.

Figure 6: Frequencies of correlation coefficients at different scales.

all_corrs %>%
  mutate(M_num = M) %>%
  ggplot(aes(x = correlation, group = M_num, color = M_num)) +
  geom_density(linewidth = 0.7, adjust = 1.2, show.legend = FALSE) +
  scale_color_gradient(low = "#C1E3CD", high =  "#006d2c") +
  facet_wrap(~method, ncol = 2) +
  coord_cartesian(xlim = c(-1, 1)) +
  theme1 +
  labs(
    title = "Frequencies of Correlation Coefficients at Different Scales",
    x = "Correlation coefficient",
    y = "Frequency"
  )

An alternative way to visualize the distribution of correlation coefficients across different scales is through boxplots. The results show that at coarser scales (fewer zones), there is a wider distribution of correlation coefficients, while at finer scales (more zones), the distribution narrows and becomes more consistent.

Figure 7: Distribution of Correlations by Scale.

ggplot(all_corrs, aes(x = factor(M), y = correlation, fill = method)) +
  geom_boxplot(outlier.size = 0.5, alpha = 0.6, color = "darkgrey") +
  scale_fill_manual(values = c("Zoning" = "#6EA6C3", "Grouping" = "#99CFAB")) +
  theme1 +
  labs(
    title = "Distribution of Correlations by Scale",
    x = "Number of Zones",
    y = "Correlation"
  )

One thing that Openshaw did in the original experiment was to compute the loss of variation for each zoning and grouping configuration, which is a measure of how much of the original variation in the data is lost due to aggregation. This is calculated by comparing the total sum of squares (S_total) with the between-zone sum of squares (S_between) and computing the proportion of variation lost. To reproduce that, we define a helper function compute_loss_variation() that takes the original values and the zone assignments, and computes the loss of variation for both the percentage of population over age 60 and the percentage of Republican votes. This allows us to analyze how much information is lost due to aggregation and how it relates to the correlation coefficients.

compute_loss_variation <- function(values, zone_ids) {
  
  df <- tibble(
    value = values,
    zone = factor(zone_ids))
  
  grand_mean <- mean(df$value, na.rm = TRUE)
  
  # total sum of squares
  S_total <- sum((df$value - grand_mean)^2, na.rm = TRUE)
  
  # between-zone sum of squares
  zone_stats <- df %>%
    group_by(zone) %>%
    summarise(
      n = n(),
      zone_mean = mean(value, na.rm = TRUE),
      .groups = "drop")
  
  S_between <- sum(zone_stats$n * (zone_stats$zone_mean - grand_mean)^2, na.rm = TRUE)
  # loss of variation
  delta_Sb <- (S_total - S_between) / S_total
  
  return(delta_Sb)
}

Same as above, we need to recalulate the loss of variation for each zoning and grouping configuration across all simulations.

zoning_loss <- function(m, run_id, nb, data, N) {
  result <- zone1_r(nb, N, m)
  zone_ids <- result$ipart
  corr_val <- compute_zone_correlation(data, zone_ids)
  loss_x <- compute_loss_variation(data$pop60_pct, zone_ids)
  loss_y <- compute_loss_variation(data$pct_rep, zone_ids)
  loss_d <- loss_x - loss_y
  tibble(
    method = "Zoning",
    M = m,
    run = run_id,
    zones = result$n_zones,
    correlation = corr_val,
    loss_x = loss_x,
    loss_y = loss_y,
    loss_d = loss_d)
}

grouping_loss <- function(m, run_id, data, N) {
  zone_ids <- random_grouping(N, m)
  corr_val <- compute_zone_correlation(data, zone_ids)
  loss_x <- compute_loss_variation(data$pop60_pct, zone_ids)
  loss_y <- compute_loss_variation(data$pct_rep, zone_ids)
  loss_d <- loss_x - loss_y
  tibble(
    method = "Grouping",
    M = m,
    run = run_id,
    zones = length(unique(zone_ids)),
    correlation = corr_val,
    loss_x = loss_x,
    loss_y = loss_y,
    loss_d = loss_d)
}
all_zoning_loss <- run_simulations(
  M_vals = M_vals,
  n_runs = n_runs,
  runner = function(m, run_id) {
    zoning_loss(
      m = m,
      run_id = run_id,
      nb = nb,
      data = iowa_data,
      N = N)}
)
all_grouping_loss <- run_simulations(
  M_vals = M_vals,
  n_runs = n_runs,
  runner = function(m, run_id) {
    grouping_loss(
      m = m,
      run_id = run_id,
      data = iowa_data,
      N = N)}
)
saveRDS(all_zoning_loss, here("data", "derived", "all_zoning_loss.rds"))
saveRDS(all_grouping_loss, here("data", "derived", "all_grouping_loss.rds"))

The results show no clear relationship between the magnitude of variation loss and the resulting correlation values. Even when the loss of variation is similar, a wide range of correlations can be produced, particularly at coarser scales. This implies that aggregation-induced changes in correlation cannot be explained solely by the amount of variation lost. Instead, the specific configuration of spatial units plays a critical role. Furthermore, grouping systems exhibit greater variability than zoning systems, indicating that contiguity constraints reduce the range of possible outcomes.

Figure 8: Graph plots of one-thousand correlation coefficients against loss of between sums of squares for the independent variable.

all_zoning_loss <- readRDS(here("data", "derived", "all_zoning_loss.rds"))
all_grouping_loss <- readRDS(here("data", "derived", "all_grouping_loss.rds"))

all_loss <- bind_rows(all_zoning_loss, all_grouping_loss)
all_loss %>% 
  filter(M %in% c(6, 12, 24)) %>% 
  ggplot(aes(x = loss_x, y = correlation)) +
  geom_point(size = 0.7, alpha = 0.4, color = "#006d2c") +
  facet_grid(method ~ M) +
  theme1 +
  labs(
    title = "Correlation Coefficients against Loss of Variation",
    x = "Loss of between sums of squares",
    y = "Correlation coefficient"
  )

Consistent with Openshaw’s findings, there is also no clear relationship between correlation coefficients and differential loss of variation. Even when the difference in variation loss between variables is similar, a wide range of correlation values can result.

Figure 9: Graph plots of one-thousand correlation coefficients against differential loss of between sums of squares.

all_loss %>%
  filter(M %in% c(6, 12, 24)) %>%
  mutate(loss_d = loss_x - loss_y) %>% 
  ggplot(aes(x = loss_d, y = correlation)) +
  geom_point(size = 0.7, alpha = 0.4, color = "#99CFAB") +
  facet_grid(method ~ M) +
  theme1 +
  labs(
    title = "Correlation Coefficients against Differential Loss of Variation",
    x = "Differential loss of variation (ΔSᴰ)",
    y = "Correlation coefficient"
  )

4.4 A Practical Consideration

In practice, researchers often need to work with predefined spatial units, such as administrative boundaries, and in this case, congressional district, rather than randomly generated zones or groups. To examine how the correlation between the percentage of Republican votes and the percentage of the population over age 60 changes at the district level, we can overlay the original county-level data with the district boundaries, aggregate the data to the district level, and compute the correlation coefficient at that level. This can provide a practical example of how the MAUP can affect real-world analyses based on predefined spatial units.

# load district data
districts <- st_read(here("data", "raw", "districts.geojson"))
districts <- st_transform(districts, st_crs(iowa_data)) %>% 
  st_make_valid() %>%
  dplyr::select(district_id = DISTRICT, geometry)

We may see that the correlation coefficient is 0.819, much higher than the correlation at the county level, which is 0.22. Since population data is generally collected at the county level, when we aggregate to the district level, it is inevitable that we need aggregation when doing this kind of analysis. It is therefore necessary that we are aware of the potential impact of the MAUP on our analysis and interpret the results with caution, especially when comparing results across different spatial units or scales.

iowa_district <- iowa_data %>%
  st_join(districts, join = st_intersects) %>%
  group_by(district_id) %>%
  summarise(
    totalpop = sum(totalpop, na.rm = TRUE),
    totalvote = sum(totalvote, na.rm = TRUE),
    repvote = sum(repvote, na.rm = TRUE),
    pop60 = sum(pop60, na.rm = TRUE),
    .groups = "drop") %>%
  mutate(
    pct_rep = 100 * repvote / totalvote,
    pop60_pct = 100 * pop60 / totalpop)
cor <- cor(iowa_district$pct_rep, iowa_district$pop60_pct, use = "complete.obs")
cat("Correlation between pct_rep and pop60_pct at district level:", round(cor, 3))
## Correlation between pct_rep and pop60_pct at district level: 0.819

5. Summary

As a quick summary, while we are unable to reproduce Openshaw’s exact numerical results, the overall patterns closely match those reported in his experiments. We can summarize these experiments on the Iowa data in the following general statements, many of which are consistent with Openshaw’s original findings.

First, the correlation coefficient generally decreases as the number of zones increases. In addition, larger zones tend to produce more variable correlation estimates, indicating greater instability at coarser scales.

Second, there are clear and systematic differences between zoning and grouping systems. These differences appear to arise from the interaction between the contiguity constraint in zoning systems and the underlying spatial autocorrelation present in the data.

Finally, there is no systematic relationship between the loss of variation (as measured by the sums-of-squares terms) and the resulting correlation coefficients. Even when the amount of variation lost is similar, the resulting correlations can vary substantially.

Taken together, these results reinforce the central insight of the MAUP: statistical relationships derived from spatial data are inherently sensitive to how spatial units are defined, and this sensitivity cannot be fully explained by simple measures such as scale or loss of variation alone.

Declaration of Generative AI and AI-Assisted Technologies

During the preparation of this work, we used Claude code to assist with recreating and debugging the zoning algorithm. After using it, we critically reviewed and edited the code and content as needed and take full responsibility for the content in this report.