Project Overview

This project builds a longitudinal relational database to track changes in federal grant opportunities listed on Grants.gov. The federal grants landscape has experienced significant volatility since the 2025 change in presidential administration, with agencies rescinding, restructuring, and freezing funding opportunities at an unprecedented pace (Clegg, 2025). Because Grants.gov only displays a current-state snapshot—once an opportunity is removed, it disappears from the public record—this database fills an accountability gap by preserving a historical record of what was added, changed, and removed over time.

The system has three components:

  1. An R-based extraction pipeline that downloads and parses XML snapshots from Grants.gov
  2. A normalized PostgreSQL database with six tables following a slowly changing dimension (Type 2) design
  3. A Shiny web application for querying, filtering, and visualizing grant trends

This progress update focuses on component 1: the extraction pipeline.


Data Source

Grants.gov publishes daily XML extracts of its full opportunity database. These are available as zipped XML files at https://www.grants.gov/xml-extract. Each extract is a complete snapshot of every grant listing at the time of extraction—typically 50,000–80,000 records per file.

The pipeline scrapes the extract index page, identifies available files, and downloads the most recent (or a specified) snapshot from Grants.gov’s S3-hosted archive.

library(httr2)
library(rvest)

get_available_extracts <- function() {
  page <- request("https://www.grants.gov/xml-extract") |>
    req_perform() |>
    resp_body_html()

  # Parse file listing table
  files <- page |>
    html_table(fill = TRUE) |>
    pluck(1)

  colnames(files) <- c("file_name", "size", "extracted_datetime")
  files <- files[-1, ] |>
    filter(grepl("\\.zip", file_name)) |>
    arrange(desc(file_name))

  # Extract S3 download URLs
  urls <- page |>
    html_nodes("a") |>
    html_attr("href") |>
    keep(~grepl("s3.amazonaws.com/extracts/GrantsDBExtract", .x))

  list(files = files, urls = urls)
}

The download function retrieves a specific extract (defaulting to the latest) and saves it locally:

download_extract <- function(file_name = NULL, dir = "data/grants_data") {
  info <- get_available_extracts()
  if (is.null(file_name)) file_name <- info$files$file_name[1]

  dir.create(dir, recursive = TRUE, showWarnings = FALSE)

  s3_url <- info$urls[grepl(file_name, info$urls)][1]
  dest    <- file.path(dir, file_name)

  request(s3_url) |> req_perform(path = dest)
  dest
}

XML Parsing Strategy

Each XML extract contains one node per grant opportunity with 30+ fields nested as child elements. The parsing function reads the XML tree, discovers all unique field names from a sample of records, then processes the full file in chunks of 10,000 records to manage memory on large extracts.

library(xml2)

xml_doc  <- read_xml("data/grants_data/extracted/GrantsDBExtract20260301v2.xml")
children <- xml_children(xml_root(xml_doc))
length(children)
#> [1] 74218

The core parsing loop iterates over XML child nodes, extracts each element’s name and text value, and populates a pre-allocated data frame. This is intentionally simple—no XPath queries, no external schema dependency—which makes it robust to minor format changes Grants.gov might introduce.

# Simplified illustration of the per-record extraction logic
node     <- children[[1]]
elements <- xml_children(node)

record <- setNames(
  xml_text(elements),
  xml_name(elements)
)

After parsing, date fields are converted from Grants.gov’s MMDDYYYY string format and numeric fields (AwardCeiling, AwardFloor, EstimatedTotalProgramFunding) are coerced from character:

library(dplyr)

chunk_data <- chunk_data |>
  mutate(
    OpportunityID              = as.numeric(OpportunityID),
    AwardCeiling               = as.numeric(AwardCeiling),
    AwardFloor                 = as.numeric(AwardFloor),
    EstimatedTotalProgramFunding = as.numeric(EstimatedTotalProgramFunding),
    PostDate      = as.Date(PostDate, format = "%m%d%Y"),
    CloseDate     = as.Date(CloseDate, format = "%m%d%Y"),
    ArchiveDate   = as.Date(ArchiveDate, format = "%m%d%Y")
  )

From Flat Extract to Normalized Schema

The raw XML produces a single flat table—one row per opportunity with agency names, category codes, and funding details all in the same row. The database design normalizes this into six tables:

Table What it stores Key design decision
agencies Federal agency lookup Surrogate PK (agency_id) because agency names are inconsistent across extracts
categories Funding category lookup Natural PK (category_code) from Grants.gov’s own type codes
opportunities Core grant records PK is opportunity_id from Grants.gov; FK to agencies
opportunity_categories Junction table (M:N) Composite PK resolves many-to-many between grants and categories
snapshots Extraction-run metadata One row per download; tracks date and record count
change_log Additions, modifications, removals FKs to both opportunities and snapshots

The normalization step splits the flat data frame into these tables in R before writing to PostgreSQL:

library(DBI)
library(RPostgres)

# --- agencies: deduplicate and assign surrogate keys
agencies <- raw_data |>
  distinct(AgencyName, AgencyCode) |>
  mutate(agency_id = row_number()) |>
  select(agency_id, agency_name = AgencyName, agency_code = AgencyCode)

# --- categories: natural key from Grants.gov type codes
categories <- raw_data |>
  filter(!is.na(CategoryCode)) |>
  distinct(category_code = CategoryCode, category_name = CategoryName)

# --- opportunities: core records with FK to agencies
opportunities <- raw_data |>
  left_join(agencies, by = c("AgencyName" = "agency_name")) |>
  transmute(
    opportunity_id         = OpportunityID,
    agency_id,
    opportunity_title      = OpportunityTitle,
    post_date              = PostDate,
    close_date             = CloseDate,
    award_ceiling          = AwardCeiling,
    award_floor            = AwardFloor,
    estimated_total_funding = EstimatedTotalProgramFunding,
    opportunity_status     = OpportunityStatus
  )

# --- opportunity_categories: junction table
opp_cats <- raw_data |>
  filter(!is.na(CategoryCode)) |>
  distinct(opportunity_id = OpportunityID, category_code = CategoryCode)

# --- snapshots: one row for this extraction run
snapshot <- tibble(
  snapshot_id   = as.integer(format(Sys.Date(), "%Y%m%d")),
  snapshot_date = Sys.Date(),
  record_count  = nrow(raw_data)
)

Change Detection

The change_log table is the analytical core of the project. After loading a new snapshot, the pipeline compares it against the previous snapshot to detect three types of changes:

  • NEW: opportunity_id exists in the current snapshot but not the previous
  • REMOVED: opportunity_id existed in the previous snapshot but not the current
  • MODIFIED: opportunity_id exists in both, but one or more field values differ
detect_changes <- function(prev, curr, snapshot_id) {

  prev_ids <- prev$opportunity_id
  curr_ids <- curr$opportunity_id

  # --- NEW grants
  new_grants <- curr |>
    filter(!opportunity_id %in% prev_ids) |>
    transmute(opportunity_id, snapshot_id = !!snapshot_id,
              change_type = "NEW", field_changed = NA,
              old_value = NA, new_value = NA,
              detected_date = Sys.time())

  # --- REMOVED grants
  removed <- prev |>
    filter(!opportunity_id %in% curr_ids) |>
    transmute(opportunity_id, snapshot_id = !!snapshot_id,
              change_type = "REMOVED", field_changed = NA,
              old_value = NA, new_value = NA,
              detected_date = Sys.time())

  # --- MODIFIED grants (field-by-field comparison)
  tracked_fields <- c("opportunity_title", "close_date", "post_date",
                       "award_ceiling", "award_floor",
                       "estimated_total_funding", "opportunity_status")

  shared <- inner_join(
    prev |> select(opportunity_id, all_of(tracked_fields)),
    curr |> select(opportunity_id, all_of(tracked_fields)),
    by = "opportunity_id", suffix = c("_prev", "_curr")
  )

  modifications <- purrr::map_dfr(tracked_fields, function(fld) {
    prev_col <- paste0(fld, "_prev")
    curr_col <- paste0(fld, "_curr")

    shared |>
      filter(!is.na(.data[[curr_col]]) | !is.na(.data[[prev_col]])) |>
      filter(as.character(.data[[prev_col]]) != as.character(.data[[curr_col]]) |
               is.na(.data[[prev_col]]) != is.na(.data[[curr_col]])) |>
      transmute(
        opportunity_id,
        snapshot_id   = !!snapshot_id,
        change_type   = "MODIFIED",
        field_changed = fld,
        old_value     = as.character(.data[[prev_col]]),
        new_value     = as.character(.data[[curr_col]]),
        detected_date = Sys.time()
      )
  })

  bind_rows(new_grants, removed, modifications) |>
    mutate(change_id = row_number())
}

This mirrors the change_detected relationship in the ER diagram: each row in the change log links an opportunity_id to a snapshot_id, with descriptive attributes recording what changed and when.


Next Steps

  • PostgreSQL integration: Write the normalized tables to a PostgreSQL instance using dbWriteTable() with appropriate primary and foreign key constraints
  • Scheduling: Automate daily extraction via cron or GitHub Actions so snapshots accumulate over time
  • Shiny application: Build the front-end dashboard with tabs for active opportunities, change history, and removal tracking