This project builds a longitudinal relational database to track changes in federal grant opportunities listed on Grants.gov. The federal grants landscape has experienced significant volatility since the 2025 change in presidential administration, with agencies rescinding, restructuring, and freezing funding opportunities at an unprecedented pace (Clegg, 2025). Because Grants.gov only displays a current-state snapshot—once an opportunity is removed, it disappears from the public record—this database fills an accountability gap by preserving a historical record of what was added, changed, and removed over time.
The system has three components:
This progress update focuses on component 1: the extraction pipeline.
Grants.gov publishes daily XML extracts of its full opportunity database. These are available as zipped XML files at https://www.grants.gov/xml-extract. Each extract is a complete snapshot of every grant listing at the time of extraction—typically 50,000–80,000 records per file.
The pipeline scrapes the extract index page, identifies available files, and downloads the most recent (or a specified) snapshot from Grants.gov’s S3-hosted archive.
library(httr2)
library(rvest)
get_available_extracts <- function() {
page <- request("https://www.grants.gov/xml-extract") |>
req_perform() |>
resp_body_html()
# Parse file listing table
files <- page |>
html_table(fill = TRUE) |>
pluck(1)
colnames(files) <- c("file_name", "size", "extracted_datetime")
files <- files[-1, ] |>
filter(grepl("\\.zip", file_name)) |>
arrange(desc(file_name))
# Extract S3 download URLs
urls <- page |>
html_nodes("a") |>
html_attr("href") |>
keep(~grepl("s3.amazonaws.com/extracts/GrantsDBExtract", .x))
list(files = files, urls = urls)
}
The download function retrieves a specific extract (defaulting to the latest) and saves it locally:
download_extract <- function(file_name = NULL, dir = "data/grants_data") {
info <- get_available_extracts()
if (is.null(file_name)) file_name <- info$files$file_name[1]
dir.create(dir, recursive = TRUE, showWarnings = FALSE)
s3_url <- info$urls[grepl(file_name, info$urls)][1]
dest <- file.path(dir, file_name)
request(s3_url) |> req_perform(path = dest)
dest
}
Each XML extract contains one node per grant opportunity with 30+ fields nested as child elements. The parsing function reads the XML tree, discovers all unique field names from a sample of records, then processes the full file in chunks of 10,000 records to manage memory on large extracts.
library(xml2)
xml_doc <- read_xml("data/grants_data/extracted/GrantsDBExtract20260301v2.xml")
children <- xml_children(xml_root(xml_doc))
length(children)
#> [1] 74218
The core parsing loop iterates over XML child nodes, extracts each element’s name and text value, and populates a pre-allocated data frame. This is intentionally simple—no XPath queries, no external schema dependency—which makes it robust to minor format changes Grants.gov might introduce.
# Simplified illustration of the per-record extraction logic
node <- children[[1]]
elements <- xml_children(node)
record <- setNames(
xml_text(elements),
xml_name(elements)
)
After parsing, date fields are converted from Grants.gov’s
MMDDYYYY string format and numeric fields
(AwardCeiling, AwardFloor,
EstimatedTotalProgramFunding) are coerced from
character:
library(dplyr)
chunk_data <- chunk_data |>
mutate(
OpportunityID = as.numeric(OpportunityID),
AwardCeiling = as.numeric(AwardCeiling),
AwardFloor = as.numeric(AwardFloor),
EstimatedTotalProgramFunding = as.numeric(EstimatedTotalProgramFunding),
PostDate = as.Date(PostDate, format = "%m%d%Y"),
CloseDate = as.Date(CloseDate, format = "%m%d%Y"),
ArchiveDate = as.Date(ArchiveDate, format = "%m%d%Y")
)
The raw XML produces a single flat table—one row per opportunity with agency names, category codes, and funding details all in the same row. The database design normalizes this into six tables:
| Table | What it stores | Key design decision |
|---|---|---|
agencies |
Federal agency lookup | Surrogate PK (agency_id) because agency
names are inconsistent across extracts |
categories |
Funding category lookup | Natural PK (category_code) from
Grants.gov’s own type codes |
opportunities |
Core grant records | PK is opportunity_id from Grants.gov; FK
to agencies |
opportunity_categories |
Junction table (M:N) | Composite PK resolves many-to-many between grants and categories |
snapshots |
Extraction-run metadata | One row per download; tracks date and record count |
change_log |
Additions, modifications, removals | FKs to both opportunities and
snapshots |
The normalization step splits the flat data frame into these tables in R before writing to PostgreSQL:
library(DBI)
library(RPostgres)
# --- agencies: deduplicate and assign surrogate keys
agencies <- raw_data |>
distinct(AgencyName, AgencyCode) |>
mutate(agency_id = row_number()) |>
select(agency_id, agency_name = AgencyName, agency_code = AgencyCode)
# --- categories: natural key from Grants.gov type codes
categories <- raw_data |>
filter(!is.na(CategoryCode)) |>
distinct(category_code = CategoryCode, category_name = CategoryName)
# --- opportunities: core records with FK to agencies
opportunities <- raw_data |>
left_join(agencies, by = c("AgencyName" = "agency_name")) |>
transmute(
opportunity_id = OpportunityID,
agency_id,
opportunity_title = OpportunityTitle,
post_date = PostDate,
close_date = CloseDate,
award_ceiling = AwardCeiling,
award_floor = AwardFloor,
estimated_total_funding = EstimatedTotalProgramFunding,
opportunity_status = OpportunityStatus
)
# --- opportunity_categories: junction table
opp_cats <- raw_data |>
filter(!is.na(CategoryCode)) |>
distinct(opportunity_id = OpportunityID, category_code = CategoryCode)
# --- snapshots: one row for this extraction run
snapshot <- tibble(
snapshot_id = as.integer(format(Sys.Date(), "%Y%m%d")),
snapshot_date = Sys.Date(),
record_count = nrow(raw_data)
)
The change_log table is the analytical core of the
project. After loading a new snapshot, the pipeline compares it against
the previous snapshot to detect three types of changes:
opportunity_id exists in the
current snapshot but not the previousopportunity_id existed in the
previous snapshot but not the currentopportunity_id exists in
both, but one or more field values differdetect_changes <- function(prev, curr, snapshot_id) {
prev_ids <- prev$opportunity_id
curr_ids <- curr$opportunity_id
# --- NEW grants
new_grants <- curr |>
filter(!opportunity_id %in% prev_ids) |>
transmute(opportunity_id, snapshot_id = !!snapshot_id,
change_type = "NEW", field_changed = NA,
old_value = NA, new_value = NA,
detected_date = Sys.time())
# --- REMOVED grants
removed <- prev |>
filter(!opportunity_id %in% curr_ids) |>
transmute(opportunity_id, snapshot_id = !!snapshot_id,
change_type = "REMOVED", field_changed = NA,
old_value = NA, new_value = NA,
detected_date = Sys.time())
# --- MODIFIED grants (field-by-field comparison)
tracked_fields <- c("opportunity_title", "close_date", "post_date",
"award_ceiling", "award_floor",
"estimated_total_funding", "opportunity_status")
shared <- inner_join(
prev |> select(opportunity_id, all_of(tracked_fields)),
curr |> select(opportunity_id, all_of(tracked_fields)),
by = "opportunity_id", suffix = c("_prev", "_curr")
)
modifications <- purrr::map_dfr(tracked_fields, function(fld) {
prev_col <- paste0(fld, "_prev")
curr_col <- paste0(fld, "_curr")
shared |>
filter(!is.na(.data[[curr_col]]) | !is.na(.data[[prev_col]])) |>
filter(as.character(.data[[prev_col]]) != as.character(.data[[curr_col]]) |
is.na(.data[[prev_col]]) != is.na(.data[[curr_col]])) |>
transmute(
opportunity_id,
snapshot_id = !!snapshot_id,
change_type = "MODIFIED",
field_changed = fld,
old_value = as.character(.data[[prev_col]]),
new_value = as.character(.data[[curr_col]]),
detected_date = Sys.time()
)
})
bind_rows(new_grants, removed, modifications) |>
mutate(change_id = row_number())
}
This mirrors the change_detected relationship in the ER
diagram: each row in the change log links an opportunity_id
to a snapshot_id, with descriptive attributes recording
what changed and when.
dbWriteTable() with
appropriate primary and foreign key constraintscron or GitHub Actions so snapshots accumulate over
time