How to Extract Structured Data with LLMs in R

llm

ellmer

data extraction

Learn to extract structured data from text using LLMs in R. Parse names, dates, products, and entities from unstructured text into clean data frames.

Published

April 4, 2026

Introduction

LLMs excel at extracting structured information from unstructured text. Instead of writing complex regex patterns, you can describe what you want and get clean, structured output.

This tutorial uses the ellmer package. You can use any provider: Claude, OpenAI, or local models with Ollama.

Use cases: - Extract names, dates, addresses from text - Parse product information from descriptions - Convert free-text survey responses to categories - Extract entities from documents - Clean and standardize messy data

Getting Started

library(ellmer)
library(tidyverse)

Basic Extraction

Define a schema

Tell the LLM what structure you expect:

person_schema <- type_object(
  name = type_string("Person's full name"),
  age = type_integer("Person's age in years"),
  email = type_string("Email address if mentioned")
)

Extract from text

chat <- chat_claude()

result <- chat$extract_data(
  "John Smith is 35 years old. You can reach him at john@email.com",
  type = person_schema
)

Access the results

result$name   # "John Smith"
result$age    # 35
result$email  # "john@email.com"

Handle missing data

result <- chat$extract_data(
  "Sarah is 28 years old.",
  type = person_schema
)

result
# $name: "Sarah"
# $age: 28
# $email: NULL  # Not mentioned in text

Type Definitions

Available types

# String
type_string("description of the field")

# Integer
type_integer("description")

# Number (float)
type_number("description")

# Boolean
type_boolean("description")

# Enum (predefined options)
type_enum(
  values = c("positive", "negative", "neutral"),
  description = "Sentiment classification"
)

# Array (list of items)
type_array(
  items = type_string("individual item description"),
  description = "A list of items"
)

# Object (nested structure)
type_object(
  field1 = type_string("..."),
  field2 = type_integer("...")
)

Practical Examples

Extract product information

product_schema <- type_object(
  name = type_string("Product name"),
  price = type_number("Price in dollars"),
  currency = type_string("Currency code"),
  features = type_array(
    items = type_string("A product feature")
  )
)

text <- "The iPhone 15 Pro costs $999. Features include titanium design,
A17 chip, 48MP camera, and USB-C port."

chat <- chat_claude()
product <- chat$extract_data(text, type = product_schema)

product
# $name: "iPhone 15 Pro"
# $price: 999
# $currency: "USD"
# $features: ["titanium design", "A17 chip", "48MP camera", "USB-C port"]

Classify sentiment

sentiment_schema <- type_object(
  sentiment = type_enum(
    values = c("positive", "negative", "neutral"),
    description = "Overall sentiment"
  ),
  confidence = type_number("Confidence score from 0 to 1"),
  key_phrases = type_array(
    items = type_string("Key phrase indicating sentiment")
  )
)

review <- "Absolutely love this product! Best purchase I've made this year.
The quality is outstanding and shipping was super fast."

chat <- chat_claude()
result <- chat$extract_data(review, type = sentiment_schema)

result
# $sentiment: "positive"
# $confidence: 0.95
# $key_phrases: ["Absolutely love", "Best purchase", "outstanding", "super fast"]

Parse contact information

contact_schema <- type_object(
  name = type_string("Full name"),
  phone = type_string("Phone number"),
  email = type_string("Email address"),
  address = type_object(
    street = type_string("Street address"),
    city = type_string("City"),
    state = type_string("State"),
    zip = type_string("ZIP code")
  )
)

text <- "Contact Jane Doe at (555) 123-4567 or jane.doe@company.com.
Her office is at 123 Main Street, San Francisco, CA 94102."

chat <- chat_claude()
contact <- chat$extract_data(text, type = contact_schema)

Extract dates and events

event_schema <- type_object(
  event_name = type_string("Name of the event"),
  date = type_string("Date in YYYY-MM-DD format"),
  location = type_string("Event location"),
  description = type_string("Brief description")
)

text <- "Join us for the R Users Meetup on March 15th, 2024 at the
Downtown Conference Center. We'll discuss data visualization techniques."

chat <- chat_claude()
event <- chat$extract_data(text, type = event_schema)

event
# $event_name: "R Users Meetup"
# $date: "2024-03-15"
# $location: "Downtown Conference Center"
# $description: "Discussion about data visualization techniques"

Extracting Multiple Items

Extract array of objects

person_schema <- type_object(
  name = type_string("Person's name"),
  role = type_string("Person's role or title")
)

people_schema <- type_array(
  items = person_schema,
  description = "List of people mentioned"
)

text <- "The meeting included CEO John Smith, CTO Sarah Johnson,
and CFO Michael Brown. They discussed Q4 results."

chat <- chat_claude()
people <- chat$extract_data(text, type = people_schema)

people
# [[1]] $name: "John Smith", $role: "CEO"
# [[2]] $name: "Sarah Johnson", $role: "CTO"
# [[3]] $name: "Michael Brown", $role: "CFO"

Convert to data frame

# Extract as tibble
people_df <- tibble(
  name = map_chr(people, "name"),
  role = map_chr(people, "role")
)

people_df

Batch Processing

Process multiple texts

library(purrr)

reviews <- c(
  "Great product, love it!",
  "Terrible quality, very disappointed",
  "It's okay, nothing special",
  "Best purchase ever, highly recommend"
)

sentiment_schema <- type_enum(
  values = c("positive", "negative", "neutral"),
  description = "Sentiment"
)

extract_sentiment <- function(text) {
  chat <- chat_claude()
  Sys.sleep(0.5)  # Rate limiting
  chat$extract_data(text, type = sentiment_schema)
}

sentiments <- map_chr(reviews, extract_sentiment)

tibble(
  review = reviews,
  sentiment = sentiments
)

Process data frame column

df <- tibble(
  id = 1:3,
  description = c(
    "John Smith, age 30, engineer",
    "Jane Doe, age 25, designer",
    "Bob Brown, age 45, manager"
  )
)

person_schema <- type_object(
  name = type_string("Name"),
  age = type_integer("Age"),
  job = type_string("Job title")
)

df_extracted <- df |>
  mutate(
    extracted = map(description, \(text) {
      chat <- chat_claude()
      Sys.sleep(0.5)
      chat$extract_data(text, type = person_schema)
    }),
    name = map_chr(extracted, "name"),
    age = map_int(extracted, "age"),
    job = map_chr(extracted, "job")
  ) |>
  select(-extracted)

df_extracted

Advanced Patterns

Extraction with instructions

chat <- chat_claude(
  system_prompt = "Extract information exactly as specified.
  If information is unclear, make your best inference.
  Use NULL for genuinely missing data."
)

result <- chat$extract_data(text, type = schema)

Validate extracted data

extract_and_validate <- function(text, schema, validation_fn) {
  chat <- chat_claude()
  result <- chat$extract_data(text, type = schema)

  if (!validation_fn(result)) {
    warning("Extraction may be incomplete or invalid")
  }

  result
}

# Example validation
validate_person <- function(person) {
  !is.null(person$name) && !is.null(person$age)
}

result <- extract_and_validate(
  "Some text",
  person_schema,
  validate_person
)

Combine extraction with classification

ticket_schema <- type_object(
  category = type_enum(
    values = c("billing", "technical", "account", "other"),
    description = "Support ticket category"
  ),
  priority = type_enum(
    values = c("low", "medium", "high", "urgent"),
    description = "Priority level"
  ),
  summary = type_string("One-sentence summary"),
  entities = type_object(
    account_id = type_string("Account ID if mentioned"),
    error_code = type_string("Error code if mentioned")
  )
)

ticket <- "Hi, I can't log into my account #12345. Getting error E401.
This is urgent as I need to complete a transaction today!"

chat <- chat_claude()
parsed_ticket <- chat$extract_data(ticket, type = ticket_schema)

Error Handling

safe_extract <- function(text, schema) {
  tryCatch({
    chat <- chat_claude()
    chat$extract_data(text, type = schema)
  }, error = function(e) {
    warning("Extraction failed: ", e$message)
    NULL
  })
}

# Use with map for batch processing
results <- map(texts, \(t) safe_extract(t, schema))

# Filter out failures
valid_results <- compact(results)  # Remove NULLs

Local LLM Extraction

Use Ollama for free, private extraction:

# Works the same way with local models
chat <- chat_ollama(model = "llama3.2")

result <- chat$extract_data(
  "John Smith, 35 years old, john@email.com",
  type = person_schema
)

Note: Local models may be less accurate for complex schemas. Test thoroughly.

Common Mistakes

1. Schema too complex

# Too many nested levels can confuse the model
# Break into simpler extractions if needed

2. Ambiguous field descriptions

# Bad
type_string("date")

# Good
type_string("Event date in YYYY-MM-DD format")

3. Not handling NULL values

# Always check for NULLs
result$field %||% "default_value"

# Or use map with default
map_chr(results, "field", .default = NA_character_)

4. Forgetting rate limits in batches

# Always add delays
map(texts, \(t) {
  Sys.sleep(0.5)  # Important!
  extract(t)
})

Summary

Task	Code
Define string field	`type_string("description")`
Define number field	`type_number("description")`
Define enum field	`type_enum(values = c(...))`
Define array	`type_array(items = type_*())`
Define object	`type_object(field = type_*())`
Extract data	`chat$extract_data(text, type)`

Define schemas with type_*() functions
Use clear field descriptions
Handle NULL values for missing data
Add delays when batch processing
Validate extracted data when reliability is important

Sources

--- title: "How to Extract Structured Data with LLMs in R" description: "Learn to extract structured data from text using LLMs in R. Parse names, dates, products, and entities from unstructured text into clean data frames." date: 2026-04-04 categories: ['llm', 'ellmer', 'data extraction'] format: html: code-fold: false code-tools: true --- ## Introduction LLMs excel at extracting structured information from unstructured text. Instead of writing complex regex patterns, you can describe what you want and get clean, structured output. This tutorial uses the [ellmer package](/llm/how-to-use-ellmer-in-r). You can use any provider: [Claude](/llm/how-to-use-claude-api-in-r), [OpenAI](/llm/how-to-use-openai-api-in-r), or [local models with Ollama](/llm/how-to-run-local-llms-in-r). **Use cases:** - Extract names, dates, addresses from text - Parse product information from descriptions - Convert free-text survey responses to categories - Extract entities from documents - Clean and standardize messy data ## Getting Started ```r library(ellmer) library(tidyverse) ``` ## Basic Extraction ### Define a schema Tell the LLM what structure you expect: ```r person_schema <- type_object( name = type_string("Person's full name"), age = type_integer("Person's age in years"), email = type_string("Email address if mentioned") ) ``` ### Extract from text ```r chat <- chat_claude() result <- chat$extract_data( "John Smith is 35 years old. You can reach him at john@email.com", type = person_schema ) ``` ### Access the results ```r result$name # "John Smith" result$age # 35 result$email # "john@email.com" ``` ### Handle missing data ```r result <- chat$extract_data( "Sarah is 28 years old.", type = person_schema ) result # $name: "Sarah" # $age: 28 # $email: NULL # Not mentioned in text ``` ## Type Definitions ### Available types ```r # String type_string("description of the field") # Integer type_integer("description") # Number (float) type_number("description") # Boolean type_boolean("description") # Enum (predefined options) type_enum( values = c("positive", "negative", "neutral"), description = "Sentiment classification" ) # Array (list of items) type_array( items = type_string("individual item description"), description = "A list of items" ) # Object (nested structure) type_object( field1 = type_string("..."), field2 = type_integer("...") ) ``` ## Practical Examples ### Extract product information ```r product_schema <- type_object( name = type_string("Product name"), price = type_number("Price in dollars"), currency = type_string("Currency code"), features = type_array( items = type_string("A product feature") ) ) text <- "The iPhone 15 Pro costs $999. Features include titanium design, A17 chip, 48MP camera, and USB-C port." chat <- chat_claude() product <- chat$extract_data(text, type = product_schema) product # $name: "iPhone 15 Pro" # $price: 999 # $currency: "USD" # $features: ["titanium design", "A17 chip", "48MP camera", "USB-C port"] ``` ### Classify sentiment ```r sentiment_schema <- type_object( sentiment = type_enum( values = c("positive", "negative", "neutral"), description = "Overall sentiment" ), confidence = type_number("Confidence score from 0 to 1"), key_phrases = type_array( items = type_string("Key phrase indicating sentiment") ) ) review <- "Absolutely love this product! Best purchase I've made this year. The quality is outstanding and shipping was super fast." chat <- chat_claude() result <- chat$extract_data(review, type = sentiment_schema) result # $sentiment: "positive" # $confidence: 0.95 # $key_phrases: ["Absolutely love", "Best purchase", "outstanding", "super fast"] ``` ### Parse contact information ```r contact_schema <- type_object( name = type_string("Full name"), phone = type_string("Phone number"), email = type_string("Email address"), address = type_object( street = type_string("Street address"), city = type_string("City"), state = type_string("State"), zip = type_string("ZIP code") ) ) text <- "Contact Jane Doe at (555) 123-4567 or jane.doe@company.com. Her office is at 123 Main Street, San Francisco, CA 94102." chat <- chat_claude() contact <- chat$extract_data(text, type = contact_schema) ``` ### Extract dates and events ```r event_schema <- type_object( event_name = type_string("Name of the event"), date = type_string("Date in YYYY-MM-DD format"), location = type_string("Event location"), description = type_string("Brief description") ) text <- "Join us for the R Users Meetup on March 15th, 2024 at the Downtown Conference Center. We'll discuss data visualization techniques." chat <- chat_claude() event <- chat$extract_data(text, type = event_schema) event # $event_name: "R Users Meetup" # $date: "2024-03-15" # $location: "Downtown Conference Center" # $description: "Discussion about data visualization techniques" ``` ## Extracting Multiple Items ### Extract array of objects ```r person_schema <- type_object( name = type_string("Person's name"), role = type_string("Person's role or title") ) people_schema <- type_array( items = person_schema, description = "List of people mentioned" ) text <- "The meeting included CEO John Smith, CTO Sarah Johnson, and CFO Michael Brown. They discussed Q4 results." chat <- chat_claude() people <- chat$extract_data(text, type = people_schema) people # [[1]] $name: "John Smith", $role: "CEO" # [[2]] $name: "Sarah Johnson", $role: "CTO" # [[3]] $name: "Michael Brown", $role: "CFO" ``` ### Convert to data frame ```r # Extract as tibble people_df <- tibble( name = map_chr(people, "name"), role = map_chr(people, "role") ) people_df ``` ## Batch Processing ### Process multiple texts ```r library(purrr) reviews <- c( "Great product, love it!", "Terrible quality, very disappointed", "It's okay, nothing special", "Best purchase ever, highly recommend" ) sentiment_schema <- type_enum( values = c("positive", "negative", "neutral"), description = "Sentiment" ) extract_sentiment <- function(text) { chat <- chat_claude() Sys.sleep(0.5) # Rate limiting chat$extract_data(text, type = sentiment_schema) } sentiments <- map_chr(reviews, extract_sentiment) tibble( review = reviews, sentiment = sentiments ) ``` ### Process data frame column ```r df <- tibble( id = 1:3, description = c( "John Smith, age 30, engineer", "Jane Doe, age 25, designer", "Bob Brown, age 45, manager" ) ) person_schema <- type_object( name = type_string("Name"), age = type_integer("Age"), job = type_string("Job title") ) df_extracted <- df |> mutate( extracted = map(description, \(text) { chat <- chat_claude() Sys.sleep(0.5) chat$extract_data(text, type = person_schema) }), name = map_chr(extracted, "name"), age = map_int(extracted, "age"), job = map_chr(extracted, "job") ) |> select(-extracted) df_extracted ``` ## Advanced Patterns ### Extraction with instructions ```r chat <- chat_claude( system_prompt = "Extract information exactly as specified. If information is unclear, make your best inference. Use NULL for genuinely missing data." ) result <- chat$extract_data(text, type = schema) ``` ### Validate extracted data ```r extract_and_validate <- function(text, schema, validation_fn) { chat <- chat_claude() result <- chat$extract_data(text, type = schema) if (!validation_fn(result)) { warning("Extraction may be incomplete or invalid") } result } # Example validation validate_person <- function(person) { !is.null(person$name) && !is.null(person$age) } result <- extract_and_validate( "Some text", person_schema, validate_person ) ``` ### Combine extraction with classification ```r ticket_schema <- type_object( category = type_enum( values = c("billing", "technical", "account", "other"), description = "Support ticket category" ), priority = type_enum( values = c("low", "medium", "high", "urgent"), description = "Priority level" ), summary = type_string("One-sentence summary"), entities = type_object( account_id = type_string("Account ID if mentioned"), error_code = type_string("Error code if mentioned") ) ) ticket <- "Hi, I can't log into my account #12345. Getting error E401. This is urgent as I need to complete a transaction today!" chat <- chat_claude() parsed_ticket <- chat$extract_data(ticket, type = ticket_schema) ``` ## Error Handling ```r safe_extract <- function(text, schema) { tryCatch({ chat <- chat_claude() chat$extract_data(text, type = schema) }, error = function(e) { warning("Extraction failed: ", e$message) NULL }) } # Use with map for batch processing results <- map(texts, \(t) safe_extract(t, schema)) # Filter out failures valid_results <- compact(results) # Remove NULLs ``` ## Local LLM Extraction Use [Ollama](/llm/how-to-run-local-llms-in-r) for free, private extraction: ```r # Works the same way with local models chat <- chat_ollama(model = "llama3.2") result <- chat$extract_data( "John Smith, 35 years old, john@email.com", type = person_schema ) ``` **Note:** Local models may be less accurate for complex schemas. Test thoroughly. ## Common Mistakes **1. Schema too complex** ```r # Too many nested levels can confuse the model # Break into simpler extractions if needed ``` **2. Ambiguous field descriptions** ```r # Bad type_string("date") # Good type_string("Event date in YYYY-MM-DD format") ``` **3. Not handling NULL values** ```r # Always check for NULLs result$field %||% "default_value" # Or use map with default map_chr(results, "field", .default = NA_character_) ``` **4. Forgetting rate limits in batches** ```r # Always add delays map(texts, \(t) { Sys.sleep(0.5) # Important! extract(t) }) ``` ## Summary | Task | Code | |------|------| | Define string field | `type_string("description")` | | Define number field | `type_number("description")` | | Define enum field | `type_enum(values = c(...))` | | Define array | `type_array(items = type_*())` | | Define object | `type_object(field = type_*())` | | Extract data | `chat$extract_data(text, type)` | - Define schemas with `type_*()` functions - Use clear field descriptions - Handle NULL values for missing data - Add delays when batch processing - Validate extracted data when reliability is important ## Related Posts - [How to Classify Text with LLMs in R](/llm/how-to-classify-text-with-llms-in-r) - [How to Analyze Sentiment with LLMs in R](/llm/how-to-analyze-sentiment-with-llms-in-r) - [How to Use ellmer in R](/llm/how-to-use-ellmer-in-r) - [How to Use Claude API in R](/llm/how-to-use-claude-api-in-r) - [How to Run Local LLMs in R](/llm/how-to-run-local-llms-in-r) ## Sources - [ellmer Data Extraction Documentation](https://ellmer.tidyverse.org/) - [ellmer Type Definitions Reference](https://ellmer.tidyverse.org/reference/)