The Design of an LLM-powered Unstructured Analytics System

Summary: Aryn compiles NL queries into semantic plans executed by Sycamore, a distributed declarative engine exposing DocSets to analyze, enrich, and transform large unstructured document collections. Luna (NL→Sycamore) and DocParse (PDF→DocSet) improve accuracy over RAG on NTSB reports and surface explainable execution traces to build trust. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID: 539
Venue: CIDR
Year: 2025
Pagerank: 6.8886648e-05
Overall Rank: 3,639 | 74.72%
DOI: -

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 13 of 13 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
1,839	DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing	2025	VLDB	0.00010351287
5,149	Abacus: A Cost-Based Optimizer for Semantic Operator Systems	2026	VLDB	5.655398e-05
5,756	Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System	2025	SIGMOD	5.3387063e-05
7,933	In-depth Analysis of Graph-based RAG in a Unified Framework	2025	VLDB	4.6089395e-05
8,464	Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS	2025	VLDB	4.5003888e-05
9,728	Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems	2025	VLDB	4.2901665e-05
9,989	Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven Analytics	2026	CIDR	4.1905499e-05
10,064	Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees	2026	SIGMOD	4.1905499e-05
10,144	Beyond Relational: Semantic-Aware Multi-Modal Analytics with LLM-Native Query Optimization	2026	SIGMOD	4.1905499e-05
10,215	Task Cascades for Efficient Unstructured Data Processing	2026	SIGMOD	4.1905499e-05
10,285	Relational Deep Dive: Error-Aware Queries Over Unstructured Data	2026	VLDB	4.1905499e-05
10,481	Approximating Opaque Top-k Queries	2025	SIGMOD	4.1905499e-05
10,718	Cracking Vector Search Indexes	2025	VLDB	4.1905499e-05

Outgoing Citations (Sorted by Pagerank)

Showing 9 of 9 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
219	Deep Entity Matching with Pre-Trained Language Models	2021	VLDB	0.00033354456
516	Can Foundation Models Wrangle Your Data?	2023	VLDB	0.00021194444
973	Natural language to SQL: Where are we today?	2020	VLDB	0.0001488435
997	CAESURA: Language Models as Multi-Modal Query Planners	2024	CIDR	0.00014726927
1,088	Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes	2024	VLDB	0.00014158762
1,839	DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing	2025	VLDB	0.00010351287
2,513	Annotating Columns with Pre-trained Language Models	2022	SIGMOD	8.6155767e-05
3,003	Chorus: Foundation Models for Unified Data Discovery and Exploration	2024	VLDB	7.7358219e-05
3,189	Text2SQL is Not Enough: Unifying AI and Databases with TAG	2025	CIDR	7.4140094e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
10,215	Task Cascades for Efficient Unstructured Data Processing	2026	SIGMOD	4.1905499e-05
13,185	Reimagining Deep Learning Systems Through the Lens of Data Systems	2024	VLDB	-
5,669	Databases Unbound: Querying All of the World's Bytes with AI	2024	VLDB	5.3805024e-05
10,462	ScaleLLM: A Technique for Scalable LLM-augmented Data Systems	2025	SIGMOD	4.1905499e-05
1,839	DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing	2025	VLDB	0.00010351287
9,728	Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems	2025	VLDB	4.2901665e-05
7,703	AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries	2025	CIDR	4.668568e-05
13,148	DocDB: A Database for Unstructured Document Analysis	2025	VLDB	-
9,989	Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven Analytics	2026	CIDR	4.1905499e-05
9,153	Unify: A System For Unstructured Data Analytics	2025	VLDB	4.380727e-05