Citation network demo dataset

The per-algorithm Examples in this section all run against one dataset: the DBLP / AMiner Citation Network V12 — 4,894,081 papers connected by 45,564,149 directed citation edges (src cites dst). It is small enough to fit on one GPU and rich enough to ask real questions: every vertex is a paper with a title, year, venue, authors, and fields of study, so every algorithm result can be joined back into a human-readable story.

The examples were re-run against the local Iceberg REST catalog backed by RustFS on a single NVIDIA RTX PRO 6000 (96 GB VRAM). The same SQL can still read direct local Parquet tables, but the primary demo path is the Iceberg namespace used below.

Tables

The primary demo source is the local Iceberg REST namespace lake.citation_network. It contains five Iceberg tables loaded from the Parquet files generated by fixture/scripts/graph/dblp_ingest.py (see fixture/data/demo/README.md in the repository):

Table	Rows	One row is
`citation_edges`	45.6M	a citation edge (`src`, `dst`, `weight`=1.0), clustered by `src`
`citation_edges_by_dst`	45.6M	the same edges, clustered by `dst` for reverse traversal
`papers`	4.9M	a paper: `paper_id`, `title`, `year`, `venue`, `n_citation`, `n_references`, `primary_fos`, …
`paper_authors`	14.9M	a paper–author link
`paper_fos`	45.0M	a paper–field-of-study link

Load and serve the namespace with the local REST/RustFS stack:

docker compose -f fixture/iceberg-local/docker-compose.yml up -d
fixture/scripts/fixture.sh iceberg rest load --workload citation_network

cargo build --release --features server,cugraph,iceberg --bin datafusion_nexus_server

NEXUS_SERVER_CUGRAPH_ENABLED=true \
NEXUS_SERVER_NATIVE_MEMORY_POLICY=unbounded \
NEXUS_SERVER_BIND=127.0.0.1:50051 \
DATAFUSION_CATALOG_DEFAULT_CATALOG=datafusion \
DATAFUSION_CATALOG_DEFAULT_SCHEMA=public \
NEXUS_ICEBERG_CATALOG_KIND=rest \
NEXUS_ICEBERG_CATALOG_NAME=lake \
NEXUS_ICEBERG_NAMESPACE=citation_network \
NEXUS_ICEBERG_WAREHOUSE=s3://lakehouse/warehouse \
NEXUS_ICEBERG_REST_URI=http://localhost:8181 \
NEXUS_ICEBERG_S3_ENDPOINT=http://localhost:9000 \
NEXUS_ICEBERG_S3_REGION=us-east-1 \
NEXUS_ICEBERG_S3_PATH_STYLE=true \
NEXUS_ICEBERG_S3_ACCESS_KEY_ID=cudfadmin \
NEXUS_ICEBERG_S3_SECRET_ACCESS_KEY=cudfadminsecret \
NEXUS_SERVER_WORKSPACE_CATALOG=datafusion \
NEXUS_SERVER_WORKSPACE_SCHEMA=public \
NEXUS_SERVER_WORKSPACE_BACKING_CATALOG=lake \
NEXUS_SERVER_WORKSPACE_BACKING_SCHEMA=citation_network \
NEXUS_SERVER_WORKSPACE_BACKING_ALIASES=citation_edges,citation_edges_by_dst,papers,paper_authors,paper_fos \
flock /tmp/cudf-gpu.lock bash scripts/run_server.sh

The workspace overlay makes unqualified names such as citation_edges and papers resolve through to lake.citation_network.*, while interactive DDL stays local to datafusion.public. Prefer CREATE VIEW for SQL-defined subgraphs. Use CREATE TABLE ... AS only when an example explicitly needs a stable local snapshot of non-deterministic algorithm labels; it does not write back to the Iceberg source catalog.

For direct-Parquet debugging without the REST catalog, start the server with Iceberg disabled and register the same logical names manually:

CREATE EXTERNAL TABLE citation_edges STORED AS PARQUET LOCATION '<data>/parquet/edges_by_src.parquet';
CREATE EXTERNAL TABLE citation_edges_by_dst STORED AS PARQUET LOCATION '<data>/parquet/edges_by_dst.parquet';
CREATE EXTERNAL TABLE papers STORED AS PARQUET LOCATION '<data>/parquet/profiles_by_paper_id.parquet';
CREATE EXTERNAL TABLE paper_authors STORED AS PARQUET LOCATION '<data>/parquet/paper_authors.parquet';
CREATE EXTERNAL TABLE paper_fos STORED AS PARQUET LOCATION '<data>/parquet/paper_fos.parquet';

AWS Glue can serve the same table layout for remote demos, but it requires an AWS account, a warehouse bucket, and credential setup. The local REST catalog is the reproducible path for the website examples.

The two-table pattern

cuGraph functions consume an edge relation and return vertex ids. The demos lean on one idiom throughout:

SQL decides what graph the GPU sees — a CREATE VIEW with joins and WHERE clauses builds the subgraph (an era, a field, an ego network), and the view name is passed straight to the cugraph_* function.
SQL turns vertex ids back into a story — the (vertex, value) output is joined to papers, windowed, aggregated, or anti-joined like any table.

Seed papers used in the examples

`paper_id`	Paper
`2963403868`	Attention Is All You Need (2017)
`2896457183`	BERT (2018)
`2163605009`	AlexNet (2012)
`1686810756`	VGG (2014)
`2194775991`	ResNet (2016)
`2064675550`	Long short-term memory (1997)
`2066636486`	The anatomy of a large-scale hypertextual Web search engine — the PageRank paper (1998)

Two practical caveats

Snapshot non-deterministic labels before self-joining

Component and community ids (label, partition) are assigned per execution. A view over a cugraph_* call re-runs the algorithm on every scan, so a query that scans such a view twice (for example a CTE joined back to it) can join mismatched labels. If one statement cannot keep the algorithm output single-use, materialize once into the mutable datafusion.public workspace with CREATE TABLE <name> AS SELECT …, then analyze the snapshot. That snapshot is local session/workspace state, not an Iceberg table write.

n_citation counts the whole world, in-degree counts this corpus

papers.n_citation is AMiner's global citation count; an in-graph degree only counts edges present in this dataset. The in-degree example measures exactly that gap.

Tables​

The two-table pattern​

Seed papers used in the examples​

Two practical caveats​

Tables

The two-table pattern

Seed papers used in the examples

Two practical caveats