Citation network demo dataset
The per-algorithm Examples in this section all run against one dataset: the
DBLP / AMiner Citation Network V12
— 4,894,081 papers connected by 45,564,149 directed citation edges
(src cites dst). It is small enough to fit on one GPU and rich enough to
ask real questions: every vertex is a paper with a title, year, venue, authors,
and fields of study, so every algorithm result can be joined back into a
human-readable story.
The examples were re-run against the local Iceberg REST catalog backed by RustFS on a single NVIDIA RTX PRO 6000 (96 GB VRAM). The same SQL can still read direct local Parquet tables, but the primary demo path is the Iceberg namespace used below.
Tables
The primary demo source is the local Iceberg REST namespace
lake.citation_network. It contains five Iceberg tables loaded from the
Parquet files generated by fixture/scripts/graph/dblp_ingest.py (see
fixture/data/demo/README.md in the repository):
| Table | Rows | One row is |
|---|---|---|
citation_edges | 45.6M | a citation edge (src, dst, weight=1.0), clustered by src |
citation_edges_by_dst | 45.6M | the same edges, clustered by dst for reverse traversal |
papers | 4.9M | a paper: paper_id, title, year, venue, n_citation, n_references, primary_fos, … |
paper_authors | 14.9M | a paper–author link |
paper_fos | 45.0M | a paper–field-of-study link |
Load and serve the namespace with the local REST/RustFS stack:
docker compose -f fixture/iceberg-local/docker-compose.yml up -d
fixture/scripts/fixture.sh iceberg rest load --workload citation_network
cargo build --release --features server,cugraph,iceberg --bin datafusion_nexus_server
NEXUS_SERVER_CUGRAPH_ENABLED=true \
NEXUS_SERVER_NATIVE_MEMORY_POLICY=unbounded \
NEXUS_SERVER_BIND=127.0.0.1:50051 \
DATAFUSION_CATALOG_DEFAULT_CATALOG=datafusion \
DATAFUSION_CATALOG_DEFAULT_SCHEMA=public \
NEXUS_ICEBERG_CATALOG_KIND=rest \
NEXUS_ICEBERG_CATALOG_NAME=lake \
NEXUS_ICEBERG_NAMESPACE=citation_network \
NEXUS_ICEBERG_WAREHOUSE=s3://lakehouse/warehouse \
NEXUS_ICEBERG_REST_URI=http://localhost:8181 \
NEXUS_ICEBERG_S3_ENDPOINT=http://localhost:9000 \
NEXUS_ICEBERG_S3_REGION=us-east-1 \
NEXUS_ICEBERG_S3_PATH_STYLE=true \
NEXUS_ICEBERG_S3_ACCESS_KEY_ID=cudfadmin \
NEXUS_ICEBERG_S3_SECRET_ACCESS_KEY=cudfadminsecret \
NEXUS_SERVER_WORKSPACE_CATALOG=datafusion \
NEXUS_SERVER_WORKSPACE_SCHEMA=public \
NEXUS_SERVER_WORKSPACE_BACKING_CATALOG=lake \
NEXUS_SERVER_WORKSPACE_BACKING_SCHEMA=citation_network \
NEXUS_SERVER_WORKSPACE_BACKING_ALIASES=citation_edges,citation_edges_by_dst,papers,paper_authors,paper_fos \
flock /tmp/cudf-gpu.lock bash scripts/run_server.sh
The workspace overlay makes unqualified names such as citation_edges and
papers resolve through to lake.citation_network.*, while interactive DDL
stays local to datafusion.public. Prefer CREATE VIEW for SQL-defined
subgraphs. Use CREATE TABLE ... AS only when an example explicitly needs a
stable local snapshot of non-deterministic algorithm labels; it does not write
back to the Iceberg source catalog.
For direct-Parquet debugging without the REST catalog, start the server with Iceberg disabled and register the same logical names manually:
CREATE EXTERNAL TABLE citation_edges STORED AS PARQUET LOCATION '<data>/parquet/edges_by_src.parquet';
CREATE EXTERNAL TABLE citation_edges_by_dst STORED AS PARQUET LOCATION '<data>/parquet/edges_by_dst.parquet';
CREATE EXTERNAL TABLE papers STORED AS PARQUET LOCATION '<data>/parquet/profiles_by_paper_id.parquet';
CREATE EXTERNAL TABLE paper_authors STORED AS PARQUET LOCATION '<data>/parquet/paper_authors.parquet';
CREATE EXTERNAL TABLE paper_fos STORED AS PARQUET LOCATION '<data>/parquet/paper_fos.parquet';
AWS Glue can serve the same table layout for remote demos, but it requires an AWS account, a warehouse bucket, and credential setup. The local REST catalog is the reproducible path for the website examples.
The two-table pattern
cuGraph functions consume an edge relation and return vertex ids. The demos lean on one idiom throughout:
- SQL decides what graph the GPU sees — a
CREATE VIEWwith joins andWHEREclauses builds the subgraph (an era, a field, an ego network), and the view name is passed straight to thecugraph_*function. - SQL turns vertex ids back into a story — the
(vertex, value)output is joined topapers, windowed, aggregated, or anti-joined like any table.
Seed papers used in the examples
paper_id | Paper |
|---|---|
2963403868 | Attention Is All You Need (2017) |
2896457183 | BERT (2018) |
2163605009 | AlexNet (2012) |
1686810756 | VGG (2014) |
2194775991 | ResNet (2016) |
2064675550 | Long short-term memory (1997) |
2066636486 | The anatomy of a large-scale hypertextual Web search engine — the PageRank paper (1998) |
Two practical caveats
Component and community ids (label, partition) are assigned per execution.
A view over a cugraph_* call re-runs the algorithm on every scan, so a query
that scans such a view twice (for example a CTE joined back to it) can join
mismatched labels. If one statement cannot keep the algorithm output single-use,
materialize once into the mutable datafusion.public workspace with
CREATE TABLE <name> AS SELECT …, then analyze the snapshot. That snapshot is
local session/workspace state, not an Iceberg table write.
n_citation counts the whole world, in-degree counts this corpuspapers.n_citation is AMiner's global citation count; an in-graph degree only
counts edges present in this dataset. The in-degree example
measures exactly that gap.