Skip to main content

Citation network demo dataset

The per-algorithm Examples in this section all run against one dataset: the DBLP / AMiner Citation Network V124,894,081 papers connected by 45,564,149 directed citation edges (src cites dst). It is small enough to fit on one GPU and rich enough to ask real questions: every vertex is a paper with a title, year, venue, authors, and fields of study, so every algorithm result can be joined back into a human-readable story.

The examples were re-run against the local Iceberg REST catalog backed by RustFS on a single NVIDIA RTX PRO 6000 (96 GB VRAM). The same SQL can still read direct local Parquet tables, but the primary demo path is the Iceberg namespace used below.

Tables

The primary demo source is the local Iceberg REST namespace lake.citation_network. It contains five Iceberg tables loaded from the Parquet files generated by fixture/scripts/graph/dblp_ingest.py (see fixture/data/demo/README.md in the repository):

TableRowsOne row is
citation_edges45.6Ma citation edge (src, dst, weight=1.0), clustered by src
citation_edges_by_dst45.6Mthe same edges, clustered by dst for reverse traversal
papers4.9Ma paper: paper_id, title, year, venue, n_citation, n_references, primary_fos, …
paper_authors14.9Ma paper–author link
paper_fos45.0Ma paper–field-of-study link

Load and serve the namespace with the local REST/RustFS stack:

docker compose -f fixture/iceberg-local/docker-compose.yml up -d
fixture/scripts/fixture.sh iceberg rest load --workload citation_network

cargo build --release --features server,cugraph,iceberg --bin datafusion_nexus_server

NEXUS_SERVER_CUGRAPH_ENABLED=true \
NEXUS_SERVER_NATIVE_MEMORY_POLICY=unbounded \
NEXUS_SERVER_BIND=127.0.0.1:50051 \
DATAFUSION_CATALOG_DEFAULT_CATALOG=datafusion \
DATAFUSION_CATALOG_DEFAULT_SCHEMA=public \
NEXUS_ICEBERG_CATALOG_KIND=rest \
NEXUS_ICEBERG_CATALOG_NAME=lake \
NEXUS_ICEBERG_NAMESPACE=citation_network \
NEXUS_ICEBERG_WAREHOUSE=s3://lakehouse/warehouse \
NEXUS_ICEBERG_REST_URI=http://localhost:8181 \
NEXUS_ICEBERG_S3_ENDPOINT=http://localhost:9000 \
NEXUS_ICEBERG_S3_REGION=us-east-1 \
NEXUS_ICEBERG_S3_PATH_STYLE=true \
NEXUS_ICEBERG_S3_ACCESS_KEY_ID=cudfadmin \
NEXUS_ICEBERG_S3_SECRET_ACCESS_KEY=cudfadminsecret \
NEXUS_SERVER_WORKSPACE_CATALOG=datafusion \
NEXUS_SERVER_WORKSPACE_SCHEMA=public \
NEXUS_SERVER_WORKSPACE_BACKING_CATALOG=lake \
NEXUS_SERVER_WORKSPACE_BACKING_SCHEMA=citation_network \
NEXUS_SERVER_WORKSPACE_BACKING_ALIASES=citation_edges,citation_edges_by_dst,papers,paper_authors,paper_fos \
flock /tmp/cudf-gpu.lock bash scripts/run_server.sh

The workspace overlay makes unqualified names such as citation_edges and papers resolve through to lake.citation_network.*, while interactive DDL stays local to datafusion.public. Prefer CREATE VIEW for SQL-defined subgraphs. Use CREATE TABLE ... AS only when an example explicitly needs a stable local snapshot of non-deterministic algorithm labels; it does not write back to the Iceberg source catalog.

For direct-Parquet debugging without the REST catalog, start the server with Iceberg disabled and register the same logical names manually:

CREATE EXTERNAL TABLE citation_edges STORED AS PARQUET LOCATION '<data>/parquet/edges_by_src.parquet';
CREATE EXTERNAL TABLE citation_edges_by_dst STORED AS PARQUET LOCATION '<data>/parquet/edges_by_dst.parquet';
CREATE EXTERNAL TABLE papers STORED AS PARQUET LOCATION '<data>/parquet/profiles_by_paper_id.parquet';
CREATE EXTERNAL TABLE paper_authors STORED AS PARQUET LOCATION '<data>/parquet/paper_authors.parquet';
CREATE EXTERNAL TABLE paper_fos STORED AS PARQUET LOCATION '<data>/parquet/paper_fos.parquet';

AWS Glue can serve the same table layout for remote demos, but it requires an AWS account, a warehouse bucket, and credential setup. The local REST catalog is the reproducible path for the website examples.

The two-table pattern

cuGraph functions consume an edge relation and return vertex ids. The demos lean on one idiom throughout:

  1. SQL decides what graph the GPU sees — a CREATE VIEW with joins and WHERE clauses builds the subgraph (an era, a field, an ego network), and the view name is passed straight to the cugraph_* function.
  2. SQL turns vertex ids back into a story — the (vertex, value) output is joined to papers, windowed, aggregated, or anti-joined like any table.

Seed papers used in the examples

paper_idPaper
2963403868Attention Is All You Need (2017)
2896457183BERT (2018)
2163605009AlexNet (2012)
1686810756VGG (2014)
2194775991ResNet (2016)
2064675550Long short-term memory (1997)
2066636486The anatomy of a large-scale hypertextual Web search engine — the PageRank paper (1998)

Two practical caveats

Snapshot non-deterministic labels before self-joining

Component and community ids (label, partition) are assigned per execution. A view over a cugraph_* call re-runs the algorithm on every scan, so a query that scans such a view twice (for example a CTE joined back to it) can join mismatched labels. If one statement cannot keep the algorithm output single-use, materialize once into the mutable datafusion.public workspace with CREATE TABLE <name> AS SELECT …, then analyze the snapshot. That snapshot is local session/workspace state, not an Iceberg table write.

n_citation counts the whole world, in-degree counts this corpus

papers.n_citation is AMiner's global citation count; an in-graph degree only counts edges present in this dataset. The in-degree example measures exactly that gap.