Data & Code

A selection of publicly available data and code. See the Publications page for the corresponding papers.

Web App

Researcher Population Pyramids
An interactive web app for visualizing and exploring the researcher population structure and gender balance of each country. It lets you browse researcher population pyramids across countries and years directly in your browser, without writing any code.

Paper: Kazuki Nakajima and Takayuki Mizuno. Researcher Population Pyramids: Tracking Demographic and Gender Trajectories Across Countries. PNAS Nexus (2025).

CitationCS dataset
A citation dataset of computer science conference and journal papers. It integrates information from OpenAlex, DBLP, CORE, and SCImago, and contains metadata for 3,720,575 papers and 22,908,275 internal citations. The data is provided in JSON Lines format.

Paper 1: Kazuki Nakajima, Yuya Sasaki, Sohei Tokuno, and George Fletcher. Quantifying gendered citation imbalance in computer science conferences. Proc. AIES (2024).

Paper 2: Kazuki Nakajima, Yuya Sasaki, Sohei Tokuno, and George Fletcher. Systemic Gendered Citation Imbalance in Computer Science: Evidence from Conferences and Journals. Scientometrics (2025).

Co-citation hypergraph data
A series of co-citation hypergraph datasets, each representing highly cited papers in a research field as nodes and co-citation relations as hyperedges. Every dataset was constructed from the OpenAlex Snapshot (2024-09-27), and each node has attributes such as OpenAlex work ID, paper title, publication date, topic, subfield, field, domain, and citation count. The datasets are also available as part of XGI-DATA.

A dataset is provided for each research field:
- cs-cocitations (Computer Science): 3,118 nodes and 53,886 hyperedges.
- biochem-cocitations (Biochemistry, Genetics and Molecular Biology): 8,998 nodes and 50,289 hyperedges.
- math-cocitations (Mathematics): 2,972 nodes and 17,099 hyperedges.
- neuro-cocitations (Neuroscience): 4,267 nodes and 16,771 hyperedges.
- physics-cocitations (Physics and Astronomy): 5,347 nodes and 42,535 hyperedges.
Paper: Kazuki Nakajima, Yuya Sasaki, Takeaki Uno, and Masaki Aida. Learning Multi-Order Block Structure in Higher-Order Networks. arXiv preprint (2025).

HyperMOSBM
Python code for a stochastic block model that learns the block structure of higher-order networks for each interaction order (hyperedge size). It infers an optimal partition of the set of interaction orders and captures the mesoscopic structure of hypergraphs.

Paper: Kazuki Nakajima, Yuya Sasaki, Takeaki Uno, and Masaki Aida. Learning Multi-Order Block Structure in Higher-Order Networks. arXiv preprint (2025).

Researcher population pyramids visualization tool
Python code for visualizing and diagnosing changes in the researcher population structure and gender balance of each country using publication data. It constructs researcher population pyramids from per-author publication-year sequences and supports the analysis of demographic dynamics and projected changes in the research ecosystem.

Paper: Kazuki Nakajima and Takayuki Mizuno. Researcher Population Pyramids: Tracking Demographic and Gender Trajectories Across Countries. PNAS Nexus (2025).

HyperNEO
Python code for inferring and visualizing the community structure of attributed hypergraphs. By combining a mixed-membership stochastic block model for hypergraphs with a dimensionality reduction method, it can infer overlapping community structure of nodes and visualize it together with attribute information.

Paper: Kazuki Nakajima, Takeaki Uno. Inference and Visualization of Community Structure in Attributed Hypergraphs Using Mixed-Membership Stochastic Block Models. Social Network Analysis and Mining (2025).

hyper-dK-series
Python/C++ code for generating reference and null models of hypergraphs. Depending on the parameters d_v = 0, 1, 2, 2.5 and d_e = 0, 1, it generates randomized hypergraphs that preserve statistics such as node degree, degree correlation, redundancy coefficient, and hyperedge size. The Python implementation is accelerated with Numba and also includes code for higher-order rich-club detection.

Paper: Kazuki Nakajima, Kazuyuki Shudo, Naoki Masuda. Randomizing hypergraphs preserving degree correlation and local clustering. IEEE Transactions on Network Science and Engineering (2022).

dK-series
A Python package for generating reference models of unweighted networks. Depending on the parameter d = 0, 1, 1.5, 2, 2.5, it generates random graphs that preserve statistics up to the number of edges, degree distribution, degree correlation, and clustering coefficient. Released in March 2026 as the Python package dk_series, with support for simple-graph sampling at d = 1 and d = 2.