Data & Code
A selection of publicly available data and code.
See the Publications page for the corresponding papers.
Data
-
CitationCS dataset
A citation dataset of computer science conference and journal papers.
It integrates information from OpenAlex, DBLP, CORE, and SCImago, and contains metadata for 3,720,575 papers and 22,908,275 internal citations.
The data is provided in JSON Lines format.
Paper 1: Kazuki Nakajima, Yuya Sasaki, Sohei Tokuno, and George Fletcher. Quantifying gendered citation imbalance in computer science conferences. Proc. AIES (2024).
Paper 2: Kazuki Nakajima, Yuya Sasaki, Sohei Tokuno, and George Fletcher. Systemic Gendered Citation Imbalance in Computer Science: Evidence from Conferences and Journals. Scientometrics (2025).
-
cs-cocitations data
A co-citation hypergraph dataset that represents highly cited computer science papers as nodes and co-citation relations as hyperedges.
It was constructed from the OpenAlex Snapshot (2024-09-27)
and contains 3,118 paper nodes and 53,886 hyperedges.
Each node has attributes such as OpenAlex work ID, paper title, publication date, topic, subfield, field, domain, and citation count.
Paper: Kazuki Nakajima, Yuya Sasaki, Takeaki Uno, and Masaki Aida. Learning Multi-Order Block Structure in Higher-Order Networks. arXiv preprint (2025).
Code
-
HyperMOSBM
Python code for a stochastic block model that learns the block structure of higher-order networks for each interaction order (hyperedge size).
It infers an optimal partition of the set of interaction orders and captures the mesoscopic structure of hypergraphs.
Paper: Kazuki Nakajima, Yuya Sasaki, Takeaki Uno, and Masaki Aida. Learning Multi-Order Block Structure in Higher-Order Networks. arXiv preprint (2025).
-
Researcher population pyramids visualization tool
Python code for visualizing and diagnosing changes in the researcher population structure and gender balance of each country using publication data.
It constructs researcher population pyramids from per-author publication-year sequences and supports the analysis of demographic dynamics and projected changes in the research ecosystem.
Paper: Kazuki Nakajima and Takayuki Mizuno. Researcher Population Pyramids: Tracking Demographic and Gender Trajectories Across Countries. PNAS Nexus (2025).
-
HyperNEO
Python code for inferring and visualizing the community structure of attributed hypergraphs. By combining a mixed-membership stochastic block model for hypergraphs with a dimensionality reduction method, it can infer overlapping community structure of nodes and visualize it together with attribute information.
Paper: Kazuki Nakajima, Takeaki Uno. Inference and Visualization of Community Structure in Attributed Hypergraphs Using Mixed-Membership Stochastic Block Models. Social Network Analysis and Mining (2025).
-
hyper-dK-series
Python/C++ code for generating reference and null models of hypergraphs.
Depending on the parameters dv = 0, 1, 2, 2.5 and de = 0, 1,
it generates randomized hypergraphs that preserve statistics such as node degree, degree correlation, redundancy coefficient, and hyperedge size.
The Python implementation is accelerated with Numba and also includes code for higher-order rich-club detection.
Paper: Kazuki Nakajima, Kazuyuki Shudo, Naoki Masuda. Randomizing hypergraphs preserving degree correlation and local clustering. IEEE Transactions on Network Science and Engineering (2022).
-
dK-series
A Python package for generating reference models of unweighted networks.
Depending on the parameter d = 0, 1, 1.5, 2, 2.5, it generates random graphs that preserve statistics up to the number of edges, degree distribution, degree correlation, and clustering coefficient.
Released in March 2026 as the Python package dk_series, with support for simple-graph sampling at d = 1 and d = 2.