Databases Centralized for AI Development
Samantha Hoffman
Executive Summary:
A new long-term plan to build data infrastructure calls for the construction of sector-specific databases to underpin artificial intelligence (AI) development in the People’s Republic of China (PRC). The plan is tied to a desire for social control and for achieving national security objectives through the promotion of “high quality datasets.”
The PRC has identified “physically distributed but logically centralized” data management as a central goal. Its new National Dataset Management Service Platform intends to aggregate data collected across government agencies to be used for AI model training, while centrally sanctioned data exchanges allow the government to audit and track data flows.
The datasets will also be deployed to improve the effectiveness of the social credit system and the public security apparatus through increased data-sharing across government institutions.
On June 3, 2026, the National Data Administration (NDA) issued its Implementation Plan for Promoting the Construction of High-Quality Industry Datasets (关于推进行业高质量数据集建设行动的实施方案) (NDA, June 8). The plan is focused on building the PRC’s data foundations on which artificial intelligence (AI) systems depend. It directs the construction of sector-specific datasets for AI model training across 19 sectors and five innovation areas, including healthcare, e-commerce, public security, urban governance, and social credit, as well as innovation areas such as the low-altitude economy. [1] As with all aspects of PRC data governance, commercial and administrative incentives coexist with the implementation of the Party’s social control and national security objectives.
High-Quality Datasets
The term “high-quality dataset” (高质量数据集) has a specific official definition that is both technical and political in nature. According to technical standards issued in August 2025 by the “National Technical Committee 609 on Data of Standardization Administration of China” (SAC/TC609; 全国数据标准化技术委员会), a high-quality dataset is defined as a collection of data that has undergone collection, processing, and other data handling, can be directly used for developing and training AI models, and can effectively improve model performance. [2] Dataset quality is reflected across five dimensions: “large” scale, “robust” security, “correct” viewpoint, “good” results, and “wide” application, and can be measured using both static and dynamic quality evaluation methods, which are outlined below (NDA, August 2025). Data security compliance is also defined within a series of quality compliance indicators, which state that data in datasets for AI model development and training shall “not contain illegal content including content that violates socialist core values, discriminatory content, commercial violations, or content that infringes upon the legitimate rights and interests of others” (不包含违反社会主义核心价值观的内容、歧视性内容、商业违法违规、侵犯他人合法权益等非法内容) (SAC/TC609, August 2025).
The same standards establish a three-tier classification system for high-quality data, which are arranged by sensitivity. “General knowledge datasets” (通识数据集) contain knowledge accessible to the public without specialist background, sourced from encyclopedias and Internet platforms, and are rated low sensitivity. “Industry general knowledge datasets” (行业通识数据集) contain knowledge requiring professional background, sourced from academic papers, industry reports, and official documents, and are also rated low sensitivity. “Industry professional knowledge datasets” (行业专识数据集) contain knowledge requiring deep professional background and operational experience, sourced from organizations’ internal business systems and management platforms, and are rated as highly sensitive (SAC/TC609, August 29, 2025).
The NDA’s High-Quality Dataset Construction Guidelines (高质量数据集建设指引), a methodology document on building high-quality datasets for AI training issued by the NDA in August 2025, recorded that “more than 35,000 high-quality datasets had been built nationally as of June 2025, with a total volume exceeding 400 [petabytes]” (截至2025年6月,全国建设高质量数据集超3.5万个、总量超400PB) (NDA, August 2025). The guidelines identify three main technical approaches to data collection:
Multi-source heterogeneous data fusion collection: pulling together disparate structured and unstructured data streams into a unified pipeline;
Edge-side collection: Local real-time capture at the point of generation, before data travels to central systems; and
Synthetic data generation: for scenarios when real operational data is scarce or too sensitive to use directly.
Each reflects a different aspect of the PRC’s existing data infrastructure, as well as their relationship to challenges of building AI training data at scale. Social credit, for example, illustrates the first. It is inherently multi-source and heterogeneous because it is distributed across separate authorities including public security, courts, tax, and market regulators. The National Credit Information Sharing Platform (全国信用信息共享平台) and unified social credit code system have been designed to improve aggregation and information sharing across silos (People’s Daily, April 1, 2025). Progress has been uneven, but the direction and intent are clear and iterative improvements will continue to see those objectives realized.
Physically Distributed but Logically Centralized
The new implementation plan includes dataset management provisions that require full lifecycle oversight covering collection, cleaning, processing, labeling, quality inspection, evaluation, iteration, and audit. The provisions subject data to state management within a system described as “physically distributed but logically centralized” (物理分散、逻辑集中) (NDA, June 8).
This approach is not new: the formulation appeared as the governing principle for cross-departmental data sharing under the 2015 “Big Data Development Action Outline” (促进大数据发展行动纲要) from the National Development and Reform Commission (NDRC; 国家发展和改革委员会), and at the time the national e-government extranet connecting 118 central government units and the social credit system’s information sharing infrastructure were among the systems built on that basis (NDRC, September 25, 2015).
The National Dataset Management Service Platform (国家数据集管理服务平台), built and operated by the National Data Development Research Institute (国家数据发展研究院) under NDA guidance, launched for a trial period on April 29, 2026, just before the plan was formally issued, at the Digital China Construction Summit (数字中国建设峰会) (NDA, April 29). The platform provides dataset catalogue management and construction monitoring functions to data management authorities, publication and quality evaluation functions to dataset suppliers, and search and retrieval functions to dataset users.
One of the Party-state’s instruments for addressing structural weaknesses associated with achieving the objective of “physically distributed but logically centralized” (物理分散、逻辑集中) has been the rollout of “data exchanges” (数据交易所). Data exchanges are state-supervised marketplaces designed to make bulk data transactable, fusible, and legible across the departmental and jurisdictional lines that fragmented governance had drawn (National Information Center, February 20, 2024). The first exchange had been established in Guiyang in 2014, but they began proliferating after the 13th Five-Year Plan, which called for implementing a “national big data strategy” (国家大数据战略) and promoting the open sharing of data resources (Study China, November 12, 2015; NPC, March 2016; CAC, May 18, 2021). Relatedly, the new implementation plan’s provisions encourage datasets to be listed and traded through data exchanges and data circulation service institutions, with the objective of existing trading models upgrading from basic data packages to application programming interface (API) calls and full-stack services. [3]
Data exchanges are designed so that transactions are logged, participants verified, and audit trails maintained (State Administration for Market Regulation, March 1, 2020). Data exchanges, as data holders, are also subject to the cooperation obligations of the 2021 Data Security Law, which requires provision of data to public security and state security organs on demand (NPC, June 10, 2021).
Conclusion
The implications of the provisions of the Implementation Plan for Promoting the Construction of High-Quality Industry Datasets are that, in the long term, they are part of a series of regulations, standards, and other measures designed to improve the Party’s ability to both maximize data and technology ecosystems for the purposes of economic development and its own social and political control. The control elements are embedded subtly in the definition of quality data and in the terms governing how that data is ultimately catalogued and shared, traced, and potentially accessed.
The plan is one regulation in a decades-long systems construction project. Where its provisions fit in the bigger puzzle is that they enable the kinds of data sharing and dataset building that will improve not only the quality and accuracy but also the effectiveness of a range of governance systems. This includes the social credit, urban governance platforms (such as smart cities platforms), and public security applications, that the plan refers to specifically.
This article originally appeared in China Brief Notes. Check it out here!
Dr. Samantha Hoffman is the founder and managing director of ANS Analytics LLC. She is an analyst with over ten years of experience and a proven track record for producing groundbreaking open-source research on Chinese politics and national security strategy. Her work has shaped global approaches to understanding challenges posed by the Chinese party-state’s harnessing of technology for security and propaganda purposes.
Notes
[1] According to the NDA Implementation Plan, the full sector list includes: scientific research, industrial manufacturing, agriculture and rural areas, smart energy, transport, financial services, healthcare, education, e-commerce, human resources, culture and tourism, emergency management, meteorological services, green and low-carbon, public security, urban governance, housing construction, natural resources, and social credit. Innovation areas: low-altitude economy, embodied intelligence, autonomous driving, smart ocean, and bio-manufacturing (NDA, June 8).
[2] Key technical standards include:
1. TC609-5-2025-0, Construction Guidelines (高质量数据集 建设指南), which cover full lifecycle construction guidance;
2. TC609-5-2025-02, Format Requirements (高质量数据集 格式要求) which cover metadata format standardization;
3. TC609-5-2025-03, Classification Guidelines (高质量数据集 分类指南) which cover unified classification framework; and
4. TC609-5-2025-04, Specifications for quality evaluation and test (高质量数据集 质量评测规范), which cover unified evaluation indicators covering accuracy, completeness, timeliness, and consistency.
For a summary, see: https://archive.ph/MbWbi.
The full text of TC609-5-2025-03 can be found at: https://web.archive.org/web/20260622121303/https://www.tc609.org.cn/tc609/tzgg/202509/6d0f135392ca47dfa28dad22f7bb5f6b/files/3.%20%E6%8A%80%E6%9C%AF%E6%96%87%E4%BB%B6%E3%80%8A%E9%AB%98%E8%B4%A8%E9%87%8F%E6%95%B0%E6%8D%AE%E9%9B%86%20%E5%88%86%E7%B1%BB%E6%8C%87%E5%8D%97%E3%80%8B.pdf.pdf.
The full text of TC609-5-2025-04 can be found at: https://web.archive.org/web/20260624011335/https://sjj.hubei.gov.cn/bmdt/tzgg/202511/P020251110587761615062.pdf.
[3] A basic data model refers to a buyer acquiring a static snapshot copy of a dataset, likely using a database manager or excel, for further applications. An API allows one application to “call” another application to request data, automating the data request process. API-based or full-stack-services provide the buyer with the ability to access real-time data on demand (Cloudflare, accessed June 24).


