In-depth explanation of professional perspective: big model filing (general artificial intelligence)

1. What is big model filing?
Large model filing refers to the filing and approval process of large model products through the National Internet Information Office (hereinafter referred to as the "Cyberspace Administration of Information") and other regulatory departments before they are opened to the public and commercially available. It is to strengthen the compliance management of generative artificial intelligence services, promote the healthy development of artificial intelligence technology through the filing system, establish a safe and reliable artificial intelligence service system, provide high-quality intelligent services to the public, and lay a solid foundation for the long-term development of the artificial intelligence industry.

2. What is the difference between big model filing and Internet algorithm filing (in-depth synthesis)?
Before understanding the differences between these two filings, we must first have an in-depth understanding of their concepts. At present, online articles are also vague about these two concepts. Wei'an Chuangyuan AI compliance experts analyze in depth here:
1. Internet algorithm filing (in-depth synthesis):
Definition concept: Deep synthesis technology refers to the technology that uses deep learning, virtual reality and other generation algorithms to produce text, images, audio, and other network information.
Main differences: In essence, deep synthesis technology combines and splices existing data (pictures, text, etc.) according to certain needs, and cannot generate new content from scratch:
Applicable objects: in-depth synthetic service providers and service technical supporters with public opinion attributes or social mobilization capabilities;

2. Large model filing (general artificial intelligence):
Definition concept: Generative artificial intelligence technology refers to models and related technologies that have the ability to generate content such as text, pictures, audio, video, etc.
The main difference: The logic of generative artificial intelligence technology is "understanding-creation". The generated content is novel and not a splicing of existing content. In other words, it has the ability to deductively innovate existing data.
Applicable objects: enterprises that provide generative artificial intelligence services with public opinion attributes or social mobilization capabilities.
3. How to distinguish between big model filing or algorithm filing?

According to the requirements of Articles 17 and 22 of the "Interim Measures for the Management of Generative Artificial Intelligence Services", enterprises that provide generative artificial intelligence services with public opinion attributes or social mobilization capabilities need to do algorithm filing and large-scale model filing. Based on exchanges with competent departments and company cooperation cases, Wei'an Chuangyuan AI compliance experts recommend that the following types of enterprises first carry out large-scale filing work:
1. The model is self-developed, or based on the secondary fine-tuning or magic modification of the open source model, and has a lot of training data.
2. Those with larger enterprise scale or larger model service applications.
3. Generative artificial intelligence is the main business and needs to do business publicity and use enterprises.
4. The location has corresponding subsidy policies for passing the registration.
5. Enterprises notified or recommended by local Cyberspace Administration of China, Industry and Information Technology Bureau and other relevant departments
4. There is no need to register a large model in the following situations:
1. Companies not engaged in business related to generative artificial intelligence models.
2. Generative artificial intelligence services that do not have public opinion attributes or social mobilization capabilities (such as purely self-use or only serve a few B-end users).
3. The situation where a third-party big model API interface is purely called (algorithm registration and registration are required).
In the above situations, the algorithm can be registered.

5. List of large-scale model filing materials

Filing form case
Generative artificial intelligence (big language model) is launched for filing, and in addition to the application form, five materials are required to submit (Vian Chuangyuan compliance experts remind you: In the end, there are 7-8 copies in some provinces. There are hidden plots):
"Application Form for Online Filing of Generative Artificial Intelligence (Big Language Model)"
"Annex 1: Safety Self-assessment Report"
"Annex 2: Model Service Agreement"
"Annex 3: Corpus Marking Rules"
"Appendix 4: Keyword Intercept List"
"Appendix 5: Evaluation Test Question Collection"

6. Key points for security assessment of large model filing:

1. Corpus source safety
In accordance with Article 7 of the Interim Measures, service providers shall carry out training data processing activities such as pre-training and optimization training in accordance with the law, and use data with legal sources. On this basis, the "Requirements" put forward detailed requirements from the aspects of corpus source management, corpus matching, and traceability:
(1) Corpus source management
It is required to use corpus from legal sources and quantitative standards are proposed for the quality of corpus content. If the corpus content contains more than 5% of illegal and adverse information, corpus from that source should not be collected or used.
(2) Copying of corpus from different sources
Requirements for increasing diversity of corpus sources. Specifically, for different languages and types of corpus, multiple different corpus sources should be used in combination. For example, when using foreign corpus, corpus from different sources from home and abroad should be reasonably combined to maintain the comprehensiveness of the corpus. Doing so helps improve the quality of the generated content and ensures the objectivity and diversity of the content.
(3) In terms of traceability of corpus sources
Service providers are required to have legal basis for processing, such as open source license agreements, relevant authorization documents, transaction contracts or cooperation agreements. If the use of self-picked corpus is involved, whether it is produced by itself or collected from the Internet, the provider must keep detailed collection records. Collection should be resolutely avoided for corpus that others have clearly stated that cannot be collected, such as network data that is clearly stated that cannot be collected through robots agreements or other restricted collection technical means, and personal information that individuals clearly refuse to authorize the collection. At the same time, information blocked in accordance with the requirements of my country's relevant laws, regulations and policy documents on network security should not be used as corpus.
2. Corpus content safety requirements
Articles 4 and 7 of the Interim Measures both put forward "legal" requirements for training data. To this end, the "Requirement" proposes that service providers can adopt a variety of measures, including but not limited to keyword filtering, classification models, and manual sampling, etc., to identify and filter corpus containing illegal and adverse information. At the same time, the "Requirements" also put forward further detailed requirements from both intellectual property rights and personal information.
(1) Intellectual Property Compliance
The "Requirements" proposes a number of measures to avoid the risk of infringement in establishing intellectual property management strategies, identifying intellectual property infringement risks, improving complaint and reporting channels, and disclosing summary information. Regarding this issue, in the case of (2024) Guangdong 0192, the case No. 113 of the People's *, when the user inputs keywords such as "Ultraman" or "Tiga", the image characteristics generated by the AI painting module in the Tab website run by the defendant are highly similar to the legal authorized IP image held by the plaintiff, which shows that the underlying training corpus of this module contains works that belong to others' copyright. In the process of generating image content, the module utilizes these copyrighted works, resulting in the output content carrying specific elements or features from the original copyrighted works, thus infringing the intellectual property rights of the rights holder. Therefore, service providers must be cautious when managing corpus content to prevent potential intellectual property risks.
(2) In terms of personal information protection
Service providers are required to ensure that their personal information processing behavior has a legal basis, that is, obtain the consent of the corresponding personal information subject or comply with other circumstances stipulated by laws and administrative regulations. When it comes to the use of sensitive personal information, individual consent must also be obtained.
3. Corpus labeling safety requirements
Article 8 of the Interim Measures stipulates that if data annotation is performed during the research and development of generative artificial intelligence technology, the provider shall formulate clear, specific and operational annotation rules that meet the requirements of this Measures; carry out data annotation quality assessment, sample and verify the accuracy of the annotation content; provide necessary training for the annotation personnel, enhance the awareness of respecting and abiding by the law, and supervise and guide the annotation personnel to carry out the annotation work in a standardized manner. On this basis, the "Requirements" put forward more specific provisions on the marking personnel, marking rules, marking content, etc.
(1) Marking personnel
First, in terms of safety training. It is required to regularly train markers, and the training content includes marking task rules, usage methods of marking tools, quality verification methods for marking content, data security management requirements, etc.
Secondly, in terms of assessment. Those who pass the assessment are required to be qualified for employment. The assessment content includes the ability to understand the marking rules, the ability to use the marking tool, the ability to determine the security risk, the ability to manage the data, etc., and establish a mechanism for regular retraining and assessment and the suspension or cancellation of the marking of the marking of the marking if necessary.
Finally, in terms of functional division. It is divided into at least two categories: data annotation and data audit. The same person shall not perform multiple functions under the same annotation task.
(2) Labeling rules
The required labeling rules include labeling targets, data formats, labeling methods, quality indicators, etc., covering data labeling and data review and other aspects.
In terms of functional annotation rules. It is required that markers be able to guide markers to produce markers with authenticity, accuracy, objectivity and diversity according to the characteristics of specific fields.
When putting in the security labeling rules, it is required that markers be able to guide the labeling personnel to mark the main security risks of the corpus and the generated content.
(3) Mark content accuracy
For functional annotations, manual sampling is carried out for each batch of annotated corpus. If the content is inaccurate, it should be re-marked; if the content contains illegal and bad information, the batch of annotated corpus should be invalid. Secondly, for security marking, each marking corpus will be reviewed and approved by at least one auditor.

(II) Model safety requirements
1. Model generation content security
Service providers are required to monitor the information input by users every time, guide the model to generate positive content, and establish normalized detection and evaluation methods to deal with safety problems found during the evaluation process in a timely manner, and optimize the model through fine-tuning of instructions and reinforcement learning.

2. Accuracy of model generation content
Service providers are required to use technical means to improve the real-time and accuracy of generated content. For example, when a user raises a legal consultation question, the AI-generated answer should refer to existing valid laws and regulations rather than outdated and invalid regulations. In addition, service providers should continuously optimize and correct models to reduce inaccuracies or fiction in the content generated by artificial intelligence.

3. Reliability of model generation content
Service providers are required to take technical measures to improve the rationality of the format framework for generating content and the content of effective content, and improve the role of generating content in helping users.

(III) Safety measures requirements

1. Model applicability
When service providers apply generative artificial intelligence services within the scope of the service, they should fully demonstrate the necessity, applicability and security of the model. If generative artificial intelligence services are used in key information infrastructure fields, or in important occasions such as medical information services, psychological counseling, and financial information services, they should be equipped with protection measures that are appropriate to the degree of risk. Generative artificial intelligence service providers for minors should also establish protection measures for minors, and they must comply with the provisions of the "Minor Protection Law", "Personal Information Protection Law", "Minor Network Protection Regulations" and other provisions to ensure the physical and mental health and safety of minors.

2. Service transparency
If a service provider provides generative artificial intelligence services through interactive interfaces, it should disclose information on the applicable population, occasion, purpose and other information to the public in prominent locations such as the homepage of the website, and disclose the usage of the basic model. If services are provided in the form of a programmable interface, the above information should be disclosed in the description document.

3. User data processing
Service providers should provide users with a convenient way to turn off their input information for model training. This can be achieved in a variety of ways, such as setting intuitive and easy-to-understand options, or providing concise voice control instructions. In order to ensure that this convenience is implemented, the "Requirements" further clarifies through specific examples: when the user selects to turn off the function through option, the operation process to reach the closing option should be controlled within four clicks starting from the main service interface.

At the same time, in order to comply with the requirements of "transparency" in the Measures, service providers should ensure that during interface design or user interaction, the user's collection status of the input information is notified, and clearly display the options or instructions for closing the information for training.

4. User Management

According to Articles 10 and 14 of the Measures, generative artificial intelligence service providers are obliged to guide users to scientifically and rationally understand and use generative artificial intelligence technology in accordance with the law, and supervise users' behavior. In order to effectively implement these supervisory responsibilities, the Requirements propose the following three specific measures:
(1) Implement monitoring mechanism
Through keyword screening or classification models, the information entered by users is monitored in real time to promptly detect and deal with improper behavior;
(2) Rejection mechanism
For detected problems containing obvious extremeness or induce the generation of illegal and adverse information, the service provider's system should automatically refuse to answer to prevent the spread of potentially harmful content.
(3) Manual monitoring mechanism
Equip yourself with special monitoring personnel to improve the quality and safety of generated content based on monitoring situations in a timely manner, and collect and respond to third-party complaints.

5. Service stability
To maintain the stability of the service, the Requirements recommend that service providers take a number of security measures, such as isolating the training environment from the inference environment to prevent data breaches and inappropriate access, continuously monitoring model inputs to prevent malicious attacks such as DDoS, XSS and injection attacks, conducting regular security audits to identify and repair potential security vulnerabilities, and establishing backup mechanisms and recovery strategies such as data and models.
(IV) Safety Assessment Requirements
In order to promote the effective performance of the responsibility of generative artificial intelligence service providers in security assessment, the "Requirements" refines the requirements in the "Interim Measures", that is, if a generative artificial intelligence service with public opinion attributes or social mobilization capabilities is provided, security assessment should be carried out in accordance with relevant national regulations, and the algorithm filing, change and cancellation registration procedures should be carried out in accordance with regulations. According to the "Requirements" service providers must review the provisions of Chapters 5 to 8 one by one, give an evaluation of "compliance", "incompliance" or "inapplicable" for each clause, and form a final evaluation report based on this. In addition, in order to ensure the operability of the evaluation work, the "Requirements" specifically propose quantitative evaluation standards for corpus safety, content generation security, and question refusal.
7. Large model filing process and duration
The registration of big model is currently in a booming period, and many companies that meet the application requirements are applying one after another, and the total time is expected to take 4-7 months (the experience of Wei'an Chuangyuan AI compliance expert is compressed to 3-4 months as soon as possible). Due to the lack of understanding of big model filing, some intermediary service providers mistakenly believe that they will place a filing number at the same time as algorithm filing, which is actually wrong. When choosing a service provider, it is recommended to find a service provider with technical services as the core. Registration process:

8. Things to note when choosing a service provider
Large model filing is a complex and professional job. If the company does not have filing experience and does not understand the filing requirements, the materials will be repeatedly sent back, resulting in the entire filing work being out of date. If there are some problems with the model itself, it will be blocked in serious cases! However, it is also necessary to pay more attention to choosing a service provider. At present, there are very few who have real experience in big model filing. Some institutions in the non-technical service industry have shown that they can complete big model filing. Most of them are outsourced to third parties in the form of intermediaries. The docking work is lengthy and cumbersome, which leads to high fees for enterprises but cannot complete the filing work smoothly, and even causes legal disputes.