wecom-temp-b723fae2da9ef94b96be11b3f49727e8.jpg

About the "Magnolia-Magnolia Open Data License Agreement"

Open source is an important driving factor for the accelerated development of the global artificial intelligence industry, which effectively improves the efficiency of artificial intelligence research and development, accelerates artificial intelligence technology innovation, and promotes the construction of artificial intelligence ecology. In technological research, product development and other links, more innovative entities can use the existing public basic research and development resources to accelerate artificial intelligence research and development based on relatively mature open source software and hardware platforms. In this process, the free circulation of data elements has become more and more important, and data openness has become a key part of promoting the innovation and development of artificial intelligence.

However, there is still a lack of practical open data licenses in the field of artificial intelligence, making the use and circulation of data elements still have many obstacles and uncertainties, not only prone to data security and legal issues, but also due to the essential attributes and existence of data. There is still a lack of understanding of the form, use mode, etc., resulting in a mismatch between the current use of data resources and the value of data for mining. For the sustainable development and use of artificial intelligence technology and related data resources, the role of open data licensing agreements has become more prominent. By standardizing the identities and conceptual definitions of data stakeholders, define the respective rights and responsibilities between data stakeholders regarding the conditions and methods of circulation of specific data objects, and guide data circulation in a completely open mode as much as possible to promote the open sharing of data elements And development and utilization.

"Mulan-Magnolia Open Data License Agreement" is a research project initiated by "Shanghai Magnolia Open Source and Open Research Institute" under the "Mulan Open Source Community" , which aims to explore the creation of a set of standardized, Based on China's artificial intelligence practice, promote the circulation of data elements, and optimize the data licensing agreement for the development environment of artificial intelligence.

"Mulan-Magnolia Open Data License Agreement" drafting instructions

The drafting of the agreement was completed by "Magnolia Open Source" and "Open Data China" . In the process, we have:

  • I have studied and understood the international common open protocols such as the Knowledge Sharing Protocol, Open Database Protocol (ODbL), etc., and summarized and summarized the terms and drafting strategies among them.
  • Authorization agreements for data circulation in the field of international community artificial intelligence, such as O-UDA and C-UDA drafted by Microsoft, Community Data License drafted by Linux Foundation, Montreal Data License drafted by Element AI, etc. have been studied and understood, and are based on the Montreal Data License. The spirit of AI has customized and refined the use behavior specified in the terminology in the field of artificial intelligence.
  • Study China's current civil code, as well as the draft data security law, the draft personal information protection law, etc., and draw on the relevant terminology definitions among them

Taking into account the complexity of compliance with the circulation of data elements, the current draft version is based on the following principles and applicability:

  • Draw up an applicable protocol for the release of artificial intelligence training data sets
  • The published data should meet the basic premise of public release and free release
  • The published data meets the requirements of national data security and does not involve national secrets, national security, social public interests, commercial secrets, etc.
  • The published data does not involve personal information (refer to the "Personal Information Protection Law (Draft)" (second review draft). Personal information is a variety of information related to identified or identifiable natural persons recorded electronically or in other ways. does not include anonymized information )

Considering that the current artificial intelligence training data set can be divided into two categories from the perspective of ownership:

  • In the first category, the data is legally and compliantly owned by the data publisher or has usufructuary rights
  • In the second category, the data is obtained by the data publisher through a legal and compliant way from a third party to obtain the compilation combination.

Therefore, the "Mulan-Magnolia Open Data License Agreement" produced two sets of agreements with different drafting strategies for the above two types of situations:

first group, that is, the default data is legally and compliantly owned by the data publisher or has the right to dispose of

We learned from the knowledge sharing agreement model and drafted a set of 4 agreements, namely

  • MBODL : a loose and open agreement, suitable for data releases that require minimal restrictions and only indicate the source of the data
  • MBODL-NC : Non-commercial use agreement, applicable to prohibit users from commercial use and sharing of data and results
  • MBODL-SA : License in the same way, applicable to the requirement that downstream dissemination data can be licensed in the same way, but does not require the infectivity of the use agreement of the output results
  • MBODL-CU : Only calculate the use agreement, applicable to the situation where the data publisher prohibits the direct use and display of the data itself (for example, the TV station as the data publisher wants to prohibit the playback, copying, sale, etc. of the video data itself, but it will Allows to use video data as training data to train video semantic tags and other tasks)

The above four agreements are all based on MBODL and formed by adding different restrictions in the "License Restrictions" section. But just like the CC agreement, on the basis of these four sets of agreements, the license restrictions can also be superimposed and crossed to form a new agreement, such as MBODL-NC-CU, which stipulates non-commercial use and only calculation use, and another example is MBODL-NC-CU. SA-CU, which stipulates that the data is authorized in the same way and is only used for calculation.

second group, that is, the data publisher's data is obtained from a third party legally and compliantly

We borrowed from the ODbL (Open Database Protocol) strategy, and split the authorization method for the structure of the database/data set (that is, the method of data selection, organization, database scheme) and data content. This type of authorization strategy is only experimental, and further feedback is needed to determine 1) whether there is a real demand or 2) whether it is operational.

Regarding the second group of situations above, we provide two possible cases to explain:

Case 1: The data publisher obtains the picture data of various birds through channels such as wikipedia and flickr. The picture data are each authorized under an open license agreement such as CC. The data publisher adds itself by selecting and combining these bird pictures For bird tags (bird names and subjects corresponding to bird photos), a "Bird Picture Training Data Set" is finally formed and needs to be released with authorization. Under the strategy of the second set of agreements, it will adopt the "Magnolia Open Source Open Data Agreement" (licensed structure only) + "labeled data" (authorized content-the publisher chooses the new license) + "the original agreement for each picture" ( Authorized content-authorize the release of the entire data set in accordance with the respective agreement).

Case 2: The data publisher obtained desensitized lung CT image data from N hospitals through authorization (assuming that the authorization allows the publisher to republish images), and the data publisher has invested in manpower to complete the above-mentioned image data The lung nodules are labeled. The data publisher hopes to publish the combination of image and image data + labeled data as the "standard training data set for lung nodules", so it can be adopted. It will adopt the "Mulan-White Magnolia Open Data License Agreement" (licensed structure only) + "labeled data" (Authorized content-the publisher chooses the new license) + "the original agreement for each picture" (authorized content-in accordance with the respective agreement) authorizes the release of the entire data set.

Based on the scenario description of the above case, we drafted the MBODL (Structured Content Separation Edition) agreement , as a separate experimental agreement for all walks of life to discuss the applicability and the practical implementation of the terms.


鸣飞
1.7k 声望41 粉丝

SF 社区编辑一枚,关注人工智能、云计算、开源和行业的技术动态,欢迎投喂优质资讯!