Editor’s note: This article is from WeChat public account “AI Frontline” (ID: ai-front), 36 氪 released with permission.Planning & Authors | Liu Yan interview guests | Miao Guanqiong, co-founder of Biaobei Technology and editor in charge of data business | LindaAI Frontline Guide: If you compare artificial intelligence to “rocket”, then data is the “fuel” that boosts the rocket.Machine learning relies on a large amount of labeled data. Data labeling allows machines to understand and understand the world.Data annotation is an indispensable part of the development of artificial intelligence, and it is the basic force for the construction of the AI pyramid.In sharp contrast to the prosperity and highlights of AI’s “pre-stage”, the data annotation is often behind the scenes, often overlooked, and subject to some prejudices. “Sweatshop”, “AI Foxconn”, “new migrant workers” …With the in-depth implementation of AI, higher requirements are imposed on data, and the data labeling industry has gradually transitioned from the rough growth stage to the more refined growth stage.The data annotation data behind the “AI Pyramid” is the basis of machine learning. Machine learning is based on data modeling. Rich labels are the prerequisite for successful machine learning modeling.Supervised learning is currently the most widely used machine learning algorithm. This method relies heavily on labeled data. It learns a large number of labeled training samples to build a predictive model.Deep learning also requires “feeding” of a large amount of data. Machine learning frameworks represented by deep learning need to be trained on large supervised data sets. Su Haibo, the chief algorithm scientist of the percentile, has said that deep learning can only be performed on dataIts power can only be exerted in the scene, but there is not enough labeled data in many practical applications.The implementation of AI technology in the whole scene and the advent of the era of big data have generated massive and exponential data, and data acquisition has become relatively easy. However, it is not easy to obtain a large amount of labeled data, which often requires a lot of effortCost of human, material and financial resources.In the subdivisions with high thresholds such as medical AI, the lack of labeled data has become a stumbling block to the development of the industry. Zheng Yefeng, director of Tencent Youtu Labs, said in an interview with the AI frontline that labeling medical data is “difficult”.One aspect is the lack of top medical data labeling talents. On the other hand, clinical and scientific research tasks are heavy, and many medical experts do not have the time and energy to do data labeling.Data annotation is mainly for voice, image, text, etc. It is mainly used to mark, mark, mark, frame objects, and annotate the data set, and then use these data sets for machine training and learning.The types of data annotations are: Pinyin annotation, prosody annotation, part-of-speech annotation, phoneme time annotation, phonetic transcription, classification annotation, dot annotation, frame annotation, area annotation, and so on.Due to the large scale and high cost of the data to be labeled, some Internet giants and some AI companies rarely have their own labeling teams, and most of them are handed over to third-party data service companies or data labeling teams.Data service is the start-up business of Biaobei Technology. Since its establishment in 2016, Biaobei Technology has provided voice, image, NLP data collection and labeling services for many companies such as BAT and AI unicorns.According to Miao Guanqiong, the head of Biaobei technology data, Biaobei has a self-developed collection and annotation platform, including a long speech (conversational, continuous) annotation platform and a phrase sound (ten seconds) annotation platform, an AI speech synthesis data annotation platform, and a data workshop.APP, etc.The choice of the labeling platform will be based on comprehensive decisions based on images, voice data, data sources, and customer needs.Taking the speech synthesis data labeling as an example, it will label its phonetic characters, prosody, phoneme time points, and parts of speech.The prosperity of artificial intelligence has given birth to and expanded the data labeling industry, and also created a large number of jobs.Data show that currently there are about 200,000 full-time data labeling practitioners in China, about one million part-time data labeling practitioners, and hundreds of companies engaged in data labeling business across the country.Data “migrant workers”?There is a popular phrase in the data labeling industry, “As much intelligence as there is labor”.Data annotation is a vital part of the development of artificial intelligence, but it is often easily overlooked.Relatively speaking, data labeling is an “entry level” type of work in the field of artificial intelligence. From the perspective of the work process alone, its technical content is low. People are the biggest influencing “factor” in this work.The outside world labels the data with a label.The low threshold has attracted many farmers, students, and the disabled to join the data labeling army. Some characteristic “data labeling villages” have appeared in the fourth and fifth tier cities in China, such as Henan, Hebei, Guizhou, and Shanxi.Not only China, migration to places with more abundant labor and lower costs is also the development trend of the global data labeling industry.There are many data tagging villages in India. They serve AI companies in the United States, Europe, Australia, and Asia. Facebook has outsourced some social content tagging to an Indian company.The above-mentioned workers have also become participants in the artificial intelligence wave. Although the treatment is far less than other artificial intelligence practitioners, compared to traditional manual work, the data labeler’s job is easier and decent.However, the other side of the coin is that this workflow is simple and tedious, and the data labeler repeats the work of “picture frame” day after day … About the data labeling industry is “dirty work”, “dataThe argument of “migrant workers” has also scattered.Miao Guanqiong disagrees with these “voices.””I don’t think (it) is a ‘dirty and exhausting’ industry, because this is not a job that anyone can do. AI itself is developing very fast. With the application of products, the requirements for data are getting higher and higher.The quality of data acquisition personnel also puts forward high requirements. “Considering that it is difficult to control the service quality of the outsourced team, the projects undertaken by Biaobei Technology mainly rely on their own data labeling teams. They have data teams in cities such as Tianjin and Changchun.The professional level of personnel is more considered, and they must have a language, dialect background, or experience in data annotation. Those with no experience must be trained for at least 6 months.Miao Guanqiong said that the development of the data annotation industry is becoming more and more specialized. In the early days, Chinese data annotation was mainly used. Now, with the development of multilingual, dialect, and personalized annotation, the demand for annotation is increasing, and it is not a “pulling”.Many people can do it, they need professional talents.In addition, the “sweatshop” situation often appeared in the early days of the industry and was mostly targeted at small teams that only had data to label a business, and they could not take on some complex, customized projects.From the workload point of view, combined with customer needs, taking voice annotation as an example, the data annotationr of Biaobei Technology has an effective voice annotation duration of 1 hour per day.The proportion of machine labels has increased, but it is impossible to replace the artificial barren era.The “China Artificial Intelligence Basic Data Service Industry White Paper 2019” analysis points out that 2010-2016 is the “nascent period” of the data service industry. The demand for early data labeling surged, coupled with the low entry threshold, a large number of players and a mixed flow of fish and fish.Since 2017, with the in-depth application of AI in various application scenarios, the data labeling industry has entered the growth stage. Manufacturers at the upper application end have continuously increased the requirements for data labeling quality, such as data in areas such as autonomous driving, moving images, and computer vision.Labeling is difficult.The industry structure is gradually clear, and the Matthew effect is obvious.It is understood that there are about several hundred companies / teams engaged in data labeling business in China, including about one hundred that independently do the entire data quality service, and dozens of companies that can provide data acquisition service integration, which can provide high-standard basic data.Only a dozen are served.At this stage, downstream AI algorithm R & D units mostly diversify their business to different data service companies, and the relevant standards for data labeling are yet to be improved. There are no big companies in the industry.This is a market that is not yet saturated, and it also means a huge space for development.According to statistics, the size of China’s artificial intelligence basic data service market was 2.586 billion yuan in 2018, and the industry’s annual compound growth rate was 23.5%.Miao Guanqiong believes that due to the continuous improvement of data security and quality standards and the introduction of related data policies, some that do not meet industry standards and customer needs will be eliminated by the market.She added, “The industry is currently in a rising and fast-developing stage, and the overall development is in the direction of personalization and specialization, transitioning from simpler and more general data in the early period to more complex and personalized, scene-oriented.Data, for many segments, a large number of real models need to be labeled to iterate the model, rather than simple general data.The data labeling industry has also begun to enter the stage of human-machine collaboration. The data labeling market is still in great demand. More professional people and efficient machines are needed to help. The proportion of machine labeling will continue to increase. AI technology and data are complementary.Improve data efficiency through AI technology, which in turn serves technology.In order to reduce labor costs and improve efficiency, many Internet technology companies and third-party data service providers are developing their own labeling tools.Last October, Google released Fluid Annotation, a human-machine collaboration interface for full image annotation. Using it to annotate the class labels and contours of each object and background area in the image, the creation of the annotation dataset can be tripled.Data annotation crowdsourcing platforms are also emerging, including JD.com, Baidu Census, figure-eight, Amazon’s Mechanical Turk, etc.In the future, machine marking and manual assistance will become a foreseeable development trend.This may not be a good thing for Data Mark Village.But Miao Guanqiong believes that machines cannot completely replace humans.At this stage, the accuracy of manual labeling is higher than that of machines. Machines can only run a certain percentage of correct results. More accurate results still require manual labeling, and the role they play is more critical.In addition, in the quality inspection process, the role of humans is irreplaceable. The calibration of the standard data is manual, and the machine follows the “first trial, second school, and third inspection” process. The machine will randomly check and accept some data and give outPreprocessing the results, the final results rely on refined manual proofreading.Guests introduced Miao Guanqiong, co-founder of Biaobei Technology, head of data business, experts in speech and data fields, industry experience of more than 17 years, participated in the writing of many professional books, and had unique solutions in the combination of product and data scenarios..
AI data annotation is not “dirty”
AI by ouyangshaoxia on Entrepreneurship Internet entrepreneurship Internet entrepreneurship project 13 views