"After Sora, seeing is not necessarily believing."
Many people can relate to this sentiment. With just a text input, Sora can generate a high-definition video that is up to a minute long, with realistic visuals and smooth transitions. The videos produced by Sora are so lifelike that it is difficult for the human eye to distinguish them as AI creations.
But it's not just about generating videos. AI can also "modify" existing videos. Recently, the Xiaopeng Motors research team introduced a new universal video simulation framework called "Any Object in Any Scene," which seamlessly inserts any object into an existing dynamic video. Again, it is hard to tell the difference with the naked eye.
Amidst the difficulty of discerning truth from falsehood, more and more people are starting to worry about the potential chaos that AI-generated videos may bring. For instance, video evidence may no longer be considered reliable: "In the future, you may find yourself sitting in a courtroom, watching a 'crime video' that you don't even know you were involved in," says a concerned individual.
Dong Jing, a researcher at the Institute of Automation, Chinese Academy of Sciences, specializes in artificial intelligence content security and countermeasures, such as image tampering and deepfakes. Many of her team's achievements have been applied to multimedia authentication. With the increasing capabilities of AI, what are the methods and means to address these technological challenges? And how can the general public be more cautious when consuming video content to prevent falling for scams? To shed light on these questions, Dong Jing was interviewed by China Science Daily.
"Authentication is still in a passive state"
One approach is based on learning from data. This usually involves collecting forged and real videos (preferably as paired data) in advance as a training dataset, so that a powerful deep neural network can be trained. As long as the model can "remember" the anomalies or traces in video frames, such as image noise or discontinuous motion trajectories between frames, it can distinguish between real and fake videos.
Dong Jing explained that this approach is relatively universal, and once the detection model parameters are determined, it is simple to deploy and yields good results in batch testing. However, this method heavily relies on the volume and completeness of the training data, and it usually fails to detect unknown or untrained data.
Another approach is based on specific clues. It involves defining some visual "clues" in the video that are illogical or inconsistent, such as inconsistent lighting, lack of vital signs in facial videos, or mismatch between lip movements and voice timing of the speaker. Specific algorithms are then designed to extract and locate these clues for evidence. This method has better interpretability and performs well in targeted detection of video segments, but it is less compatible with data diversity.
This method can be used to identify videos that have been "modified" by the Xiaopeng Motors team. Dong Jing explained that after preliminary analysis, they found slight changes in colors and textures across different frames of the video/image after "inserting" the target object. "Taking this as a clue, we collected relevant data for training and detection testing."
However, Dong Jing also mentioned that as tools like Sora enhance their capabilities in generating video details and diversified processing, the explicit traces of forgery in generated videos will become less noticeable. This will make it increasingly difficult to determine the authenticity of a video relying solely on traditional video analysis and forgery detection methods.
"Currently, specialized technologies are still in the early stages of development, and there is a need to strengthen the development and optimization of various detection techniques," Dong Jing told China Science Daily. She further explained that the current approach is still based on conventional detection techniques, so it is necessary to improve the model's identification capabilities by constructing new forgery video datasets. Additionally, existing video detection models need to be updated to be compatible with new video generation algorithms. Furthermore, techniques such as digital watermarking, digital signatures, and video retrieval can be utilized to enhance tracking and management of generated video data throughout its lifecycle.
"Overall, the authentication of video content is still relatively passive at the moment, and we need to develop and optimize various detection technologies to keep up with the constantly evolving video synthesis algorithms," Dong Jing stated. Despite the increasing difficulties, AI-generated videos will inevitably leave specific patterns or traces during the generation process, and detection technologies will continue to utilize these subtle clues that are imperceptible to the naked eye for countermeasures, analysis, and authentication.
Dong Jing and her team have proposed new detection algorithms from multiple perspectives. These algorithms are based on reconstruction errors, multimodal contrastive learning, and forged feature purification, representing continuous attempts to explore "new specific forgery clues."
Establishing internationally recognized standards and regulations
To prevent chaos, non-technical solutions such as "source control" have been frequently suggested. For example, it has been proposed that agreements could be made with AI technology providers like OpenAI to embed AI-generated marks at the very beginning of video generation.
Dong Jing stated that embedding marks is currently one of the recommended strategies, but it still faces technical challenges and limitations, such as reliability, concealment, and universality of the marks, as well as considering factors like privacy and security.
Compared to passive detection of videos, watermarks or marks belong to active defense. Dong Jing informed China Science Daily that her team is currently conducting research on visual generative watermarking. They aim to introduce a "robust watermark embedding module" into current generative models to incorporate visible or invisible digital watermarks into the generated videos. They have also recently attempted to add "adversarial noise" to real images or videos to prevent AI synthesis on these source data.
In addition to technical measures, Dong Jing mentioned some non-technical measures.
"People need to improve AI data governance and regulations on the use of AI tools, while conducting popular science education, strengthening industry standards, and raising public awareness of relevant precautions," Dong Jing said. Regarding overseas AI generation service providers such as OpenAI, she calls for the establishment of internationally recognized AI data technological standards and regulations to form a feasible scheme for reasonable marking and collaborative supervision against generated videos.
Dong Jing believes that by regulating the use of new video generation tools like Sora, such as managing and collecting source data sets on which training relies, standardizing the output and security testing of generated videos with sensitive or fake content, and implementing governance and control measures, we can minimize the risks associated with the misuse of AI-generated videos. "The difficulty of authentication will not continue to increase indefinitely."
Enhancing immunity to fake videos
Although she agrees that "the task of identifying AI-generated videos should not be entrusted to the public," Dong Jing insists that ordinary individuals can still be cautious and "keep a watchful eye" when consuming video content to avoid being deceived.
For this, Dong Jing suggests a few measures.
Firstly, observe the logical authenticity of video details, such as whether the actions of individuals and the background environment in the video align with the real world, and whether the physiological features of individuals (such as teeth, fingers, skin texture, iris color, etc.) are congruent. She stated that it is currently unknown whether algorithms like Sora can easily and conveniently generate large quantities of high-quality visual and video content. Based on the publicly available video clips, it is still possible to discern imperfections in their movements through careful observation.
Secondly, observe if the quality and clarity of the video are balanced. Generally, AI-generated videos may have some flaws in terms of picture quality, sharpness, etc., such as image blurring or shaking.
Finally, check the logical consistency of the video's content, such as whether the content and storyline are reasonable and coherent. If there are doubts, further examine the credibility and consistency of video sources, publishing platforms, comments, format, and production time. Cross-validation using specialized tools and software for detecting AI-generated videos can also be helpful.
Dong Jing mentioned that in interactive scenarios like video calls, one can actively request the other party to turn their face sideways, get closer, or move away from the camera to discern authenticity, as current forgery techniques have relatively poor prediction and generation performance in the presence of significant movements.
In addition, Dong Jing reminded people that in today's complex media and public opinion environment, it is crucial for the general public to actively learn relevant knowledge, understand the mechanisms and vulnerabilities of AI generation to a certain extent, and be prepared for potential challenges.
"It is like regularly getting vaccinated against the latest flu," Dong Jing told reporters. "It enhances immunity against fake videos and prevents blind acceptance." She added, "Although I personally believe that the public should not bear the responsibility of identifying AI-generated content, it is our duty, both publicly and privately, to improve the digital literacy and awareness of security precautions among the public. This will help minimize the spread of fake information, economic fraud, and misleading public opinion, and foster societal trust."