Google releases open source Magika content type detection tool on GitHub
Google has decided to make Magika open source, but what exactly is it? Well, it is an innovative AI-powered system that the search giant designed to revolutionize the way binary and textual file types are identified. Magika stands out for its ability to deliver precise file identification within milliseconds, even when operating on a CPU.
Magika employs a custom, highly optimized deep-learning model that has been meticulously designed and trained using Keras. This model is remarkably lightweight, weighing in at just about 1MB. For inference, Magika utilizes Onnx as an engine, ensuring that files are identified swiftly, almost as quickly as non-AI tools, even on a CPU.
The performance of Magika is nothing short of impressive. When evaluated on a benchmark of 1 million files covering over 100 file types, Magika outperforms existing tools by approximately 20%. The system shows even greater gains in performance for textual files, including code files and configuration files, which have traditionally posed challenges for other tools.
Internally, Google has already been leveraging Magika to enhance user safety. The system is deployed at scale to route files in Gmail, Drive, and Safe Browsing to appropriate security and content policy scanners. With Magika, Google has observed a 50% improvement in file type identification accuracy compared to previous systems that relied on handcrafted rules. This increase in accuracy has enabled the scanning of an additional 11% of files with specialized malicious AI document scanners and reduced the number of unidentified files to 3%.
Additionally, the upcoming integration of Magika with VirusTotal promises to further enhance the platform’s efficiency and accuracy. Magika will act as a pre-filter before files are analyzed by VirusTotal’s Code Insight, which uses Google’s generative AI to detect malicious code. This collaboration is expected to contribute significantly to the global cybersecurity ecosystem.
By open-sourcing Magika, Google aims to assist other software in improving their file identification accuracy and provide researchers with a reliable method for identifying file types at scale. The code and model for Magika are now freely available on GitHub under the Apache2 License.
Magika can be easily installed as a standalone utility and Python library via the pypi package manager with the simple command pip install magika, with no GPU required. An experimental npm package is also available for those interested in using the TFJS version.