A research team at McGill’s School of Information Studies (SIS) is well on its way to creating the world’s first artificial intelligence-powered search engine for assembly code, a tool with great potential to improve cybersecurity worldwide. The project is overseen by Professor Benjamin Fung, Canada Research Chair in Data Mining and Cyber Security and Associate Professor at SIS, with PhD students Steven Ding and Miles Li. Their primary goal is to empower information systems at all levels, and to ensure that Kam1n0 will be available for free.
In a normal software development environment, a programmer will spend hours writing and compiling source code into an executable format to create a functioning computer program. However, source code is unavailable for malicious software like a virus or Trojan Horse. In order to understand the behaviour of malware, reverse engineers have to disassemble the program (break it down to its assembly code) and undertake a long and tedious process called ‘reverse-engineering’. When faced with a potentially dangerous issue – large-scale cyber attacks – the stakes are magnified, making the timeliness, accuracy, and efficiency of the reverse-engineer’s response ever more important.
Enter Kam1n0, an AI-powered search engine that assists the engineer by indicating which parts of the disassembled software needs their attention. Type a line of code into Kam1n0 and it searches through an entire database, revealing every recording and comment made on the code (if it exists) in the reverse engineer’s repository. If you are lost, the following video made by the Data Mining and Security Lab (DMaS) coherently breaks it down.
“We assume the one who wants to understand the behaviour of the malware has a large repository of code that they have identified and commented on in their database”, explains Professor Fung. “Kam1n0 is like a search engine for this malware repository, so when a new malware comes it can be used to show what parts have been commented on already. If it is already recorded, Kam1n0 will reveal which lines have been replicated. The reverse engineer, seeing their comments on the already identified code, will know which parts of the assembly code are ‘clones’, and which parts require further investigation.”
The objective is to speed up the process of understanding the behavior and inner-workings of any unfamiliar software. Using Kam1n0 to understand malicious software is important because malware is always encrypted, and its contents obfuscated. Just reading the code requires breaking into it, which requires getting past the programmed decoys and junk inserted by hackers to divert attention away from understanding and therefore breaking the malware.
Professor Fung likens the challenges of Kam1n0 to that of Google Search. On the backend, searching through trillion of lines of code in a database and returning results within a few seconds requires a very complex platform to be both accurate and efficient. Kam1n0 is the first functioning assembly code search engine that is both.
“In many real-life applications, we need to make decisions efficiently. Kam1n0 has to scan through terabytes of assembly code and return results within one second of the search, as Google does with all the web pages that exist on the World Wide Web. If Kam1n0 took hours to pull up results from the trillions of searches it must do, no one will use it. This is why efficiency and scalability are as important as accuracy for us. It’s one of our challenges.”
The federally funded project has gained wide recognition in the cybersecurity world, placing second in an international reverse-engineering competition, and attracting investments and collaboration from the Defence Research and Development Canada (DRDC). Still in the beginnings of the AI research components, Kam1n0 can also be used for non-cybersecurity applications as well, like searching for patent-infringing code copied in a new software.
As per the agreement with the DRDC, Kam1n0 will remain freely accessible on GitHub, open to society as a whole to use. Professor Fung hopes this will not only benefit the likes of government, but “any companies and organizations who want to build software to defend their information system network.”
The project, in development at the DMaS, began in 2014 with a federal grant of $400,000 from the DRDC. In 2018, the team received another grant from the DRDC as well as NSERC, bringing their total funding to $1.3 million.