Source{d} Applies Machine Learning to Help Companies Manage Their Code Bases

Jan 4, 2018 | 2640 Views

If you go to GitHub, the most popular developer platform today, and search for a piece of code, it is a plain-text search.

"It's like how we used to search on the web in 1996," said Eiso Kant, CEO and co-founder at source{d}, a startup focused on applying machine learning on top of source code.

"We have been writing trillions of lines of source code across the world, but none of the systems or developer tools or programming languages we've designed actually learn from all the source code we have written."

A compiler that translates a programming language into machine code that executes on the device uses a large set of rules that never take into account actual language, he said.

"There have been probably a million functions written that connect to a database, but the compiler does not understand that what's happening is connecting to a database. And none of the tooling essentially does that."

Founded in 2015, source{d} originated in Madrid, but also maintains an office in San Francisco.

As a revenue stream, it initially created a tech recruitment business, using AI to match developers to jobs openings based on their open source contributions.

"As developers, we're often closer to artists than to engineers. Just like a writer, you get this feeling of style - we have 300,000 words in the English dictionary, but they're incredibly different for Shakespeare writing a play, a journalist writing for The Washington Post or writing about tech. It's the same in programming. There's an incredibly personal, artistic component to it.

"When we started building AI systems to understand code, we really didn't want to just understand how to solve problem X, but how to solve problem X like developer Y. We wanted to give every single developer personalized recommendations and a real understanding of who they are as a developer, so you don't just cookie-cutter out the answers," Kant said.

Tools for Understanding Natural Language in Code
While profitable, recruitment proved too much of a distraction from the core mission, he said. So they went back to the board and investors and asked to drop that part of the business.

What the company offers instead are tools to better understand code and help enterprises better manage their code bases.

It mines a data set based on 57 million public Git repositories - hundreds of terabytes of source code - to train machine-learning models to understand natural language, intent and similarity.

"One of the biggest challenges we've had is how do you understand natural language in code? When we look at the future of search, the future of code suggestion, the future of compilers, it comes a lot down to understanding natural language, understanding what the intent of the developer actually is and what they're trying to do with a piece of code that they're writing," Kant said.

One large customer has 50,000 developers globally and billions of lines of source code created over years. The company wants to build better developer tools, ones that are no longer dumb, he said.

So source{d} provides a technology stack that it can deploy internally, first to collect and put all that source code in the same place, and second to be able to easily process and analyze it at scale. It applies Source Engine, a library for running scalable data retrieval pipelines, then on top of that machine learning, Source ML, to understand what's happening in its code base.

It uses Spark for data processing and Tensorflow for training and inferring. A lot of low-level programming is done with Golang and Scala; it uses Python for machine learning and for programming directly on GPUs, the graphics cards used for machine learning models, CUDA.

Its engine generates from source code a data set of universal abstract syntax trees ready to be analyzed or used in machine learning tools and models. It's just one of its data-retrieval tools. Its projects also include models, language analysis tools, machine learning tools and demos.

The language-analysis tool Babelfish, for example, parses programming languages and generates an abstract syntax tree in a universal format. The demo vecino is a CLI app to find the Git repository most similar to another.

The Road to Source Code Generation
So far it's working with customers purely to learn and grow with them, Kant said. Next year it will be building out an enterprise sales team with a go-to-market strategy.

However, Kant doesn't foresee this technology replacing developers.

"In the end, the biggest part of writing software is not writing code, it's understanding a problem and figuring out how to solve that problem, what is a solution and how to architect it. Most senior engineers will say that writing code is the least intense part of their job," he said.

"As a developer, you're solving problems. You're taking specifications, sometimes it's business logic, or doing something very specific, such as optimize the speed of X. Writing the code is almost secondary. Where we shine on this is helping you write the code better, faster, more securely. We're not in a world yet where AI can solve a lot of these problems by themselves."

In a blog post, Francesc Campo, vice president of developer relations, writes:

"There's a never-ending list of use cases that could benefit from ML over source code: autocompletion (that doesn't require a connection to a third-party server), code linters, architecture analyzers, automated code reviews, and (one day) source code generation from unit tests or even natural language specifications."

Source: The New stack