This past week I attended CSMR-WCRE (soon to be called SANER) where I enjoyed Andi Marcus's acceptance talk for the most influential paper from 2004. As I listened to his talk, I realized that his work on applying IR techniques to more effectively search source code was originally presented at the second conference I ever attended. It's funny how this work is still influencing my work with Sando today, yet at the same time a bit sad how far code search has to go to gain mainstream acceptance.
Fortunately, while attending CSMR-WCRE, enjoying some great code search talks from both academia and industry, it hit me. Code search is extremely well-positioned to make a real impact in the next few years. Here's why:
Researchers are focusing on human aspects
In the early days of code search research, like in most fields, there was an appropriate but strong focus on better algorithms, higher precision and recall, and automated evaluations. Yet, because of the great progress on these fronts researchers like Sonia Haiduc, Emily Hill, and myself are starting to focus on the human aspects of code search, like query re-formulations, query recommendations, and search-appropriate UI. While human aspects may seem soft to hardcore algorithmic researchers consider this: a UI/UX overhaul for Sando in 2013 had a lot to do with the 10x increase in downloads over 2012. I believe progress on these long-neglected research areas may push code search over the top.
Researchers have the infrastructure to make rapid progress
One of the long known problems with code search tools was the difficulty of evaluating new search algorithms and, even harder, comparing two competing search algorithms. As documented in Bogdan and Denys's 2012 survey, only 30 papers out of 89 included an evaluation that compared the new approach against any existing approach. Fast forward to 2014 and we now have at least two infrastructures for evaluating code search algorithms, one allowing us to quickly prototype new algorithms and test them in-lab and one allowing us to further validate them via field data.
First, we have TraceLab, a framework for creating and carrying out search algorithm evaluations in a reproducible manner. I've seen TraceLab used, for instance, to show the effect of including and excluding commonly used components like splitters and stemmers into code search algorithms. Thus, for researchers who develop any improvement on top of existing approaches, evaluating their work against the state-of-the-art is as simple as implementing a TraceLab component.
Second, once researchers determine a new top algorithm in the lab using TraceLab, we also have a framework for evaluating that new algorithm in the field, Sando. Assuming that the algorithm can be plugged into Sando, we've recently shown that anonymous activity stream metrics correlate with user satisfaction, and thus by watching these metric we can determine whether the new algorithm improves performance in the field too.
I believe that this infrastructure will dramatically reduce the time-to-market for new ideas coming out of research. Each new idea, quickly tested in TraceLab, validated in the field by Sando, can impact users in a matter of months instead of years.
Calling all students!
So, PhD students looking for a topic, you may have thought that code search research was quieting down. I would argue the opposite; the research done in the next five years will likely be what appears in practice. There are obvious needs in terms of better query recommendation/expansion algorithms, more software specific synonym information, ideas for dealing with abbreviation-heavy domains, an improved code search UI design (i.e., lists vs graphs), and even the performant use of structural information to improve code search results. The infrastructure's available, the low-hanging fruit is available.. come join us for some of the most exciting work of the next five years!
David Shepherd leverages software engineering research to create useful additions to the IDE.