Our recent work on the Sando Code Search extension, a tool which leverages Lucene to search code, has been focused on making it more scalable and robust. To demonstrate our progress I'll provide demos of both Sando and FindInFiles (i.e., a grep-like feature in Visual Studio) searching the entire Linux kernel. As you'll see, there's a fundamental difference between Lucene-based search tools and regular expression based search tools.
Before we begin, let's first briefly examine the Linux source tree. At the time of our demo it contained 47,528 files which occupied 1.71 GB on disk. Most of these files were C code, yet there was also a fair amount of documentation and configuration files. Sando and FindInFiles both search all text files.
Searching the Linux Source Tree with FindInFiles
To use FindInFiles I configured it to search the directory containing the Linux code, entered my search, and selected Find All. In this running example the user is searching for encryption algorithms, specifically those related to AES, and thus they use the regular expression query "encrypt*aes". Executing this search caused FindInFiles to run its regular expression matching algorithm against every line of every file in that directory, recursively. As you can see in "Starting the Search", this utilized about 50% of the CPU on an eight core machine for a considerable amount of time.
After about one minute and forty seconds the search completed, having searched 47,407 files. Unfortunately, no lines matched this particular search (see "Finishing the Search"). As often happens with a regular expression based search, the word ordering in the query did not match the word ordering in the code. In this situation the user would likely have to run another search with re-ordered search terms (e.g., "aes*encrypt") to find relevant code.
Searching the Linux Source Tree with Sando
Next we searched the same Linux source tree using Sando. Unlike FindInFiles, which is based on regular expression matching, Sando is built upon information retrieval technology (think Google). It leverages Lucene.NET to pre-index source code and provide ranked results almost instantly. Typing in the same query as before minus the regular expression syntax (i.e., "encrypt aes") you can see below that results are returned almost instantly. Just as importantly, the most relevant results are returned first with less relevant results toward the bottom. Additionally, in Sando's UI, selecting a result in the list provides a preview of the program element with matching terms in bold.
Of course, there is a cost to pre-indexing. For the Linux source tree that cost is about 50 minutes of low CPU background processing. Fortunately, this only happens once after which incremental updates and switching branches trigger at most a few seconds of indexing. Additionally, for most medium-sized projects initial indexing completes in a matter of seconds. For instance, Sando can index its own source code in less than ten seconds.
Try It For Yourself: Online, in Eclipse, or in Visual Studio
I hope this quick post has piqued your interest in Lucene-based code search tools. If you'd like to try an advanced code search tool there are several options, both online (e.g., GitHub's search) and offline (e.g., Sando for VS and Instasearch for Eclipse). Follow the links below to try them out.
David Shepherd leverages software engineering research to create useful additions to the IDE.