As developers, most of us use an IDE because it takes care of the messy details of development, allowing us to focus our full attention on higher-level tasks. Most common IDE features work well, yet unfortunately there is one tool embedded within modern IDEs that is failing us: the search tool.*
Regex-based searches are slow
IDEs currently offer search tools that are painfully slow-to-execute.
Recent studies show that about 88% of developers' searches fail, but because search tools are bundled within an otherwise well-performing IDE most developers do not recognize the problem, and some even deny it exists. Today's post serves to dispel this myth. Three authors of search tools including myself present the case for upgrading your current search tooling.

*In this article we focus on local search tools. These tools are used to search your on-disk project. Examples include 'Find-in-Files' in Visual Studio, 'File Search' in Eclipse, or even grep. We are not discussing web-based searches of code repositories, such as GrepCode or GitHub's search.

Picture
Richie Hindle 
Entrian Source Search (Visual Studio)

No matter how good we are as developers, we can't hold the whole source tree in our heads at once - we need tools to help navigate it. IDE's are a big help - the F12 key on my keyboard (Visual Studio's Go to Definition command) is well worn. But the tools that IDEs provide don't have the power, coverage or speed that a full-text indexed search engine gives you. As one Source Search user puts it: "Our products are complex enough that not everything lives in C#: there's XML, PowerShell scripts, WiX config, custom build actions in the .csproj files... only a full-text search engine will do what I want."

Anyone remember when the Yahoo Directory was how you found things on the internet? Life without Google, yow... but that's what developing without a good search tool is. Imagine the leap from Yahoo Directory to Google, but for your coding experience. Obviously speed is a big part of that - anything that breaks your train of thought, interrupts your flow, is a bad thing, so instantaneous search is a big help. But it's also about being able express what you're looking for, both more exactly ("it's in the Renderer directory, in a .cpp file that I've modified today") and more vaguely ("it's something like InterpolateSomething(), and there's a comment that talks about quaternion rotation").

I believe that a good source code search engine will one day be one of the tools of the trade that we all take for granted, like syntax coloring or networked source control - yes, you could develop without those things (and I'm just about old enough to remember when that was normal!) but you'd feel like you were working with one hand tied behind your back. Better tools make us better developers, and powerful code search is one of those tools that you quickly wonder how you ever did without.


Picture
Andrejs Jermakovics 
Instasearch (Eclipse)

From my experience good code search is essential in an IDE and can be a massive productivity boost. This especially true when working with large codebases and I wrote InstaSearch out of my own need to find code in a million LOC projects. In a way this is similar to a desktop search but there are aspects specific to source code. You have to be able to search for words of variable names independent of the naming convention (camel case, underscore delimited) and to search inside specific code projects. And, of course, it helps a lot if the search is fast since you can tweak your search and see the results change immediately.

The two main use cases for search in IDEs that I'm noticing are: 1) I am looking for an exact string, 2) I don't know exactly what I'm looking for but I want to find code related to a few words. The first one is for looking up all occurrences of some text such as a constant or a method name. The second one is for discovering new code and finding where a particular functionality is implemented. I think code search tools need to support both these use cases to be effective. One way to enhance code search even further is to take advantage of static code information such as classes and methods.


Picture
David C. Shepherd
Sando Code Search Tool (Visual Studio)

In the past I have presented arguments as to why searches fail, why an information retrieval approach is better than a regex-based approach, and even pointed out the obvious superiority of ranked results over a flat list. Today, however, I want to guide you through a few searches in your own IDE, as many developers (myself included!) do not truly understand the depth of the issue until they experience it in their own code base. So please humor me by opening your IDE and trying out the following three searches:

  1. Search for the most popular term in your code base.  For the Sando code base that would mean searching for the term 'search'. Using standard search tools this search takes 10s to execute on a relatively small project of about 300 files. Using a next-gen search tool search results are instant, regardless of the number of project files or hits. 
  2. Search for a feature that someone else implemented. For the Sando code base that would mean searching for the method parsing code. Using standard search tools I search for 'parse' and receive 3300+ hits. Using a next-gen tool autocompletion guides me to expand my query to 'parse method', which finds the relevant methods as the top three hits.
  3. Search for 5 - 10 "known" classes, methods, or fields. For the Sando code base, that would mean searching for classes I'm familiar with, like CppParser. Using a standard symbol lookup dialog (i.e., 'Navigate To' in Visual Studio or 'Open Type' in Eclipse) to search for 'cpp', SrcMLCppParser is hidden in an alphabetized list of 47 matching symbols (at slot 32).  Even after expanding my query to 'cpp parser' SrcMLCppParser is still one of 18 matching symbols. A search for 'cpp' using a next-gen tool recommends 'cpp parse' as the first autocompletion and the resulting search returned SrcMLCppParser as the first result. 
The quick execution time, autocompletion help, and information retrieval-based search engine of next-gen tools leads to a much better search experience.


Ready for an Upgrade?

Today we have presented our case for why you should upgrade your search tooling, and I hope we have convinced you.  If our arguments have hit home here is a list of known search tools available for popular IDEs that are based on information retrieval technology.  Enjoy!
Eclipse:
Instasearch 
Visual Studio:
Source Search
Sando
Standalone:
OpenGrok
 
 
When using code search tools, such as Find in Files, developers don't want to waste mental energy crafting the perfect search query, they just want to find relevant code. Previous versions of the Sando Code Search Extension helped by providing users with conceptual autocompletion. Even with this help we found users creating ill-advised queries (including very smart, experienced developers).  For instance, users would perform literal searches (e.g., document.Add) without adding the required quotes (e.g., "document.Add"). Or they'd enter a (previously) unsupported query format, like a wildcard query (e.g., custom*Document). To make it more natural for developers to interact with Sando we expanded our query interpretation code in today's release. Developers can enter a keyword query (e.g., open file), a literal query (e.g., File.Open), or even a wildcard query (e.g., Document*Create) and expect Sando to take care of the details of returning relevant results. 

Please give Sando's new query interpretation code a spin and let us know what you think! If you like this new direction we'll look into even better query interpretation in the future. 

Sando Code Search Tool
Supported Languages: C, C++, and C#
Supported IDEs: Visual Studio 2010 - 2012 (All Editions)
Available: Visual Studio Gallery
 
 
Here's an example search using FindInFiles (left) and Sando (right):
I'm genuinely curious, why do Visual Studio developers prefer FindInFiles instead of search tools like Entrian's Source Search or my own group's Sando Code Search tool? Let me know in the comments and feel free to be brutally honest!
 
 
Last year we released an early version of Sando, a free, open source code search extension for Visual Studio, that was based on the latest advances in code search research. Because of the modest success of this relatively unpolished tool (1300+ downloads) we have taken a number of measures to refine and improve it, making it more appropriate for wider usage. Among other things, we have refactored the code into two independent projects, updated the core search algorithm based on user study feedback, improved the quality of the test suite, shortened indexing time by 2-3x, and added autocompletion. In this post we will highlight several of the new features of Sando 0.4 which searches C, C++, and C# code.*  

The ReallyReallyReallySimpleRoguelike project will be used to illustrate several usage scenarios of Sando. This game "...has a really simple goal - pickup a sword and kill the monster(s)."

Conceptual Autocompletion

When searching an unfamiliar codebase or even an unfamiliar part of a familiar codebase creating an effective query can be tricky because it is difficult to guess what terms are actually used in the code. Let's consider searching for the concept of picking up a weapon in the Roguelike C# game. Using the default FindInFiles search engine searches for "pick*weapon" or alternatives such as "grab*weapon" fail to return any results, and a search for "weapon" returns a large number of results that would be tedious to sort through. 
Picture
Sando can help you complete your query when you are only partially sure of what you should be searching for. In the example to the left the user has typed in "weapon" and Sando has proposed several autocomplete suggestions.  A quick scan suggests that the #3 result, "Add Weapon", is likely a good candidate. 

Picture
Executing the search for "Add Weapon" finds several relevant methods, including the AddWeapon method in the Player class, which is most directly responsible for picking up a weapon. This search also returns several related fields and methods such as field Player.WeaponSlot.  

Exclude Terms (e.g., Don't Search Test Files)

Picture
When executing a search on a code base it is easy to miss the relevant items in a search result because they are overwhelmed by unrelated results. When trying to search for the concept of reading input from the keyboard in RRRSimpleRoguelike my results were flooded with XML-reading code. To eliminate these results from my search I added '-xml' to my query. Another popular use case for this feature is to exclude tests from search results.

Limit by Filetype (e.g., Only Search .h Files)

Picture
Many projects are a combination of languages (e.g., C++ backend with a C# GUI) and searches meant to explore the C# code can include results from C++ code. In the example to the left we have searched Sando's code base itself for the term "theme".  The relevant C# results are overwhelmed by the results from the C++ code. In fact only a single C# result is shown (as the second result). 

Picture
To eliminate unwanted filetypes, and to enable developers to search only header files, we have implemented a filetype search. In the example on the left we have further scoped the search to only include C# files, thus eliminating the irrelevant C++ results from our search.

Exact Matching (e.g., Find This String Literal)

Picture
As we saw during our user study, developers often want to search for exact strings. By adding quotes to any search string in Sando developers can search for specific literals (or any line snippet) that exists in code.  Because Sando is an indexed searcher the results are near-instant, whereas using FindInFiles for this type of search will cause a delay while each file in the search scope is scanned.

Got All That? If Not, Consult the Tooltips

As you use Sando you'll notice that, as part of polishing it, we've added a few tooltips to guide your usage. Both the search box and the [?] icon include helpful tips on how to use Sando effectively.

Try Our Free, Open Source Search Extension

We think that the combination of new features, dramatically improved indexing performance, and polish lead to a much better search experience, but we'd love to hear what you think! Please don't hesitate to drop us a line on codeplex or, better yet, rate Sando on Visual Studio Gallery

Download Sando from Visual Studio Gallery
 
 
When performing software engineering research one of the biggest challenges is gathering information on developers. Developers are busy, usually with an upcoming deadline, and they often don't have time to take an hour-long survey or perform a task using your pre-alpha-research-quality-prototype tool. As a step towards making developer data more available we (ABB Corporate Research and University of Zurich)  are releasing our recently collected developer data set. Our data consists of twelve, two hour transcripts of developer actions transcribed by hand from the approximately two hours of video collected as developers completed change tasks on open source software. It also includes the patches they submitted and a drawing/description of the program elements and relationships they deemed relevant to complete their task (e.g., see above picture). Our hope is that this data will allow researchers to better understand how developers work, to sanity check hypotheses prior to performing their own user studies, or to investigate the potential impact of a proposed software tool.

For more information visit: Study Artifacts: Supporting Search and Navigation through Code Context Models
Tech Report: Supporting Search and Navigation through Code Context Models
 
 
The above video provides some qualitative insight into how frustrating it can be to search in today's IDEs.
During the last year or so my collaborators and I have been focused on impacting both the state-of-the-practice and the state-of-the-art in code search. To impact the state-of-the-practice we have created a code search extension for Visual Studio, called Sando. To impact the state-of-the-art we have used this extension as the basis for case studies, where we collected usage data from developers performing normal maintenance tasks. For us, this data has directly led to improvements in the upcoming release of Sando. However, to maximize our impact on the state-of-the-art we are releasing this humble data set to other researchers and developers in hopes that it helps them create an even better code search tool.   

Context and Data Format

This data was collected during a comparative analysis of two search engines, namely the default Visual Studio search engine (i.e., Find in Files) and the Sando search engine, as developers worked in situ. It was collected in two separate phases. The first phase compared the current version of Sando at the time against FiF and contained 325 user queries. The second phase compared an improved version of Sando against FiF and contains 637 user queries.

The primary reason we are posting this data is to share the user queries we have collected.  However, as the results were gathered during our case study they contain some additional information.  The results for phase one are in the following format: <date>; Sando=<wins>, Lex = <wins> ; query=<query string>.  An example data entry is shown below.
11-8-2012; Sando=6, Lex=3 ; query='reader'
The date represents the day on which the data was collected and the number behind Sando and Lex represents the number of 'wins' for each approach for this query. In the above example the developer clicked on 9 search resuls; 6 of those clicked items had a higher ranking in Sando's result set and 3 had a higher ranking in Lex's result set. In phase two the data format was slightly updated, adding: <number of Sando results>, <number of Lex results when grouped by program element>(<number of Lex results when grouped by line number>). An example data entry from this phase is shown below: 
11-29-2012; EnhSando=1, Lex=0 ; query='MembershipUser' ; 20, 90(105) 
In favor of brevity, details of our comparative study have been omitted from this post. For further information on this study please contact Kostadin Damevski, Lori Pollock, or myself for a preprint. 

The Data

Here we provide the data in two separate files, the first from phase one and the second from phase two. We hope that this data helps further your research or improve your search tool!  
timeordered-sando-v-lex.log
File Size: 16 kb
File Type: log
Download File

timeordered-enhsando-v-lex.log
File Size: 74 kb
File Type: log
Download File

 
 
I read a few articles and a book about what companies usually look for in a full-time candidate and was wondering whether I could fit the bill with the research project in place of the usual internship.
Recently I gave a guest lecture at NCSU for their software engineering course.  One of the students I chatted with afterwards will be using this summer to work on his research thesis.  This particular thesis work involves significant software development.  Having just heard my talk on the importance of gaining software development experience prior to graduation he was wondering how to get the most out of his summer work.  Here is my advice for both this student and any other student taking on summer research. 

Open Source Your Work

Academia affords students great opportunities for opening up their work.  While some specific DoD grants may have restrictions, most typical grants (e.g., NSF or industrial grants) allow or even encourage open sourcing your work.  As a student, open sourcing your research code will have two major benefits.  First, your code will become much cleaner. While many developers may have some bad habits they relapse into when committing to a closed repository, when they write code they know will be public the code is simply better. This will not only make for a more pleasant summer but will also help you as you begin to build a programmer's portfolio, or a set of publicly available projects that you have written. This second benefit of open sourcing your code, while not a new idea, this is a surprisingly simple way of gaining credibility with potential employees.  For instance, two of the three interns that made it past the first round at ABB this year provided us with links to their publicly available code (note: the third provided a live demo of their work).

Collaborate if at all Possible

While generating a sizable project alone can certainly be impressive working with others is an important part of experience that many students are lacking.  If at all possible, collaborate with other researchers and students on your code.  This can be as simple as building upon an existing framework or a set of libraries that another student has generated.  For instance, during my time at UDel summer undergrad students would often start from an existing tool (e.g., a search tool)  and improve a single component of that tool (e.g., an abbreviation expander), investigating the effect of that component's improvement on the overall performance.  This forced them to not only understand that existing component, but to also learn how that component interacted with the larger system. In most research labs there is ample opportunity to build upon others' work.  

Focus!

I have a planned my schedule for the summer and I notice that I could spare time in a day for projects/ work outside my thesis.
One point I want to warn against is losing focus.  While a summer may seem like a long time, it is short.  So short, in fact, that at ABB we communicate with upcoming interns for months in advance to hammer out a well-scoped idea so that when they join us for the summer they can have a chance at finishing something significant. Thus, I advise students to focus on a single project.  When doing this, it is important that this project have a significant coding component, as this student has, so that, should he/she have any extra time they can continue to polish and refactor and improve this code base ad infinitum. If a student can finish the summer with a well-polished research prototype, with clean code and a working demo, in addition to an (eventually) completed thesis, that student will certainly have good job prospects.        

Enjoy the Summer...

I hope this post helps you as you organize your summer. To summarize, I recommend open sourcing as much as you can, working with others whenever possible, and focusing on a single project. Following these guidelines I'm sure that you can have a polished, well-narrated demo video posted on youtube by the end of the summer (add a link in the comments!). For those of you pursing PhDs in software engineering that enjoy both research AND developing software tools, consider applying to our internship program next year.
 
 
Many software engineers (i.e., those that actually program on a daily basis) are unaware of the dedicated sub-field of software engineering researchers (like these from Microsoft) whose mission is to help make the daily grind of writing software better.  While software engineering researchers have historically had limited practical impact, there are some notable companies and tools that were born out of software engineering research, and many brilliant, driven individual researchers who want to have impact.  I hope that raising awareness of this research among software engineers will encourage more feedback to the software engineering research community, ultimately leading to more useful output.

To that end, I'm posting a short 3 minute video that provides a quick overview of a typical software engineering research project at ABB.  I hope this video gives you a better sense of what we're working on, our balance between theoretical and practical impact, and what types of technologies we're investing in. If you're a software engineer we'd love to hear what you think about our research directions, how we could improve, or even just what problems are currently slowing you down.  Feel free to leave comments or even contact me directly.  
 
 
Make it work with 2012! :)   -naspinski, from reddit
Well, the people have spoken, and we've (finally) delivered.  The Sando code search extension is now available in VS2012 (in addition to VS2010), thanks in large part to Kosta Damevski. As we release this version I realize that not all readers are familiar with Sando's Raison d'être. For those of you new to Sando here's the top few reasons why we've spent the last year creating it. 

  1. Regexes Searches Fail - Many programmers claim that they can write a regex to find any code they might need. This is absolutely true, but they often don't mention that most of their regex searches fail.  In Ko et al.'s 2006 study of developers performing maintenance tasks 88% of developers' regex searches failed, and these developers were searching over a program of only 500 lines of code.  In our study of programmers working on medium sized code bases we observed a similar failure rate. I'd like to challenge you. Record your search success rate for a single day; if it's under 50% consider installing Sando
  2. Information Retrieval Technology Avoids Regex's Failures - The tragedy of this high failure rate is that the types of failures that are caused by regex technology are easily avoided.  Because Sando is backed by Lucene.NET, which uses the Vector Space Model with TF-IDF scoring, it can handle common regex failure cases such as word re-ordering.  Users who would have had to search for both "open*file" and "file*open" using regex technology can now simply search for "open file".  Similarly, since Sando uses Lucene's  SnowballAnalyzer a search for "open file" will automatically return matches for different word forms, such as "opened", thus finding the relevant method "OpenedFile". While each individual shortcoming when searching with regexes seems trivial, the combination of issues creates real problems. Imagine searching for the concept of "open file".  The javascript regex ([oO]pen[ing]*(\s)*[fF]ile[s]*)*((\s)*[fF]ile[s]*[oO]pen[ing]*)* could be used to find the likely relevant strings  {openfile, fileopen, OpenFile, FileOpen, fileOpening,OpeningFiles}. Yet even more regex-fu would be needed to match the equally probable {FileOpened}.  In contrast, Sando would only require the search terms "open file".
  3. Ranked Results Reduce Human Processing Time - One of the main drivers of code search tools is that they save developers time.  In that sense, regex-based searches (e.g., grep) are a huge time-saver when compared with manual scanning.  Sando aims to build upon this time saving by not only automatically identifying matches, but by ranking those matches. In practice this time savings is significant. Consider this common scenario.  A developer searches for the string "save*failed" using a regex-based search.  This very specific search returns no results, so he creates the more general query "fail".  This query returns about one hundred unranked results, which he slowly scans, finding the relevant match in result #50.  In contrast, when using Sando, which ranks matches according to their similarity score, the most relevant result appears as result #2. 

Research-Driven Advances

Above I've quickly described a few reasons that we expect Sando to perform better than available regex-based tools and I've used a few scenarios to explain why. However, its important to know that Sando is not primarily based on my personal insights. Sando is built upon the huge body of code search research, started by Andrian Marcus's thesis work and so ably continued by researchers like Denys PoshyvanykDawn Lawrie and David BinkleyLori Pollock, Emily Hill, and many others. Thus, you can download and use Sando, assured that it's providing you with high-quality search results influenced by cutting edge advances in software engineering research.   

Sando is available as a Visual Studio extension for VS2010 and VS 2012
 
 
While you may know Sando as a software search tool for Visual Studio many are unaware that Sando is also a research-enabling framework.  Sando was built to be extensible, for open source enthusiasts who want to support new languages, but also for researchers who need to quickly prototype new search ideas. 

You may wonder, why do researchers need to prototype their code search ideas? Because code search is a software engineering problem, involving aspects of program analysis, information retrieval, and even natural language processing, it's necessary for researchers to ground their new approaches in the reality of the engineering issues.  They need to test their new search algorithm(s) on realistic source code bases, because it's difficult to simulate the complexity of the system through thought experiments alone.     

So, if you're a researcher interested in code search or a developer looking for an open framework to experiment with have a look at my demo (above) on Sando from the 2012 Foundations of Software Engineering Demo Track.  I'll cover not only how developers use Sando in their day-to-day work but also how Sando can be used to quickly realize your kooky research ideas. Happy searching!