Musings of a Tech Transfer Enthusiast
Recently there's been an increasing amount of excitement around industrial tracks. They are becoming well attended, competitive, and discussion-worthy. Alas, this recent enthusiasm could be quickly extinguished if we, the program committees of these industrial tracks, provide inappropriate or low quality reviews. Imagine the frustration of a practitioner if their real world case study gets rejected while an experiment with ten freshman undergraduate students is accepted. To avoid this, we need to provide clear, distinct guidelines for evaluating industry track papers.
An industry track focuses on the same topics and expects the same rigor as the corresponding research track, yet papers featured in the industry track are distinct. What sets the industry track apart is that it values impact and realism over novelty. How should this affect paper reviews? Industry track reviews should be the same in terms of quality, length, objectiveness, etc. *except* that they should emphasize industrial values instead of academic values. For instance, the ICSME Research track in 2015 (credit: Robillard & Krinke) used the following list of criteria to rate their papers (note: ordering is my interpretation of importance):
Potential impact refers to the potential impact of the findings, tool, or technique on industry. The potential for impact manifests itself in two ways. First, it can be measured by the magnitude of the result. An evaluation of different regression test selection techniques that shows that on one industrial system, one technique is 5% more effective than another would garner a low score in terms of potential impact. That same study showing that one technique is 50% more effective across several industrial systems would rate higher. Second, it can be measured by the applicability of the technique; an improved solution for the overly specific task of model-based, aspect-oriented, real-time fuzz testing is less widely applicable than an improved solution for the more general task of unit testing.
Real world focus differs depending on the type or section of the paper. For a tool-focused paper it refers to the maturity of the tool (an open source tool with 100s of users is better than a closed source tool with 100s of users is better than an open source tool with a few users..). For an evaluation section, it refers to the evaluation environment (an industrial environment with real developers working on their own code is better than industrial developers working on toy tasks is better than students..). For a case study it refers to the type of industrial environment (a large software company is better than a small startup is better than an undergraduate class) or the type of software systems studied (both proprietary and open source systems is better than just open source systems is better than student/research projects..).
Related work is extremely important in research papers because it is used to establish novelty. In industry track papers novelty is much less important, and thus related work serves a diminished role. In industry track papers related work should give context to the contribution, cite original sources of ideas, and cite original studies. Yet reviewers should be lenient with authors who miss related work, as practitioners may not always keep up with late-breaking results. Reviewers should provide feedback to the related work section in an industrial track with a collaborative mindset.
While the underlying values of a review are important, appropriate organization can dramatically affect a review's interpretation. To avoid unnecessary complexity, we recommend that you include the following in your reviews (from B. Adams and D. Poshyvanyk). Note that this advice is also applicable for research track reviews.
1. Paper summary
2. Points for the paper
3. Points against the paper
4. Supporting argumentation for your points
5. Suggested paper improvements
Although the Program Committee Co-Chairs determine the available ratings for each track, we recommend using the following ratings and guidelines whenever possible. This rating system is preferred over the five point rating system because it makes it easier to identify the champion (+3) and the opponent (-3) for each paper.
3: strong accept - I will object if rejected
2: accept - I support acceptance
1: weak accept - accept, but could reject
0: borderline paper - I have no opinion
-1: weak reject - reject, but could accept
-2: reject - I support rejection
-3: strong reject - I will object if accepted
While it's important to have clear values and formatting, I always find it easiest to learn through examples. Here's an anonymized example of a well done review:
Summary: The authors introduce an interesting problem, the performance of…, and provides a case study on their approach to dramatically reducing the runtime of... They also include some discussion on both the challenges and future opportunities.
Summary of Opinion: This paper introduces an interesting problem and offers a solution that improves performance by 50X (in some cases). While I wish the authors had included an example walkthrough of the porting process and that the takeaways were more interesting, I feel that the introduction of the problem and the improvement of performance is valuable enough to warrant publication.
In addition to the above (nearly) full review I'd like to include a few snippets from industry track reviews which you may not have ever seen in research track reviews:
It's also important to know what NOT to include in this type of review. Here are a few examples of phrases I hope to NOT see in an industry track review:
While it's great that the software engineering community is rallying around the importance of bridging the gap between academia and industry we face a major threat to this progress: our own biases. If we do not consciously shift our values when reviewing industry papers we will end up alienating practitioners further. Now that we have positive momentum let's put in the work to create professional, fair, and useful reviews.
As the PC Co-Chair for ICSE's Software Engineering in Practice (SEIP) track in 2017, I'm jealous. While the submission policy for the research track of ICSE 2017 has recently been the center of hot debate, SEIP's submission policies have NOT been slandered. To remedy this imbalance, I propose adding the following to SEIP's submission policy:
All researchers who have submitted to a "Visions" track in the past year *must* submit at least one paper to SEIP to help restore balance to the field.
One of the main reasons for this policy is that every year researchers submit more papers to "Visions" tracks, but the pool of qualified researchers doing technology transfer work does not grow in proportion. Allowing the imbalance between the amount of "Visions" being created and the technology being transferred erodes our community's already tenuous relationship with working software engineers. These engineers have recently made comments on research in our field such as: "the tool that would result would not be something I would use or can imagine anyone else using." Perhaps our "Visions" would be better if we spent more time in the field.
Vision without execution is hallucination. -Thomas Edison
A key to breaking this cycle is to recognize that technology transfer work is valuable, and on this point the community has begun to respond. In the past few years SEIP has had an acceptance rate hovering around the low 20s and, more importantly, it has become a legitimate venue for professors interested in impact. For instance, Dr. Shane McIntosh, a tech-transfer-curious assistant professor at McGill, has published at SEIP '14 & '16 with no negative effect on his career. Technology transfer has become tolerated.
Yet tolerance is not enough. We look forward to the day when technology transfer work is seen as a first class member of the ICSE community. We are working hard to bring this vision to fruition, and thus, while this proposed policy won't be popular, we feel these steps must be undertaken to push forward tech transfer in an efficient, fair, and sustainable way.
In the summer between my fourth and fifth grade years my parents sent me to a Lego Logo camp (i.e., now called Lego Mindstorms). As an elementary student already trying to build useful, interesting contraptions like an automated NES game dispenser with my Construx, this class changed my perspective. Using software to dynamically control the various motors and sensors opened up a new class of possibilities that I hadn't considered feasible before, and my fascination with the subject subsequently bloomed into an undergraduate degree, internships, and eventually a PhD in Computer Science.
My story ends well, as I'm now happily employed at ABB Corporate Research, but I've often wondered what would have happened if my mom hadn't pushed me to attend that camp. Would I have started college as "undeclared", floating around a few years until I found CS, or would I have simply done something else? I'm honestly not sure how life would have worked out if that camp hadn't sparked my interest, and I'm hugely thankful that it did.
As I've thought through what might have happened to me, a middle class child, I wonder what happens to kids in more challenging situations. If no one ever pulls back the curtain of technology and shows them how things work will they ever get excited about STEM careers? My guess is "no". Without that spark, the same one that hit me during a summer camp way back in elementary school, it's hard to sustain the drive and determination to get through what is certainly a challenging major.
And so, to spark interest in computer science for those unwilling or unable to pay hefty summer camp fees my friends (Nicholas Kraft and Christopher Corley) and I have, for the past two summers, been running a (very cheap) one week camp to teach 10-12 year olds how to program. In this post I'll share all about the camp, from the tablets the children used to the language we used to the speakers we brought in.
However, this post is not about how great a camp we put on. This post is about giving you, the working software developer, the corporate sponsor, or the volunteer enough information to decide whether you want to join us next year. In 2016 we hope to reach twice as many of Raleigh's youth, but we need three additional dedicated teachers, about $3000 in corporate sponsorship, and 60 volunteer hours if we're going to reach our goal of 40 campers.
We hold our camp each year at Roberts Park Community Center. In addition to having a variety of sized rooms to accommodate our growing camp Roberts Park has an excellent and supportive staff. Ms. Sherri Hartsfield, along with her colleagues, not only take care of all facilities booking and administrative necessities, but they are instrumental in recruiting students that may otherwise not be exposed to programming. Pulling from the Roberts Park after school and summer childcare program, neighborhood contacts, and personal contacts Ms. Hartsfield has no trouble filling our classroom each summer.
As you can see from the above copy of Thursday's schedule we tried to have a full agenda each day, finding that keeping the participants busy was key to a productive camp. We utilized a variety of activity types to fill out our schedule. In between programming tutorials, the meat of our code camp, we had outside speakers from working software engineers, videos on the importance of programming, computer science unplugged activities, and demos. Breaking up tutorial/programming sessions with these other activities helped motivate students and had the nice side effect of making programming time seem scarce, which meant that students were very focused when they finally did have programming sessions.
In my computer science journey the Lego Logo class I attended sparked my interest, but this spark was kindled in a nurturing environment. I had a friend whose parents bought a Lego Logo system for home, an educational system where computers were common place, and a brand new TI-82 which I could program to my heart's content. However, students with less affluent friends and school systems do not necessarily go home to such a nurturing environment. Thus, a key part of our code camp is that, along with the knowledge of how to program, we deliver the means for students to continue programming after the class ends; students get to keep the tablet they use in class. Due to generous corporate sponsors in the first two years of our camp students have taken home either a Chromebook or a tablet.
In the first iteration of our camp (2014) we utilized Scratch, MIT's visual programming language specifically designed for teaching programming. We enjoyed using Scratch and found it effective for teaching programming concepts. However, Scratch is best used on a laptop or Chromebook, which are about $250 at a minimum. Fortunately, the language we used, TouchDevelop by Microsoft Research, was designed from the beginning to be programmed via touch interface, and so we were able to purchase more cost effective tablets (e.g., HP Stream 7). TouchDevelop, which is similar to Scratch, provided us with the same great teaching experience. In addition, TouchDevelop also had built-in, interactive tutorials, which reinforced lecture lessons and gave students clear guidance on projects. We found that the interactive tutorials led to much higher completion rates for individual exercises.
Part of sparking an interest in computer science is setting a vision for a desirable future. At least part of my interest in computer science was because of the countless adults that told me "You'll make a great living in computer science!" That, combined with the minimum wage jobs that I had in high school helped show me that education really did lead to a better future.
To help foster the same realization in our camp's participants we had a speaker, who had to actively be earning a living in the computer field. You would be amazed by how effective these 10-15 minute sessions were in growing the children's visions of their future. I'm not sure what it was--perhaps these students had never met a working computer scientist before--but the kids grew absolutely silent, listening intently during many of these talks. And they had a strong effect, by the end of the week I'd hear some saying "I want to be like Mr. Mark when I grow up!"
Join Us Next Year!
As you can tell, we had a lot of fun this past summer. More importantly, we helped level the playing field for twenty students, increasing the chances that they'll benefit from the tech boom that is happening all around them in the Triangle. Yet this year we only reached 20 kids when there are hundreds growing up within walking distance of Roberts Park who are falling prey to the digital divide. We want to reach more.
If you are a working computer scientist, a corporation that wants to make a difference in Raleigh, or someone who is willing to volunteer their time, we need your help. In 2016 we hope to reach twice as many of Raleigh's youth, but we need three additional dedicated teachers (anyone with a CS background), about $3000 in corporate sponsorship, and 60 volunteer hours if we're going to reach our goal of 40 campers.
Please contact email@example.com if you are interested in joining us!
The International Conference on Source Code Analysis and Manipulation (SCAM) has always been a small conference, but it consistently delivers high value for attendees due to its interactive format, engineering roots, and strong community. SCAM's 2016 edition will be held in tech-savvy Raleigh, a place that offers visitors a taste of all that's great about the American South without having to leave the comforts of a city. In this post I'll introduce our organizing committee, explain updates to SCAM's format, and give a sneak peak into potential Raleigh diversions.
For those of you unfamiliar with SCAM, SCAM is what I would call an interactive venue. Speakers still present during each session, but a significant portion of each session is devoted to discussing the papers that were presented as well as provocative questions that each author prepares. To me this is what sets SCAM apart from other venues, as it provides a safe open forum to quickly get feedback on your work from lots of experts. Additionally, it forces you to become better at public discussion, an invaluable skill for an aspiring researcher. For those seeking deeper feedback and looking to improve their discussion skills in a friendly environment I'd encourage you to try SCAM out.
Before I introduce our tracks' co-chairs I first need to explain one change in SCAM's format that will occur in 2016. As always, SCAM will have only two tracks. However, in 2016 SCAM will have a Research Track and an Engineering Track, the Engineering Track replacing the Tools Track. This is not to discourage tool paper submissions--they will now fall into the Engineering Track--but to broaden the scope of the tools track. I do not want to say too much about this track's scope, as we have two excellent co-chairs working on the concrete definition, but for those of you that invest blood, sweat, and tears into tooling, infrastructure, or realistic field studies SCAM recognizes the value of this work, which is not always pure research, and we are designing this track to attract that type of work.
Thus far we have been able to assemble a great set of co-chairs for SCAM. Our strategy at every level was to select one co-chair from academia and one from industry. Without further ado, I present our General Co-Chairs:
Next, I'd like to present our Research Track Co-Chairs:
And finally, I'd like to introduce SCAM's very first Engineering Track Co-Chairs:
While SCAM itself is worth the trip it's always nice when the venue has a cultural experience to offer. On this front Raleigh shines. Raleigh's just Southern enough to offer great BBQ and sweet tea but with enough outside influence to still be comfortable to visitors. Here I'll point out just a few of the places you'll want to visit while you're here.
North Carolina has great BBQ, as everyone knows, but our deep-seated Southern hospitality forces us to ensure we also have great offerings for vegetarians too. We'd hate for anyone to feel left out! Below are pictures from two restaurants that are just a short walk from the venue, The Pit for BBQ lovers, and the Fiction Kitchen for vegetarians.
One of America's greatest contributions of late is the craft brewery scene, which has been burgeoning North Carolina. As of last count, there were 135+ craft breweries in NC alone! While the diversity of brews available has been great I must say it's a bit overwhelming for the uninitiated to choose from the aisles and aisles of craft suds. That's why I'm recommending you visit Tasty Beverage, a bottle shop and bar just a short walk from the conference venue. The staff at Tasty Beverage are true craft beer connoisseurs and they are always ready with helpful recommendations. Treat them as your tour guide through America's craft brew scene and you'll soon see why this movement has become so popular.
For the music lovers there's a genre of music you must hear live when you're in Raleigh, bluegrass. There's been a recent resurgence of interest in this art form led by traditionalists like Balsam Range and the more pop-influenced Mipso. As luck would have it, the Wide Open Bluegrass festival has coincided with SCAM's timing for the past two years, and likely as well in 2016, so there will be plenty of chances to experience bluegrass the only way it was meant to be heard, live.
As you can see, SCAM '16 has a lot to offer, but I haven't given it all away here. You see, SCAM, much like its participants, is a wonderfully quirky venue. Having ridden on a duck boat on the way to the banquet in '15 and each year paid my tribute to Monty Python, SCAM has always surprised and amused me. I hope you'll join us for next year's ride.
Jochen Quante and myself recently had the pleasure of co-chairing the Industry Track for ICSME '15. Due to the strong work on the ground from our program committee, aided by a few tweaks from us, the track received about twice as many submissions as in recent years. For those of you chairing industry tracks in the future, we'd like to share our strategy.
1. Choose an Industry-Leaning PC
While an earlier post addresses this topic at length, its main point bears reiterating: choose an industry-heavy PC. We intentionally selected a program committee with a much higher percentage of those in industry, raising this percentage from a respectful 46% (11/24) in 2014 to a notable 78% (18/23) in 2015. This choice not only led to more appropriate reviews but ultimately was key in soliciting quality industry submissions. Here's the email we used to invite PC-members.
ICSME Industry PC Invitation Email 2015
2. Leverage Your PC's Professional Network
Between the time that PC members sign on and the papers start rolling in there's usually a long silent period. Use this period to build momentum for your track. We did so by leveraging the network of our PC. We encouraged PC members to personally reach out to their contacts in industry, guessing that the success rate via personal contact is likely much higher than spamming mailing lists or tweeting ad nauseum. Here's the email that we sent to PC members (a few times), which includes a draft email they could use as a starting point when contacting friends.
3. Blog About it
To help generate buzz around your track it's helpful to blog about it, but what do you say about a track that's existed for years? We focused our single publicity blog post on changing the percentage of industrialists on the PC. While hardly an earth-shattering change, it represented the industry-focus that we were bringing to the track this year, and perhaps struck a chord with potential authors. Thus, I'd encourage you too to find your unique take on a track, implement it via concrete changes (no matter how small), and communicate it in a post. Well-timed scope or theme updates to a track can be especially fruitful.
Summary: Invite, Encourage, Blog
While industry tracks were once second class citizens they are now gaining credibility with both academics and industrialists alike. If you're an upcoming chair for an industry track I hope these tips will help you run an even stronger track, as ultimately I see industry tracks as a promising vehicles to help solve our field's impact problem... but that's a post for another day!
To close, I'd like to list the program committee members. Even though their day jobs are demanding, they gave time that made this track much stronger. Thank you PC!
Benjamin Klatt, inovex GmbH
Carl Worms, Credit Suisse AG
David Hovemeyer, York College of Pennsylvania
Davy Landman, CWI/Univ. Amsterdam
Elmar Jürgens, CQSE GmbH
Eric Bouwers, Software Improvement Group
Felienne Hermans, Delft University of Technology
Jacek Czerwonka, Microsoft
Jan Wloka, Quatico Solutions Inc.
Jens Knode, Fraunhofer IESE
Jeroen van den Bos, Netherlands Forensic Institute
John Penix, Google
Joost Visser, Software Improvement Group
Magiel Bruntink, CWI/Univ. Amsterdam
Marc Rambert, Coverity
Marin Litoiu, York University
Marius Marin, Microsoft
Paul Anderson, GrammaTech Inc.
Ray Buse, Google
Tiago Alves, Microsoft
Tom Tourwe, Sirris
Vinay Augustine, ABB Corporate Research
Zachary Fry, GrammaTech Inc.
Recently I was asked to be the co-chair of the ICSME Industry Track. Of course, one of the responsibilities of the chairs is to chose the program committee, those who review papers and ultimately decide what's in and what's out. While there's some good advice on choosing PCs for academic tracks there's much less written on choosing PCs for industrial tracks. So, below I'm sharing a couple of tips we learned as we chose an industrial PC.
Tip #1: Don't Invite Pure Academics
Don't get me wrong--I have a ton of respect for academics who are pushing the boundaries of knowledge--but their focus on novelty is not necessarily an asset during industry track reviews. Let's consider the following review, which my group received for an ICSE Industry Track (SEIP) submission:
The paper presents a tool (and process) called Prodet, to assists developer in navigating the code, in MSFT Visual studio. They validate their assumptions about the benefit of Prodet through an experiment.
Obviously, this is an unusually kind review (thanks reviewer!), but let me bring your attention to the bolded ending sentence: "I just find it not very exciting, sorry." This is exactly the reaction that I expect, and regularly get, when presenting work to academics. I don't fault them for it. If your focus is on pushing the boundaries and novelty as the primary metric industrial work can seem quite boring. However, for industrialists this type of work is quite exciting. Taking an idea in its infancy, finding all the ways that it fails miserably when applied at scale, innovating around these issues, and convincing busy developers that the resulting tool will save them time is what we live for. Put simply, it's disheartening to receive reviews from those who don't value that work.
Fortunately, there's a simple fix for this issue: invite those from industry. As you'll see from a quick glance at our PC, this is what we've tried to do. We have people like:
Tip #2: Invite Some Industry-Leaning Academics
While I've just told you to avoid inviting pure academics, it's actually important to invite some academics. Why? When an industrialist is knee deep in his/her latest tool-building effort, it's easy to miss a few relevant papers from the latest conferences, and so academics are sometimes more up-to-date on advances in the academic field. Thus, a few academics on the PC can ensure an appropriate framing of the work and help inject new ideas into the conversation.
However, when inviting academics I'd still recommend inviting a certain type of academic, one with an industrial leaning or background. For instance, we have invited people like:
Spread the Word: ICSME Industry Track
Hopefully these tips help others as they go about choosing an industry track PC. More immediately, I hope any industrial researchers or practitioners reading this are encouraged by these improvements, as I think they are part of a larger trend. As ICSE's SEIP continues to get better (22.5% acceptance rate this year), ICSME's Industry track continues to grow and become more competitive, and more attention is paid to the engineering side of software engineering than perhaps ever before, I think that those living on the boundary between research and practice can expect an exciting future.
Submit to the ICSME Industry Track. Abstracts due Jun 19th.
Our recent work on the Sando Code Search extension, a tool which leverages Lucene to search code, has been focused on making it more scalable and robust. To demonstrate our progress I'll provide demos of both Sando and FindInFiles (i.e., a grep-like feature in Visual Studio) searching the entire Linux kernel. As you'll see, there's a fundamental difference between Lucene-based search tools and regular expression based search tools.
Before we begin, let's first briefly examine the Linux source tree. At the time of our demo it contained 47,528 files which occupied 1.71 GB on disk. Most of these files were C code, yet there was also a fair amount of documentation and configuration files. Sando and FindInFiles both search all text files.
Searching the Linux Source Tree with FindInFiles
To use FindInFiles I configured it to search the directory containing the Linux code, entered my search, and selected Find All. In this running example the user is searching for encryption algorithms, specifically those related to AES, and thus they use the regular expression query "encrypt*aes". Executing this search caused FindInFiles to run its regular expression matching algorithm against every line of every file in that directory, recursively. As you can see in "Starting the Search", this utilized about 50% of the CPU on an eight core machine for a considerable amount of time.
After about one minute and forty seconds the search completed, having searched 47,407 files. Unfortunately, no lines matched this particular search (see "Finishing the Search"). As often happens with a regular expression based search, the word ordering in the query did not match the word ordering in the code. In this situation the user would likely have to run another search with re-ordered search terms (e.g., "aes*encrypt") to find relevant code.
Searching the Linux Source Tree with Sando
Next we searched the same Linux source tree using Sando. Unlike FindInFiles, which is based on regular expression matching, Sando is built upon information retrieval technology (think Google). It leverages Lucene.NET to pre-index source code and provide ranked results almost instantly. Typing in the same query as before minus the regular expression syntax (i.e., "encrypt aes") you can see below that results are returned almost instantly. Just as importantly, the most relevant results are returned first with less relevant results toward the bottom. Additionally, in Sando's UI, selecting a result in the list provides a preview of the program element with matching terms in bold.
Of course, there is a cost to pre-indexing. For the Linux source tree that cost is about 50 minutes of low CPU background processing. Fortunately, this only happens once after which incremental updates and switching branches trigger at most a few seconds of indexing. Additionally, for most medium-sized projects initial indexing completes in a matter of seconds. For instance, Sando can index its own source code in less than ten seconds.
Try It For Yourself: Online, in Eclipse, or in Visual Studio
I hope this quick post has piqued your interest in Lucene-based code search tools. If you'd like to try an advanced code search tool there are several options, both online (e.g., GitHub's search) and offline (e.g., Sando for VS and Instasearch for Eclipse). Follow the links below to try them out.
Interviewee: Jason Cohen
This coming Thursday I'll be hosting a webinar that's targeted at ABB's product managers and software developers that has just been opened to anyone who's interested. Join us! I'll be interviewing Jason Cohen, an entrepreneur that has started four companies with two successful exits. He is currently running his fourth company, WP Engine, which offers Word Press hosting. More relevant to software engineers he's the founder of SmartBear Software, the company behind CodeCollaborator, a popular code review tool that many groups within ABB use today. Additionally, he's the author of the "a smart bear" blog that has over 40,000 subscribers.
Interview Focus: Lessons Delivering Software
For many large companies (i.e., the enterprise) the last ten years or so have ushered in a new era, one that is increasingly software focused. For these companies and their employees it's important to continually gain a deeper understanding of how to build and sell software in order to deliver their core business. While there is some value in learning from large corporations like Microsoft or Apple I find that the complexity of their stories obfuscates any lessons we might learn. In contrast, Jason's story, especially his SmartBear story, offers unusually clear insight into what it takes to build and sell software. He was laser-focused on creating a company that did one thing well. He found that thing, made thousands of customers happy, and his company profitable. The issues that he encountered when building a software business, from keeping customers happy with under-performing software to developing a sane pricing model, is what we'll discuss and learn from on Thursday.
This webinar is free, but there's a limited number of spots so sign up:
Time: 11am EST on Thursday, May 15
Signup Link: https://sdip-abb.webex.com/sdip-abb-en/onstage/g.php?d=950787858&t=a
Update: This interview has been conducted and is now available via the player at the top of this post and direct download below. Enjoy!
This spring I'll be presenting at several code camps and as well as online. Please see "Upcoming Talks" below for details. While I've blogged about code search many times in the past, this post contains new developer data, links to non-Sando search tools, and my latest thoughts on code search.
How many times do you Google per day? I'd estimate 20+ times for myself, but across all Internet users the average is much lower, at 2.46 Googles per day (5,922,000,000 Googles per day/2,405,518,375 Internet users). It's amazing how low this average is!
Now, for you developers out there, how many times do you search your code per day, perhaps looking for a feature's implementation while fixing a bug... once or twice? Nope. We collected anonymous activity from 60 developers for a total of 1750 developer days and saw that developers interact with known search tools in the Visual Studio IDE 34+ times per day! On average! Extrapolating from the Google example, I wonder if some developers are searching a hundred times per day on occasion...
Unfortunately, while developers are searching, empirical studies show they aren't finding. One study of developers performing maintenance tasks showed that 88% of developers' searches fail! To me, this is no surprise. Current search tooling inside modern IDEs is regex-based, and bound to fail for developers with less-than-perfect memories. Why don't developers upgrade search tools that fail so often, especially one that they interact with 34+ times per day?
The answer is no secret: until recently, there were no better search options. But now there are, and it is time for developers to remove constant failure from their day. In an upcoming set of talks, I'll detail why Lucene-based search tools work better for most developers. I'll talk about the different types of searches (e.g., symbol lookups vs exploratory searches), when to use built-in tooling (e.g., NavigateTo), and when to leverage advanced techniques (e.g., Sando or Entrian Source Search). Finally, I'll even provide advice on when uber-scalable server-based solutions (e.g., OpenGrok) are more appropriate.
This past week I attended CSMR-WCRE (soon to be called SANER) where I enjoyed Andi Marcus's acceptance talk for the most influential paper from 2004. As I listened to his talk, I realized that his work on applying IR techniques to more effectively search source code was originally presented at the second conference I ever attended. It's funny how this work is still influencing my work with Sando today, yet at the same time a bit sad how far code search has to go to gain mainstream acceptance.
Fortunately, while attending CSMR-WCRE, enjoying some great code search talks from both academia and industry, it hit me. Code search is extremely well-positioned to make a real impact in the next few years. Here's why:
Researchers are focusing on human aspects
In the early days of code search research, like in most fields, there was an appropriate but strong focus on better algorithms, higher precision and recall, and automated evaluations. Yet, because of the great progress on these fronts researchers like Sonia Haiduc, Emily Hill, and myself are starting to focus on the human aspects of code search, like query re-formulations, query recommendations, and search-appropriate UI. While human aspects may seem soft to hardcore algorithmic researchers consider this: a UI/UX overhaul for Sando in 2013 had a lot to do with the 10x increase in downloads over 2012. I believe progress on these long-neglected research areas may push code search over the top.
Researchers have the infrastructure to make rapid progress
One of the long known problems with code search tools was the difficulty of evaluating new search algorithms and, even harder, comparing two competing search algorithms. As documented in Bogdan and Denys's 2012 survey, only 30 papers out of 89 included an evaluation that compared the new approach against any existing approach. Fast forward to 2014 and we now have at least two infrastructures for evaluating code search algorithms, one allowing us to quickly prototype new algorithms and test them in-lab and one allowing us to further validate them via field data.
First, we have TraceLab, a framework for creating and carrying out search algorithm evaluations in a reproducible manner. I've seen TraceLab used, for instance, to show the effect of including and excluding commonly used components like splitters and stemmers into code search algorithms. Thus, for researchers who develop any improvement on top of existing approaches, evaluating their work against the state-of-the-art is as simple as implementing a TraceLab component.
Second, once researchers determine a new top algorithm in the lab using TraceLab, we also have a framework for evaluating that new algorithm in the field, Sando. Assuming that the algorithm can be plugged into Sando, we've recently shown that anonymous activity stream metrics correlate with user satisfaction, and thus by watching these metric we can determine whether the new algorithm improves performance in the field too.
I believe that this infrastructure will dramatically reduce the time-to-market for new ideas coming out of research. Each new idea, quickly tested in TraceLab, validated in the field by Sando, can impact users in a matter of months instead of years.
Calling all students!
So, PhD students looking for a topic, you may have thought that code search research was quieting down. I would argue the opposite; the research done in the next five years will likely be what appears in practice. There are obvious needs in terms of better query recommendation/expansion algorithms, more software specific synonym information, ideas for dealing with abbreviation-heavy domains, an improved code search UI design (i.e., lists vs graphs), and even the performant use of structural information to improve code search results. The infrastructure's available, the low-hanging fruit is available.. come join us for some of the most exciting work of the next five years!
David Shepherd leverages software engineering research to create useful additions to the IDE.