Skip to content

Commit c8da0d7

Browse files
doc: add some blog to do
1 parent 5577145 commit c8da0d7

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed

website/blog/code-search-design-space.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,99 @@
11
# Design Space for Code Search
22

3+
Code search is a critical tool for developers, enabling them to find, understand, and reuse existing code.
4+
5+
ast-grep at its core is a code search tool: other features like linting and rewriting are all derived from the basic code search functionality.
6+
7+
8+
This blog is a recap of a great review paper: Code Search: A Survey of Techniques for Finding Code https://www.lucadigrazia.com/papers/acmcsur2022.pdf
9+
10+
11+
We will not cover all the details in the paper, but focus on specifically the design space for code search tool. These factors are:
12+
1. Query design
13+
2. Indexing
14+
3. Retrieval
15+
16+
17+
## Query Design
18+
19+
The starting point of every search is a query. We define a query as an explicit expression of the
20+
intent of the user of a code search engine. This intent can be expressed in various ways, and
21+
different code search engines support different kinds of queries.
22+
23+
The designers of a code search
24+
engine typically aim at several goal when deciding what kinds of queries to support:
25+
• Ease. A query should be easy to formulate, enabling users to use the code search engine
26+
without extensive training. If formulating an effective query is too difficult, users may get
27+
discouraged from using the code search engine.
28+
• Expressiveness. Users should be able to formulate whatever intent they have when searching
29+
for code. If a user is unable to express a particular intent, the search engine cannot find the
30+
desired results.
31+
• Precision. The queries should allow specifying the intent as unambiguously as possible. If the
32+
queries are imprecise, the search is likely to yield irrelevant results.
33+
34+
35+
36+
These three goals are often at odds with each other.
37+
38+
39+
## PREPROCESSING AND EXPANSION OF QUERIES
40+
41+
The query provided by a user may not be the best possible query to obtain the results a user
42+
expects. One reason is that natural language queries suffer from the inherent imprecision of natural
43+
language. Another reason is that the vocabulary used in a query may not match the vocabulary
44+
used in a potential search result. For example, a query about “container” is syntactically different
45+
from “collection”, but both refer to similar concepts. Finally, a user may initially be unsure what
46+
exactly she wants to find, which can cause the initial query to be incomplete.
47+
To address the limitations of user-provided queries, approaches for preprocessing and expanding
48+
queries have been developed. We discuss these approaches by focusing on three dimensions: (i)
49+
the user interface, i.e., if and how a user gets involved in modifying queries, (ii) the information
50+
used to modify queries, i.e., what additional source of knowledge an approach consults, and (iii)
51+
the actual technique used to modify queries. Table 1 summarizes different approaches along these
52+
three dimensions, and we discuss them in detail in the following.
53+
54+
## INDEXING OR TRAINING, FOLLOWED BY RETRIEVAL OF CODE
55+
56+
The perhaps most important component of a code search engine is about retrieving code examples
57+
relevant for a given query. The vast majority of approaches follows a two-step approach inspired
58+
by general information retrieval: At first, they either index the data to search through, e.g., by
59+
representing features of code examples in a numerical vector, or train a model that learns representations of the data to search through. Then, they retrieve relevant data items based on the
60+
pre-computed index or the trained model. To simplify the presentation, we refer to the first phase
61+
as “indexing” and mean both indexing in the sense of information retrieval and training a model
62+
on the data to search through.
63+
The primary goal of indexing and retrieval is effectiveness, i.e., the ability to find the “right” code
64+
examples for a query. To effectively identify these code examples, various ways of representing
65+
code and queries to compare them with each other have been proposed. A secondary goal, which
66+
is often at odds with achieving effectiveness, is efficiency. As users typically expect code search
67+
engines to respond within seconds [108], building an index that is fast to query is crucial. Moreover,
68+
as the code corpora to search through are continuously increasing in size, the scalability of both
69+
indexing and retrieval is important as well [4].
70+
We survey the many different approaches to indexing, training and retrieval in code search
71+
engines along four dimensions, as illustrated in Figure 4. Section 4.1 discuss what kind of artifacts a
72+
search engine indexes. Section 4.2 describes different ways of representing the extracted information.
73+
Section 4.3 presents techniques for comparing queries and code examples with each other. Table 2
74+
summarizes the approaches along these first three dimensions. Finally, Section 4.4 discusses different
75+
levels of granularity of the source code retrieved by search engines.
76+
77+
78+
## Representing the Information for Indexing
79+
* Individual Code Elements: Representing code as sets of individual elements, such as tokens or function calls, without considering their order or relationships.
80+
* Sequences of Code Elements: Preserving the order of code elements by extracting sequences from Abstract Syntax Trees (ASTs) or control flow graphs.
81+
* Relations between Code Elements: Extracting and representing relationships between code elements, such as parent-child relationships, method calls, and data flow.
82+
83+
ast-grep index the individual code elements
84+
85+
## Representing the Information for Retrieval
86+
87+
Techniques to Compare Queries and Code
88+
89+
* Feature Vectors: Algorithmically extracted feature vectors represent code and queries as numerical vectors. Standard distance measures like cosine similarity or Euclidean distance are used to compare these vectors.
90+
* Machine Learning-Based Techniques: End-to-end neural learning models embed both queries and code into a joint vector space, allowing for efficient retrieval based on learned representations.
91+
* Database-Based Techniques: General-purpose databases, such as NoSQL or relational databases, store and retrieve code examples based on precise matches to the query.
92+
* Graph-Based Matching: Code and queries are represented as graphs, and graph similarity scores or rewrite rules are used to match them.
93+
* Solver-Based Matching: SMT solvers are used to match queries against code examples by solving constraints that describe input-output relationships.
94+
95+
96+
397
* Code Search: A Survey of Techniques for Finding Code https://www.lucadigrazia.com/papers/acmcsur2022.pdf
498
* Aroma: Code recommendation via structural search https://arxiv.org/pdf/1812.01158
599
* Deep code search: https://guxd.github.io/papers/deepcs.pdf

0 commit comments

Comments
 (0)