[Top] [Prev] [Next] [Index] [TOC]

Chapter 15

Find: A Tool for Transitive Pattern Recognition

Find is a static analysis tool in the Toolsuite. Find supports the identification of date-sensitive objects through a simple transitive relation. Date-sensitive patterns are initially specified which are common across the application. The atoms (groups of characters, surrounded by white space) in the code that match each pattern are then either accepted or rejected. Acceptance or rejection can be selected in either local file or global scope. Once the initial atoms are selected, the transitive relation is invoked, which identifies more candidates for date-sensitive objects. By selectively pruning objects that are not date-sensitive, closure on the date-sensitive objects in the application can be reached. Output from Find is then exported to the remaining tools to specify interesting regions for testing. Because the transitive relation is a simple heuristic, namely objects on the same line, Find is language independent, and well-adapted to pointer-based languages like C or C++. Other languages, like Perl or Tcl are also excellent candidates for analysis through Find.


15.1 Background

Find is a tool within the Toolsuite which performs transitive pattern recognition. Its intended use is to assist in identifying pieces of code that are related to one another in a thematic way. The user begins with seed files containing standard and/or customized templates (regular expressions) to identify components with the designated patterns. Unlike the rest of the Toolsuite which uses dynamic traces to analyze code, Find performs a simple static analysis of the code. Because of its simple heuristics, Find is language independent. It does simple lexical analysis rather than a full parse to determine relationships among the elements of a language.

One of Find's principle applications is to analyze and delineate date-sensitive code. The simplest form of a date pattern might be mmddyy (two numbers each for month, day and year such as 032697). A default seed file is provided for this application, which includes the most frequent formats for encoding dates.

The tool applies the patterns to atoms (groups of characters, surrounded by white space) in the code, which are all words except those included in a stop list. The stop list is user definable and typically includes keywords of the language being analyzed. As the atoms satisfy the patterns, they are highlighted in red as candidates.

The candidates are inspected for relevance to the feature code being delineated. Based upon this determination, the atoms matching the patterns in the seed file are accepted or rejected in either file or global scope. Upon completion of this task, the heuristic that generates further candidates is applied. As currently implemented, new candidates are those atoms on the same line but not excluded by the stop list.

Iteration through inspection of candidates, determination of their relevance, and generation of new candidates is performed until no further relevant new candidates are generated. At this point the process ceases.

Find requires close user-interaction with and intuition about the program being analyzed. Without this judgement, static analysis of the type outlined above does not converge. With appropriate pruning of the dependencies among atoms in the code, static analysis becomes a useful tool for isolating feature code in large programs.

15.2 A Tutorial

The use of Find is most easily understood by an example. In this chapter we use the same wordcount program as used in previous chapters to illustrate the basic features of Find. To copy these files, create a new directory, cd into it, and copy the contents of the directory in which the tutorial files are installed into the new directory. This tutorial will not require that any programs be compiled, so if preferred, files from a previous tutorial may be used. For the illustrations in this chapter, we will use two files: main.c and wc.c.

The wordcount program has the ability to count either characters, words or lines depending on the command line options. Each of these abilities is deemed a feature. This tutorial will find and demark all code that implements the word counting feature. In another application the feature might just as easily be date sensitive. Start the tutorial by typing:

	prompt:> xfind main.c wc.c						
at the command-line prompt. The resulting display will appear as in Figure 15-1.

Figure 15-1 The initial Find window
The Find window is divided into two areas, a program display area in the upper region and four control boxes in the lower region. The rightmost control box initially has a generic label ``Derived Seeds''. This label changes to ``Patterned Seeds'' after a seed file is imported and to ``Transitive Seeds'' after the transitive relation is applied. The control boxes may be selected and deselected with the ``Edit'' menu ``Show Controls''strong> option. Deselecting the control boxes causes the program display area to fill the entire window.

Other entries in the ``Edit'' menu include:

As in Figure 15-1, Find displays the plain text of main.c, which may be scrolled with either the scroll bar to the left of the text or by using the PgUp, PgDn or arrow keys. Selecting wc.c in the ``Modules'' control box causes Find to display wc.c in the program area. The displayed module is denoted by an exclamation mark to its left. If the control boxes are deselected by using the ``Edit/Show Controls'' menu entry, then a ``Modules'' menu appears in the tool bar that enables the user to switch between modules. Before continuing the tutorial, reselect the control boxes and main.c in the ``Modules'' box. The resulting display will be the same as that in Figure 15-1.

The program may be seeded with initial patterns to be matched either from a seed file or interactively from the program display area. Selecting seeds interactively is accomplished by sweeping out an atom with the left mouse button and invoking either the ``All'' or ``File'' option in the ``Seed'' menu depending on whether the selected atom is in either global or file scope. Two seed files are included in the tutorial as shown in Figure 15-2. The file seeds.sd is used in this tutorial, whereas dates.sd is displayed to provide a more illustrative example of a seed file. Patterns can be changed or inserted one per line in the seed files, which must have the extension .sd. Patterns may be excluded beyond the default list by populating the excludeSeed region of the file. These patterns should be entered between the curly braces, one pattern per line.

===== File seeds.sd =====
set includeSeed {{seedList {
word
}}}
set excludeSeed {}
===== File dates.sd =====
set includeSeed {{seedList {
ti?me?
da?te?
epoch
ne?we?r
older
ye?a?r
mo?n?th
da?y
ho?u?r
minute
seco?n?d?
}}}
set excludeSeed {}
Figure 15-2 The word-sensitive (seeds.sd) and date-sensitive (dates.sd) seed files

Import the seeds.sd file from the ``File'' menu by selecting the ``Import'' entry which invokes a standard file selection dialog. Other entries in the ``File'' menu include:

Once Find reads the seed file and code modules, analysis begins. The Find window should look like Figure 15-3. Find highlights all the atoms that satisfy the regular expression represented by ``word''. Each of these is now a candidate for inclusion or exclusion into the code-feature set in either file or global scope. The white area covering each line that includes a candidate enhances its visibility in the scroll bar. Small candidates in large modules tend to be overlooked without this highlighting.

Figure 15-3 The updated Find window after seeds.sd is imported
Before proceeding with the analysis, take a few moments to become familiar with the operation of the ``Seed Files'' and ``Seed List'' controls at the bottom of the screen. The ``Update'', ``All'' and ``None'' buttons in each control box determine whether the seed file or individual regular-expression seeds are considered when generating candidates. The ``Update'' button functions as a toggle. Selecting the entry ``word'' in the ``Seed List'' control box and clicking the ``Update'' button removes the highlighting of all atoms that match the pattern ``word''. Reselecting ``word'' and clicking the ``Update'' button highlights all atoms that match the pattern ``word'' again. The ``All'' and ``None'' buttons have the expected effects. The ``Seed Files'' control box operates similarly. An exclamation mark to the left of the pattern or file indicates that the item is active.

Classify each of the candidates highlighted in red according to whether it is included or excluded in file or global scope. Atoms that are defined as static or automatic variables should be selected at file scope while those defined as global should be selected at global scope. Click the left mouse button on the red highlighted wordct at the beginning of main.c. The pop-up menu in Figure 15-4 appears. Move the mouse down to select ``Include File''. This will insert wordct into the ``Patterned Seeds'' control box. Other entries in the pop-up menu include:

Figure 15-4 The updated Find window as the user begins to classify candidates according to whether they are included or excluded in file or global scope
If a classification needs to be changed, remove entries from the ``Patterned Seeds'' control box by clicking the seed and selecting ``Edit/Delete''.

Click twordct and move the mouse down to select ``Include File''. Repeat this for doword. These atoms will appear in the ``Patterned Seeds'' control box and all other instances of the atom will change color to either green or yellow depending on whether they were included or excluded from further consideration. Atoms that the user does not wish to explicitly exclude may be left alone; they will not appear at the next level in the relation. For example, the candidates which match the ``word'' pattern in module wc.c should be ignored, as should all words within comments.

Now invoke the ``New Level'' entry in the ``Seed'' menu, which computes the next set of candidate atoms and updates the Find window. Scroll down until you see what is displayed in Figure 15-5.

Figure 15-5 The updated Find window after ``New Level'' is selected from
the ``Seed'' menu
Going to a new level causes several changes in the window. The title of the ``Patterned Seeds'' control box changes to ``Transitive Seeds'' indicating that atoms are no longer being selected through regular expressions but rather through the transitive relation of residing on the same line in the module. The entry in the ``Seed Files'' control box changes from `seeds.sd', the name of the seeds file, to ``seedGen_1'' indicating the level of the transitive relation between atoms. Most importantly a new set of candidate seeds is highlighted in red and the included seeds remain highlighted in green with red text, which indicates these atoms are included in a previous classification pass.

Other entries in the ``Seed'' menu include:

Go through the modules excluding the following atoms at file scope: linect, charct, tlinect, tcharct, dochar, doline, and file and excluding at global scope stdin, a global variable. These atoms are excluded to relieve clutter on the display at the next level. If they were not excluded they would remain as candidates. The atoms count and print should be included at global scope, thereby selecting them as candidate atoms in wc.c in the next level. Invoke ``Seed/New Level'' to go to the next level.

Select the wc.c module and observe that, ``count'' is a candidate atom. It is known from the previous level that the third argument of ``count'' is connected with the word counting feature, therefore ``p_nw'' should be included in file scope. Everything else on that line should be excluded at the file level. Select ``Seed/New Level'' to generate new candidates. This action leads to one new candidate atom, ``nw'', which should be included in file scope and ``Seed/New Level'' selected. At this point no new relevant candidates emerge. Ignore the candidates on the integer declaration line as there is no true dependency among these. This is because the declaration of a variable does not convey how it is going to be used and hence what it is related to. The analysis is now complete (Figure 15-6). Saving the state in a .xfd file, generating a report or creating a .dif file may be done at this time if desired.

Figure 15-6 The updated Find window after completion of the static analysis
Displaying each module shows the atoms included in the word counting feature for that module in the ``Seed List'' control box. Global atoms are preceded with an exclamation point, and seeds in file scope are preceded with a ``+'' sign.

Examination of Figure 15-6 demonstrates one of the limitations of static analysis. The integer variable ``state'' (e.g. in the statement ``state = OUT'') was not found by Find but is in fact an atom that contributes to counting words, as the code shows. To include the ``state'' variable from this statement (Figure 15-7) in the analysis, double click it with the left mouse button and select the ``Seed/File'' menu entry.

Figure 15-7 Include the variable state in the statement ``state = OUT'' into analysis
To quit Find, click on the ``File'' button in the top button bar, then select ``Exit''.



[Top] [Prev] [Next] [Index] [TOC]