Search and sort
In this introduction we study the most elementary algorithms for searching and sorting. They are less efficient, but are easier to understand. The faster ones rely on more advanced knowledge that is beyond the objectives of this course. There are some probability theories behind such problems but we skip those for now.
The usual way to tech these algorithms is by showing a sequence of values and performing a search or sort operation by hand. After that we attempt to translate every step taken during the manual search or sort operation in a sequence of algorithm's operations. At the end we can do a simplified analyses about the algorithm's performance, nothing that requires deep mathematical knowledge, just a quick evaluation about the worst case (greatest number of operations), best case (least number of operations) and an estimation about the average case.
In the performance analysis we do a proportional estimation. That is, it doesn't matter the speed at which the algorithm is executed because that depends on the processor, what really matters is the number of operations performed.
It's required to known logarithms and the mathematical induction principle to do the search and sort algorithm's performance demonstrations
- Sequential search
/* Receives an array, the array's number of elements and some value x
If x is found in the array, return 1
Else, return -1 */
int i;
for (i = 0; i < n; i++) if (array[i] == x) return 1;
return -1;
}
We have a set of values, we want to know whether x is or isn't in that set. Suppose that the set is unsorted. The elements are randomly distributed. In such case there is only one way to know whether the searched element is there or not, which is to compare it against the whole set, one by one. This is the worst search case.
Best case: one operation. The element is the first and is found in first attempt.
Worst case: n operations, as many as the number of elements in the set. The searched element is the last or isn't contained in the set.
Average: 1 + 2 + ... (n - 1) + n = n.(n - 1)/2. The proof is by induction.
- Binary search (the sequences must be already sorted!)
/* Receives an array, the number of elements and a value x to be found
If found, return the element's index
Else, return -1 */
int start = 0, end = n - 1, middle;
while (start <= end) {middle = (start + end) / 2;
if (a[middle] == x) return middle;
if (a[middle] > x) end = middle - 1;
else start = end + 1;
}
return -1;
}
The decision "if middle is greater, go to the left" can be inverted to "if middle is less than, go to the right", it doesn't matter. The algorithm doesn't count the number of comparisons made, but it's pretty easy to add that. However, as the function can only return "found" or "not found", we'd have to use a pointer to be able to return the number of comparisons made and the result of the search itself.
Let's search the number 39 in an array of 9 elements:
First step, sum the first and last index and halve
(0 + 8) / 2 | ||||||||
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
1 | 4 | 6 | 8 | 9 | 11 | 39 | 56 | 78 |
Second step, 39 > 9, then the number cannot be to the left of 9, continue to the right and discard the element from the previous comparison, the number 9
(0 + 8) / 2 | ||||||||
5 | 6 | 7 | 8 | |||||
11 | 39 | 56 | 78 |
Just repeat this process for any searched element. If the quotient isn't integer, then don't round, truncate or simply consider the integer part and discard the fractional part. The search ends when we have (index + index)/2 or simply start = end, which happens in two cases: the element is the last or nowhere to be found.
Worst case: [math]\displaystyle{ 1 + \log_2{n} }[/math] operations. The searched element is the last or is nowhere to be found. Let's do a quick deduction (do it on paper!): begin with any number of elements. Double the quantity. The number of comparisons increases by just one. Try a larger increase, square the number of elements. The number of comparisons increases by just two (the increment was proportional to the power's exponent). Take notice, while the number of elements grows geometrically, the number of comparisons grows linearly, much slower.
Let's repeat the reasoning, but this time "reversed": at each binary search's step, half of the sequence is discarded, which reveals a geometric progression [math]\displaystyle{ n/2 }[/math], [math]\displaystyle{ n/4 }[/math], ..., [math]\displaystyle{ n/2^k }[/math]. The search's limit is 1, because when the table is reduced to 1 element, the halving steps cease. We then have that [math]\displaystyle{ n/2^k }[/math] is at most 1, or [math]\displaystyle{ n/2^k \leq 1 }[/math]. Let's apply some algebra: [math]\displaystyle{ n \leq 2^k }[/math]. Now let's apply the log's definition to reach: [math]\displaystyle{ \log_2{n} \leq k }[/math]. Where n is the number of elements and k the number of comparison operations.
An analogy: a dictionary is a sorted sequence of words. If we'd search for a word using the sequential search, we'd always be forced to browse the dictionary page by page until the searched word is found. With binary search we can open the dictionary directly at the alphabet's letter of the word's first letter. From there we can reduce our search boundaries to that letter, discarding the rest of the dictionary. By repeating the same criteria to the word's second letter, we further reduce the search boundary, discarding many more pages. In numerical terms, the algorithm behaves as a geometric progression of ratio = 1/2. It's not possible to increase the ratio of division to 1/3 or more because we must rely on the fact that we have only two ways to go, either up or down, left or right.