08 - Lab Hash

In this lab you will be implementing functions on hash tables with two different collision resolution strategies—separate chaining and linear probing. These hash tables serve an implementation of the dictionary abstract data type. In the second part of the lab you will use the dictionary to solve some problems.

Note that there are a LOT of files in this lab—don't get too intimidated as you won't be modifying the vast majority of them.

Notes about `list` Iterators

When you are working with the Separate Chaining Hash Table, you will need to iterate over the linked list of a given bucket. Since the hash tables are templatized, however, this causes us a slight headache syntactically in C++. To define a list iterator on a given bucket, you will need to declare it as follows:

typename list< pair<K,V\> \>::iterator it \= table\[i\].begin();

Alternatively you can just use auto and have the compiler worry about the type:

auto it \= table\[i\].begin();

Also note that if you do not insert your data at the head of the bucket, you may end up with your results being in a different order than ours.

If you use the list::erase() function, be advised that if you erase the element pointed to by an iterator that the parameter iterator is no longer valid. For instance:

auto it \= table\[i\].begin();
table\[i\].erase(it);
it++;

is invalid because it is invalidated after the call to erase(). So, if you are looping with an iterator, remember a break statement after you call erase()!

Checking out the Code

You can download to code here: lab_has.zip. As usual, please upload the files onto the school's Linux servers and complete the lab there.

Note that there are a lot of files, but you only will need to modify four of them, and the spec guides you through which functions you have to implement. Don't panic!

For reference, here are the only files you need to modify (and submit for credit):

lphashtable.cpp
schashtable.cpp
word_counter.cpp
anagram_finder.cpp

Review the `.h` files

It goes without saying that .h files are a great way to understand how the code works. In particular, we are implementing hash tables with two different collision resolution methods—separate chaining, and linear probing. Both types of hash tables are subclasses of the HashTable class. Take a look at the hashtable.h file before proceeding.

`remove` And `resizeTable` on a Separate Chaining Hash Table

Open your schashtable.cpp file. In this file, the above two functions have not been implemented—your job is to implement them.

`remove`

Given a key, remove it from the hash table.
If the given key is not in the hash table, do nothing.

`resizeTable`

This is called when the load factor for our table is >= 0.7. The function shouldResize checks this and is already implemented for you.
It should resize the internal array for the hash table. Use the return value of findPrime with a parameter of double the current size to set the size. See other calls to resize for reference.
You can find the specs for findPrime and the declaration for 'size' in hashtable.h
You will want to look at schashtable.h to see how the SCHashTable is actually implemented.

`insert` On a Linear Probing Hash Table

Open your lphashtable.cpp. In this file, you will be implementing the insert function.

insert, given a key and a value, should insert the (key, value) pair into the hash table. Remember to check if the array needs to be resized prior to insertion.
Remember the collision handling strategy for linear probing! (To maintain compatibility with our outputs, you should probe by moving forwards through the internal array, not backwards).
Check the lphashtable.h file to make sure the appropriate fields have been updated. For example, the should_probe array flags whether we continue to probe forwards in the array if the key is not found at the current index.
You do not need to concern yourself with duplicate keys—simply treat the duplicate key as normal. However, when in client code and using our hash tables, the proper procedure for updating a key is to first remove the key, then re-insert the key with the new data value (which we will not do today).

You MUST handle collisions in your insert function, or your hash table will be broken!

Test Your Code Using `charcount`

The CharFreq class has already been implemented for you. Type the following into the terminal:

make charcount

This will make the charcount executable. This file takes two arguments: the first is the file to analyse, and the second is the frequency for which to return characters. For example, a run on SherlockHolmes.txt for characters with frequency greater or equal to 10000 might look like:

$ ./charcount SherlockHolmes.txt 10000 schash

Finding chars in SherlockHolmes.txt with frequency \>\= 10000 using SCHashTable...
54944	e
40511	t
36142	a
34869	o
31248	i
29731	n
29579	h
27965	s
25684	r
19100	d
17636	l
13636	u
12155	m
11534	w
11103	c

Note You should verify that your solution outputs the right values for both kinds of hash tables!! You can test each one by changing the third command line argument to "schash" or "lphash", respectively.

Implement the `WordFreq` Class

Now we will write our own application for a hash table! Open word_counter.cpp. You will be writing the getWords function. This function, given an integer threshold, should return a vector of (string, int) pairs. Each of these pairs is a string and its frequency in the file. Your vector should only contain those pairs whose frequencies are greater than or equal to threshold. Look at char_counter.cpp if you are stuck. You may find the TextFile class defined in textfile.h helpful.

Test Using the `wordcount` Executable

When you're done with the above, type the following into your terminal:

make wordcount

This will build the wordcount executable, which uses the WordFreq class to determine word frequencies in a file. It, like charcount, takes two arguments: the first is the path to the file to be analyzed, and the second is the threshold for which to grab words. If a word occurs >= threshold times, it will be printed to the console.

For example, a run of wordcount on metamorphoses.txt with a frequency 1000 might look like:

$ ./wordcount metamorphoses.txt 1000 lphash
Finding words in metamorphoses.txt with frequency \>\= 1000 using LPHashTable...
10473 the
5753 of
4739 and
3078 to
2246 a
1989 in
1706 was
1572 his
1548 that
1500 her
1456 with
1306 is
1241 he
1185 by

You should verify your solution works with both kinds of hash tables.

Implement the `AnagramFinder` Class

Finding anagrams is another clever use of hash tables. Open anagram_finder.cpp. You will be writing the checkWord function, which simply returns a boolean value. It should return true if the string test is an anagram of word, and false otherwise.

An anagram of a given word is a word which has the same letters. For example, "retinas" and "stainer" are anagrams of each other. Think carefully about what it means for a word to be an anagram of another. Can we use a hash table to keep track of something? Perhaps counting letter frequencies will help you?

Note: the proper procedure to update a value in our hash tables is to remove the key from the hash table, and re-insert the key with the new value.

Test Using the `anagramtest` Executable

When you're done with the above, type the following into your terminal:

make anagramtest

This will make the anagramtest executable, which you can use to test your AnagramFinder class. It can be run with or without arguments. Without arguments will test a very simplistic example—with arguments is more interesting.

anagramtest can also take two arguments: the first is the path to the file to be analyzed, and the second is the word to find anagrams for. For example, a run of anagramtest on words.txt looking for anagrams of retinas might look like:

$ ./anagramtest words.txt retinas schash
Checking file words.txt for anagrams of retinas using SCHashTable...
anestri is an anagram of retinas
asterin is an anagram of retinas
eranist is an anagram of retinas
nastier is an anagram of retinas
ratines is an anagram of retinas
resiant is an anagram of retinas
restain is an anagram of retinas
retains is an anagram of retinas
retinas is an anagram of retinas
retsina is an anagram of retinas
sainter is an anagram of retinas
stainer is an anagram of retinas
starnie is an anagram of retinas
stearin is an anagram of retinas

You should verify your solution works with both kinds of hash tables to verify their correctness as well.

Submission

anagram_finder.cpp

/**
 * @file anagram_finder.cpp
 * Implementation of the AnagramFinder class.
 *
 * @author Chase Geigle
 * @date Spring 2011
 * @date Summer 2012
 */

#include "anagram_finder.h"

using std::string;
using std::vector;
using std::ofstream;
using std::endl;

/**
 * Constructs an AnagramFinder based on a filename to read potential
 * anagrams from.
 *
 * @param ifilename The name of the file to read in.
 */
template <template <class K, class V> class Dict>
AnagramFinder<Dict>::AnagramFinder(const string& ifilename)
    : file(true), filename(ifilename)
{
    /* nothing */
}

/**
 * Constructs an AnagramFinder based on a set of strings instead of a
 * filename.
 *
 * @param istrings The set of strings to use for this finder.
 */
template <template <class K, class V> class Dict>
AnagramFinder<Dict>::AnagramFinder(const vector<string>& istrings)
    : file(false), strings(istrings)
{
    /* nothing */
}

/**
 * Determines if the given word is an anagram of the test word.
 *
 * @param word Word that is possibly an anagram.
 * @param test Word to check against.
 * @return A boolean value indicating whether word is an anagram of test.
 */
template <template <class K, class V> class Dict>
bool AnagramFinder<Dict>::checkWord(const string& word, const string& test)
{
    /**
     * @todo Implement this function! You should use the provided
     * templated hashtable class Dict.
     */

    if (word.length() != test.length()) return false;

    Dict<char, int> charCount(256);

    for (size_t i = 0; i < word.length(); i++) {
        charCount[word[i]]++;
    }

    for (size_t i = 0; i < test.length(); i++) {
        charCount[test[i]]--;
    }

    for (auto kv = charCount.begin(); kv != charCount.end(); ++kv) {
        if (kv->second != 0) return false;
    }

    return true;
}

/**
 * Retrieves a set of words that are anagrams of a given word.
 *
 * @param word The word we wish to find anagrams of inside the finder.
 */
template <template <class K, class V> class Dict>
vector<string> AnagramFinder<Dict>::getAnagrams(const string& word)
{
    // set up the return vector
    vector<string> ret;

    if (file) {
        TextFile infile(filename);
        string line;
        vector<string> tests;
        while (infile.good()) {
            string test = infile.getNextWord();
            if (checkWord(word, test))
                ret.push_back(test);
        }
    } else {
        for (size_t i = 0; i < strings.size(); i++) {
            if (checkWord(word, strings[i]))
                ret.push_back(strings[i]);
        }
    }
    return ret;
}

/**
 * Retrieves a set of anagrams in the finder of a given words, but writes
 * them out to a file instead of returning a vector.
 *
 * @param word The word we wish to find anagrams of inside the finder.
 * @param output_file The name of the file we want to write to.
 */
template <template <class K, class V> class Dict>
void AnagramFinder<Dict>::writeAnagrams(const string& word,
                                        const string& output_file)
{
    vector<string> anagrams = getAnagrams(word);
    ofstream outfile(output_file.c_str());
    if (outfile.is_open()) {
        for (size_t i = 0; i < anagrams.size(); i++)
            outfile << anagrams[i] << endl;
    }
    outfile.close();
}

lphashtable.cpp

/**
 * @file lphashtable.cpp
 * Implementation of the LPHashTable class.
 *
 * @author Chase Geigle
 * @date Spring 2011
 * @date Summer 2012
 */
#include "lphashtable.h"

using hashes::hash;
using std::pair;

template <class K, class V>
LPHashTable<K, V>::LPHashTable(size_t tsize)
{
    if (tsize <= 0)
        tsize = 17;
    size = findPrime(tsize);
    table = new pair<K, V>*[size];
    should_probe = new bool[size];
    for (size_t i = 0; i < size; i++) {
        table[i] = NULL;
        should_probe[i] = false;
    }
    elems = 0;
}

template <class K, class V>
LPHashTable<K, V>::~LPHashTable()
{
    for (size_t i = 0; i < size; i++)
        delete table[i];
    delete[] table;
    delete[] should_probe;
}

template <class K, class V>
LPHashTable<K, V> const& LPHashTable<K, V>::operator=(LPHashTable const& rhs)
{
    if (this != &rhs) {
        for (size_t i = 0; i < size; i++)
            delete table[i];
        delete[] table;
        delete[] should_probe;

        table = new pair<K, V>*[rhs.size];
        should_probe = new bool[rhs.size];
        for (size_t i = 0; i < rhs.size; i++) {
            should_probe[i] = rhs.should_probe[i];
            if (rhs.table[i] == NULL)
                table[i] = NULL;
            else
                table[i] = new pair<K, V>(*(rhs.table[i]));
        }
        size = rhs.size;
        elems = rhs.elems;
    }
    return *this;
}

template <class K, class V>
LPHashTable<K, V>::LPHashTable(LPHashTable<K, V> const& other)
{
    table = new pair<K, V>*[other.size];
    should_probe = new bool[other.size];
    for (size_t i = 0; i < other.size; i++) {
        should_probe[i] = other.should_probe[i];
        if (other.table[i] == NULL)
            table[i] = NULL;
        else
            table[i] = new pair<K, V>(*(other.table[i]));
    }
    size = other.size;
    elems = other.elems;
}

template <class K, class V>
void LPHashTable<K, V>::insert(K const& key, V const& value)
{
    /**
     * @todo Implement this function.
     *
     * @note Remember to resize the table when necessary (load factor >=
     *  0.7). **Do this check *after* increasing elems!!** Also, don't
     *  forget to mark the cell for probing with should_probe!
     */

    (void)key;   // prevent warnings... When you implement this function, remove this line.
    (void)value; // prevent warnings... When you implement this function, remove this line.

    ++elems;
    if (shouldResize()) resizeTable();

    size_t idx = hash(key, size);
    while (table[idx] != NULL) {
        idx = (idx + 1) % size;
    }

    table[idx] = new pair<K, V>(key, value);
    should_probe[idx] = true;
}

template <class K, class V>
void LPHashTable<K, V>::remove(K const& key)
{
    int idx = findIndex(key);
    if (idx != -1) {
        delete table[idx];
        table[idx] = NULL;
        --elems;
    }
}

template <class K, class V>
int LPHashTable<K, V>::findIndex(const K& key) const
{
    size_t idx = hash(key, size);
    size_t start = idx;
    while (should_probe[idx]) {
        if (table[idx] != NULL && table[idx]->first == key)
            return idx;
        idx = (idx + 1) % size;
        // if we've looped all the way around, the key has not been found
        if (idx == start)
            break;
    }
    return -1;
}

template <class K, class V>
V LPHashTable<K, V>::find(K const& key) const
{
    int idx = findIndex(key);
    if (idx != -1)
        return table[idx]->second;
    return V();
}

template <class K, class V>
V& LPHashTable<K, V>::operator[](K const& key)
{
    // First, attempt to find the key and return its value by reference
    int idx = findIndex(key);
    if (idx == -1) {
        // otherwise, insert the default value and return it
        insert(key, V());
        idx = findIndex(key);
    }
    return table[idx]->second;
}

template <class K, class V>
bool LPHashTable<K, V>::keyExists(K const& key) const
{
    return findIndex(key) != -1;
}

template <class K, class V>
void LPHashTable<K, V>::clear()
{
    for (size_t i = 0; i < size; i++)
        delete table[i];
    delete[] table;
    delete[] should_probe;
    table = new pair<K, V>*[17];
    should_probe = new bool[17];
    for (size_t i = 0; i < 17; i++)
        should_probe[i] = false;
    size = 17;
    elems = 0;
}

template <class K, class V>
void LPHashTable<K, V>::resizeTable()
{
    size_t newSize = findPrime(size * 2);
    pair<K, V> **temp = new pair<K, V> *[newSize];
    delete[] should_probe;
    should_probe = new bool[newSize];
    for (size_t i = 0; i < newSize; i++)
    {
        temp[i] = NULL;
        should_probe[i] = false;
    }

    for (size_t i = 0; i < size; i++)
    {
        if (table[i] != NULL)
        {
            size_t idx = hash(table[i]->first, newSize);
            while (temp[idx] != NULL)
                idx = (idx + 1) % newSize;
            temp[idx] = table[i];
            should_probe[idx] = true;
        }
    }

    delete[] table;
    // don't delete elements since we just moved their pointers around
    table = temp;
    size = newSize;
}

schashtable.cpp

/**
 * @file schashtable.cpp
 * Implementation of the SCHashTable class.
 *
 * @author Chase Geigle
 * @date Spring 2011
 * @date Summer 2012
 */

#include "schashtable.h"

using hashes::hash;
using std::list;
using std::pair;

template <class K, class V>
SCHashTable<K, V>::SCHashTable(size_t tsize)
{
    if (tsize <= 0)
        tsize = 17;
    size = findPrime(tsize);
    table = new list<pair<K, V>>[size];
    elems = 0;
}

template <class K, class V>
SCHashTable<K, V>::~SCHashTable()
{
    delete[] table;
}

template <class K, class V>
SCHashTable<K, V> const& SCHashTable<K, V>::
operator=(SCHashTable<K, V> const& rhs)
{
    if (this != &rhs) {
        delete[] table;
        table = new list<pair<K, V>>[rhs.size];
        for (size_t i = 0; i < rhs.size; i++)
            table[i] = rhs.table[i];
        size = rhs.size;
        elems = rhs.elems;
    }
    return *this;
}

template <class K, class V>
SCHashTable<K, V>::SCHashTable(SCHashTable<K, V> const& other)
{
    table = new list<pair<K, V>>[other.size];
    for (size_t i = 0; i < other.size; i++)
        table[i] = other.table[i];
    size = other.size;
    elems = other.elems;
}

template <class K, class V>
void SCHashTable<K, V>::insert(K const& key, V const& value)
{
    ++elems;
    if (shouldResize())
        resizeTable();
    pair<K, V> p(key, value);
    size_t idx = hash(key, size);
    table[idx].push_front(p);
}

template <class K, class V>
void SCHashTable<K, V>::remove(K const& key)
{
    typename list<pair<K, V>>::iterator it;
    /**
     * @todo Implement this function.
     *
     * Please read the note in the lab spec about list iterators and the
     * erase() function on std::list!
     */
    
    size_t idx = hash(key, size);
    for (it = table[idx].begin(); it != table[idx].end(); it++) {
        if (it->first != key) continue;

        table[idx].erase(it);
        --elems;
        return;
    }
}

template <class K, class V>
V SCHashTable<K, V>::find(K const& key) const
{
    size_t idx = hash(key, size);
    typename list<pair<K, V>>::iterator it;
    for (it = table[idx].begin(); it != table[idx].end(); it++) {
        if (it->first == key)
            return it->second;
    }
    return V();
}

template <class K, class V>
V& SCHashTable<K, V>::operator[](K const& key)
{
    size_t idx = hash(key, size);
    typename list<pair<K, V>>::iterator it;
    for (it = table[idx].begin(); it != table[idx].end(); it++) {
        if (it->first == key)
            return it->second;
    }

    ++elems;
    if (shouldResize())
        resizeTable();

    idx = hash(key, size);
    pair<K, V> p(key, V());
    table[idx].push_front(p);
    return table[idx].front().second;
}

template <class K, class V>
bool SCHashTable<K, V>::keyExists(K const& key) const
{
    size_t idx = hash(key, size);
    typename list<pair<K, V>>::iterator it;
    for (it = table[idx].begin(); it != table[idx].end(); it++) {
        if (it->first == key)
            return true;
    }
    return false;
}

template <class K, class V>
void SCHashTable<K, V>::clear()
{
    delete[] table;
    table = new list<pair<K, V>>[17];
    size = 17;
    elems = 0;
}

template <class K, class V>
void SCHashTable<K, V>::resizeTable()
{
    typename list<pair<K, V>>::iterator it;

    size_t oldSize = size;
    size = findPrime(size * 2);
    list<pair<K, V>>* newTable = new list<pair<K, V>>[size];

    for (size_t i = 0; i < oldSize; i++) {
        for (it = table[i].begin(); it != table[i].end(); it++) {
            size_t idx = hash(it->first, size);
            newTable[idx].push_front(*it);
        }
    }
    
    delete[] table;
    table = newTable;
}

word_counter.cpp

/**
 * @file word_counter.cpp
 * Implementation of the WordFreq class.
 *
 * @author Chase Geigle
 * @date Spring 2011
 * @date Spring 2012
 */

using std::cout;
using std::endl;
using std::ifstream;
using std::istringstream;
using std::pair;
using std::string;
using std::vector;

template <template <class K, class V> class Dict>
WordFreq<Dict>::WordFreq(const string& filename) : dict(256) {
    TextFile infile(filename);

    while (infile.good()) {
        string str = infile.getNextWord();
        dict[str]++;
    }
}

template <template <class K, class V> class Dict>
vector<pair<string, int>> WordFreq<Dict>::getWords(int threshold) {
    vector<pair<string, int>> ret;
    /**
     * @todo Implement this function.
     * @see char_counter.cpp if you're having trouble.
     */


    for (auto kv = dict.begin(); kv != dict.end(); ++kv) {
        if (kv->second < threshold) continue;

        ret.push_back(*kv);
    }

    return ret;
}

Notes about list Iterators

Checking out the Code

Review the .h files

remove And resizeTable on a Separate Chaining Hash Table

remove

resizeTable

insert On a Linear Probing Hash Table

Test Your Code Using charcount

Implement the WordFreq Class

Test Using the wordcount Executable

Implement the AnagramFinder Class

Test Using the anagramtest Executable