Sunday, March 24, 2013

Comparing and Filtering through Lists of Data

In this tutorial, similarly to the Comma Adder Tutorial, we will be reading in data and then modifying it for output into another file.

However, this time, there will be another file of data that we will compare it with. We will remove duplicates from our output, and we will keep track of those duplicates and output them to another file. So, if our input for the first file is this:

1234567
2345678
3456789

and our input from the second file is this (I've colored the duplicate blue):

8573968
3958396
1234567
3058376

Our first output file will be this (with duplicate(s) removed):

2345678
3456789

and our second output file will be this (the duplicate(s)):

1234567

Here's a screenshot of our fully finished program working:


and here are some useful links for you in case you want them:
 
Code file ("main.cpp"): http://thecodingwebsite.com/tutorials/listfilter/main.cpp
Code file as a ".txt" file ("main.txt") so you can view it in your browser: http://thecodingwebsite.com/tutorials/listfilter/main.txt
Finished program: http://thecodingwebsite.com/tutorials/listfilter/ListFilter.exe
Sample input1 file ("input1.txt") (Main list): http://thecodingwebsite.com/tutorials/listfilter/input1.txt
Sample input2 file ("input2.txt") (Exclude list): http://thecodingwebsite.com/tutorials/listfilter/input2.txt
Sample output1 file ("output1.txt") (Main output/Filtered list): http://thecodingwebsite.com/tutorials/listfilter/output1.txt
Sample output2 file ("output2.txt") (Duplicate list): http://thecodingwebsite.com/tutorials/listfilter/output2.txt


This program will require that we store 4 lists of data:
  1. The main input list.
  2. The exclude/filter list.
  3. The main output/filtered list.
  4. The duplicate list.
In order to achieve this, we will need to include the "list" library:

#include <list>

The "list" class enables us to store an array of data that can change size as it needs to. This will especially come in handy for our two output lists, as we do not know at the start of the program how big they will need to be.

The beginning of our program is relatively the same as with the Comma Adder, however we will be inputting and outputting 2 files each:

#include <list>
#include <iostream>
#include <fstream>
#include <string>

using namespace std;

void main()
{
 //Open "input1.txt" and "input2.txt" for reading.
 fstream fin1("input1.txt");
 fstream fin2("input2.txt");

 //Open "output1.txt" and "output2.txt" for writing - the ios::out parameter is required. This will overwrite if applicable.
 fstream fout1("output1.txt", ios::out);
 fstream fout2("output2.txt", ios::out);

 //Make sure both input and output files were opened properly.
 if (fin1 && fin2 && fout1 && fout2)
 {
  //Tell them we're comparing the lists.
  cout << "Comparing lists...\n\n";

Next, we will need to create 4 lists. Each of these lists will be holding values of type "string":

//These are the 4 lists that we will need for this program.
  list<string> mainList, excludeList, outputList, duplicateList;

Then we will read in both files ("input1.txt" and "input2.txt") of data (one at a time) and store each line of data into its appropriate list ("input1.txt" will go into the "mainList", while "input2.txt" will go into the "excludeList"):

//This will hold each line of characters that we read and write.
  string nextLine = "";

  //This is an infinite loop - it will only break (exit) from the loop when we've reached the end of the file.
  while (true)
  {
   //Read the next line from the file.
   getline(fin1, nextLine);
   
   //Add the next line of data to the main list.
   mainList.push_back(nextLine);

   //Check to see if we've reached the end of the file. If so, break out of the reading/writing while loop.
   if (!fin1)
   {
    break;
   }
  }

  //Repeat the procedure above, except use fin2 and excludeList this time.
  while (true)
  {
   //Read the next line from the file.
   getline(fin2, nextLine);
   
   //Add the next line of data to the exclude list.
   excludeList.push_back(nextLine);

   //Check to see if we've reached the end of the file. If so, break out of the reading/writing while loop.
   if (!fin2)
   {
    break;
   }
  }

Note that we used the "push_back" function to add each line (string) of data to each list:

//Add the next line of data to the main list.
   mainList.push_back(nextLine);

   //Add the next line of data to the exclude list.
   excludeList.push_back(nextLine);

That function simply adds (pushes) the string to the end (back) of the list. We don't ever have to worry about the size of the list in this program, although there are functions (e.g. "mainList.size()") for accessing this sort of information about a list if the need arises.

Now that we have all of our data read into the program and stored into the appropriate lists, we can work on filling the two output lists. Fortunately, we will be able to fill both output lists at once.


Here is the entire duplication-checking code, which I will run through piece by piece later:

//Used to keep track of duplicates.
  bool isDuplicate;

  //Iterate through each element in mainList.
  for (list<string>::iterator i = mainList.begin(); i != mainList.end(); ++i)
  {
   //Assume that a duplicate was not found until proven otherwise.
   isDuplicate = false;

   //Iterate through each element in excludeList.
   for (list<string>::iterator j = excludeList.begin(); j != excludeList.end(); ++j)
   {
    //See if the mainList element and the excludeList element are duplicates.
    if ((string)*i == (string)*j)
    {
     //If so, set the boolean to true so that the mainList value isn't added to the outputList later.
     isDuplicate = true;

     //Instead, add the duplicate value to the duplicateList.
     duplicateList.push_back((string)*i);

     //No need to make any more comparisons for this particular mainList element.
     break;
    }
   }

   //Check to see if this mainList element was not a duplicate of any of the excludeList elements.
   if (!isDuplicate)
   {
    //Place this element in the outputList.
    outputList.push_back((string)*i);
   }
  }


We start by creating a boolean to keep track of whether or not a duplicate was found:

//Used to keep track of duplicates.
  bool isDuplicate;

Then we will "iterate" through (go through) every element in "mainList", one at a time. To do this, we'll need to use:

list<string>::iterator

which is a variable type (like "int" or "string"). This type, however, exists specifically for the purpose of iterating through a "list", which we just so happen to be using. So, here's how we'll iterate through "mainList":


//Iterate through each element in mainList.
  for (list<string>::iterator i = mainList.begin(); i != mainList.end(); ++i)
  {

The "begin" and "end" functions exist solely for us to be able to compare the iterator's position in the list to the beginning and end of the list (for iteration purposes).

At the beginning of each iteration, we are going to make the assumption that the current piece of data in "mainList" has no duplicates, until proven otherwise:

//Assume that a duplicate was not found until proven otherwise.
   isDuplicate = false;

Then we will iterate through each element in "excludeList" in the same manner. Notice how we have one for loop inside another, however:

//Iterate through each element in mainList.
  for (list<string>::iterator i = mainList.begin(); i != mainList.end(); ++i)
  {
   //Assume that a duplicate was not found until proven otherwise.
   isDuplicate = false;

   //Iterate through each element in excludeList.
   for (list<string>::iterator j = excludeList.begin(); j != excludeList.end(); ++j)
   {

This means that we will loop through each element in "excludeList" once for each element in "mainList", allowing us to properly check for duplicates:

//See if the mainList element and the excludeList element are duplicates.
    if ((string)*i == (string)*j)
    {
     //If so, set the boolean to true so that the mainList value isn't added to the outputList later.
     isDuplicate = true;

     //Instead, add the duplicate value to the duplicateList.
     duplicateList.push_back((string)*i);

     //No need to make any more comparisons for this particular mainList element.
     break;
    }

In order to retrieve an iterator's value, we first need to "dereference" it. An iterator is only a "pointer" that points to the element - it does not actually hold the element's value. By dereferencing it, however, we gain access to the value of the element itself:

*i
*j

Then we need to convert the values to the "string" type to get the actual string value of the elements for comparison:

//See if the mainList element and the excludeList element are duplicates.
    if ((string)*i == (string)*j)
    {

If they are duplicates, we set the "isDuplicate" boolean to true (to prevent the value from being added to "outputList" later on) and we instead add the value to "duplicateList". We also break from the inner loop, because we already know that the current "mainList" element is a duplicate:

//If so, set the boolean to true so that the mainList value isn't added to the outputList later.
     isDuplicate = true;

     //Instead, add the duplicate value to the duplicateList.
     duplicateList.push_back((string)*i);

     //No need to make any more comparisons for this particular mainList element.
     break;
    }

The last part of the outer for loop is to add the non-duplicate elements to "outputList":

//Check to see if this mainList element was not a duplicate of any of the excludeList elements.
   if (!isDuplicate)
   {
    //Place this element in the outputList.
    outputList.push_back((string)*i);
   }
  }

Now that the duplication-checking process is complete, we can finally iterate through "outputList" and "duplicateList" (separately) and output each element to their corresponding files:

for (list<string>::iterator i = outputList.begin(); i != outputList.end(); ++i)
  {
   fout1 << (string)*i << "\n";
  }

  for (list<string>::iterator i = duplicateList.begin(); i != duplicateList.end(); ++i)
  {
   fout2 << (string)*i << "\n";
  }

Finally, we have the rest of the program, which is basically the same as the other programs we've written except that we have to close 4 files this time:

//Assume success since there were no file opening errors.
  cout << "Comparing lists successful! Check:\n\n\"output1.txt\" (for a filtered list) and\n\"output2.txt\" (for a list of duplicates found).";
 }
 else
 {
  //The files were not opened properly - let the user know that it was unsuccessful.
  cout << "Error either opening \"input1.txt\" or \"input2.txt\" for reading or opening \"output1.txt\" or \"output2.txt\" for writing.";
 }

 //Close the files.
 fout2.close();
 fout1.close();
 fin2.close();
 fin1.close();

 //Output some new line characters before the program asks for a key press.
 cout << "\n\n";

 system("pause");
}

Here's our entire completed program:

#include <list>
#include <iostream>
#include <fstream>
#include <string>

using namespace std;

void main()
{
 //Open "input1.txt" and "input2.txt" for reading.
 fstream fin1("input1.txt");
 fstream fin2("input2.txt");

 //Open "output1.txt" and "output2.txt" for writing - the ios::out parameter is required. This will overwrite if applicable.
 fstream fout1("output1.txt", ios::out);
 fstream fout2("output2.txt", ios::out);

 //Make sure both input and output files were opened properly.
 if (fin1 && fin2 && fout1 && fout2)
 {
  //Tell them we're comparing the lists.
  cout << "Comparing lists...\n\n";

  //These are the 4 lists that we will need for this program.
  list<string> mainList, excludeList, outputList, duplicateList;

  //This will hold each line of characters that we read and write.
  string nextLine = "";

  //This is an infinite loop - it will only break (exit) from the loop when we've reached the end of the file.
  while (true)
  {
   //Read the next line from the file.
   getline(fin1, nextLine);
   
   //Add the next line of data to the main list.
   mainList.push_back(nextLine);

   //Check to see if we've reached the end of the file. If so, break out of the reading/writing while loop.
   if (!fin1)
   {
    break;
   }
  }

  //Repeat the procedure above, except use fin2 and excludeList this time.
  while (true)
  {
   //Read the next line from the file.
   getline(fin2, nextLine);
   
   //Add the next line of data to the exclude list.
   excludeList.push_back(nextLine);

   //Check to see if we've reached the end of the file. If so, break out of the reading/writing while loop.
   if (!fin2)
   {
    break;
   }
  }

  //Used to keep track of duplicates.
  bool isDuplicate;

  //Iterate through each element in mainList.
  for (list<string>::iterator i = mainList.begin(); i != mainList.end(); ++i)
  {
   //Assume that a duplicate was not found until proven otherwise.
   isDuplicate = false;

   //Iterate through each element in excludeList.
   for (list<string>::iterator j = excludeList.begin(); j != excludeList.end(); ++j)
   {
    //See if the mainList element and the excludeList element are duplicates.
    if ((string)*i == (string)*j)
    {
     //If so, set the boolean to true so that the mainList value isn't added to the outputList later.
     isDuplicate = true;

     //Instead, add the duplicate value to the duplicateList.
     duplicateList.push_back((string)*i);

     //No need to make any more comparisons for this particular mainList element.
     break;
    }
   }

   //Check to see if this mainList element was not a duplicate of any of the excludeList elements.
   if (!isDuplicate)
   {
    //Place this element in the outputList.
    outputList.push_back((string)*i);
   }
  }

  for (list<string>::iterator i = outputList.begin(); i != outputList.end(); ++i)
  {
   fout1 << (string)*i << "\n";
  }

  for (list<string>::iterator i = duplicateList.begin(); i != duplicateList.end(); ++i)
  {
   fout2 << (string)*i << "\n";
  }

  //Assume success since there were no file opening errors.
  cout << "Comparing lists successful! Check:\n\n\"output1.txt\" (for a filtered list) and\n\"output2.txt\" (for a list of duplicates found).";
 }
 else
 {
  //The files were not opened properly - let the user know that it was unsuccessful.
  cout << "Error either opening \"input1.txt\" or \"input2.txt\" for reading or opening \"output1.txt\" or \"output2.txt\" for writing.";
 }

 //Close the files.
 fout2.close();
 fout1.close();
 fin2.close();
 fin1.close();

 //Output some new line characters before the program asks for a key press.
 cout << "\n\n";

 system("pause");
}

Hopefully this has helped you learn how to deal with lists, iterators, pointers, and dereferencing. If you have any questions, feel free to ask them below!

No comments:

Post a Comment