SMILearn Tutorial 1: Creating a Simple Data Set

From DSL

Jump to: navigation, search

This tutorial introduces one of very basic structures in the SMILEARN -- DSL_dataset. This class is used to store data and plays a role of a container that holds data between learning steps. This tutorial shows how to create an object of the DSL_dataset class "manually". In most applications, the data set will be extracted from a data source (e.g., from a text parser).


First, we create data elements stored in std::vector. Please note that the number of records for each of the variables should be the same:

  vector<int> aData;
  vector<float> bData;
  
  aData.push_back(1); bData.push_back(11.1f);
  aData.push_back(2); bData.push_back(22.2f);
  aData.push_back(3); bData.push_back(33.3f);
  aData.push_back(4); bData.push_back(44.4f);
  aData.push_back(5); bData.push_back(55.5f);

Then we create an empty data set (with no variables).

  DSL_dataset ds;

AddVariable is a method that allows to add variables with corresponding data records. The first variable added in this way determines the number of records for the following variables.

  ds.AddIntVar("MyDisc",&aData);
  ds.AddFloatVar("MyCont",&bData);

If at this point we use a method for printing the content of a data set on the screen (method PrintDataset is provided below), we will obtain the following:

  ===================
  -- variable info --
  number of variables = 2
  Variable 0
       id: MyDisc
       is discrete
       Missing element value: -1
       State names:
  Variable 1
       id: MyCont
       is continuous
       Missing element value: -1.#IND
  -- data records --
  number of records = 5
  1       11.1
  2       22.2
  3       33.3
  4       44.4
  5       55.5

The code below shows how to add elements by record, and how to add missing elements.

  vector<DSL_dataElement> rec(ds.NumVariables());
  rec[0].i = DSL_MISSING_INT;   
  rec[1].f = 88.8f; 
  ds.AddRecord(rec);

If we print the content of the data set now, we will obtain the following:

   ===================
  -- variable info --
  number of variables = 2
  Variable 0
      id: MyDisc
      is discrete
      Missing element value: -1
      State names:
  Variable 1
      id: MyCont
      is continuous
      Missing element value: -1.#IND
  -- data records --
  number of records = 6
  1       11.1
  2       22.2
  3       33.3
  4       44.4
  5       55.5
  *       88.8

The function below shows how to access the information from the class DSL_dataset. This function was used to produce the results shown above.

  void PrintDataset(const DSL_dataset &d)
  {
  // The code below shows how to access crucial information form
  // a data set. 
  int v,r;
  cout << " ===================" << endl;
  cout << " -- variable info --" << endl;
  cout << "number of variables = " << d.NumVariables() << endl;
  for (v=0;v<d.NumVariables();v++)
  {
   DSL_variableInfo vi;
   d.GetVariableInfo(v,vi);
   cout << " Variable " << v << endl;
   cout << "\tid: " << vi.id << endl; 
   cout << "\t"; 
   if (vi.useInt) 
     cout << "is discrete";
   else
     cout << "is continuous";
   cout << endl;
   cout << "\tMissing element value: ";
   if (vi.useInt)
   {
     cout << vi.missingValue.i << endl;
     cout << "\tState names: ";
     for (unsigned s=0;s<vi.stateNames.size();s++)
       cout << vi.stateNames[s] << " ";
     cout << endl;
   }
   else
     cout << vi.missingValue.f << endl;
  }
  
  cout << " -- data records --" << endl;
  cout << "number of records = " << d.NumRecords() << endl;
  for (r=0;r<d.NumRecords();r++)
  {
   for (v=0;v<d.NumVariables();v++)
   {
     if (d.IsMissing(v,r))
       cout << "*";
     else
     {
       if (d.IsDiscrete(v))
         cout << d.At(v,r).i;
       else
         cout << d.At(v,r).f;
     }
     cout << "\t" ;
   }
   cout << endl;
  }
  cout << endl;
  }
Personal tools