SMILearn Tutorial 1: Creating a Simple Data Set
From DSL
This tutorial introduces one of very basic structures in the SMILEARN -- DSL_dataset. This class is used to store data and plays a role of a container that holds data between learning steps. This tutorial shows how to create an object of the DSL_dataset class "manually". In most applications, the data set will be extracted from a data source (e.g., from a text parser).
First, we create data elements stored in std::vector. Please note that the number of records for each of the variables should be the same:
vector<int> aData; vector<float> bData; aData.push_back(1); bData.push_back(11.1f); aData.push_back(2); bData.push_back(22.2f); aData.push_back(3); bData.push_back(33.3f); aData.push_back(4); bData.push_back(44.4f); aData.push_back(5); bData.push_back(55.5f);
Then we create an empty data set (with no variables).
DSL_dataset ds;
AddVariable is a method that allows to add variables with corresponding data records. The first variable added in this way determines the number of records for the following variables.
ds.AddIntVar("MyDisc",&aData);
ds.AddFloatVar("MyCont",&bData);
If at this point we use a method for printing the content of a data set on the screen (method PrintDataset is provided below), we will obtain the following:
===================
-- variable info --
number of variables = 2
Variable 0
id: MyDisc
is discrete
Missing element value: -1
State names:
Variable 1
id: MyCont
is continuous
Missing element value: -1.#IND
-- data records --
number of records = 5
1 11.1
2 22.2
3 33.3
4 44.4
5 55.5
The code below shows how to add elements by record, and how to add missing elements.
vector<DSL_dataElement> rec(ds.NumVariables()); rec[0].i = DSL_MISSING_INT; rec[1].f = 88.8f; ds.AddRecord(rec);
If we print the content of the data set now, we will obtain the following:
===================
-- variable info --
number of variables = 2
Variable 0
id: MyDisc
is discrete
Missing element value: -1
State names:
Variable 1
id: MyCont
is continuous
Missing element value: -1.#IND
-- data records --
number of records = 6
1 11.1
2 22.2
3 33.3
4 44.4
5 55.5
* 88.8
The function below shows how to access the information from the class DSL_dataset. This function was used to produce the results shown above.
void PrintDataset(const DSL_dataset &d)
{
// The code below shows how to access crucial information form
// a data set.
int v,r;
cout << " ===================" << endl;
cout << " -- variable info --" << endl;
cout << "number of variables = " << d.NumVariables() << endl;
for (v=0;v<d.NumVariables();v++)
{
DSL_variableInfo vi;
d.GetVariableInfo(v,vi);
cout << " Variable " << v << endl;
cout << "\tid: " << vi.id << endl;
cout << "\t";
if (vi.useInt)
cout << "is discrete";
else
cout << "is continuous";
cout << endl;
cout << "\tMissing element value: ";
if (vi.useInt)
{
cout << vi.missingValue.i << endl;
cout << "\tState names: ";
for (unsigned s=0;s<vi.stateNames.size();s++)
cout << vi.stateNames[s] << " ";
cout << endl;
}
else
cout << vi.missingValue.f << endl;
}
cout << " -- data records --" << endl;
cout << "number of records = " << d.NumRecords() << endl;
for (r=0;r<d.NumRecords();r++)
{
for (v=0;v<d.NumVariables();v++)
{
if (d.IsMissing(v,r))
cout << "*";
else
{
if (d.IsDiscrete(v))
cout << d.At(v,r).i;
else
cout << d.At(v,r).f;
}
cout << "\t" ;
}
cout << endl;
}
cout << endl;
}
