SMILearn Tutorial 2: Parsing a Text File

From DSL
Jump to: navigation, search

DSL_textParser is now obsolete. DSL_dataset has a built-in parser (See Tutorial 4). Tutorial is kept here for historical purposes.

This tutorial shows how to import information form a text file containing data to a DSL_dataset object. Let us assume we want to parse text file "data.txt" which is presented below:

  A B C D E
  ordinal discrete ordinal discrete continuous
  0 0 0 true 1.1
  1 0 1 false 2.1
  * * 0 true 0.1 
  1 1 1 * 3

The first step is to create DSL_textParser object and set appropriate parameters. Our text file contains a header with names of the variables (A, B, C, ...) and explicitly defined types of these variables (ordinal, discrete and continuous). We need to inform the parser that it should expect this information in the file. The default marker of a missing data element is "*" and, therefore, we do not need set this parameter explicitly. Here is the code that creates the parser object:

  DSL_textParser parser;
  parser.SetUseHeader(true);
  parser.SetTypesSpecified(true);

Once the parser object is correctly initialized, one should call Parse method to perform actual reading and interpreting the file:

  if (parser.Parse("data.txt")!=DSL_OKAY)
  cout << "Parsing failed!" << endl;

It is always worth testing for result of this method, as file operations are often likely to fail. Finally, we want to create a data set object that contains parsed data. The code below shows how to do that:

  DSL_dataset d = parser.GetDataset();

If we decide to print the content of the data set using function PrintDataset introduced in Tutorial 1, we will obtain following information:

   ===================
  -- variable info --
  number of variables = 5
  Variable 0
        id: Column_0
        is continuous
        Missing element value: -1.#IND
  Variable 1
        id: Column_1
        is discrete
        Missing element value: -1
        State names: 0 1
  Variable 2
        id: Column_2
        is continuous
        Missing element value: -1.#IND
  Variable 3
        id: Column_3
        is discrete
        Missing element value: -1
        State names: d1 d2
  Variable 4
        id: Column_4
        is continuous
        Missing element value: -1.#IND
  -- data records --
  number of records = 4
  0       0       0       0       1.1
  1       0       1       1       2.1
  *       *       0       0       0.1
  1       1       1       *       3
Personal tools