GeNIe Tutorials: Tutorial 14 - Cleaning data
From DSL
Objective of the tutorial: To learn how to clean data before proceeding to learning Bayesian networks.
Estimated Time: 20 Minutes
At the end of the tutorial you will be able to:
- Edit data files to learn Bayesian networks.
Before structure (or only parameters) is learned it might necessary or desirable to prepare the data. GeNIe allows editing data files in several aspects described in sections below. For the purpose of this tutorial we will be using data file retention.txt, which can be found in the examples subfolder of the GeNIe installation folder.
Loading Data Files
To load a data file choose File -> Open Data File... option from the main menu. Next select the data file you wish to load. GeNIe uses the data grid view to display the loaded data files and let's users work with them much like with spreadsheets. If you opened retention.txt your screen should look like that:
Missing Values
To select rows that contain missing values (for instance for deletion) select Data -> Missing Values -> Select... option from the main menu. You will be given two options: to (1) select all rows containing missing values and (2) select rows with values missing only in the currently selected column. A column in the data grid is considered selected if the cursor is placed in any of its cells. If the data file did not contain any missing values GeNIe will inform you about that. Otherwise, GeNIe will tell how many rows were selected and the corresponding ones will become highlighted in the data grid.
If the data file contains any missing values you can choose to replace them. To do that select Data -> Missing Values -> Replace... option from the main menu. You will be given a choice to replace (1) with a specific value or (2) with an average of the selected column. The values will be replaced in the currently selected column. The replaced values will can be distinguished with a red color, like on the screenshot below:
Your decision will be remembered and applied if you choose to delete any of the existing value (just try to delete one of them).
You can rollback all the replace actions on any of the columns by selecting the column itself (putting the cursor in one of its cells) and selecting Data -> Missing Values -> Restore... option from the main menu. The inserted values will be removed from the data grid.
Discretization
GeNIe gives you a tool to discretize values. To invoke the discretization interface select Data -> Discretize... option from the main menu. First you want to inspect the discretization method. You can choose (1) hierarchical, (2) uniform widths or (3) uniform counts. Then select the number of bins. You can change the number of bins for the original distribution using the slider at the bottom of the window. When you press the Discretize button you should get something like that:
You can play with the settings after you pressed the Discretize button to see them applied on the fly. You can also drag boundaries between two bins (denoted with different colors) on the histogram to adjust intervals (e.g., to fine tune the shape of the distribution). You can also adjust them in the grid view above the graph. After you accept the changes the discretized column will be shown in blue. You can stop discretization at any time by selecting Data -> Stop Discretization option from the main menu. Once the discretization has been stopped you can resume it by selecting Data -> Stop Discretization option from the main menu.
Merging States
To explain this functionality let's load another data file: merge_states.txt. It's a very simple file with only two variables X and Y and a couple of states for each of them. To merge states of the variable X select that column and select Data -> Merge States... from the main menu. You will see the following dialog window:
Select the states that you want to merge together and provide the name for the resulting state. You can see that GeNIe gives you information about the number of occurrences of each of the states of the selected variable. If you select all the states before performing the merging GeNIe will ask you to confirm.
Statistics
GeNIe allows to display some useful statistics for each of the columns in the data. They are: mean, variance, standard deviation, min, max, count (number of values in the column). You can also display the correlation matrix. To access those select Data -> Statistics option from the main menu. The interface for those statistics looks like that:
Histogram
To see the distribution of the values in a column on a histogram graph select Data -> Histogram from the main menu. It will look like that:
Double click a bar to select the corresponding data rows in the data grid.
Piechart
To see the distribution of the values in a column on a piechart graph select Data -> Piechart from the main menu. It will look like that:
Scatterplot
To see the distribution of values from two columns select two numerical columns (by clicking on their header having CTRL key pressed) and select Data -> Scatterplot from the main menu. It will look like that:
Double click a point to select the corresponding data row in the data grid.
Time Series
To see the distribution of single data value with respect to total number of data units select Data -> Scatterplot from the main menu. It will look like that:
Double click a point to locate the corresponding point in the data spreadsheet.
To Normalize the data select the Normalize values option.
To view the data points on the series select the option Show Markers.
Next Tutorial:









