Lung Cancer

Creating Decision Tree Models Using Visual Studio

Data Preprocessing in Excel

Abstract

This project attempted to look at different factors and its correlation on lung cancer by utilizing SQL Server Data Tools in Visual Studio. The “Lung Cancer” dataset was found on Kaggle and can be downloaded here: https://www.kaggle.com/datasets/nancyalaswad90/lung-cancer

The dataset contains 16 attributes and 309 instances. The 16 attributes are Gender, Age, Smoking, Yellow Fingers, Anxiety, Peer Pressure, Chronic Disease, Fatigue, Allergy, Wheezing, Alcohol Consuming, Coughing, Shortness of Breath, Swallowing Difficulty, Chest Pain and Lung Cancer. Every attribute is binary except for age.

In the Analysis Services package in Visual Studios, there are 9 different types of mining techniques you can use to create models. They are Association Rules, Clustering, Decision Trees, Linear Regression, Logistic Regression, Naïve Bayes, Neural Network, Sequence Clustering and Time Series.

For this dataset, I needed a model that can classify lung cancer into an output of yes or no. I also wanted to look at how the input attributes affect the classification so for these reasons I decided to use decision trees as it is easy to understand and interpret the results. The decision trees provided a visual mapping of factors as they relate to lung cancer. However, due to limited data, the models I created were not complex enough to detect lung cancer. It is also important to note that the dataset is biased with 87% having lung cancer.


Data Preparation

Since the dataset is fairly small, using Excel would be sufficient to examine and preprocess the data.

First, I checked for null values across multiple cells by using the COUNTBLANK function. Running the function over the entire dataset showed that there are no null values.

Next, to create a decision tree mining model in Visual Studio, I needed to have one of the columns as the Key attribute which the raw dataset did not have. To create one, I simply added another column called “ID” and assigned a value from 1 to 309 for each instance.

To exam the data further, I checked how many records are indicated as having lung cancer in the dataset by using the COUNTIF function.

Roughly, 87% of the dataset have lung cancer which may imply the dataset was likely taken with people who already had indications of lung cancer.

To look at the age of people in our dataset, I plotted a histogram as shown below.

The chart shows that most of the people in the dataset were in their 50s and 60s. The average age was calculated to be 62 years old.

There were two outliers: one at age 21 and another at age 87. However, decision trees are not sensitive to outliers so I did not need to remove them.

To see the binary splits of the other attributes, I first summed up the binary split totals using the COUNTIF function. Then, I created a double bar plot using the split counts.

The number of people with fatigue and shortness of breath is approximately double then without it. However, the rest of the attributes have a roughly even split.

I created a correlation plot in Excel using Data Analysis tools but first I needed to convert the gender and lung cancer columns to numerical types which can be done using the IF or IFS functions.

The correlation plot shows that the attributes allergy, alcohol consumption and swallowing difficulty correlate to lung cancer the most.

Fortunately, decision trees do not require as much preprocessing as many other machine learning algorithms. I did not have to do any further preprocessing for this dataset such as converting data types, normalizing or using one-hot encoding.

Importing Data

To use the dataset in Visual Studio Analysis Services, you need to first download the dataset. Then, add it to Microsoft SQL Server Management Studio. Finally, create a Data Source and a Data Source View to import the data to Visual Studio.

The data is stored in a CSV file. To upload it into Microsoft SQL Server Management Studio (SSMS), use an existing database or create a new one. Then right-click on the database and go to “Task” and then “Import Data”. When selecting a data source, choose “Flat File Source” and choose the location where you saved the CSV file. Then continue on through the Wizard screen to import the data into the database.

Start Visual Studio and create a new project using the Analysis Services Multidimensional and Data Mining Project template. Connect the project to SSMS by using the Data Source Wizard to connect it to the database where the lung cancer data is stored. Then add a Data Source View and select the lung cancer data.

Creating a Mining Model

In the Solution Explorer window, right-click on Mining Structures and select “New Mining Structure”. Using an existing relational database, select “Microsoft Decision Trees” as the mining technique. Select the Data Source View you just created. Make sure you check the box to indicate the Lung Cancer table as a “Case”. On the Specify the Training Data screen, check the ID column as the Key and the LUNG_CANCER column as the Predict. If you click on the “Suggest” option, a window will display showing you correlations for each attribute to lung cancer. The suggestion only selects the top 6 columns with the highest scores. For now, select all the attributes as inputs as we can choose to exclude specific ones later.

On the Specify Columns’ Content and Data Type screen, change AGE to “Continuous” and its data type to “Double”. The rest of the attributes are acceptable with its default content type of “Discrete” and data type of “Text”.

Finally, reserve 20% of the data for testing purposes. The input data will be randomly split into a training set and a testing set.

Adjusting Parameters

There are multiple Algorithm Parameters that you can adjust to build the model. In particular, I looked at adjusting the complexity penalty and the minimum support. The complexity penalty hinders the growth of the tree. The parameter range is from 0 to 1. If you want to inhibit the growth or make the model less complex, you can increase the value closer to 1. This can help with overfitting. The minimum support parameter is the minimum number of cases that a leaf node must have. Increasing this number decreases the likelihood of a split and can also help with overfitting (Duncan, Coulter, Sherer, & Hu, 2022). I created three different decision tree models while adjusting these parameters.

Model 1: Default Parameters

The default complexity penalty value is based on the number of attributes used in the model. Since I used 15 attributes, the default value is 0.9. This value is on the higher end since the range is from 0 to 1.

The default minimum support value is 10. This is also relatively high since we only have 248 cases in the training set.

In the Mining Model Viewer, decision tree displays as follows:

The blue and pink bar show how the cases are split, blue for YES and pink for NO in regards to lung cancer. Since the default complexity penalty and minimum support value were so high, the model produced a fairly simple decision tree with limited branches and nodes. This is actually a very poor model because in each of the nodes, YES outweighs NO so the model will always return a YES. This is displayed in the results below for the test case data.

Model 2: Complex Decision Tree

To see how complex a tree can get, I set the complexity penalty to 0.01 and the minimum support to 1.

Now the decision tree is much more complex. It is 7 layers deep and has numerous nodes. However, this is an overfitted model. The following table shows the model’s results on the test case data.

Model 3: Feature Reduction

This time I cut the features in half and use only the top 7 features that the software suggests which are age, allergy, alcohol consumption, swallowing difficulty, wheezing, coughing and chest pain. Then, I set the complexity penalty to 0.2 and the minimum support to 5.

The model ended up excluding wheezing and chest pain. In this decision tree, the model is 5 layers deep. The test case results are in the following table.

Conclusion

The decision tree models results were poor. In reality, none of these models would be acceptable for use in lung cancer detection. Ideally, you would want to error on the side of not missing any people with lung cancer. In other words, false negatives are worse than false positives. However, to improve the model, better data is needed. First, more records are needed. Having only 309 instances is a small number for a prediction like this, and that number is before splitting data into the test set. Next, data quality should be improved. Having all the fields as binary makes modeling easier but limits what it can do. For example, answering yes/no to smoking or chronic disease hardly tells the whole story. Knowing how many years and how frequently someone smokes can help improve the model. In the same manner, knowing the exact chronic disease(s) can make the model more robust. Finally, even with 15 input attributes, that is hardly enough when it comes to cancer detection. There are many factors that can affect whether someone will get cancer. It can be genetics, diets or environmental factors such as pollution. Improving the dataset is needed to achieve better results.


References

Duncan, O., Coulter, D., Sherer, T., & Hu, J. (2022, June 13). Microsoft Decision Trees Algorithm Technical Reference. Retrieved June 18, 2022, from Microsoft: https://docs.microsoft.com/en-us/analysis-services/data-mining/microsoft-decision-trees-algorithm-technical-reference?view=asallproducts-allversions