I’m not a data scientist. And while I know a Jupyter notebook and have written a lot of Python code, I don’t consider myself a machine learning expert. So when I did the first part of our machine learning experiment with no code / low code and got a better than 90% accuracy rate on the model, I suspected I was doing something wrong.
If you haven’t followed it up so far, here’s a quick overview before directing you back to the first two articles in this series. To see how much machine learning tools have developed for the rest of us – and to redeem myself for the unwinnable task assigned to me last year in machine learning – I took a well-worn heart attack dataset from an archive at the University of California -Irvine and tried to outperform data science students by using the “easy button” Amazon Web Services low code and no code tools.
The whole point of this experiment was to see:
- Could a relative novice use these tools effectively and accurately?
- Were the tools more profitable than finding someone who knows what the hell they are doing and handing it over to them?
This isn’t quite the true picture of how machine learning projects usually work. As I found out, the “no code” option offered by Amazon Web Services – SageMaker Canvas – is intended to work with the more scientific approach to SageMaker Studio data. But Canvas surpassed what I was able to do with Studio’s low-code approach – albeit probably because of my unskilled data hands.
(For those of you who haven’t read the previous two articles, now’s the time to catch up: here’s Part One and here’s Part Two.)
Evaluation of the robot’s work
Canvas allowed me to export a share link that opened the model I created with my full version from over 590 rows of patient data from Cleveland Clinic and the Hungarian Institute of Cardiology. This link gave me a little more insight into what was going on in the very black box of Canvas from Studio, a Jupyter-based platform for conducting data science and machine learning experiments.
As the name slyly suggests, Jupyter is based on Python. It is a web-based interface to the container environment that allows you to jump-start kernels based on different Python implementations, depending on the task.
The kernels can be populated with any modules required by the project when you do code-centric exploration, such as the Python (pandas) data analysis library and SciKit-Learn (sklearn). I used the local version of Jupyter Lab to do most of my preliminary data analysis to save AWS computation time.
The Studio environment created with the Canvas link included prebuilt content to provide insight into the produced Canvas model, some of which I briefly discussed in a recent article:
Some of the details included the hyperparameters used by the best-tuned version of the model created by Canvas:
Hyperparameters are adjustments that AutoML has made to the computation by the algorithm to improve accuracy, as well as some basic housekeeping activities – SageMaker instance parameters, the tuning metric (“F1” below), and other inputs. This is all pretty standard for a binary classification like ours.
The model review in Studio included some basic information about the model created by Canvas, including the algorithm used (XGBoost) and the relative importance of each of the columns evaluated with something called SHAP values. SHAP is a really awful acronym that stands for “SHapley Additive exPlanations,” which is a game theory based method of extracting the contribution of each data function to changing the model’s outcomes. It turns out that the “maximum heart rate reached” had little effect on the model, while the “thall” and angiogram (“caa”) scores – the data points for which data were missing – had a greater effect than I wanted. . Apparently I couldn’t just drop them. So I downloaded the performance report for the model to get more detailed information on how the model lasted: