Wenlei Cao

Pandas Equivalent for Database Analytic Functions

2024-06-01T00:00:00+00:00

I wrote a post about using pandas to do some basic SQL operations a while back. Those are good if you just get started. When dealing with more involved logic, oftentimes than not, you will need to use analytics functions and think how to implement that in the pandas as well. This post will focus on this part.

I will use the SQL Server online compiler for this post to display SQL. You can paste the SQL code in the window and run it to see results.

Let us first create a toy dataset. I use union to create a temp table, which will be used to demo what the results are supposed to be in SQL and compare with pandas results, which I will show in the Jupyter notebook.

I deliberately include the null value here so that we can observe how SQL and pandas handle it. Pandas creates a dataframe which behaves like tables in a database.

The first challenge is using regular aggregation like sum/avg. With over (partition by …) SQL syntax, this allows us to get aggregation at each original row without reducing the row number like group by. This will help when you need to calculate different metrics in situ but don’t want to reduce the record. I have count(score) and count() here. The former will ignore null value in the score column, but count() will return all record counts.

This can be achieved in pandas by groupby and transform function, Notice when you need to incude null, you will need to use size function instead.

Another form of aggregation is cumulative aggregation. In SQL, you will need to add sequence via order by, so that SQL engine knows how to combine data sequentially.

Because data contains null value, if we simply aggregate in pandas, null+number will end up with null. To fix that, we will need to fillna first, then we can use pandas cumsum and cumcount to get the result, because cumcount starts with 0, we will need to add 1 to match the results on the SQL side.

Sometimes, we need to find the first or last record in one group, based on the certain sequence. This can be achieved by the first_value function combined with different order as follows.

In pandas, it does have first and last functions, but unfortunately, it excludes the null value by default (see this github question). In order to simulate the same behavior in SQL, we will have to work around with the nth function which only works with groupby objects. The last two columns show the same behavior as SQL, inside map function, df_first_last.groupby(‘student’).nth(0)[‘score’] get the first score for each group. If we use nth(-1), that will be the last score.

Occasionally, you will need to compare the current record with other records in the same group. In SQL, you can use lead/lag. For e.g., you can do the following. Lag means to compare with the previous record depending on the sequence you set in the order by. Lead to get the next one.

You can use transform and shift to achieve the same. Since we will need to pass in param, I use the lambda function here. Another thing to notice, I would think to use -1 to get the previous record, but in pandas, it uses 1 instead. sort of conterintuitive, but maybe developer think of the different way :))

It is not uncommon in real work, you will need to use row_number, rank, and dense_rank function among the same group to get sequence numbers. Row_nubmer will always give the different number even if value is equal, rank will give the same number if value is equal, but the sequence number will skip. Dense_rank is similar to rank, however, dense_rank will not skip the sequence number.

Obviously, I found a bug here in onecompiler.com marked by a red circle. It should be 2, 2, 1 since math and statistics are both 4.0. Pandas can handle this by using a rank function. Just notice, pandas rank ignore null value. if you need to include that, add na_option.

You might come across a situation that requires you to pivot tables to get some insight. In SQL, you can use the pivot function. Here we created a separate table to demo this function. In the second part of results, you can see product column values have been pivoted to be column names. Also notice that you can list the column whatever you want. In this case, people add ‘chip’, which is not in the value of the original product, but it works just fine. In addition, some database recently included more rank functions, such as percent_rank which is used to let you know where the record stands percentage wise, but there is no corresponding function yet on the pandas side, I have tried different ways, so far, no luck yet, but I include the stack overflow discussion if you would like to give a shot.

We create a df_test2 dataframe containing the same data. We can use pivot_table function to tell what I wound like on the row and column and what would be the value to be aggregated and what method I would like to aggregate (mean, sum, et al). Notice, there is no way you can insert “chips” here. Also, pivot_table has a sister function called pivot. Pivot_table can handle multiple columns. That is why you see the parameter is passed in as a list. Pivot can only handle one column at a time.

Pandas also offer a function to pivot row to column without aggregation. In this case, you will need to use unstack and stack functions. You will need to set index to the column you would like to pivot. By default, it will pivot the inner layer. But you can use the level parameter to adjust. You can reverse the unstack with the stack function.

in summary, pandas as a powerful open source tool has been able to achieve the vast majority of what database can offer. Although the implementation method could be slight different. for e.g., handling null value. so, be mindful.

I hope you feel this post is helpful. As usual, you can find the zip file containing both notebook and sql code file here.

you can also see result of sql here. but I don’t know how long they keep this link live https://onecompiler.com/sqlserver/42exn3g9w

thanks

Wenlei

Ensemble Pipeline Model with Stacking

2024-05-05T00:00:00+00:00

One could try different machine learning algorithms for a given data science project. When you have a few different models at hand, one way to further improve the performance is to perform a model ensemble.

There are a few different ways to perform the model ensemble. Sklearn has a good write out for this. For example, random forest is an example of bagging, while Xgboost is using a boosting method. Besides that, you can also do voting and stacking to ensemble.

Sklearn provides classes to help ensemble through voting and stacking. You can check the following links. They also have the regressor variants if your problem is regression.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html

On my hand, I have good luck with the Voting method. Notice that you will need to tune the weight of each model in voting, which you can do with Optuna.

However, the stacking method from Sklearn often results in worse performance. I did some reading on this topic. The following blog helps me.

Steven Yu’s blog has a good overview of stacking and blending method.
Jason Brownlee’s machinelearningmastery blog has end to end example.
Anisotropic’s Kaggle post showed more detailed example.

If you read those posts, you will find the example Sklearn showed in their documentation is a highly simplifying version. You will need to implement Out-of-fold in your process to make the result consistent. The idea is very much like we use cross validation in hyperparam tuning.

I stole this image from Yu’s blog.

You can see train data were divided into groups like cross validation to produce prediction on the right and the results will be used to train 2nd level model, meta model.

Both Brownlee and Anisotropic manually implemented the “cross validation” data process through the numpy function. The following function is from Anistotropic’s post.

However, I got some trouble replicating the process as Anisotropic did because my process was using pipeline, one of which step was adding features like ratio et al (I call it pipeline model, but formal name is composite estimators per Sklearn). This requires some Pandas data manipulation. If I use the Numpy function to preprocess data into groups, I will lose the column name which will be needed in the adding features step. This broke my process. To mitigate that, I ended up altering the function to implement the same idea but made it compatible with dataframe (I believe majority people will need to preprocess the data before modeling, which is why we are using pipeline here).

Notice I use .iloc here with the dataframe, so X_tr and X_te are still dataframes and can be used in the clf.fit function but keeping the column name. In my case, clf will be the pipeline model, this way I don’t need to make any changes on existing processes. Here I collected the prediction of the first level model.

Merge together and form the dataset for next step.

with first level data, I used the Xgboost for my meta model.

Here, Optuna was used to optimize the hyperparam of Xgboost and model performance will be compared under cross validation.

I was able to get the best params and use that for training and predicting the test dataset and performance did get improved. Please see the notebook here.

I hope this post is helpful for you.

Thank you for following along.

Wenlei

Use Optuna to Tune Sklearn Pipeline Hyperparam

2024-04-22T00:00:00+00:00

Pipeline makes data scientists’ life easier by combining different data transformation together. You just place a training dataset on one end, the other end is a model to be trained. By the time you want to predict, you just input the test data instead. Pipeline will process test data the exact same way as training dataset and help you get correct predictions.

Most people use Sklearn pipeline with GridSearchCV or RandomizedSearchCV for hyperparameter tuning. Those tools are good, but much like brute force to loop through all hyperparameter combinations. People are thinking if there are smarter ways to search for optimized hyperparameters. Optuna, a third party tool, has become popular lately by offering various search algorithms and pruning features. It also works with all major machine learning frameworks. I have seen many examples of Optuna working with tuning Xgboost, Catboost, model hyperparameters. But I don’t see Optuna being used for Sklearn pipeline hyperparam tuning. Imagining if you need to tune pipeline and model hyperparam separately, that would be awkward.

In this post, I try to see if I can use Optuna to tune Sklearn pipeline hyperparam as well as model hyperparam. I am going to present a working notebook. So, there are some extra cells which I used to test other ideas. But our focus here is to explore how to use Optuna for Sklearn pipeline hyperparam.

A Sklearn pipeline is composed of various transformers. Besides pre-made transformers, there are two types of custom transformers. One is the class based transformer, which is an instance of BaseEstimator. The other transformer is created by functiontransformer. The latter is a shortcut way, but only can be used for transformation that does not need to store the state, which is also called stateless transformation. For example, in order to impute the null value in a test dataset, you need to store medium (or whatever imputation strategy) value of training data so that both dataset are treated fairly. This is called stateful transformation.

This example notebook is about predicting Abalone’s age. Abalone is a sea creature by the way.

I have an “add feature” function which I use to add additional features to the dataset. Notice in this function, I have flags. If I select the flag, it will create the features. But I don’t know if these feature are useful. Therefore, I need to tune these hyperparam. I will use functiontransformer to convert this function to transformer to be used in pipeline.

After I add these features, I need to handle numeric features and categorical features separately for some preprocess. In this dataset, there is only one categorical feature. I first choose columns, then imputation, winsorize is used to remove outliers (you can use powertransform too, scaler to scale all number in a certain range, I also include a custom transformer, which if two features are highly correlated , this transformer can drop one of them. There is a threshold need to tune.

This is the full pipeline that I put different components together. Notice the special format when using functiontransformer. Also, I include RobustScaler before PCA since PCA is sensitive to value. I am not sure if RobustScaler and PCA will improve the model performance. We will tune to see. Here I use random forrest as a model placeholder.

This is the whole pipeline structure.

Usually here, you can place the pipe in GridSearchCV and tune the param. But our goal is to tune everything in Optuna. So we will need to make some modifications to our code.

In order to use Optuna for hyperparmeter tuning, we will need to create an objective function. First import necessary library. I am going to tune lightgbm model with pipeline.

Optuna does not allow you to choose objects directly. So what we can do is use suggest_categoircal to choose a string, Then, use string variable to set the object later.

Now, once Optuna chose a string value, we can use the if clause here to define the object we need.

These variables will be used in the pipeline as follows, see yellow highlight area.

Also before the pipe step, I included lgbm hyperparams and created the model. This model will be used in the pipe above.

Finally, we use pipe in the cross_val_score. And this objective function is used to create a study. In this study, I ran 100 trials.

This is trial log.

At the end, I can print the best param. Looks like it does not choose to use PCA, robustscaler. Outlier removal chose to use winsorize. Here due to the size of the window. It cannot show all params. But you will see the chosen param for lgbm in the notebook.

So it is possible to tune pipeline hyperparam with model hyperparam. We can do it in one go.

Hope you feel this is useful.

As usual, you can find notebook here.

thanks

Wenlei

Shap Value for Single Record in Model Prediction

2023-12-24T00:00:00+00:00

Feature importance helps us understand what features play more importance roles in a given model. After that, you can do feature selection, or you can use results to cross check with your business domain knowledge to validate the model.

In the past, the majority of analyses focus on the feature importance at the model level. At record level, however, it is difficult to describe what causes the model to make the particular prediction to a single record. One can make a guess, but a precise contribution of each feature still lacks. However, in the business world, stakeholders will be eager to know what features drive the prediction. Let us say, our model predicts a customer will cancel his insurance policy. If the model also provides what actual cause is, this will help the customer service team to handle customers accordingly. Overall, the explanability is more important in the business world.

There is a python shap package which can help us to achieve this goal. I read quite a few blogs and found most of them are showing various fancy charts that the package is capable of. But few focus on the single instance feature importance.

Let us see how we use iris dataset to unveil the mystery of single record feature importance and see if we can make a function to explain the prediction for a record so that we can make our life easier next time.

First, let us set up an experiment. We started with importing packages and data from iris, creating a simple random forest classifier. Then we take a look at how SHAP explains the results.

We instantiate an explainer and use that to get SHAP values. The summary_plot will tell us, among all 4 features, petal length and petal width plays a more important role compared with sepal length and sepal width.

Let us take a look, what shap_values is. You can see it is a list with 3 elements.

If we dive in the shap_value to understand its structure. We can see SHAP_value is actually a nested array.

Each array contains the same dimension as our input dataset. Indicating that for each class (3 for iris data) SHAP has one value set for it.
Before we show one record feature importance chart, we see how data looks like for a given record.

We know this record is classified as 0 by model (see cell 18). We also print out its SHAP value for each class and corresponding column name. You can see the first list gives larger numbers except sepal length(cm) is negative. Maybe that is why the prediction is 0 (first element)?

When we observed the force chart for this record (below), it proved our assumption. Only difference, SHAP puts negative value to the right. But you can see the number of positive values corresponds to the number of red, and the number of negative values corresponds to the number of blue. The final prediction is determined by which list produces the largest number. In this case, the 0 class produces the value 0.95. therefore, it is classified as 0.

Now that we understand the data structure of shap_values and how it makes decisions. We can see how we extract the most important features for a given record.

This is our target: we want to extract class 0, number 12 record, top 2 import features.

Cell 19, give us the class 0 and number 12 record SHAP value.

Cell 20, we can convert it to pandas Series and give it column name, then sort value, get the first 2 feature names as it is showing here.

It will be cumbersome to do this exercise every time we have a new model. A function will be helpful.

So, there I did slight modification to the previous cell, instead of only getting the feature name, I also included the feature value for that class. So that, people can see what the feature weight is. Also, if the user wants to see the force chart, it is optional.

Let us do a test run, if we use the same record we run previously, we can see it gave us the same result as before.

let us see if we can use it to check other records. We list the first 20 records and pick the sequence 3 one (highlighted, notice I am not using the index of y_test, since it is shuffled in the train test split). This record is classified as 2.

Let us first get all force charts for 3 categories.

We passed in the parameter for this record, index =2, category is 2

We can see it indeed show the same result as we expected.

thanks for following along. I hope this help you understand the shap_value a bit better.

Attached is the notebook which used in this post.

Wenlei

Log errors in python with examples

2023-09-30T00:00:00+00:00

Most of people use print function to show variable value in python for debugging purpose. It works fine when you are in dev mode. But as time goes by, your code also grows. It is cumbersome to have the print statement everywhere in your code and clean that afterwards. Python have building-in logging library, which can be leveraged to make our life easier.

One reason people not using logging much is that it take some efforts to set it up. Especially when it is involved with handler and format, people could get confused quickly. Normally, you would get log handler and format ready, then add to logger. Therefore, it is multi-step processes. Since python 3.7, this has become easier by setting up everything in the basicConfig function. It is still hard for me to remember every detail since I am not going to use on daily bases. So, I put it down here, which can server as a script template and change certain param later.

First, import package I will use in the example. logging for log error, Date for timestamp, os for file path manipulation. I create log file name dynamically so later log will not overwrite the previous ones

Here is the logger setting. essentially, you stuff all setting in basicConfig.

In this particular case, I set log format and file and stream handler. When I run it I can see it both show in the output and file as follows.

I am running this in the Jupyter Notebook, but if your code make calls to multiple python modules, you might want to include logging in each module with module name in the format. This post will help solve some issues you might encounter.

https://stackoverflow.com/questions/50714316/how-to-use-logging-getlogger-name-in-multiple-modules

Now that the purpose of logging is to trace back the errors should it happen. So it is important to be able to include the trace info into the message.

Let us simulate a call stack to show a divide 0 error. I create f1 function. Then use this function in try except block. And intentionally use 0 as denominator which will throw out an error and be captured at the except block. We use logger.error function to show error. This does show divide 0 error. But did not give us where the problem is.

If we add additional exc_info=True param. It will include the trace.

The same thing can be archived using logger.exception (e), which you can think it as short-hand version. but notice the source file format difference.

These did not specify the file, so if you are familiar with the place the function located, you are all set. But if you run this through different module. You might also want to include stack_info = True. Like the follows

I hope this post can get you a quick start of logging library. It is a powerful tool.

the notebook can be found here.

thanks

Wenlei

Retrieve Data from Teradata stored procedure into DataFrame in Python

2023-08-27T00:00:00+00:00

Teradata has a smaller user base than others like Oracle and SQL Server, which results in less resource you can find online. In this post, I will share my research on how to use stored procedure to output data and retrieve it into a dataframe on the python end.

To protect the privacy, I blackout the datalab portion of table and stored procedure, you can add your own if you want to repeat. First, I create a stored procedure to be able to output a dataset based on parameter values.

In this dataset, I pass in a major parameter in row 1. This parameter is used at row 7 to filter a student table. If I use major value = All, this will give me all rows. If I use a particular major, this will give me the student for that major. Notice, in Teradata, you will need to create cursor for this purpose and need to open cursor to be able to retreive.

Let us test the stored proc.

We run row 17, this give all 4 records from student table (using filter value = All).

When we use major = Business, we only get two records. So, this stored procedure works properly.
Now let us see how we get the record into the dataframe.

The output of stored procedure is a little messy, we first created a function to format it. So we can see how data is structured.

In jupyter notebook, let us make connection to Teradata using teradatasql library.

teradatasql

I can use it to get business students info by pass filter major= Business. Please note the format, parameters from stored procedure need to be in list or tuple. Also, notice, the output has more than 1 dataset, the actual data was from the 2nd dataset.

In order to get the data into the dataframe, the following code get row and column name from the stored procedure (row6 and row7) and then cast the data into dataframe. Before that, row4 get all dataset, row5 filter dataset only keep dataset with values.

The other issue, I came across is Teradata has two session mode, ANSI and Teradata. If python complain you have session mode problem. Try to change the setting to default, which help me solve the issue (mine using ANSI).

Hope this helps.
thanks

Wenlei

Use Docker to Operationalize the Data Science Prediction

2023-06-19T00:00:00+00:00

There are many challenges working in machine learning fields. In terms of the time consuming part, one is data preparation, the other one would be operationalizing the model. If you have the right data, building a model actually is a relatively easier part of the whole process. Let us talk about the operationalization of the model. I have used batch files (link) to handle R and python processes and used the task scheduler to schedule a job at a virtual machine in the cloud. But it is probably not the best way to go because the followings:

It is hard to switch to different visual machine. In other word, the portability is poor.
You are limited in windows machine, whereas majority of servers in the cloud are using Linux based operation system.

Here I want to explore the possibility of using docker to handle data science operations. It is a modern technology and with great potential to solve the issues I list above.

Docker is a container technology that can package all your artifacts for your application in one folder and move to the host machine (cloud server). Your application will run without worrying about the operating system, hardware and so on. This will fit in with all cloud service vendors like AWS, Azure, GCP, et al. It uses the same concept of container in the shipping industry.

Majority of docker materials and tutorials online are using Nginx or flask, because the end product is a web App. You can easily see the result by browsing the web page. In my case, I would like to use docker for data science purposes. Unfortunately, I don’t find a lot of resources online. I think the following sites help me a lot.

Docker tutorial site is good for getting started, but it is not specific for data science.

https://docs.docker.com/get-started/

The following site helps me use docker with data science, but the content are a bit dated (not working anymore) and methods are over simplified for real life project.

https://mlinproduction.com/docker-for-ml-part-1/

I want to achieve the following goals:

Being able to build customized docker images for data science purposes.
Training and inference processes are getting data from databases (this is common in real life).
Try docker compose if there are multiple containers working together (simplify the configuration step).

To run docker in windows environment, you will need the following set up.

Install docker desktop, remember to also check the option to install Windows Subsystem for Linux (WSL2 ). This essentially allows you to have a Linux dev environment in Windows.
Install Ubuntu app from windows store to be able to use Linux shell to interact with docker desktop. It is a common language in the container world.
In the docker desktop, configure Ubuntu to integrate with docker(WSL Integration, see below). So you can use Ubuntu shell interact with docker.
Install VS code and add extension WSL, so I can use the VS code to view Linux subsystem files and be able to make dockerfile or compose yaml file.

Let us take baby steps.

The first step, I want to be able to replicate what Luigi did, but minor modifications are needed to make it work.

I will still use Luigi Patruno’s Boston Housing price data for this exploration. But since he published the blog, the dataset source has been changed, the original code is not working any more. So I will need to make changes to it.

In addition, Luigi used a Jupyter notebook image as his base image, this image is bulky (over 1gb ). For some reason, I cannot launch a jupyter notebook by connecting to the container from the host even though this container runs without issue and I put in the token as suggested. Therefore, I changed it to use python image instead. I use python 3.8 image from docker hub (about 500 mb). it might be ok to build from scratch such as Alpine. But I leave it to the future.

The folder contains the following files.

This is the content of dockerfile, which contains instruction to build an image as a blueprint for a container.
dockerfile

Row 1: get image from docker hub python:3.8 version
Row 4: create folder under home, form home/wcao/model structure
Row 5-7: create environment variables to be used in other files. Those are paths and file names
Row 10: install packages which were not included in core library of python
Row 11: move all file from current folder
Row 13: run python3 train.py while build the image

Train.py

Row 1-11, import necessary packages.
Row 17-21, using environment variables to form the model and metadata paths to be used later. Row 27-36, getting training data. Due to some value was wrapped to 2nd row, there are some data manipulation between 31-33 using np.hstack, then data was divided into train and test.

Row 40-43, setting up a regressor.
Row 45-50, fit the regressor with train data and get train and test_mse. Save these mse values into metadata.
Row 55-60, dump the model to a file and also save the metadata.

Inference.py
Data steps are similar, so I am going to skip it.

Row 41-44, load the trained model from file generated from training.
Row 46-47, using the model to predict and print out the prediction.

With the file in the folder, let us see how we use docker to run it.
In the Ubuntu shell, navigate where the folder contain the code,

Here I use the build command, giving the name python-ds as image name. don’t forget the dot, which indicate build from the same folder, i.e., using the dockerfile in the current folder to build.

Once built, I can run docker run command to instantiate a container based on the image.

In the docker run command, we use cat command to show content of metdata.json, which is generated during the train.py run. It prints out the two mse values as below, which indicate the training work as expected.

Next we will see if inference is working

We can run inference.py and get all predictions (10% of original data). So it works fine.

The first step worked as expected.
Let us add some challenges to our process. In real life, oftentimes, we need to query databases to get new data for training and inference.

Let us get data from a database and here I use mysql to simulate where our source data is from. In dev, it is acceptable to have a db container. In the prod, however, it is suggested that we should use a managed cloud database, due to the phantom nature of containers. But it is easy to switch the connection. The process is quite similar.

The folder structure of the files.

Since we have a container for data science, we also have another container for mysql. Using docker-compose.yml is a good way to simplify the configuration process. It will create both containers within the same network.

Let us take a look what inside the docker-compose.yml

I have two services in this compose file.
Row 3-6, I created a data science service as ds_container. I will build an image from the current folder (I will need to run commands from the current folder) and give the container name data_science_container.
Row 8-14, create mysql_db container from mysql image in docker hub. We map host volume to mysql volume (volume helps persist data even if the container is destroyed). And some environment variables, such as database to use, and password.

I add one more command,
Row 11, in order to deal with database, I add sqlalchemy and mysql-connector-python package
Row 16, this will allow container stand by and not exit, since I will run other commands against the container

Train.py did not change.

Inference.py: I got data from sql query. So I replaced the data extraction part.

Notice here I use sqlalchemy to create mysql_engine. Notice the special format of connection string.

Database+connector://user:password@host(defined in compose file)/database.

Using the connection, we can get a dataframe from read_sql. Here we get data from boston_housing price table. We have not created this yet. Once we stand up the container, we can create it inside the container.

See this blog for more on mysql connection
https://planetscale.com/blog/using-mysql-with-sql-alchemy-hands-on-examples

Let us give it a shot.

Using docker compose up, we can see two services started and used –d for detached mode, so containers run behind the scene. –remove-orphans is optional since I have some other containers with the same name running.

Using docker ps, we can see what name of container is running.

From here, we know the mysql container’s name is code-mysql_db-1. we can get into mysql containers to create tables and insert some records for testing.

Here we use docker exec to add commands to existing running containers. –it is a command when you want to get inside a container (i:interactive, t: tty). Code-mysql_db-1 is the container name, mysql –p is run mysql command and –p means asking for password. After entering the password, we set it as secret in the compose file. We get a mysql prompt. (in real life we will need encode the pw, see example in this blog)

We used mysql database, and created a boston_housing_price table with columns.
Let us assume, there were 3 new data coming in and need us to predict.

We can insert 3 records like so. We check the values in the table. We can type exit, now we are out of the container. We only use 3 records here, so we should only get three predictions.

Now let us score the data

Since the container is already running, I should use exec, it did give me three score values. So we can successfully get prediction using docker compose.

Once it is done, we can use docker compose down to remove all containers.

It is convenient and easier to deploy anywhere with containers, as long as I have these file in a folder.

The next step, we can use docker swarm or K8s for orchestration, but it is out of the scope of this post.

thanks for following along. Here is the folder contain file you need if you want to repeat.

Wenlei

Dynamic SQL and Cursor in Teradata

2023-06-04T00:00:00+00:00

Dynamic SQL is SQL statement formed when it is at run time. You can use variable to increase the flexibility of SQL, but variable won’t help in certain occasions. For e.g., you want to change the table name or column name of script in run time. This is where dynamic SQL shines.

Cursor in SQL has been ditched as low efficient old technology in favor of set operation because cursor operate is in row by row nature. But cursor is very useful when you deal with handful of repetitive jobs because of its flexibility. Thinking cursor as a foreach loop might be helpful to understand.

It can achieve amazing results if you combine dynamic SQL and cursor together. Let us take a look at a real life example.

I was tasked to convert some R code into SQL. R saved data into dataframe and handled by dplyr package. In one line of code, you can use lapply to trim all column’s whitespace for a given dataframe. But it will not be that easy to do in Teradata.

I am not an expert on Teradata. But I will give it a shot. The least techy way will be to run a trim function for every char/varchar column in a table. But what if you have 5 tables, a total of 100 columns? What if the columns will be changing in the future? Manually maintaining the code will be a disaster.

I am thinking of using dynamic SQL and Cursor because we can pass a list of column names to a cursor. The cursor then releases the column name one at a time to dynamic SQL to execute update statement with trimmed values. To give a list of char/varchar columns, we can query the metadata system view dbc.columnV so even if the table changes, we are still good.

I have been complaining about the documentation of Teradata. It is difficult to find example code which fits your needs. Of course, I also need to boost my searching skills.

I feel the following two sites are helpful for Teradata users if you cannot google what you need.

https://docs.teradata.com/ use search function here to locate some examples
https://dbmstutorials.com/random_teradata/teradata-dynamic-statements.html

The second site gives some actual Teradata scripts. Although in this case, it is using dynamic SQL to create tables or insert tables. It is close enough to our purpose.

I use the script from the 2nd link as a template to create my own stored procedure and create a fake table to test the stored procedure.

Teradata syntax is quite different from other RDBMS. Honestly, I don’t remember all those syntax unless I am doing it on a daily basis. So if you are on the same boat, I will suggest using a template and modify the code based on your need.

It is time to get our hands dirty. let us go over the code. BTW, I marked the database name for privacy. you just need to change to yours.

line 1-2, create stored procedure, pass in two params, dbname and tablename because I need to use it for different tables. maybe a different database in the future.
line 4-6, define a few of local variables to be used later.
line 14, 15, For loop start here if you have multiple columns to loop through.
line 19-22, from system view get column list based on the parameter dbname and tablename value, notice the syntax with “:”.

line 24-33 is using the columnname to do things.
First, get the cursor value with the set statement at line 27. Notice the syntax cursor.columnname. You should be able to use it directly, it just makes it simpler later with a shorter variable name.
Line 28 forms SQL statement in the run time.
Line 32 executes update statement.

Let us give a test run.

line 1-9, create a fake table, with various leading and trailing spaces. I just wanna see if the stored procedure works.
By querying the table, we can see there are spaces that need to be removed.

line 14, call the stored procedure we created and then pass in the dbname and tablename.
then select the table again. You see all value has been trimmed nicely.

Next, you can just pass any db and tablename, not worry how many columns they have and what changes will be made to those tables later. Your heart is free now :).

As always, code can be download here.

thanks!

Wenlei

Poor man’s automation

2023-02-26T00:00:00+00:00

One bottleneck of machine learning is to operationalize the model you build. Let us say you are satisfied with the model performance. Now the question is how you link your data engineering step with your model prediction and then save the prediction somewhere. Above all, you will need to make it run automatically. So that you can do something more important.

I have been a BI developer on Microsoft platform. ETL tools, such as SSIS, are well capable of doing data extraction. Besides that, it also has an Execute Process task, with which you can run other applications, such as python, R. You can put tasks sequentially so that tasks will carry out based on your design. Alternatively, if you are allowed to use containers, you might be able to recreate your entire environment in Docker and run it from the cloud. What if you are in an environment where the more advanced technology is not available to you yet? Luckily, we have a very old friend, a command line tool, and batch files, which we can use to automate the process. Because it is old, the majority of softwares supports it. You can run python, R from the command line, which means you can put that in your batch file.

I have a data science project, in which source data comes from a Microstrategy report, data was extracted using R mstrio package (step 1). Some feature engineering took place in R (step 2). More features were brought in from a variety of sources with python, data went through merge and transformation, fed to the model, finally saved the prediction in AWS S3 (step 3). You can see different technologies used. We are able to use a few batch files to automate it.

I will share the structure of my implementation of batch files. But I will not go into very basic level. I think people should be able to do some research themselves.

This is what I want to achieve:

I can just run master.bat to get prediction and no need other commands. You can use task scheduler in windows to schedule job run later on.
I want to modularize the child process. If the problem happen at child process level, I do not need touch master.bat
I want to save job logs into text files. So in case I need to check what is wrong. I can see those for troubleshooting
The process needs to have error handling. If there is an error, the batch file will report the error.

Row 1: @echo off command do not output verbose command. by default, it will repeat your command in output

Row 2-4, automatically use system time, generate date time format for e.g., step1-2_09-25-2023_1928.txt

Notice %xxx% is the syntax of the variable in the batch file. Some are system variable like %CD%, %TIME%.
you can pull system variable value directly. When you need to create a variable yourself, you will need to follow a format like row 14. Then you can use it like row 15 to show value.

Row 12, when you schedule a job via task scheduler. This will tell change directory to current running file location (otherwise, it did not know )

Row 14, The parent folder of the running file is code folder. So I save this path. It makes easier to change to this path later

Row 19. I get the directory one level up, because I want to save the log file in the output folder, whereas the output folder is at the same level as the code folder.
Row 24. I combined the file path for two log files.

Row 29 -32. Change directory to code folder, at 31, call child bat, step1-2.bat, to start step1-2 (see below for detail). Because both step 1 and 2 use R. I combine them together in one step. The “>” is used to output the log to log file. The file path was created at row 24. If there is error, it will go to error handler at Row 49

Row 34-38. Call step3 child bat. Output the log to the file path generated previously.

Row 40. Cleanup.bat moves all files generated and then saves it in the archive folder.

If every step before row 46 is correct, it will print successful, then exit without error.

Row 49-52. If the previous step has an error, this block of code will be called, it will show Failed and kill the process. Then exit with error.

When run row 31, child bat file, step1-2.bat was called. the content of this child bat file is as follows,

Row 3. Tell R application path.

Row 8. Under this python, there is a file named Rscript.exe. you can use command to run step1-2.R.

When run row 36. It will open step3.bat like so and run it.

Row 4. Create a time variable, mydate, it will be like 2023-02-15. But will need to change value based on the current month.

In this case, my python is run under an anaconda environment. So I need to run row 8-9 to activate the environment.

Row 17, we can use python to execute step3.py. “test” and %mydate% are two parameters, which I use to pass to the python script. In the python script, you can use sys.argv[1] to get the value for the first variable, sys.argv[2] for second one. This way, you pass a variable value to the python script if every month the date needs to be changed like the following python snippets

This is probably the way in which you use minimal tech resources to complete automation tasks. But I don’t think this is the best practice. If we will have to move it to the cloud, then you will need to create a virtual machine and install identical R and python environment to be able to run this. It will take a long time. If you switch a different VM, you will need to create another environment.

The better approach would be to create a container that you can move anywhere.

But if you are on budget, this might be something you can try.

Thanks. Hope this is useful.

Wenlei

How to Evaluate Your Machine Learning Model

2022-11-12T00:00:00+00:00

Given a scenario, after a few days of hard work, you have a model built, are you done?
Unfortunately, It is not the end of your data science project, it is a new start of many other things!

I will need to validate a few things (just my list not exausted one though) so that we know the model is good.

Besides good performance metrics score, what are the important features which support prediction? We can use business knowledge to cross check. We can ask ourselves, Does that make sense?
It would be good to check visually whether model is overfitting and underfitting
If it is imbalance data, what threshold is optimized to choose from
Other useful chart to check model performance

It is easier to explain with an example, so let us use the previous project to explore the question we listed. For details of the previous project, I include the link here.

First, I need to import the pickle file I saved from the previous project. In order to do that, I need to import all packages, custom classes, functions used as I know it will run into errors without those.

In cell 8, I imported sklearn.model_selection object. when you look at the import object, it shows the pipeline detail.

1. feature importance

In some industries, model explainability is crucial. Imagining we built an insurance pricing model, but we cannot explain how the model works. When we submit the model to the state regulator for approval. We will fail because regulators need to know why we increase insurance premiums. Over the past 20 years, there have been a few important machine learning frameworks such as scikit-learn, deep learning et al. Depending on different models, scikit-learn generally provides model properties to reveal the model importance; whereas deep learning framework is famous for its blackbox due to hidden layers despite of fairly high performance. Since I used scikit-learn in my previous blog, I will focus more on some of the ways in scikit-learn.

In Dr. Bronlee’s blog, he illustrated three important ways to get feature importance.

https://machinelearningmastery.com/calculate-feature-importance-with-python/

Coefficients as feature importance if it is linear model
Tree based feature importance (Decision tree, Random Forest, XGBoost et al )
Permutation feature importance, (Pass scrambled predictors to model to check performance drop to get importance)

The optimized model in my previous project is logistic regression. It is the linear model, let us use the first method, i.e. coefficient method. We can use name steps to retrieve the model directly from sklearn.model_selection object. Using its property coef_, We can see 153 features. The particular coefficient indicates how much impact this feather can impact. Please note, your data needs to be scaled to a similar level. In my case, all values are scaled between (-1, 1). So the impact between each feature is relatively comparable.

With the coefficients in place, we will need to map the feature name to it.

In my previous blog, I have numeric features, categorical features and text features.

Num_column, categorical_column, Title in txt_coloumn and ‘Review Text’ in txt column passed through pipeline-1 ~ pipeline-4 within the column transformer respectively. After the transformation, the results are in the sparse matrix format. Unlike dataframe, this format does not have column names. So in order to know the feature importance, we will have to retrieve the feature name if they went through something like one hot encoding, which will expand one column to multiple columns.

Let us see how we get the feature name for the sparse matrix.
First of all, they follows sequence how you set the pipeline up (pipeline 1 to pipeline 4).

Num_column: just simple imputing and scaling. The column number has not changed.
Categorical column: it has the column expanding transformer, onehotEncoder. Luckily, scikit-learn provides functions such as named_steps, named_transformers (see cell 12 line 5) so that you can navigate to the onehotencoder step. Then you can use get_feature_names() to get all feature names (not shown in the screenshot, too wide, but in notebook). Notice here, scikit-learn renamed the three variables (Division, Department, Class) to X0-X2. By the way, showing pipeline like cell 11 is helpful for you to navigate complicate transformation
For text columns, I divide Title and “Review Text” into two pipelines (3 and 4). Text is tokenized into words (column expanding). We select 20 and 100 important words respectively with selectKBest transformer (column number changed too). These Text values are later used as column names.

Here we can first get the word index from selectKbest step. Then use the index to get actual words from the previous step countvectorizer. There are 20 in Title, so I loop 20 times and rename the column to add the prefix ‘title’ to distinguish words from ‘review’ . For Review Text, it works similarly. But just change 20 to 100.

Next, I piece together all the feature names and combine them with coefficients.

Let us see what those important players are.

I listed the top 10 and bottom 10 important features. Notice bottom 10 is also important. It just negatively impact the prediction towards negative target (not recommended, target =0).

By reviewing those features, it makes much more sense to me. Rating definitely positively correlated with recommended (target =1), I don’t see categorical features play an important role here. Probably those are neutral. Others, in the top 10 features, are mainly good words. Bottom 10 features are mainly bad words. Only question here is title_wanted in bottom 10, meaning the word “wanted” in the title. I would think this is a positive word. But maybe, I don’t understand fashion. In the fashion world, if other people want it, it might not be a good thing. But it is worth taking time to see what original context to see if this is the case and make further adjustment.

if we visualize it with bar chart

These analyses are all at the detailed level, I often ask myself. What if we treat expanded categorical/text features into one feature. As a whole, what feature importance landscape would be?

For that, we can use permutation feature importance calculation. Idea is as follows.

We already had a model and we knew performance would be. Now if I scramble/shuffle a feature value and pass the dataset to the model, you will see performance drop. If that particular feature is important, the drop is higher.

The following is the process. I modified part of the code from this blog.

Before this section, I didn’t need source data, because all model related data is pickled and you can retrieve from pickle. Since I need to do some data scrambling. I re-imported data and did a prediction at cell 26 and used f1 as performance metrics. We get f1=0.94 without scrambling. Now in cell 39, we loop through each column, in row 12, we shuffle the record. We recalculate the f1 score and compare it with the baseline at row 19. Then we put all change into a dataframe and sort it.

From the result, still you will see Rating is the most important feature. Two text features are still rank 2 and 3, but value wise is not as important as the detail level. This makes sense, because it contains both positive and negative words, which might balance the impact as a whole.

There is a popular feature importance package called Shap. The difference between permuation importance and Shap is: the former is determined by performance metrics drop, while the latter is magnitude of feature attributions. Shap can also explain deep learning models. So check it out

https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html

2. Model quality

Generally, if you see your model perform well with your test dataset. It is a good sign. But it is reassuring for you if you have a chart to show the learning curve. In Marcelino’s blog, he has two functions which I think are useful.

https://www.kaggle.com/code/pmarcelino/data-analysis-and-feature-extraction-with-python/notebook

The function can be found in my notebook as well, download link are list below

Basically, if your learning score is high, there is no underfitting. If you don’t see an obvious space between two curves, there is no overfitting. In my case, I don’t see obvious underfitting and overfitting.

The second function is used to check hyperparam. You can see at 10e-1, the two curves start separate. Looks hyperparam C should be chosen 10e-1 or lower. This is consistent with the best param in the previous blog.

3. optimize the threshold

When doing classification, the algorithm gives a possibility for each record, then the final label is assigned by comparing the possibility with the threshold. By default, the threshold is 0.5. But it is not always the case. I have seen the optimized threshold at 0.05 for some imbalanced dataset. How do you find the optimized threshold ? The idea is you put together your predicted probability with your target and plot it and find the optimized threshold.

First, you use predit_prob function to get the probability.

second, you list possibility with the actual label.

Then you can use displot in Seaborn to plot the distribution. You can find the optimized threshold point where it separates two populations well. In this particular case, it is close to 0.54

4. Other useful chart to check model performance

You can draw a ROC curve to see how good the performance is. You can use AUC to compare different runs.

The lift chart can tell you how your model can improve prediction by comparing without a model. In this case, it improves performance by about 2 fold when you compare the top 40% sample with baseline.

To understand the lift chart, the following link will be helpful.

https://scikit-plot.readthedocs.io/en/stable/metrics.html
http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html

Thanks for following along.

It is a long series. The notebook is here.

Keep safe.

Wenlei