Jekyll2023-12-24T18:47:44+00:00https://wenleicao.github.io/feed.xmlWenlei CaoWenlei's Tech BlogShap Value for Single Record in Model Prediction2023-12-24T00:00:00+00:002023-12-24T00:00:00+00:00https://wenleicao.github.io/Shap_Value_for_Single_Record<p>Feature importance helps us understand what features play more importance roles in a given model. After that, you can do feature selection, or you can use results to cross check with your business domain knowledge to validate the model.</p>
<p>In the past, the majority of analyses focus on the feature importance at the model level. At record level, however, it is difficult to describe what causes the model to make the particular prediction to a single record. One can make a guess, but a precise contribution of each feature still lacks. However, in the business world, stakeholders will be eager to know what features drive the prediction. Let us say, our model predicts a customer will cancel his insurance policy. If the model also provides what actual cause is, this will help the customer service team to handle customers accordingly. Overall, the explanability is more important in the business world.</p>
<p>There is a python shap package which can help us to achieve this goal. I read quite a few blogs and found most of them are showing various fancy charts that the package is capable of. But few focus on the single instance feature importance.</p>
<p>Let us see how we use iris dataset to unveil the mystery of single record feature importance and see if we can make a function to explain the prediction for a record so that we can make our life easier next time.</p>
<p>First, let us set up an experiment. We started with importing packages and data from iris, creating a simple random forest classifier. Then we take a look at how SHAP explains the results.</p>
<p><img src="/images/blog58/1summaryplot.jpg" /></p>
<p>We instantiate an explainer and use that to get SHAP values. The summary_plot will tell us, among all 4 features, petal length and petal width plays a more important role compared with sepal length and sepal width.</p>
<p>Let us take a look, what shap_values is. You can see it is a list with 3 elements.</p>
<p><img src="/images/blog58/1.5shap_value.png" /></p>
<p>If we dive in the shap_value to understand its structure. We can see SHAP_value is actually a nested array.</p>
<p><img src="/images/blog58/2nest_array.JPG" /></p>
<p><img src="/images/blog58/3shap_value.png" /></p>
<p>Each array contains the same dimension as our input dataset. Indicating that for each class (3 for iris data) SHAP has one value set for it. <br />
Before we show one record feature importance chart, we see how data looks like for a given record.</p>
<p><img src="/images/blog58/4onesamplevalue.JPG" /></p>
<p>We know this record is classified as 0 by model (see cell 18). We also print out its SHAP value for each class and corresponding column name. You can see the first list gives larger numbers except sepal length(cm) is negative. Maybe that is why the prediction is 0 (first element)?</p>
<p>When we observed the force chart for this record (below), it proved our assumption. Only difference, SHAP puts negative value to the right. But you can see the number of positive values corresponds to the number of red, and the number of negative values corresponds to the number of blue. The final prediction is determined by which list produces the largest number. In this case, the 0 class produces the value 0.95. therefore, it is classified as 0.</p>
<p><img src="/images/blog58/5single_feature_importance_chart.JPG" /></p>
<p>Now that we understand the data structure of shap_values and how it makes decisions. We can see how we extract the most important features for a given record.</p>
<p><img src="/images/blog58/6_get_feature_name.JPG" /></p>
<p>This is our target: we want to extract class 0, number 12 record, top 2 import features.</p>
<p>Cell 19, give us the class 0 and number 12 record SHAP value.</p>
<p>Cell 20, we can convert it to pandas Series and give it column name, then sort value, get the first 2 feature names as it is showing here.</p>
<p>It will be cumbersome to do this exercise every time we have a new model. A function will be helpful.</p>
<p><img src="/images/blog58/7.create_function.JPG" /></p>
<p>So, there I did slight modification to the previous cell, instead of only getting the feature name, I also included the feature value for that class. So that, people can see what the feature weight is. Also, if the user wants to see the force chart, it is optional.</p>
<p>Let us do a test run, if we use the same record we run previously, we can see it gave us the same result as before.</p>
<p><img src="/images/blog58/8verfication1.JPG" /></p>
<p>let us see if we can use it to check other records. We list the first 20 records and pick the sequence 3 one (highlighted, notice I am not using the index of y_test, since it is shuffled in the train test split). This record is classified as 2.</p>
<p><img src="/images/blog58/9.another_rec.JPG" /></p>
<p>Let us first get all force charts for 3 categories.</p>
<p><img src="/images/blog58/10another_example.JPG" /></p>
<p>We passed in the parameter for this record, index =2, category is 2</p>
<p><img src="/images/blog58/11result.JPG" /></p>
<p>We can see it indeed show the same result as we expected.</p>
<p>thanks for following along. I hope this help you understand the shap_value a bit better.</p>
<p>Attached is the <a href="/Files/shap_explanibility2.ipynb">notebook</a> which used in this post.</p>
<p>Wenlei</p>Feature importance helps us understand what features play more importance roles in a given model. After that, you can do feature selection, or you can use results to cross check with your business domain knowledge to validate the model.Log errors in python with examples2023-09-30T00:00:00+00:002023-09-30T00:00:00+00:00https://wenleicao.github.io/log_application_error_in_Python<p>Most of people use print function to show variable value in python for debugging purpose. It works fine when you are in dev mode. But as time goes by, your code also grows. It is cumbersome to have the print statement everywhere in your code and clean that afterwards. Python have building-in logging library, which can be leveraged to make our life easier.</p>
<p>One reason people not using logging much is that it take some efforts to set it up. Especially when it is involved with handler and format, people could get confused quickly. Normally, you would get log handler and format ready, then add to logger. Therefore, it is multi-step processes. Since python 3.7, this has become easier by setting up everything in the basicConfig function. It is still hard for me to remember every detail since I am not going to use on daily bases. So, I put it down here, which can server as a script template and change certain param later.</p>
<p>First, import package I will use in the example. logging for log error, Date for timestamp, os for file path manipulation.
I create log file name dynamically so later log will not overwrite the previous ones</p>
<p><img src="/images/blog57/1settingup.JPG" /></p>
<p>Here is the logger setting. essentially, you stuff all setting in basicConfig.</p>
<p><img src="/images/blog57/1.5logger.JPG" /></p>
<p>In this particular case, I set log format and file and stream handler. When I run it I can see it both show in the output and file as follows.</p>
<p><img src="/images/blog57/2fileoutput.JPG" /></p>
<p>I am running this in the Jupyter Notebook, but if your code make calls to multiple python modules, you might want to include logging in each module with module name in the format. This post will help solve some issues you might encounter.</p>
<p><a href="https://stackoverflow.com/questions/50714316/how-to-use-logging-getlogger-name-in-multiple-modules">https://stackoverflow.com/questions/50714316/how-to-use-logging-getlogger-name-in-multiple-modules</a></p>
<p>Now that the purpose of logging is to trace back the errors should it happen. So it is important to be able to include the trace info into the message.</p>
<p>Let us simulate a call stack to show a divide 0 error. I create f1 function. Then use this function in try except block. And intentionally use 0 as denominator which will throw out an error and be captured at the except block. We use logger.error function to show error. This does show divide 0 error. But did not give us where the problem is.</p>
<p><img src="/images/blog57/3showerror.JPG" /></p>
<p>If we add additional exc_info=True param. It will include the trace.</p>
<p><img src="/images/blog57/2.5showerror.JPG" /></p>
<p>The same thing can be archived using logger.exception (e), which you can think it as short-hand version. but notice the source file format difference.</p>
<p><img src="/images/blog57/4showerror.JPG" /></p>
<p>These did not specify the file, so if you are familiar with the place the function located, you are all set. But if you run this through different module. You might also want to include stack_info = True. Like the follows</p>
<p><img src="/images/blog57/5showerror.JPG" /></p>
<p>I hope this post can get you a quick start of logging library. It is a powerful tool.</p>
<p>the notebook can be found <a href="/Files/test_logging.ipynb">here</a>.</p>
<p>thanks</p>
<p>Wenlei</p>Most of people use print function to show variable value in python for debugging purpose. It works fine when you are in dev mode. But as time goes by, your code also grows. It is cumbersome to have the print statement everywhere in your code and clean that afterwards. Python have building-in logging library, which can be leveraged to make our life easier.Retrieve Data from Teradata stored procedure into DataFrame in Python2023-08-27T00:00:00+00:002023-08-27T00:00:00+00:00https://wenleicao.github.io/Retrieve_Data_from_Teradata_stored_procedure_into_DataFrame_in_Python<p>Teradata has a smaller user base than others like Oracle and SQL Server, which results in less resource you can find online. In this post, I will share my research on how to use stored procedure to output data and retrieve it into a dataframe on the python end.</p>
<p>To protect the privacy, I blackout the datalab portion of table and stored procedure, you can add your own if you want to repeat.
First, I create a stored procedure to be able to output a dataset based on parameter values.</p>
<p><img src="/images/blog56/stored_proc1.PNG" /></p>
<p>In this dataset, I pass in a major parameter in row 1. This parameter is used at row 7 to filter a student table. If I use major value = All, this will give me all rows. If I use a particular major, this will give me the student for that major. Notice, in Teradata, you will need to create cursor for this purpose and need to open cursor to be able to retreive.</p>
<p>Let us test the stored proc.</p>
<p><img src="/images/blog56/verify1.PNG" /></p>
<p>We run row 17, this give all 4 records from student table (using filter value = All).</p>
<p><img src="/images/blog56/verify2.PNG" /></p>
<p>When we use major = Business, we only get two records. So, this stored procedure works properly. <br />
Now let us see how we get the record into the dataframe.</p>
<p>The output of stored procedure is a little messy, we first created a function to format it. So we can see how data is structured.</p>
<p><img src="/images/blog56/3.5function.JPG" /></p>
<p>In jupyter notebook, let us make connection to Teradata using teradatasql library.</p>
<p><a href="https://pypi.org/project/teradatasql/">teradatasql</a></p>
<p><img src="/images/blog56/4jupyter_show_stored_proc_result.JPG" /></p>
<p>I can use it to get business students info by pass filter major= Business. Please note the format, parameters from stored procedure need to be in list or tuple. Also, notice, the output has more than 1 dataset, the actual data was from the 2nd dataset.</p>
<p>In order to get the data into the dataframe, the following code get row and column name from the stored procedure (row6 and row7) and then cast the data into dataframe. Before that, row4 get all dataset, row5 filter dataset only keep dataset with values.</p>
<p><img src="/images/blog56/5jupyter_show_stored_proc_result.JPG" /></p>
<p>The other issue, I came across is Teradata has two session mode, ANSI and Teradata. If python complain you have session mode problem. Try to change the setting to default, which help me solve the issue (mine using ANSI).</p>
<p>Hope this helps.<br />
thanks</p>
<p>Wenlei</p>Teradata has a smaller user base than others like Oracle and SQL Server, which results in less resource you can find online. In this post, I will share my research on how to use stored procedure to output data and retrieve it into a dataframe on the python end.Use Docker to Operationalize the Data Science Prediction2023-06-19T00:00:00+00:002023-06-19T00:00:00+00:00https://wenleicao.github.io/Use_Docker_to_Operationalize_the_Data_Science_Prediction<p>There are many challenges working in machine learning fields. In terms of the time consuming part, one is data preparation, the other one would be operationalizing the model. If you have the right data, building a model actually is a relatively easier part of the whole process.
Let us talk about the operationalization of the model. I have used batch files (<a href="https://wenleicao.github.io/Poor_man_automation">link</a>) to handle R and python processes and used the task scheduler to schedule a job at a virtual machine in the cloud. But it is probably not the best way to go because the followings:</p>
<ul>
<li>It is hard to switch to different visual machine. In other word, the portability is poor.</li>
<li>You are limited in windows machine, whereas majority of servers in the cloud are using Linux based operation system.</li>
</ul>
<p>Here I want to explore the possibility of using docker to handle data science operations. It is a modern technology and with great potential to solve the issues I list above.</p>
<p>Docker is a container technology that can package all your artifacts for your application in one folder and move to the host machine (cloud server). Your application will run without worrying about the operating system, hardware and so on. This will fit in with all cloud service
vendors like AWS, Azure, GCP, et al. It uses the same concept of container in the shipping industry.</p>
<p>Majority of docker materials and tutorials online are using Nginx or flask, because the end product is a web App. You can easily see the result by browsing the web page. In my case, I would like to use docker for data science purposes. Unfortunately, I don’t find a lot of resources online. I think the following sites help me a lot.</p>
<ol>
<li>Docker tutorial site is good for getting started, but it is not specific for data science.</li>
</ol>
<p><a href="https://docs.docker.com/get-started/">https://docs.docker.com/get-started/</a></p>
<ol>
<li>The following site helps me use docker with data science, but the content are a bit dated (not working anymore) and methods are over simplified for real life project.</li>
</ol>
<p><a href="https://mlinproduction.com/docker-for-ml-part-1/">https://mlinproduction.com/docker-for-ml-part-1/</a></p>
<p>I want to achieve the following goals:</p>
<ul>
<li>Being able to build customized docker images for data science purposes.</li>
<li>Training and inference processes are getting data from databases (this is common in real life).</li>
<li>Try docker compose if there are multiple containers working together (simplify the configuration step).</li>
</ul>
<p>To run docker in windows environment, you will need the following set up.</p>
<ol>
<li>Install docker desktop, remember to also check the option to install Windows Subsystem for Linux (WSL2 ). This essentially allows you to have a Linux dev environment in Windows.</li>
<li>Install Ubuntu app from windows store to be able to use Linux shell to interact with docker desktop. It is a common language in the container world.</li>
<li>In the docker desktop, configure Ubuntu to integrate with docker(WSL Integration, see below). So you can use Ubuntu shell interact with docker.</li>
<li>Install VS code and add extension WSL, so I can use the VS code to view Linux subsystem files and be able to make dockerfile or compose yaml file.</li>
</ol>
<p><img src="/images/blog55/intergate_shell_with_docker.JPG" /></p>
<p>Let us take baby steps.</p>
<p>The first step, I want to be able to replicate what Luigi did, but minor modifications are needed to make it work.</p>
<p>I will still use Luigi Patruno’s Boston Housing price data for this exploration. But since he published the blog, the dataset source has been changed, the original code is not working any more. So I will need to make changes to it.</p>
<p>In addition, Luigi used a Jupyter notebook image as his base image, this image is bulky (over 1gb ). For some reason, I cannot launch a jupyter notebook by connecting to the container from the host even though this container runs without issue and I put in the token as suggested. Therefore, I changed it to use python image instead. I use python 3.8 image from docker hub (about 500 mb). it might be ok to build from scratch such as Alpine. But I leave it to the future.</p>
<p>The folder contains the following files.</p>
<p><img src="/images/blog55/first_folder_structure.JPG" /></p>
<p>This is the content of dockerfile, which contains instruction to build an image as a blueprint for a container.<br />
dockerfile<br />
<img src="/images/blog55/first_dockerfile.JPG" /> <br />
Row 1: get image from docker hub python:3.8 version<br />
Row 4: create folder under home, form home/wcao/model structure<br />
Row 5-7: create environment variables to be used in other files. Those are paths and file names<br />
Row 10: install packages which were not included in core library of python<br />
Row 11: move all file from current folder <br />
Row 13: run python3 train.py while build the image</p>
<p>Train.py<br />
<img src="/images/blog55/first_train.JPG" /> <br />
Row 1-11, import necessary packages.<br />
Row 17-21, using environment variables to form the model and metadata paths to be used later.
Row 27-36, getting training data. Due to some value was wrapped to 2nd row, there are some data manipulation between 31-33 using np.hstack, then data was divided into train and test.<br />
<img src="/images/blog55/first_train2.JPG" /><br />
Row 40-43, setting up a regressor.<br />
Row 45-50, fit the regressor with train data and get train and test_mse. Save these mse values into metadata.<br />
Row 55-60, dump the model to a file and also save the metadata.</p>
<p>Inference.py<br />
Data steps are similar, so I am going to skip it.<br />
<img src="/images/blog55/first_inference.JPG" /> <br />
Row 41-44, load the trained model from file generated from training. <br />
Row 46-47, using the model to predict and print out the prediction.</p>
<p>With the file in the folder, let us see how we use docker to run it.<br />
In the Ubuntu shell, navigate where the folder contain the code, <br />
<img src="/images/blog55/rebuilt.JPG" /></p>
<p>Here I use the build command, giving the name python-ds as image name. don’t forget the dot, which indicate build from the same folder, i.e., using the dockerfile in the current folder to build.</p>
<p>Once built, I can run docker run command to instantiate a container based on the image.</p>
<p><img src="/images/blog55/test_rebuit_using_training2.JPG" /></p>
<p>In the docker run command, we use cat command to show content of metdata.json, which is generated during the train.py run. It prints out the two mse values as below, which indicate the training work as expected.</p>
<p>Next we will see if inference is working</p>
<p><img src="/images/blog55/inference_working_with_new_image.JPG" /></p>
<p>We can run inference.py and get all predictions (10% of original data). So it works fine.</p>
<p>The first step worked as expected. <br />
Let us add some challenges to our process. In real life, oftentimes, we need to query databases to get new data for training and inference.</p>
<p>Let us get data from a database and here I use mysql to simulate where our source data is from. In dev, it is acceptable to have a db container. In the prod, however, it is suggested that we should use a managed cloud database, due to the phantom nature of containers. But it is easy to switch the connection. The process is quite similar.</p>
<p>The folder structure of the files.</p>
<p><img src="/images/blog55/second_structure.JPG" /></p>
<p>Since we have a container for data science, we also have another container for mysql. Using docker-compose.yml is a good way to simplify the configuration process. It will create both containers within the same network.</p>
<p>Let us take a look what inside the docker-compose.yml</p>
<p><img src="/images/blog55/second_docker_compose.JPG" /></p>
<p>I have two services in this compose file.<br />
Row 3-6, I created a data science service as ds_container. I will build an image from the current folder (I will need to run commands from the current folder) and give the container name data_science_container.<br />
Row 8-14, create mysql_db container from mysql image in docker hub. We map host volume to mysql volume (volume helps persist data even if the container is destroyed). And some environment variables, such as database to use, and password.</p>
<p><img src="/images/blog55/second_dockerfile.JPG" /></p>
<p>I add one more command,<br />
Row 11, in order to deal with database, I add sqlalchemy and mysql-connector-python package<br />
Row 16, this will allow container stand by and not exit, since I will run other commands against the container</p>
<p>Train.py did not change.</p>
<p>Inference.py: I got data from sql query. So I replaced the data extraction part.</p>
<p><img src="/images/blog55/second_inference.jpg" /></p>
<p>Notice here I use sqlalchemy to create mysql_engine. Notice the special format of connection string.</p>
<p>Database+connector://user:password@host(defined in compose file)/database.</p>
<p>Using the connection, we can get a dataframe from read_sql. Here we get data from boston_housing price table. We have not created this yet. Once we stand up the container, we can create it inside the container.</p>
<p>See this blog for more on mysql connection<br />
<a href="https://planetscale.com/blog/using-mysql-with-sql-alchemy-hands-on-examples">https://planetscale.com/blog/using-mysql-with-sql-alchemy-hands-on-examples</a></p>
<p>Let us give it a shot.</p>
<p><img src="/images/blog55/compose_up.JPG" /></p>
<p>Using docker compose up, we can see two services started and used –d for detached mode, so containers run behind the scene. –remove-orphans is optional since I have some other containers with the same name running.</p>
<p><img src="/images/blog55/docker_run.JPG" /></p>
<p>Using docker ps, we can see what name of container is running.</p>
<p>From here, we know the mysql container’s name is code-mysql_db-1. we can get into mysql containers to create tables and insert some records for testing.</p>
<p><img src="/images/blog55/getinsidecontainer.JPG" /></p>
<p>Here we use docker exec to add commands to existing running containers. –it is a command when you want to get inside a container (i:interactive, t: tty). Code-mysql_db-1 is the container name, mysql –p is run mysql command and –p means asking for password.
After entering the password, we set it as secret in the compose file. We get a mysql prompt. (in real life we will need encode the pw, see example in this <a href="https://www.dataknowsall.com/postgres.html">blog</a>)</p>
<p><img src="/images/blog55/create_table.JPG" /></p>
<p>We used mysql database, and created a boston_housing_price table with columns. <br />
Let us assume, there were 3 new data coming in and need us to predict.</p>
<p><img src="/images/blog55/table_insert_value.JPG" /></p>
<p>We can insert 3 records like so. We check the values in the table. We can type exit, now we are out of the container. We only use 3 records here, so we should only get three predictions.</p>
<p>Now let us score the data</p>
<p><img src="/images/blog55/docker_score.JPG" /></p>
<p>Since the container is already running, I should use exec, it did give me three score values. So we can successfully get prediction using docker compose.</p>
<p>Once it is done, we can use docker compose down to remove all containers.</p>
<p>It is convenient and easier to deploy anywhere with containers, as long as I have these file in a folder.</p>
<p>The next step, we can use docker swarm or K8s for orchestration, but it is out of the scope of this post.</p>
<p>thanks for following along. Here is the <a href="/Files/blog55.zip">folder</a> contain file you need if you want to repeat.</p>
<p>Wenlei</p>There are many challenges working in machine learning fields. In terms of the time consuming part, one is data preparation, the other one would be operationalizing the model. If you have the right data, building a model actually is a relatively easier part of the whole process. Let us talk about the operationalization of the model. I have used batch files (link) to handle R and python processes and used the task scheduler to schedule a job at a virtual machine in the cloud. But it is probably not the best way to go because the followings:Dynamic SQL and Cursor in Teradata2023-06-04T00:00:00+00:002023-06-04T00:00:00+00:00https://wenleicao.github.io/dynamic_sql_cursor_in_teradata<p>Dynamic SQL is SQL statement formed when it is at run time. You can use variable to increase the flexibility of SQL, but variable won’t help in certain occasions. For e.g., you want to change the table name or column name of script in run time. This is where dynamic SQL shines.</p>
<p>Cursor in SQL has been ditched as low efficient old technology in favor of set operation because cursor operate is in row by row nature. But cursor is very useful when you deal with handful of repetitive jobs because of its flexibility. Thinking cursor as a foreach loop might be helpful to understand.</p>
<p>It can achieve amazing results if you combine dynamic SQL and cursor together. Let us take a look at a real life example.</p>
<p>I was tasked to convert some R code into SQL. R saved data into dataframe and handled by dplyr package. In one line of code, you can use lapply to trim all column’s whitespace for a given dataframe. But it will not be that easy to do in Teradata.</p>
<p>I am not an expert on Teradata. But I will give it a shot. The least techy way will be to run a trim function for every char/varchar column in a table. But what if you have 5 tables, a total of 100 columns? What if the columns will be changing in the future? Manually maintaining the code will be a disaster.</p>
<p>I am thinking of using dynamic SQL and Cursor because we can pass a list of column names to a cursor. The cursor then releases the column name one at a time to dynamic SQL to execute update statement with trimmed values. To give a list of char/varchar columns, we can query the metadata system view dbc.columnV so even if the table changes, we are still good.</p>
<p>I have been complaining about the documentation of Teradata. It is difficult to find example code which fits your needs. Of course, I also need to boost my searching skills.</p>
<p>I feel the following two sites are helpful for Teradata users if you cannot google what you need.</p>
<p><a href="https://docs.teradata.com/">https://docs.teradata.com/</a> use search function here to locate some examples<br />
<a href="https://dbmstutorials.com/random_teradata/teradata-dynamic-statements.html">https://dbmstutorials.com/random_teradata/teradata-dynamic-statements.html</a></p>
<p>The second site gives some actual Teradata scripts. Although in this case, it is using dynamic SQL to create tables or insert tables. It is close enough to our purpose.</p>
<p>I use the script from the 2nd link as a template to create my own stored procedure and create a fake table to test the stored procedure.</p>
<p>Teradata syntax is quite different from other RDBMS. Honestly, I don’t remember all those syntax unless I am doing it on a daily basis. So if you are on the same boat, I will suggest using a template and modify the code based on your need.</p>
<p>It is time to get our hands dirty. let us go over the code. BTW, I marked the database name for privacy. you just need to change to yours.</p>
<p><img src="/images/blog54/stored_proc1.PNG" /></p>
<p>line 1-2, create stored procedure, pass in two params, dbname and tablename because I need to use it for different tables. maybe a different database in the future.<br />
line 4-6, define a few of local variables to be used later.<br />
line 14, 15, For loop start here if you have multiple columns to loop through.<br />
line 19-22, from system view get column list based on the parameter dbname and tablename value, notice the syntax with “:”.</p>
<p><img src="/images/blog54/stored_proc2.PNG" /></p>
<p>line 24-33 is using the columnname to do things.<br />
First, get the cursor value with the set statement at line 27. Notice the syntax cursor.columnname. You should be able to use it directly, it just makes it simpler later with a shorter variable name.<br />
Line 28 forms SQL statement in the run time.<br />
Line 32 executes update statement.</p>
<p>Let us give a test run.</p>
<p><img src="/images/blog54/verify1.PNG" /></p>
<p>line 1-9, create a fake table, with various leading and trailing spaces. I just wanna see if the stored procedure works.<br />
By querying the table, we can see there are spaces that need to be removed.</p>
<p><img src="/images/blog54/verify2.PNG" /></p>
<p>line 14, call the stored procedure we created and then pass in the dbname and tablename. <br />
then select the table again. You see all value has been trimmed nicely.</p>
<p>Next, you can just pass any db and tablename, not worry how many columns they have and what changes will be made to those tables later. Your heart is free now :).</p>
<p>As always, code can be download <a href="/Files/blog54.zip">here</a>.</p>
<p>thanks!</p>
<p>Wenlei</p>Dynamic SQL is SQL statement formed when it is at run time. You can use variable to increase the flexibility of SQL, but variable won’t help in certain occasions. For e.g., you want to change the table name or column name of script in run time. This is where dynamic SQL shines.Poor man’s automation2023-02-26T00:00:00+00:002023-02-26T00:00:00+00:00https://wenleicao.github.io/Poor_man_automation<p>One bottleneck of machine learning is to operationalize the model you build. Let us say you are satisfied with the model performance. Now the question is how you link your data engineering step with your model prediction and then save the prediction somewhere. Above all, you will need to make it run automatically. So that you can do something more important.</p>
<p>I have been a BI developer on Microsoft platform. ETL tools, such as SSIS, are well capable of doing data extraction. Besides that, it also has an Execute Process task, with which you can run other applications, such as python, R. You can put tasks sequentially so that tasks will carry out based on your design. Alternatively, if you are allowed to use containers, you might be able to recreate your entire environment in Docker and run it from the cloud.
What if you are in an environment where the more advanced technology is not available to you yet? Luckily, we have a very old friend, a command line tool, and batch files, which we can use to automate the process. Because it is old, the majority of softwares supports it. You can run python, R from the command line, which means you can put that in your batch file.</p>
<p>I have a data science project, in which source data comes from a Microstrategy report, data was extracted using R mstrio package (step 1). Some feature engineering took place in R (step 2). More features were brought in from a variety of sources with python, data went through merge and transformation, fed to the model, finally saved the prediction in AWS S3 (step 3). You can see different technologies used. We are able to use a few batch files to automate it.</p>
<p>I will share the structure of my implementation of batch files. But I will not go into very basic level. I think people should be able to do some research themselves.</p>
<p>This is what I want to achieve:</p>
<ol>
<li>I can just run master.bat to get prediction and no need other commands. You can use task scheduler in windows to schedule job run later on.</li>
<li>I want to modularize the child process. If the problem happen at child process level, I do not need touch master.bat</li>
<li>I want to save job logs into text files. So in case I need to check what is wrong. I can see those for troubleshooting</li>
<li>The process needs to have error handling. If there is an error, the batch file will report the error.</li>
</ol>
<p><img src="/images/blog53/1filename.PNG" /></p>
<p>Row 1: @echo off command do not output verbose command. by default, it will repeat your command in output</p>
<p>Row 2-4, automatically use system time, generate date time format for e.g., step1-2_09-25-2023_1928.txt</p>
<p>Notice %xxx% is the syntax of the variable in the batch file. Some are system variable like %CD%, %TIME%.<br />
you can pull system variable value directly. When you need to create a variable yourself, you will need to follow a format like row 14. Then you can use it like row 15 to show value.</p>
<p><img src="/images/blog53/2path.PNG" /></p>
<p>Row 12, when you schedule a job via task scheduler. This will tell change directory to current running file location (otherwise, it did not know )</p>
<p>Row 14, The parent folder of the running file is code folder. So I save this path. It makes easier to change to this path later</p>
<p>Row 19. I get the directory one level up, because I want to save the log file in the output folder, whereas the output folder is at the same level as the code folder.<br />
Row 24. I combined the file path for two log files.</p>
<p><img src="/images/blog53/2brunfile.PNG" /></p>
<p>Row 29 -32. Change directory to code folder, at 31, call child bat, step1-2.bat, to start step1-2 (see below for detail). Because both step 1 and 2 use R. I combine them together in one step. The “>” is used to output the log to log file. The file path was created at row 24. If there is error, it will go to error handler at Row 49</p>
<p>Row 34-38. Call step3 child bat. Output the log to the file path generated previously.</p>
<p>Row 40. Cleanup.bat moves all files generated and then saves it in the archive folder.</p>
<p>If every step before row 46 is correct, it will print successful, then exit without error.</p>
<p>Row 49-52. If the previous step has an error, this block of code will be called, it will show Failed and kill the process. Then exit with error.</p>
<p>When run row 31, child bat file, step1-2.bat was called. the content of this child bat file is as follows,</p>
<p><img src="/images/blog53/3process_R.PNG" /></p>
<p>Row 3. Tell R application path.</p>
<p>Row 8. Under this python, there is a file named Rscript.exe. you can use command to run step1-2.R.</p>
<p>When run row 36. It will open step3.bat like so and run it.</p>
<p><img src="/images/blog53/3bprocess_python.PNG" /></p>
<p>Row 4. Create a time variable, mydate, it will be like 2023-02-15. But will need to change value based on the current month.</p>
<p>In this case, my python is run under an anaconda environment. So I need to run row 8-9 to activate the environment.</p>
<p>Row 17, we can use python to execute step3.py. “test” and %mydate% are two parameters, which I use to pass to the python script. In the python script, you can use sys.argv[1] to get the value for the first variable, sys.argv[2] for second one. This way, you pass a variable value to the python script if every month the date needs to be changed like the following python snippets</p>
<p><img src="/images/blog53/4passparaminpython.PNG" /></p>
<p>This is probably the way in which you use minimal tech resources to complete automation tasks. But I don’t think this is the best practice. If we will have to move it to the cloud, then you will need to create a virtual machine and install identical R and python environment to be able to run this. It will take a long time. If you switch a different VM, you will need to create another environment.</p>
<p>The better approach would be to create a container that you can move anywhere.</p>
<p>But if you are on budget, this might be something you can try.</p>
<p>Thanks. Hope this is useful.</p>
<p>Wenlei</p>One bottleneck of machine learning is to operationalize the model you build. Let us say you are satisfied with the model performance. Now the question is how you link your data engineering step with your model prediction and then save the prediction somewhere. Above all, you will need to make it run automatically. So that you can do something more important.How to Evaluate Your Machine Learning Model2022-11-12T00:00:00+00:002022-11-12T00:00:00+00:00https://wenleicao.github.io/How_to_Evaluate_Your_Machine_Learning_Model<p>Given a scenario, after a few days of hard work, you have a model built, are you done?<br />
Unfortunately, It is not the end of your data science project, it is a new start of many other things!</p>
<p>I will need to validate a few things (just my list not exausted one though) so that we know the model is good.</p>
<ol>
<li>Besides good performance metrics score, what are the important features which support prediction? We can use business knowledge to cross check. We can ask ourselves, Does that make sense?</li>
<li>It would be good to check visually whether model is overfitting and underfitting</li>
<li>If it is imbalance data, what threshold is optimized to choose from</li>
<li>Other useful chart to check model performance</li>
</ol>
<p>It is easier to explain with an example, so let us use the previous project to explore the question we listed. For details of the previous project, I include the link <a href="https://wenleicao.github.io/How_to_Handle_Textual_Features_along_with_Other_Features_in_Machine_Learning/">here</a>.</p>
<p>First, I need to import the pickle file I saved from the previous project. In order to do that, I need to import all packages, custom classes, functions used as I know it will run into errors without those.</p>
<p><img src="/images/blog52/1import_object.PNG" /></p>
<p>In cell 8, I imported sklearn.model_selection object. when you look at the import object, it shows the pipeline detail.</p>
<h2 id="1-feature-importance">1. feature importance</h2>
<p>In some industries, model explainability is crucial. Imagining we built an insurance pricing model, but we cannot explain how the model works. When we submit the model to the state regulator for approval. We will fail because regulators need to know why we increase insurance premiums. Over the past 20 years, there have been a few important machine learning frameworks such as scikit-learn, deep learning et al. Depending on different models, scikit-learn generally provides model properties to reveal the model importance; whereas deep learning framework is famous for its blackbox due to hidden layers despite of fairly high performance.
Since I used scikit-learn in my previous blog, I will focus more on some of the ways in scikit-learn.</p>
<p>In Dr. Bronlee’s blog, he illustrated three important ways to get feature importance.</p>
<p><a href="https://machinelearningmastery.com/calculate-feature-importance-with-python/">https://machinelearningmastery.com/calculate-feature-importance-with-python/</a></p>
<ol>
<li>Coefficients as feature importance if it is linear model</li>
<li>Tree based feature importance (Decision tree, Random Forest, XGBoost et al )</li>
<li>Permutation feature importance, (Pass scrambled predictors to model to check performance drop to get importance)</li>
</ol>
<p>The optimized model in my previous project is logistic regression. It is the linear model, let us use the first method, i.e. coefficient method. We can use name steps to retrieve the model directly from sklearn.model_selection object. Using its property coef_, We can see 153 features. The particular coefficient indicates how much impact this feather can impact. Please note, your data needs to be scaled to a similar level. In my case, all values are scaled between (-1, 1). So the impact between each feature is relatively comparable.</p>
<p><img src="/images/blog52/1.5coefficient.PNG" /></p>
<p>With the coefficients in place, we will need to map the feature name to it.</p>
<p>In my previous blog, I have numeric features, categorical features and text features.</p>
<p><img src="/images/blog52/2Features.PNG" /></p>
<p>Num_column, categorical_column, Title in txt_coloumn and ‘Review Text’ in txt column passed through pipeline-1 ~ pipeline-4 within the column transformer respectively. After the transformation, the results are in the sparse matrix format. Unlike dataframe, this format does not have column names. So in order to know the feature importance, we will have to retrieve the feature name if they went through something like one hot encoding, which will expand one column to multiple columns.</p>
<p>Let us see how we get the feature name for the sparse matrix.<br />
First of all, they follows sequence how you set the pipeline up (pipeline 1 to pipeline 4).</p>
<ul>
<li>Num_column: just simple imputing and scaling. The column number has not changed.</li>
<li>Categorical column: it has the column expanding transformer, onehotEncoder. Luckily, scikit-learn provides functions such as named_steps, named_transformers (see cell 12 line 5) so that you can navigate to the onehotencoder step. Then you can use get_feature_names() to get all feature names (not shown in the screenshot, too wide, but in notebook). Notice here, scikit-learn renamed the three variables (Division, Department, Class) to X0-X2.
By the way, showing pipeline like cell 11 is helpful for you to navigate complicate transformation</li>
<li>For text columns, I divide Title and “Review Text” into two pipelines (3 and 4). Text is tokenized into words (column expanding). We select 20 and 100 important words respectively with selectKBest transformer (column number changed too). These Text values are later used as column names.</li>
</ul>
<p><img src="/images/blog52/2Features2.PNG" /></p>
<p>Here we can first get the word index from selectKbest step. Then use the index to get actual words from the previous step countvectorizer. There are 20 in Title, so I loop 20 times and rename the column to add the prefix ‘title’ to distinguish words from ‘review’ . For Review Text, it works similarly. But just change 20 to 100.</p>
<p>Next, I piece together all the feature names and combine them with coefficients.</p>
<p><img src="/images/blog52/3mapfeture_coeffient.PNG" /></p>
<p>Let us see what those important players are.</p>
<p><img src="/images/blog52/4top_feature_importance_result.PNG" /></p>
<p>I listed the top 10 and bottom 10 important features. Notice bottom 10 is also important. It just negatively impact the prediction towards negative target (not recommended, target =0).</p>
<p>By reviewing those features, it makes much more sense to me. Rating definitely positively correlated with recommended (target =1), I don’t see categorical features play an important role here. Probably those are neutral. Others, in the top 10 features, are mainly good words. Bottom 10 features are mainly bad words. Only question here is title_wanted in bottom 10, meaning the word “wanted” in the title. I would think this is a positive word. But maybe, I don’t understand fashion. In the fashion world, if other people want it, it might not be a good thing. But it is worth taking time to see what original context to see if this is the case and make further adjustment.</p>
<p>if we visualize it with bar chart</p>
<p><img src="/images/blog52/5feature_importance.PNG" /></p>
<p>These analyses are all at the detailed level, I often ask myself. What if we treat expanded categorical/text features into one feature. As a whole, what feature importance landscape would be?</p>
<p>For that, we can use permutation feature importance calculation. Idea is as follows.</p>
<p>We already had a model and we knew performance would be. Now if I scramble/shuffle a feature value and pass the dataset to the model, you will see performance drop. If that particular feature is important, the drop is higher.</p>
<p>The following is the process. I modified part of the code from this <a href="https://towardsdatascience.com/from-scratch-permutation-feature-importance-for-ml-interpretability-b60f7d5d1fe9">blog</a>.</p>
<p><img src="/images/blog52/6permutaion_calculation.PNG" /></p>
<p>Before this section, I didn’t need source data, because all model related data is pickled and you can retrieve from pickle. Since I need to do some data scrambling. I re-imported data and did a prediction at cell 26 and used f1 as performance metrics. We get f1=0.94 without scrambling. Now in cell 39, we loop through each column, in row 12, we shuffle the record. We recalculate the f1 score and compare it with the baseline at row 19. Then we put all change into a dataframe and sort it.</p>
<p><img src="/images/blog52/6permutaion_results.PNG" /></p>
<p>From the result, still you will see Rating is the most important feature. Two text features are still rank 2 and 3, but value wise is not as important as the detail level. This makes sense, because it contains both positive and negative words, which might balance the impact as a whole.</p>
<p>There is a popular feature importance package called Shap. The difference between permuation importance and Shap is: the former is determined by performance metrics drop, while the latter is magnitude of feature attributions. Shap can also explain deep learning models. So check it out</p>
<p><a href="https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html">https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html</a></p>
<h2 id="2-model-quality">2. Model quality</h2>
<p>Generally, if you see your model perform well with your test dataset. It is a good sign. But it is reassuring for you if you have a chart to show the learning curve.
In Marcelino’s blog, he has two functions which I think are useful.</p>
<p><a href="https://www.kaggle.com/code/pmarcelino/data-analysis-and-feature-extraction-with-python/notebook">https://www.kaggle.com/code/pmarcelino/data-analysis-and-feature-extraction-with-python/notebook</a></p>
<p>The function can be found in my notebook as well, download link are list below</p>
<p><img src="/images/blog52/7learning_curve.PNG" /></p>
<p>Basically, if your learning score is high, there is no underfitting. If you don’t see an obvious space between two curves, there is no overfitting. In my case, I don’t see obvious underfitting and overfitting.</p>
<p><img src="/images/blog52/8validation_curve.PNG" /></p>
<p>The second function is used to check hyperparam. You can see at 10e-1, the two curves start separate. Looks hyperparam C should be chosen 10e-1 or lower. This is consistent with the best param in the previous blog.</p>
<h2 id="3-optimize-the-threshold">3. optimize the threshold</h2>
<p>When doing classification, the algorithm gives a possibility for each record, then the final label is assigned by comparing the possibility with the threshold. By default, the threshold is 0.5. But it is not always the case. I have seen the optimized threshold at 0.05 for some imbalanced dataset. How do you find the optimized threshold ? The idea is you put together your predicted probability with your target and plot it and find the optimized threshold.</p>
<p><img src="/images/blog52/9threshold1.PNG" /></p>
<p>First, you use predit_prob function to get the probability.</p>
<p><img src="/images/blog52/9threshold2.PNG" /></p>
<p>second, you list possibility with the actual label.</p>
<p><img src="/images/blog52/9threshold3.PNG" /></p>
<p>Then you can use displot in Seaborn to plot the distribution. You can find the optimized threshold point where it separates two populations well. In this particular case, it is close to 0.54</p>
<h2 id="4-other-useful-chart-to-check-model-performance">4. Other useful chart to check model performance</h2>
<p>You can draw a ROC curve to see how good the performance is. You can use AUC to compare different runs.</p>
<p><img src="/images/blog52/10ROC.PNG" /></p>
<p>The lift chart can tell you how your model can improve prediction by comparing without a model. In this case, it improves performance by about 2 fold when you compare the top 40% sample with baseline.</p>
<p>To understand the lift chart, the following link will be helpful.</p>
<p><a href="https://scikit-plot.readthedocs.io/en/stable/metrics.html">https://scikit-plot.readthedocs.io/en/stable/metrics.html</a><br />
<a href="http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html ">http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html </a></p>
<p><img src="/images/blog52/lift_chart.PNG" /></p>
<p>Thanks for following along.</p>
<p>It is a long series. The notebook is <a href="/Files/understand_model.ipynb">here</a>.</p>
<p>Keep safe.</p>
<p>Wenlei</p>Given a scenario, after a few days of hard work, you have a model built, are you done? Unfortunately, It is not the end of your data science project, it is a new start of many other things!How to Handle Textual Features along with Other Features in Machine Learning2022-10-02T00:00:00+00:002022-10-02T00:00:00+00:00https://wenleicao.github.io/How_to_Handle_Textual_Features_along_with_Other_Features_in_Machine_Learning<p>When working on a real-life data science project. You will realize there are things you might not expect from what you learn in school.</p>
<ol>
<li>You spend the majority of your time working on data prep and data preprocessing.</li>
<li>You will need to handle all kinds of data, not just numeric and categorical.</li>
</ol>
<p>Scikit learn’s pipeline will greatly simplify the data prep and make it much easier that you can apply the exact same data prep logic to up-coming new data. But in real work, you often come across text type data, like notes, or social media text saved in the database. Those could be important features, which means you will have to use text mining techniques. People usually stay away from those due to challenges. It is a pity that we cannot take advantage of that, oftentimes then not, it could improve overall model performance.</p>
<p>Text is a bit trickier than other types of data in that you usually have to clean it and then vectorize it before it can be put into use. Other forms of dataset, like audio and image, which by default is already in a numpy array format. Therefore, those are relatively easier to incorporate than text.</p>
<p>Before I write this, I only see a blog discussing this. I would suggest people take a look since the author mentioned two different ways when you are using scikit and deep learning networks. One issue with the blog is the author did not provide a source dataset, so it is hard for readers to repeat the work.</p>
<p><a href="https://towardsdatascience.com/how-to-combine-textual-and-numerical-features-for-machine-learning-in-python-dc1526ca94d9">https://towardsdatascience.com/how-to-combine-textual-and-numerical-features-for-machine-learning-in-python-dc1526ca94d9</a></p>
<p>Here, I would like to show how I go about it with an example via scikit learn framework.</p>
<p>I use Womens Clothing E-Commerce Reviews dataset. This dataset contain numeric, categorical and text field. You can download the dataset <a href="/Files/Womens Clothing E-Commerce Reviews.zip">here</a> to follow along.</p>
<p>First, we import all packages that will be used in the analysis.</p>
<p><img src="/images/blog51/1import_package.PNG" /></p>
<p>Now we import the dataset and take a peek at what the data looks like.</p>
<p><img src="/images/blog51/2check_data1.PNG" /></p>
<p>Using the info function, we notice data type and some missing data.</p>
<p><img src="/images/blog51/2check_data2.PNG" /></p>
<p>Among all columns, Review Text and Title are text columns. Also Recommended IND is an indicator used for the recommendation, which we will use as a binary classification target.</p>
<p>Besides that, it looks like the data is imbalanced data by checking Recommended IND value count in cell 12. In cell 13, I undersample value =1, so that I can create an equal number of positive and negative records in the dataset which will be used for training and test. It is very important to handle imbalanced data properly, otherwise, your model can learn to just predict the predominant value and still get high accuracy, which is not what you want.</p>
<p>You can use the python package, imblearn, to handle the imbalance problem, which will give you more flexibility. The following link will get you started. But this is not our goal here. So I used a simple step.</p>
<p><a href="https://youtu.be/YMPMZmlH5Bo">https://youtu.be/YMPMZmlH5Bo</a><br />
<a href="https://youtu.be/OJedgzdipC0">https://youtu.be/OJedgzdipC0</a></p>
<p><img src="/images/blog51/3under_sample.PNG" /></p>
<p>I defined column lists in cell 15.</p>
<p>I split the data into train and test in cell 17.</p>
<p><img src="/images/blog51/4sample_split.PNG" /></p>
<p>Since many machine learning algorithm only takes numpy array as input data, we will need to impute null, scale data, if data is not in normal form, we try to correct that, for categorical data, we will need encode it to numeric value, for text data, we will need to vectorize it.</p>
<p>Next we will build pipeline for numeric data. Normally you will see an imputer and a scaler, we will add a custom transformer too. Since the data is not in a normal distribution. Let us try a log transoformation. Because there are some 0 values in the data, if we directly use the log function, it will generate –inf, which will cause trouble for the next transformation in the pipeline. Therefore, we add 1 to the original value, then do log transformation. In my case, the minimum value is 0, so it is fine. But if you have negative value, you might want to try other kind of transformation like box-cox.</p>
<p>In cell 20, I use functionTransformer to convert a function to a class. This is a shortcut in which you don’t have to write a class for a transformer.</p>
<p><img src="/images/blog51/5log_function.PNG" /></p>
<p>After transformation, notice the in third row (yellowshaded), the log(0+1) = 0, No –inf value anymore.</p>
<p><img src="/images/blog51/6log_function_after.PNG" /></p>
<p>I use the make_pipeline function to create a pipeline. Here, I put the custom log_transformer into the pipeline. Then test both the numerical and categorical pipeline. Both work fine.</p>
<p><img src="/images/blog51/7pipeline_handle_num_cat.PNG" /></p>
<p>Next, I start to work on the text pipeline.</p>
<p>Since our focus is not training a model, rather handling heterozygous data. I will borrow Bert’s clean_text class to clean up the Review Text column. His blog is at the following address if you want to know more about clean_text function.</p>
<p><a href="https://towardsdatascience.com/sentiment-analysis-with-text-mining-13dd2b33de27">https://towardsdatascience.com/sentiment-analysis-with-text-mining-13dd2b33de27</a></p>
<p><img src="/images/blog51/8clean_text_class.PNG" /></p>
<p>Both pipelines have similar components. Only difference, Txt2_pipeline contains clean_text step. Also, I chose to pick up top 100 important features vs 20, since review text is longer.</p>
<p>Let us give it a shot. Whoops! We get our first error in the txt1_pipeline.</p>
<p>This error tells us, our first step imputer needs 2 dimensional data, However we provide one dimension data. However, we have to flatten the structure with ravel function, because in the next step, countVectorizer expects one dimension data.</p>
<p><img src="/images/blog51/9text_pipeline_init_error.PNG" /></p>
<p>So, we need to fix that.</p>
<p><img src="/images/blog51/9text_pipeline_fix.PNG" /></p>
<p>In cell 30, we reshape data for imputation. After that, we flatten the structure so that it is ready for CounterVectorizer. In order to use the same logic in the pipeline, This wrapper idea is good from stackoverflow site in cell 31. Once we wrap the SimpleImputer class, it will yield the correct format for the next step.</p>
<p>Same way, we can handle txt2_pipeline.</p>
<p><img src="/images/blog51/10text_pipeline.PNG" /></p>
<p>It is time to put those pipelines together. In cell 36, we can merge all these different pipelines together using make_column_transformer function. Here I came across the 2nd error, which is a bit weird. Since all previous pipelines work individually, I don’t expect there will be an error here. The error is the same as it is shown in this stackoverflow post. I have to change line 6 from [‘Review Text’] to ‘Review Text’ . This puzzled me about why the same error did not show up for txt1.pipeline. So far, I have no explanation but if you know, you can leave it in the comment section. The preprocess is successful, you can see that from dimension.</p>
<p><img src="/images/blog51/11text_pipeline.PNG" /></p>
<p>I put together a list of classifiers and params (part of screen). I included 8 different classifier to compare which one perform better.</p>
<p><img src="/images/blog51/12classfier_param.PNG" /></p>
<p>Now, in cell 39, we created a final pipeline with a classifier. And in cell 40, we did a RandomizedSearchCV to save time with 100 iterations, as our goal is not how good the model is. We are looking to handle textual features with others. The train is completed in 42 min.</p>
<p><img src="/images/blog51/13classifier.PNG" /></p>
<p>I would expect xgboost did a better job, but here logistic regression won the gold medal, and xgboost is in 2nd place.</p>
<p><img src="/images/blog51/14result.PNG" /></p>
<p>Let us see how it performed on test data it has never seen.</p>
<p>Looks like they performed pretty well. Both recall and precision are high.</p>
<p><img src="/images/blog51/15test_result.PNG" /></p>
<p>I introduced some custom steps to simulate what you would encounter in real life. I hope you feel this is useful.</p>
<p>Some errors I experienced may have something to do with scikit learn design. But I wish this post could help you avoid some of the pains I had.</p>
<p>Thanks, you can download the notebook <a href="/Files/Handle_different_type_data4.ipynb">here</a>.</p>
<p>Wenlei</p>When working on a real-life data science project. You will realize there are things you might not expect from what you learn in school. You spend the majority of your time working on data prep and data preprocessing. You will need to handle all kinds of data, not just numeric and categorical. Scikit learn’s pipeline will greatly simplify the data prep and make it much easier that you can apply the exact same data prep logic to up-coming new data. But in real work, you often come across text type data, like notes, or social media text saved in the database. Those could be important features, which means you will have to use text mining techniques. People usually stay away from those due to challenges. It is a pity that we cannot take advantage of that, oftentimes then not, it could improve overall model performance.Entitle the Custom Transformer a Memory in Python2022-08-13T00:00:00+00:002022-08-13T00:00:00+00:00https://wenleicao.github.io/entitle_Custom_Transformer_the_memory<p>Before data can be used in machine learning, it will need to be preprocessed. Many machine learning algorithms cannot handle null and outlier well. So, these steps are important. Sklearn has a preprocessing class to help you do that. But oftentimes, something like capping outliers, using log function to transform the data, which requires you to write your own transformer class.</p>
<p>Let us say, when training your model, you capped the outlier with a 99th percentile value of training data. Now, you start to use the model to predict, your testing data also has outliers, should you capped it with 99th percentile of training data or testing data? Because your model’s params were built with training data. If you want the model to work properly, you would treat the test data the same as the training data by using 99th percentile of training data.</p>
<p>This means we will have to “remember” the training data 99th percentile value. This might not be a problem for 1 or 2 features. But what If you need to work on a model with 100 features. Thirty of 100 features need to be preprocessed with the 99th percentile value. How do you track all those numbers? It is not practical to remember those values and hard-code those in function. Let us see how we can use class to track those for us.</p>
<p>In addition, there are times that we need to preprocess data separately and then feed to the model hosted somewhere else. A custom transformer class can preprocess data in one step. Besides that, you can store the fitted transformer physically somewhere and reload it only when you need to. That is convenient for lazy people like me. I will show how it works as well.</p>
<p>Let us see how a custom class can help us with that.</p>
<p>First, let me show non class version</p>
<p><img src="/images/blog50/1nonclass.PNG" /></p>
<p>Cell 79 creates a data frame from a tuple. Notice I introduce an outlier in row 4 and a null value in row 5. You can fill null value and cap the outlier in cell 80 and 81 (hard coded for simplicity)</p>
<p>Now,the hard code method works, let us see how class can simplify our work.</p>
<p><img src="/images/blog50/1class.PNG" /></p>
<p>I put the code in the transform function and create a custom transform called model_transformer. Let us see if the custom transformer will work. I reset the data frame in cell 83. Now in cell 84, I first instantiate model_transformer class and give the name mt. Then I use the fit_transform function of the class to preprocess the data frame. The result shows 84, which is the same as 81. The class basically packages all functions together and transforms data in one step. That is neat.</p>
<p>Let us say, you have a new dataset for testing. Can this instance handle it?</p>
<p><img src="/images/blog50/2class_new_data.PNG" /></p>
<p>I created a new dataset, with values changed in row 4, 5. The mt instance gets the job done.</p>
<p>If I want to use the mt instance in another notebook and do transformation, will that work?</p>
<p><img src="/images/blog50/3pickle.PNG" /></p>
<p>Here I can pickle the transformer and then reload it with a different name. In cell 11, I create a different dataset and the loaded instance process data just fine. (For simplicity, I did it in the same notebook, but you might need to include class definition when you use another notebook for it to work so that python knows where to find it)</p>
<p>Can the class remember the value of training? In the previous code, I simplified it with hard code to see how class and pickle work.</p>
<p>Here I have a training dataset and a testing dataset, both with different mean values for salary and bonus. We use mean as a simple example, but that applies to other scenarios such as capping outliers.</p>
<p>Let us see if I fit with the training data and then transform with the testing data, if the mean value from training can be used to fill the null value of testing.</p>
<p><img src="/images/blog50/4hardcodemean.PNG" /></p>
<p>Hard coded python script converts salary and bonus null value with avg.</p>
<p>Create a transformer</p>
<p><img src="/images/blog50/5.nomemory_transformer.PNG" /></p>
<p>This custom transformer is able to handle training data. But obviously, it has no memory. You can see when it transforms testing data, it fills the null value with testing data mean (yellow shade below).</p>
<p><img src="/images/blog50/6nomemory_transform_result.PNG" /></p>
<p>In order to let the class remember the value, you will need to create an instance variable at line 8, where you can save the training value. The instance variable will be different when you fit different training instances. When you fit the model with training data, the average of each column was saved into the instance variable (line 20). When you transform with testing data, these values are retrieved by set_null_as_avg function see line 14 and 15.</p>
<p><img src="/images/blog50/7memory_transformer.PNG" /></p>
<p>Let us give a try</p>
<p><img src="/images/blog50/7memory_transformer2.PNG" /></p>
<p>You can see df5 no longer uses the testing data average, but instead uses the training data average. In fact, you can check the instance variable value in cell 23. Once you fit the model with training data, these average values have been saved there.</p>
<p>I hope the smart class can save you some time from the tedious work.</p>
<p>As always, the Jupyter code file can be found <a href="/Files/model_transformer-add_memory.ipynb">here</a>.</p>
<p>Thanks for reading</p>
<p>Wenlei</p>Before data can be used in machine learning, it will need to be preprocessed. Many machine learning algorithms cannot handle null and outlier well. So, these steps are important. Sklearn has a preprocessing class to help you do that. But oftentimes, something like capping outliers, using log function to transform the data, which requires you to write your own transformer class.Custom Python Functions Used in Exploratory Data Analysis2022-06-21T00:00:00+00:002022-06-21T00:00:00+00:00https://wenleicao.github.io/Custom_Python_Functions_Used_in_Exploratory_Data_Analysis<p>A thousand miles begins with a single step. Similarly, any fancy machine learning begins with exploratory data analysis (EDA). Because of that, you would think there will be a lot of resources for this. Surprisingly, it is not. Let us google the best book for EDA. Here are two top links for EDA from google search.</p>
<p><a href="https://www.kaggle.com/getting-started/173448">https://www.kaggle.com/getting-started/173448</a></p>
<p><a href="https://www.quora.com/What-is-a-good-book-on-exploratory-data-analysis">https://www.quora.com/What-is-a-good-book-on-exploratory-data-analysis</a></p>
<p>The first link gave us a few more kaggle links with some experts in the field sharing their notebooks, which did enlighten me quite a bit. In the second link, a few people mentioned a book authored by John Tukey, named “Exploratory Data Analysis”. This book was first published in 1977. I don’t question the theory of the book. But it will not teach you too much about using python to do EDA.</p>
<p>After reading a few Kaggle articles and data science blogs, I think the following links are very helpful. They provide different perspectives that how others do this part. You might want to read it over.</p>
<ul>
<li>regression</li>
</ul>
<p><a href="https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python/notebook">https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python/notebook</a></p>
<ul>
<li>classification plus methodology minimal model to chubby model and monitor model performance</li>
</ul>
<p><a href="https://www.kaggle.com/code/pmarcelino/data-analysis-and-feature-extraction-with-python/notebook">https://www.kaggle.com/code/pmarcelino/data-analysis-and-feature-extraction-with-python/notebook</a></p>
<ul>
<li>NLP</li>
</ul>
<p><a href="https://towardsdatascience.com/sentiment-analysis-with-text-mining-13dd2b33de27">https://towardsdatascience.com/sentiment-analysis-with-text-mining-13dd2b33de27</a></p>
<ul>
<li>visualize variable relationship</li>
</ul>
<p><a href="https://towardsdatascience.com/exploratory-data-analysis-eda-python-87178e35b14">https://towardsdatascience.com/exploratory-data-analysis-eda-python-87178e35b14</a></p>
<p>I personally do not like to keep repeating myself. But if you have a different project, you might have to repeat the similar process. Therefore, I would like to make those commonly used processes into functions. So that, I can just save the function somewhere and call it when I need it.</p>
<p>In this blog, I would like to share how I go about doing EDA using the Titanic dataset. I just started my collections, these functions are not fully tested. So they might be buggy. Please let me know if you have issues.</p>
<p><img src="/images/blog49/1package.PNG" /></p>
<p>First I imported common packages, then added the function folder to the module search path at row 11 so that the system could find the function file. Then import EDA_function.py file at row 12. This file contains all functions needed. Please change your folder accordingly (either absolute or relative path will do).</p>
<p>Seaborn has some built-in datasets. I will use the Titanic dataset as an example.</p>
<p><img src="/images/blog49/2importdataset.PNG" /></p>
<p>I am going to touch several aspects of EDA</p>
<ol>
<li>overall EDA</li>
<li>further EDA on data variation and dup</li>
<li>further EDA on data missing or inf (inlier)</li>
<li>further EDA on data with outlier</li>
<li>further EDA on feature correlation</li>
<li>further EAD on feature selection</li>
</ol>
<h2 id="1-overall-eda">1. overall EDA</h2>
<p>You will probably use df.info() or df.describe(). Since this will be run back to back. I combined them into one function.</p>
<p><img src="/images/blog49/3overall.PNG" /></p>
<p>This gave you an overview of data. From the first result, you will see if there are missing value issues and data type. The second result is only about numeric variables. You can see how data is distributed and if there are any outliers.</p>
<h2 id="2-data-variation-and-duplication">2. Data variation and duplication</h2>
<p>If you are not checking this, you might not believe the dataset contains duplications.</p>
<p><img src="/images/blog49/4dup.PNG" /></p>
<p>If a variable only has one value, it can be removed because it will not contribute anything to the machine learning process. In the Titanic case, the least one contains 2. So it is likely a binary variable. But it is not uncommon to find a column that only contains one value in real life dataset.</p>
<p><img src="/images/blog49/5variationPNG.PNG" /></p>
<h2 id="3-data-missing-or-inf">3. Data missing or inf</h2>
<p>Custom functions only list variables containing NaN value. If too many are missing, you might want to remove this variable like “deck”. If too few are missing, you can delete those particular rows like “embarked”. For “age”, you might want to impute with average or other strategies. (please check the classification <a href="https://www.kaggle.com/code/pmarcelino/data-analysis-and-feature-extraction-with-python/notebook">blog</a> above, it is impressive).</p>
<p><img src="/images/blog49/6missing.PNG" /></p>
<p>You might find infinite in the overview result. This could happen because the data source did some calculation which divides by 0. Do not be surprised, this does happen when you deal with real life data.</p>
<p><img src="/images/blog49/7inf1.PNG" /></p>
<p>The Titanic dataset does not have inf value. I manually create a fake dataset, which proves the function works as expected.</p>
<p><img src="/images/blog49/7inf2.PNG" /></p>
<h2 id="4-outlier">4. Outlier</h2>
<p>Depending on how you define outliers, here I use 1.5 standard deviation. The column list is as follows. Please note: there are some discrete variables. Although I plan to programmatically remove it. But there are no easy functions to tell discrete variables. So you have to use your business sense to remove those manually.</p>
<p><img src="/images/blog49/8outlier1.PNG" /></p>
<p>The following function can show a histogram with a normal curve. Looks age follows normal distribution more or less.</p>
<p><img src="/images/blog49/8outlier2.PNG" /></p>
<p>If you are tired of showing plots one by one manually. Here is another function,which helps you show them in bulk. Looks fare does not follow normal distribution.
There are multiple ways to handle outlier and non normal distribution, such as log transformation. The <a href="https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python/notebook">blog</a> above has some details.</p>
<p><img src="/images/blog49/8outlier3.PNG" /></p>
<p>You can bulk boxplots to see where you can find outliers. There are one over $500 fare.</p>
<p><img src="/images/blog49/8outlier4.PNG" /></p>
<h2 id="5-feature-correlation">5. feature correlation</h2>
<p>Dataframe built-in function show all numeric correlation.</p>
<p><img src="/images/blog49/9all.PNG" /></p>
<p>If you want to show only correlation with the target variable. This custom function will list all in descending order. Looks like “fare” has higher correlations with survival. Money talks. Adult_male have a higher negative correlation with survival. Poor man!</p>
<p><img src="/images/blog49/9target.PNG" /></p>
<p>You can take a close look using pairplot</p>
<p><img src="/images/blog49/9pairplot.PNG" /></p>
<p>Another useful plot is heatmap.</p>
<p><img src="/images/blog49/9heatmap.PNG" /></p>
<p>Notice parch and sibsp, adult_male and alone are relatively highly correlated. Sometimes, because of these collinearity, you can just choose one variable to include in your analysis.</p>
<p>Another way to observe the relationship between variables is to use a scatter plot for two numeric variables. If it is one numeric and one categorical, you can use a bar chart to compare. These two blogs (<a href="https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python/notebook">blog1</a>, <a href="https://towardsdatascience.com/exploratory-data-analysis-eda-python-87178e35b14">blog2</a>) give some good examples.</p>
<h2 id="6-use--statistical-method-to-select-feature">6. Use statistical method to select feature</h2>
<p><img src="/images/blog49/10selectKb.PNG" /></p>
<p>You want 10 most important features. Here numeric features and categorical features need to be inputed and one-hot encoded to be able to use selectK feature function. Therefore you need put numeric and categorical into different lists</p>
<p>The result gives some hints. Looks fare is important here too, sex is also important factors.</p>
<p>I hope these functions could help you save some time and guide you do EDA in a systematic fashion.</p>
<p>Of couse, I could miss some important part here. but please let me know.</p>
<p>The notebook can be showed <a href="/Files/blog49_EDA_Titanic.ipynb">here</a>. The function file can be downloaded <a href="/Files/EDA_function.py">here</a>.</p>
<p>Thanks and keep safe.</p>
<p>Wenlei</p>A thousand miles begins with a single step. Similarly, any fancy machine learning begins with exploratory data analysis (EDA). Because of that, you would think there will be a lot of resources for this. Surprisingly, it is not. Let us google the best book for EDA. Here are two top links for EDA from google search.