Use Optuna to Tune Sklearn Pipeline Hyperparam

Pipeline makes data scientists’ life easier by combining different data transformation together. You just place a training dataset on one end, the other end is a model to be trained. By the time you want to predict, you just input the test data instead. Pipeline will process test data the exact same way as training dataset and help you get correct predictions.

Read More

Shap Value for Single Record in Model Prediction

Feature importance helps us understand what features play more importance roles in a given model. After that, you can do feature selection, or you can use results to cross check with your business domain knowledge to validate the model.

Read More

Log errors in python with examples

Most of people use print function to show variable value in python for debugging purpose. It works fine when you are in dev mode. But as time goes by, your code also grows. It is cumbersome to have the print statement everywhere in your code and clean that afterwards. Python have building-in logging library, which can be leveraged to make our life easier.

Read More

Use Docker to Operationalize the Data Science Prediction

There are many challenges working in machine learning fields. In terms of the time consuming part, one is data preparation, the other one would be operationalizing the model. If you have the right data, building a model actually is a relatively easier part of the whole process. Let us talk about the operationalization of the model. I have used batch files (link) to handle R and python processes and used the task scheduler to schedule a job at a virtual machine in the cloud. But it is probably not the best way to go because the followings:

Read More

Dynamic SQL and Cursor in Teradata

Dynamic SQL is SQL statement formed when it is at run time. You can use variable to increase the flexibility of SQL, but variable won’t help in certain occasions. For e.g., you want to change the table name or column name of script in run time. This is where dynamic SQL shines.

Read More

Poor man's automation

One bottleneck of machine learning is to operationalize the model you build. Let us say you are satisfied with the model performance. Now the question is how you link your data engineering step with your model prediction and then save the prediction somewhere. Above all, you will need to make it run automatically. So that you can do something more important.

Read More

How to Handle Textual Features along with Other Features in Machine Learning

When working on a real-life data science project. You will realize there are things you might not expect from what you learn in school.

  1. You spend the majority of your time working on data prep and data preprocessing.
  2. You will need to handle all kinds of data, not just numeric and categorical.

Scikit learn’s pipeline will greatly simplify the data prep and make it much easier that you can apply the exact same data prep logic to up-coming new data. But in real work, you often come across text type data, like notes, or social media text saved in the database. Those could be important features, which means you will have to use text mining techniques. People usually stay away from those due to challenges. It is a pity that we cannot take advantage of that, oftentimes then not, it could improve overall model performance.

Read More

Entitle the Custom Transformer a Memory in Python

Before data can be used in machine learning, it will need to be preprocessed. Many machine learning algorithms cannot handle null and outlier well. So, these steps are important. Sklearn has a preprocessing class to help you do that. But oftentimes, something like capping outliers, using log function to transform the data, which requires you to write your own transformer class.

Read More

Custom Python Functions Used in Exploratory Data Analysis

A thousand miles begins with a single step. Similarly, any fancy machine learning begins with exploratory data analysis (EDA). Because of that, you would think there will be a lot of resources for this. Surprisingly, it is not. Let us google the best book for EDA. Here are two top links for EDA from google search.

Read More

Variable and Stored Procedure in Teradata Database

If you have been using Oracle or SQL Server, creating variables and stored procedures is like a breeze. When you are told in Teradata, you cannot use a variable in Teradata script (you can declare a variable in stored proc though), you might be shocked, especially when you want to implement something flexible.

Read More

Use Regex to Partially Replace Text in Python

People who work on databases know wild card search will help if you don’t want the search to be exact. In SQL server world (and most DBMS systems), you use % to replace multiple characters, _ to replace single character and use like as operator in where clause. For example, if you want to check if a column contains Massachusetts, sometimes people use Mass for abbreviation. You can use a where clause: where column like ‘Mass%’ to get rows containing values starting with Mass.

Read More

Launch A New Android APP - Solution Maker

About 15 years ago, while I was working at University of Pennsylvania. I started to play with Java to create a few small desktop applications for scientific calculation. For example, solution maker, Real-time PCR calculator. You can still find those at google code repository.

Read More

Example Using Table Partition in Incremental Loading

Many SQL Server database developers have heard of table partition. These technologies have been around for quite some time. But most people have never used it partially because people think they can make it do without it. However, If you work with large amounts of data such as a data warehouse or dealing with incremental loading. It is actually a very powerful tool should you know how to use it.

Read More

Compare Different Unpivot Methods in SQL Server

Recently, I have a request to unpivot two sets of multiple columns into two columns. It gave me an opportunity to explore different ways to do unpivoting, which is not common, but these methods are life saver when you need them.

Read More

Programmatically Retrieve Flat File MetaData by Python

A fair number of ETL tasks for Business intelligence developers are to load raw flat files. For large scale flat file loading, BIML provides a good solution to be able to generate massive packages programmatically. The following simple talk post showed how elegant the solution could be.

Read More

Solving SSRS Multi-value Parameter Data Driven Subscription Issue Step by Step

Subscription is very important for report automation. When you only have a handful subscription, you might get by setting regular subscription for each one. When there are many subscriptions like over 10 for a given SSRS, it will be painful to set up and manage. Luckily, we have data driven subscription (DDS) to help, but DDS has an limitation for multi-value parameter. In regular subscription, system will provide parameter value and you can choose it. DDS will get parameter value from a query.

Read More

Machine Learning Dynamic Feature Engineering and Optimization

Machine learning empower you to explore the deep relation between your data and result. In order to examine the deep relationship, you might need to do some feature engineering. The most common example, you can combine two features to get a more correlated new feature. Occasionally , you might need to generate feature dynamically while you examine the intrinsic relationship. When I say dynamically, I mean the feature value is based on certain parameter of its own. Let me give you an example.

Read More

Pandas Cheat Sheet for Database Developer

Python has been a Swiss army knife in the software world. Especially, in the data science field, it offers numerous library, such as numpy, sklearn. Pandas is one of most fundamental library, which helps user to manipulate data in the fashion like relational database. You often see people import data from csv or other source into data frame. Essentially data frame is a relational database table. You can do similar thing on it just like you can use SQL to work on table. I compiled most common used command in SQL and the corresponding python code using Pandas, so hopefully, this can server as a quick start for SQL database developer to work on python.

Because Github pages does not support display Jupyter notebook, However, repository does. So, I load Jupyter notebook as Github repository. You can click the image below to link to that.

Read More

Pass Query Result to Variable in Talend

Talend is new competitor in the Data integration tool market. If you has been using Microsoft SSIS, you know it is a very good tool. But when it comes to the needs for connecting to different type of databases, you will have a lot of hassle to make it work. Talend is very good at this. Also it has tons of components to be able to work with big data and cloud. If you know how to use, it is a very powerful tool.

Read More

Novel Way to Send HTML Table Formatted Email Via SSIS

The send mail task in SSIS can handle regular email just fine. But when requirements ask you to send an html email, it won’t work. There are two ways that you can html.

  1. Use sp_send_dbmail.
    It is not working by default, You will need certain permissions to work on the Database mail configuration wizard in SQL server. Once you are able to send email via SQL server, the next step, use execute SQL task to send email. You can find the configuration in the following link. This is by far the simpliest solution.
Read More

Using Correlated Subquery Fix Data Quality Issue

Correlated subquery is a special type of subquery. It takes outer query value and uses in inner query. Therefore, it will loop though every row of outer query. Unlike regular subquery, correlated subquery has the dependency on outer query. Thus, it cannot run by itself.

Read More

Use Recursive CTE Solve Real Life Issue

If you go through SQL interview, more often than not, you will be asked what difference between common table expression (CTE) and temporary table (temp table) is. Well, CTE can do what temp table does, i.e. holding data in a temporary space, but CTE has additional unique functionality which cannot be replaced by temp table: Recursive.

Read More

Deal with complex DAX measure

If you are working on tabular model, power BI, or Excel pivot table, you probably are familiar with DAX. Although DAX seems like excel expression, I have experienced many frustration moments. You might be asked to create measure like sum(value). But every so often, you will be requested to create something much more complex than that. I don’t see many book teach you how to deal with that. I will use an example to show you my approach to solve a complex measure.

Read More

Generate Metadata for BIML via PowerShell

This is a real scenario. A request was sent to me a couple weeks ago. There are about 15 main tables flat files and 65 reference tables flat files, which need to be loaded into a database so that client can do data analysis on it. The data is directly outputted from an application, but client does not have access to the application database. All column can be loaded as text. User can convert the data type themselves afterwards in their query.

Read More

Show D3 and React Visualization in Github Page

Let us say, you create some javascript data visualization somewhere and you would like it show in Github page site. Is it possible? The following is Massachusetts county map, I created with D3 and react in javascript. The legend is interactive if you hover mouse over it.

Read More

Solve Readmission Flag Issue with Windows Analytic Function

I was requested to rewrite the old readmission script which is cursor based. The script is not long, but it has nested cursor inside code. It took me some time to figure out how it works. Basically, it works on one patient at a time (sorted by discharge date). Compare the first record discharge date with the rest record admission date. if the period < =30 (readmission), we flag it 1; if >30, it is a new admission, we flag it as 0. Now same process will start with this new admission record to compare the rest of record with the same patientID.

Read More

Gain insight of Google Analytics on my tech blog via Power BI

I have this tech blog for more than 2 years. The blog is built on Github pages, which can be tracked by Google Analytics. I used to go to Google Analytics to see some web activity stat. I have to say Google Analytics did a very good job from tracking to presenting data with various visualization. One drawback though, it has some preset params, like previous week. If I want to see what is going on today, I need to change that at drop down list. In addition, there are some charts I am not interested. It seems there is no options that I can personalize the chart presented. Since Power BI can connect to Google Analytics and retrieve data. I am wondering if I can retrieve data and set up desired chart in Power BI. Next time, if I want to check the web stat, the only thing I need to do is to hit the refresh button.
First, let us define what I want to see in the dashboard.

  1. I want to know some key statistics, such as total page view, total session, I also want to know what the maximum page view number is within a single day. This one is not available from Google analytics. I will create this measure with DAX in Power BI.
  2. Show user number and new user number over time. This will be displayed in line chart, so that I can have an idea of my consumers.
  3. Show which post is popular
  4. Where are my consumers from? It is always interesting to me that people from different countries are working on the same thing. It is a small world indeed.
Read More

Extract Active Directory User info via PowerShell

A few month ago, I was asked to get meta-data who is accessing our Power BI (PBI) dashboard. Besides PBI audit log, we would like to understand more about user behavior so that we can improve power BI report. PBI service now can add Active Directory (AD) group to manage permission as opposed to office 365 group, so it is much more efficient now in my opinion. Much of the user info also can be extracted from AD group. The easiest way is using powershell.

Read More

Troubleshoot DAX issue quickly with SQL Server Profiler and beyond

Assuming you are a developer of big organization, your Power BI dashboards were all built from one popular Tabular model.
One day, your boss calls you in his office and points to a Power BI dashboard saying this number does not make sense. Can you take a look and tell me why? I will meet big boss in one hour. Can you give me some insights by then?

Read More

Extract Relationships from Tabular Models through C#

Let us say, your BSA found something fishy when he checks data. The whole month of sale in MA is only $2000. He suspects something wrong at the model level or database level. He is not familiar with visual studio. So he is asking you for help to identify the source table and join key, so that he can write some SQL query to pinpoint if this is data issue.

Read More

Leverage metadata in BIML

I have been working on SSIS for many years. I came across BIML in multiple occasions, such as reading reference, browsing BI topics… BIML as a tool is used to massively create SSIS package and meta-data driven ETL solution.
My view on BIML
• Good candidate project for BIML
If your project is following certain pattern, such as moving multiple table data from one database to another (you can do it via script if it is between same types of database, but would be harder between SQL server and Oracle for example). In addition, it is much easier to maintain the big project using BIML.
• Not good candidate project: small task with unique requirement

Read More

Compare Tabular Model Difference For Better Promotion Experience

If you work on the same project with others, you must be familiar with source control. There are a variety of source control products on the market. There is open source one, such as Git offered by Github. I worked with SAP products before, which uses central repository to check in and out. People working with visual studio will probably be more familiar with Team foundation server (TFS).

Read More

Power BI Report Sevice "Could not load folder contents" issue

Among current Microsoft reporting service tools. Power BI is mainly design for dashboard type of report, whereas SSRS is designed for old school paginated report, although they can overlap to some extent. Since they are designed for different purpose, you might experience difficulties if you don’t choose the tool carefully. An example is we have a legacy rolling week report, which the week column label will change every week. Power BI takes the column name from measure name, this has been a tough issue to resolve if we are using Power BI because you cannot change measure name every week. But SSRS can easily address that with expression. On the other hand, some reports have aggregation at subtotal and total level for non-addable measure, a lot of visuals. Those are good candidates for Power BI, not for SSRS. SSRS can do it but might need a lot of MDX or DAX coding to work around the issue.

Read More

Many to Many relationship in Microsoft Tabular model

One of features that Multi-dimension cube has whereas Tabular model doesn’t is that Tabular model cannot handle many to many relationship. This has been resolved in the 2016 release by Bi-directional cross filtering by setting it in the relationship window. However, not all companies can keep pace with the latest release. In fact, our company just start to update from SQL Server 2012 to 2014 this year. Our analysis server is still using 2012 SP2. But that does not mean we cannot deal with many to many relationship in tabular model. Actually, there are workaround by using customized DAX measure. And if you understand the context concept and know some basic of DAX programming, it is actually not that difficult.

Read More

SAP Data Service SQL Push Down

SAP Business Object Data Service (BODS) is a powerful ETL tool. Part of ETL developer job is to convert SQL script into data flows. This is good for both data lineage and documentation. When you build a data flow, you add different components together, source, target, query transfer, merge… As you add more and more components to the data flow, it becomes more complex. Do you notice, at the beginning, the data flow run very fast, but once it become complex, it can take hours to run. But the same SQL script only run 10 min in Toad, what is wrong with the data flow?

Read More

When Python comes across Data Service

I am working on an ETL process lately. The requirements are as follows.

  • Using SAP data service (DS)
  • Destination tables are in oracle
  • Business logic are fairly complicated, using a lot of oracle analytic functions, grouping set to dynamically generate subtotal…
Read More

Use Temp Table in SSIS?

As a database developer in SQL SERVER world, you will use temp table quite often to store data temporarily for next step in your SQL script. But when it comes to the SSIS, is it necessary?

Read More

Implement Analytical SQL Function in Data Service

To accommodate the analytical requirement of business, both PL SQL and T SQL have developed similar analytical function over the years. For example, row_number(), rank(), dense_rank(), max(), sum(), lead(), lag(). The main purpose of these functions is to help business get insight of data more conveniently. It works differently from ordinary aggregation function in that it can add aggregation in the same row, as oppose to regular aggregation which only list the unique combination of columns you grouped by

Read More

Make Batch API Requests via Python

Recently, I was working with Python on a project. A few features in the dataset are missing, which can be acquired from the third party via API calls. The original dataset is a csv file. Basically, I need to loop through each row of csv, pass the value to API, and get the result back. It will be nice if we can call in batch, return in batch and export all result to csv.
Here I am using Zillow API as an example. I have a list of street address and zip code, I need to know the Zillow webpage of those properties.

Read More

How to let SSIS wait to certain time point

We had a SSIS ETL package which populates the report and dashboard for daily EDI status monitoring. This ETL process runs daily. Since it loads thousands and millions record, the final load to fact table need truncate and load about 30 min to 1 hour. This could take report offline for this amount of time and therefore could interfere report user experience. The request is to delay loading till night. If you have used BO Data service, there is a built-in function, called sleep function. But in SSIS, there are no such tasks. I did a little research online. You can do it through either execute sql task, or script task using VB.net or C#, or using for loop.

Read More

Create a String Builder for Youreself

People create tool to make their job easier. Carpenters make wood jig. Farmers are using spinning wheel for making cloth. As a BI developer, I came across many scenario that the job is tedious, but you have to be careful or the code won’t work.

Read More

Unconventional way to solve column comparison issue in BODS

We had a request from an internal customer that they need to compare two tables to see what records are inserted, deleted and updated. These tables contain thousands of records, therefore it is not manageable for a person to do this work. For this purpose, Business Object Data Service (BODS) table comparison transform and Map operation transform can identify table level change (please check my previous blog ). Customer also want to detail what particular columns were changed in the format like

Read More

Compare MDX to DAX programming pattern

SQL is used in relational database (OLTP) for query purpose. For analytical purpose, OLAP cube is built on top of data warehouse. A new type of language called MultiDimensional eXpressions (MDX) was introduced in 1997 by Microsoft to query the OLAP cube. Thereafter, MDX is widely adopted by major OLAP vendors such as Microsoft SSAS, Oralce Essbase et al. In 2009, Microsoft pushed out Data Analysis Expressions (DAX) and use it in powerpivot, powerBI desktop and SSAS tabular model. As of now, I am not aware of other major vendor adopted DAX yet. It seems to me that Microsoft will continue to promote application of DAX. But I don’t believe DAX will eventually replace MDX, since MDX is still the only language used in multi-dimensional model (see table below). In fact, DAX is translated into MDX at the backend to query tabular model. Therefore, you can query tabular model with MDX, but not vice versa.

Read More

How to handle many to many relationship in MongoDB

MongoDB is one type of NoSQL database (Not Only SQL). MongoDB was generated by 10gen in 2009 to bridge the gap between key-value stores (which are fast and scalable) and relational databases (which have rich functionality). Its name is from “Humongos”. It is a document oriented NoSQL database, which is in JSon like format.

Read More

Spark RDD join operation with step by step example

Compared with Hadoop, Spark is a newer generation infrastructure for big data. It stores data in Resilient Distributed Datasets (RDD) format in memory, processing data in parallel. RDD can be used to process structural data directly as well. It is hard to find a practical tutorial online to show how join and aggregation works in spark. I did some research. For presentation purpose, I just use a small dataset, but you can use much larger one.

Read More

Observe SAP data service CDC behavior using Table Comparison Transform

Change data capture (CDC) is very important for ETL developer. It is the foundation of data incremental loading. A lot of companies now require data to be refreshed on the daily basis. If full load of ETL process takes less than 24 hours, you can still manage to do it with full load daily. As source data is growing, the full load cannot be finished within 24 hours. Now, it is the time that CDC comes in to play. In SAP data service, table comparison transform can be used to compare incoming data with existing data and further update the existing data. It has some configurations which might affect the outcome of CDC, I would like to use some dummy data to see different CDC behaviors so that we can understand the configuration and know when to use it. Let us first create a customer info dummy table test_cdc_1, we use it as an existing table original In this case, I have 3 persons and they have the address info listed in table. Now we want to see CDC’s three actions, insert, update, and delete. We design an incoming table, test_cdc_2. So we can see these 3 actions.
2nd In this table, you can see John keeps the same location, Kevin changed location to Cambridge, Lisa is not customer anymore. But we have new customer Mary. So John keep the same, Kevin will need to be updated, Lisa will be deleted and Mary will be inserted. I created data service job like the follows, what is displayed is data flow part.

Read More

Add percentage to SAP webi chart legend

SAP has an array of BI products, which is comparable to Microsoft stack. Check the following table for comparison. I have been using SAP BI Webi to create some annual reports for a client. I am impressed by its section feature and how easy to create a summary and detail break down report. In SSRS you would have to add code for summary and use group feature to do that. It also has variable feature which like expression in SSRS make it more flexible to handle different situation.

Read More

Address the Many to Many relationship in Multi-dimensional SSAS cube

In relational database, there are a few relationships between tables, One to One, Many to one, Many to Many. Some typical real life examples for Many to many relationship are student vs class, bank customer vs bank account, order number vs product. To model many to many relationship, we typically use a bridge table, or a mapping table, to connect the two tables. As to why we need to use the bridge table, this youtube explain very well.

Read More

Workaround for "There is already an object named xxxx error" in T-SQL

I have been using temp table a lot while writing long SQL script and stored procedures. Compared with CTE or table variable, temp table saves data in tempdb. So, you can retrieve data as long as you do not close the session. That is very convenient if you want to track data change and do some trouble shooting.

Read More