Sklearn Pipeline Deep Dive
Whoever used the Sklearn pipeline knew how convenient it could be when handling train and test data.
In this post, I want to deep dive in the Sklearn pipeline. I will try to answer the following questions. I will answer with a minimal working example (MWE).
- By default, a Sklearn transformer will output a numpy (np) array for computation efficiency. Is this always the case in every transformer?
- How do we dynamically choose different columns for different pipelines in a column transformer?
- People have been complaining about the np array output, as the column names are gone, they cannot tell which is which. This could be important for feature importance analysis. Sklearn has an option to enable pandas output. But does it always work?
- What if you have to use np array output, are you stuck with the feature name?