TY - GEN
T1 - A conditional-probability zone transformation coding method for categorical features
AU - He, Liang
AU - Shen, Chao
AU - Li, Yun
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/5/17
Y1 - 2019/5/17
N2 - It has been a key issue for solving problems efficiently by machine learning models with code categorical features. The state-of-the-art one-hot coding is a widely accepted method to convert the categorical features into numerical values. However, it attracts a sparse space and meaningless value after coding. We come up with a novel coding method based on conditional probability after dividing the features into zones, which is called Conditional-probability-based Zone Transformation (CZT) coding. CZT coding calculates the conditional probability of each feature, then divides the features into several zones according to the probability and finally codes the features in each zone. We mathematically prove that compared with the state-of-the-art method, CZT coding reduces the code length by at least the mean of feature space and the issue becomes into an easier one after CZT coding for the following machine learning model. Finally, using the same neuron network as the classifier, we compare the performance of CZT coding and one-hot coding by using the titanic dataset, where most of the features are categorical, and the result is that CZT coding makes the classifier performs better both on the accuracy and steadiness.
AB - It has been a key issue for solving problems efficiently by machine learning models with code categorical features. The state-of-the-art one-hot coding is a widely accepted method to convert the categorical features into numerical values. However, it attracts a sparse space and meaningless value after coding. We come up with a novel coding method based on conditional probability after dividing the features into zones, which is called Conditional-probability-based Zone Transformation (CZT) coding. CZT coding calculates the conditional probability of each feature, then divides the features into several zones according to the probability and finally codes the features in each zone. We mathematically prove that compared with the state-of-the-art method, CZT coding reduces the code length by at least the mean of feature space and the issue becomes into an easier one after CZT coding for the following machine learning model. Finally, using the same neuron network as the classifier, we compare the performance of CZT coding and one-hot coding by using the titanic dataset, where most of the features are categorical, and the result is that CZT coding makes the classifier performs better both on the accuracy and steadiness.
KW - Categorical features
KW - Conditional probability formatting
KW - Feature engineering
KW - Feature extraction
UR - https://www.scopus.com/pages/publications/85072820247
U2 - 10.1145/3321408.3326636
DO - 10.1145/3321408.3326636
M3 - 会议稿件
AN - SCOPUS:85072820247
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the ACM Turing Celebration Conference - China, ACM TURC 2019
PB - Association for Computing Machinery
T2 - 2019 ACM Turing Celebration Conference - China, ACM TURC 2019
Y2 - 17 May 2019 through 19 May 2019
ER -