MACHINE LEARNING MODEL FOR STUNTING PREDICTION

This study aims to find the best Supervised Machine Learning (SML) model for stunting prediction. This research was conducted using an experimental approach using 192 infant data with a composition of 183 normal infant data and 9 stunted infant data using a custom dataset. The conclusion obtained from this study can be concluded that the combination of the Random Forest classification algorithm with Support Vector Machine Weighting and the Genetic Algorithm Feature Selection has the best performance. The parameters with the best performance are: The training and testing data distribution is 90% of the training data and 10% of the testing data. The number of trees in the random forest algorithm is 100, and the Gain Ratio criterion and max_depth is 10. In the Genetic Algorithm

of the uses of ML in the health sector is as a tool for detecting factors that cause stunting (Berhe et al. 2019).ML works by recognizing patterns of stunting data from existing stunting cases in the dataset, and then ML can predict stunting from the new data entered (Vaivada et al. 2020).As a promising technology, ML can solve problems related to: regression, prediction, association, classification, and clustering (Lee et al. 2017).However, with all the advantages and benefits several, ML has several drawbacks (Abid et al. 2022).For example, many algorithms can be chosen according to the parts of the problem, and many algorithms can be selected according to the features of the problem and data (Kotthoff 2016).Each algorithm has different accuracy according to the given dataset (Dogan and Tanrikulu 2013).So the algorithm with the best accuracy value, does not necessarily produce the same accuracy for other problems (Bierman, Li, and Lu 2023).The problems don't just stop here.If you have chosen an algorithm, the accuracy value will also vary according to the parameters (Zhang et al. 2023).The culmination of the problem is that there is no parameter value to get the best results, and the problem goes through a process of trial and error (Sohrabi et al. 2023).So there is research opportunity to find algorithms and parameters for each case (Salachoris et al. 2023).
The contribution of this study is to propose an ML model that performs better for stunting detection (Ndagijimana et al. 2023).This ML model includes algorithm selection, oversampling, feature selection, feature weighting, data distribution, and determining algorithm parameters (Sultana and Islam 2023).Another contribution is formulating the causes of stunting (Simatupang, Gultom, and Rahman 2023).

Related Paper
Based on our search, several studies are similar to this research.The first study analyzed and conducted an analysis of ML performance on stunting data in Zambia.The study concluded that the Random Forest algorithm had the best performance and succeeded in selecting 13 factors that caused stunting from 58 input variables.The second the causes of stunting using ML.The results of this study state that the Extreme Gradient Boost algorithm has the best performance.This algorithm is able to select 5 factors supporting stunting through the Ethiopian stunting dataset.The third study from.This research concludes that the Random Forest algorithm has the best performance.This study uses 5 input variables and 1 output label.
The synthesis result collection concludes that there has been no research using datasets with a total of 73 input variable columns.The use of survey methods to obtain research data.The research location is the Blora Regency, Central Java Province.Retrieval of this dataset through the SITEKSTAGI (Nutrition Status Detection System) application in our previous research.This dataset has 192 rows with details of 183 with normal status and 9 with stunting status.This dataset has 78 input variables and 1 output label.

Machine Learning
There are four types of machines: Unsupervised Learning, Supervised Learning, Semisupervised Learning, and Reinforcement learning.Unsupervised learning is a type of ML used when the available data does not have an output label.Supervised learning is a type of ML whose data has output label assistance.Semisupervised Learning is ML used to solve combination problems of Unsupervised Learning and Supervised Learning.Reinforcement Learning is the use of ML types to solve problems so that they can find the best solution.

Supervised Learning
Supervised learning is a form of ML category that requires the help of output labels on the dataset it processes.In this study, we used nine Supervised Learning classification algorithms.There are many Supervised Learning classification algorithms available.Logistic Regression (LR), is a classification algorithm that limits each label/class, Then, each variable will be searched based on its proximity to these boundaries to find a predicted relationship between the variable and the label/class.LR is generally used in applied statistics to solve discrete analysis problems.Support Vector Machine (SVM) is a classification algorithm that finds the maximum hyperplane value.To separate the table can use SVM.K-Nearest Neighbor (KNN), is a classification algorithm that finds the similarity or distance of each input variable to the output variable(label/class).A Decision Tree (DT), is a classification algorithm that makes a decision tree diagram and considers each component to find the relationship between each input variable and output label.Random Forest(RF), is a development classification algorithm from DT.This Algorithm form a collection of decision trees using a random sample as the basis for the decision tree, Comparison of the results of each decision tree to get the best value.Gradient Boost(GB), is also an algorithm derived from DT.This algorithm works the way.Naïve Bayes(NB), is an algorithm that.Neural Network (NN), is an algorithm that imitates the workings of neuron cells in the human brain.Deep Learning (DL), is an algorithm developed from the principle of the NN algorithm.The difference between DL and NN principally lies in the existence and number of hidden layers used [8].

Feature Weighting
Variable input weighting can improve the performance of the feature selection process.In the Rapidminer application, there are several weighting methods provided, namely: Information Gain(IG), Gain Ratio(GR), Correlation(Corr), Chi Square (Chi), Gini Index(Gini), and Support Vector Machine(SVM) [8].

Feature Selection
Feature selection plays a critical, crucial role in classifying ML to improve its performance.In classification problems, not all input variables/features in the dataset affect the output label.The existence of variables that do not have this influence decreases the overall performance of the ML algorithm.Feature selection is in charge of selecting which variables trial impact powering do not affect influencing the output label and eliminating other variables that do not affect the output label so that it can improve ML performance [9].This study uses the Rapidminer 9.10 application as an ML experimental device.The Rapidminer application provides several feature selection methods, Forward Selection, Backward Selection, and Optimized Selection.In this study, Optimized Selection (Evolutionary) feature selection was selected using a Genetic Algorithm.This Algorithm was chosen because it is proven, based on existing literature, to improve ML accuracy.Another advantage of GA is that its performance can be optimized based on its parameters.
The Genetic Algorithm (GA) is a heuristic algorithm that mimics the process of the genetic evolution of living things.The GA will generate as many random values as the population values.Then crossbreed with other populations with a percentage based on the crossover value, calculate the fitness value, then cross again with another random population to produce new generations as many as the generation value to find the best fitness value.The GA can be optimal by changing the parameters of feature selection, population size, number of generations, mutation percentage, crossover percentage, and the crossover method [10].

RESEARCH METHODS
This research applied an experimental approach.The dataset used comes from the SITEKSTAGI application from our previous study [7].The beginning of this research goes through the pre-processing stage.Next, in the pre-processing step, researchers select datasets based on irrelevant data such as addresses, names, and others.Then, the researcher performs a correction stage on the wrong data when filling in the data.Next is the encoding stage to change the nominal data type to numeric.The encoding stage applies by breaking each variable column into several variables based on the content of the data.So that each column of the input variable has only the same data, normalize the data to ensure all data has the correct distribution of values.After cleaning the data, we leave 73 independent variables in the form of numbers and 1 dependent variable/output label in the form of stunting and normal labels.
Furthermore, researchers separate the data into parts, namely training and testing data.The use of training data to perform algorithm learning.The use of data testing to perform algorithm testing.To ensure that the calculation results are valid and consistent, researchers conduct testing using the k-cross validation method with a value of k = 10. Figure 1

RESULTS AND DISCUSSION Distribution of Data on the Dataset
The SITEKSTAGI dataset has 192 rows containing 183 normal baby data and 9 stunted baby data.Due to the imbalance in the amount of data, the machine learning algorithm will be inaccurate.On the other hand, we wanted to keep the synthetic data from getting too much.So the application of the SMOTE technique (Sampling Minority Over) by adding 300% synthetic data for stunted babies.In addition, the dataset remained in an unbalanced position with stunting percentages and normal data from 1:24 to 1:7, The results as follows:

Algorithm Selection with Feature Selection
The first step is to compare the performance of nineclassification algorithms, namely: k-Nearest Neighborhood(KNN), Naives Bayes(NB), Random Forest(RF), Decision Tree(DT), Support Vector Machine(SVM), Logistic Regression(LR) , Neural Network(NN), Deep Learning(DL) and Gradient Boost(GB).The results of this comparison are as follows: The performance assessment in table 1 above was carried out on the original data with a comparison of stunting data compared to normal, which was 1:20 with details of 9 stunting and 182 normal.The results showed the score had poor scores for all algorithms and five scoring elements.It proves that if ML gives an unbalanced set on each label of the output data and there is too little data, then the ML performance will be poor.
The next stage is to perform ML optimization to increase the scores of the five assessment elements in all algorithms.Then the addition of stunting data by 300% artificial data with the SMOTE technique.Then the composition of the stunting and normal data ratio becomes 1:7 with details of 19 edits and 182.The next step is to evaluate the performance, are as follows: The conclusion from the performance test results of nine machine learning algorithms is that the Random Forest algorithm has the best performance.Then the selection of the RF algorithm for the optimization process is in the next step.

Selection of Feature Weights
This stage tests the five input variable weighting methods to know which input variable influences stunting or normal.The results are as follows: Table 3 shows that the SVM method scored the highest for the five assessment elements.Then, the application of the SVM method goes through the following testing stage.

Genetic Algorithm Optimization at the Feature Selection stage
This study uses a genetic algorithm as the method selected at the feature selection stage.The performance of the genetic Algorithm depends on the given parameter values.The parameter to be tested at this stage is the feature selection method.There are 5 selection methods, Tournament, Roulette Wheel, Boltzman, Stochastic, and Non-Dominated Sorting.From the test results, in table 3 the Roulette Wheel method has the best performance with an accuracy value of 0.96, AUC of 0.99, precision of 0.93, recall of 0.83, and f_measure of 0.86.Furthermore, we will select the Roulette Wheel method in the next test.Test results in table 5 show that the Shuffle method has the best performance with an accuracy value of 0.96, AUC of 0.99, precision of 0.95, recall of 0.80, and f_measure of 0.84.Next, the use of the Shuffle method goes through the following stages of testing.The next test for Genetic Algorithm (GA) parameters is the percentage value of crossover, mutation, and population At this stage, tested four value variations were: a) Crossover = 0.9, mutation = 0.03, and population = 5 b) The default value of the rapidminer application is crossover = 0.5, mutation = -1.0,and population = 5 c) Crossover = 0.9, mutation = 0.03, and population = 20 d) Crossover 0.5 mutation = -1.0,and population = 20 The results are as follows: The test results in table 6 conclude that the combination of crossing values of 0.9, mutations of 0.03, and a population of 20 produced the best scores with accuracy, AUC, precision, recall, and f_measure.Furthermore, the researchers will enter this value in the next stage of testing.

Split Data Testing
The division of Training and Testing data (Data splitting) is an essensial factor in machine learning theory.The researchers divide the dataset into two parts with a certain percentage.Machine Learning uses the training data to carry out the learning process, while Machine Learning uses the testing data to carry out the testing process.The percentage distribution of training and testing data is very influential on the results of algorithm performance.Then the researchers tested the data separation of the five methods and derived its value from literature studies.The results of these tests are as follows: The conclusion of the split data percentage test results in table 6 is that the training data rate compared to the best-performing data test is 0.9:0.1.Researchers will use this percentage in the following testing stage.

Optimize the Random Forest Algorithm
The performance of machine learning algorithms depends on the given parameters.Each algorithm has different parameters.So it is necessary to test these parameters to determine which produces the best performance value.At this stage, four criteria assessment methods will be tested on RF, namely: Gain Ratio (GR), Correlation (Corr), Gini Index (Gini), and Support Vector Machine (SVM).The results of these tests are as follows: The test results in table 7, the highest accuracy values are in the Gini and SVM methods, Still, the SVM method has the highest recall score, which means that the detection of stunting labels is better.The highest f_measure value, which means a balance between the number of stunting detections (recalls) and the accuracy of stunting detection (precision), is better than other methods.

Model Testing on Nine Algorithms
After getting all the optimal parameters, namely the feature weighting method, feature selection, selection scheme on GA, crossover method on GA, crossover value, mutation and population on GA, and the percentage of split data, Researchers use this parameter to test other algorithms.The results are as follows: The conclusion of the test results of the nine ML algorithms in table 8 uses the best parameters from the previous stage; namely, the RF algorithm has the best performance and excels at five accuracies, AUC, precision, recall, and f_measure score.Then the researcher chooses the RF algorithm for parameter optimization to get the best performance.

Optimization of the Random Forest Classification Algorithm
Random Forest is an ML algorithm whose performance can be adjusted using existing parameters.Parameters that can affect the results of the implementation of the RF algorithm are the value of trees(T), the value of max_depth(D), criterion(C), and pruning(P).The results of testing the four parameters are as follows: The table above concludes that the best random forest parameters are in tress = 100, depth = 10, criteria = gain ratio, and pruning disabled.

Feature Rank
Feature rank/variable input rank is the results of learning and pattern recognition of datasets using ML results of learning and pattern recognition.The result of the Feature rank is the weight of the relationship between each input variable and the output label (Stunting / Normal).The results of ranking each variable using the SVM method can are as follows:

CONCLUSION
This study concludes that machine learning can predict stunting status in infants according to the test results on small imbalance datasets in several machine learning classification algorithms and their parameters.Thus, combining the Random Forest classification algorithm, weighting with a Support Vector Machine, and feature selection with the Genetic Algorithm has the best performance.The parameters with the best performance are: The training and testing data distribution is 90% of the training data and 10% of the testing data.The number of trees in the random forest algorithm is 100, and the Gain Ratio criterion and max_depth are 10.In the Genetic Algorithm, the best parameters are: the Roulette Wheel selection method, the population is 20, the mutation value is 0.03, and the crossover value is 0.9.The validation method uses k-fold cross validation with a value of k = 10.Another conclusion is that there are 44 supporting factors for stunting.If we take a ranking of 10 in order of magnitude from most significant to most negligible, the supporting factors for stunting are: 1.Baby's weight at birth.2.Baby's Height at Birth.3.Number of meal per day.4.Breast Milk.5.Diarrhe times per 3 month.6.Child development examination during covid by Health Worker at home.7.Mother's age at birth.8.Mother height at birth.9.Number of sibling.10.Age when the first food was given.This research has the disadvantage of no test on other datasets.So that researchers do not consider the reliability of findings in different datasets

Figure 2 .
Figure 2. Comparison of Original Data vs SMOTE Data

Figure 1
Figure 1 Feature Rank

Figure 4 .
Figure 4. Proposed Model of ML

Table 1
Performance of Nine ML Algorithm

Table 2
Performance of nine algorithm to SMOTE dataset

Table 3
Feature Weighting Performance on RF Algorithm

Table 4
Feature Selection Performance on GA Algorithm After finding the best feature selection parameters on GA, the next step is to test the crossover method on GA to see the best performance.Approval of three crossover methods: One Point, Uniform and Shuffle.The test results are as follows

Table 5
Crossover method Performance on GA Algorithm

Table 6
Crossover, Mutation, Population Performance on GA Algorithm

Table 7
Split Data Percentage Performance on RF Algorithm

Table 8
Criteria Assesment Method Performance on RF Algorithm

Table 9
Performance of Nine Algorithm using Choosen Parameter

Table 11 . Feature Rank Proposed Machine Learning Models
The contribution of this study is to propose a Supervised Learning classification model for detecting stunting cases using seven stages, namely: 1. SMOTE Upsampling.2.Variable encoding.3. Data Normalization.4. SVM Feature Weighting 5.