CHURN RATE MODELING FOR TELECOMMUNICATION OPERATORS USING DATA SCIENCE METHODS

. The telecommunication company functioned in the market with extremely high competitiveness. Attracting new customers needs 5-10 times more expenses than maintaining an existing one. As a result, effective customer churn management and analysis of the reasons for customer churn are vital tasks for telecommunication operators. As a result, predicting subscriber churn by switching on the competitors becomes very important. Data Science and machine learning create enormous opportunities for solving this task to evaluate customer satisfaction with company services, determine factors that cause disappointment, and forecast which clients are at a greater risk of abandoning and changing services suppliers. A company that implements data analysis and modelling to develop customer churn prediction models has an opportunity to improve customer churn management and increase business results. The purposes of the research are the application of machine learning models for a telecommunications company, in particular, the construction of models for predicting the user churn rate and proving that Data Science models and machine learning are high-quality and effective tools for solving the tasks of forecasting the key marketing metrics of a telecommunications company. Based on the example of Telco, the article contains the results of the implementation of various models for classification, such as logistic regression, Random Forest, SVM, and XGBoost, using Python programming language. All models are characterised by high quality (the general accuracy is over 80%). So, the paper demonstrates the feasibility and possibility of implementing the model to classify customers in the future to anticipate subscriber churn (clients who may abandon the company's services) and minimise consumer outflow based on this. The main factors influencing customer churn are established, which is basic information for further forecasting client outflow. Customer outflow prediction models implementation will help to reduce customer churn and maintain their loyalty. The research results can be useful for optimising marketing activity of managing the outflow of consumers of companies on the telecommunication market by developing effective decisions based on data and improving the mathematical methodology of forecasting the outflow of consumers. Therefore, the study's main theoretical and practical achievements are to develop an efficient forecasting tool for enterprises to control outflow risks and to enrich the research on data analysis and Data Science methodology to identify essential factors that determine the propensity of customers to churn.


Introduction.
Modern digitalisation trends extend to all areas of people's lives and the economy. In today's global economy, working with data is a vital component of business strategy for business development and expansion. The continued expansion of telecommunication technologies during COVID-19 has increased competition between businesses and changed consumer behaviour forever, putting much pressure on businesses and opening new opportunities for them to compete locally and globally. This trend has accelerated the digitisation of the economy due to the growth of the importance of e-commerce in the current stage and the movement of large numbers of online consumers. This trend will continue to grow, leading to the fact that in the future, at least 90% of purchases will be made online, and retaining consumers will be an increasingly difficult task. At the same time, consumer standards have also increased, increasing pressure on businesses to adopt new technologies.
Big Data Analytics (BDA) is a process of analysis the big data, which is characterised by large volumes and higher value, that gives a vision for making effective management decisions about business development (Ryfiak, 2020). BDA, in many ways, determines the activities of companies because it contributes to understanding the behaviour and interests of users, identifying user satisfaction or the risk of their outflow, and increasing business revenues (Riddle, 2020). With this in mind, Data Science and Machine Learning methods, models, and technologies must be increasingly applied to improve operations and business development of enterprises. Therefore, companies must assess their relationships with clients and evaluate success from the client's perspective and carry out a customer-centric strategy. The advancement of technologies such as database processing technologies contributes to innovations that increase the interactivity and personalisation of the consumer experience (Rosario et al., 2021). This influences competition growth and drives enterprises to pay more attention to the needs of consumers, turning them into powerful elements of business strategy and business processes. Management decisions and operations should be underpinned by knowledge of current and potential consumers and systematic analysis of customer relationships. Realising consumer research based on the implementation of modern ICT tools supports customer profiling and determining the multiple behavioural dimensions that impact customers' behaviour and intentions (Singh et al., 2020). Currently, the telecommunication industry moves to saturation as supply overtakes demand (Fang, 2021), and the key source of business growth can be achieved by switching between providers. The marketing research demonstrates that attracting a new client needs 5-10 times more expenses than saving a current one. So, effective management of client outflow and determining the reasons for client churn have become very important tasks for telecommunication companies (Thakkar et al., 2022). So, forecasting consumer outflow is one of the main in the marketing activity of telecommunication company, which increase the relevance of current research and the necessity of finding appropriate methods for its solving.
The purpose of the current study is the application of different machine learning models for a telecommunications company, in particular, the construction of models for predicting the user churn rate and proving that Data Science models and machine learning are high-quality and effective tools for solving the tasks of forecasting the key marketing metrics of a telecommunications company.
Literature Review. Many research papers by modern scientists prove the relevance of using big data to analyse various processes. There has been growing attention to the BDA in the last ten years. With the rapid increase in the use of Internet technologies, a huge amount of generated and stored data (Sekli et al., 2021). Many companies use the analysis of this big data to improve their business strategy and gain market advantages (Maroufkhani et al., 2019) by improving their processes, improving the work with the customer base, increasing loyalty to the company, and maintaining and increasing revenues (Riddle, 2020). The main goal of BDA is the improvement of the decision-making processes through exploring and analysing big data, such as social media data (text, photo, video), text messages, etc. (Ryfiak, 2020). Implementing BDA helps companies improve their operational activities and develop effective strategies to gain benefits and improve customer value (Arya et al., 2016). Technologies such as BDA are required to generate opportunities to innovate by extracting valuable insights. Sekli and Vega (2021) examine the factors influencing the adoption of BDA and assess its relationships with knowledge and performance management. In addition, the authors provide practical guidelines for decision-makers, which are responsible for defining a strategy for the execution of big data in institutions of high education.
Govorek (2021) substantiates the strategies of telecom operators, which improve their competitive positions in the market and make it possible to attract a wide customer base through the use of BDA. Radukych et al. (2019) examine the impact and consequences of digital transformation on telecommunication industries, which have signs of network externalities. Additionally, the article presents a methodological basis for assessing the information society. In a paper by Yusuf-Asaiah et al. (2017), the authors offer to evaluate the recognised quality of experience to help telecom operators manage network performance more effectively and provide satisfaction with mobile Internet services.
Mobile network operators are facing a huge increase in data on their networks. Big data technologies present an up-to-date approach to working with accumulated data. A study by Simakovich et al. (2021) proposes solutions for big data that can accumulate and proceed huge amounts of data to obtain valuable insights. It creates an opportunity for mobile operators to improve the quality of their networks and offer their customers complete IoT solutions. The significance of the analysis is that the attitude of customers towards the brands and the products they purchase has increased. Data analysis is a main tool for identifying customer sentiment, specific product or service problems, and extending customer lifetime value (Parveen et al., 2021). A study by Dai (2017) highlights the significance of consumer feedback, which is a great source of information for telecommunication companies. This article uses a quantitative methodology to predict the number of dissatisfied users. The author focuses on selecting prognostic features and then the classification model. Dai (2017) used several data sets for the research related to complaints and survey data.
The triumph of e-commerce is mainly supported by consumers' engagement in the Internet environment. Hsu (2019) shows that using a marketing strategy when consumers consider the results of participation or observation of an event is effective because it becomes an incentive and induces purchase intention. A customer-centric marketing strategy focuses on consumers' feelings, ideas, emotions, and behaviours that are reflected during interactions in the online environment. However, Yen (2014) claims that consumer behaviour and purchase intentions change due to the expansion of the Internet and e-commerce platforms based on the Internet. For example, today's customers are concerned about information satisfaction, an organisation's reputation, and enhanced benefits. Consequently, companies are combining social media and e-commerce to maximise exposure and growth brand visibility and awareness.
Marketing can be specified as profitably determining and satisfying customers' needs. In an article by Varga and Gabor (2021), they state that marketing on social networks helps companies more easily promote products and services by focusing advertising on the more appropriate consumer segments. The major hypothesis of the study about the statement that social networks influence people's decisions was confirmed.
Social media allows for conveying brand information and offers, and e-commerce platforms provide a place where clients can see all products and relevant product information to make decisions. Company website data can be actively used to improve the organisation's business strategy. Unlike traditional offline marketing, the development of e-commerce makes it possible for clients to receive relevant information about enterprises, their proposals, and business activities from websites, as well as reviews and recommendations from consumers (Rosario et al., 2021). As a result, it increases competition because the consumer can easily switch from one service provider to another. Increasing satisfaction due to upgraded service, the proposition of more qualitative products, interactions, and the formation of good experiences is the basis for the success of companies operating in the competitive environment.
E-commerce involves implementing digital technology in business processes to increase online transactions and sales. However, the evolution of e-commerce occurs under the change in technological development (Yang et al., 2019). Thanks to the company's data analysis technologies through the application of Data Science and Machine Learning, businesses can monitor demographic characteristics and consumer purchasing activity to match the products and services proposed by the company to consumer requirements and needs and predict key business performance indicators. Machine learning and Data Science technologies allow real-time monitoring of the relationship between the efficiency of the business and marketing activity (including media activity) (Chornous et al., 2021;Fareniuk, 2022). In general, such studies are highly relevant for making predictions and developing optimal decisions for business management in conditions of uncertainty and high risk.
Implementing modern technologies included in Data Science, such as Data Engineering, Machine Learning, and Data Management, is now crucial to process data (Moorthi et al., 2020) effectively. Hurtado et al. (2019) mentioned that it is important to form an effective process of collecting and analysing all possible business functions and implement it in a complete Data Science solution to improve and control business performance because, at the current stage of market development, it is not enough for enterprises only to analyse datasets to form highly accurate predictions, develop the most cost-effective processes and minimise costs (Fedirko et al., 2019).
In general, there are a huge number of Data Science tools, and they should be implemented in business processes. Nevertheless, the technologies and methodology of Data Science and Machine Learning are not enchantments for business, as implementation processes directly affect their effectiveness. Experts claim that around 85% of projects must be implemented more effectively. The correct management process for implementing and developing Data Science projects is key. It is important to abide by all the universal rules for efficient work with innovation and effectively using technology in the company so that this innovation is useful (Zhang et al., 2021).
Methodology and research methods. In the current research, the following methods of general scientific research were applied: empirical (familiarisation with the basics of Data Science methods), theoretical, observation, experiment, analysis and synthesis, induction and deduction, as well as mathematical methods, statistical information processing methods, and parametric models.
Machine learning classification models were used to predict e-commerce metrics, in particular churn rate for telecommunication companies (Telco was selected for an experiment). The following four classical machine learning methods were chosen for analysis and simulation: boosting (XGBoost), logistic regression, classification trees on an example of Random Forest, and support vector machines (SVM). These methods are the most popular and the most useful considering the task and specific of collected data for the experiment.
We define the stages of machine learning as direct learning; evaluation of the quality of models; choosing the best of them; optimisation of input parameters in the model to increase accuracy; assessment of the target parameter or classification; analysis of results. By training a model, we understand the transfer of data to derive a certain pattern of dependencies between them. The data that the model accepts for analysis should be divided into 2 or 3 groups: training, test, and validation samples. When building a computer model, the software is selected for calculating the constructed mathematical models and obtaining results per the set goal. This research used Python programming languages for data processing and modelling.
To assess the accuracy of classification methods, the confusion matrix will be used accuracy, which shows the percentage of correctly grouped objects from the database; precision (the proportion of objects that were determined by the classifier as positive (belonging to the specific group under consideration) and, indeed, is positive), recall (responsible for the ability of the algorithm to correctly find the group as a whole and is defined as the share of objects of the positive group, which were correctly classified by the algorithm, to all objects of this group). These characteristics allow us to choose the optimal model for further analysis.
Results. In order to investigate the practical application of Data Science in e-commerce, we will use several models based on open data of a telecommunications company that provides several services to its users. The data research was conducted for a Telco and was downloaded from IBM Sample Data Sets for customer retention programs. The study aims to predict churn behaviour to help the company develop a customer retention strategy. Each row of the collected database represents a customer, and each column contains a factor that characterises it: • customers with churn within the last month; • the subscribed services by each customer, such as phone connection, Internet, streaming TV and movies, multiple lines, online security, device protection, online backup, and technical support; • information about the client's account -how long he has been a client, type of contract, method of payment, monthly payments, paperless account, and general costs; • information about demographic characteristics of the clientsage, gender, presence of partners. Customer outflow is one of the larger problems for the telecommunications market. Studies have proved that the average monthly churn rate metric among leading mobile and internet operators in the US market is from 1.9% to 2%. Nevertheless, before building models for outflow forecasting, it is required to examine the dataset's structure and quality and form hypotheses.
By examining the correlations between variables and churn metric, we can conclude that monthly contracts, lack of technical support, and lack of online security positively correlate with churn. On the other hand, contracts with two-year subscriptions have a negative correlation with churn. Online services such as streaming TV, online security, technical support, online backup, etc. are negatively associated with outflow ( Figure 1).
Many customers use the services of a telecommunications company for only a month, while quite a few use the company's services for around 72 months. This could be because different customers have various contracts. So, according to the contract they entered into, it may be more challenging for clients to leave a Telco or stay with the company.
The number of customers under different contracts should be considered. Most customers (almost 4,000) are on a monthly contract, while 1-and 2-year contracts have the same number of customers (1,500-2,000). Most monthly contracts are active during 1-2 months, while contracts for 2-years subscriptions tend to survive around 70 months. This displays that clients who take a longer contract have higher loyalty to the provider and tend to subscribe to services for a longer period.
Next, it is worth looking at the value of the main metric, which will need to be predicted. Churn rate -the outflow rate of consumers who refused to consume and use the company's services during a certain period. According to data, 26% of customers refuse to use the services of this telecommunications company. The data is distorted because a much larger part of customers does not stop using the services. It is critical to keep in mind in the process of modelling that inequality can lead to a lot of false negatives.
Let us consider the dependence of the outflow coefficient (churn rate) on some other variables and form hypotheses: • Churning vs Tenure: As can be seen from Figure 2, non-churning customers tend to stay longer with a Telco.
• Churn by type of contracts: Customers with a monthly contract are characterised by a higher churn rate (Figure 3).
• Outflow by age: The attrition rate of the elderly is almost twice as high as that of the younger population ( Figure 4).
• Churn by monthly and total payments: a higher level of customer churn when monthly or total payments are high ( Figure 5 and Figure 6).         Figure 6. Distribution of total charges by churn Sources: developed by the authors.
The study applies four models: Random Forest, XGBoost, logistic regression, and SVM. Logistic regressionit is crucial to scale all indicators in the logistic regression and achieve that they are all between 0 and 1. This helped increase the accuracy from 79.7% to 80.8%. Logistic regression has low false negative results (Table 1). Sources: developed by the authors.
By examining the effects of the factors (Figure 7), it can be seen that some characteristics have a positive relationship with the churn as a predicted variable, while others have a negative relationship. A negative correlation indicates that the probability of outflow drops with those characteristics. Let us single out certain observations: • Subscription for a 2-year contract decreases the likelihood of outflow. Tenure and a 2-year contract have the most negative relationship with outflow, as predicted by modelling with logistic regression.
• Subscription for the service of DSL Internet also lessens churn.
• General costs, internet services, monthly contracts, and the age of users can lead to an increase in churn rate. This is quite curious because even though fibre optic services have higher speeds than others, subscribers are more likely to stop using the services because of this. Implementing the Random Forest model helps achieve accuracy at 80.9%, which is quite similar to logistic regression. Based on the algorithm of Random Forest, the tenure, monthly contract, and general cost are the most important forecast characteristics for churn prediction and modelling. The results of this algorithm are similar to the results of logistic regression.

Figure 8. The importance of factors which influenced customer churn, determined by the Random
Forest model Sources: developed by the authors.
Implementation of the SVM algorithm helps to increase accuracy to 82%. Analysis of the confusion matrix (Table 2) shows that false negative rates are lower than in the logistic regression model (8,5% vs 10,3%). Sources: developed by the authors.
At the final stage, we construct the XGBoost model based on boosting algorithms, which helps to increase accuracy to 83%. After using various Data Science models, the best prediction model for this dataset was XG Boost, with the highest accuracy. However, the algorithm of support vector machines gives the lowest level of false-negative results. As a result, the combination of XG Boost and SVM is recommended for implementation in a telecommunications company to predict the churn rate for the customer base.
Conclusions. The active implementation of big data has become a requirement in the development process and business decisions in the 21st century, thanks to the ever-increasing opportunity to produce and collect it at an extraordinary speed. Thus, the categories «data analytics», «big data», and «Data Science» have become frequently used in the last 5-10 decades, which caused their active application in numerous areas of economics and social life.
Reducing the outflow of customers is one of the most important tasks of the marketing activity of a telecommunications enterprise operating in a highly competitive market. To guarantee the effective development of business, there is a need to use Data Science and machine learning technologies.
It was determined that methods of BDA began to show more and more applications in practice for business, in particular in the telecommunications industry, in order to evaluate the level of clients' satisfaction with company services, determine factors that cause dissatisfaction, and foresee which clients are at a greater risk of abandoning and changing services supplier.
Data Science and machine learning allow management and the marketing department to establish marketing processes dedicated to retaining customers who tend to switch suppliers. The study proves the necessity of the implementation of Data Science models and machine learning for the telecommunications market because they demonstrate high-quality forecasting of key indicators and make it possible to improve business results.
At the modelling stage, the main models (Random Forest, XGBoost, logistic regression, and support vector machines) were implemented using the Python programming language. The assessment of the accuracy of the results reached more than 80%, which points out the possibility and feasibility of models' applications to predict those who may abandon the company's services and minimise the outflow of customers based on this. The built models are effective because they have high accuracy and a sufficiently low level of false positive/negative results (within 10%). A high-quality classification of client loyalty allows one to adequately determine the rate of client loyalty and the intention to leave the provider with a general accuracy of more than 80%. As the results of the study are planned to be implemented to prevent customer outflow when planning marketing activities, it is appropriate to prioritise the models considering both accuracy and easiness of interpretation.
The key factors affecting the churn of clients from the company's client base are the type of contract (monthly or long-term), the duration of using the services, and the costs of the telecommunication company's services. Managing these factors and finding additional motives for managing customer behaviour is the basis for ways to improve the company's services and increase the business's operational results. As the proposed models built in the study showed high accuracy in the classification of customer loyalty, i.e. the client's desire to stay with the enterprise or switch to another telecommunication provider, there is an opportunity to implement proactive management promptly to prevent customer churn (since acquiring a new client is from 5 to 10 times more expensive than keeping a current one client), retaining clients satisfaction and loyalty, and reducing expenses.
The results of the research can be useful for optimising the marketing activity of managing the outflow of consumers of enterprises in the market of telecommunications by making effective decisions based on data analysis and modelling and as well as for improving the mathematical methodology of forecasting the outflow of consumers. Therefore, the most significant theoretical and practical inferences of the study are to develop an efficient forecasting system for businesses to monitor churn risks and to enrich the scientific resources on data analytics, machine learning methods, and methodology of Data Science to identify vital factors that determine the propensity of customers to churn. Additionally, regular control is relevant for the timely identification of new potential factors that will lead to customer churn. Furthermore, it is recommended to carry out research periodically and consider new additional variables which could influence the churn rate.
The main limitations of this study include only four machine learning models and sample imbalance, which can be surmounted by implementing additional balancing procedures and applying particularly price-sensitive classification methods. Models may become outdated due to fundamental changes in market conditions or tariff plans, a decline in the service and quality of communication, a substantial change in positions on the market, or large changes in the economic or political situation in the country. To prevent such a situation, enhancing the models monthly and reacting to possible threats in time is relevant.