Issue 
Wuhan Univ. J. Nat. Sci.
Volume 28, Number 5, October 2023



Page(s)  451  460  
DOI  https://doi.org/10.1051/wujns/2023285451  
Published online  10 November 2023 
Information Technology
CLC number: TP301.6
Improved Hybrid Collaborative Fitering Algorithm Based on Spark Platform
^{1}
College of Computer Information Engineering, Jiangxi Normal University, Nanchang 330022, Jiangxi, China
^{2}
NationalLevel International Science and Technology Cooperation Base of Networked Supporting Software, Nanchang 330022, Jiangxi, China
Received:
1
July
2022
An improved Hybrid Collaborative Filtering algorithm (HCF) is proposed, addressing the issues of data sparsity, low recommendation accuracy, and poor scalability present in traditional collaborative filtering algorithms. The core of HCF is a linear weighted hybrid algorithm based on the Latent Factor Model (LFM) and the Improved Item Clustering and Similarity Calculation Collaborative Filtering Algorithm (ITCSCF). To begin with, the items are clustered based on their attribute dimension, which accelerates the computation of the nearest neighbor set. Subsequently, HCF enhances the formula for scoring similarity by penalizing popular items and optimizing unpopular items. This improvement enhances the rationality of scoring similarity and reduces the impact of data sparseness. Furthermore, a weighting function is employed to combine the various improved algorithms. The balance factor of the weighting function is dynamically adjusted to attain the optimal recommendation list. To address the realtime and scalability concerns, the algorithm leverages the Spark big data distributed cluster computing framework. Experiments were conducted using the public dataset MovieLens, where the improved algorithm's performance was compared against the algorithm before enhancement and the algorithm running on a single machine. The experimental results demonstrate that the improved algorithm outperforms in terms of data sparsity, recommendation personalization, accuracy, recall, and efficiency.
Key words: recommendation algorithm / collaborative filtering / latent factor model / score weighting / item clustering / spark / similarity calculation
Biography: YOU Zhen, female, Associate professor, research direction: software formalization, concurrent distributed computing, virtual reality, big data algorithm. Email: youzhenjxnu@163.com
Fundation item: Supported by the Natural Science Foundation of Jiangxi Province (20212BAB202018), Provincial Virtual Simulation Experiment Education Project of Jiangxi Education Department (202020048) and the Science and Technology Research Project of Jiangxi Province Educational Department (GJJ210333)
© Wuhan University 2023
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
0 Introduction
With the rapid development of Internet information technology, people have entered the era of big data, leading to an explosive growth of data information. Faced with this vast amount of information resources, people often have to invest significant time and effort in filtering the content they are interested in. To address this issue of information overload^{[1]}, the recommendation algorithm emerged^{[2]}. Serving as a novel form of implicit information service, the recommendation algorithm has found extensive application in the Internet industry, yielding favorable outcomes in ecommerce, short video platforms, and other domains. Furthermore, it substantially reduces the cost of user information retrieval.
Among the recommendation algorithms, the collaborative filtering algorithm is widely used^{[3]}, and it can be categorized into projectbased collaborative filtering^{[4]}, userbased collaborative filtering^{[5]}, and modelbased collaborative filtering^{[6]}. The fundamental concept of this algorithm is to calculate user similarity or project similarity solely based on user's interactive behavior data without extracting project characteristics. This allows us to recommend content that might be of interest to users. However, traditional collaborative filtering also has some shortcomings, such as data sparsity, low accuracy, lack of timeliness, and scalability issues^{[7]}. Numerous scholars have conducted extensive research on recommendation algorithms and systems. Liu et al^{[8]} proposed a multifactor weight collaborative filtering recommendation algorithm. They introduced time data and penalty factors into the similarity coefficient, resulting in a new similarity calculation method. Moreover, by considering the dynamic changes in interest factors, the recommendation accuracy has significantly improved. Nonetheless, this approach suffers from high calculation complexity and unclear effectiveness when dealing with sparse data. Tao et al^{[9]} proposed a recommendation algorithm based on grey correlation clustering, which employs grey correlation degree to determine user similarity. This method guarantees recommendation quality in high sparse dimensions but can only handle datasets with internal correlations between users and projects. Other approaches, like KullbackLeibler (KL)based Similarity Measure (KLCF)^{[10]}, incorporate item similarity weight into the user similarity formula, enhancing the accuracy of similarity to some extent. However, this involves a substantial number of Cartesian products, leading to high calculation costs. Zhang et al^{[11]} proposed an explicit implicit feedback algorithm with a similarity weighting strategy, which effectively improved the recommendation accuracy based on the differential privacy index. However, it failed to improve accuracy when the dataset was sparse. Most of the mentioned references focus on enhancing recommendation accuracy by improving similarity, but they only address a single direction, neglecting the simultaneous problems of low accuracy and data sparsity. Furthermore, with the exponential growth of data volume, the previous standalone recommendation models struggle to handle largescale data recommendation calculations. The processing efficiency of standalone computing is low, resulting in poor scalability and extended realtime recommendation processing.
To address the issues of data sparsity, low accuracy, and poor scalability in traditional collaborative filtering algorithms, we propose an enhanced Hybrid Collaborative Filtering algorithm (HCF). This algorithm incorporates several improvements. Firstly, we cluster item attributes to accelerate the calculation time of nearest neighbor sets. Secondly, we enhance the formula for scoring similarity calculation to increase the similarity between items and the actual values, thereby reducing the impact of data sparsity and enhancing the accuracy of recommendations. Thirdly, the two improved algorithms are combined with linear weights using a balance factor, allowing us to fully explore users' personalization and potential preferences in order to obtain the optimal recommendation list. Finally, we leverage the advantages of Spark distributed computing and algorithmic clustering techniques to enhance the scalability of the recommendation system and the responsiveness of the algorithm.
1 Collaborative Filtering Algorithm
1.1 Alternative Least Squares(ALS) Algorithm Based on Latent Factor Model(LFM)
LFM^{[12]} is a type of matrix factorization, belonging to the realm of machine learning algorithms. It incorporates latent factors to establish the relationship between user interests and items. The fundamental idea of this model lies in breaking down the highdimensional data matrix of user ratings on items. To tackle this problem, the key algorithm employed is the ALS algorithm, which involves three main stages. Firstly, the useritem rating matrix is constructed. Then, the rating matrix is decomposed into the product of two lowrank matrices. Finally, ratings are predicted and recommendations are made. The calculation process is as follows.
1) Build user item rating matrix: Build a rating matrix R_{mn}
As shown in Fig. 1, where R_{ui} represents user u interest score for item i, there are m rows and n columns. Since it is not necessary that each user rates all items, R_{mn} is often a sparse matrix.
Fig. 1 LFM principle user item rating representation diagram 
2) ALS matrix dimension reduction calculation: For the matrix R_{mn}, find the product $\tilde{R}$ of two lowrank matrices X_{mf} and Y_{f n} to be approximate, where X_{mf} is the user's implicit preference matrix for the item, Y_{f n} is the implicit feature matrix contained in the item, f is the number of hidden class features, and f ≤min(m, n). The formula is as follows:
${\tilde{\mathit{R}}}_{mn}={\mathit{X}}_{mf}^{\mathrm{T}}\cdot {\mathit{Y}}_{fn}$(1)
In order to make the product $\tilde{\mathit{R}}$ as close as possible to R_{mn}, a loss function that minimizes the squared error is used. And to prevent overfitting, a regularization term is added:
$\mathrm{m}\mathrm{i}\mathrm{n}C(X,Y)=\sum _{u,i}({r}_{ui}{\mathit{x}}_{u}^{\mathrm{T}}\cdot {\mathit{y}}_{i}{)}^{\mathrm{2}}+\lambda \sum _{u}{\Vert {\mathit{x}}_{u}\Vert}^{\mathrm{2}}+\lambda \sum _{i}{\Vert {\mathit{y}}_{i}\Vert}^{\mathrm{2}}$(2)
Among them, x_{u} is the implicit feature vector of user u preference for the item, y_{i} is the implicit feature vector of item i, r_{ui} is the actual rating of the ith item by the uth user, ${\mathit{x}}_{u}^{\mathrm{T}}\cdot {\mathit{y}}_{i}$ is the approximate score of the item of user u, and $\lambda \sum _{u}{\Vert {\mathit{x}}_{u}\Vert}^{\mathrm{2}}+\lambda \sum _{i}{\Vert {\mathit{y}}_{i}\Vert}^{\mathrm{2}}$ is the regularization term, where the regularization coefficient $\lambda $ can be obtained by crossvalidation.
Because there is a double variable x_{u} and y_{i} in formula (2), there exists a coupling, so we use the ALS alternating least squares method to solve it. Fixing X, finding the partial derivative of y_{i} by the loss function C(X,Y) and setting the partial derivative equal to 0, we can get:
$\frac{\partial C(\mathit{X},\mathit{Y})}{\partial {\mathit{y}}_{i}}=\mathrm{2}{\mathit{X}}_{mf\text{}}\text{}{r}_{i}+\mathrm{2}{\mathit{X}}_{mf}{\mathit{X}}_{mf}^{\mathrm{T}}{\mathit{y}}_{i}+\mathrm{2}\lambda {\mathit{y}}_{i}=\mathrm{0}$(3)
${\mathit{y}}_{i}=({\mathit{X}}_{mf\text{}}{\mathit{X}}_{mf}^{\mathrm{T}}{+\lambda E)}^{\mathrm{1}}{\mathit{X}}_{mf\text{}}{r}_{i}$(4)
In the same way, fixing Y and taking the partial derivative of x_{u}, we can get:
${\mathit{x}}_{u}=({\mathit{Y}}_{fn}{\mathit{Y}}_{fn}^{\mathrm{T}}{+\lambda E)}^{\mathrm{1}}{\mathit{Y}}_{fn}{r}_{u}$(5)
The equations (4) and (5) are calculated repeatedly in sequence, and the root mean square error (RMSE)^{[13]} is introduced as the condition parameter for terminating the iteration.
3) Prediction score: Convergence of the result can be determined when the RMSE value experiences slight fluctuations and reaches a certain level of accuracy, or when the maximum number of iterations is reached. For prediction and scoring, it is recommended to use the final training result matrix following formula (6).
${\mathit{P}}_{\mathrm{a}\mathrm{l}\mathrm{s}}=\tilde{\mathit{R}}={\mathit{X}}^{\mathrm{T}}\mathit{Y}$(6)
1.2 ItemBased Collaborative Filtering Algorithm
The ItemBased Collaborative Filtering algorithm (Item CF) is based on the similarity of user ratings of items and recommends items using the nearest neighbor set as a reference. It suggests items similar to the ones the user has previously liked. The dataset used is the useritem rating data. Referring to the R_{mn} rating matrix in Fig. 1, where R_{ui} represents the user u's interest score for item i. Since each user has only rated a subset of items, the matrix is sparse. The implementation process of the algorithm is divided into two stages. Firstly, the similarity between items is calculated to obtain the nearest neighbor set. Subsequently, scores are predicted and recommendations are made accordingly. The procedure is as follows:
1) Similarity calculation: The similarity calculation constitutes the most crucial component of the algorithm. Its primary objective is to characterize the similarity between items and derive the nearest neighbor set, a pivotal factor in enhancing recommendation accuracy. To achieve this, a modified cosine similarity is employed for the computation. This choice is made due to the variation in score scales among individual users, necessitating score normalization to eliminate the user dimension. The formula is as follows:
$\mathrm{s}\mathrm{i}\mathrm{m}(i,j)=\frac{\sum _{u\in {U}_{i,j}}({R}_{ui}\overline{{R}_{u}})({R}_{uj}\overline{{R}_{u}})}{\sqrt[]{\sum _{u\in {U}_{i,j}}({R}_{ui}{\overline{{R}_{u}})}^{\mathrm{2}}\sum _{u\in {U}_{i,j}}({R}_{uj}{\overline{{R}_{u}})}^{\mathrm{2}}}}$(7)
Among them, the larger the sim(i,j), the higher the similarity between items i and j. The set U_{i,j} describes the user who has evaluated i and j at the same time, R_{ui} describes the user u rating for item i, $\overline{{R}_{u}}$ describes the average rating of all rated items for user u.
2) Predicted score: The score is calculated based on the similarity between the user's unrated items and the items in the nearest neighbor set. The scoring model formula is as follows:
${P}_{\mathrm{i}\mathrm{t}\mathrm{e}\mathrm{m}}=P(u,i)=\overline{{R}_{i}}+\frac{\sum _{j\in {N}_{i}}\mathrm{s}\mathrm{i}\mathrm{m}(i,j)({R}_{uj}\overline{{R}_{u}})}{\sum _{j\in {N}_{i}}\mathrm{s}\mathrm{i}\mathrm{m}(i,j)}$(8)
where P(u,i) represents the predicted score of item i by user u, N_{i} represents the top n neighbors with the largest similarity to item i, and R_{i} represents the mean score of item i.
2 Improved Hybrid Collaborative Filtering(HCF) Algorithm
Data sparsity, accuracy, and scalability have consistently been the most prominent issues in recommendation algorithms. Traditional single collaborative filtering algorithms are unable to comprehensively address these problems, nor can they enhance individual performance unilaterally. To address these limitations, this paper introduces an enhanced HCF algorithm that effectively compensates for these shortcomings. The key components of the proposed approach include item clustering based on the traditional Item CF algorithm, enhancement of scoring similarity, and linear weighted fusion of the improved algorithm.
2.1 Improved Collaborative Filtering Algorithm Based on Item CF
To reduce the impact of data sparsity and computational complexity, an enhancement of the traditional Item CF algorithm has been designed. It is named the Improved Item Clustering and Similarity Calculation Collaborative Filtering Algorithm (ITCSCF). This algorithm is also a subalgorithm within the final hybrid collaborative filtering algorithm. Firstly, items are clustered based on their attributes using clustering techniques, which narrows the search scope of the nearest neighbor set, reducing the complexity of similarity calculations. Then, the formula for rating similarity calculation is improved. Given the different interests of users in popular and unpopular items when data is sparse, a penalty is applied to popular items, and parameters for unpopular items are optimized, reducing the impact of data sparsity and enhancing the accuracy of similarity.
2.1.1 Item clustering
Item CF needs to traverse the entire item data set when finding the nearest neighbor of the target item. For matrix data with m rows and n columns, the time complexity of the traditional collaborative filtering algorithm is O(n*m*m)^{[14]}. To mitigate the impact of data sparsity and reduce computational complexity, items can be scored and clustered. The approach involves selecting M points from the items as the initial cluster centers, then traversing the similarity between all items and the center points. Items with the highest similarity are assigned to their corresponding clusters. Next, the average value in each cluster is calculated and used to update the current cluster center.
This process is iterated repeatedly until the center point remains unchanged. As a result, the neighborhood calculation is reduced from the entire item space to several clusters, significantly reducing computational complexity. The time complexity after clustering is O(m*k*t), where m represents the number of data points, k represents the number of class center points, and t represents the number of cyclic iterations. The clustering algorithm is presented as follows:
Clustering algorithm
Input: Scored item set UIDB and number of clusters M
Output: M cluster classes
Step 1: Select the M starting cluster center points, traverse the UIDB of the item set, calculate the number of ratings S of each item i, sort the S of all items, and take the first M items with the largest S value as the starting cluster class , the cluster center set is recorded as C={c_{1},c_{2},…,c_{m}}; the cluster set where the center is located is recorded as U_{C}={C_{1},C_{2},…,C_{m}}.
Step 2: Traverse and calculate the cosine similarity sim(i, c_{j}) of all items i and the cluster center point c_{j}.
Step 3: Divide each item i into the cluster set C_{j} where the center with the highest similarity is located, and calculate the average value of the clusters as the latest cluster center point.
Step 4: Repeat the above steps above the loop until the position of each center point remains unchanged, end, and return the result.
Item clustering involves grouping similar items into the same cluster. When computing the nearest neighbor set, user only need to select the first few clusters that exhibit the highest similarity to the target item, and then perform the search within those clusters. This approach significantly reduces computational complexity and enhances the efficiency of the algorithm. Additionally, utilizing cosine similarity calculation in clustering helps mitigate the issue of "similarity not being the same", which can arise due to variations in individual attributes.
2.1.2 Improved scoring similarity calculation formula
At the heart of the collaborative filtering algorithm lies the similarity calculation, and its accuracy directly impacts the quality of recommendations. The varying degrees of user interest in popular and unpopular items can influence the precision of similarity calculation. To address this, this paper introduces two enhancements to the similarity calculation formula. The details are as follows:
1) Tuning of unpopular projects
In formula (7), the user set U_{ij}, which contains scores for two items i and j simultaneously, is used to calculate the similarity. However, when the data is sparse, the number of users who provide joint scores in U_{ij} will be very small, leading to significant deviations in the similarity calculation. For example, when there is only one user in U_{ij}, the result of calculating the similarity is 100%. No matter whether the two items are similar or not, they will be selected into the N_{i} set, which directly interferes with the calculation accuracy of P(u,i). In order to solve this problem, multiply a g(x) weight function on the basis of the above formula. g(x) is an increasing function of x, and it converges with the growth of x, because only convergence can ensure that the calculation result of the similarity after weighting has the least interference, and the calculation accuracy of P(u,i) is the highest. To sum up, choose g(x)=lg(1+x), and the improved formula is as follows:
$\mathrm{s}\mathrm{i}\mathrm{m}{(i,j)}_{\mathrm{n}\mathrm{e}\mathrm{w}}=\mathrm{l}\mathrm{g}(\mathrm{1}+x)\frac{\sum _{u\in {U}_{i,j}}({R}_{ui}\overline{{R}_{u}})({R}_{uj}\overline{{R}_{u}})}{\sqrt[]{\sum _{u\in {U}_{i,j}}({R}_{ui}{\overline{{R}_{u}})}^{\mathrm{2}}\sum _{u\in {U}_{i,j}}({R}_{uj}{\overline{{R}_{u}})}^{\mathrm{2}}}}$(9)
where x represents the number of public users in U_{ij}. When x is very small, a small weight is added to the similarity calculation to ensure it is not heavily influenced by the N_{i} set. In the rare event it is included, its impact on the accuracy of P(u,i) calculation will be minimal. Therefore, by multiplying with g(x), the resulting similarity value will be closer to the actual similarity.
2) Popular item punishment
In formula (7), if item j is extremely popular and receives ratings from numerous users, then every other item will appear very similar to the popular item. This, in turn, can significantly influence the ratings of less popular items. To mitigate this issue, we apply a penalty to reduce the impact on the similarity calculation. The following formula:
$\mathrm{P}{\mathrm{P}}_{j}=\frac{{U}_{i,j}}{\sqrt[]{{U}_{i}\times {U}_{j}}}$(10)
$\mathrm{s}\mathrm{i}\mathrm{m}{(i,j)}_{\mathrm{n}\mathrm{e}\mathrm{w}}=\mathrm{l}\mathrm{g}(\mathrm{1}+x)\times \frac{\sum _{u\in {U}_{i,j}}\mathrm{P}{\mathrm{P}}_{j}({R}_{ui}\overline{{R}_{u}})({R}_{uj}\overline{{R}_{u}})}{\sqrt[]{\sum _{u\in {U}_{i,j}}({R}_{ui}{\overline{{R}_{u}})}^{\mathrm{2}}\sum _{u\in {U}_{i,j}}({R}_{uj}{\overline{{R}_{u}})}^{\mathrm{2}}}}$(11)
where PP_{j} is the penalty for popular item j, U_{i}_{,}_{j} is the number of users who have rated items i and j at the same time, U_{i} is the number of users who have rated item i, and U_{j} is the number of users who have rated popular item j. The penalty PP_{j} is taken into the calculation when calculate the item similarity, as shown in formula (11).
2.2 Hybrid Collaborative Filtering Algorithm Based on ITCSCF and LFM
To fully leverage user personalization and enhance recommendation accuracy, the HCF algorithm proposed in this paper primarily utilizes ITCSCF and LFM to apply linear weighting of scores. Specifically, the clustering technology employed in the ITCSCF algorithm reduces computational complexity and significantly improves the recommendation response rate. Additionally, the enhanced similarity calculation effectively mitigates the impact of data sparsity, resulting in improved recommendation accuracy. To accommodate the diverse preferences of individual users, the algorithm maximizes the potential of ITCSCF personalization and LFM preferences by linearly combining the scores from both algorithms. This approach enhances the accuracy of prediction score calculations and increases the interpretability of recommendations. Moreover, the algorithm takes advantage of Spark's big data distributed computing and storage capabilities to address scalability issues faced by traditional recommendation methods. Building upon clustering, it further enhances data processing capabilities and reduces computing time. The algorithm flow is depicted in Fig. 2.
Fig. 2 A hybrid collaborative filtering algorithm model based on ITCSCF and LFM 
The final recommendation result is the hybrid weighted prediction score based on ITCSCF and ALS in the LFM, and the score expression is shown in equation (12).
${P}_{\mathrm{h}\mathrm{c}\mathrm{f}}=\alpha \times {P}_{\mathrm{a}\mathrm{l}\mathrm{s}}+\beta \times {P}_{\mathrm{i}\mathrm{t}\mathrm{e}\mathrm{m}},\alpha +\beta =\mathrm{1}$(12)
where P_{hcf} represents the prediction score of the hybrid recommendation algorithm, P_{als} represents the prediction score of the ALS algorithm, α is the P_{als} weighted balance factor, P_{item} represents the improved ITCSCF algorithm prediction score, and β is the P_{item} weighted balance factor.
In regard to weight values, crossvalidation is employed to train the model and capture variations among subalgorithms in diverse environments. This approach dynamically adjusts the weights and employs iterative weighted regression calculations, where data to be predicted is sequentially fed into the regression model during each iteration. The tag value is continuously updated until the predefined target value is achieved^{[15]}.
3 Spark Big Data Distributed Implementation
3.1 Spark Big Data Distributed Computing Platform
Spark^{[16]} is a parallel distributed computing framework based on Resilient Distributed Datasets (RDD). Its most prominent feature is the inmemory computing capability of RDD, which stores intermediate data generated during the calculation process in memory, effectively reducing disk I/O overhead. As a result, Spark is particularly wellsuited for iterative data processing, and its distributed approach represents the optimal means to enhance system scalability. The Spark framework structure is illustrated in Fig. 3.
Fig. 3 Spark framework 
The Driver Program serves as a task control node responsible for processing user code logic. The Cluster Manager is in charge of managing cluster resources, scheduling tasks and monitoring their progress. The Worker Node functions as a computing node within the cluster, capable of executing tasks concurrently using multiple threads, which constitutes one of its key advantages. The Executor is tasked with executing individual tasks, while the Task represents a computing subunit within a node. Upon task execution, the Driver requests resources from the Cluster Manager and delegates them to the Executor for processing. Once the execution is completed, the resulting data is returned to the Driver.
3.2 Algorithm Distributed Implementation
Different from the traditional recommendation algorithms, the HCF algorithm adopts the Spark big data distributed framework, deploying computing and data storage across multiple servers. It leverages Spark's inmemory computing and distributed multithreading capabilities, which enable maximum computational efficiency for both model training and offline computing. This framework offers two key advantages: 1) All operations are based on RDD inmemory computing, making it highly suitable for a large number of iterative recommendation algorithms and significantly accelerating computing efficiency; 2) The Spark framework supports various distributed structures, including concurrent computing, realtime stream processing, and data storage, effectively enhancing the scalability of recommender systems. The distributed flow of the algorithm is illustrated in Fig. 4.
Fig. 4 Distributed computing process of HCF algorithm on Spark platform 
Initially, data is gathered from distributed data sources, and items undergo preclassification through clustering, thereby reducing the computational requirements of the nearest neighbor set. Next, by taking into account the popularity of items and data sparsity, an enhanced cosine similarity measure denoted as sim is employed to determine the similarity between items. Various subalgorithms are then distributed to compute the predicted scores, resulting in individual temporary score lists. Ultimately, weight parameters are calculated through iterative regression to obtain the optimal recommendation list, which is subsequently stored in the distributed database.
4 Experiments and Results Analysis
4.1 Experimental Dataset
The experiment utilizes the MovieLens^{[17]} public dataset provided by Grouplens to evaluate the algorithm. The dataset comprises three main scales: 100 KB, 1 MB, and 10 MB, representing 100 000, 1 000 000, and 10 000 000 ratings, respectively. The dataset contains user information, movie information, and user ratings for each movie item. The ratings range from 1 to 5, where higher values indicate a greater level of user interest in the respective item.
4.2 Experimental Environment
The experimental system is deployed on 5 Spark distributed clusters, of which 1 is the master node and the other 4 are slave nodes. The hardware configuration is as follows: CentOS7.6; Intel Xeon E52650 v3 CPU 2.3GHz; 32G memory. The experimental software versions are as follows: Java JDK 1.8.0.4; Spark 2.1; Zookeeper 3.4 (cluster management).
4.3 Evaluation Standard
1) RMSE
To assess the accuracy of the algorithm, the RMSE is utilized for evaluation. The concept involves calculating the difference between the predicted score generated by the algorithm and the actual score. The smaller the deviation value, the more accurate the recommendation. Here, $\tilde{R}$ represents the predicted value, and R denotes the actual value. The formula is as follows:
$\mathrm{R}\mathrm{M}\mathrm{S}\mathrm{E}=\sqrt[]{\frac{\mathrm{1}}{N}{\displaystyle \sum _{i=\mathrm{1}}^{N}}{(R\tilde{R})}^{\mathrm{2}}}$(13)
2) Recall rate and accuracy rate
The evaluation of algorithm recommendation quality and personalization is typically done using recall and precision. The recall rate measures the proportion of the user's interested items in the recommendation list out of all the user's interested items in the system. The precision rate, on the other hand, measures the proportion of items in the recommendation list that users are interested in, out of all the items recommended to users. It is worth noting that recall rate and precision rate are often inversely correlated. The formula is as follows:
${R}_{\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l}}=\frac{\sum _{u\in U}R(u)\bigcap T(u)}{\sum _{u\in U}T(u)}\text{},\text{}{P}_{\mathrm{p}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{i}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}}=\frac{\sum _{u\in U}R(u)\bigcap T(u)}{\sum _{u\in U}R(u)}$(14)
Among them, R_{recall} is the recall rate, P_{precision} is the accuracy rate, R(u) is the list of recommended items that the user has evaluated and interacted with in the training set, and T(u) is the list of items that the user has evaluated and interacted with in the test set.
4.4 Experimental Results and Analysis
To assess the performance of the HCF algorithm, we will conduct experiments on the algorithm from various perspectives.
Experiment 1: The impact of weighted balance factor α on different data sparsity levels. The value of the weighted balance factor α plays a crucial role in achieving the best comprehensive prediction score for the HCF algorithm. In this experiment, a 1 MB dataset is taken as a sample, and three levels of data sparsity are selected for both training and testing. The ratios of the training set to the test set are 9:1, 7:3, and 5:5, respectively. By controlling other variables, we compare the effects of different α values on RMSE value, and the results are depicted in Fig. 5.
Fig. 5 Effect of different data sparsity balance factor on RMSE 
The experimental results indicate that a significant enhancement in the recommendation accuracy of the HCF algorithm can be achieved by increasing the Pals scoring weight of the LFM when the density of scoring data is high. Similarly, in sparse data scenarios, elevating the Pitem score weight of the ITCSCF algorithm can effectively address accuracy issues.
Experiment 2: A comparison between the HCF Algorithm and traditional algorithms. To assess the recommendation accuracy of the HCF algorithm, we compared the RMSE values of its predicted scores with those of several other classical algorithms. During the experiment, we utilized the optimal parameters for each algorithm within this specific environment. For testing purposes, we employed a 1 MB data sample to uniformly select various proportions for both the training set and test set, resulting in different levels of data sparsity. The outcomes of this comparison are depicted in Fig. 6.
Fig. 6 Comparison of root mean square error between HCF algorithm and traditional algorithm 
The experimental results indicate that the enhanced HCF algorithm outperforms the Item CF and LFM algorithms by 8.7% and 7.6% in terms of RMSE values when considering a 5:5 ratio and sparse data. Moreover, at a 9:1 ratio, the HCF algorithm exhibits an improvement of 6.3% and 2.8% over the RMSE values of the other two algorithms. A lower RMSE value corresponds to higher recommendation accuracy, and the HCF algorithm consistently achieves a lower RMSE value across various data volume ratios, highlighting its superior performance in both recommendation accuracy and data sparsity.
Experiment 3: The variations in the recommended recall rate and accuracy rate of the HCF algorithm were examined under different data scales. The results are depicted in Fig. 7.
Fig. 7 Changes in precision and recall 
The experimental results indicate that as the data scale increases, the accuracy of recommendations also improves, but the recall rate decreases. This observation suggests that, under unchanged conditions, a higher density of highrated data corresponds to a greater user interest in popular items, resulting in more accurate recommendations but lower item novelty. These factors exhibit an inverse relationship.
Experiment 4: Algorithm execution timeconsuming test, in which the enhanced HCF algorithm is deployed on 5 Spark clusters and 1 standalone machine, respectively, and their respective running times are compared. The experimental training set to test set ratio is set at 9:1, and the data scale is gradually increased for testing purposes. The results are presented in Fig. 8.
Fig. 8 Comparison of computing time between single machine and Spark cluster under different data scales 
As shown in Fig. 8, the operation efficiency is significantly improved under the Spark cluster compared to standalone operation. As the data scale increases, the gap in uptime between them also grows. There are two main reasons: 1) The clustering algorithm processing in the early stage reduces the time complexity from O(n*m*m) to O(m*k*t), thus saving iteration time; 2) Communication and data transmission between Spark cluster nodes consume running time. While the difference is negligible when the data scale is small, it significantly accelerates the computing efficiency when dealing with largescale data. Moreover, the improved algorithm's scalability is greatly enhanced by the combination of the algorithm and the Spark cluster, surpassing the traditional algorithm in this aspect.
5 Conclusion
An improved HCF algorithm has been proposed to address the issues of data sparsity, low accuracy, and poor scalability faced by traditional collaborative filtering algorithms. This enhanced approach can be summarized in three key aspects: 1) Building upon the foundation of the traditional Item CF algorithm, we cluster item attributes, which significantly reduces the calculation time of the nearest neighbor set; 2) Leveraging the cold and hot characteristics of the item formula, we enhance the score similarity calculation within the item CF algorithm. By doing so, we mitigate the impact of data sparsity and bring the similarity between items closer to their actual values; 3) By skillfully combining the newly devised ITCSCF algorithm and the LFM, we apply linear weighting, effectively leveraging their individual strengths in personalization and potential preferences. Moreover, we dynamically adjust the balance factor to obtain the optimal recommendation list. These improvements collectively contribute to tackling data sparsity, enhancing accuracy, and boosting the scalability of the collaborative filtering process.
The experimental results demonstrate that the enhanced hybrid collaborative filtering algorithm achieves higher recommendation accuracy, improved personalization, and enhanced computational efficiency compared to the conventional Item CF algorithm and LFM model algorithm under identical conditions. Moreover, leveraging the advantages of the Spark distributed platform and algorithm clustering technology, data sparsity and scalability have been noticeably enhanced. Nevertheless, it should be noted that the algorithm's clustering and weighted balance factor parameters are crucial factors influencing the recommendation accuracy. For this reason, these parameters will be subjected to thorough testing and finetuning in future experiments and research endeavors.
References
 Chen J F, Yuan Y, Ruan T, et al. Hyperparameterevolutionary latent factor analysis for highdimensional and sparse data from recommender systems[J]. Neurocomputing, 2021, 421: 316328. [CrossRef] [Google Scholar]
 Yan J, Zeng Q T, Zhang F Q. Summary of recommendation algorithm research[J]. Journal of Physics: Conference Series, 2021, 1754(1): 012224. [NASA ADS] [CrossRef] [Google Scholar]
 Chen Y C, Hui L, Thaipisutikul T. A collaborative filtering recommendation system with dynamic time decay[J]. The Journal of Supercomputing, 2021, 77(1): 244262. [CrossRef] [Google Scholar]
 Xue F, He X N, Wang X, et al. Deep itembased collaborative filtering for topN recommendation[J]. ACM Transactions on Information Systems, 2019, 37(3): 125. [Google Scholar]
 Wu Y T, Zhang X M, Yu H, et al. Collaborative filtering recommendation algorithm based on user fuzzy similarity[J]. Intelligent Data Analysis, 2017, 21(2): 311327. [CrossRef] [Google Scholar]
 George G, Lal A M. HyMOM: Hybrid recommender system framework using memorybased and modelbased collaborative filtering framework[J]. Cybernetics and Information Technologies, 2022, 22(1): 134150. [Google Scholar]
 Jia R, Li R, Gao M. Study on data sparsity in social networkbased recommender system[J]. International Journal of Computational Science and Engineering, 2019, 20(1): 15. [CrossRef] [Google Scholar]
 Liu C H, Han C F, Chen T C, et al. Collaborative filtering recommendation algorithm based on penalty factors and time weights[J]. Cyber Security and Data Governance, 2020, 39(5): 1721(Ch). [Google Scholar]
 Tao W C, Dang Y G. Collaborative filtering recommendation algorithm based on grey incidence clustering[J]. Operations Research and Management Science, 2018, 27(1): 8488 (Ch). [Google Scholar]
 Wang Y, Deng J, Gao J, et al. A hybrid user similarity model for collaborative filtering[J]. Information Sciences, 2017, 418: 102118. [CrossRef] [Google Scholar]
 Zhang R L, Zhang R, Wu X N, et al. Collaborative filtering recommendation algorithm based on mixed similarity and differential privacy[J]. Application Research of Computers, 2021, 38(8): 23342339(Ch). [Google Scholar]
 Chen Y, Liu Z Q. Research on improved recommendation algorithm based on LFM matrix factorization[J]. Computer Engineering and Applications, 2019, 55(2):116120(Ch). [Google Scholar]
 Wang W J, Lu Y M. Analysis of the mean absolute error (MAE) and the root mean square error (RMSE) in assessing rounding model[J]. IOP Conference Series: Materials Science and Engineering, 2018, 324: 012049. [NASA ADS] [CrossRef] [Google Scholar]
 Xiang L. Practical Combat of Recommendation System[M]. Beijing: People's Post and Telecommunications Press, 2012(Ch). [Google Scholar]
 Anand R, Beel J. Autosurprise: An automated recommendersystem (AutoRecSys) library with tree of parzens estimator (TPE) optimization[C]//Fourteenth ACM Conference on Recommender Systems. New York: ACM, 2020: 585587. [Google Scholar]
 Spark Apache. Spark mllib programming guide[EB/OL]. [20221023]. https://spark.apache.org/mllib. [Google Scholar]
 MovieLens GroupLens. MovieLens data guide[EB/OL]. [20221103]. https://grouplens.org/datasets/movielens. [Google Scholar]
All Figures
Fig. 1 LFM principle user item rating representation diagram 

In the text 
Fig. 2 A hybrid collaborative filtering algorithm model based on ITCSCF and LFM 

In the text 
Fig. 3 Spark framework 

In the text 
Fig. 4 Distributed computing process of HCF algorithm on Spark platform 

In the text 
Fig. 5 Effect of different data sparsity balance factor on RMSE 

In the text 
Fig. 6 Comparison of root mean square error between HCF algorithm and traditional algorithm 

In the text 
Fig. 7 Changes in precision and recall 

In the text 
Fig. 8 Comparison of computing time between single machine and Spark cluster under different data scales 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.