Web News Media Retrieval Analysis Integrating with Knowledge Recognition of Semantic Grouping Vector Space Model

Traditional Web news media retrieval technology can only meet the specific


Introduction
Early search engines generally used content-based search methods, which were developed on the basis of the theory and technology of traditional information retrieval.The primary consideration is the relationship between web pages and search terms and the frequency and location of users querying documents.This method improves the search quality and accuracy to a certain extent, but these methods are based on keyword queries.Synonyms and polysemous words in natural language cannot be retrieved, so the search rate is not high and has certain limitations [1][2].In recent years, the rapid progress of information technology has ushered in the digital age, and the digital and electronic files of past paper files are gradually being replaced.How to quickly and accurately query the user's demand information in the Web news media database?The current significant problem is how to deal with the issue of archive information supply and demand.Web news media retrieval is a helpful solution.By using a personalized search system to refer to the same information among users, new calculation methods for the semantic grouping vector space model can be discovered.
The most important thing is the user type, but the participation of users is required because of the calculation method and collaboration Personalized search methods of semantic grouping vector space model have their characteristics.With the continuous development of the multimedia entertainment field, Web news media has gradually integrated into people's lives.However, how large-capacity Web news media data can better adapt to the environment and different user characteristics and introduce Web news media suitable for users is a challenge facing people.This paper presents the knowledge recognition of the semantic grouping vector space model in the Web news media retrieval process.It accurately conducts the Web news media retrieval by using the deep belief network algorithm in the audio segmentation stage of the Web news media.Finally, the results of the experimental analysis show that the method proposed in this paper can quickly retrieve the corresponding type of Web news media according to the user's preferences.The study emphasized the use of data mining tools to examine event representations, integrating case studies and user evaluation procedures.

Factors
including data velocity and volume, the study's dependence on data quality, and the extent of user examination were among the constraints.The study focused on the deployment of the Hermes Framework for Personalized News in the Hermes News Portal.Three main approaches were examined in the study: the implementation of the Hermes News Portal, semantic query languages, and semantic text analysis.
The complex nature of complexity and the learning curve, as well as its dependence on the quality of ontology.The study concentrated on the challenges associated with traditional search engines and the integration of NLP.
The study involved using Web Crawling and NLP techniques to gather and preprocess data, followed by applying Sentiment Analysis algorithms.To overcome these challenges, we proposed a vector space model that focuses on semantic grouping based on feature words.The paper aims to organize news information into different categories based on the meaning of specific words.

Semantic grouping vector space model
The knowledge recognition process based on the semantic grouping vector space model is mainly for the problems that occur during database retrieval.If this problem is completely solved, the semantic grouping vector space model knowledge algorithm needs to be used to optimize the parameters of the model.This paper combines the strategy of weight sharing.Weight sharing refers to making the semantic grouping vector space model different in connection mode and parameter sharing mode to the conventional vector model.The semantic grouping vector space model can be locally connected and data information can be shared.Weight sharing mainly refers to the collection of parameter data based on multiple nodes in the hierarchical process.The feasibility analysis of shared data parameters is primarily related to various goals in the calculation process.Different from the traditional method, the semantic grouping vector space model mainly uses the initial feature value of the input signal of the collected data and learns the input signal according to the hierarchical retrieval [3].Usually, it includes the average time amplitude difference of the time domain features.This article uses the energy in a short time, etc., for the initial input characteristics of the model.The Web news media search signal can be represented by x(n), and the short-term energy balance can be expressed as follows: However, the high dimensionality of the time-domain features under initialization will cause a lot of interference and noise.Therefore, the input search signal needs to be reduced in dimensionality.The primary cause analysis method is used to make statistics on the multiple variables of the investigation, and the internal structure among various variables can be analyzed by studying multiple main components.After the data of Web news media is processed by dimensionality reduction, the input data information can be retrieved.In deep learning, the semantic grouping vector space model can perform data processing on the output data.Among them, the Web news media retrieval structure at this stage is shown in Figure 1.

W. Xiong
Figure 1: Semantic grouping vector space structural diagram.

Deep belief network algorithm in the audio segmentation stage
The Deep Belief Network (DBN) algorithm, utilized in audio segmentation, employs a hierarchical, generative model consisting of multiple layers of stochastic, latent variables.Initially trained layer by layer, the DBN captures intricate patterns in audio data through unsupervised learning.The top layer of the network functions as a discriminative model, facilitating the identification of relevant features for segmentation.Through iterative fine-tuning of weights during training, the DBN learns hierarchical representations, enabling the extraction of meaningful audio features.This hierarchical approach enhances the algorithm's efficiency in discriminating between various audio segments, thereby improving its ability to segment and classify different components within the audio signal accurately.

Knowledge recognition algorithm of semantic grouping vector space model
A news report contains four elements: time, place, person, and event.Therefore, for Web news information, at first, feature words are distinguished based on these four elements, respectively define the associated semantic groups to form 4 vectors, and then determine which vector space each feature word belongs to, and establish an inverse index corresponding to each vector space, calculate the weight value and similarity of feature words for each meaning group.Finally, the weighted sum of similarity is obtained, and the search results that are greater than a specific valve value are sorted using link analysis [4][5].

Weight and similarity of feature word
In Web news information, the position of the feature word on the document is different, and the importance of the document ability expressed is also other.The feature word level describes this characteristic.If the feature word TE appears n times in the document, as in formula 2, the feature word level score k T is the k-th appearance level of the feature word T in the document.
( ) The weight of the feature word is In the case of calculating the weighted similarity between the query Q and a specific news feature word group D, due to the large amount of calculation and time overhead of the conventional VSM similarity, the ratio of the weight value of the QD cross part to the sum of the QD weight value is used for calculation (4). ,

Semantic grouping vector space calculation method
For the vector space model, the conventional method of semantic grouping vector space calculation is to calculate the cosine similarity between vectors.The semantic grouping vector space of user u and Web news media d can be defined as: ( ) Regarding the probability model, the cosine similarity of vectors cannot be calculated by self-connection [6][7].The following propositions are proposed to express the diversity of user interests.Therefore, formula ( 7) can be transformed into The purpose is to transform the semantic grouping vector space problem of the probability model into a situation of seeking conditional probability, presenting the diversity of user interests.The adopted system has the memory of recording the user's search history and clicks and continues to search for the data information source of the user's operation behavior model.The system automatically completes this coherent operation, and the user experience is not disturbed.First, the user's historical search information in the browser is saved to learn the user's interest, and then the user's interest in the search information through the user's operation on the search results.Add time stamps to the data of interest.Therefore, update the points of interest that users need to be more interested.In the user interest model, the design process is shown in Figure 2.Where N is the number of generated texts, and n is the number of all texts containing the keyword i k .
The method of calculating the rights of keywords i k can be adjusted as follows.
When comparing the data of the model and the document ( ) x composition, which is inversely proportional to the degree of user concern.The smaller the θ, the higher the correlation between this file and the user's interests and preferences.The calculation formula is as follows.

Web news media retrieval analysis
In  9), since p(u) does not interfere with the results of the recommendation probability, according to this method, the detailed explanation of the retrieval calculation of the multimedia digital archive users is performed.
Algorithm 1. Web news media retrieval analysis algorithm based on knowledge recognition of semantic grouping vector space model.
Input: domain classification model, user interest model, retrieval keywords, search engine, output: multimedia digital archive users' Web news media search results.
(1) According to the search keywords, the search engine is used to generate a preliminary search result set X.
(2) Set the number of iterations i=0.
(3) For the i-th Web news media in the set X, formula (1) is used to calculate the probability distribution in the field's classification model.
(4) Equation ( 9) is used to calculate the probability that the multimedia digital file i is recommended to the current user and added to the list Y.
(5) If the multimedia digital file i is the last multimedia digital file in the set X, go to (6); otherwise, set i=i+1 and return to (3).
(6) Sort and output the multimedia digital files according to the probability in the list Y in descending order.
Because the algorithm is actually based on another search engine, for each multimedia digital archive of search results, the probability distribution in the domain classification model must be calculated.This has a great impact on the performance of the algorithm.If the search engine calculates the probability distribution in the domain classification model of each Web news media in advance, the performance of the algorithm will be significantly improved to meet the needs of real-time processing.
3 Web news media retrieval analysis process

Web news media retrieval
The four parts of the browser plug-in, personal manager, user model learner, and information personalized searcher, constitute the experimental system; as shown in Figure 3, it is the browser plug-in that provides users with convenient tools.After the user logs in and register information, the browser plug-in can be used to complete the Web news media retrieval of multimedia digital archives-no need to log in to the server.In addition, the browser plug-in mainly collects the user's personal information and transmits it to the server.The personal manager is used to manage the user's personal information, hobbies, and bookmarks through the personal manager.The purpose of tracking user behavior is to learn user interests.The information Web, a news media retriever, can complete the user's query and recommendation in the multimedia numbers calculated by the semantic grouping vector space model.
Our system can track the behavior of guests; it is distributed on the edge of the client and server and will not affect the customer's reading and system performance.

News page judgment and related information extraction information extraction
Through retrieval and learning, the link weights and node offsets of each layer are obtained, and the network initialization is completed.The reverse conduction algorithm (BP)is adopted, and the deep trust network model monitored from top to bottom is fine-tuned to overcome the shortcomings of local optimization and long search time.Although the performance of the deep belief network model shows strong characteristic learning ability, from the above principles, Internet search requires a large amount of sample data to generate more parameter values.On the other hand, based on the problem of Web news media recommendation search, there is a lack of a large amount of sample data, and it is found that the generation of a large number of parameters takes a long time, which is not good for practical applications.During the calculation process, the feature vector is used to represent the web page, and if the keyword weight We are determined by the TF*IDF method, and the term item is determined to be a named entity, the weight value shall be appropriately enhanced.The specific definition is as follows. , , Among them, α is the weighting factor, which is 5 in this experiment.
Finally, if m word items with large weight values are selected to generate web page feature vectors and applications.The number of shared term items in the two webpage feature vectors is used as the basis for judging similarity.If the number of shared terms is greater than the threshold, the two web pages are similar.After determining the reprinting or similar relationship, relevant information is extracted and recorded.The leading information recorded is the reprinted website, the source site of the reprinted website, the number of responses to the reprinted website, and the time of the news release.The reprinted website and the source site here are only records of the reprinting relationship, not the finalized accurate source site and reprinted website.
The last source site will be determined in the next step.( ) ( )

Judgment of news reprinting relationship and calculation of authority of news source sites
Iteratively update the attributes of each node ( ) A pt according to the above formula.
The extracted reprinting information is used first to extract the relationship between news reprinting sites, calculate the authority value of each reprinting site, and use the website with the most reprinting times as the source site, including the relationship between direct reprinting and indirect reprinting, and that authority value is treated as the value of the reprint rate of the news.

Calculation of new response rate
The response rate (denoted as Rep) directly reflects people's reaction to Web news.usually amount of responses Response rate number of clicks Observation results show that most news pages only provide the number of answerers rather than the number of clicks/viewers.The number of clicks/views on the page is stored on the page server-side and cannot be obtained through simple capture and information extraction.Based on a large number of observations, a response rate ratio is summed up based on the relative number of news responses, and this ratio is used as the news response rate.Here, the number of responses is the total of the number of responses from the source site and the number of responses from the reprint website [8][9].Figure 4 shows the distribution of the number of news responses.It can be seen from Figure 4 that the number of responses to most news is within 1,000 people.There is very few news with more than 3,000 people.According to the statistical rules in the figure above, the relative recovery rate values are shown in Table 1.As an example, the number of responses (0~500) indicates the range of the number of people who responded to this event, and the relative response rate indicates that the number of people who responded to this event is between (0~500).It is considered that the number of people who responded to this event accounted for 5% of the number of viewers.If there are more than 5,000 respondents, those who have read this report will basically give answers, and the relative answer rate is 100%.

The influence of time factors on news ranking
There are usually two trends in people's interest in news, as shown in Figure 5.The attention here is measured by the number of news viewers per unit of time.The first is the slow-growing type of interest in knowledge such as national policy news.The timeliness of news in these categories is not strong, and people's concern is slowly increasing with time.The other is the type that grows rapidly and declines.It is mainly for news on current events; this kind of news is very time-sensitive.People's attention to this kind of news has increased rapidly in a short period, and after some time, the attention has quickly dropped [10][11][12].Therefore, the sorting of news must first be classified and judged, taking into account the influence of time.From this perspective, the importance of news is inversely proportional to the time of publication.In addition, the longer the release time, the higher the probability of reprinting and replying and the greater the number of responses and reprints.If the time factor is not taken into consideration, it is unfair to the newly published news.Therefore, when selecting parameters, the time factor will affect the importance of news.For reports with a long submission period, the number of responses and reprints will be reduced.Summarizing the above two points, combined with the definition of the news decay time parameter in literature [4], the definition of the time parameter is as follows.
( ) Among them is the publication time of the news.The determination of α depends on the recession time of the news category to which the news belongs.Recession time refers to the time from the news release to the intermediate experience that no one cares about, and it is defined here.The relationship between news and recession time is Here β is the decline time of current affairs news.γ is the decline time of non-current affairs news.

Judgment of news influence
Through the above steps, data on news reprint rate, news response rate, and influence factor of news source sites can be obtained( S W ), as well as the time parameter of the news release ( ) Reprinting and replying to news are considered to be the recognition of news, so the Web news recognition rate (denoted as Rec) is defined as news recognition rate = a.×Reload rate+b×Recovery rate.
In order to ensure that the authorization rate is less than 1, the relationship between a and b is defined as a+b=1.
Since there is no suitable corpus, the values of a and b cannot be obtained through the training method so these decisions can be obtained according to the 80/20 rule.
There may be a lot of people watching the news, but few people answer it, and even fewer people understandably do repost.So, I think the reprint rate can better reflect the influence of news.Experiments show that this definition method is feasible [13][14].
Finally, combining the above information, define the influence of news (NF) as follows.

Experiment results and analysis
The performance evaluation of the information retrieval system is generally used as a benchmark, and the comprehensive evaluation rate F can also be used for evaluation (25) 2 precision recall F precision recall

Experimental data set
The top nine news information websites were tracked on the Chinese website rankings for a week.As many features based on the summary of Chinese webpages appeared in the algorithm research, Chinese webpages were still used as experimental subjects in the experiment.The news on the homepage of these nine websites is captured every hour.The list of captured experimental data is shown in Figure 6.

Experimental result
The recommended task is executed using Python 3.5.Numpy, Pandas, Scikit-learn, Natural Language Toolkit (NLTK), and Matplotlib software are required to be installed alongside Python to carry out the procedure (1) News influence ranking for the week from September 10 to 16th, 2007 Figure 8 shows the top 10 news and their influence values in the week from September 10 to 16th, 2007.Here, the recession time of current affairs news is defined as 72 hours instead of 120 hours.Figure 9 shows the distribution of news influence values within a week.(2) Designated news influence ranking The algorithm is also suitable for the sorting of designated news, giving some news on different topics, and using the topics of these news as keywords to search for relevant news pages using popular search engines.Select the top 50 from the search results for statistical calculations.The reason for ranking in the top 50 is that basically all reprinted pages and almost all similar pages, as well as the top 50 websites with news information in the Chinese website rankings, are included.It is sufficient to determine the source website of the news and the reprint rate of the news.After browsing many web pages, it can be found that all the comments of netizens are on the top websites.Netizens on other websites make almost zero responses, so selecting these pages can also get a more accurate news response rate value.After obtaining each news topic and reprinting the page, the relevant information is extracted to analyze the influence of each topic news according to the above algorithm, find the influence coefficient, sort according to the influence coefficient, and obtain the ranking result of the quantitative analysis.Next, we investigated the sorting results of these topics by multiple people.After synthesis, we got the sorting results of manual qualitative analysis.Finally, the consistency of the two results is compared.
From the results of multiple comparisons, it is found that the sorting results calculated by this method are almost the same as the manual sorting results.The comparison results are shown in Figure 10 and Figure 11.Here is a comparison of the influence of news on non-related topics.Experiments show that this method is also applicable to related topics.Query speed is a metric that measures the efficiency of a system in processing and responding to user queries.It quantifies the time it takes for a system to execute a search or retrieval operation and deliver the relevant results to the user.Table 2 and Figure 12 show the query speed result.While comparing the proposed method (SGVSM -5 sec) with the other existing methods (Generalized Vector Space Model (GVSM )-10 sec, TF-IDF -12 sec), it shows that our proposed method is superior for prediction accuracy in Web News Media Retrieval to other methods.Calibration rate is a metric used to assess the accuracy of predicted probabilities in a predictive model.It measures how well the predicted probabilities align with the actual outcomes or events.Table 3 and Figure 13 show the Calibration rate results.While comparing the proposed method (SGVSM -82%) with the other existing methods (GVSM -70%, TF-IDF -68%), it shows that our proposed method is superior for prediction accuracy in Web News Media Retrieval to other methods.

Implication
The combination of Web News Media Retrieval Analysis and Knowledge Recognition of Semantic Grouping Vector Space Model has significant practical applications for information retrieval and analysis in the field of online news media.This strategy seeks to improve the efficiency and relevancy of web news searches by integrating sophisticated approaches in information retrieval, semantic grouping, and knowledge recognition.The model enhances the extraction of significant insights by identifying semantic relationships within the material, allowing for more precise categorization and collection of articles based on their underlying knowledge structures.This integration enhances a news retrieval system by making it more sophisticated and contextually aware.
Additionally, it has the potential to improve the user experience by delivering more coherent and informative results.Furthermore, this approach may be utilized in several domains, such as journalism, research, and data analysis, providing a valuable instrument for effectively navigating and understanding the extensive and ever-changing realm of web-based news media.

Discussion
The generalized Vector Space Model may struggle to capture semantic nuances and ambiguity in language.Words with multiple meanings or contexts may be represented by a single vector, leading to potential confusion.TF-IDF Vector Space Model faces challenges with synonymy and polysemy, where different words may have similar meanings or a single word may have multiple meanings.When implementing a proposed Semantic Vector Space Model, achieving superior performance in capturing semantic nuances and addressing the challenges of ambiguity becomes possible.This model endeavors to represent words in a multidimensional space, taking into account their semantic relationships, thereby offering a more nuanced and context-aware representation.By considering the inherent meanings and associations between words, the Semantic Vector Space Model aims to enhance accuracy and effectiveness.

Conclusions
Compared with previous vector space models, the vector space model based on the semantic grouping of feature words proposed in this paper is more accurate for Web news information and is suitable for the retrieval of Web news information systems.The calibration rate of document query, the comprehensive evaluation rate F, and query speed have been significantly improved.The current situation is personalized services.The general retrieval system in the past can no longer meet the retrieval requirements in different environments, purposes and different times.This paper has carried out a series of research and analysis on Web news media retrieval.Through experiments, we can see the interference factors for the calculation of the semantic grouping vector space model.Experiments have proved that the accuracy of analysis has been improved, and the interests and needs of users can be correctly expressed so that the accuracy of Web news media retrieval is further enhanced.Keeping up with real-time updates and delivering the latest news can be a challenge.Systems might struggle to provide up-to-the-minute information due to processing delays or the dynamic nature of news.In future research, systems may focus on improving semantic understanding to provide more accurate and contextually relevant results.This could involve advanced natural language processing techniques, including sentiment analysis, entity recognition, and topic modeling.

Proposition 1 .
Assuming that user you have conditions independent of multimedia digital archive d in the predetermined classification model multimedia digital archive d recommends to user u is: can be known from the total probability formula,

Figure 2 :
Figure 2: Design process of user modeling.
the reprinting relationship of Web news has two types, direct reprinting, and indirect reprinting, the source site cannot be determined initially, and the two attribute values of all nodes are initialized to 1 in the entire network layer.Then, in pt→qt, the website pt describes that the news of the website qt is reproduced.The content quality attribute value and the reprint attribute value are calculated using the following repetitive formula.The attribute value of all web pages is normalized to 1 when each iteration is completed.

Figure 4 :
Figure 4: Statistics of the person-time number of responses.

Figure 6 :Figure 7 :
Figure 6: List of experimental web pages.

Figure 10 :Figure 11 :
Figure 10: Results of manual ranking of designated news influence.

Figure 13 :
Figure 13: Calibration rate the proportion of the weight.People's forgetting value is the trend of forgetting from the beginning and gradually becoming late.In the interest model system, the weight of the keyword of interest is multiplied by the update time, the weight of the phrase is sorted, and the forgotten interest topics are deleted.Complete the tracking of the effective behavior of the user and harvest a new keyword to recalculate the proportion.If the weight exceeds the threshold, it is added to the user interest model to complete the model update.It can be proved from Proposition 1 that according to the results of the ranking query based on the recommended ratio, the semantic grouping vector space model calculation can be used to query the media digital archive users.Based on p(u) of inequality ( multimedia digital archives, user Web news media retrieval is completed by three stages: collecting and analyzing user data, constructing user interest model, and updating user interest model.Suppose Web News Media wants to obtain user information.In that case, it must first get the user's obvious information, such as the user'Because the user's interest is not fixed, it is necessary to establish an update mechanism when building a model to remove the forgotten topics in time to add new content, calculate the weight of the user's interest, and rank them according to

Table 1 :
List of relative response rates.

Table 2 :
Query speed

Table 3 :
Calibration rate The correctness of case predictions is measured as a percentage of complete occurrences.Figure14and Table 4 depict the comparative evaluation of accuracy in suggested and traditional methods.When compared to currently existing methods such as GVSM

Table 7 :
Computational time