Content Relevancy Quantification Model

Yuri Arkhipkin

Any content sensitizes sense or results while processing by brain or computer. Content search aims to provide access to results of content processing. These results are to be of enough quality to rely on, so the content search must be reliable enough to provide relevant content for processing.

Reliability of content search may be viewed as a probability of relevant content detection or coverage by applying a query. Irrelevancy is a deviation of content coverage results from customer requirements. The deviation is defined by compliance of query and content specifications with search engine implementation. Quality of subject matter query specification influences the content search reliability (relevancy).

In general we know nothing about content to be searched, but some ideas concerning queries and content processing results.

The Content Relevancy Quantification Model (CRQM) considers any content as a set of n queries with queries’ variety number N£n. Any content query q_v (v=1,2,…,N) may sensitize sense and is a potentially sense inherent one, even if an input (search) query is empty for a given content. Input query, applied for sense detection (sensitization), may discover or recover content with some relevancy to the search results.

Queries are viewed as structural and(or) functional content (document, collection, corpus) elements (e.g., see XML, XQuery grammar [http://www.w3.org/]), including subject matter data. Let the query set structure defines queries’ variety number N and an occurrence probability p_v (v=1,2,…,N) of v-th query, that defines a probability of sensitizing sense in response to this query.

It is mostly improbable, that an input query includes all N queries’ variety specifying content under search, because of the great values of number N for almost any content. To yield specified search results at the required relevancy level, in practice, it is enough for a content to have only s_v sensitive queries. The number s_v of potentially sensitive queries, sensitized throughout content search, defines content’s semantic coverage as c_v = s_v/n and is a known parameter of search models.

It is natural to consider any content under search to be relevant enough, if there exist at least some fixed, before stated number s_v of sensitive queries as a minimal part c_v of the total content queries’ number n.

Any search query may sensitize sense and is a potentially sense inherent one.

Consider any content as a system comprised of n queries, sensitive set of this system consists of states, each including more than s_v sensitive queries. Consider queries are pair wise independent and uniformly distributed over all possible queries’ number n=const. Content relevancy R (reliability) or probability of sensitizing sense is quantified in the CRQM by known equation

R = exp(-l), (1)

where l is a content’s irrelevancy intensity, defined by the CRQM and measured in irrelevancy (probability of not sensitizing sense) per content query.

Content query set structure is defined while indexing as a content query semantics matrix (CQSM). Semantics of the query’s parameter values is defined by content subject matter and search engine’s algorithm implementation particularities. The CQSM is the semantic quantifier’s (SEQUANTIC tool) data base for queries’ indexing (quantification), refining content’s semantic mean p_s, semantic coverage c_v, and semantic shift c_v-p_s, thus providing content relevancy quantification. Content search assures the achievement of required semantic coverage value.

Consider a content structured as N queries’ variety number with semantic mean p_s=1/N. The mostly relevant query R_v=1, l_v=0 for any content is the input query that either complies at full c_v=1 with the content queries, or is empty c_v=0 for the content. The mostly irrelevant input query R_vg0 in general is defined at c_v=p_s=p_v and l_vginfinity, so that semantic coverage is equal to semantic mean and occurrence probability of the query q_v for the content consisting of equally unique content queries (terms). The query discovers a content with higher relevancy, the greater is the unequality c_v<p_s. The query recovers a content with higher relevancy, the greater is the unequality c_v>p_s.

Content relevancy metrics, being continuously evaluated throughout search process and compared with required ones, provide a possibility for making decisions on searching. The achieved relevancy level, depicts changes either in subject matter requirements (input queries) or search engine implementation particularities, thus providing possibilities for optimization throughout content engineering process.

The SEQUANTIC tool as a content product lifecycle management solution may be viewed as a test bed or (and) a cradle for many existing or on coming content relevancy models and search engine implementations.

Each content product, released under the SEQUANTIC tool, may be equipped with relevancy e-certificate needed for acquisition, remedial and continual improvement processes, thus suggesting the quantifiable improvement approach to the content development W3C standards, e.g., XML, XQuery refinements and enhancements.

The CRQM based Sequantic engineering yields validated and verifiable relevancy estimations, integrating qualitative subject matter queries and quantitative relevancy metrics of content. Sequantic engineering provides continuous relevancy evaluations throughout content engineering process, thus accounting and tracing quantitative content relevancy requirements from customer to the content product.

The CRQM improves latent semantic indexing, especially for unknown and (or) heterogenous collections, by increasing relevancy, precision, and recall of content search, including the full text search. The CRQM may be used for data exploration and data integration tasks (due to its potential to quantify the content’s semantics), to solve heterogeneity problems, and to provide varied levels of Querying services, that facilitates knowledge discovery at different levels of granularity.

Consider the shaded and underlined one-word content queries q_v (v=3,4,…,25) of this article as t=23 tokens. _t

So the content above may be structured as N=S (^t_i)=8388608

ⁱ⁼⁰

queries’ variety number with semantic mean p_s=1/N =0.000000192 and total queries’ number n=8388793 (see the table below). According to the CQRM quantification, the mostly relevant recovery l₂₆=1.6663E-66 may be performed by “content search“ query, and the mostly irrelevant recoveries l₁₁=l₂₁=3.15341073 may be performed by “recover” and ” deviation” queries at the 0.042706219 relevancy level. The queries “semantics” and “engine” may be considered as four sigma (l£0.00621) relevant recovery queries for the content above. The query “term” is 0.125443318 relevant discovery query for the content example.

Table. The CRQM relevancy quantification example for the content above

v	Query set structure q_v	Occur # n_v	Occurrence probability p_v= n_v /n	Semantic coverage c_v	Irrelevancy intensity l_v	Relevancy R_v
1	All queries’ set	1	0.0000001192	1.0	0.0	1.0
2	Empty query	0	0.0	0.0	0.0	1.0
3	content	41	0.0000048872	0.0000048872	2.9692E-41	1-2.9692E-41
4	search	22	0.0000026224	0.0000026224	1.1102E-16	1-1.1102E-16
5	relevant	13	0.0000015496	0.0000015496	1.4346E-07	0.999999857
6	sensitize	12	0.0000014304	0.0000014304	1.1129E-06	0.999998887
7	coverage	5	0.000000596	0.000000596	0.10854887	0.897135052
8	query	32	0.0000038144	0.0000038144	4.5650E-29	1-4.5650E-29
9	reliable	4	0.0000004768	0.0000004768	0.33795093	0.713230286
10	discover	3	0.0000003576	0.0000003576	0.97014278	0.379028916
11	recover	2	0.0000002384	0.0000002384	3.15341073	0.042706219
12	irrelevancy	3	0.0000003576	0.0000003576	0.97014278	0.379028916
13	sense	6	0.0000007152	0.0000007152	0.03063331	0.969831136
14	semantics	8	0.0000009536	0.0000009536	0.00160992	0.99839137
15	engine	8	0.0000009536	0.0000009536	0.00160992	0.99839137
16	quantify	7	0.0000008344	0.0000008344	0.00751214	0.99251601
17	processing	3	0.0000003576	0.0000003576	0.97014278	0.379028916
18	results	7	0.0000008344	0.0000008344	0.00751214	0.99251601
19	probability	4	0.0000004768	0.0000004768	0.33795093	0.713230286
20	requirements	4	0.0000004768	0.0000004768	0.33795093	0.713230286
21	deviation	2	0.0000002384	0.0000002384	3.15341073	0.042706219
22	implementation	4	0.0000004768	0.0000004768	0.33795093	0.713230286
23	specification	4	0.0000004768	0.0000004768	0.33795093	0.713230286
24	collection	2	0.0000002384	0.0000002384	3.15341073	0.042706219
25	term	1	0.0000001192	0.0000001192	2.07590127	0.125443318
26	content search	5	0.000000596	0.0000069136	1.6663E-66	1-1.6663E-66
27	relevant content	2	0.0000002384	0.0000061984	2.6302E-57	1-2.6302E-57
28	content coverage	2	0.0000002384	0.0000052448	2.6302E-57	1-2.6302E-57
29	search engine	3	0.0000003576	0.0000032184	9.3429E-23	1-9.3429E-23
30	Search engine implementation	2	0.0000002384	0.000003576	2.6302E-57	1-2.6302E-57
31	content processing	2	0.0000002384	0.0000050064	2.6302E-57	1-2.6302E-57
32	content engineering	2	0.0000002384	0.0000056024	2.6302E-57	1-2.6302E-57
33	Sequantic engineering	2	0.0000002384	0.000001192	2.6302E-57	1-2.6302E-57
…	…	…	…	…	…	…