Content Relevancy Quantification Model  

 

                                               Yuri Arkhipkin

                                                aryur@yandex.ru

 

 

Any content sensitizes sense or results while processing by brain or computer. Content search aims to provide access to results of content processing. These results are to be of enough quality to rely on, so the content search must be reliable enough to provide relevant content for processing.

Reliability of content search may be viewed as a probability of relevant content detection or coverage by applying a query. Irrelevancy is a deviation of content coverage results from customer requirements. The deviation is defined by compliance of query and content specifications with search engine implementation. Quality of subject matter query specification influences the content search reliability (relevancy).

In general we know nothing about content to be searched, but some ideas concerning queries and content processing results.

The Content Relevancy Quantification Model (CRQM) considers any content as a set of n queries with queries’ variety number N£n. Any content query qv (v=1,2,…,N) may sensitize sense and is a potentially sense inherent one, even if an input (search) query is empty for a given content. Input query, applied for sense detection (sensitization), may discover or recover content with some relevancy to the search results.

Queries are viewed as structural and(or) functional content (document, collection, corpus) elements (e.g., see XML, XQuery grammar [http://www.w3.org/]), including subject matter data. Let the query set structure defines queries’ variety number N and an occurrence probability pv (v=1,2,…,N) of v-th query, that defines a probability of sensitizing sense in response to this query.

It is mostly improbable, that an input query includes all N queries’ variety specifying content under search, because of the great values of number N for almost any content. To yield specified search results at the required relevancy level, in practice, it is enough for a content to have only sv sensitive queries. The number sv of potentially sensitive queries, sensitized throughout content search, defines content’s semantic coverage as cv = sv/n and is a known parameter of search models.

It is natural to consider any content under search to be relevant enough, if there exist at least some fixed, before stated number sv of sensitive queries as a minimal part cv of the total content queries’ number n.

Any search query may sensitize sense and is a potentially sense inherent one.

Consider any content as a system comprised of n queries, sensitive set of this system consists of states, each including more than sv sensitive queries. Consider queries are pair wise independent and uniformly distributed over all possible queries’ number n=const. Content relevancy R (reliability) or probability of sensitizing sense is quantified in the CRQM by known equation

R = exp(-l),                                                                       (1)

where l is a content’s irrelevancy intensity, defined by the CRQM and measured in irrelevancy (probability of not sensitizing sense) per content query.

Content query set structure is defined while indexing as a content query semantics matrix (CQSM). Semantics of the query’s parameter values is defined by content subject matter and search engine’s algorithm implementation particularities. The CQSM is the semantic quantifier’s (SEQUANTIC tool) data base for queries’ indexing (quantification), refining content’s semantic mean ps, semantic coverage cv, and semantic shift cv-ps, thus providing content relevancy quantification. Content search assures the achievement of required semantic coverage value.

Consider a content structured as N queries’ variety number with semantic mean ps=1/N. The mostly relevant query Rv=1, lv=0 for any content is the input query that either complies at full cv=1 with the content queries, or is empty c v =0 for the content. The mostly irrelevant input query Rvg0 in general is defined at cv=ps=pv and lv ginfinity, so that semantic coverage is equal to semantic mean and occurrence probability of the query qv for the content consisting of equally unique content queries (terms). The query discovers a content with higher relevancy, the greater is the unequality cv<ps. The query recovers a content with higher relevancy, the greater is the unequality cv>ps.

Content relevancy metrics, being continuously evaluated throughout search process and compared with required ones, provide a possibility for making decisions on searching. The achieved relevancy level, depicts changes either in subject matter requirements (input queries) or search engine implementation particularities, thus providing possibilities for optimization throughout content engineering process.

The SEQUANTIC tool as a content product lifecycle management solution may be viewed as a test bed or (and) a cradle for many existing or on coming content relevancy models and search engine implementations.

Each content product, released under the SEQUANTIC tool, may be equipped with relevancy e-certificate needed for acquisition, remedial and continual improvement processes, thus suggesting the quantifiable improvement approach to the content development W3C standards, e.g., XML, XQuery refinements and enhancements.

The CRQM based Sequantic engineering yields validated and verifiable relevancy estimations, integrating qualitative subject matter queries and quantitative relevancy metrics of content. Sequantic engineering provides continuous relevancy evaluations throughout content engineering process, thus accounting and tracing quantitative content relevancy requirements from customer to the content product.

The CRQM improves latent semantic indexing, especially for unknown and (or) heterogenous collections, by increasing relevancy, precision, and recall of content search, including the full text search. The CRQM may be used for data exploration and data integration tasks (due to its potential to quantify the content’s semantics), to solve heterogeneity problems, and to provide varied levels of Querying services, that facilitates knowledge discovery at different levels of granularity.

Consider the shaded and underlined one-word content queries qv (v=3,4,…,25) of this article as t=23 tokens.                                                t          

So the content above may be structured as N=S (ti)=8388608

                                                           i=0

queries’ variety number with semantic mean ps=1/N =0.000000192 and total queries’ number n=8388793 (see the table below). According to the CQRM quantification, the mostly relevant recovery l26=1.6663E-66 may be performed by “content search“ query, and the mostly irrelevant recoveries l11=l21=3.15341073 may be performed by “recover” and ” deviation” queries at the 0.042706219 relevancy level. The queries “semantics” and “engine” may be considered as four sigma (l£0.00621) relevant recovery queries for the content above. The query “term” is 0.125443318 relevant discovery query for the content example.

 

Table. The CRQM relevancy quantification example for the content above 

v

Query set structure qv

Occur # nv

Occurrence probability pv= nv /n

Semantic coverage cv

Irrelevancy intensity lv

Relevancy Rv

1

All queries’ set

1

0.0000001192

1.0

0.0

1.0

2

Empty query

0

0.0

0.0

0.0

1.0

3

content

41

0.0000048872

0.0000048872

2.9692E-41

1-2.9692E-41

4

search

22

0.0000026224

0.0000026224

1.1102E-16

1-1.1102E-16

5

relevant

13

0.0000015496

0.0000015496

1.4346E-07

0.999999857

6

sensitize

12

0.0000014304

0.0000014304

1.1129E-06

0.999998887

7

coverage

5

0.000000596

0.000000596

0.10854887

0.897135052

8

query

32

0.0000038144

0.0000038144

4.5650E-29

1-4.5650E-29

9

reliable

4

0.0000004768

0.0000004768

0.33795093

0.713230286

10

discover

3

0.0000003576

0.0000003576

0.97014278

0.379028916

11

recover

2

0.0000002384

0.0000002384

3.15341073

0.042706219

12

irrelevancy

3

0.0000003576

0.0000003576

0.97014278

0.379028916

13

sense

6

0.0000007152

0.0000007152

0.03063331

0.969831136

14

semantics

8

0.0000009536

0.0000009536

0.00160992

0.99839137

15

engine

8

0.0000009536

0.0000009536

0.00160992

0.99839137

16

quantify

7

0.0000008344

0.0000008344

0.00751214

0.99251601

17

processing

3

0.0000003576

0.0000003576

0.97014278

0.379028916

18

results

7

0.0000008344

0.0000008344

0.00751214

0.99251601

19

probability

4

0.0000004768

0.0000004768

0.33795093

0.713230286

20

requirements

4

0.0000004768

0.0000004768

0.33795093

0.713230286

21

deviation

2

0.0000002384

0.0000002384

3.15341073

0.042706219

22

implementation

4

0.0000004768

0.0000004768

0.33795093

0.713230286

23

specification

4

0.0000004768

0.0000004768

0.33795093

0.713230286

24

collection

2

0.0000002384

0.0000002384

3.15341073

0.042706219

25

term

1

0.0000001192

0.0000001192

2.07590127

0.125443318

26

content search

5

0.000000596

0.0000069136

1.6663E-66

1-1.6663E-66

27

relevant content

2

0.0000002384

0.0000061984

2.6302E-57

1-2.6302E-57

28

content coverage

2

0.0000002384

0.0000052448

2.6302E-57

1-2.6302E-57

29

search engine

3

0.0000003576

0.0000032184

9.3429E-23

1-9.3429E-23

30

Search engine implementation

2

0.0000002384

0.000003576

2.6302E-57

1-2.6302E-57

31

content processing

2

0.0000002384

0.0000050064

2.6302E-57

1-2.6302E-57

32

content engineering

2

0.0000002384

0.0000056024

2.6302E-57

1-2.6302E-57

33

Sequantic engineering

2

0.0000002384

0.000001192

2.6302E-57

1-2.6302E-57

 



Хостинг от uCoz