Yuri
Arkhipkin
Any content
sensitizes sense or results while processing
by brain or computer. Content search aims to provide access to results
of content processing. These results are to be of enough quality to
rely on, so the content search must be reliable
enough to provide relevant content for processing.
Reliability of content search may be viewed as a probability of relevant content detection or coverage by applying a query. Irrelevancy is a deviation of content coverage results from customer requirements. The deviation is defined by compliance of query and content specifications with search engine implementation. Quality of subject matter query specification influences the content search reliability (relevancy).
In general we know nothing about content to be searched,
but some ideas concerning queries and content processing
results.
The Content Relevancy Quantification Model (CRQM) considers any content
as a set of n queries with queries’
variety number N£n. Any content
query qv (v=1,2,…,N) may sensitize
sense and is a potentially sense inherent one, even if an input (search) query
is empty for a given content. Input query, applied for sense detection (sensitization),
may discover or recover content with
some relevancy to the search results.
Queries are viewed as structural and(or) functional content (document, collection, corpus) elements (e.g., see XML, XQuery grammar [http://www.w3.org/]), including subject matter data. Let the query set structure defines queries’ variety number N and an occurrence probability pv (v=1,2,…,N) of v-th query, that defines a probability of sensitizing sense in response to this query.
It is mostly improbable, that an input query includes all N queries’ variety specifying content under search, because of the great values of number N for almost any content. To yield specified search results at the required relevancy level, in practice, it is enough for a content to have only sv sensitive queries. The number sv of potentially sensitive queries, sensitized throughout content search, defines content’s semantic coverage as cv = sv/n and is a known parameter of search models.
It is natural to consider any content under search to be relevant enough, if there exist at least some fixed, before stated number sv of sensitive queries as a minimal part cv of the total content queries’ number n.
Any search query may sensitize sense and is a potentially sense inherent one.
Consider any content as a system comprised of n queries, sensitive set of this system consists of states, each including more than sv sensitive queries. Consider queries are pair wise independent and uniformly distributed over all possible queries’ number n=const. Content relevancy R (reliability) or probability of sensitizing sense is quantified in the CRQM by known equation
R = exp(-l), (1)
where l is a content’s irrelevancy intensity, defined by the CRQM and measured in irrelevancy (probability of not sensitizing sense) per content query.
Content query set structure is defined while indexing as a content query semantics matrix (CQSM). Semantics of the query’s parameter values is defined by content subject matter and search engine’s algorithm implementation particularities. The CQSM is the semantic quantifier’s (SEQUANTIC tool) data base for queries’ indexing (quantification), refining content’s semantic mean ps, semantic coverage cv, and semantic shift cv-ps, thus providing content relevancy quantification. Content search assures the achievement of required semantic coverage value.
Consider a content structured as N queries’ variety number with semantic mean ps=1/N. The mostly relevant query Rv=1, lv=0 for any content is the input query that either complies at full cv=1 with the content queries, or is empty c v =0 for the content. The mostly irrelevant input query Rvg0 in general is defined at cv=ps=pv and lv ginfinity, so that semantic coverage is equal to semantic mean and occurrence probability of the query qv for the content consisting of equally unique content queries (terms). The query discovers a content with higher relevancy, the greater is the unequality cv<ps. The query recovers a content with higher relevancy, the greater is the unequality cv>ps.
Content relevancy metrics, being continuously evaluated throughout search process and compared with required ones, provide a possibility for making decisions on searching. The achieved relevancy level, depicts changes either in subject matter requirements (input queries) or search engine implementation particularities, thus providing possibilities for optimization throughout content engineering process.
The SEQUANTIC tool as a content product lifecycle management solution may be viewed as a test bed or (and) a cradle for many existing or on coming content relevancy models and search engine implementations.
Each content product, released under the SEQUANTIC tool, may be equipped with relevancy e-certificate needed for acquisition, remedial and continual improvement processes, thus suggesting the quantifiable improvement approach to the content development W3C standards, e.g., XML, XQuery refinements and enhancements.
The CRQM based Sequantic engineering
yields validated and verifiable relevancy estimations, integrating qualitative subject matter
queries and quantitative relevancy metrics
of content. Sequantic engineering provides continuous relevancy evaluations throughout content engineering
process, thus accounting and tracing quantitative content relevancy
requirements from customer to the content
product.
The CRQM improves latent semantic indexing, especially for unknown and
(or) heterogenous collections, by increasing relevancy, precision, and recall of content search,
including the full text search. The CRQM may be used for data exploration and
data integration tasks (due to its potential to quantify
the content’s semantics), to solve heterogeneity problems, and to provide
varied levels of Querying services,
that facilitates knowledge discovery at
different levels of granularity.
Consider the shaded and underlined
one-word content queries qv (v=3,4,…,25) of this article
as t=23
tokens. t
So the
content above may be structured as N=S (ti)=8388608
i=0
queries’ variety number with semantic mean ps=1/N =0.000000192 and total queries’ number n=8388793 (see the table below). According to the CQRM quantification, the mostly relevant recovery l26=1.6663E-66 may be performed by “content search“ query, and the mostly irrelevant recoveries l11=l21=3.15341073 may be performed by “recover” and ” deviation” queries at the 0.042706219 relevancy level. The queries “semantics” and “engine” may be considered as four sigma (l£0.00621) relevant recovery queries for the content above. The query “term” is 0.125443318 relevant discovery query for the content example.
Table. The CRQM relevancy quantification example for the content above
v |
Query set structure qv |
Occur # nv |
Occurrence probability
pv= nv /n |
Semantic coverage
cv |
Irrelevancy intensity lv |
Relevancy Rv |
1 |
All queries’ set |
1 |
0.0000001192 |
1.0 |
0.0 |
1.0 |
2 |
Empty query |
0 |
0.0 |
0.0 |
0.0 |
1.0 |
3 |
content |
41 |
0.0000048872 |
0.0000048872 |
2.9692E-41 |
1-2.9692E-41 |
4 |
search |
22 |
0.0000026224 |
0.0000026224 |
1.1102E-16 |
1-1.1102E-16 |
5 |
relevant |
13 |
0.0000015496 |
0.0000015496 |
1.4346E-07 |
0.999999857 |
6 |
sensitize |
12 |
0.0000014304 |
0.0000014304 |
1.1129E-06 |
0.999998887 |
7 |
coverage |
5 |
0.000000596 |
0.000000596 |
0.10854887 |
0.897135052 |
8 |
query |
32 |
0.0000038144 |
0.0000038144 |
4.5650E-29 |
1-4.5650E-29 |
9 |
reliable |
4 |
0.0000004768 |
0.0000004768 |
0.33795093 |
0.713230286 |
10 |
discover |
3 |
0.0000003576 |
0.0000003576 |
0.97014278 |
0.379028916 |
11 |
recover |
2 |
0.0000002384 |
0.0000002384 |
3.15341073 |
0.042706219 |
12 |
irrelevancy |
3 |
0.0000003576 |
0.0000003576 |
0.97014278 |
0.379028916 |
13 |
sense |
6 |
0.0000007152 |
0.0000007152 |
0.03063331 |
0.969831136 |
14 |
semantics |
8 |
0.0000009536 |
0.0000009536 |
0.00160992 |
0.99839137 |
15 |
engine |
8 |
0.0000009536 |
0.0000009536 |
0.00160992 |
0.99839137 |
16 |
quantify |
7 |
0.0000008344 |
0.0000008344 |
0.00751214 |
0.99251601 |
17 |
processing |
3 |
0.0000003576 |
0.0000003576 |
0.97014278 |
0.379028916 |
18 |
results |
7 |
0.0000008344 |
0.0000008344 |
0.00751214 |
0.99251601 |
19 |
probability |
4 |
0.0000004768 |
0.0000004768 |
0.33795093 |
0.713230286 |
20 |
requirements |
4 |
0.0000004768 |
0.0000004768 |
0.33795093 |
0.713230286 |
21 |
deviation |
2 |
0.0000002384 |
0.0000002384 |
3.15341073 |
0.042706219 |
22 |
implementation |
4 |
0.0000004768 |
0.0000004768 |
0.33795093 |
0.713230286 |
23 |
specification |
4 |
0.0000004768 |
0.0000004768 |
0.33795093 |
0.713230286 |
24 |
collection |
2 |
0.0000002384 |
0.0000002384 |
3.15341073 |
0.042706219 |
25 |
term |
1 |
0.0000001192 |
0.0000001192 |
2.07590127 |
0.125443318 |
26 |
content search |
5 |
0.000000596 |
0.0000069136 |
1.6663E-66 |
1-1.6663E-66 |
27 |
relevant content |
2 |
0.0000002384 |
0.0000061984 |
2.6302E-57 |
1-2.6302E-57 |
28 |
content coverage |
2 |
0.0000002384 |
0.0000052448 |
2.6302E-57 |
1-2.6302E-57 |
29 |
search engine |
3 |
0.0000003576 |
0.0000032184 |
9.3429E-23 |
1-9.3429E-23 |
30 |
Search engine implementation |
2 |
0.0000002384 |
0.000003576 |
2.6302E-57 |
1-2.6302E-57 |
31 |
content processing |
2 |
0.0000002384 |
0.0000050064 |
2.6302E-57 |
1-2.6302E-57 |
32 |
content engineering |
2 |
0.0000002384 |
0.0000056024 |
2.6302E-57 |
1-2.6302E-57 |
33 |
Sequantic engineering |
2 |
0.0000002384 |
0.000001192 |
2.6302E-57 |
1-2.6302E-57 |
… |
… |
… |
… |
… |
… |
… |