PostgreSQL, Triple Group and Similarity

Just test PostgreSQL 9.6.2 on my Mac and use Ngrams.
Assuming the brewery has a GIN triplet index.

Similarity limit (I know this is deprecated):

SELECT set_limit(0.5);

I am building on a 2,3M row table A trigram search.

My selection code:

SELECT winery, similarity(winery,'chateau chevla blanc') AS similarity 
FROM usr_wines
WHERE status=1 AND winery%'chateau chevla blanc'
ORDER BY similarity DESC;

My results (329 milliseconds on my Mac):

Chateau ChevL Blanc 0,85
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0 ,736842
Chateau Blanc, 0,736842
Chateau Blanc 0,736842
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0, 727273
Chateau Cheval Blanc (7) 0,666667
Chateau Cheval Blanc Cbo 0,64
Chateau Du Cheval Blanc 0,64
Chateau Du Cheval Blanc 0,64

Well, I don’t understand the similarity of “Chateau blanc”> In this case, “Chateau Cheval Blanc”? As far as I know, these two words are exactly the same “castle” and “blanc”, but there is no other word “cheval”.

And why is “Chateau ChevL Blanc” the first one? A letter “a” is missing!

Well, my goal is to match all possible duplicates when giving the name of the winery, even if it is misspelled. What am I missing?

The concept of triple similarity relies on dividing any sentence into “triplets” ( A sequence of three consecutive letters), and treat the result as a SET (ie: the order does not matter, and you have no repeated values). Before considering the sentence, add two spaces at the beginning, one space at the end, and a single space Replace with double-precision spaces.

Trigrams are a special case of N-grams.

The triple corresponding to “Chateau blanc” is searched for the three that appear on it All sequences of letters to find:

chateau blanc
--- => 'c'
--- =>' ch'
--- =>'cha'
--- =>'hat'
--- =>'ate'
--- =>'tea'
- - =>'eau'
--- =>'au'
--- =>'u'
--- => 'b'
--- => 'bl'
--- =>'bla'
--- =>'lan'
--- =>'anc'
--- => 'nc'

Sort them and repeat them to get:

' b'
' c'
' bl'
' ch'
'anc'
'ate'
'au'
'bla'
'cha'
'eau '
'hat'
'lan'
'nc'
'tea'

This can be calculated by PostgreSQL through the function show_trgm:

SELECT show_trgm('Chateau blanc') AS A
< br />A = [b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

……There are 14 trigrams. ( Check pg_trgm).

The triple corresponding to “Chateau Cheval Blanc” is:

SELECT show_trgm('Chateau Cheval Blanc') AS B < br />
B = [b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

……There are 19 trigrams

If you calculate the total number of three trigrams, you will find that they have the following:

< /p>

A intersect B = 
[b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

What they have in total is:

A union B = 
[b, c, bl, ch,anc,ate,au ,bla,cha,che,eau ,evl,hat,hev,la ,lan,nc ,tea,vla]

That is to say, there are 14 triples in two sentences, a total of 19.
The similarity is calculated as follows:

similarity = 14 / 19

You can view:

SELECT 
cast (14.0/19.0 as real) AS computed_result,
similarity('Chateau blanc','chateau chevla blanc') AS function_in_pg

You will see that you get: 0.736 842

…explains how to calculate the similarity and why you get the value you get.

Note: You can calculate the intersection and union in the following ways:

SELECT 
array_agg(t) AS in_common
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
INTERSECT
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
ORDER BY t
) AS trigrams_in_common ;

SELECT
array_agg(t ) AS in_total
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
UNION
SELECT unnest(show_trgm('chateau chevla blanc') ) AS t
) AS trigrams_in_total ;

This is a way to explore the similarity of different sentence pairs:

WITH p AS< br />(
SELECT
'This is just a sentence I``ve invented'::text AS f1,
'This is just a sentence I''ve also invented':: text AS f2
),
t1 AS
(
SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
SELECT
(SELECT count(*) FROM
(SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
(SELECT count(*) FROM
(SELECT * FROM t1 UNION SELECT * FROM t2 ) AS s0)::integer AS total,
similarity(f1, f2) AS sim_2
FROM
p
)
SELECT
same, total, same::real/total::real AS sim_1, sim_2
FROM
x ;

You can view it at Rextester

Just test PostgreSQL 9.6.2 on my Mac and use Ngrams.
Assuming that the brewery has a GIN triplet index.

Limits on similarity (I know this has been Deprecated):

SELECT set_limit(0.5);

I am building a trigram search on a 2,3M line table.

My selection code:

SELECT winery, similarity(winery,'chateau chevla blanc') AS similarity 
FROM usr_wines
WHERE status=1 AND winery%'chateau chevla blanc'
ORDER BY similarity DESC;

My results (329 milliseconds on my Mac):

 Chateau ChevL Blanc 0,85
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc, 0,736842
Chateau Blanc 0,736842
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc (7) 0,666667
Chateau Cheval Blanc Cbo 0,64
Chateau Du Cheval Blanc 0,64
Chateau Du Cheval Blanc 0,64

Well, I don’t understand what is the similarity of "Chateau blanc"> In this case, "Chateau Cheval Blanc"? As far as I know, these two words are exactly the same "castle" and "blanc", but there is no other word "cheval".

And why is "Chateau ChevL Blanc" the first one? A letter "a" is missing!

Well, my goal is to match all possible duplicates when giving the name of the winery, even if it is misspelled. What am I missing?

The concept of similarity of triples relies on dividing any sentence into It is a SET (ie: the order does not matter, and you have no repeated values). Before considering the sentence, add two spaces at the beginning, add a space at the end, and replace a single space with a double-precision space.

Trigrams are a special case of N-grams.

The triple corresponding to "Chateau blanc" is found by searching all sequences of three letters that appear on it:

< p>

chateau blanc
--- => 'c'
--- =>' ch'
--- =>'cha'
--- =>'hat'
--- =>'ate'
--- =>'tea'
--- =>'eau'
- - =>'au'
--- =>'u'
--- => 'b'
--- =>' bl'
--- =>'bla'
--- =>'lan'
--- =>'anc'
--- =>'nc'

Sort and repeat to get:

' b'
' c'
' bl'
' ch'
'anc'
'ate'
'au'
'bla'
'cha'
'eau'
'hat'
'lan'
'nc'
'tea'

This can be calculated by PostgreSQL through the function show_trgm:

SELECT show_trgm('Chateau blanc') AS A

A = [b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan, nc ,tea]

……There are 14 trigrams. (check pg_trgm).

The triplet corresponding to "Chateau Cheval Blanc" is:

SELECT show_trgm('Chateau Cheval Blanc') AS B 

B = [b, c, bl, ch,anc,ate,au ,bla,cha,che, eau,evl,hat,hev,la ,lan,nc ,tea,vla]

……There are 19 trigrams

If you calculate how many trigrams there are in three trigrams Tuples, you will find that they have the following:

A intersect B = 
[b, c, bl, ch,anc,ate,au ,bla ,cha,eau,hat,lan,nc ,tea]

What they have in total is:

A union B = 
[b , c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

That is to say, two There are 14 triples in the sentence, 19 in total.
The similarity is calculated as follows:

similarity = 14 / 19

You can view:

SELECT 
cast(14.0/19.0 as real) AS computed_result,
similarity('Chateau blanc','chateau chevla blanc') AS function_in_pg< /pre>

You will see that you get: 0.736842

…explains how to calculate the similarity and why you get the value you get.

Note: You can pass the following Ways to calculate intersection and union:

SELECT
array_agg(t) AS in_common
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
INTERSECT
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
ORDER BY t
) AS trigrams_in_common ;

SELECT
array_agg(t) AS in_total
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
UNION
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;

This is a way to explore the similarity of different sentence pairs:

WITH p AS
(
SELECT
'This is just a sentence I``ve invented'::text AS f1,
'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
SELECT
(SELECT count(*) FROM
(SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
(SELECT count(*) FROM
(SELECT * FROM t1 UNION SELECT * FROM t2) AS s0)::integer AS total,
similarity(f1, f2) AS sim_2
FROM
p
)
SELECT
same, total, same::real/total::real AS sim_1, sim_2
FROM
x ;

You can check it at Rextester

Leave a Comment

Your email address will not be published.