Using PostgreSQL
In doing a large amount of data (100,000 level) import, certain fields and database table data (ten million) duplication needs to be excluded, the table data query to use the program way to determine the de-emphasis of the efficiency is very low, and then consider the use of temporary tables.
- First insert the new data into the temporary table, which has the same structure as the original table.
- Query the non-duplicated data DataA using SQL.
- Insert DataA into the original table.
Because of the non-repeating data I have to do some other processing, so check out DataA, if you do not need to do special processing can be used directlyinsert into select
Insert the data from step 2 into the original table in the same way as in step 2, and then empty the temporary table.
There are two ways to do step 2, one is to use thenot exists
in a manner such as
SELECT
*
FROM
tableTemp t1
WHERE
NOT EXISTS ( SELECT 1 FROM tableName WHERE columnA = AND columnB = )
The second way is to useleft join
SELECT
*
FROM
tableTemp t1
LEFT JOIN tableName t2 ON = AND =
WHERE
IS NULL
Tested (everyone's table structure and indexes are different, I'm just providing a reference here)
When the amount of data in the temporary table is small, thenot exists
It took less time, with more data it took longer. When the data reaches 10w, it takes 25s.
When the amount of data in the temporary table is small, theleft join
It took 30s, not much change with more data, and 40s when the data reached 10w.
Conclusion 1: When de-duplicating a single table, as long as the amount of imported data is not particularly large (20w or more), it is preferred to use not exists to do de-duplication.
But there is another case where you need to do de-duplication of two tables.
for example
SELECT
*
FROM
tableTemp t1
WHERE
NOT EXISTS (
SELECT 1 FROM tableNameA WHERE columnA = AND columnB =
union all
select 1 from tableNameB WHERE columnA = AND columnB =
);
SELECT
*
FROM
tableTemp t1
LEFT JOIN tableNameA t2 ON = AND =
LEFT JOIN tableNameB t3 ON = AND =
WHERE
IS NULL
AND IS NULL
In this case.
When there is little data in the temporary table, thenot exists
It took less time, with more data it took longer. When the data reached 10w, it took 150s!!!!
When there is little data in the temporary table, theleft join
The time taken is still 30s, and it gets longer as the amount of data gets bigger. When the data reaches 10w, the time taken is still 40s.
Both are comparable in terms of time spent at 3w data size
Conclusion 2: For double table de-duplication, use not exists when the imported data is below 3w, and use left join when it is above 3w.