SQL Server Development Bookmark and Share   
 index > SQL Server Integration Services > De-duping data
 

De-duping data

I have a list of companies that all have the same Company Id but have formatting issues with the names. For example I have the following:

CompanyId Name
12345 Microsoft
12345 Microsoft Inc.
12345 Microsoft, Inc.

I figured I would use a simple data flow task to insert the company id and name into a company table where the primary key is the company id (I would then ignore the errors of duplicate keys). I was expecting to see the first instance of the company inserted into the company table and the dupes ignored. That's not the case - I get an error on duplicate primary keys. Can I not do this?
JasonRP  Tuesday, February 06, 2007 1:54 AM
jrp210 wrote:
I have a list of companies that all have the same Company Id but have formatting issues with the names. For example I have the following:

CompanyId Name
12345 Microsoft
12345 Microsoft Inc.
12345 Microsoft, Inc.

I figured I would use a simple data flow task to insert the company id and name into a company table where the primary key is the company id (I would then ignore the errors of duplicate keys). I was expecting to see the first instance of the company inserted into the company table and the dupes ignored. That's not the case - I get an error on duplicate primary keys. Can I not do this?


This is more of a transact-sql question. The observed behavior is correct. In the sense of SSIS, you can use a sort transformation to eliminate dups. Or, you can structure your source query to do more of a select companyid, max(name) from table group by companyid.
Phil Brammer  Tuesday, February 06, 2007 2:05 AM
jrp210 wrote:

You are right. But, how would a sort transorm work to eliminate dupes?

There is an option in the sort transformation (a check box) to eliminate dups. You'd have to sort only on the companyID though.

Phil Brammer  Tuesday, February 06, 2007 3:49 AM
jrp210 wrote:
I have a list of companies that all have the same Company Id but have formatting issues with the names. For example I have the following:

CompanyId Name
12345 Microsoft
12345 Microsoft Inc.
12345 Microsoft, Inc.

I figured I would use a simple data flow task to insert the company id and name into a company table where the primary key is the company id (I would then ignore the errors of duplicate keys). I was expecting to see the first instance of the company inserted into the company table and the dupes ignored. That's not the case - I get an error on duplicate primary keys. Can I not do this?


This is more of a transact-sql question. The observed behavior is correct. In the sense of SSIS, you can use a sort transformation to eliminate dups. Or, you can structure your source query to do more of a select companyid, max(name) from table group by companyid.
Phil Brammer  Tuesday, February 06, 2007 2:05 AM

You are right. But, how would a sort transorm work to eliminate dupes?

JasonRP  Tuesday, February 06, 2007 3:41 AM
jrp210 wrote:

You are right. But, how would a sort transorm work to eliminate dupes?

There is an option in the sort transformation (a check box) to eliminate dups. You'd have to sort only on the companyID though.

Phil Brammer  Tuesday, February 06, 2007 3:49 AM

You can use google to search for other answers

Custom Search

More Threads

• need better performance than the lookup can provide
• How to upload two columns on an exisiting database table
• Getting connection info in a script task - how?
• Reusability & SSIS - issues they do not write in books about
• Help with child package script component
• Cannot open data file error while running SSIS package through SQL Server job
• Problem using DBTIMESTAMP data type date field in a query.
• Cannot add Reference to a Visual Basic Script Task
• Persist ConnectionString
• Importing Excel 2007