Patient matching is one of the main barriers to achieving interoperability in healthcare. Mismatched patient records and the inability to retrieve patient history information can severely hinder proper clinical decisions and lead to missed diagnoses or delays in treatment. In addition, healthcare providers often expend effort on deduplicating patient data, especially as the number of patient records in their database grows rapidly. Electronic health records (EHR) have dramatically optimized patient safety and care coordination in recent years, but accurate patient matching remains a challenge for many healthcare organizations.
Duplicate patient records can arise for a variety of reasons, including insertion, deletion, substitution, or transposition errors in human-generated records. While Optical Character Recognition (OCR) software digitizes patient records, it can also introduce errors.
We can employ a variety of record matching algorithms to solve this problem. They include: basic deterministic methods (such as grouping and comparing related fields, including SSN, name or date of birth, etc.), speech coding systems, and more advanced algorithms using machine learning (ML).
Amazon Lake Formation is a HIPAA compliant service that helps you build a secure data lake in a few simple steps. Lake Formation also has built-in FindMatches, an ML transformation feature that lets you match records across different datasets, and identify and remove duplicate records with little or no human intervention.
This article will show you how to use the FindMatches ML transformation to identify matching patient records in a synthetically generated dataset. To use FindMatches, you don't need to write code or understand how ML works. Without a reliable unique personal identifier, finding matches in data becomes very practical, even if its fields don't match exactly.
patient dataset
Due to its sensitive nature, different countries employ various regulations to manage patient data. This situation results in the often scarcity of patient data for training matching algorithms, further complicating model development. A common way to circumvent such challenges is to use aggregated data. This article will generate patient data Open Source Freely Extensible Biomedical Record Linkage Program (FEBRL) FEBRL employs a Hidden Markov Model (HMM) to prepare name and address data for patient record matching. It also allows simulations of real-life patient datasets that lead to duplications that may have the following types of mismatches:
1. Blank field.
2. Typographical errors, such as spelling mistakes, character transposition or field swapping, etc.
3. Abbreviating middle names and recording full middle names.
4. Mailing addresses in different formats.
5. Errors related to OCR.
6. Voice errors.
7. No globally unique patient or personal identifiers. Every healthcare provider might assign a patient identifier to the same person, but it might not be a personal identifier like an SSN, so they have a dataset but no key.
FEBRL can generate such datasets based on configurable parameters to vary the likelihood of each error, thus covering the various scenarios that lead to duplication. The generation of synthetic datasets is beyond the scope of this article; this article will provide a pre-generated dataset for your exploration:
In a nutshell, here are the steps to generate the synthetic dataset used to run FindMatches:
1. Download and install FEBRL.
2. Modify the parameters to create a dataset that simulates your expectations. For more information, see the FEBRL dataset generation instructions:
https://github.com/J535D165/FEBRL-fork-v0.4.2/tree/master/dsgen
3. Clean the dataset (this will confirm the same schema for every record and remove single quotes and family roles).
The Amazon region used for this dataset is US East (N. Virginia).
FRBRL Patient Data Structure
The following table shows the structure of FEBRL patient data. This data contains 40,000 records.
Original records and duplicate records are grouped together. patient_id
in a specific format:
rec-<record number>-org/dup-<duplicate record number>
The table below is a preview of what you're trying to achieve with the FindMatches ML transformation. After the dataset is matched, the resulting table will reflect the structure and data of the input table, with the match_id
column added. Matching records show the same match_id
value. False positives and false negatives are still possible, but the benefits of conversion are obvious.
Prerequisites
The example synthetic patient dataset in this article is in the US East (N. Virginia) Region, so all steps mentioned in this article must be performed in the same Amazon region (i.e. us-east-1); however, if the data is in a different region, you can Steps can be modified very easily.
Solution Architecture
The following diagram shows the architecture of the solution.
Solution Overview
At a macro level, the matching process includes the following steps:
1. Upload the original patient dataset in csv format on an Amazon S3 bucket
2. Crawling the uploaded patient dataset using the Amazon Glue crawler
3. Catalog your patient data using the Amazon Glue Data Catalog and create a FindMatches ML transformation.
4. Create a label set via ML transformation or manually, and then provide label examples for matching and non-matching records to train FindMatches. Upload your tags and estimate the quality of the predictions. Add more label sets as required and repeat this step to obtain the desired precision, precision, and recall.
5. Create and execute an Amazon Glue ETL job transformed with your FindMatches.
6. Store the result of the FindMatches transformation on an Amazon S3 bucket
7. Create an Amazon Glue data catalog of FindMatches ML transformation results.
8. Use Amazon Athena to check the conversion results.
your data and create FindMatches ML transformations
FindMatches operates on tables defined within the Amazon Glue Data Catalog. Use an Amazon Glue crawler to discover and catalog patient data. You can use the FEBRL patient data generated for this article:
The Cloudformation stack provided below - us-east-1
(US East (N. Virginia))
To create a catalog and FindMatches ML transform in Amazon Glue, start the following stack:
This stack creates the following resources:
1. The Amazon S3 bucket to store the ML transformation results (configurable as part of the launch). You can find the name of the bucket below the output in the Amazon CloudFormation stack console. This article uses the name S3BucketName
2. An IAM role that allows Amazon Glue to access other services, including S3.
3. Amazon Glue database (configurable as part of launch).
4. An Amazon Glue table (configurable as part of launch) containing the "Exposed Raw Synthetic Patient Dataset".
5. Amaozn Glue ML transformation, the source is your Amazon Glue table, accuracy set to 1, and accuracy set to 0.9.
For more information, see Integrating Datasets and Deduplication Using Amazon Lake Formation FindMatches:
ML Transformation Tuning
The security risk of false positive matches (ie, misleading clinicians into thinking that misinformation about patients is accurate) may be greater than the security risks of false negative matches (ie, clinicians failing to access existing information about patients). (For more information, see the related study on the NCBI website.) Therefore, moving the recall vs. precision slider toward the precision side increases confidence in identifying whether records belong to the same patient, And minimize the security risk of false positive matching.
The higher accuracy setting helps to improve the recall , but at the cost of longer runtime (and cost) to make the necessary comparisons on more records.
To obtain data for this particular set of relatively better results, start the stack has been created for you to convert, recall vs. accuracy slider to 0.9 and biased precision side, low cost vs. accurate The rate slider is set to accuracy rate . If necessary, you can adjust such values later by selecting Convert and using the Tune
with labeled data
After successfully launching the stack, you can train the transformation by providing matching and non-matching records with a set of labels.
create tag set
You can create your own label sets, or allow Amazon Glue to generate label sets based on heuristics.
Amazon Glue pulls records from your source data and suggests possible matching records. The resulting label set file contains about 100 data samples for you to manipulate.
This article provides you with an Amazon Glue-generated label data file that you can use with the label column populated with full data. You can use this file anytime.
If you choose to use the pre-generated label data file provided in this article, please skip the step below to label file
To create a training set, perform the following steps:
1. In the Amazon Glue console ETL, and job ML conversion below, you will see the name provided by the stack created for you CFN-findMatches-ml-the Transform-Demo of ML conversion .
2. Select ML Transform cfn-findmatches-ml-transform-demo , then click Actions and select Training Transform .
3. For use label training transform , choose I don't have label .
4. Select generate the label file .
5. Provide the S3 path to store the generated label files.
6. Select next to .
The following table shows the generated label data file, where the label column is empty.
You need to mark records that actually match the same value in order to populate the label
column. Each label set should contain positive and negative matching examples.
This article provides you with a label data file that you can use with the label
column populated with complete data. You can use this file anytime.
The table below shows the table with the label column fully populated.
The labeling file has the same schema as the input data, with two extra columns: labeling_set_id id and label
.
The training dataset can be divided into multiple label sets. Each label set displays the labeling_set_id
value. This identification method simplifies the tagging process, allowing you to focus on matching record relationships within the same tag set rather than scanning the entire file. For the above dataset, label values are extracted from patient_id
-org
and -dup
But in general you want to assign such labels based on the records that should match based on the attribute value.
If you assign the same label to two or more records within a label set, you are training the FindMatches transformation to treat those records as a match. On the other hand, when two or more records within the same tag set have different tags, FindMatches learns not to treat those records as matches. Transforms record relationships between records within the same label set without evaluating across different label sets.
You should label hundreds of records for proper match quality, and thousands of records for better match quality.
your tags and check match quality
Train FindMatches where to look after you create the labeled dataset (in .csv format). Follow these steps:
1. On the Amazon Glue console, select the transformation you created earlier.
2. Select operate .
3. Select training conversion .
4. For upload tag , select I have tag .
5. Select to upload the tag file from S3.
6. Select next to .
7. If you would like to use the label set provided in this blog post, download the label set here:
8. In the same S3 bucket created by the launched cloudformation template above, create a folder training set
9. In the same S3 bucket, upload the above label set training set
10. Select Overwrite my existing tab. You can only use one label set. To add tags repeatedly, select the Append to my existing tags option.
11. Select upload . After uploading the label, you are now ready to use the transformation. While not a hard requirement, you can determine the quality of a conversion match by examining matching and non-matching records.
12. Select estimate the conversion quality . Conversion quality estimates will use 70% of your labels for learning. After training is complete, the quality estimate will be tested against the remaining 30% of the transformation's learned ability to identify matching records. Finally, the transformation compares the matches and non-matches predicted by the algorithm and your actual labels in order to generate quality metrics. This process can take several minutes.
Your results should look similar to the screenshot below.
Treat these metrics as approximations, as the test will only use a small subset of the data to estimate the quality of the match. When you are satisfied with the metrics, go ahead and create and run a record matching job. Alternatively, upload more tag records to further improve match quality.
Create and run an Amazon Glue ETL job
After you create the FindMatches transformation and verify that it has learned to identify matching records in your data, you can identify matches in your full dataset. To create and run a record matching job, perform the following steps:
transformresults
folder inside the S3 bucket created by the Amazon CloudFormation template when you launch the stack. This folder will store the results of your Amazon Glue job's ML transformation.
2. In the Amazon Glue console, under job , select add job .
3. In the configuration job property under the name , enter the name of the job.
4. For the IAM role , select your role in the drop-down menu. Select the IAM role AWSGlueServiceRoleMLTransform
created by the Amazon CloudFormation stack. For more information, see Creating an IAM Role for Amazon Glue:
https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html?icmpid=docs_glue_console
5. Select Spark as type, Glue version is Spark 2.2, Python 2 (Glue version 0.9)
Amazon Glue-generated suggestion script for the job run.
7. In select a data source , select Convert data source. This article uses the data source cfn_table_patient
.
8. Under select the conversion type , select find the matching record .
9. For worker thread type , select G.2X .
10. For number of worker threads , enter 10
. You can fill in higher numbers based on the size of the dataset to add more worker threads.
11. To check for records identified as duplicates, do not select remove duplicate records .
12. Select next to .
13. Select the transition created by you.
14. Select next to .
15. For select a data object , select in the data object.
16. For datastore , select Amazon S3 .
17. For format , select CSV .
18. For compression type , select without .
19. For target path , select the path for the job output. The destination path is the S3 bucket created by Amazon CloudFormation and the folder named transformresults that you created earlier.
20. Select save the job and edit the script .
The script for your job is now generated and ready to use. Alternatively, you can further customize the script to meet your specific ETL needs.
21. To begin to identify this data set to match, as selected run job shown below the screen . Leave the default settings for the job parameters temporarily and close this page after starting the job. The screenshot below shows the suggested Python Spark script generated by Amazon Glue using ML transformations.
If the execution is successful, FindMatches will display a running status of Succeeded
. Execution may take several minutes to complete.
FindMatches saves your output data as a segmented .csv file to the destination path you specified during job creation. The resulting .csv file reflects the structure and data of the input table and has match_id
columns inside. Matching records show the same match_id
value.
Create a data directory for your transformation results
To view the output, you can download the segmented .csv file directly from the S3 bucket and view it in the editor, or you can use Athena to run SQL-like queries on the data stored on S3. To view data through Athena, you need to crawl the folder and the segmented .csv files you created as part of the FindMatches ETL job output.
Go to Amazon Glue and use the Amazon crawler in your existing database to create a new table for patient matches with records from your FindMatches ETL job output and in an S3 bucket file containing the segmented .csv file folder as source data. In this article, the source data is the folder transformresults
in the bucket created by the Amazon CloudFormation stack.
To create a crawler, perform the following steps:
1. In the Amazon Glue console, under the data catalog , choose Crawlers.
2. Click add a crawler to create a new crawler to crawl the transformation results.
3. Provide a name for the crawler and click .
4. Select Datastore as the crawler source type and click Next.
S3 in the Select a data store in the add data store section.
6. For crawl data location , select the specified path in my account.
7. Under Include Path , enter the name of your path. It should be the same S3 bucket created by cloudformation earlier, and a folder transformresults
Verify the created segmented csv file in the folder.
8. Select next to .
9. In the Select an IAM role section, select select the IAM role .
10. For the IAM role , enter the name of the crawler.
11. Select next to .
12. For frequency , select on demand.
13. Configure crawler output, the database to cfn-database-patient.
14. Set the value of the prefix added to the table to table_results_. Doing so will help identify the table that contains the transformation results.
15. Click end .
16. Select the same crawler and click run the crawler . After a successful crawler execution, you should see new tables created that correspond to the crawler in the appropriate database you selected during crawler configuration.
17. In the Amazon Glue console, under database , select table .
18. Select operation .
19. Select edit the table details .
20. Enter below the Serde serialization library
org.apache.hadoop.hive.serde2.OpenCSVSerde
21. Serde parameter, add the key escapeChar with the value \.
22. Add the key quoteChar with the value " (half-width double quotes).
23. Set the key field.delim with the value ,
24. Add key separatorChar with value,
You can set the Serde parameters as required based on the type of dataset you have.
25. Edit the schema of the table by setting the data type of all columns to String. To edit a table's schema, click the table and then click the Edit Schema button.
You can also choose to retain inferred data types by crawler, depending on your requirements. For simplicity, this article will all be set to the String
data type, except for the match_id
column, which is set to bigint
.
Using Amazon Athena to check the output
To examine the output using Amazon Athena, perform the following steps:
1. In the data directory, select table .
2. Choose the name of the table created by your crawler for the results.
3. Select operate .
4. Select view the data .
The Athena console will open. If you are running Amazon Athena for the first time, you may have to click start . Before running the query for the first time, you will also need to set the location of the query results in Amazon S3. On the Amazon Athena console, click Set the location of the query results in Amazon S3 , and then set the location of the query results. You can create more folders in the same Amazon S3 bucket created earlier by cloudformation. Make sure the S3 path ends with /.
5. Select the appropriate database. For this article, choose cfn-database-patient
. If you don't see the database in the drop-down menu, you may need to refresh the data source.
6. Select the results table that contains the FindMatches output and that has patient records and columns match_id
In this case it is table_results_transformresults . If you chose a different name for this result table, you will need to change the query below to reflect the correct table name.
7. Select Run Query to run the following query.
SELECT * FROM "cfn-database-patient"."table_results_transformresults" order by match_id;
The screenshot below shows your output.
safety precautions
Amazon Lake Formation helps protect your data by giving you a central location where you can configure granular data access policies regardless of which service is used to access it
To centralize data access policy control using Lake Formation, first turn off direct access to your bucket in S3 so that Lake Formation manages all data access. Configure data protection and access policies through Lake Formation, which enforces such policies across all Amazon services that access data in your data lake. You can configure users and roles, and define what data those roles can access, down to the table and column level.
Amazon Lake Formation provides a permission model based on a simple grant/revoke mechanism. Lake Formation permissions combine with IAM permissions to control permissions to access the data stored in the database and the metadata describing that data. For more information, see Metadata and Data Security and Access Control in Lake Formation:
https://docs.aws.amazon.com/lake-formation/latest/dg/security-data-access.html
Lake Formation currently supports server-side encryption (SSE-S3, AES-265) on S3. Lake Formation also supports private endpoints in your VPC and logs all activity in Amazon CloudTrail, giving you network isolation and auditability.
The Amazon Lake Formation service is a HIPAA compliant service.
Summary
This article demonstrates how to use the Lake Formation FindMatches ML transformation to find matching records in a patient database. It lets you find matches when records in two datasets do not share the same identifier or contain duplicate data. This method helps you find matches between dataset rows if fields do not match exactly, or if attributes are missing or corrupted.
You are now ready to start building with Lake Formation and try FindMatches on your data.
References
Amazon Lake Formation:
https://amazonaws-china.com/lake-formation/
Open Source Freely Extensible Biomedical Record
Linkage Program (FEBRL):
https://github.com/J535D165/FEBRL-fork-v0.4.2
Amazon Glue:
https://amazonaws-china.com/glue
Amazon Athena:
http://amazonaws-china.com/athena
Amazon S3:
Amazon CloudFormation:
http://amazonaws-china.com/cloudformation
Label data file generated by Amazon Glue:
Pre-generated label data files:
Tag data file:
Amazon CloudTrail:
http://amazonaws-china.com/cloudtrail
HIPAA Compliant:
https://amazonaws-china.com/compliance/hipaa-eligible-services-reference/
The author of this article
Dhawalkumar Patel
Amazon Cloud Technologies Senior Solutions Architect
He has worked for organizations ranging from large corporations to mid-sized startups, working on problems related to distributed computing and artificial intelligence. Currently, he is focusing on researching machine learning and serverless technologies.
Ujjwal Ratan
Chief Machine Learning Specialist Solution Architect, Global Health Care and Life Sciences Team, Amazon Cloud Technologies
His main focus is machine learning and deep learning applications to solve a variety of real-world industry problems, such as medical imaging, unstructured clinical text, genomics, precision medicine, clinical trials, and quality of care optimization, to name a few. He specializes in scaling machine learning/deep learning algorithms on the Amazon Cloud Technology Cloud to speed up training and inference. In his spare time, he enjoys listening to (and playing) music and taking on-the-go road trips with his family.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。