Home > Archive > Mathematica > November 2007 > Matching string in Mathematica
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Matching string in Mathematica
|
|
| Coleman, Mark 2007-11-20, 4:29 am |
| Greetings
I've got a large-sized file with a list of names (a couple of million
records). I'd like to organize this data by individual name, so that I
can perform a variety of statistical analyses for any given name a user
may input. The problem, naturally, is that these names were drawn from a
database where the name field was free-form text. Thus the same
underlying name can have many different permutations of spelling,
punctuation, abbreviation, etc.
Given the powerful string processing features, it seems Mathematica v6 might be
a good choice for cleaning up this data. I'm wondering if anyone on
MathGroup has used successfully used Mathematica in this context?
Thanks,
-Mark
Mark S. Coleman
Manager, Claims Analytics
Personal Market Claims
Liberty Mutual
175 Berkeley Street, Mail Stop 02J
Boston, MA 02116
Mark.Coleman@libertymutual.com
(617) 654-4572 SDN: 8-654-4572
NOTICE: The information contained in this electronic mail transmission
is intended by Liberty Mutual for the use of the named individual or
entity to which it is directed and may contain information that is
privileged or otherwise confidential. If you have received this
electronic mail transmission in error, please delete it from your system
without copying or forwarding it, and notify the sender of the error by
reply e-mail or by telephone (collect), so that the sender's address
records can be corrected.
| |
| congruentialuminaire@yahoo.com 2007-11-22, 4:40 am |
| Hello Mark:
Here is a brief "design sketch" outlining how I would approach this
problem...
Although the solution to this problem is related to string matching, I
would start with a different approach, cluster analysis. This can be
leveraged using another (new in V6) function, namely FindCluster.
There are lots of options to use for the DistanceFunction-> and an
exploratory approach would be needed that is tailored to your
application.
Of course, there is also a need of a measure to pick the "center" of
each cluster.
Then you can map each record to its respective center.
Then you can determine which string match/replacement methods to use
to "clean up" the data.
Finally, my impression is that lots of people want to keep some
"misspellings" (i.e. Kirsten vs. Kristen).
HTH.
Regards..Roger W.
On Nov 20, 12:46 am, "Coleman, Mark" <Mark.Cole...@LibertyMutual.com>
wrote:
> Greetings
>
> I've got a large-sized file with a list of names (a couple of million
> records). I'd like to organize this data by individual name, so that I
> <snipped/>
| |
|
|
Hi Mark,
it is hard to advice without more specific info. But note that
Mathematica can do RegularExpressions what is a pretty poerfull tool.
You may use it either in the standard form as RegularExpression[...] or
in the string pattern form.
hope this helps, Daniel
Coleman, Mark wrote:
> Greetings
>
> I've got a large-sized file with a list of names (a couple of million
> records). I'd like to organize this data by individual name, so that I
> can perform a variety of statistical analyses for any given name a user
> may input. The problem, naturally, is that these names were drawn from a
> database where the name field was free-form text. Thus the same
> underlying name can have many different permutations of spelling,
> punctuation, abbreviation, etc.
>
> Given the powerful string processing features, it seems Mathematica v6 might be
> a good choice for cleaning up this data. I'm wondering if anyone on
> MathGroup has used successfully used Mathematica in this context?
>
> Thanks,
>
> -Mark
>
>
>
> Mark S. Coleman
> Manager, Claims Analytics
>
> Personal Market Claims
> Liberty Mutual
> 175 Berkeley Street, Mail Stop 02J
> Boston, MA 02116
>
> Mark.Coleman@libertymutual.com
> (617) 654-4572 SDN: 8-654-4572
> NOTICE: The information contained in this electronic mail transmission
> is intended by Liberty Mutual for the use of the named individual or
> entity to which it is directed and may contain information that is
> privileged or otherwise confidential. If you have received this
> electronic mail transmission in error, please delete it from your system
> without copying or forwarding it, and notify the sender of the error by
> reply e-mail or by telephone (collect), so that the sender's address
> records can be corrected.
|
|
|
|
|