While coming across a post by Peter Norvig (see the post) about "How to write a Spelling Corrector" , I got an Idea of making one myself. In the above post, Peter Norvig had mentioned the theory behind the working of a spelling corrector.
He had made a program which was of only 24 lines and achieved 80%-85% accuracy. After reading the theory (most of which went off my head), I decided to make another program for the same task which is very simple to understand. Peter Norwig's program is very good but not beginner friendly and those (like me) who do not know about sets, collections, re module can not understand what the program does. I used the same theory( as much as I understood ) but in a separate way. I got from the post that the main idea was to modify the word in every possible way and then check each word for being an English word. The probability of possible words does the task then. The most probable word has to be chosen.
For finding the most probable word, Peter Norwig made a file named big.txt which is a concatenation of many Sherlock Holmes stories and other books and novels which contains a couple of million words. The file hence, contains the most used English words.
The file was used to make a dictionary which stored the no. of occurences of each of the words.
The file weighed 6.xx MB and to make the size small, I made a file in which I stored only the dictionary and the file is about 453.1 KB in size.
Then the most occurring word is chosen from the dictionary and returned.
That's it but that's not too easy, despite 61 lines of code, My program achieves only about 80% accuracy which is enough for toying but not that good.
The program can correct 345,230 words!
Download the program and text file here: Datafilehost
Note: Keep both the files in the above archive in the same folder for the program to work.
See the program in action:
He had made a program which was of only 24 lines and achieved 80%-85% accuracy. After reading the theory (most of which went off my head), I decided to make another program for the same task which is very simple to understand. Peter Norwig's program is very good but not beginner friendly and those (like me) who do not know about sets, collections, re module can not understand what the program does. I used the same theory( as much as I understood ) but in a separate way. I got from the post that the main idea was to modify the word in every possible way and then check each word for being an English word. The probability of possible words does the task then. The most probable word has to be chosen.
For finding the most probable word, Peter Norwig made a file named big.txt which is a concatenation of many Sherlock Holmes stories and other books and novels which contains a couple of million words. The file hence, contains the most used English words.
The file was used to make a dictionary which stored the no. of occurences of each of the words.
The file weighed 6.xx MB and to make the size small, I made a file in which I stored only the dictionary and the file is about 453.1 KB in size.
Then the most occurring word is chosen from the dictionary and returned.
That's it but that's not too easy, despite 61 lines of code, My program achieves only about 80% accuracy which is enough for toying but not that good.
The program can correct 345,230 words!
Download the program and text file here: Datafilehost
Note: Keep both the files in the above archive in the same folder for the program to work.
See the program in action:
No comments:
Post a Comment