Extracting newspaper Data using OCR

From: Adam Randall


Hello
Apart from research avenues for refinance, lately I have been looking into obtaining my own statistics from newspapers(rentals, sales) for Adelaide.
I have a scanner and OCR (optical character recognition software omnipage 9.0), and have successfully scanned all the for sale real estate pages from the advertiser into the PC,however converting it into useful data has proved a little bit trickier (IE grouping prices to suburb names etc). As for the rental pages the text must be to small or to close together for the software to handle, as it comes out garbled (is that a real word). I was wondering if anyone else has experimented with this, and if they have had any success. If I am able to do this I could plot up to date Price vs rental on a map (a 1.22m*2.44m street level map of adelaide metropolitan area) for each suburb. This would be a great tool for indicating undervalued suburbs, and negate having to rely on second hand statistics that have probably been already acted upon. Manual input of data is not an option as it would be a full time job, and a comparison of every suburb is required for accurate analysis.
Regards Adam
 
Last edited by a moderator:
Reply: 1
From: Jay Hunter


Hi Adam,

Sounds like a big job.

If you could get the details you need from your newspapers web site... (or other web sites) containing the information... you could write some software to parse the HTML and place the information into a formatted text file for importing into your spreadsheet/or database.

This could make life a bit easier.

enjoy
Jay
 
Last edited by a moderator:
Reply: 1.1
From: Adam Randall


Hi Jay
I have already thought of that and downloaded a shareware program called parseRat, which filters, and places output into an excel spread sheet(still working out how to use it).
I will go to the advertiser web sit to see if they have a full electronic copy of their realestate section (that would be nice)
Regards Adam
 
Last edited by a moderator:
Back
Top