I do a lot of data mining, especially web scraping, but sometimes have trouble scraping the info I need from the site I’m looking at and have to resort to collecting the data manually. Of course, I depend a lot on Excel to clean up the data I’ve collected. Since I’ve started learning Visual Basic, I’ve been able to write a couple small scripts to help me with my data cleanup.
Here are a couple small modules that I’ve used to clean up various kinds of datasets.
If you’re not familiar with working with Visual Basic or macros, you’re going to need to enable the Developer tab on the ribbon:
- click on the Office button on the top left side of the screen
- select ‘Excel Options’
- select ‘Popular’ on the left hand side (it’s usually the default)
- tick the ‘Show Developer tab in the Ribbon’ box
- select ‘OK’
After adding the Developer tab, you can install a module using the following steps:
- under the developer tab on the ribbon, select ‘Visual Basic’
- in the Visual Basic screen, select the Sheet you are working in (usually Sheet 1)
- right click on the sheet and choose Insert -> Module
- write/copy/paste the module code
- run the code by clicking the green arrow or selecting Run -> Run Sub/User Form from the menu
- go back to your Excel sheet by clicking on the green Excel button on the left
Handy modules for data cleaning
get URLs from a list of hyperlinks
Let’s look at https://www.reesbroome.com/our-professionals for an example. I want a link for every individual on this page. I could collect that info manually by right clicking -> ‘Copy Link Address -> pasting to my list, but that can be time consuming. I could also also copy the page source code (right click -> Inspect or Page Source Code), but again, picking the URLs out of a whole page of code is inefficient.
Luckily, we can get the url for each link with a few lines of code:
Sub ExtractHL() Dim HL As Hyperlink For Each HL In ActiveSheet.Hyperlinks HL.Range.Offset(0, 1).Value = HL.Address Next End Sub
So what is going on here? Well, ‘Sub’ and ‘End Sub’ are wrapping a procedure called ‘ExtractHL’. The middle four lines are stating the steps to perform when the procedure is called. We’re only using this single module now, but if we were writing a bigger program, we can run these steps again by calling ExtractHL().
‘Dim HL As Hyperlink’ is naming a variable ‘HL’. The ‘HL’ name is arbitrary, you could declare ‘Dim X As Hyperlink’ and it would be the same thing. More importantly, the code is also declaring that the type of ‘HL’ is a Hyperlink. Declaring the type of a variable lets Visual Basic know what kind of code can be performed for that variable.
‘For Each HL In ActiveSheet.Hyperlinks’ is fairly self-explanatory. You are telling VB that you want to perform the following action on all hyperlinks in your current sheet.
‘HL.Range.Offset(0, 1).Value = HL.Address’ telling VB where to put it! ‘HL.Address’ means the URL contained in the Hyperlink. The .Range.Offset(0,1).Value is telling it to put the URL in the cell right next to the link. Range.Offset can be defined in columns, rows, or columns as rows and is noted like (RowOffset, ColumnOffset). This, obviously, is something you can define yourself. If I wanted to put the data below the hyperlink I’d write ‘HL.Range.Offset(1,0).Value’.
‘Next’ is simply iterating the procedure until there are no links left to extract the address from.
Here, I used .Range.Offset(0, 2) to put my data in column C. Easy!
delete all pics or objects from a sheet
Now, I want the information from this page, but just the names, titles, and phone numbers. I don’t need any images. Using what you learned, can you write a module that will delete all the images from a sheet?
Sub DeleteAllPics() Dim Pic As Object For Each Pic In ActiveSheet.Pictures Pic.Delete Next Pic End Sub
And there you go!