Google SERP and Website Scraping with Python
While preparing the content briefs, one of the analysis we do is to browse the SERP for our target keywords and look for the pages which are ranking on the first page. It is especially important for us what headings are used in these blog posts. Of course, to reach these articles, we have to search for keywords one by one and open the results one by one and examine their headings.
One day, we were talking with my co-worker Duygu Garip , that this process is taking a lot of time. At this point, we started to think about how we can make things easier by automating. Duygu said that if we can pull this data from the sites, our work will be accelerated and we can save our time.
We can list what we need to do here in 2 steps;
1- Searching for the relevant keywords on Google
2- Going to the URLs in SERP and pulling the H headings on those pages.
At the end of our research, we achieved this with Python. Let's see how we can write this code now;
1- Install
Let's load the required libraries first;
pip install google-api-python-client
pip install selenium
pip install webdriver-manager
pip install pandas
pip install openpyxl
2- Import Process
Now, we can start writing our code by opening a new python file.
In order to avoid confusion every time, let's do the following import operations from the beginning. As you write the codes, you will see where we use the libraries we import.
from googleapiclient.discovery import build
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
3- Withdrawing Data from Google
First we need to pull data from Google. For this we need 2 elements; API Key and CSE(Custom Search Engine) ID.
After providing these elements, we can search Google with the following function and return the results.
my_api_key = 'Your API Key'
my_cse_id = "Your CSE ID"
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey =api_key)
res = service.cse().list(q =search_term, cx =cse_id, **kwargs).execute()
return res
result = google_search("Apple", my_api_key, my_cse_id)
print(result)
The return value here will be in JSON format. You can extract the information you want from the dataset in JSON format.
But we want to search for more than one keyword here. At the same time, we want to pull the H headings by entering the URLs on the results page. At this point, we should make some changes and additions.
4- Scraping Heading Tags
About internal link opportunities In the article I wrote, I talked about how to scrape the hrefs in the site using Python. Here we will use a similar code structure.
First, we will pull the hrefs from the JSON data returned from the above code. Then we will open these URLs using the Selenium library and scrape the H headers we want. Then we will print them to an excel file.
The final version of our code will look exactly like this:
from googleapiclient.discovery import build
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
my_api_key = 'Your API Keyid'
my_cse = "Your CSE ID"
def google_search (search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey =api_key)
res = service.cse().list( q =search_term, cx =cse_id, **kwargs).execute()
return res
KEYWORDS = ["tea", "coffee", "cola"]
df = pd.DataFrame(columns =['Keywords', 'URLs', 'Headings', 'Contents'], index =[])
for Keyword in KEYWORDS:
try :
result = google_search(Keyword, my_api_key, my_cse_id) if 'items' in result:
URLS = [item['link'] for item in result['items']]
for URL in URLS:
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(service = ChromeService(ChromeDriverManager().install()), options =options)
driver.set_page_load_timeout(10)
try :
driver.get(URL) elements1 = driver.find_elements(By.XPATH, '//h1')
for headers1 in elements1:
df.loc[ len (df.index)] = [Keyword, URL, "H1", headers1.text]
elements2 = driver.find_elements(By.XPATH, '//h2')
for headers2 in elements2:
df.loc[ len (df.index)] = [Keyword, URL, "H2", headers2.text]
elements3 = driver.find_elements(By.XPATH, '//h3')
for headers3 in elements3:
df.loc[ len (df.index)] = [Keyword, URL, "H3", headers3.text]
elements4 = driver.find_elements(By.XPATH, '//h4')
for headers4 in elements4:
df.loc[ len (df.index)] = [Keyword, URL, "H4", headers4.text]
elements5 = driver.find_elements(By.XPATH, '//h5')
for headers5 in elements5:
df.loc[ len (df.index)] = [Keyword, URL, "H5", headers5.text]
elements6 = driver.find_elements(By.XPATH, '//h6')
for headers6 in elements6:
df.loc[ len (df.index)] = [Keyword, URL, "H6", headers6.text]
except :
df.loc[ len (df.index)] = [Keyword, URL, "No Data" , "No Data" ]
pass
except :
print ("No Data")
pass
df.to_excel(r'/Users/gokaysevilmis/Downloads/Result.xlsx')
print ("Done")
By exporting the heading tags into excel, you can easily analyse the headings;
Thanks to Duygu Garip for her support in the making of this study.
SEO Executive at SEM
1yEline sağlıık, çok faydalı olmuş 🤓
Sr. SEO Executive
1yDurmuyorduuu 😀 Ellerine sağlık kankacım ❤️
SEO Executive
1yTeşekkürler Gökay ☺️
Trendyol şirketinde Sr. SEO Specialist
1yHarika bir kaynak ❤️