Java Check if web request is from Google crawler
If a web request is coming from Google crawler or Google bot, the requested “user agent” should look similar like this :
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) or (rarely used): Googlebot/2.1 (+http://www.google.com/bot.html)
Source : Google crawlers
1. Java Example
In Java, you can get the “user agent” from HttpServletRequest.
@Autowired private HttpServletRequest request; //... String userAgent = request.getHeader("user-agent"); System.out.println("User Agent : " + userAgent); if(!StringUtils.isEmpty(userAgent)){ if(userAgent.toLowerCase().contains("googlebot")){ System.out.println("This is Google bot"); }else{ System.out.println("Not from Google");
Above solution works well, but failed to detect the fake or spoof user agent.
2. Fake User Agent
It’s easy to create a fake/spoof user agent request. For example :
package com.mkyong.web; import org.apache.http.HttpResponse; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.HttpClientBuilder; public class test { public static void main(String[] args) throws Exception { HttpClient client = HttpClientBuilder.create().build(); HttpGet request = new HttpGet("abcdefg.com"); request.setHeader("user-agent", "fake googlebot"); HttpResponse response = client.execute(request);
Output at abcdefg.com.
User Agent : fake googlebot This is Google bot
3. Verifying Googlebot
To verify the real Googlebot, you can use “reverse DNS lookup” manually like this :
> host 66.249.66.1 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com. > host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Source : Verifying Googlebot
4. Verifying Googlebot – Java Example
Based on the above theory, we can simulate the 1st part of the “reverse DNS lookup”. Use host command to determine where is the requested IP point to.
If the request is coming from Googlebot, it will display this pattern : xx *.googlebot.com..
P.S host command is available at *nix system only.
@Autowired private HttpServletRequest request; //... String requestIp = getRequestIp(); String userAgent = request.getHeader("user-agent"); System.out.println("User Agent : " + userAgent); if(!StringUtils.isEmpty(userAgent)){ if(userAgent.toLowerCase().contains("googlebot")){ //check fake user agent String output = executeCommand("host " + requestIp); System.out.println("Output : " + output); if(output.toLowerCase().contains("googlebot.com")){ System.out.println("This is Google bot"); }else{ System.out.println("This is fake user agent"); }else{ System.out.println("Not from Google"); //get requested IP private String getRequestIp() { String ipAddress = request.getHeader("X-FORWARDED-FOR"); if (ipAddress == null) { ipAddress = request.getRemoteAddr(); return ipAddress; // execute external command private String executeCommand(String command) { StringBuffer output = new StringBuffer(); Process p; try { p = Runtime.getRuntime().exec(command); p.waitFor(); BufferedReader reader = new BufferedReader(new InputStreamReader(p.getInputStream())); String line = ""; while ((line = reader.readLine())!= null) { output.append(line + "\n"); } catch (Exception e) { e.printStackTrace(); return output.toString();
Try the “step 2” fake user agent example again. Now, you get this output:
Output : Host 142.1.168.192.in-addr.arpa. not found: 3(NXDOMAIN) //this output may vary. User Agent : fake googlebot This is fake user agent
This simple solution may not able to stop the fake/spoof user agent 100%, but this extra security layer should be able to stop most of the basic user agent spoofing attacks.
If you have a better solution, do share below, thanks.
References
From:一号门
COMMENTS