Android Web Scraping With Retrofit

Filed Under: Android

In this tutorial, we’ll be implementing Web Scraping in our Android Application. We will be scraping Journaldev.com to get all the words listed on the home page. We’ll be using the Retrofit library to read web pages.

Android Retrofit Converters

We’ve covered a lot on Retrofit in the below tutorials:

Most of the times we have used Gson to serialise/deserialise JSON responses.
For this, we’ve used GsonConverters in our Retrofit Builder.

There can be instances when you just need plain text as the response body from the network call.
In such cases, instead of GsonConverters, we need to use Scalars Converter

In order to use Scalar Converters, you need to add the following dependency along with Retrofit and OkHttp dependencies in the build.gradle


implementation 'com.squareup.retrofit2:converter-scalars:2.3.0'

To add Scalar Converters to the Retrofit Builder, do the following:


Retrofit retrofit = new Retrofit.Builder()
                .addConverterFactory(ScalarsConverterFactory.create())
                .baseUrl("BASE URL")
                .client(okHttpClient).build();

We can add multiple converters to the builder as well. But the order is important since retrofit chooses the first compatible converter.

We can use RequestBody and ResponseBody class from OkHttp as the types if we don’t want to use Scalar Converters.
RequestBody and ResponseBody allows receiving any type of response data using request.body() in the enqueue method.

The only disadvantage: You need to handle the RequestBody object creation yourself.

Web pages are in Html, so in order to parse them, we’ll use Jsoup library.

In the following section, we’ll be using ScalarConverter to parse the website passed in the Retrofit request. We’ll fetch all text words and keep a count of each word in the RecyclerView.

Also, we’ll add a filter function that filters the words by the count. We’ll use a Hashmap to store the word/count pair and sort it by value.

Project Structure

android retrofit web scraping project

The dependencies in the build.gradle is:


implementation 'com.squareup.retrofit2:retrofit:2.4.0'
implementation 'com.squareup.okhttp3:logging-interceptor:3.9.1'
implementation 'com.android.support:design:28.0.0'
implementation 'com.squareup.retrofit2:converter-scalars:2.3.0'
implementation 'org.jsoup:jsoup:1.10.1'

Code

The code for the activity_main.xml is defined below:


<?xml version="1.0" encoding="utf-8"?>
<android.support.constraint.ConstraintLayout xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:app="http://schemas.android.com/apk/res-auto"
    xmlns:tools="http://schemas.android.com/tools"
    android:layout_width="match_parent"
    android:layout_height="match_parent">


    <android.support.v7.widget.RecyclerView
        android:id="@+id/wordList"
        android:layout_width="match_parent"
        android:layout_height="match_parent"
        android:orientation="vertical"
        app:layoutManager="android.support.v7.widget.LinearLayoutManager"
        app:layout_constraintBottom_toBottomOf="parent"
        app:layout_constraintLeft_toLeftOf="parent"
        app:layout_constraintRight_toRightOf="parent"
        app:layout_constraintTop_toTopOf="parent" />


    <android.support.design.widget.FloatingActionButton
        android:id="@+id/fab"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_margin="16dp"
        app:layout_constraintBottom_toBottomOf="parent"
        app:layout_constraintLeft_toLeftOf="parent"
        android:src="@drawable/ic_filter_list"
        app:layout_constraintRight_toRightOf="parent" />

</android.support.constraint.ConstraintLayout>

The code for the ApiService.java class is given below:


package com.journaldev.androidwebscrapingretrofit;

import retrofit2.Call;
import retrofit2.http.GET;

public interface ApiService {


    @GET(".")
    Call<String> getStringResponse();
}

. is used to specify no path. Thus the base url only would be used.

The code for the MainActivity.java is given below:


package com.journaldev.androidwebscrapingretrofit;

import android.support.design.widget.FloatingActionButton;
import android.support.v7.app.AppCompatActivity;
import android.os.Bundle;
import android.support.v7.widget.RecyclerView;
import android.view.View;

import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;

import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;

import okhttp3.OkHttpClient;
import okhttp3.logging.HttpLoggingInterceptor;
import retrofit2.Call;
import retrofit2.Callback;
import retrofit2.Response;
import retrofit2.Retrofit;
import retrofit2.converter.scalars.ScalarsConverterFactory;

public class MainActivity extends AppCompatActivity {


    RecyclerView recyclerView;
    FloatingActionButton fab;
    HashMap<String, Integer> occurrences = new HashMap<>();
    WordsAdapter wordsAdapter;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);
        recyclerView = findViewById(R.id.wordList);

        fab = findViewById(R.id.fab);

        OkHttpClient okHttpClient = new OkHttpClient().newBuilder().addInterceptor(new HttpLoggingInterceptor().setLevel(HttpLoggingInterceptor.Level.BODY))
                .build();

        Retrofit retrofit = new Retrofit.Builder()
                .addConverterFactory(ScalarsConverterFactory.create())
                .baseUrl("https://www.journaldev.com/")
                .client(okHttpClient).build();


        final ApiService apiService = retrofit.create(ApiService.class);


        Call<String> stringCall = apiService.getStringResponse();
        stringCall.enqueue(new Callback<String>() {
            @Override
            public void onResponse(Call<String> call, Response<String> response) {
                if (response.isSuccessful()) {

                    String responseString = response.body();
                    Document doc = Jsoup.parse(responseString);
                    responseString = doc.text();
                    createHashMap(responseString);
                }

            }

            @Override
            public void onFailure(Call<String> call, Throwable t) {

            }
        });

        fab.setOnClickListener(new View.OnClickListener() {
            @Override
            public void onClick(View view) {


                occurrences = sortByValueDesc(occurrences);

                wordsAdapter = new WordsAdapter(MainActivity.this, occurrences);
                recyclerView.setAdapter(wordsAdapter);


            }
        });

    }

    private void createHashMap(String responseString) {


        responseString = responseString.replaceAll("[^a-zA-Z0-9]", " ");

        String[] splitWords = responseString.split(" +");

        for (String word : splitWords) {

            if (StringUtil.isNumeric(word)) {
                continue;
            }

            Integer oldCount = occurrences.get(word);
            if (oldCount == null) {
                oldCount = 0;
            }
            occurrences.put(word, oldCount + 1);
        }

        wordsAdapter = new WordsAdapter(this, occurrences);
        recyclerView.setAdapter(wordsAdapter);
    }

    public static HashMap<String, Integer> sortByValueDesc(Map<String, Integer> map) {
        List<Map.Entry<String, Integer>> list = new LinkedList(map.entrySet());
        Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
            @Override
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                return o2.getValue().compareTo(o1.getValue());
            }
        });

        HashMap<String, Integer> result = new LinkedHashMap<>();
        for (Map.Entry<String, Integer> entry : list) {
            result.put(entry.getKey(), entry.getValue());
        }
        return result;
    }


}

The following code parses the string from HTML format;


Document doc = Jsoup.parse(responseString);
                    responseString = doc.text();

Inside createHashMap we remove all special characters and omit all numerics from the hashmap.
sortByValueDesc uses a Comparator to compare the values and sort the HashMap in a descending order.

The code for the list_item_words.xml which contains the layout for RecyclerView rows is given below:


<?xml version="1.0" encoding="utf-8"?>
<RelativeLayout xmlns:android="http://schemas.android.com/apk/res/android"
    android:layout_width="match_parent"
    android:layout_height="wrap_content"
    android:background="?attr/selectableItemBackground"
    android:padding="24dp">


    <TextView
        android:id="@+id/txtWord"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_alignParentStart="true"
        android:layout_centerVertical="true" />

    <TextView
        android:id="@+id/txtCount"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_alignParentEnd="true"
        android:layout_centerVertical="true" />

</RelativeLayout>

The code for the WordsAdapter.java class is given below:


package com.journaldev.androidwebscrapingretrofit;

import android.content.Context;
import android.support.annotation.NonNull;
import android.support.v7.widget.RecyclerView;
import android.view.LayoutInflater;
import android.view.View;
import android.view.ViewGroup;
import android.widget.TextView;

import java.util.HashMap;

public class WordsAdapter extends RecyclerView.Adapter<WordsAdapter.WordsHolder> {

    HashMap<String, Integer> modelList;
    Context mContext;
    private String[] mKeys;

    class WordsHolder extends RecyclerView.ViewHolder {


        TextView txtWord;
        TextView txtCount;

        public WordsHolder(View itemView) {
            super(itemView);


            txtWord = itemView.findViewById(R.id.txtWord);
            txtCount = itemView.findViewById(R.id.txtCount);
        }
    }

    public WordsAdapter(Context context, HashMap<String, Integer> modelList) {
        this.modelList = modelList;
        mContext = context;
        mKeys = modelList.keySet().toArray(new String[modelList.size()]);
    }

    @NonNull
    @Override
    public WordsHolder onCreateViewHolder(@NonNull ViewGroup parent, int viewType) {
        View view = LayoutInflater.from(mContext).inflate(R.layout.list_item_words, parent, false);
        return new WordsHolder(view);
    }

    @Override
    public void onBindViewHolder(@NonNull WordsHolder holder, int position) {
        holder.txtWord.setText(mKeys[position]);
        holder.txtCount.setText(String.valueOf(modelList.get(mKeys[position])));
    }

    @Override
    public int getItemCount() {
        return modelList.size();
    }


}

The output of the above application is given below:

android retrofit web scraping output

So the above output shows all words present on the home page of JournalDev at the time of writing this tutorial with their frequency.

This brings an end to this tutorial. You can download the project from the link below:

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages