In this tutorial, we’ll be implementing Web Scraping in our Android Application. We will be scraping Journaldev.com to get all the words listed on the home page. We’ll be using the Retrofit library to read web pages.
Table of Contents
Android Retrofit Converters
We’ve covered a lot on Retrofit in the below tutorials:
- Retrofit Basics
- Retrofit And RxJava
- Retrofit Offline Caching
- Retrofit Calling In Intervals
- Retrofit Downloading Files
- Retrofit MVP Dagger RxJava
- Retrofit Downloading And Showing Progress in Notifications
Most of the times we have used Gson to serialise/deserialise JSON responses.
For this, we’ve used GsonConverters in our Retrofit Builder.
There can be instances when you just need plain text as the response body from the network call.
In such cases, instead of GsonConverters, we need to use Scalars Converter
In order to use Scalar Converters, you need to add the following dependency along with Retrofit and OkHttp dependencies in the build.gradle
implementation 'com.squareup.retrofit2:converter-scalars:2.3.0'
To add Scalar Converters to the Retrofit Builder, do the following:
Retrofit retrofit = new Retrofit.Builder()
.addConverterFactory(ScalarsConverterFactory.create())
.baseUrl("BASE URL")
.client(okHttpClient).build();
We can add multiple converters to the builder as well. But the order is important since retrofit chooses the first compatible converter.
Scalar Converters
.RequestBody
and ResponseBody
allows receiving any type of response data using request.body()
in the enqueue method.The only disadvantage: You need to handle the RequestBody object creation yourself.
In the following section, we’ll be using ScalarConverter to parse the website passed in the Retrofit request. We’ll fetch all text words and keep a count of each word in the RecyclerView.
Also, we’ll add a filter function that filters the words by the count. We’ll use a Hashmap to store the word/count pair and sort it by value.
Project Structure
The dependencies in the build.gradle
is:
implementation 'com.squareup.retrofit2:retrofit:2.4.0'
implementation 'com.squareup.okhttp3:logging-interceptor:3.9.1'
implementation 'com.android.support:design:28.0.0'
implementation 'com.squareup.retrofit2:converter-scalars:2.3.0'
implementation 'org.jsoup:jsoup:1.10.1'
Code
The code for the activity_main.xml
is defined below:
<?xml version="1.0" encoding="utf-8"?>
<android.support.constraint.ConstraintLayout xmlns:android="https://schemas.android.com/apk/res/android"
xmlns:app="https://schemas.android.com/apk/res-auto"
xmlns:tools="https://schemas.android.com/tools"
android:layout_width="match_parent"
android:layout_height="match_parent">
<android.support.v7.widget.RecyclerView
android:id="@+id/wordList"
android:layout_width="match_parent"
android:layout_height="match_parent"
android:orientation="vertical"
app:layoutManager="android.support.v7.widget.LinearLayoutManager"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintLeft_toLeftOf="parent"
app:layout_constraintRight_toRightOf="parent"
app:layout_constraintTop_toTopOf="parent" />
<android.support.design.widget.FloatingActionButton
android:id="@+id/fab"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_margin="16dp"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintLeft_toLeftOf="parent"
android:src="@drawable/ic_filter_list"
app:layout_constraintRight_toRightOf="parent" />
</android.support.constraint.ConstraintLayout>
The code for the ApiService.java class is given below:
package com.journaldev.androidwebscrapingretrofit;
import retrofit2.Call;
import retrofit2.http.GET;
public interface ApiService {
@GET(".")
Call<String> getStringResponse();
}
.
is used to specify no path. Thus the base url only would be used.The code for the MainActivity.java is given below:
package com.journaldev.androidwebscrapingretrofit;
import android.support.design.widget.FloatingActionButton;
import android.support.v7.app.AppCompatActivity;
import android.os.Bundle;
import android.support.v7.widget.RecyclerView;
import android.view.View;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import okhttp3.OkHttpClient;
import okhttp3.logging.HttpLoggingInterceptor;
import retrofit2.Call;
import retrofit2.Callback;
import retrofit2.Response;
import retrofit2.Retrofit;
import retrofit2.converter.scalars.ScalarsConverterFactory;
public class MainActivity extends AppCompatActivity {
RecyclerView recyclerView;
FloatingActionButton fab;
HashMap<String, Integer> occurrences = new HashMap<>();
WordsAdapter wordsAdapter;
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
recyclerView = findViewById(R.id.wordList);
fab = findViewById(R.id.fab);
OkHttpClient okHttpClient = new OkHttpClient().newBuilder().addInterceptor(new HttpLoggingInterceptor().setLevel(HttpLoggingInterceptor.Level.BODY))
.build();
Retrofit retrofit = new Retrofit.Builder()
.addConverterFactory(ScalarsConverterFactory.create())
.baseUrl("https://www.journaldev.com/")
.client(okHttpClient).build();
final ApiService apiService = retrofit.create(ApiService.class);
Call<String> stringCall = apiService.getStringResponse();
stringCall.enqueue(new Callback<String>() {
@Override
public void onResponse(Call<String> call, Response<String> response) {
if (response.isSuccessful()) {
String responseString = response.body();
Document doc = Jsoup.parse(responseString);
responseString = doc.text();
createHashMap(responseString);
}
}
@Override
public void onFailure(Call<String> call, Throwable t) {
}
});
fab.setOnClickListener(new View.OnClickListener() {
@Override
public void onClick(View view) {
occurrences = sortByValueDesc(occurrences);
wordsAdapter = new WordsAdapter(MainActivity.this, occurrences);
recyclerView.setAdapter(wordsAdapter);
}
});
}
private void createHashMap(String responseString) {
responseString = responseString.replaceAll("[^a-zA-Z0-9]", " ");
String[] splitWords = responseString.split(" +");
for (String word : splitWords) {
if (StringUtil.isNumeric(word)) {
continue;
}
Integer oldCount = occurrences.get(word);
if (oldCount == null) {
oldCount = 0;
}
occurrences.put(word, oldCount + 1);
}
wordsAdapter = new WordsAdapter(this, occurrences);
recyclerView.setAdapter(wordsAdapter);
}
public static HashMap<String, Integer> sortByValueDesc(Map<String, Integer> map) {
List<Map.Entry<String, Integer>> list = new LinkedList(map.entrySet());
Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
@Override
public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
return o2.getValue().compareTo(o1.getValue());
}
});
HashMap<String, Integer> result = new LinkedHashMap<>();
for (Map.Entry<String, Integer> entry : list) {
result.put(entry.getKey(), entry.getValue());
}
return result;
}
}
The following code parses the string from HTML format;
Document doc = Jsoup.parse(responseString);
responseString = doc.text();
Inside createHashMap
we remove all special characters and omit all numerics from the hashmap.
sortByValueDesc uses a Comparator to compare the values and sort the HashMap in a descending order.
The code for the list_item_words.xml
which contains the layout for RecyclerView rows is given below:
<?xml version="1.0" encoding="utf-8"?>
<RelativeLayout xmlns:android="https://schemas.android.com/apk/res/android"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:background="?attr/selectableItemBackground"
android:padding="24dp">
<TextView
android:id="@+id/txtWord"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_alignParentStart="true"
android:layout_centerVertical="true" />
<TextView
android:id="@+id/txtCount"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_alignParentEnd="true"
android:layout_centerVertical="true" />
</RelativeLayout>
The code for the WordsAdapter.java
class is given below:
package com.journaldev.androidwebscrapingretrofit;
import android.content.Context;
import android.support.annotation.NonNull;
import android.support.v7.widget.RecyclerView;
import android.view.LayoutInflater;
import android.view.View;
import android.view.ViewGroup;
import android.widget.TextView;
import java.util.HashMap;
public class WordsAdapter extends RecyclerView.Adapter<WordsAdapter.WordsHolder> {
HashMap<String, Integer> modelList;
Context mContext;
private String[] mKeys;
class WordsHolder extends RecyclerView.ViewHolder {
TextView txtWord;
TextView txtCount;
public WordsHolder(View itemView) {
super(itemView);
txtWord = itemView.findViewById(R.id.txtWord);
txtCount = itemView.findViewById(R.id.txtCount);
}
}
public WordsAdapter(Context context, HashMap<String, Integer> modelList) {
this.modelList = modelList;
mContext = context;
mKeys = modelList.keySet().toArray(new String[modelList.size()]);
}
@NonNull
@Override
public WordsHolder onCreateViewHolder(@NonNull ViewGroup parent, int viewType) {
View view = LayoutInflater.from(mContext).inflate(R.layout.list_item_words, parent, false);
return new WordsHolder(view);
}
@Override
public void onBindViewHolder(@NonNull WordsHolder holder, int position) {
holder.txtWord.setText(mKeys[position]);
holder.txtCount.setText(String.valueOf(modelList.get(mKeys[position])));
}
@Override
public int getItemCount() {
return modelList.size();
}
}
The output of the above application is given below:
So the above output shows all words present on the home page of JournalDev at the time of writing this tutorial with their frequency.
This brings an end to this tutorial. You can download the project from the link below: