4 years ago (2016-12-16)  Algorithm language |   First to comment  5 
post score 0 times, average 0.0

The Siwa Auditorium training requires us to output our own learning experience in the short form. The company also counts the number of articles, including the number of articles, the amount of comments, the amount of favorite items, and so on.It is naturally troublesome for so many people to manually count. Of course, programmers will give such arduous and tedious work to the code, so they write a reptile.The value-for-money guest is learning Ruby, so I wrote an article about Ruby's reptile statistics book users to drive my own Ruby learning. If you let me crawl a site's content, my first thought might be to grab it's HTML, but it will also in turn ask myself if this site has an Rss feed, and the RSS feed's content is xml. Compared to html is more concise and efficient, and because the structure of xml is a little stable (html may change a css that day may cause my reptiles do not use it), parsing will be more convenient.After the survey did not provide RSS, I decided to choose html to violently crawl the book.

Analysis booklet URL

  • Home: http://www.jianshu.com/
  • User homepage: http://www.jianshu.com/users/user ID (probably so)

    We can get the user's attention, fans, articles, words, likes and other information

Use Ruby simple book reptile statistics user article information

  • User latest article http://www.jianshu.com/users/user ID (temporary estimate)/latest_articles

    We can get a list of user articles, which can be used to count the number of comments, readings, etc. of user articles. By traversing the article list, we can add comments and readings to get the total number of comments and total readings.

It should be pointed out that, because the article list page can not list all the user's articles, but lists 10, the user scrolls to the bottom of the article list will be automatically loaded in the browser, is to use js to request data from the background and then in the front Spliced ​​out, so want to grab one time to put the user's comments on the total amount of reading is not enough, the user list page paged.So how do I know how many pages a user article contains?We get the total number of articles for the user from the user home page, so divide by 10 plus 1 to get the number of pages

  • User list page pagination, 10/page, with m page URL:

Http://www.jianshu.com/users/user ID (probably estimated)/latest_articles?page=m

Use Ruby simple book reptile statistics user article information


Fetch webpages, get html

The HTTP access methods provided by Ruby are very concise and efficient. Of course, there is more than one method. For those interested in other methods, I use Google myself. Here I post my own code:

  Consultant, I don't think I need to explain the meaning of the code. According to the simplified book URL rules described above, we can use the above code to crawl the HTML of the corresponding web page.

Analyze the structure of crawled content

After the HTML content of the corresponding webpage is obtained, the content and structure of the HTML are analyzed.We use the eye to easily see the content on the web page, but the reptile sees only the html source code.Below I extract the following useful code from the grabbed HTML:

  • Used to extract the amount of user's attention, fans, articles, words, harvest likes

  •   Used to extract the total amount of user comments, total reading


Deliver key information

Above I have extracted useful key HTML. What I want to do now is to let the crawler do the same thing.So I use regular matches.

  • Regular Matches Fans", "Following", "Articles", "Word Count", "Harvest Likes"

  Other matching code see source code

Integrate information and diversify output

When the user's article information is counted, the statistical information is output.In order to make the output more rich and more customizable, I took the template rendering method to separate the data from the interface. Template file:

  Then load the template file in Ruby code and replace @{title}, @{time}, @{content}, @{content} with real statistics

  Of course, as long as you add an out2json that day, you can easily create an API to achieve a higher degree of customization.

Project home page



  • Download project code and run

    More detailed project description please move to Github project homepage


This article has been printed on copyright and is protected by copyright laws. It must not be reproduced without permission.If you need to reprint, please contact the author or visit the copyright to obtain the authorization. If you feel that this article is useful to you, you can click the "Sponsoring Author" below to call the author!

Reprinted Note Source: Baiyuan's Blog>>https://wangbaiyuan.cn/en/in-ruby-jane-books-crawler-statistics-users-post-information-2.html

Post comment


No Comment


Forget password?