Memory Error When Dealing With Huge Data
Solution 1:
As others have noted, six million bins doesn't sound very useful. But a simple thing would be to reuse the same figure: since the only plot elements that change are things other than the histograms, try something like this:
vline1 = plt.axvline(...)
vline2 = plt.axvline(...)
lgd = legend()
and after the savefig don't close the figure and plot new histograms, instead reuse it, changing what needs to be changed:
# change vline1 and vline2 positions and labels
vline1.set_data([240,240],[0,1])
vline1.set_label('new label')
vline2.set_data(...)
vline2.set_label(...)
# remove old legend, replace with new
lgd.remove()
lgd = plt.legend(loc=4)
plt.xlabel('new xlabel')
# etc
Finally call savefig again with the new filename.
Solution 2:
You plot 6 million bins and then zoom in on (presumably) a small part of that. With two lines per figure, that's 12 million data points and I'm not surprised matplotlib crashes once you try and plot another 12 million in the next figure. I highly doubt you really need six million bins, so let's try and get your histogram down to a more manageable size!
Let's say that your data spans the 44 or 48 hours that you wish to look at. Then with six million bins this would imply you have a 30 millisecond resolution in your data. Considering the resolution of minutes which you display, this seems unreasonable. Alternatively, you have a resolution of seconds, so 6 million bins would imply your data spans 70 days, but you only look at two of them.
Let's assume you are interested in two days of data with a resolution in terms of seconds or minutes.
While you specify bins as a number of bins, you can also specify a range of values. Thus for your first graph, you could say
plt.hist(taccept_list, bins=range(120), normed = True, histtype ="step", cumulative = True, color = 'b', label = 'accepted answer')
plt.hist(tfirst_list, bins=range(120), normed = True, histtype ="step", cumulative = True, color = 'g',label = 'first answer')
giving you a resolution in minutes for the first 120 minutes. The histogram will disregard anything higher than 120 which is fine, since you are not going to show it in your plot anyway.
An alternative for resolution in seconds could be:
numpy.linspace(0,120,7200)
Now, the number of points in your histogram is much more reasonable and probably more in line with the data you are looking at / displaying.
Post a Comment for "Memory Error When Dealing With Huge Data"